The Next3 filesystem

Benefits for LWN subscribers
The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
May 11, 2010

The ext3 filesystem is tried and true, but it lacks a number of features deemed interesting by contemporary users. Snapshots - the ability to quickly capture the state of the filesystem at an arbitrary time - is at the top of many lists. It is currently possible to use the LVM snapshotting feature with ext3, but snapshots taken through LVM have some significant limitations. The Next3 filesystem offers an approach which might prove easier and more flexible: snapshots implemented directly in ext3.

Next3 was developed by CTERA Networks, which has started shipping it on its C200 network-attached storage device. This code has also been posted on SourceForge and proposed for merging into the mainline kernel. The Next3 filesystem adds a simple snapshot feature to ext3 in ways which are (mostly) compatible with the existing on-disk format. It looks like a useful feature, but its path into the mainline looks to be longer than its implementers might have hoped.

The Next3 filesystem is a new filesystem type - it's not just an addition to ext3. At its core, it works by creating a special, magic file to represent a snapshot of the filesystem. The files have the same apparent size as the storage volume as a whole, but they are sparse files, so they take almost no space at the outset. When a change is made to a block on disk, the filesystem must first check to see whether that block has been saved in the most recent snapshot already. If not, the affected block is moved over to the snapshot file, and a new block is allocated to replace it. Thus, over time, disk blocks migrate to the snapshot file as they are rewritten with new contents.

Gaining read-only access to a snapshot is a simple matter of doing a loopback mount of the snapshot file as an ext2 filesystem. The snapshot file is sufficiently magic that any attempts to read blocks in the holes (which represent blocks that have not been changed since the snapshot was taken) will be satisfied from a later snapshot - which will have captured the contents of that block when it was eventually changed - or from the underlying storage device. Deleting a snapshot requires moving changed blocks into the previous snapshot, if it exists, because the deleted snapshot holds blocks which are logically part of the earlier snapshots.

The changes to the ext3 on-disk format are minimal, to the point that a Next3 filesystem can be mounted by the ordinary ext3 code. If snapshots exist, though, ext3 cannot be allowed to modify the filesystem, lest the changed blocks fail to be saved in the snapshot. So, when snapshots exist on the filesystem, it will be marked with a feature flag which forces ext3 to mount the filesystem readonly.

On the performance side, the news is said to be mostly good. Writes will take a little longer due to the need to move the old block to a snapshot file. The worst performance impact is seemingly on truncate operations; these may have to save a large number of blocks and can get a lot slower. It is also worth noting that the moving of modified blocks to the snapshot file will, over time, wreck the nice, contiguous on-disk format that ext3 tries so hard to create, with an unfortunate effect on streaming read performance. Files which must not be fragmented can be marked with a special flag which will cause blocks to be copied into the snapshot file rather than moved; that will slow writes further, but will keep the file contiguous on disk.

Next3 developer Amir Goldstein requested relatively quick review of the patches because he is trying to finalize some of the on-disk formatting. The answer he got from Ted Ts'o was probably not quite what he was looking for:

Ext4 is where new development takes place in the ext2/3/4 series. So enhancements such as Next3 will probably not be received with great welcome into ext3.

Amir's response was that, while porting the patches to ext4 is on the "we'll get around to it someday" list, that port is not an easy thing to do. The biggest problem, apparently, is making the movement of blocks into the snapshot file work properly with ext4's extent-oriented format. Beyond that, Amir says, he's not actually trying to get the changes into ext3 - he wants to merge a separate filesystem called Next3 which happens to be mostly compatible with ext3.

The "separate Next3" approach is unlikely to fly very far, though. As Ted put it, ext2, ext3, and ext4 are really just different implementations of the same basic filesystem format; this format has never really been forked. Next3, as a separate filesystem, would be a fork of the format. The fact that Next3 has taken over some data structure fields which are used to different purpose in ext4 has not helped matters:

The "ext" in ext2 stands for "extended", as in the "the second extended file system" for Linux. It perhaps would be better if we had used the term "extensible", since that's the main thing about ext2/3/4 that has given it so much staying power. We've been able to add, in very carefully backwards and forwards compatible way, new features to the file system format. This is why I object to why Next3 uses some fields that overlaps with ext4. It means that e2fsprogs, which supports _one_ and _only_ _one_ file system format, will now need to support two file system formats. And that's not something I want to do.

The answer appears fairly clear: patches adding the snapshot feature might be welcome, but not as a fork of the ext3 filesystem. At a bare minimum, the filesystem format will have to be changed to avoid conflicts with ext4, but the real solution appears to be simply implementing the patches on top of ext4 instead of ext3. That is a fair amount of extra work which might have been avoided had the Next3 developers talked with the community prior to starting to code.

Index entries for this article
Kernel	Filesystems/ext3

The Next3 filesystem

Posted May 13, 2010 4:56 UTC (Thu) by spotter (guest, #12199) [Link] (1 responses)

wondering how this differs from Zachary Peterson's ext3cow

The Next3 filesystem

Posted May 13, 2010 7:16 UTC (Thu) by amir73il (subscriber, #66165) [Link]

ext3cow has inspired Next3, but the final result is fundamentally different.
On top of the differences list is compatibility, or in other words:
Delete the Next3 snapshot files and you are back to plain old Ext3.

The Next3 filesystem

Posted May 13, 2010 9:03 UTC (Thu) by ringerc (subscriber, #3071) [Link] (4 responses)

NTFS snapshots (volume shadow copies) are the one thing I really miss as a Linux server admin. LVM snapshots do *not* cut it for all purposes:

- They require LVM, which has its own issues and isn't always desirable

- They require storage to be reserved for them and allocated to them in advance

- They don't gracefully age out and aren't quietly removed when they run out of backing store. In fact, I recently had a server fail to boot because of an LVM snapshot that'd filled up.

- They need the file system to be capable of being mounted read-only from a dirty state. Not all file systems can handle this.

I find LVM snapshots to be well suited to taking backups, where I need to snapshot a volume, read the snapshot contents, unmount the snapshot and destroy it.

I find them rather less than ideal for when I just want to keep a few snapshots around to provide coarse versioning, as it's so useful for on a Windows server with VSS. An in-filesystem snapshot faclility would be really, really nice for this sort of thing, and one that didn't require loopback mounts (instead providing virtual directory access or the like) would be truly fantastic for backups.

VSS = Visual SourceSafe?

Posted May 13, 2010 11:44 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (2 responses)

If that's the correct expansion, just say no. It should be a give away when you find you need "coarse versioning" for your version control system that what you have is a liability. Rather than thrash around trying to find ways to keep VSS sort-of working it should be a priority to migrate away from it.

This isn't some partisan thing, I don't care if you migrate to git or Perforce or even Team Foundation Server. But get off Visual SourceSafe. Once you're safely running a real version control system you'll find that your headache goes away without any need for "coarse versioning" using snapshots.

VSS = Volume Shadow copy Service (Yes, I know)

Posted May 13, 2010 13:27 UTC (Thu) by ringerc (subscriber, #3071) [Link] (1 responses)

Argh! That acronym collision drives me nuts.

Here, VSS = "Volume Shadow copy Service". Yes, Microsoft uses the acronym "VSS" for Volume Shadow Copy Service despite having an existing claim on that TLA via Visual Source Safe ie VSS.

You'd think they'd refer to it as VSCS, but no....

What I'm talking about is a facility in Microsoft servers (and client operating systems, but it's less important there) that's based on the Volume Shadow Copy Service where they can make automatic snapshots of their file systems on a schedule, and retain them until the total size of all snapshots reaches an admin-configured limit, at which point the oldest snapshot is dropped to make room.

The same underlying snapshot service is used to provide efficient image-based backup. In fact, on Win2k8 you can have the server maintain a bootable backup disk image of its self on a raw disk - I use an iSCSI volume on my Linux backup server. The server uses the volume shadow copy service to only update dirty parts of the image at each backup run. It's nice to have for a Windows-based server OS where unlike Linux/BSD you can't just rsync the whole file system contents to another box and expect to be able to boot it.

(For what it's worth, I use svn heavily, though am drifting git-wards now that I've started actually using it and discovered how seriously nice it is these days. You won't catch me near Visual Source Safe unless it's with an axe. In an amusing confluence of these two topics, I now maintain all my servers /etc in git and git-push them to the backup server every night, 'cos it's more convenient than Bacula when reverting changes).

VSS = Volume Shadow copy Service (Yes, I know)

Posted May 13, 2010 13:48 UTC (Thu) by nye (subscriber, #51576) [Link]

>Here, VSS = "Volume Shadow copy Service". Yes, Microsoft uses the acronym "VSS" for Volume Shadow Copy Service despite having an existing claim on that TLA via Visual Source Safe ie VSS.

>You'd think they'd refer to it as VSCS, but no....

That would be 'Volume Snapshot Service', supposedly. The acronym collision probably never occurred to anyone at Microsoft since (it is widely held, at least) nobody at Microsoft actually uses Visual Source Safe.

It's one of those things that was bought, mangled, and re-released to fill a gap in their product line, targetted at those people who must use all-Microsoft, all-the-time even if MS themselves think the tool is worthless.

The Next3 filesystem

Posted May 14, 2010 15:36 UTC (Fri) by mebrown (subscriber, #7960) [Link]

BTRFS has exactly what you describe (VSS) at the filesystem level, and is widely held to be the heir-apparent of the Linux Filesystem of the Future(TM) crown.

In fact, in Fedora 13, there is a yum plugin that will snapshot the system before every yum transaction, meaning you can rollback to any previous system state if an upgrade goes awry.

The Next3 filesystem

Posted May 13, 2010 10:16 UTC (Thu) by anton (subscriber, #25547) [Link] (46 responses)

I would really like to use snapshots, so I would like to see Next3 in the mainline.

As for ext4 being the future: It does not have a single feature that I want. Worse, it has the misfeature of worse crash consistency. So I am happy that Next3 is based on ext3, not ext4, so I don't have to choose between snapshots and crash consistency.

The Next3 filesystem

Posted May 13, 2010 11:16 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link] (45 responses)

crash consistency = consistently crashing? :-)

The Next3 filesystem

Posted May 13, 2010 13:42 UTC (Thu) by anton (subscriber, #25547) [Link] (44 responses)

Data consistency after crashes or power outages.

The Next3 filesystem

Posted May 13, 2010 16:57 UTC (Thu) by cortana (subscriber, #24596) [Link] (33 responses)

The zero-sized files after a crash? I thought those issues were resolved quite some time ago.

ext4 and data consistency

Posted May 13, 2010 18:38 UTC (Thu) by anton (subscriber, #25547) [Link] (32 responses)

Ted T'so still believes that data consistency on OS crashes (not application crashes) is the job of the applications (with fsync() etc.), not of the file system. And most applications don't do that, and those few that try it are probably not well tested against that (because that's extremely hard).

He fixed one particularly frequent cause of data loss in ext4 (involving writing a file, then renaming it across an old one), but nothing else. So people will see data loss with ext4 less frequently than before, but not as infrequently as with ext3 (or has this data loss feature been backported from ext4 to ext3 to give us fewer reasons to stick with ext3?).

ext4 and data consistency

Posted May 13, 2010 20:08 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link] (9 responses)

You are clearly overstating the case and the position. There were several issues fixed. Not just one.

ext4 and data consistency

Posted May 14, 2010 12:54 UTC (Fri) by anton (subscriber, #25547) [Link] (8 responses)

Am I? That's Ted T'so's position as reported on, e.g., LWN. But maybe you can show me where I was wrong in my statement of his position. And my impression is that if it was just up to him, he would not have made the rename fix.

ext4 and data consistency

Posted May 14, 2010 14:00 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

You said there was only one fix. There were several and there are other Ext4 filesystem developers as well. What your state as his position is leaving out a lot of naunced arguments in a complex topic and making it sound very simplistic. If you can actually show a single case where Ext4 performs less robustly than Ext3, I would be interested.

ext4 and data consistency

Posted May 21, 2010 15:08 UTC (Fri) by Duncan (guest, #6647) [Link] (6 responses)

What bothers me is how they reduced the guarantees and stability of the long mature ext3 filesystem in the aftermath of all this, by defaulting it to data=writeback, a change from the old default data=ordered.

Presumably you used tun2fs or simply fstab to ensure your ext3 mounts remain stable with data=ordered after the kernel in question (was it 2.6.30 or 2.6.31?), right?

What'd be interesting to see would be how the distributions have handled it, since. Did they go with the new ext3 data=writeback default, or have they either reverted either that commit or now default their userspace to specify data=ordered by default?

I know at least one guy who was complaining of ext3 instability after installing a new kernel due to that, that went away when he returned to data=ordered for his ext3 volumes. The context of that discussion was the pan (nntp client) user list, IIRC.

Me, I've been on reiserfs for years on both my main system and (more recently) my netbook, and have been extremely happy with it since data=ordered became its default (2.6.6 according to a google hit on another LWN comment of mine). My most recent experience with extX is on no-journal ext4 formatted USB flash-based thumbdrives, where journaling isn't a good idea. I've been following btrfs with interest, and expect I'll upgrade to it once a few more of the kinks get worked out. (I've seen hints that the current 2.6.35 cycle will reduce the strength of the warning for its kernel config item, but I don't follow the btrfs list or lkml, and any detail of even plans has been harder to come by on the broader community sites such as LWN, HO, LXer, etc, that I follow.)

Duncan

ext4 and data consistency

Posted May 22, 2010 19:15 UTC (Sat) by anton (subscriber, #25547) [Link] (5 responses)

What bothers me is how they reduced the guarantees and stability of the long mature ext3 filesystem in the aftermath of all this, by defaulting it to data=writeback, a change from the old default data=ordered.

Yes, that's what was at the back of my mind when I wrote about "backporting the data loss feature from ext4 to ext3".

Presumably you used tun2fs or simply fstab to ensure your ext3 mounts remain stable with data=ordered after the kernel in question (was it 2.6.30 or 2.6.31?), right?

The youngest kernel we have is 2.6.30, and according to /proc/mounts it mounts our ext3 file systems with data=ordered. I guess we will go the fstab route once we get a kernel that defaults to data=writeback.

I am a little worried, though, because of what happened after data=journal was no longer the default; I then read that using data=journal resulted in corrupt file systems; I read that for a significant amount of time, and never read that this bug has been fixed (but haven't seen such reports for some time).

So if they made data=ordered non-default in 2.6.31 or some kernel, will they really care if it works? My confidence is limited. We should probably better stick with 2.6.30 until we migrate off extx file systems completely.

ext4 and data consistency

Posted May 22, 2010 20:36 UTC (Sat) by nix (subscriber, #2304) [Link] (4 responses)

Sticking with 2.6.30 is foolish. Bugs are fixed in ext[34] all the time, sometimes data loss bugs: by sticking with 2.6.30, you're depriving yourself of all of those.

(btw, you can put mount options in the superblock, and avoid modifying /etc/fstab.)

ext4 and data consistency

Posted May 23, 2010 11:44 UTC (Sun) by anton (subscriber, #25547) [Link] (3 responses)

And new bugs are introduced, and if they are for a non-default option like (now) data=ordered, they won't get noticed in time, and they won't get fixed for quite some time; at least that's what the non-default data=journal episode teaches. So what's higher: the risk of data loss from a well-known kernel, or from a new kernel in a non-default setting? Choosing the latter seems foolish to me.

Modifying fstab is not a big deal, why would I want to avoid it. The problem with doing it in the superblock is that I have to do it again when I transfer the system to another disk.

ext4 and data consistency

Posted May 23, 2010 11:50 UTC (Sun) by cortana (subscriber, #24596) [Link] (1 responses)

How can I check whether my distribution has changed the default value of the option in its kernels?

ext4 and data consistency

Posted May 23, 2010 13:29 UTC (Sun) by anton (subscriber, #25547) [Link]

One way is to mount such a file system with the default value (without overriding the default with tune2fs or in fstab), and the checking the actual options in /proc/mounts. That is what I do.

Another way would be to check CONFIG_EXT3_DEFAULTS_TO_ORDERED in the kernel config file.

ext4 and data consistency

Posted May 23, 2010 13:55 UTC (Sun) by nix (subscriber, #2304) [Link]

Well, OK, you're quite within your rights to stick with an old kernel: but I hope you encounter no other security bugs, or stability bugs, or new hardware, or *anything* else that might require a new kernel!

ext4 and data consistency

Posted May 13, 2010 20:19 UTC (Thu) by drag (guest, #31333) [Link] (21 responses)

There is no file system in Linux that tries to assure that renames are atomic functions.

Ext3, Ext4, XFS, JFS, etc etc.. all of these have the same consistency problems your complaining about.

The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)

What is more with 2.6.30 a patch was added to Ext4 that attempted to detect and then replicate the same behavior in Ext3 in order to maintain backwards compatibility with application developer's assumptions on file system behavior with regards to renames.

So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.

---------------------------

I know that this issue has cropped up again due to the fact that in Ubuntu the dpkg program detects if it's running on Ext4 and goes into paranoid mode were it runs 'fsync' were as with Ext3 it does not. This causes Ubuntu installs to last significantly longer if you choose 'Ext4' file system.

If the dpkg folks were smart they'd enable paranoid mode on all file systems, except maybe Ext3 (due to Ext3's poor ability to handle that sort of workloads)

As far as my personal opinion this is a advantage for using Ext4 over Ext3 since upgrades will be much safer on my laptop...

---------------------------

The one feature that I like about Ext4 is that it takes a minute or two to run a full fsck on my home directory versus upwards to 15-20 minutes for the same operation on Ext3.

ext4 and data consistency

Posted May 13, 2010 20:45 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (11 responses)

The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)

That window must be vanishingly small because neither I nor anyone else has ever been able to make ext3 crease zero-length files in the way you describe. Quirk or not, rename atomicity is an important feature that works just fine on a running filesystem, and filesystems ought to preserve its qualities on a restart. Allowing random garbage to exist on the filesystem after a restart is terrible policy and reflects a profound ignorance on the part of filesystem developrse as to how applications and users expect their systems to work.

ext4 and data consistency

Posted May 13, 2010 23:07 UTC (Thu) by njs (subscriber, #40338) [Link] (10 responses)

And ext4 now has rename atomicity over crashes. I also think that this is the right decision, but I wince when I see people tear into filesystem developers over this; if anything, it seems to reflect a profound ignorance of the difficulty of the trade-offs fs developers have to make, the disparity between what people want from a fs and what fs's have historically provided, etc. Keep in mind that if you go two web-pages over, you can find people tearing into POSIX for providing *too* strong guarantees and how we absolutely need to relax them for real-world usage (atime is the obvious example, but there are others). So I can hardly blame fs developers for being *cautious* about introducing strong *new* guarantees.

ext4 and data consistency

Posted May 14, 2010 13:43 UTC (Fri) by anton (subscriber, #25547) [Link] (9 responses)

[...] trade-offs fs developers have to make, the disparity between what people want from a fs and what fs's have historically provided, etc.

Yes, different people expect different things from file systems.

E.g., I expect data consistency from a file system; Linux file systems don't give any guarantee on that, but at least ext3 does ok in most cases; some people may consider this a fluke (but is Stephen Tweedie, the creator of ext3 among them?), but that's the reality.

Other people expect maximum speed. And for these people Linux provides tmpfs and ext4.

Given this choice, ext4 is certainly not a replacemet of ext3 for me.

Keep in mind that if you go two web-pages over, you can find people tearing into POSIX for providing *too* strong guarantees and how we absolutely need to relax them for real-world usage (atime is the obvious example, but there are others).

Yes, there are different kinds of users. I lost quite a bit of time because Linux does not follow POSIX atime semantics by default anymore. I find them useful in my real-world usage. Those who don't want atime have been able to use noatime for a long time, and now there is relatime, but making it the default (especially with mounts that don't know about strictatime) is a bad practice.

ext4 and data consistency

Posted May 14, 2010 15:53 UTC (Fri) by bronson (subscriber, #4806) [Link] (5 responses)

What on earth do you use atime for? Personally, the last time I ever needed to worry about atime was in the 1990s, and it was very easy to replace.

ext4 and data consistency

Posted May 15, 2010 8:36 UTC (Sat) by anton (subscriber, #25547) [Link] (4 responses)

I use atime to check whether some complex software really does access the files that I think it does.

ext4 and data consistency

Posted May 20, 2010 19:23 UTC (Thu) by oak (guest, #2786) [Link] (3 responses)

> I use atime to check whether some complex software really does access the files that I think it does.

Wouldn't "strace -f" be handier for that kind of thing? With that you notice also a lot of other stuff that the SW does.

Strace-account script gives an overview of file accesses in the strace output:
http://blogs.gnome.org/mortenw/2005/12/14/strace-account/

ext4 and data consistency

Posted May 21, 2010 12:04 UTC (Fri) by anton (subscriber, #25547) [Link] (2 responses)

It would not be handier exactly because it tells me a huge amount of other stuff the software does and that I am not interested in.

ext4 and data consistency

Posted Jun 8, 2010 22:17 UTC (Tue) by elanthis (guest, #6227) [Link] (1 responses)

Meet grep. Grep is your friend. Grep can make your life much easier. Grep is here to help you.

ext4 and data consistency

Posted Jun 9, 2010 9:06 UTC (Wed) by anton (subscriber, #25547) [Link]

And how is that handier than just doing "stat <file>"?

ext4 and data consistency

Posted May 14, 2010 17:33 UTC (Fri) by njs (subscriber, #40338) [Link] (2 responses)

I'm not aware of any common filesystem that provides "data consistency" in any coherent sense, unless you do weird things like mount -o sync. Speed is too important -- Stephen Tweedie didn't make data=journal the default, either. At most you get guarantees in particular situations -- e.g., both ext3 and ext4 guarantee that a rename will not be committed to disk until writes to the file being renamed have been committed to disk. They even both try to guarantee that programmers who do horrible things like truncating the file and *then* rewriting it are somewhat protected from their incompetence.

But maybe there are other cases where ext3 does better than ext4. You must have some excellent ones in mind to lump ext4 in with tmpfs... can you give any examples?

ext4 and data consistency

Posted May 15, 2010 9:19 UTC (Sat) by anton (subscriber, #25547) [Link] (1 responses)

Speed is too important

For whom? For me data consistency is much more important. Before barriers were supported, we ran ext3 on IDE disks without write caching, and that's really slow. The file system was still fast enough.

Stephen Tweedie didn't make data=journal the default, either.

Actually he did, at least at the start. Later it got changed (by whom?) to data=ordered; that still has the potential to provide data consistency unless existing files are overwritten.

As for an example: Consider a process writing file A and then file B. With ext4 I expect that it can happen that after recovery B is present and A is not or is empty. With ext3 I expect that this does not happen. But given that I did not find any documented guarantees in Documentation/filesystems/ext3.fs, maybe we should lump ext3 with tmpfs, too.

Still, my search brought up a Linux file system that gives guarantees: In nilfs2.txt it says:

order=strict	Apply strict in-order semantics that preserves sequence
		of all file operations including overwriting of data
		blocks.  That means, it is guaranteed that no
		overtaking of events occurs in the recovered file
		system after a crash.

Yes, that's exactly the guarantee I want to see. This means that any application that keeps its files consistent as visible from other processes will also have consistent files after an OS crash.

ext4 and data consistency

Posted May 16, 2010 3:57 UTC (Sun) by njs (subscriber, #40338) [Link]

> For whom? For me data consistency is much more important

That's fine. I'd like data consistency too. But I still don't mount my disks with -o sync, nor does pretty much anyone else, even most of the people who say they want data consistency. That's the reality that fs developers live in.

Maybe on SSD (where nilfs2 is designed to live), we'll be able to get guaranteed data consistency as a matter of course. That'll be nice if it happens.

ext4 and data consistency

Posted May 13, 2010 21:23 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (1 responses)

The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)

My understanding is that ext3 would always have allocated the blocks for the new file and written it before the rename would occur. The 0-length file issue was due to ext4 performing delayed allocation and performing the rename before the data ever got written.

So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.

This is a misunderstanding. The desired behaviour was that operations occur in order. It's not terribly important to a user if they lose the configuration changes they made before a crash - it's pretty significant if the rename was performed before the data hit disk, resulting in the complete loss of their configuration.

It's true that POSIX doesn't require that filesystems behave this way. There's many things that POSIX doesn't require but which we expect anyway because the alternative is misery.

ext4 and data consistency

Posted May 14, 2010 12:32 UTC (Fri) by ricwheeler (subscriber, #4980) [Link]

You need to be careful not to confuse the rename issue specifically with the need to use fsync() properly to make sure that data is on disk.

Applications still have to understand when to use fsync() properly to move data from the page cache out to persistent storage (on disk, ssd, etc).

ext4 and data consistency

Posted May 14, 2010 13:21 UTC (Fri) by anton (subscriber, #25547) [Link] (5 responses)

There is no file system in Linux that tries to assure that renames are atomic functions.

That may be true (wrt. what happens on crashes; I do hope that they are all atomic wrt state visible to other processes in regular operations); I certainly have never seen any Linux file system give any guarantees about data consistency on crashes. Not doing renames properly would be pretty poor of Linux, though, given that this is a case where even the old BSD FFS goes to extra lengths to ensure at least meta-data consistency (it never cares about your data).

Concerning Linux file systems, I am pretty sure that ext3 with the default data=ordered mode can result in an inconsistent data state if file overwriting is happening, but data consistency would be achievable for files that are freshly created (I don't know if ext3 actually achieves it, though). For ext4 I don't expect any data consistency.

So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.

Yes, but that's neither necessary nor sufficient for data consistency.

[...] in Ubuntu the dpkg program detects if it's running on Ext4 and goes into paranoid mode were it runs 'fsync' were as with Ext3 it does not. This causes Ubuntu installs to last significantly longer if you choose 'Ext4' file system.

Oh, really? We have dozens of Debian systems running on ext3 (presumably without paranoid mode), and we have not had a single problem with a dpkg database corrupted by the file system. What does Ubuntu do with dpkg that makes a significant difference in the length of the installation life? And where can I find the statistics on which you base this claim?

ext4 and data consistency

Posted May 14, 2010 17:50 UTC (Fri) by njs (subscriber, #40338) [Link] (4 responses)

> We have dozens of Debian systems running on ext3 (presumably without paranoid mode), and we have not had a single problem with a dpkg database corrupted by the file system.

No filesystem goes out and corrupts the dpkg database, but dpkg failing to properly ensure on-disk consistency might make it possible for an untimely power failure (or whatever) to trash its database. How often do you pull the plug while dpkg is running?

That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore. Which is, of course, the whole problem here -- it means that as users we have to rely on external signals, like how I still don't really trust MySQL, because sure, I know they have transactions now, but do I *really* trust a group who was at one point talking about how useless they are to later have the necessary mind-numbing paranoia to catch every edge case? And hey, over here there's Postgres, whose developers clearly *are* absurdly paranoid, excellent...

Or, how you don't trust ext4, even though you have no statistics on it either, because of how Ted T'so's messages came across. It's just a mystery to me how his basically sensible posts gave you (and others) this image of him as some kind of data-eating monster.

ext4 and data consistency

Posted May 14, 2010 19:15 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore.

Virtualization and CoW should have made this much, much easier to test in a finegrained fashion; halt the VM you're using to do the testing, CoW the file, start a new VM using the CoWed copy and mount it; note if it failed and if so how, kill the VM, remove the CoWed copy of the file and let the VM run for another few milliseconds (or, if you're being completely pedantic, another instruction!)

ext4 and data consistency

Posted May 14, 2010 19:36 UTC (Fri) by njs (subscriber, #40338) [Link] (1 responses)

That's a neat idea. I don't think we have cycle-accurate VMs in FOSS yet, but it doesn't matter for this, you can do the halt/check after every disk write, not every instruction. It still doesn't solve a major part of the problem -- you also need to exercise all the weird corner cases that only arise under certain sorts of memory pressure, or what happens if the disk is fragmented in *this* way and has *this* queue write depth and that makes the elevator algorithm tempted to reorder writes in an unfortunate way, etc. -- but it'd be really useful!

ext4 and data consistency

Posted May 14, 2010 20:41 UTC (Fri) by nix (subscriber, #2304) [Link]

I don't think we have cycle-accurate VMs in FOSS yet

They just need to be accurate enough that stuff works. We're not trying to make Second Reality run, here. I can't think of anything that runs on Core 2 but not AMD Phenom because of differing instruction timings!

all the weird corner cases that only arise under certain sorts of memory pressure

Seems to me that the balloon driver is what we want; it can add memory to the guest on command, can't it also take it away? I don't see why we can't do an analogue of what SQLite does in its testing procedures (use a customized allocator that forces specific allocations to fail). The disk-fragmentation stuff would take a lot more work, probably a custom block allocator, which is a bit tough since the block allocator is one of the things we're trying to test!

ext4 and data consistency

Posted May 15, 2010 9:57 UTC (Sat) by anton (subscriber, #25547) [Link]

No filesystem goes out and corrupts the dpkg database, but dpkg failing to properly ensure on-disk consistency might make it possible for an untimely power failure (or whatever) to trash its database.

The file system does not have to go out to do it, because it was entrusted with that data; so it can just fail to keep it consistent while staying at home. A good file system will properly ensure on-disk consistency without extra help from applications (beyond applications keeping the files consistent from the view of other processes).

How often do you pull the plug while dpkg is running?

Never. And I doubt it happens in a significant number of cases for Ubuntu users, either. And the subset of cases where ext3 corrupts the database is even smaller. That's why I questioned the drag's claim.

That's why robustness is so hard -- it's almost impossible to test.

And that's why I find the attitude that not the file system, but applications should be responsible for data consistency in case of an OS crash or power outage absurd. Instead of testing one or a few file systems, thousands of applications would have to be tested.

ext4 and data consistency with dpkg

Posted Jun 18, 2010 5:38 UTC (Fri) by guillemj (subscriber, #49706) [Link]

> I know that this issue has cropped up again due to the fact that in
> Ubuntu the dpkg program detects if it's running on Ext4 and goes into
> paranoid mode were it runs 'fsync' were as with Ext3 it does not. This
> causes Ubuntu installs to last significantly longer if you choose
> 'Ext4' file system.
>
> If the dpkg folks were smart they'd enable paranoid mode on all file
> systems, except maybe Ext3 (due to Ext3's poor ability to handle that
> sort of workloads)

dpkg has always done fsync() on the internal database, it was only
missing doing fsync() for the extracted control files from a package
to be installed/upgraded (which include maintainer scripts for example).

As of recently, dpkg started doing fsync() before rename on *all*
file systems for all extracted files from a package (there's actually
never been any kind of file system detection or special "paranoid mode").
It also does now fsync() on all database related directories.

The reason for this has been mainly the zero-length issues with ext4
(appearing even with the recent rename heuristic fixes), as we've had
no previous bug reports of broken systems due to zero-length files on
any other file system. But I consider it was still a bug for something
like dpkg to not fsync() files, just because the package status would
not match the package installed data, which is an issue, but not as
grave as having empty files left around (think boot loader, kernel or
libc as example).

But those changes produced major performance regressions *only* on
ext4 (that we know as of now), so we implemented per package delayed
fsync()s + rename()s, which helped a bit with ext4, but not enough. We
have now switched to use delayed sync() + rename()s *only* on Linux
(because it's the only place were sync() is synchronous) which brings
performance closer to the initial values. ext3 didn't have a noticable
performance degradation during the implementation iterations.

The still present zero-length issues and performance issues with fsync()
have been reported to ext4 upstream, the solutions offered were to either
not use fsync() because it's slow and it's not feasible to make it faster,
use non-portable sync() or ignore the problem as it's not a usual case...
(most of the hundreds of duped reports in Ubuntu, which happens to have
ext4 as default file system in latest releases, were due to sudden power
off, and not to system crash which were a minority).

Not to mention this will be an issue if someone happens to port ext4 to
any non-Linux kernel where sync() is asynchronous, then the only options
for developers are either massive performance degradation or possible
data loss in case of abrupt system crashes/shutdown...

> As far as my personal opinion this is a advantage for using Ext4 over
> Ext3 since upgrades will be much safer on my laptop...

Well, whatever happens in maintainer scripts for example is not synced,
so there's still room for data loss with dpkg on ext4...

I've just checked if rpm is doing any kind of sync for extracted files
before rename() and it does not seem so, I'm guessing other packaging
systems might be susceptible to this issue too, but I've not checked.
This is something they might also want to consider doing, in case those
systems start offering ext4 as installation file system, or they might
start suffering the same kind of bug reports as Ubuntu saw. :/

The Next3 filesystem

Posted Jun 9, 2010 2:01 UTC (Wed) by dgm (subscriber, #49227) [Link] (9 responses)

I don't know if it's true or not, so condider the following as a bit of FUD in it's literal meaning.

I have read horror histories about I/O controllers and discs messing commands queues when a power failure occurs. This is something that cannot be fixed in any sane way by the OS, the only protecction is a good recent backup. Wouldn't it be that the best solution for your case too?.

The Next3 filesystem

Posted Jun 9, 2010 8:32 UTC (Wed) by anton (subscriber, #25547) [Link] (8 responses)

I have done experiments on what disks do on power failure, but barriers or turning off disk write cacheing should help against these reorderings. I have also seen disks that destroy old data and the low-level formatting on power failure. And there are other modes in which you can lose data, so having a good backup is a good idea in any case.

Sure, if we are ready to restore our data from backup every time there is a power failure or OS crash, we can use file systems like tmpfs and ext4 for these data. But many of us want to avoid that hassle in the common case when the disk behaves properly, and we need a file system for that case that behaves properly, too. And just like IBM (now Hitachi) and Maxtor (now Seagate) drives are on my don't-buy list after the problems mentioned above, ext4 is on my don't-use list.

The Next3 filesystem

Posted Jun 10, 2010 0:36 UTC (Thu) by cmccabe (guest, #60281) [Link] (7 responses)

> Sure, if we are ready to restore our data from backup every time there is
> a power failure or OS crash, we can use file systems like tmpfs and ext4
> for these data.

This is a trollish statement. I have lost power and had OS crashes many times with ext4 and never had to restore from backups.

> But many of us want to avoid that hassle in the common
> case when the disk behaves properly, and we need a file system for that
> case that behaves properly, too. And just like IBM (now Hitachi) and
> Maxtor (now Seagate) drives are on my don't-buy list after the problems
> mentioned above, ext4 is on my don't-use list.

Even if you had a filesystem that met all of your requirements (and it's unclear if any real filesystem actually does), no consumer-grade hardware guarantees sane behavior in the event of power loss. Some hard disks are better than others, but a lot have serious problems. Some lie about when data has been flushed to disk. Others corrupt data randomly when power is lost.

Even if you buy only certain brands of drives, manufacturers re-brand hard drives all the time. It's hard to know what you're actually buying.

So *if* your hard drive doesn't ruin power-loss for you anyway, *and* your application is written sloppily enough that it doesn't fsync, *and* this application is critical to your system, then ext3 *might* be more reliable, maybe. Is it possible that you're overreacting?

The Next3 filesystem

Posted Jun 14, 2010 14:01 UTC (Mon) by Cato (guest, #7643) [Link] (6 responses)

You are lucky - the user of one Linux PC that I manage had an unfortunate habit of nudging the reset button when sitting down ... This is rather a pathological case but it resulted in serious data loss on two separate filesystems on different physical disks (one PATA and one SATA), despite no actual power loss. I was using an ext3 plus LVM setup and I'm convinced that the use of hard disk write caching was the problem.

http://lwn.net/Articles/343425/ has more details. Haven't had any more problems since stopping use of write caching and making some other changes such as ext3 data=journal. I also stopped using LVM but I don't think that's a factor - on other PCs I now just use data=journal and turn off hard disk write caching, and still use LVM.

The Next3 filesystem

Posted Jun 15, 2010 10:07 UTC (Tue) by etienne (guest, #25256) [Link] (1 responses)

That is strange: the reset button is not (and has never) been connected to the hard disk, i.e. the hard disk is only aware of the reset button been pressed when the BIOS re-initialises the ATA/SATA interface (so a long time after). There isn't any "reset wire" on ATA or SATA connector.
Historically, that has lead to strange bugs - like LILO was able to start after running Windows 3.1 but not from a cold boot (or the opposite), because the hard disk was reconfigured with a different number of heads and sector per track (BIOS only times, no LBA).
For all what I can see, it is exactly the same nowadays.
The reason has always been to give time to the hard disk to finish and write back its cache.

The Next3 filesystem

Posted Jun 15, 2010 20:33 UTC (Tue) by Cato (guest, #7643) [Link]

My mistake - it was actually the power button not reset that was pressed repeatedly. So there was power loss which is probably what caused the problem due to hard disk write caching.

The Next3 filesystem

Posted Jun 23, 2010 1:41 UTC (Wed) by cmccabe (guest, #60281) [Link] (3 responses)

Up until Linux 2.6.31, write barriers were always disabled while using LVM. So it's unlikely that journaling would have been very effective at preventing data loss in a system using something like ext3 + LVM + Linux 2.6.28.

Personally, I use rsync for monthly backup, every month, and hope for the best. And when you see that first I/O error come out of /dev/sda... throw that thing in the trash. I've never seen a disk "get better" after starting to give I/O timeouts and errors.

The Next3 filesystem

Posted Jun 23, 2010 1:49 UTC (Wed) by cmccabe (guest, #60281) [Link]

Just to clarify. I've never had to restore my data from backup drives. I have had hard disks go "funny" on me. This happened on two disks. Files started becoming unreadable some of the time (but not all) and I/O timeouts started happening. In both cases, I copied over my data from the affected disk to a new disk.

I haven't ever lost data as a result of a power outage, partly because I'm a compulsive user of the save button / command. I also didn't get bitten by the ext4 rename bug / controversy because I was using ext3 at the time. I don't have a UPS at home or work.

The Next3 filesystem

Posted Jun 23, 2010 12:16 UTC (Wed) by Cato (guest, #7643) [Link] (1 responses)

Since I turn off disk write caching that bypasses the problem of write barriers being disabled in such kernels. For backups, I use DAR (like tar but with granular checksums for easier recovery from corruption) and rsnapshot, which is rsync-based, but a true backup system as it saves multiple versions and runs very fast, like rsync - works very well as long as you don't have very large files that change frequently.

The Next3 filesystem

Posted Jun 23, 2010 13:30 UTC (Wed) by nix (subscriber, #2304) [Link]

DAR (like tar but with granular checksums for easier recovery from corruption)

Actually par2 provides that feature. What dar gives you is multi-storage-medium support via running arbitrary scripts to change medium. tar has nothing like it.

works very well as long as you don't have very large files that change frequently

Like, uh, VM images? I hear they're quite common these days.

(If you've got a lot of those, try rdiff-backup. It's slower than rsnapshot, but when a file changes it stores rdiff-format compressed deltas from the new file to the old one, rather than resaving the entire old file all over again.)

Filesystem Independence

Posted Jun 8, 2010 22:35 UTC (Tue) by BrucePerens (guest, #2510) [Link] (3 responses)

Why do this at the filesystem level? It's possible to do a copy-on-write image as a block device. Then you'd have filesystem independence. It would be sensitive to changes in filesystem metadata: time stamps, free block lists, inodes, etc., but I'm not convinced this would be a tremendous overhead unless you compact the filesystem or do something else that causes all blocks to be written.

Filesystem Independence

Posted Jun 8, 2010 22:45 UTC (Tue) by dlang (guest, #313) [Link] (2 responses)

one possible reason (and I don't know if it's true in this case) would be that by doing this at the filesystem level they can be smarter than something at the block level could be.

some things could be implemented as a journal ('timestamp on this file changed from X at time T' is much shorter than duplicating the entire block)

the filesystem can also be smarter about the location of blocks when it's accessing a block device more directly. One problem with doing snapshots at the block level is that you very quickly end up with a lot of seeking. The filesystem can be smarter about this.

in practice it may or may not matter, but there are enough places where it could matter that it's worth exploring.

Filesystem Independence

Posted Jun 8, 2010 23:02 UTC (Tue) by BrucePerens (guest, #2510) [Link] (1 responses)

one possible reason (and I don't know if it's true in this case) would be that by doing this at the filesystem level they can be smarter than something at the block level could be.

Yes, of course. At the filesystem level, you can ignore everything except for the data content of directories, files, and symlinks, and some of the metadata that would be returned by stat.

What I don't have so far is proof that being smarter is really smart this time. There is a cost per filesystem format, is it more than 10% greater than the cost of doing this at the block image level?

10% seems like a low cost compared to dealing with Ted, Linus, etc. And if you get filesystem independence too, it's a deal!

Filesystem Independence

Posted Jun 8, 2010 23:05 UTC (Tue) by BrucePerens (guest, #2510) [Link]

Oops, I mean is the cost of doing this at the block level more than 10% greater than doing it at the filesystem level. I wrote the reverse.