The Next3 filesystem
Benefits for LWN subscribersThe ext3 filesystem is tried and true, but it lacks a number of features deemed interesting by contemporary users. Snapshots - the ability to quickly capture the state of the filesystem at an arbitrary time - is at the top of many lists. It is currently possible to use the LVM snapshotting feature with ext3, but snapshots taken through LVM have some significant limitations. The Next3 filesystem offers an approach which might prove easier and more flexible: snapshots implemented directly in ext3.The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!
Next3 was developed by CTERA Networks, which has started shipping it on its C200 network-attached storage device. This code has also been posted on SourceForge and proposed for merging into the mainline kernel. The Next3 filesystem adds a simple snapshot feature to ext3 in ways which are (mostly) compatible with the existing on-disk format. It looks like a useful feature, but its path into the mainline looks to be longer than its implementers might have hoped.
The Next3 filesystem is a new filesystem type - it's not just an addition to ext3. At its core, it works by creating a special, magic file to represent a snapshot of the filesystem. The files have the same apparent size as the storage volume as a whole, but they are sparse files, so they take almost no space at the outset. When a change is made to a block on disk, the filesystem must first check to see whether that block has been saved in the most recent snapshot already. If not, the affected block is moved over to the snapshot file, and a new block is allocated to replace it. Thus, over time, disk blocks migrate to the snapshot file as they are rewritten with new contents.
Gaining read-only access to a snapshot is a simple matter of doing a loopback mount of the snapshot file as an ext2 filesystem. The snapshot file is sufficiently magic that any attempts to read blocks in the holes (which represent blocks that have not been changed since the snapshot was taken) will be satisfied from a later snapshot - which will have captured the contents of that block when it was eventually changed - or from the underlying storage device. Deleting a snapshot requires moving changed blocks into the previous snapshot, if it exists, because the deleted snapshot holds blocks which are logically part of the earlier snapshots.
The changes to the ext3 on-disk format are minimal, to the point that a Next3 filesystem can be mounted by the ordinary ext3 code. If snapshots exist, though, ext3 cannot be allowed to modify the filesystem, lest the changed blocks fail to be saved in the snapshot. So, when snapshots exist on the filesystem, it will be marked with a feature flag which forces ext3 to mount the filesystem readonly.
On the performance side, the news is said to be mostly good. Writes will take a little longer due to the need to move the old block to a snapshot file. The worst performance impact is seemingly on truncate operations; these may have to save a large number of blocks and can get a lot slower. It is also worth noting that the moving of modified blocks to the snapshot file will, over time, wreck the nice, contiguous on-disk format that ext3 tries so hard to create, with an unfortunate effect on streaming read performance. Files which must not be fragmented can be marked with a special flag which will cause blocks to be copied into the snapshot file rather than moved; that will slow writes further, but will keep the file contiguous on disk.
Next3 developer Amir Goldstein requested relatively quick review of the patches because he is trying to finalize some of the on-disk formatting. The answer he got from Ted Ts'o was probably not quite what he was looking for:
Amir's response was that, while porting the patches to ext4 is on the "we'll get around to it someday" list, that port is not an easy thing to do. The biggest problem, apparently, is making the movement of blocks into the snapshot file work properly with ext4's extent-oriented format. Beyond that, Amir says, he's not actually trying to get the changes into ext3 - he wants to merge a separate filesystem called Next3 which happens to be mostly compatible with ext3.
The "separate Next3" approach is unlikely to fly very far, though. As Ted put it, ext2, ext3, and ext4 are really just different implementations of the same basic filesystem format; this format has never really been forked. Next3, as a separate filesystem, would be a fork of the format. The fact that Next3 has taken over some data structure fields which are used to different purpose in ext4 has not helped matters:
The answer appears fairly clear: patches adding the snapshot feature might
be welcome, but not as a fork of the ext3 filesystem. At a bare minimum,
the filesystem format will have to be changed to avoid conflicts with ext4,
but the real solution appears to be simply implementing the patches on top
of ext4 instead of ext3. That is a fair amount of extra work which might
have been avoided had the Next3 developers talked with the community prior
to starting to code.
Index entries for this article | |
---|---|
Kernel | Filesystems/ext3 |
Posted May 13, 2010 4:56 UTC (Thu)
by spotter (guest, #12199)
[Link] (1 responses)
Posted May 13, 2010 7:16 UTC (Thu)
by amir73il (subscriber, #66165)
[Link]
See also:
Posted May 13, 2010 9:03 UTC (Thu)
by ringerc (subscriber, #3071)
[Link] (4 responses)
- They require LVM, which has its own issues and isn't always desirable
- They require storage to be reserved for them and allocated to them in advance
- They don't gracefully age out and aren't quietly removed when they run out of backing store. In fact, I recently had a server fail to boot because of an LVM snapshot that'd filled up.
- They need the file system to be capable of being mounted read-only from a dirty state. Not all file systems can handle this.
I find LVM snapshots to be well suited to taking backups, where I need to snapshot a volume, read the snapshot contents, unmount the snapshot and destroy it.
I find them rather less than ideal for when I just want to keep a few snapshots around to provide coarse versioning, as it's so useful for on a Windows server with VSS. An in-filesystem snapshot faclility would be really, really nice for this sort of thing, and one that didn't require loopback mounts (instead providing virtual directory access or the like) would be truly fantastic for backups.
Posted May 13, 2010 11:44 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
This isn't some partisan thing, I don't care if you migrate to git or Perforce or even Team Foundation Server. But get off Visual SourceSafe. Once you're safely running a real version control system you'll find that your headache goes away without any need for "coarse versioning" using snapshots.
Posted May 13, 2010 13:27 UTC (Thu)
by ringerc (subscriber, #3071)
[Link] (1 responses)
Here, VSS = "Volume Shadow copy Service". Yes, Microsoft uses the acronym "VSS" for Volume Shadow Copy Service despite having an existing claim on that TLA via Visual Source Safe ie VSS.
You'd think they'd refer to it as VSCS, but no....
What I'm talking about is a facility in Microsoft servers (and client operating systems, but it's less important there) that's based on the Volume Shadow Copy Service where they can make automatic snapshots of their file systems on a schedule, and retain them until the total size of all snapshots reaches an admin-configured limit, at which point the oldest snapshot is dropped to make room.
The same underlying snapshot service is used to provide efficient image-based backup. In fact, on Win2k8 you can have the server maintain a bootable backup disk image of its self on a raw disk - I use an iSCSI volume on my Linux backup server. The server uses the volume shadow copy service to only update dirty parts of the image at each backup run. It's nice to have for a Windows-based server OS where unlike Linux/BSD you can't just rsync the whole file system contents to another box and expect to be able to boot it.
(For what it's worth, I use svn heavily, though am drifting git-wards now that I've started actually using it and discovered how seriously nice it is these days. You won't catch me near Visual Source Safe unless it's with an axe. In an amusing confluence of these two topics, I now maintain all my servers /etc in git and git-push them to the backup server every night, 'cos it's more convenient than Bacula when reverting changes).
Posted May 13, 2010 13:48 UTC (Thu)
by nye (subscriber, #51576)
[Link]
>You'd think they'd refer to it as VSCS, but no....
That would be 'Volume Snapshot Service', supposedly. The acronym collision probably never occurred to anyone at Microsoft since (it is widely held, at least) nobody at Microsoft actually uses Visual Source Safe.
It's one of those things that was bought, mangled, and re-released to fill a gap in their product line, targetted at those people who must use all-Microsoft, all-the-time even if MS themselves think the tool is worthless.
Posted May 14, 2010 15:36 UTC (Fri)
by mebrown (subscriber, #7960)
[Link]
In fact, in Fedora 13, there is a yum plugin that will snapshot the system before every yum transaction, meaning you can rollback to any previous system state if an upgrade goes awry.
Posted May 13, 2010 10:16 UTC (Thu)
by anton (subscriber, #25547)
[Link] (46 responses)
As for ext4 being the future: It does not have a single feature
that I want. Worse, it has the misfeature of worse crash consistency.
So I am happy that Next3 is based on ext3, not ext4, so I don't have
to choose between snapshots and crash consistency.
Posted May 13, 2010 11:16 UTC (Thu)
by rahulsundaram (subscriber, #21946)
[Link] (45 responses)
Posted May 13, 2010 13:42 UTC (Thu)
by anton (subscriber, #25547)
[Link] (44 responses)
Posted May 13, 2010 16:57 UTC (Thu)
by cortana (subscriber, #24596)
[Link] (33 responses)
Posted May 13, 2010 18:38 UTC (Thu)
by anton (subscriber, #25547)
[Link] (32 responses)
He fixed one particularly frequent cause of data loss in ext4 (involving
writing a file, then renaming it across an old one), but nothing else.
So people will see data loss with ext4 less frequently than before,
but not as infrequently as with ext3 (or has this data loss feature
been backported from ext4 to ext3 to give us fewer reasons to stick
with ext3?).
Posted May 13, 2010 20:08 UTC (Thu)
by rahulsundaram (subscriber, #21946)
[Link] (9 responses)
Posted May 14, 2010 12:54 UTC (Fri)
by anton (subscriber, #25547)
[Link] (8 responses)
Posted May 14, 2010 14:00 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link]
Posted May 21, 2010 15:08 UTC (Fri)
by Duncan (guest, #6647)
[Link] (6 responses)
Presumably you used tun2fs or simply fstab to ensure your ext3 mounts remain stable with data=ordered after the kernel in question (was it 2.6.30 or 2.6.31?), right?
What'd be interesting to see would be how the distributions have handled it, since. Did they go with the new ext3 data=writeback default, or have they either reverted either that commit or now default their userspace to specify data=ordered by default?
I know at least one guy who was complaining of ext3 instability after installing a new kernel due to that, that went away when he returned to data=ordered for his ext3 volumes. The context of that discussion was the pan (nntp client) user list, IIRC.
Me, I've been on reiserfs for years on both my main system and (more recently) my netbook, and have been extremely happy with it since data=ordered became its default (2.6.6 according to a google hit on another LWN comment of mine). My most recent experience with extX is on no-journal ext4 formatted USB flash-based thumbdrives, where journaling isn't a good idea. I've been following btrfs with interest, and expect I'll upgrade to it once a few more of the kinks get worked out. (I've seen hints that the current 2.6.35 cycle will reduce the strength of the warning for its kernel config item, but I don't follow the btrfs list or lkml, and any detail of even plans has been harder to come by on the broader community sites such as LWN, HO, LXer, etc, that I follow.)
Duncan
Posted May 22, 2010 19:15 UTC (Sat)
by anton (subscriber, #25547)
[Link] (5 responses)
I am a little worried, though, because of what happened after
data=journal was no longer the default; I then read that using
data=journal resulted in corrupt file systems; I read that for a
significant amount of time, and never read that this bug has been
fixed (but haven't seen such reports for some time).
So if they made data=ordered non-default in 2.6.31 or some kernel,
will they really care if it works? My confidence is limited. We
should probably better stick with 2.6.30 until we migrate off extx file
systems completely.
Posted May 22, 2010 20:36 UTC (Sat)
by nix (subscriber, #2304)
[Link] (4 responses)
(btw, you can put mount options in the superblock, and avoid modifying /etc/fstab.)
Posted May 23, 2010 11:44 UTC (Sun)
by anton (subscriber, #25547)
[Link] (3 responses)
Modifying fstab is not a big deal, why would I want to avoid it.
The problem with doing it in the superblock is that I have to do it
again when I transfer the system to another disk.
Posted May 23, 2010 11:50 UTC (Sun)
by cortana (subscriber, #24596)
[Link] (1 responses)
Posted May 23, 2010 13:29 UTC (Sun)
by anton (subscriber, #25547)
[Link]
Another way would be to check CONFIG_EXT3_DEFAULTS_TO_ORDERED in
the kernel config file.
Posted May 23, 2010 13:55 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted May 13, 2010 20:19 UTC (Thu)
by drag (guest, #31333)
[Link] (21 responses)
Ext3, Ext4, XFS, JFS, etc etc.. all of these have the same consistency problems your complaining about.
The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)
What is more with 2.6.30 a patch was added to Ext4 that attempted to detect and then replicate the same behavior in Ext3 in order to maintain backwards compatibility with application developer's assumptions on file system behavior with regards to renames.
So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.
---------------------------
I know that this issue has cropped up again due to the fact that in Ubuntu the dpkg program detects if it's running on Ext4 and goes into paranoid mode were it runs 'fsync' were as with Ext3 it does not. This causes Ubuntu installs to last significantly longer if you choose 'Ext4' file system.
If the dpkg folks were smart they'd enable paranoid mode on all file systems, except maybe Ext3 (due to Ext3's poor ability to handle that sort of workloads)
As far as my personal opinion this is a advantage for using Ext4 over Ext3 since upgrades will be much safer on my laptop...
---------------------------
The one feature that I like about Ext4 is that it takes a minute or two to run a full fsck on my home directory versus upwards to 15-20 minutes for the same operation on Ext3.
Posted May 13, 2010 20:45 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (11 responses)
Posted May 13, 2010 23:07 UTC (Thu)
by njs (subscriber, #40338)
[Link] (10 responses)
Posted May 14, 2010 13:43 UTC (Fri)
by anton (subscriber, #25547)
[Link] (9 responses)
E.g., I expect data consistency from a file system; Linux file
systems don't give any guarantee on that, but at least ext3 does ok in
most cases; some people may consider this a fluke (but is Stephen
Tweedie, the creator of ext3 among them?), but that's the reality.
Other people expect maximum speed. And for these people Linux
provides tmpfs and ext4.
Given this choice, ext4 is certainly not a replacemet of ext3 for
me.
Posted May 14, 2010 15:53 UTC (Fri)
by bronson (subscriber, #4806)
[Link] (5 responses)
Posted May 15, 2010 8:36 UTC (Sat)
by anton (subscriber, #25547)
[Link] (4 responses)
Posted May 20, 2010 19:23 UTC (Thu)
by oak (guest, #2786)
[Link] (3 responses)
Wouldn't "strace -f" be handier for that kind of thing? With that you notice also a lot of other stuff that the SW does.
Strace-account script gives an overview of file accesses in the strace output:
Posted May 21, 2010 12:04 UTC (Fri)
by anton (subscriber, #25547)
[Link] (2 responses)
Posted Jun 8, 2010 22:17 UTC (Tue)
by elanthis (guest, #6227)
[Link] (1 responses)
Posted Jun 9, 2010 9:06 UTC (Wed)
by anton (subscriber, #25547)
[Link]
Posted May 14, 2010 17:33 UTC (Fri)
by njs (subscriber, #40338)
[Link] (2 responses)
But maybe there are other cases where ext3 does better than ext4. You must have some excellent ones in mind to lump ext4 in with tmpfs... can you give any examples?
Posted May 15, 2010 9:19 UTC (Sat)
by anton (subscriber, #25547)
[Link] (1 responses)
As for an example: Consider a process writing file A and then file
B. With ext4 I expect that it can happen that after recovery B is
present and A is not or is empty. With ext3 I expect that this does
not happen. But given that I did not find any documented guarantees
in Documentation/filesystems/ext3.fs, maybe we should lump ext3 with
tmpfs, too.
Still, my search brought up a Linux file system that gives
guarantees: In nilfs2.txt it says:
Posted May 16, 2010 3:57 UTC (Sun)
by njs (subscriber, #40338)
[Link]
That's fine. I'd like data consistency too. But I still don't mount my disks with -o sync, nor does pretty much anyone else, even most of the people who say they want data consistency. That's the reality that fs developers live in.
Maybe on SSD (where nilfs2 is designed to live), we'll be able to get guaranteed data consistency as a matter of course. That'll be nice if it happens.
Posted May 13, 2010 21:23 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link] (1 responses)
My understanding is that ext3 would always have allocated the blocks for the new file and written it before the rename would occur. The 0-length file issue was due to ext4 performing delayed allocation and performing the rename before the data ever got written.
So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.
This is a misunderstanding. The desired behaviour was that operations occur in order. It's not terribly important to a user if they lose the configuration changes they made before a crash - it's pretty significant if the rename was performed before the data hit disk, resulting in the complete loss of their configuration.
It's true that POSIX doesn't require that filesystems behave this way. There's many things that POSIX doesn't require but which we expect anyway because the alternative is misery.
Posted May 14, 2010 12:32 UTC (Fri)
by ricwheeler (subscriber, #4980)
[Link]
Applications still have to understand when to use fsync() properly to move data from the page cache out to persistent storage (on disk, ssd, etc).
Posted May 14, 2010 13:21 UTC (Fri)
by anton (subscriber, #25547)
[Link] (5 responses)
Concerning Linux file systems, I am pretty sure that ext3 with the
default data=ordered mode can result in an inconsistent data state if
file overwriting is happening, but data consistency would be
achievable for files that are freshly created (I don't know if ext3
actually achieves it, though). For ext4 I don't expect any data
consistency.
Posted May 14, 2010 17:50 UTC (Fri)
by njs (subscriber, #40338)
[Link] (4 responses)
No filesystem goes out and corrupts the dpkg database, but dpkg failing to properly ensure on-disk consistency might make it possible for an untimely power failure (or whatever) to trash its database. How often do you pull the plug while dpkg is running?
That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore. Which is, of course, the whole problem here -- it means that as users we have to rely on external signals, like how I still don't really trust MySQL, because sure, I know they have transactions now, but do I *really* trust a group who was at one point talking about how useless they are to later have the necessary mind-numbing paranoia to catch every edge case? And hey, over here there's Postgres, whose developers clearly *are* absurdly paranoid, excellent...
Or, how you don't trust ext4, even though you have no statistics on it either, because of how Ted T'so's messages came across. It's just a mystery to me how his basically sensible posts gave you (and others) this image of him as some kind of data-eating monster.
Posted May 14, 2010 19:15 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted May 14, 2010 19:36 UTC (Fri)
by njs (subscriber, #40338)
[Link] (1 responses)
Posted May 14, 2010 20:41 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted May 15, 2010 9:57 UTC (Sat)
by anton (subscriber, #25547)
[Link]
Posted Jun 18, 2010 5:38 UTC (Fri)
by guillemj (subscriber, #49706)
[Link]
dpkg has always done fsync() on the internal database, it was only
As of recently, dpkg started doing fsync() before rename on *all*
The reason for this has been mainly the zero-length issues with ext4
But those changes produced major performance regressions *only* on
The still present zero-length issues and performance issues with fsync()
Not to mention this will be an issue if someone happens to port ext4 to
> As far as my personal opinion this is a advantage for using Ext4 over
Well, whatever happens in maintainer scripts for example is not synced,
I've just checked if rpm is doing any kind of sync for extracted files
Posted Jun 9, 2010 2:01 UTC (Wed)
by dgm (subscriber, #49227)
[Link] (9 responses)
I have read horror histories about I/O controllers and discs messing commands queues when a power failure occurs. This is something that cannot be fixed in any sane way by the OS, the only protecction is a good recent backup. Wouldn't it be that the best solution for your case too?.
Posted Jun 9, 2010 8:32 UTC (Wed)
by anton (subscriber, #25547)
[Link] (8 responses)
Sure, if we are ready to restore our data from backup every time
there is a power failure or OS crash, we can use file systems like
tmpfs and ext4 for these data. But many of us want to avoid that
hassle in the common case when the disk behaves properly, and we need
a file system for that case that behaves properly, too. And just
like IBM (now Hitachi) and Maxtor (now Seagate) drives are on my
don't-buy list after the problems mentioned above, ext4 is on my
don't-use list.
Posted Jun 10, 2010 0:36 UTC (Thu)
by cmccabe (guest, #60281)
[Link] (7 responses)
This is a trollish statement. I have lost power and had OS crashes many times with ext4 and never had to restore from backups.
> But many of us want to avoid that hassle in the common
Even if you had a filesystem that met all of your requirements (and it's unclear if any real filesystem actually does), no consumer-grade hardware guarantees sane behavior in the event of power loss. Some hard disks are better than others, but a lot have serious problems. Some lie about when data has been flushed to disk. Others corrupt data randomly when power is lost.
Even if you buy only certain brands of drives, manufacturers re-brand hard drives all the time. It's hard to know what you're actually buying.
So *if* your hard drive doesn't ruin power-loss for you anyway, *and* your application is written sloppily enough that it doesn't fsync, *and* this application is critical to your system, then ext3 *might* be more reliable, maybe. Is it possible that you're overreacting?
Posted Jun 14, 2010 14:01 UTC (Mon)
by Cato (guest, #7643)
[Link] (6 responses)
http://lwn.net/Articles/343425/ has more details. Haven't had any more problems since stopping use of write caching and making some other changes such as ext3 data=journal. I also stopped using LVM but I don't think that's a factor - on other PCs I now just use data=journal and turn off hard disk write caching, and still use LVM.
Posted Jun 15, 2010 10:07 UTC (Tue)
by etienne (guest, #25256)
[Link] (1 responses)
Posted Jun 15, 2010 20:33 UTC (Tue)
by Cato (guest, #7643)
[Link]
Posted Jun 23, 2010 1:41 UTC (Wed)
by cmccabe (guest, #60281)
[Link] (3 responses)
Personally, I use rsync for monthly backup, every month, and hope for the best. And when you see that first I/O error come out of /dev/sda... throw that thing in the trash. I've never seen a disk "get better" after starting to give I/O timeouts and errors.
C.
Posted Jun 23, 2010 1:49 UTC (Wed)
by cmccabe (guest, #60281)
[Link]
I haven't ever lost data as a result of a power outage, partly because I'm a compulsive user of the save button / command. I also didn't get bitten by the ext4 rename bug / controversy because I was using ext3 at the time. I don't have a UPS at home or work.
Posted Jun 23, 2010 12:16 UTC (Wed)
by Cato (guest, #7643)
[Link] (1 responses)
Posted Jun 23, 2010 13:30 UTC (Wed)
by nix (subscriber, #2304)
[Link]
(If you've got a lot of those, try rdiff-backup. It's slower than rsnapshot, but when a file changes it stores rdiff-format compressed deltas from the new file to the old one, rather than resaving the entire old file all over again.)
Posted Jun 8, 2010 22:35 UTC (Tue)
by BrucePerens (guest, #2510)
[Link] (3 responses)
Posted Jun 8, 2010 22:45 UTC (Tue)
by dlang (guest, #313)
[Link] (2 responses)
some things could be implemented as a journal ('timestamp on this file changed from X at time T' is much shorter than duplicating the entire block)
the filesystem can also be smarter about the location of blocks when it's accessing a block device more directly. One problem with doing snapshots at the block level is that you very quickly end up with a lot of seeking. The filesystem can be smarter about this.
in practice it may or may not matter, but there are enough places where it could matter that it's worth exploring.
Posted Jun 8, 2010 23:02 UTC (Tue)
by BrucePerens (guest, #2510)
[Link] (1 responses)
Yes, of course. At the filesystem level, you can ignore everything except for the data content of directories, files, and symlinks, and some of the metadata that would be returned by stat.
What I don't have so far is proof that being smarter is really smart this time. There is a cost per filesystem format, is it more than 10% greater than the cost of doing this at the block image level? 10% seems like a low cost compared to dealing with Ted, Linus, etc. And if you get filesystem independence too, it's a deal!
Posted Jun 8, 2010 23:05 UTC (Tue)
by BrucePerens (guest, #2510)
[Link]
The Next3 filesystem
The Next3 filesystem
On top of the differences list is compatibility, or in other words:
Delete the Next3 snapshot files and you are back to plain old Ext3.
http://sourceforge.net/apps/mediawiki/next3/index.php?tit...
The Next3 filesystem
VSS = Visual SourceSafe?
VSS = Volume Shadow copy Service (Yes, I know)
VSS = Volume Shadow copy Service (Yes, I know)
The Next3 filesystem
I would really like to use snapshots, so I would like to see Next3 in
the mainline.
The Next3 filesystem
The Next3 filesystem
Data consistency after crashes or power outages.
The Next3 filesystem
The Next3 filesystem
Ted T'so still believes that data consistency on OS crashes (not
application crashes) is the job of the applications (with fsync()
etc.), not of the file system. And most applications don't do that, and those few that try it
are probably not well tested against that (because that's extremely
hard).
ext4 and data consistency
ext4 and data consistency
Am I? That's Ted T'so's position as reported on, e.g., LWN. But
maybe you can show me where I was wrong in my statement of his
position. And my impression is that if it was just up to him, he
would not have made the rename fix.
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
What bothers me is how they reduced the guarantees and
stability of the long mature ext3 filesystem in the aftermath of all
this, by defaulting it to data=writeback, a change from the old
default data=ordered.
Yes, that's what was at the back of my mind when I wrote about
"backporting the data loss feature from ext4 to ext3".
Presumably you used tun2fs or simply fstab to ensure your
ext3 mounts remain stable with data=ordered after the kernel in
question (was it 2.6.30 or 2.6.31?), right?
The youngest kernel we have is 2.6.30, and according to /proc/mounts
it mounts our ext3 file systems with data=ordered. I guess we will go
the fstab route once we get a kernel that defaults to data=writeback.
ext4 and data consistency
And new bugs are introduced, and if they are for a non-default option
like (now) data=ordered, they won't get noticed in time, and they
won't get fixed for quite some time; at least that's what the
non-default data=journal episode teaches. So what's higher: the risk
of data loss from a well-known kernel, or from a new kernel in a
non-default setting? Choosing the latter seems foolish to me.
ext4 and data consistency
ext4 and data consistency
One way is to mount such a file system with the default value (without
overriding the default with tune2fs or in fstab), and the checking the
actual options in /proc/mounts. That is what I do.
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)
That window must be vanishingly small because neither I nor anyone else has ever been able to make ext3 crease zero-length files in the way you describe. Quirk or not, rename atomicity is an important feature that works just fine on a running filesystem, and filesystems ought to preserve its qualities on a restart.
Allowing random garbage to exist on the filesystem after a restart is terrible policy and reflects a profound ignorance on the part of filesystem developrse as to how applications and users expect their systems to work.
ext4 and data consistency
ext4 and data consistency
[...] trade-offs fs developers have to make, the disparity between what
people want from a fs and what fs's have historically provided, etc.
Yes, different people expect different things from file systems.
Keep in mind that if you go two web-pages over, you can
find people tearing into POSIX for providing *too* strong guarantees
and how we absolutely need to relax them for real-world usage (atime
is the obvious example, but there are others).
Yes, there are different kinds of users. I lost quite a bit of time
because Linux does not follow POSIX atime semantics by default
anymore. I find them useful in my real-world usage. Those who don't
want atime have been able to use noatime for a long time, and now
there is relatime, but making it the default (especially with mounts
that don't know about strictatime) is a bad practice.
ext4 and data consistency
I use atime to check whether some complex software really does access
the files that I think it does.
ext4 and data consistency
ext4 and data consistency
http://blogs.gnome.org/mortenw/2005/12/14/strace-account/
It would not be handier exactly because it tells me a huge amount of other stuff the software does and that I am not interested in.
ext4 and data consistency
ext4 and data consistency
And how is that handier than just doing "stat <file>"?
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
Speed is too important
For whom? For me data consistency is much more important. Before
barriers were supported, we ran ext3 on IDE disks without write
caching, and that's really slow. The file system was still fast
enough.
Stephen Tweedie didn't make data=journal the default, either.
Actually he did, at least at the start. Later it got changed (by
whom?) to data=ordered; that still has the potential to provide data
consistency unless existing files are overwritten.
order=strict Apply strict in-order semantics that preserves sequence
of all file operations including overwriting of data
blocks. That means, it is guaranteed that no
overtaking of events occurs in the recovered file
system after a crash.
Yes, that's exactly the guarantee I want to see. This means that any
application that keeps its files consistent as visible from other
processes will also have consistent files after an OS crash.
ext4 and data consistency
The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
There is no file system in Linux that tries to assure
that renames are atomic functions.
That may be true (wrt. what happens on crashes; I do hope that they
are all atomic wrt state visible to other processes in regular
operations); I certainly have never seen any Linux file system give
any guarantees about data consistency on crashes. Not doing renames
properly would be pretty poor of Linux, though, given that this is a
case where even the old BSD FFS goes to extra lengths to ensure at
least meta-data consistency (it never cares about your data).
So ya.. apparently that 'fsync' was always needed by
application developers if they wanted to ensure that data was written
to disk in a timely fashion.
Yes, but that's neither necessary nor sufficient for data consistency.
[...] in Ubuntu the dpkg program detects if it's running on
Ext4 and goes into paranoid mode were it runs 'fsync' were as with
Ext3 it does not. This causes Ubuntu installs to last significantly
longer if you choose 'Ext4' file system.
Oh, really? We have dozens of Debian systems running on ext3
(presumably without paranoid mode), and we have not had a single
problem with a dpkg database corrupted by the file system. What does
Ubuntu do with dpkg that makes a significant difference in
the length of the installation life? And where can I find the
statistics on which you base this claim?
ext4 and data consistency
ext4 and data consistency
That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore.
Virtualization and CoW should have made this much, much easier to test in a finegrained fashion; halt the VM you're using to do the testing, CoW the file, start a new VM using the CoWed copy and mount it; note if it failed and if so how, kill the VM, remove the CoWed copy of the file and let the VM run for another few milliseconds (or, if you're being completely pedantic, another instruction!)
ext4 and data consistency
ext4 and data consistency
I don't think we have cycle-accurate VMs in FOSS yet
They just need to be accurate enough that stuff works. We're not trying to make Second Reality run, here. I can't think of anything that runs on Core 2 but not AMD Phenom because of differing instruction timings!
all the weird corner cases that only arise under certain sorts of memory pressure
Seems to me that the balloon driver is what we want; it can add memory to the guest on command, can't it also take it away? I don't see why we can't do an analogue of what SQLite does in its testing procedures (use a customized allocator that forces specific allocations to fail). The disk-fragmentation stuff would take a lot more work, probably a custom block allocator, which is a bit tough since the block allocator is one of the things we're trying to test!
ext4 and data consistency
No filesystem goes out and corrupts the dpkg database,
but dpkg failing to properly ensure on-disk consistency might make it
possible for an untimely power failure (or whatever) to trash its
database.
The file system does not have to go out to do it, because it was
entrusted with that data; so it can just fail to keep it consistent
while staying at home. A good file system will properly ensure
on-disk consistency without extra help from applications (beyond
applications keeping the files consistent from the view of other
processes).
How often do you pull the plug while dpkg is running?
Never. And I doubt it happens in a significant number of cases for
Ubuntu users, either. And the subset of cases where ext3 corrupts the
database is even smaller. That's why I questioned the drag's claim.
That's why robustness is so hard -- it's almost impossible to test.
And that's why I find the attitude that not the file system, but
applications should be responsible for data consistency in case of an
OS crash or power outage absurd. Instead of testing one or a few file
systems, thousands of applications would have to be tested.
ext4 and data consistency with dpkg
> Ubuntu the dpkg program detects if it's running on Ext4 and goes into
> paranoid mode were it runs 'fsync' were as with Ext3 it does not. This
> causes Ubuntu installs to last significantly longer if you choose
> 'Ext4' file system.
>
> If the dpkg folks were smart they'd enable paranoid mode on all file
> systems, except maybe Ext3 (due to Ext3's poor ability to handle that
> sort of workloads)
missing doing fsync() for the extracted control files from a package
to be installed/upgraded (which include maintainer scripts for example).
file systems for all extracted files from a package (there's actually
never been any kind of file system detection or special "paranoid mode").
It also does now fsync() on all database related directories.
(appearing even with the recent rename heuristic fixes), as we've had
no previous bug reports of broken systems due to zero-length files on
any other file system. But I consider it was still a bug for something
like dpkg to not fsync() files, just because the package status would
not match the package installed data, which is an issue, but not as
grave as having empty files left around (think boot loader, kernel or
libc as example).
ext4 (that we know as of now), so we implemented per package delayed
fsync()s + rename()s, which helped a bit with ext4, but not enough. We
have now switched to use delayed sync() + rename()s *only* on Linux
(because it's the only place were sync() is synchronous) which brings
performance closer to the initial values. ext3 didn't have a noticable
performance degradation during the implementation iterations.
have been reported to ext4 upstream, the solutions offered were to either
not use fsync() because it's slow and it's not feasible to make it faster,
use non-portable sync() or ignore the problem as it's not a usual case...
(most of the hundreds of duped reports in Ubuntu, which happens to have
ext4 as default file system in latest releases, were due to sudden power
off, and not to system crash which were a minority).
any non-Linux kernel where sync() is asynchronous, then the only options
for developers are either massive performance degradation or possible
data loss in case of abrupt system crashes/shutdown...
> Ext3 since upgrades will be much safer on my laptop...
so there's still room for data loss with dpkg on ext4...
before rename() and it does not seem so, I'm guessing other packaging
systems might be susceptible to this issue too, but I've not checked.
This is something they might also want to consider doing, in case those
systems start offering ext4 as installation file system, or they might
start suffering the same kind of bug reports as Ubuntu saw. :/
The Next3 filesystem
I have done experiments on
what disks do on power failure, but barriers or turning off disk
write cacheing should help against these reorderings. I have also
seen disks that destroy old data and the low-level
formatting on power failure. And there are other modes in which
you can lose data, so having a good backup is a good idea in any case.
The Next3 filesystem
The Next3 filesystem
> a power failure or OS crash, we can use file systems like tmpfs and ext4
> for these data.
> case when the disk behaves properly, and we need a file system for that
> case that behaves properly, too. And just like IBM (now Hitachi) and
> Maxtor (now Seagate) drives are on my don't-buy list after the problems
> mentioned above, ext4 is on my don't-use list.
The Next3 filesystem
The Next3 filesystem
Historically, that has lead to strange bugs - like LILO was able to start after running Windows 3.1 but not from a cold boot (or the opposite), because the hard disk was reconfigured with a different number of heads and sector per track (BIOS only times, no LBA).
For all what I can see, it is exactly the same nowadays.
The reason has always been to give time to the hard disk to finish and write back its cache.
The Next3 filesystem
The Next3 filesystem
The Next3 filesystem
Since I turn off disk write caching that bypasses the problem of write barriers being disabled in such kernels.
For backups, I use DAR (like tar but with granular checksums for easier recovery from corruption) and rsnapshot, which is rsync-based, but a true backup system as it saves multiple versions and runs very fast, like rsync - works very well as long as you don't have very large files that change frequently.
The Next3 filesystem
The Next3 filesystem
DAR (like tar but with granular checksums for easier recovery from corruption)
Actually par2 provides that feature. What dar gives you is multi-storage-medium support via running arbitrary scripts to change medium. tar has nothing like it.
works very well as long as you don't have very large files that change frequently
Like, uh, VM images? I hear they're quite common these days.
Why do this at the filesystem level? It's possible to do a copy-on-write image as a block device. Then you'd have filesystem independence. It would be sensitive to changes in filesystem metadata: time stamps, free block lists, inodes, etc., but I'm not convinced this would be a tremendous overhead unless you compact the filesystem or do something else that causes all blocks to be written.
Filesystem Independence
Filesystem Independence
Filesystem Independence
one possible reason (and I don't know if it's true in this case) would be that by doing this at the filesystem level they can be smarter than something at the block level could be.
Oops, I mean is the cost of doing this at the block level more than 10% greater than doing it at the filesystem level. I wrote the reverse.
Filesystem Independence