An update on bcachefs
This article brought to you by LWN subscribersSubscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.
The bcachefs filesystem has been under development for a number of years now; according to lead developer Kent Overstreet, it is time to start talking about getting the code upstream. He came to the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) to discuss that in a combined filesystem and storage session. Bcachefs grew out of bcache, which is a block layer cache that was merged into Linux 3.10 in mid-2013.
Five or six years ago, when he was still at Google, creating bcachefs from bcache seemed like it would take a year and 15,000 lines of code, Overstreet said. Now, six years and 50,000 lines of code later, it is a real filesystem. It "turned out really well", he said.
![Kent Overstreet [Kent Overstreet]](https://static.lwn.net/images/2018/lsf-overstreet-sm.jpg)
Bcachefs is a general-purpose copy-on-write filesystem with lots of features, including checksumming for blocks, compression, encryption, multiple device support, and, of course, caching. Jens Axboe asked if there was still a clean separation between bcachefs and bcache. Overstreet said that there was; roughly 80% of the code is shared. He has taken out the bcache interfaces in his development tree because there is no need for them as bcachefs can handle all of what bcache can do (and more).
Hannes Reinecke asked about the long-term expectation for bcache and bcachefs; will they coexist or will bcache be removed in favor of bcachefs. Overstreet said that bcache is the prototype for all of the ideas in bcachefs. As part of developing bcachefs, the B-tree code has been fleshed out and polished. Bcache was fast in most cases, but there were some corner cases where it was not; all of that has been fixed in bcachefs.
He said that he would like get users off of bcache and onto bcachefs. The filesystem has an fsck available to detect and repair problems. A block layer cache does not get the same level of testing that a full filesystem does. By creating and upstreaming bcachefs, he will in some sense be turning it into a real project.
He would prefer not have both the block layer and filesystem interfaces, since that doesn't really provide anything extra. One major disadvantage of bcache is that writes to the backing device are not copy on write so there are cache coherency issues. Bcache had ways to deal with those problems, but bcachefs simply eliminates them entirely.
Ted Ts'o asked how many users of bcache there are; how much of a problem is it to get rid of bcache? Axboe said that there are users and a community has formed to develop and maintain it. Ts'o said he would be in favor of eliminating bcache, but if there are users of the feature, that really cannot happen. Reinecke said that SUSE supports bcache in its distributions, so it will need to be maintained for a few years.
The on-disk format is different between bcache and bcachefs, similar to how ext2, ext3, and ext4 have evolved, Overstreet said. If he brought back the block device interfaces into bcachefs, then the filesystem could be a drop-in replacement for bcache. Ts'o noted that before ext3 and ext2 could be dropped, ext4 was able to handle the other two; if bcachefs can support the older bcache devices, the same could be done. Axboe said that perhaps an offline conversion tool could be written. Reinecke said that SUSE will still need bcache as a device for some time, but doesn't care if it is provided by the bcache code or by bcachefs.
Amir Goldstein asked about support for reflink, but Overstreet said that bcachefs does not have that yet. It is one of the easier things on the to-do list, however. Other things on that list include erasure coding and then snapshots further out. The reflink feature uses the same design as is in XFS, he said. Dave Chinner said that reflink is a major feature to be missing from a filesystem these days. Overstreet said that he has gotten much of it working, but space accounting is not right yet.
Chinner asked if there would be an on-disk format changes that would require "forklift upgrades". The snapshot feature will require on-disk format changes, Overstreet said, but the other features should not. There has not been a need to change the on-disk format for quite some time, which is part of why he thinks it is ready to go upstream.
Chinner wondered where bcachefs is aimed; what are its target users? Overstreet said that the killer feature is performance. The latency tail is "really really good", he said. In tests, it has gotten 14GB/sec writes without major CPU impact and mixed read/write workloads also do well. On every workload the project can find, bcachefs performs as fast as the hardware should go.
Both small and large users will benefit from the filesystem, he said. He has been using it as his root filesystem for several years, there are users running it on servers, and the company that is funding him to work on bcachefs is using it on NAS boxes with up to 60 spindles. He was asked about shingled magnetic recording (SMR) support; both bcache and bcachefs do file data allocation in terms of 1-2MB buckets, which they write to once. That should be fairly SMR-friendly, but he has not worked out how to deal with metadata on SMR devices yet.
Ts'o wondered about the diversity of devices that had been used in the benchmarking; that would be useful in determining what the strengths and weaknesses of bcachefs are. Has it been tried on older hardware, low-end flash devices, small disks, etc.? From what he has heard, it "starts to sound like snake oil". It has been tested on big RAID devices, high-end NVMe devices, and various other options, but has not been tested on some of the lower-end devices that were asked about, Overstreet said.
The discussion then shifted to whether it was time to get bcachefs into the mainline and how that process would work. Axboe was concerned that the on-disk format may still change to support snapshots and wondered if it made sense to wait until that work was completed. But filesystems can support multiple on-disk formats; Btrfs does it, as Josef Bacik pointed out, and XFS has been doing it for 20 years, Chinner said. Overstreet said that filesystems using the current on-disk format would still be fully supported, just that they would not be able to take snapshots.
Ts'o asked about xfstests and Overstreet said that he uses them all the time; there is a 30-line patch needed to support bcachefs. Once that is added, Ts'o said, he would be happy to add bcachefs to his automated testing regime.
Bacik said that the filesystem and storage developers need to see the code and know that he will be around to maintain it, at least until there are others who will pick it up. He said that Overstreet had hit all the high points, so Bacik said he was comfortable with starting the review process.
Overstreet said he would post his patches shortly after LSFMM, but that it is 50,000 lines of code. Chinner said that it needs to be broken up into sane chunks. Bacik agreed, saying that he mostly cared about the interfaces, not the internal B-tree stuff. Chinner said that the user-space APIs and the on-disk format were two places to start; people make "obvious mistakes" in those areas. Next would be the interface to the VFS; generally, reviewers are going to be most interested in things at the periphery. Ts'o suggested that since Overstreet knows the code best, he should highlight places where he is making assumptions about various other parts of the kernel (e.g. the dentry cache, the memory-management subsystem); that would allow reviewers to scrutinize that code.
Index entries for this article | |
---|---|
Kernel | Filesystems/bcachefs |
Conference | Storage, Filesystem, and Memory-Management Summit/2018 |
Posted May 23, 2018 22:54 UTC (Wed)
by harlequin (guest, #119081)
[Link] (1 responses)
Posted May 25, 2018 0:10 UTC (Fri)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Posted May 23, 2018 23:26 UTC (Wed)
by doublez13 (guest, #122213)
[Link] (21 responses)
Bcachefs is starting to sound like a very promising fs, but btrfs has a hell of a head start.
Posted May 23, 2018 23:29 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (16 responses)
Posted May 23, 2018 23:51 UTC (Wed)
by rahvin (guest, #16953)
[Link] (15 responses)
btrfs gets a bad rap for reasons I don't quite understand. All these new filesystems end up with tons of corner cases once they get broad deployment, btrfs was moving pretty fast until it got pushed out of beta then it slowed down as they spent and continue to spend tons of time fixing all these corner/edge cases. Although I'd argue significant prior available resources were pulled away when Oracle bought Sun as they were a prime backer.
Personally I'll be surprised if bcachefs doesn't run into the same long tail of corner cases once it's mainlined. I remember reading something saying it takes at least 10 years after a filesystem is mainlined before they deal with all these issues, after all it was pretty much 10 years after XFS was brought in before it really got stable.
Posted May 24, 2018 1:21 UTC (Thu)
by simcop2387 (subscriber, #101710)
[Link] (3 responses)
Posted May 24, 2018 10:30 UTC (Thu)
by pizza (subscriber, #46)
[Link] (1 responses)
I've had two [might-as-well-be-]total filesystem losses with btrfs, both after clean shutdown/reboot cycles on lightly-loaded, battery-backed hardware that hasn't so much as hiccupped before or since.
Posted May 25, 2018 15:54 UTC (Fri)
by drag (guest, #31333)
[Link]
1st time was when my cat yanked out a PCI card that wasn't screwed in. Corrupted xfs metadata, had to sort through tens of thousands of files in the recovery bin.
2nd time was while using btrfs.
There was also a couple times while using btrfs on external storage device I expanded it wedged itself into a corner. Ran out of inodes or something like that. It was a long time ago. Didn't lose data, but the file system was effectively rendered worthless and needed to be rebuilt.
Now these things are all partially my fault, but I have used Linux on garbage-level (sometimes literally) hardware for years and it's been proven to be tough.
I am looking forward to bcachefs getting support from various distro installers.
Posted May 24, 2018 13:11 UTC (Thu)
by dcg (subscriber, #9198)
[Link]
As someone who has been using btrfs for many years without problems I also feel that btrfs is not appreciated enough (and that bcachefs is over-hyped - Overstreet has criticized btrfs for its stability, but I have seen several cases of people having corruption issues in the bcachefs IRC channel). But I wonder how much can that last if btrfs keeps not progressing in basic features like RAID5/6?
Posted May 24, 2018 6:14 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
From my experience, I'm using BTRFS on my personal external RAID tower. It got corrupted every single time I tried its multi-device support. Mostly because of botched RAID rebuilds after power failures or disk replacements.
I've got tired of this and switched to a good old MD-based RAID. I'm still using BTFS for snapshots, but there's no way in hell I'm going to touch its multi-device snapshot within the next 5 years or so.
Posted May 24, 2018 7:48 UTC (Thu)
by khim (subscriber, #9252)
[Link] (5 responses)
Posted May 24, 2018 7:53 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted May 25, 2018 19:26 UTC (Fri)
by khim (subscriber, #9252)
[Link] (3 responses)
Ok, I'll bite. You say that "back then" page looked nice and cheerful. Here's the full history.
When exactly RAID56 wasn't marked with red and word "Unstable"?
When you are ignoring warnings written in red... bad things happen... that's just how life is it...
Posted May 25, 2018 20:54 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Also, I wasn't aware of this page's existence. RAID56 search on Google gives another page that looked OK.
Posted May 31, 2018 21:45 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
Even md raid has only just fixed it, with the addition of a journal (a feature that might still be experimental). The journal is intended primarily as an optimisation, flushing updates to SSD before saving them to the full raid on spinning rust. It fixes the write hole almost as a side-effect.
(Note, understanding the journal is on my to-do list, I need to learn more about it before I can document it ... :-)
Cheers,
Posted Jun 2, 2018 17:52 UTC (Sat)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Posted May 24, 2018 21:50 UTC (Thu)
by rahvin (guest, #16953)
[Link] (1 responses)
It just seems like everyone wants to work on something new rather than try to fix what we've got. I get that, fixing bugs is hard and boring rather than developing something new and shiny. But that new and shiny is still going to take 10 years to stabilize just like every filesystem Linux has had.
This is actually a question I'd like to see answered at one of these filesystem conventions, Why is it so hard to get experienced developers to work stabilizing the fs's we have rather than building new ones?
Posted May 24, 2018 22:51 UTC (Thu)
by nivedita76 (subscriber, #121790)
[Link]
Posted May 24, 2018 7:18 UTC (Thu)
by vadim (subscriber, #35271)
[Link] (1 responses)
Eg, the "raid1 volumes only mountable once RW if degraded" issue was there until very recently and is not the kind of thing one wants to deal with on a RAID setup.
Compression until recently came with warnings.
Compression is still quite half-assed on the UI side. Why can't I easily check how well a file is being compressed? Why do I have to 'defrag' to compress, on a SSD? I don't care if it's fragmented, I just want it compressed. There's no need to move already compressed data around.
Snapshots are in theory very cool, but in practice easily cause horrible performance issues. Eg, I unwisely had installed snapper and allowed DNF to make a couple dozen snapshots of my filesystem. Cleaning that up took an entire day, on an SSD, with the first snapshots taking hours to remove and completely locking up the machine. I don't want to imagine how that would work on an HDD.
Despite my best efforts so far I haven't figured out a way to do a btrfs scrub in such a way that it doesn't make the entire desktop lag, on a computer with a SSD.
Posted May 25, 2018 2:03 UTC (Fri)
by zlynx (guest, #2285)
[Link]
One thing to point out about slow btrfs scrubs is that it depends on your IO scheduler. Use the standard CFQ or BFQ. If you use deadline or noop then of course the scrub is going to use up all the IO because nothing will stop it.
It still causes some extra latency. On my email server I use multi-queue BFQ and I can feel a bit of extra delay when loading each IMAP message from it during a scrub, but I don't think it's that serious.
And on my laptop with NVMe SSD the btrfs scrub is done in about 30 seconds, so it isn't a big deal there either. It reads and verifies at hundreds of megabytes per second.
Posted May 23, 2018 23:49 UTC (Wed)
by EdwardConnolly (guest, #123865)
[Link] (3 responses)
Posted May 24, 2018 6:00 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link] (2 responses)
Posted May 24, 2018 11:36 UTC (Thu)
by mchouque (subscriber, #62087)
[Link] (1 responses)
If you do RAID 1 on more than 2 devices (say n devices), btrfs only guarantees you have 2 copies of your data, not n.
More over, you only read from one disk (IIRC I think they do load balancing with the PID or something like that so you really use RAID 1 / 10 for redundancy more than performance).
When you loose a device in RAID1, you can mount the FS but only once... https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_vol...
As for RAID10, it inherits the design features of RAID1. From the doc, it says "RAID-10 is built on top of these definitions. Every stripe is split across to exactly 2 RAID-1 sets and those RAID-1 sets are written to exactly 2 devices". The issue is you do not control on which devices your stripe is mirrored.
So a double disk failure on a btrfs RAID10 is more likely to end up with data loss data than on a traditional RAID-10.
There was a long discussion about that years ago: https://www.mail-archive.com/[email protected]/...
>The example below is probably a pathological case - but here goes. Let's say in this 4-disk example that chunks are >striped as d1,d2,d1,d2 where d1 is the first bit of data and d2 is the second:
>The above skewed example has an average of 40% of all chunks failed. As you add more data and randomise the >allocation, this will approach 50% - BUT, the chances of losing *some* data is already clearly shown to be very close to >100%.
The bottom line is RAID1 or 10 as defined by btrfs is not what you think they are if you haven't read the fine prints...
Posted May 31, 2018 21:50 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
I won't say more than that mdraid-10 is NOT raid 1+0. Look it up if you can be bothered.
Cheers,
Posted May 24, 2018 10:09 UTC (Thu)
by TimSmall (guest, #96681)
[Link] (5 responses)
If continuing to support the existing block device interface is not favoured, then perhaps user space tools to allow users to migrate to dm-cache (which I understand got some performance fixes last year in 4.12) would be possible....
Posted May 24, 2018 12:30 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
No, I fear the same rule applies to this as to any fs: people rely on it, so you can't rip it out or randomly change the data format -- though for non-writeback bcache volumes it is probably more practical to change the cache device format, you don't want to require changes to the backing device: even writeback bcaches probably just need to write all the data back and then freely change the cache device: it would be nice if this didn't require downtime for those of us with rootfses on bcache, too.
Posted May 24, 2018 12:54 UTC (Thu)
by TimSmall (guest, #96681)
[Link] (1 responses)
Posted May 24, 2018 14:53 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted May 25, 2018 0:11 UTC (Fri)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Posted May 25, 2018 6:50 UTC (Fri)
by nilsmeyer (guest, #122604)
[Link]
Posted May 25, 2018 6:52 UTC (Fri)
by nilsmeyer (guest, #122604)
[Link]
Posted May 27, 2018 21:35 UTC (Sun)
by meyert (subscriber, #32097)
[Link] (3 responses)
Posted May 31, 2018 21:59 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Use it as a journal for an md raid?
Cheers,
Posted Jun 1, 2018 23:11 UTC (Fri)
by Pc5Y9sbv (guest, #41328)
[Link] (1 responses)
You can enjoy the usual benefits of LVM to allocate portions of the SSD or HDD for different purposes, migrating LV content between backing devices, etc. You might make an SSD-only filesystem as well as an SSD-cached filesystem with a larger HDD. You can also drop and add cache on an existing LV without having to reformat the backing filesystem, in case you change your mind about how much SSD to allocate for caching.
I think the cache management commands can be done while a filesystem in the LV is online, but I actually haven't done that yet. I've set up a write-back cached volume using 60GB of SSD for 2TB of nearline disk. My system RAM provides enough buffer cache for reads, so this is mostly to give me a durable write buffer so my programs don't wait for the HDD to complete writes as often.
I'm a little old-fashioned, so I do my RAID mirroring via mdraid and treat those as PVs for LVM. So, I have an SSD mirror backing some small filesystems and cache pool, and an HDD mirror backing bulk filesystems with some cached and some uncached.
Posted Jun 2, 2018 13:34 UTC (Sat)
by meyert (subscriber, #32097)
[Link]
Posted Jun 18, 2018 12:40 UTC (Mon)
by poelzi (guest, #14953)
[Link]
bcache works like charm (rebooted yesterday, usually I have ~80% hit rate ...). It caches my 2TB media hdd on my notebook trough a 80gb SSD partition. I also use it on my home server to cache /home.
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
Wol
An update on bcachefs
An update on bcachefs
XFS is around 25 years old, and has been in Linux since the early 00's, not new and shiny. BTRFS is the "new and shiny" fs by comparison.
XFS history
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
An update on bcachefs
Incidentally, my main setup is bcache+btrfs ( https://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2... ) but it doesn't seem that bcachefs could be a feature-complete replacement.
An update on bcachefs
https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_...
>Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
>Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
>Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
>Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
>Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10
>
>Lose any two disks and you have a 50% chance on *each* chunk to have lost that chunk. With traditional RAID10 you >have a 50% chance of losing the array entirely. With btrfs, the more data you have stored, the chances get closer to 100% >of losing *some* data in a 2-disk failure.
>
>In the above example, losing A and B means you lose d3, d6, and d7 (which ends up being 60% of all chunks).
>Losing A and C means you lose d1 (20% of all chunks).
>Losing A and D means you lose d9 (20% of all chunks).
>Losing B and C means you lose d10 (20% of all chunks).
>Losing B and D means you lose d2 (20% of all chunks).
>Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)
An update on bcachefs
Wol
bcache block interface ongoing support
bcache block interface ongoing support
bcache block interface ongoing support
bcache block interface ongoing support
bcache block interface ongoing support
bcache block interface ongoing support
An update on bcachefs
An update on bcachefs
How did this turn into a filesystem? Could bcachefs still act as an SSD cache for a bigger spinning disk?
What is the best solution nowadays to use a small SSD as an cache for an bigger spinning disk?
An update on bcachefs
Wol
An update on bcachefs
An update on bcachefs
An update on bcachefs
--- bcache ---
Device /dev/dm-3 (254:3)
UUID ec13b555-309c-4c03-a34f-10ffbe9d51d2
Block Size 0.50KiB
Bucket Size 2.00MiB
Congested? False
Read Congestion 2.0ms
Write Congestion 20.0ms
Total Cache Size 80.00GiB
Total Cache Used 80.00GiB (100%)
Total Cache Unused 0B (0%)
Dirty Data 0B (0%)
Evictable Cache 80.00GiB (100%)
Replacement Policy [lru] fifo random
Cache Mode writethrough [writeback] writearound none
Total Hits 9573 (67%)
Total Misses 4685
Total Bypass Hits 3422 (15%)
Total Bypass Misses 18947
Total Bypassed 2.70GiB
I'm looking forward to bcachefs as I'm quite disappointed by btrfs and zfs.
Bcache + ext4 still has proven the most stable and pleasantly fast solution so far, thank you very much for this.