An update on bcachefs

This article brought to you by LWN subscribers
Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jake Edge
May 23, 2018

LSFMM

The bcachefs filesystem has been under development for a number of years now; according to lead developer Kent Overstreet, it is time to start talking about getting the code upstream. He came to the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) to discuss that in a combined filesystem and storage session. Bcachefs grew out of bcache, which is a block layer cache that was merged into Linux 3.10 in mid-2013.

Five or six years ago, when he was still at Google, creating bcachefs from bcache seemed like it would take a year and 15,000 lines of code, Overstreet said. Now, six years and 50,000 lines of code later, it is a real filesystem. It "turned out really well", he said.

Bcachefs is a general-purpose copy-on-write filesystem with lots of features, including checksumming for blocks, compression, encryption, multiple device support, and, of course, caching. Jens Axboe asked if there was still a clean separation between bcachefs and bcache. Overstreet said that there was; roughly 80% of the code is shared. He has taken out the bcache interfaces in his development tree because there is no need for them as bcachefs can handle all of what bcache can do (and more).

Hannes Reinecke asked about the long-term expectation for bcache and bcachefs; will they coexist or will bcache be removed in favor of bcachefs. Overstreet said that bcache is the prototype for all of the ideas in bcachefs. As part of developing bcachefs, the B-tree code has been fleshed out and polished. Bcache was fast in most cases, but there were some corner cases where it was not; all of that has been fixed in bcachefs.

He said that he would like get users off of bcache and onto bcachefs. The filesystem has an fsck available to detect and repair problems. A block layer cache does not get the same level of testing that a full filesystem does. By creating and upstreaming bcachefs, he will in some sense be turning it into a real project.

He would prefer not have both the block layer and filesystem interfaces, since that doesn't really provide anything extra. One major disadvantage of bcache is that writes to the backing device are not copy on write so there are cache coherency issues. Bcache had ways to deal with those problems, but bcachefs simply eliminates them entirely.

Ted Ts'o asked how many users of bcache there are; how much of a problem is it to get rid of bcache? Axboe said that there are users and a community has formed to develop and maintain it. Ts'o said he would be in favor of eliminating bcache, but if there are users of the feature, that really cannot happen. Reinecke said that SUSE supports bcache in its distributions, so it will need to be maintained for a few years.

The on-disk format is different between bcache and bcachefs, similar to how ext2, ext3, and ext4 have evolved, Overstreet said. If he brought back the block device interfaces into bcachefs, then the filesystem could be a drop-in replacement for bcache. Ts'o noted that before ext3 and ext2 could be dropped, ext4 was able to handle the other two; if bcachefs can support the older bcache devices, the same could be done. Axboe said that perhaps an offline conversion tool could be written. Reinecke said that SUSE will still need bcache as a device for some time, but doesn't care if it is provided by the bcache code or by bcachefs.

Amir Goldstein asked about support for reflink, but Overstreet said that bcachefs does not have that yet. It is one of the easier things on the to-do list, however. Other things on that list include erasure coding and then snapshots further out. The reflink feature uses the same design as is in XFS, he said. Dave Chinner said that reflink is a major feature to be missing from a filesystem these days. Overstreet said that he has gotten much of it working, but space accounting is not right yet.

Chinner asked if there would be an on-disk format changes that would require "forklift upgrades". The snapshot feature will require on-disk format changes, Overstreet said, but the other features should not. There has not been a need to change the on-disk format for quite some time, which is part of why he thinks it is ready to go upstream.

Chinner wondered where bcachefs is aimed; what are its target users? Overstreet said that the killer feature is performance. The latency tail is "really really good", he said. In tests, it has gotten 14GB/sec writes without major CPU impact and mixed read/write workloads also do well. On every workload the project can find, bcachefs performs as fast as the hardware should go.

Both small and large users will benefit from the filesystem, he said. He has been using it as his root filesystem for several years, there are users running it on servers, and the company that is funding him to work on bcachefs is using it on NAS boxes with up to 60 spindles. He was asked about shingled magnetic recording (SMR) support; both bcache and bcachefs do file data allocation in terms of 1-2MB buckets, which they write to once. That should be fairly SMR-friendly, but he has not worked out how to deal with metadata on SMR devices yet.

Ts'o wondered about the diversity of devices that had been used in the benchmarking; that would be useful in determining what the strengths and weaknesses of bcachefs are. Has it been tried on older hardware, low-end flash devices, small disks, etc.? From what he has heard, it "starts to sound like snake oil". It has been tested on big RAID devices, high-end NVMe devices, and various other options, but has not been tested on some of the lower-end devices that were asked about, Overstreet said.

The discussion then shifted to whether it was time to get bcachefs into the mainline and how that process would work. Axboe was concerned that the on-disk format may still change to support snapshots and wondered if it made sense to wait until that work was completed. But filesystems can support multiple on-disk formats; Btrfs does it, as Josef Bacik pointed out, and XFS has been doing it for 20 years, Chinner said. Overstreet said that filesystems using the current on-disk format would still be fully supported, just that they would not be able to take snapshots.

Ts'o asked about xfstests and Overstreet said that he uses them all the time; there is a 30-line patch needed to support bcachefs. Once that is added, Ts'o said, he would be happy to add bcachefs to his automated testing regime.

Bacik said that the filesystem and storage developers need to see the code and know that he will be around to maintain it, at least until there are others who will pick it up. He said that Overstreet had hit all the high points, so Bacik said he was comfortable with starting the review process.

Overstreet said he would post his patches shortly after LSFMM, but that it is 50,000 lines of code. Chinner said that it needs to be broken up into sane chunks. Bacik agreed, saying that he mostly cared about the interfaces, not the internal B-tree stuff. Chinner said that the user-space APIs and the on-disk format were two places to start; people make "obvious mistakes" in those areas. Next would be the interface to the VFS; generally, reviewers are going to be most interested in things at the periphery. Ts'o suggested that since Overstreet knows the code best, he should highlight places where he is making assumptions about various other parts of the kernel (e.g. the dentry cache, the memory-management subsystem); that would allow reviewers to scrutinize that code.

Index entries for this article
Kernel	Filesystems/bcachefs
Conference	Storage, Filesystem, and Memory-Management Summit/2018

An update on bcachefs

Posted May 23, 2018 22:54 UTC (Wed) by harlequin (guest, #119081) [Link] (1 responses)

I'm really looking forward to have bcachefs upstreamed. Thank you Kent for your perseverance.

An update on bcachefs

Posted May 25, 2018 0:10 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

Thanks, I appreciate it :)

An update on bcachefs

Posted May 23, 2018 23:26 UTC (Wed) by doublez13 (guest, #122213) [Link] (21 responses)

Does anyone know of a good comparison between the implemented/expected features of bcachefs and btrfs?

Bcachefs is starting to sound like a very promising fs, but btrfs has a hell of a head start.

An update on bcachefs

Posted May 23, 2018 23:29 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (16 responses)

By now btrfs' head start is probably a negative. It's been around for 10 years and is still not quite ready for the prime-time.

An update on bcachefs

Posted May 23, 2018 23:51 UTC (Wed) by rahvin (guest, #16953) [Link] (15 responses)

There are a lot of people using it day to day including myself.

btrfs gets a bad rap for reasons I don't quite understand. All these new filesystems end up with tons of corner cases once they get broad deployment, btrfs was moving pretty fast until it got pushed out of beta then it slowed down as they spent and continue to spend tons of time fixing all these corner/edge cases. Although I'd argue significant prior available resources were pulled away when Oracle bought Sun as they were a prime backer.

Personally I'll be surprised if bcachefs doesn't run into the same long tail of corner cases once it's mainlined. I remember reading something saying it takes at least 10 years after a filesystem is mainlined before they deal with all these issues, after all it was pretty much 10 years after XFS was brought in before it really got stable.

An update on bcachefs

Posted May 24, 2018 1:21 UTC (Thu) by simcop2387 (subscriber, #101710) [Link] (3 responses)

I think a lot of the bad rap came not from the btrfs developers themselves, but a few distros that started supporting it in production a bit too early. This caused a number of people to start using it and then experience those corner cases. It got a lot of testing but left a bad taste in people's mouths.

An update on bcachefs

Posted May 24, 2018 10:30 UTC (Thu) by pizza (subscriber, #46) [Link] (1 responses)

"Corner cases" my right foot. The reason for the bad taste is that btrfs would completely eat itself even when using supposedly stable/fully-supported configurations on stable hardware.

I've had two [might-as-well-be-]total filesystem losses with btrfs, both after clean shutdown/reboot cycles on lightly-loaded, battery-backed hardware that hasn't so much as hiccupped before or since.

An update on bcachefs

Posted May 25, 2018 15:54 UTC (Fri) by drag (guest, #31333) [Link]

There are two major times my personal Linux boxes ate my data...

1st time was when my cat yanked out a PCI card that wasn't screwed in. Corrupted xfs metadata, had to sort through tens of thousands of files in the recovery bin.

2nd time was while using btrfs.

There was also a couple times while using btrfs on external storage device I expanded it wedged itself into a corner. Ran out of inodes or something like that. It was a long time ago. Didn't lose data, but the file system was effectively rendered worthless and needed to be rebuilt.

Now these things are all partially my fault, but I have used Linux on garbage-level (sometimes literally) hardware for years and it's been proven to be tough.

I am looking forward to bcachefs getting support from various distro installers.

An update on bcachefs

Posted May 24, 2018 13:11 UTC (Thu) by dcg (subscriber, #9198) [Link]

I don't think being unstable was really that much of a problem at first; people expected btrfs to be unstable anyway. The problem is that it kept being unstable because there weren't developers that would focus on fixing that. There were developers that worked on adding features (often not cleanly, eg qgroups were very unstable and still today they impact performance so much that people recommends not using them unless it's necessary), but the core was left unattended. Which is the reason why, still today and after so many years, btrfs does not have clean, ZFS-class RAID5/6 support, or integrated encryption. The lack of improvement in some key fields seem to be making people feel that btrfs is stagnating, and who would blame them? It is in fact one of the motivations behind bcachefs.

As someone who has been using btrfs for many years without problems I also feel that btrfs is not appreciated enough (and that bcachefs is over-hyped - Overstreet has criticized btrfs for its stability, but I have seen several cases of people having corruption issues in the bcachefs IRC channel). But I wonder how much can that last if btrfs keeps not progressing in basic features like RAID5/6?

An update on bcachefs

Posted May 24, 2018 6:14 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

A show-stopping bug in BTRFS RAIDs was discovered only _last_ _year_, after BTRFS had been marked as production-ready for at least a couple of years.

From my experience, I'm using BTRFS on my personal external RAID tower. It got corrupted every single time I tried its multi-device support. Mostly because of botched RAID rebuilds after power failures or disk replacements.

I've got tired of this and switched to a good old MD-based RAID. I'm still using BTFS for snapshots, but there's no way in hell I'm going to touch its multi-device snapshot within the next 5 years or so.

An update on bcachefs

Posted May 24, 2018 7:48 UTC (Thu) by khim (subscriber, #9252) [Link] (5 responses)

Have BTRFS developers ever claimed that it's RAID code is production-ready? AFAIK it was always maked red (== experimental, don't use yet) in their table.

An update on bcachefs

Posted May 24, 2018 7:53 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

This page cheerfully says that it's OK: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_... Back then this page also looked nice and cheerful: https://btrfs.wiki.kernel.org/index.php?title=RAID56&...

An update on bcachefs

Posted May 25, 2018 19:26 UTC (Fri) by khim (subscriber, #9252) [Link] (3 responses)

Ok, I'll bite. You say that "back then" page looked nice and cheerful. Here's the full history.

When exactly RAID56 wasn't marked with red and word "Unstable"?

When you are ignoring warnings written in red... bad things happen... that's just how life is it...

An update on bcachefs

Posted May 25, 2018 20:54 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

"Write hole still exists" is a far cry from "will corrupt your data irrevocably on rebuild".

Also, I wasn't aware of this page's existence. RAID56 search on Google gives another page that looked OK.

An update on bcachefs

Posted May 31, 2018 21:45 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

To be fair, fixing the write hole is tricky - in fact it seems to me that it is inherent and unavoidable in a simple raid 5 implementation.

Even md raid has only just fixed it, with the addition of a journal (a feature that might still be experimental). The journal is intended primarily as an optimisation, flushing updates to SSD before saving them to the full raid on spinning rust. It fixes the write hole almost as a side-effect.

(Note, understanding the journal is on my to-do list, I need to learn more about it before I can document it ... :-)

Cheers,
Wol

An update on bcachefs

Posted Jun 2, 2018 17:52 UTC (Sat) by koverstreet (✭ supporter ✭, #4296) [Link]

Except btrfs is copy on write - the raid5 hole exists because of doing updates in place on existing stripes, so why btrfs is doing it that way I have no idea.

An update on bcachefs

Posted May 24, 2018 21:50 UTC (Thu) by rahvin (guest, #16953) [Link] (1 responses)

I know it's had issues but it's this bad rap history that follows it around that I think does more reputation damage than the current code deserves. This is one of the reasons it bothers me so much that Redhat decided to devote resources to expanding feature to XFS instead of spending those resources stabilizing btrfs. Btrfs could be IMO the best filesystem Linux has ever had if we could get experienced filesystem developers working on it instead of everyone developing their own filesystem. Which is probably why Redhat's decision bothers me.

It just seems like everyone wants to work on something new rather than try to fix what we've got. I get that, fixing bugs is hard and boring rather than developing something new and shiny. But that new and shiny is still going to take 10 years to stabilize just like every filesystem Linux has had.

This is actually a question I'd like to see answered at one of these filesystem conventions, Why is it so hard to get experienced developers to work stabilizing the fs's we have rather than building new ones?

An update on bcachefs

Posted May 24, 2018 22:51 UTC (Thu) by nivedita76 (subscriber, #121790) [Link]

XFS is around 25 years old, and has been in Linux since the early 00's, not new and shiny. BTRFS is the "new and shiny" fs by comparison. XFS history

An update on bcachefs

Posted May 24, 2018 7:18 UTC (Thu) by vadim (subscriber, #35271) [Link] (1 responses)

Probably because BTRFS got merged too early and had serious gotchas for the cool features.

Eg, the "raid1 volumes only mountable once RW if degraded" issue was there until very recently and is not the kind of thing one wants to deal with on a RAID setup.

Compression until recently came with warnings.

Compression is still quite half-assed on the UI side. Why can't I easily check how well a file is being compressed? Why do I have to 'defrag' to compress, on a SSD? I don't care if it's fragmented, I just want it compressed. There's no need to move already compressed data around.

Snapshots are in theory very cool, but in practice easily cause horrible performance issues. Eg, I unwisely had installed snapper and allowed DNF to make a couple dozen snapshots of my filesystem. Cleaning that up took an entire day, on an SSD, with the first snapshots taking hours to remove and completely locking up the machine. I don't want to imagine how that would work on an HDD.

Despite my best efforts so far I haven't figured out a way to do a btrfs scrub in such a way that it doesn't make the entire desktop lag, on a computer with a SSD.

An update on bcachefs

Posted May 25, 2018 2:03 UTC (Fri) by zlynx (guest, #2285) [Link]

One thing to point out about slow btrfs scrubs is that it depends on your IO scheduler. Use the standard CFQ or BFQ. If you use deadline or noop then of course the scrub is going to use up all the IO because nothing will stop it.

It still causes some extra latency. On my email server I use multi-queue BFQ and I can feel a bit of extra delay when loading each IMAP message from it during a scrub, but I don't think it's that serious.

And on my laptop with NVMe SSD the btrfs scrub is done in about 30 seconds, so it isn't a big deal there either. It reads and verifies at hundreds of megabytes per second.

An update on bcachefs

Posted May 23, 2018 23:49 UTC (Wed) by EdwardConnolly (guest, #123865) [Link] (3 responses)

Personally, until bcachefs catches up to Btrfs in regard to snapshots it's going to be a non-starter for me. That said, I'm excited to see another next generation filesystem in the works.

An update on bcachefs

Posted May 24, 2018 6:00 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (2 responses)

Everyone needs something else from filesystems ;) I'm not interested in snapshots at all. The main thing btrfs has for me is checksuming and self-healing of bitrotten files (in raid1/10 deployments). That's the killer feature.
Incidentally, my main setup is bcache+btrfs ( https://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2... ) but it doesn't seem that bcachefs could be a feature-complete replacement.

An update on bcachefs

Posted May 24, 2018 11:36 UTC (Thu) by mchouque (subscriber, #62087) [Link] (1 responses)

One confusing thing about RAID1/10 in btrfs is that it's not really what you think it is.

If you do RAID 1 on more than 2 devices (say n devices), btrfs only guarantees you have 2 copies of your data, not n.
https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_...

More over, you only read from one disk (IIRC I think they do load balancing with the PID or something like that so you really use RAID 1 / 10 for redundancy more than performance).

When you loose a device in RAID1, you can mount the FS but only once... https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_vol...

As for RAID10, it inherits the design features of RAID1. From the doc, it says "RAID-10 is built on top of these definitions. Every stripe is split across to exactly 2 RAID-1 sets and those RAID-1 sets are written to exactly 2 devices". The issue is you do not control on which devices your stripe is mirrored.

So a double disk failure on a btrfs RAID10 is more likely to end up with data loss data than on a traditional RAID-10.

There was a long discussion about that years ago: https://www.mail-archive.com/[email protected]/...

>The example below is probably a pathological case - but here goes. Let's say in this 4-disk example that chunks are >striped as d1,d2,d1,d2 where d1 is the first bit of data and d2 is the second:
>Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
>Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
>Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
>Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
>Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10
>
>Lose any two disks and you have a 50% chance on *each* chunk to have lost that chunk. With traditional RAID10 you >have a 50% chance of losing the array entirely. With btrfs, the more data you have stored, the chances get closer to 100% >of losing *some* data in a 2-disk failure.
>
>In the above example, losing A and B means you lose d3, d6, and d7 (which ends up being 60% of all chunks).
>Losing A and C means you lose d1 (20% of all chunks).
>Losing A and D means you lose d9 (20% of all chunks).
>Losing B and C means you lose d10 (20% of all chunks).
>Losing B and D means you lose d2 (20% of all chunks).
>Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)

>The above skewed example has an average of 40% of all chunks failed. As you add more data and randomise the >allocation, this will approach 50% - BUT, the chances of losing *some* data is already clearly shown to be very close to >100%.

The bottom line is RAID1 or 10 as defined by btrfs is not what you think they are if you haven't read the fine prints...

An update on bcachefs

Posted May 31, 2018 21:50 UTC (Thu) by Wol (subscriber, #4433) [Link]

note that btrfs probably got this from mdraid.

I won't say more than that mdraid-10 is NOT raid 1+0. Look it up if you can be bothered.

Cheers,
Wol

bcache block interface ongoing support

Posted May 24, 2018 10:09 UTC (Thu) by TimSmall (guest, #96681) [Link] (5 responses)

I support various systems using ext4, XFS and btrfs on top of bcache.

If continuing to support the existing block device interface is not favoured, then perhaps user space tools to allow users to migrate to dm-cache (which I understand got some performance fixes last year in 4.12) would be possible....

bcache block interface ongoing support

Posted May 24, 2018 12:30 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

That might be quite difficult because you can layer bcache at almost any point in the block hierarchy, and because its header is on the front, either rewriting of all data or careful movement of the layer above (LVM, partitions or whatever, which requires knowledge of that layer) is required. And then there are clever bastards like me with an LVM PV on bcache on md, and a VG layered across both bcached and non-bcached PVs, on two md arrays on the same physical disks, and lots of LVs underneath that... changing *that* to the one-lvm-cache-device-per-LV model used by dm-cache without screwing alignment completely or risking data loss seems likely to be terribly difficult.

No, I fear the same rule applies to this as to any fs: people rely on it, so you can't rip it out or randomly change the data format -- though for non-writeback bcache volumes it is probably more practical to change the cache device format, you don't want to require changes to the backing device: even writeback bcaches probably just need to write all the data back and then freely change the cache device: it would be nice if this didn't require downtime for those of us with rootfses on bcache, too.

bcache block interface ongoing support

Posted May 24, 2018 12:54 UTC (Thu) by TimSmall (guest, #96681) [Link] (1 responses)

I was assuming that this would not involve the lvm tool set (or metadata etc.), but instead that a user space bcache-specific tool could be written to setup device mapper mappings to use dm-cache directly (within the constraints of dm-cache). This would involve fully draining (write-back → write-through → nocache) and discarding the cache device contents - so that there's a way to end up with the same (cached) content without moving the entire contents of the backing store (but keeping the bcache metadata headers etc. on the backing store device).

bcache block interface ongoing support

Posted May 24, 2018 14:53 UTC (Thu) by nix (subscriber, #2304) [Link]

Thinking about it, you could do that in the dynamic state of the system with just a dm-linear that skips the start of the device. You'd definitely need something to get that set up though (with low enough requirements that it could run from an initramfs and link against musl etc). It would not exactly be entirely transparent, unless "bcache" remains as a block device that just does that and passes everything through, or that thunks to dm-cache for everything (now *that* would be neat, get any improvements with that layer without compatibility problems!).

bcache block interface ongoing support

Posted May 25, 2018 0:11 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

bcache is still getting maintained - and once bcachefs is upstreamed there are tentative plans for bringing back the old style bcache block device interface to provide an upgrade path for existing bcache users.

bcache block interface ongoing support

Posted May 25, 2018 6:50 UTC (Fri) by nilsmeyer (guest, #122604) [Link]

I suppose once this is mainlined most of those systems running bcache you're supporting would be phased out or re-installed.

An update on bcachefs

Posted May 25, 2018 6:52 UTC (Fri) by nilsmeyer (guest, #122604) [Link]

Great effort by Kent, I've been following the project for quite a while (also as a Patreon supporter).

An update on bcachefs

Posted May 27, 2018 21:35 UTC (Sun) by meyert (subscriber, #32097) [Link] (3 responses)

Wait, wasn't bcache a cache that would enable an small SSD to act as cache for an bigger rotating disk?
How did this turn into a filesystem? Could bcachefs still act as an SSD cache for a bigger spinning disk?
What is the best solution nowadays to use a small SSD as an cache for an bigger spinning disk?

An update on bcachefs

Posted May 31, 2018 21:59 UTC (Thu) by Wol (subscriber, #4433) [Link]

> What is the best solution nowadays to use a small SSD as an cache for an bigger spinning disk?

Use it as a journal for an md raid?

Cheers,
Wol

An update on bcachefs

Posted Jun 1, 2018 23:11 UTC (Fri) by Pc5Y9sbv (guest, #41328) [Link] (1 responses)

If you use LVM, you can have an SSD-based PV and an HDD-based PV and then form a cached LV based on a small chunk of the SSD and a big chunk of the HDD. The terminology in LVM is to create a cached LV using a backing LV and a cache pool.

You can enjoy the usual benefits of LVM to allocate portions of the SSD or HDD for different purposes, migrating LV content between backing devices, etc. You might make an SSD-only filesystem as well as an SSD-cached filesystem with a larger HDD. You can also drop and add cache on an existing LV without having to reformat the backing filesystem, in case you change your mind about how much SSD to allocate for caching.

I think the cache management commands can be done while a filesystem in the LV is online, but I actually haven't done that yet. I've set up a write-back cached volume using 60GB of SSD for 2TB of nearline disk. My system RAM provides enough buffer cache for reads, so this is mostly to give me a durable write buffer so my programs don't wait for the HDD to complete writes as often.

I'm a little old-fashioned, so I do my RAID mirroring via mdraid and treat those as PVs for LVM. So, I have an SSD mirror backing some small filesystems and cache pool, and an HDD mirror backing bulk filesystems with some cached and some uncached.

An update on bcachefs

Posted Jun 2, 2018 13:34 UTC (Sat) by meyert (subscriber, #32097) [Link]

Cool, thanks for the pointers!

An update on bcachefs

Posted Jun 18, 2018 12:40 UTC (Mon) by poelzi (guest, #14953) [Link]

> bcache-stats
--- bcache ---
Device /dev/dm-3 (254:3)
UUID ec13b555-309c-4c03-a34f-10ffbe9d51d2
Block Size 0.50KiB
Bucket Size 2.00MiB
Congested? False
Read Congestion 2.0ms
Write Congestion 20.0ms
Total Cache Size 80.00GiB
Total Cache Used 80.00GiB (100%)
Total Cache Unused 0B (0%)
Dirty Data 0B (0%)
Evictable Cache 80.00GiB (100%)
Replacement Policy [lru] fifo random
Cache Mode writethrough [writeback] writearound none
Total Hits 9573 (67%)
Total Misses 4685
Total Bypass Hits 3422 (15%)
Total Bypass Misses 18947
Total Bypassed 2.70GiB

bcache works like charm (rebooted yesterday, usually I have ~80% hit rate ...). It caches my 2TB media hdd on my notebook trough a 80gb SSD partition. I also use it on my home server to cache /home.
I'm looking forward to bcachefs as I'm quite disappointed by btrfs and zfs.
Bcache + ext4 still has proven the most stable and pleasantly fast solution so far, thank you very much for this.