Bcachefs

Bcache is done and stable - but work hasn't stopped. Bcachefs is the hot new thing: a next generation, robust, high performance copy on write filesystem. You could think of it as bcache version two, but it might be more accurate to call bcache the prototype for what's happening in bcachefs - incrementally developing a filesystem was part of the bcache plan since nearly the beginning.

It's proving to be quite stable, and it's gotten to the point where it's suitable for careful deployment and a wider user base. Please see the bcachefs page for the current status and instructions on getting started.

Developing a filesystem is also not cheap or quick or easy; we need funding! Please chip in on Patreon - the Patreon page also has more information on the motivation for bcachefs and the state of Linux filesystems. If you've been a happy bcache user, your contribution will be particularly appreciated - I didn't ask for contributions when I was working on bcache, but I am now.

Lots more information about bcachefs, including how to try it out, is available on the bcachefs page.

What is bcache?

Bcache is a Linux kernel block layer cache. It allows one or more fast disk drives such as flash-based solid state drives (SSDs) to act as a cache for one or more slower hard disk drives.

Hard drives are cheap and big, SSDs are fast but small and expensive. Wouldn't it be nice if you could transparently get the advantages of both? With Bcache, you can have your cake and eat it too.

Bcache patches for the Linux kernel allow one to use SSDs to cache other block devices. It's analogous to L2Arc for ZFS, but Bcache also does writeback caching (besides just write through caching), and it's filesystem agnostic. It's designed to be switched on with a minimum of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the random reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high end storage arrays, and perhaps even embedded.

The design goal is to be just as fast as the SSD and cached device (depending on cache hit vs. miss, and writethrough vs. writeback writes) to within the margin of error. It's not quite there yet, mostly for sequential reads. But testing has shown that it is emphatically possible, and even in some cases to do better - primarily random writes.

It's also designed to be safe. Reliability is critical for anything that does writeback caching; if it breaks, you will lose data. Bcache is meant to be a superior alternative to battery backed up raid controllers, thus it must be reliable even if the power cord is yanked out. It won't return a write as completed until everything necessary to locate it is on stable storage, nor will writes ever be seen as partially completed (or worse, missing) in the event of power failure. A large amount of work has gone into making this work efficiently.

Bcache is designed around the performance characteristics of SSDs. It's designed to minimize write inflation to the greatest extent possible, and never itself does random writes. It turns random writes into sequential writes - first when it writes them to the SSD, and then with writeback caching it can use your SSD to buffer gigabytes of writes and write them all out in order to your hard drive or raid array. If you've got a RAID6, you're probably aware of the painful random write penalty, and the expensive controllers with battery backup people buy to mitigate them. Now, you can use Linux's excellent software RAID and still get fast random writes, even on cheap hardware.

User documentation - Documentation/bcache.txt in the bcache kernel tree.

Troubleshooting performance

Programmer's guide

FAQ

Articles about bcache

Old status updates

Getting bcache

Bcache has been merged into the mainline Linux kernel; for the latest stable bcache release use the latest 3.10 or 3.11 stable kernel.

For the userspace tools,

git clone https://evilpiepirate.org/git/bcache-tools.git

The udev rules, Debian/Ubuntu source package, and Ubuntu PPA are maintained here:

git clone https://github.com/koverstreet/bcache-tools.git

To use the PPA (Ubuntu Raring and Saucy):

sudo add-apt-repository ppa:g2p/storage
sudo apt-get update
sudo apt-get install bcache-tools

The PPA also contains blocks, a conversion tool.

A Fedora package is available in Fedora 20, and maintained here.

Contact information

Mailing list: linux-bcache@vger.kernel.org

IRC: irc.oftc.net #bcache

Author: kent.overstreet@gmail.com

Features

  • A single cache device can be used to cache an arbitrary number of backing devices, and backing devices can be attached and detached at runtime, while mounted and in use (they run in passthrough mode when they don't have a cache).
  • Recovers from unclean shutdown - writes are not completed until the cache is consistent with respect to the backing device (Internally, bcache doesn't distinguish between clean and unclean shutdown).
  • Barriers/cache flushes are handled correctly.
  • Writethrough, writeback and writearound.
  • Detects and bypasses sequential IO (with a configurable threshold, and can be disabled).
  • Throttles traffic to the SSD if it becomes congested, detected by latency to the SSD exceeding a configurable threshold (useful if you've got one SSD for many disks).
  • Readahead on cache miss (disabled by default).
  • Highly efficient writeback implementation; dirty data is always written out in sorted order, and if writeback_percent is enabled background writeback is smoothly throttled with a PD controller to keep around that percentage of the cache dirty.
  • Very high performance b+ tree - bcache is capable of around 1M iops on random reads, if your hardware is fast enough.
  • Stable - in production use now.

Performance

Random notes and microbenchmarks, 7/25/12

On my test machine, I carved up the SSD into two equal size partitions - one for testing the raw SSD, the other set up to cache the spinning disk.

The only bcache settings changed from the defaults are cache_mode = writeback and writeback_percent = 40. (If writeback percent is nonzero, bcache uses a PD controler to smoothly throttle background writeback and try to keep around that percentage of the cache device dirty). I also disabled the congested threshold, as having bcache switch to writethrough when the ssd latency spikes (as ssds tend to do) will screw with the results.

I didn't trim the SSD before starting, as I would've had to retrim before every benchmark and the results would have had nothing to do with steady state operation anyways. But I did alternate between running a single benchmark on the bcache device and then on the raw ssd, so hopefully the internal state of the SSD didn't skew the results too much. The results seemed to be somewhat repeatable.

This is with an Intel 160 gb MLC SSD - identifies as "INTEL SSDSA2M160".

For the benchmarks I'm using fio, with this test script: [global] randrepeat=1 ioengine=libaio bs=4k ba=4k size=8G direct=1 gtod_reduce=1 norandommap iodepth=64 I'm running fio against the raw block device, skipping the filesystem - but for these benchmarks with anything connected by SATA it shouldn't matter.

Random writes

On the raw SSD, here's the output from fio:

root@utumno:~# fio ~/rw4k

randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio 1.59
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/49885K /s] [0 /12.2K iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=1770
  write: io=8192.3MB, bw=47666KB/s, iops=11916 , runt=175991msec
  cpu          : usr=4.33%, sys=14.28%, ctx=2071968, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/2097215/0, short=0/0/0

Run status group 0 (all jobs):
  WRITE: io=8192.3MB, aggrb=47666KB/s, minb=48810KB/s, maxb=48810KB/s, mint=175991msec, maxt=175991msec

Disk stats (read/write):
  sdb: ios=69/2097888, merge=0/3569, ticks=0/11243992, in_queue=11245600, util=99.99%

And through bcache:

root@utumno:~# fio ~/rw4k

randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio 1.59
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/75776K /s] [0 /18.5K iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=1914
  write: io=8192.3MB, bw=83069KB/s, iops=20767 , runt=100987msec
  cpu          : usr=3.17%, sys=13.27%, ctx=456026, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/2097215/0, short=0/0/0

Run status group 0 (all jobs):
  WRITE: io=8192.3MB, aggrb=83068KB/s, minb=85062KB/s, maxb=85062KB/s, mint=100987msec, maxt=100987msec

Disk stats (read/write):
  bcache0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

18.5K iops for bcache, 12.2K for the raw SSD. Bcache is winning because it's sending the SSD sequential writes, but it still has to pay the cost of persisting index updates. Partly this speaks to how thoroughly that's been optimized in bcache, but the relatively high iodepth (64) benefits bcache too - with 64 IOs in flight at a time, it's able to coalesce many index updates into the same journal write.

A high iodepth is representative of lots of workloads, but not all. The results do change if we crank it down:

IO depth of 32: bcache 20.3k iops, raw ssd 19.8k iops

IO depth of 16: bcache 16.7k iops, raw ssd 23.5k iops

IO depth of 8: bcache 8.7k iops, raw ssd 14.9k iops

IO depth of 4: bcache 8.9k iops, raw ssd 19.7k iops

The SSD performance was getting wonky towards the end. Getting consistent numbers out of SSDs with any kind of write workload is difficult at best. I was aiming for fair more than consistent.

If we were benchmarking random 4k writes with an iodepth of 1, bcache would have to do twice as much writing as the raw SSD: each write would require its own journal write to update the index, which by default can't be smaller than 4k (The default blocksize is 4k - it would be interesting to test with a blocksize of 1 sector, but 4k is the default so I'm being fair for now. Should try it later).

We could perhaps greatly improve the write performance when the IO depth is very small by adding the ability to do full data journalling, but I'm not sure it's worth the effort for real world workloads - it would be a decent amount of work.

Random reads

For reads I'm testing with a warm cache; it isn't particularly meaningful to benchmark bcache vs. the raw SSD when some of bcache's io is going to the spinning disk. (That's an important benchmark, but cache misses should be compared to the raw backing device).

IO depth of 64: bcache 29.5k iops, raw ssd 25.4k iops

IO depth of 16: bcache 28.2k iops, raw ssd 27.6k iops

Not sure why bcache was beating the SSD on random reads, but it was very consistent. Most likely explanation I can think of was the data they were reading happened to be slightly more favourably striped across the SSD's chips for bcache - I'm going to say that on random reads bcache's performance is equal to the raw SSD to within the margin of error.

It's worth noting that the way this test was setup is a worst case for bcache - I'm reading the data the 4k random write tests wrote. This means that the btree is full of 4k extents, and is much larger than normal; on all the machines I've looked at that were using bcache for real workloads the average extent size was somewhere around 100k. The larger btree hurts because it means each index lookup has to touch quite a bit more memory and much less of it is in L2. However, it's been my experience from testing on higher end hardware that this doesn't start to noticably affect things until somewhere around 500k iops, and even then it's not that noticable.

If there's any other microbenchmarks people would like to see, or benchmarks they'd like analyzed please send me or the public mailing list an email. I'm particularly interested if anyone can find any performance bugs; I haven't seen any myself in quite awhile but I'm sure there's something left.

Flash cache comparison

Alexandru Ionica graciously shared some performance data recently, showing a comparison with Facebook's flash cache.

http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/

Troubleshooting Notes

Shutdown / Device removal

When the system is shut down the cache stays dirty. That means that the backing device is not safe to be separated from the caching device unless they are first manually detached or cache is switched to writethrough.

Automatic attaching

The kernel part of bcache will try to match a cache and the backing device. The attempt will happen regardless if the order the devices will become available to the system.

Root partition on bcache

In order to have the root partition under bcache you may have to add (e.g.) rootdelay=3 as a boot parameter to allow for the udev rules to run before the system attempts to mount the root filesystem.

Previously formatted disk or partition

If a partition or disk device does not register in the cache array at boot it may be because of a rogue superblock. In order to avoid conflicts bcache's device auto-detection udev rules will skip over adding a device that does not satisfy both bcache-probe and blkid checks. The udev rules check for superblocks to identify the file-system type and if a superblock that does not match the "bcache" file-system type is found the disk will not be added.

# cat /usr/lib/udev/rules.d/61-bcache.rules
....
# Backing devices: scan, symlink, register
IMPORT{program}="/sbin/blkid -o udev $tempnode"
# blkid and probe-bcache can disagree, in which case don't register
ENV{ID_FS_TYPE}=="?*", ENV{ID_FS_TYPE}!="bcache", GOTO="bcache_backing_end"
...

# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUID

NAME        MAJ:MIN RM   SIZE TYPE FSTYPE MOUNTPOINT UUID                                 PARTUUID
sda           8:0    0 111.8G disk
├─sda1        8:1    0     3G part vfat   /esp       7E67-C0BB                            d39828e8-4880-4c85-9ec0-4255777aa35b
└─sda2        8:2    0 108.8G part ext2              93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65
sdb           8:16   0 931.5G disk
└─sdb1        8:17   0 931.5G part ntfs              FAD2B75FD2B71EB7                     90c80e9d-f31a-41b4-9d4d-9b02029402b2
sdc           8:32   0   2.7T disk bcache            4bd63488-e1d7-4858-8c70-a35a5ba2c452
└─bcache1   254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1
sdd           8:48   0   2.7T disk bcache            ce6de517-7538-45d6-b8c4-8546f13f76c1
└─bcache0   254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1
sde           8:64   1  14.9G disk
└─sde1        8:65   1  14.9G part ext4   /          d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318dd

In the above case a partition had been previously formatted with the ext2 file-system. The bcache array was later built with:

# make-bcache -B /dev/sdc /dev/sdd -C /dev/sda2

Because devices /dev/sdc and /dev/sdd then correctly identified as "bcache" file-systems they successfully added automatically at every subsequent boot, but /dev/sda2 did not and instead required a manual register. After identifying a left-over superblock from the previous ext2 file-system at byte 1024 which "make-bcache" does not erase during format because its offset begins at byte 4096, the problem was then corrected thus:

# dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sda2

And after reboot all disks had been automatically added and the array was correctly assembled:

# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUID

NAME        MAJ:MIN RM   SIZE TYPE FSTYPE MOUNTPOINT UUID                                 PARTUUID
sda           8:0    0 111.8G disk
├─sda1        8:1    0     3G part vfat   /esp       7E67-C0BB                            d39828e8-4880-4c85-9ec0-4255777aa35b
└─sda2        8:2    0 108.8G part bcache            93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65
  ├─bcache0 254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1
  └─bcache1 254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1
sdb           8:16   0 931.5G disk
└─sdb1        8:17   0 931.5G part ntfs              FAD2B75FD2B71EB7                     90c80e9d-f31a-41b4-9d4d-9b02029402b2
sdc           8:32   0   2.7T disk bcache            4bd63488-e1d7-4858-8c70-a35a5ba2c452
└─bcache1   254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1
sdd           8:48   0   2.7T disk bcache            ce6de517-7538-45d6-b8c4-8546f13f76c1
└─bcache0   254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1
sde           8:64   1  14.9G disk
└─sde1        8:65   1  14.9G part ext4   /          d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318dd

*Rogue superblocks can cause other related problems as well