Bcache Wiki

What is bcache?

Seeks per second, bonnie++: 90% reads, 10% rewrites
Bcache running in synchronous writeback mode with the same SSD and hard drive:

bcache.png

Hard drives are cheap and big, SSDs are fast but small and expensive. Wouldn't it be nice if you could transparently get the advantages of both? With Bcache, you can have your cake and eat it too.

Bcache is a patch for the Linux kernel to use SSDs to cache other block devices. It's analogous to L2Arc for ZFS, but Bcache also does writeback caching, and it's filesystem agnostic. It's designed to be switched on with a minimum of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the random reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high end storage arrays, and perhaps even embedded.

The design goal is to be just as fast as the SSD and cached device (depending on cache hit vs. miss, and writethrough vs. writeback writes) to within the margin of error. It's not quite there yet, mostly for sequential reads. But testing has shown that it is emphatically possible, and even in some cases to do better - primarily random writes.

It's also designed to be safe. Reliability is critical for anything that does writeback caching; if it breaks, you will lose data. Bcache is meant to be a superior alternative to battery backed up raid controllers, thus it must be reliable even if the power cord is yanked out. It won't return a write as completed until everything necessary to locate it is on stable storage, nor will writes ever be seen as partially completed (or worse, missing) in the event of power failure. A large amount of work has gone into making this work efficiently.

Bcache is designed around the performance characteristics of SSDs. It's designed to minimize write inflation to the greatest extent possible, and never itself does random writes. It turns random writes into sequential writes - first when it writes them to the SSD, and then with writeback caching it can use your SSD to buffer gigabytes of writes and write them all out in order to your hard drive or raid array. If you've got a RAID6, you're probably aware of the painful random write penalty, and the expensive controllers with battery backup people buy to mitigate them. Now, you can use Linux's excellent software RAID and still get fast random writes, even on cheap hardware.

Bcache is currently beta quality software. There aren't any known data corruption bugs and recovering from unclean shutdown is now working beautifully; it ought to be suitable for non critical use provided you have backups.

Future plans

Further off, there's plans to use Bcache's index to implement overcommited storage. If you're familiar with LVM, it works by allocating logical volumes in units of 4 mb extents; thus you can arbitrarily create and resize LVs. But when you create an LV you have to fully allocate its storage, regardless of whether it'll ever be written to. If you've ever managed servers with lots of random LVs, you've probably experienced first hand how much of a pain it is to keep track of how much free space you have, resize LVs when the filesystems fill up, etc. - not to mention the huge amount of space that typically gets wasted because you really don't want filesystems to fill up.

But all the work has already been done in Bcache for allocating on demand, and maintaining the index while it's in use - and by using the same index for cached data and the volumes themselves, there will be approximately zero extra runtime overhead. You'll be able to create petabyte sized filesystems with a tiny amount of real storage, resize them arbitrarily, and be able to see exactly how much space you're using. Reading from newly created volumes also won't return old data; sectors that haven't previously been written to will return 0s. This was actually the primary motivation for this feature - for shared hosting, you don't want customers to be able to see other people's data.

There's also been quite a bit of interest in tiered storage. If you've got a truly large amount of storage, it may be beneficial to have a really large RAID60 of large 7200 rpm drives, and a smaller RAID10 of 15k rpm SAS drives. Nobody wants to manually manage what goes where - and keep track of what data gets accessed the most - so if you could use it as one large pool, and have data migrate between them automatically, so much the better. Once overcommited storage is implemented, tiered storage should actually be quite easy to add.

Bcache has made amazing progress in the past six months, but there's still work to be done to make it truly production ready, and no shortage of features to implement after that. Completing all this is thus contingent on my ability to afford to continue to work full time on it - any funding and/or hardware that could be contributed would be a great help :) I've received some funding, without which bcache would not be as far along as it is but I'm definitely not fully funded at this time.

Current status

Bcache looks like it's about ready for non critical use - provided you have backups, and test on your particular setup. Recovering from unclean shutdown has now seen quite a bit of stress testing and is working beautifully. Caveats:

Getting started

Suppose you want to cache the root filesystem on your desktop machine; we'd also like a setup that will work for writeback caching. I personally saw the greatest benefits from writethrough caching on my dev machine - writeback didn't add much; in contrast to a server there's little reason for software to make the user (you) wait on fsyncs. There was the notorious case of Firefox fsyncing the sqlite database in the main ui thread, but that's finally been fixed. Anyway...

From the bcache-tools git repository, use make-bcache to format your cache device. If you're using a partition, I recommend aligning it to your SSD's erase block size; I believe this is typically 512k for most consumer SSDs and 128k for Intel SSDs. Pass the erase block size to make-bcache, too: ./make-bcache -b512k /dev/foo

If you're just using writethrough caching you could load the cache device while your root filesystem is mounted read only, but for writeback it is imperative that your cache device is loaded before your root filesystem is used at all - which means it's got to be done from initramfs. You also need a separate filesystem for /boot, since grub won't know anything about the cache.

On Debian/Ubuntu, this is easy enough. You can drop a script into /etc/initramfs-tools/scripts/local-premount, and it'll be included automatically by update-initramfs. Mine looks like this:

echo /dev/sdd > /sys/kernel/bcache/register_cache
echo "fc3085b5-26e5-4881-9cef-03bf2a704d6f /dev/md0" > /sys/kernel/bcache/register_dev

I read afterwards that update-initramfs notices when you use a program in one of your scripts and includes it in the initramfs, so you should be able to use blkid if you wanted to look up your root filesystem by something other than UUID. Better would be to use the links in /dev/disk/by-uuid - I haven't gotten around to trying either of those though.

Take care to ensure that barriers are disabled. Adding nobarrier to /etc/fstab - at least on debian/ubuntu - isn't enough since the initramfs won't see it. You need to add it to ROOTFLAGS in /etc/initramfs-tools/initramfs.conf, also. While you're at it, I would suggest switching to writeback mode instead of the default ordered - particularly for a single user machine there shouldn't be any reason not to and due to the way bcache flushes writes to the btree the performance boost will be bigger than normal. Switching to writeback mode is easier though - tune2fs -o journal_data_writeback will change the default in your filesystem's superblock.

That should be all you need to do - once you've got data in your cache, you should notice rebooting goes somewhat faster :)

I'm not doing anything on shutdown; you could disable caching and unload the cache device in a shutdown script - if you're not using writeback caching - but it still has to be done after the filesystem is remounted read only or the cache contents will go stale, which bcache can't detect so you'll have "interesting" problems when you reboot. If you're using writeback caching, you'd have to wait for all the dirty data to be written out to unload your cache device, which could potentially be a long time to wait for a reboot. There shouldn't be any downside to shutting down with bcache still loaded and running, besides the time it takes for the check when you reboot and bcache loads the cache device. At some point in the future it might be nice to have a mechanism to sync certain metadata and mark the cache as clean - analogous to what md does - but it's a fairly low priority. As mentioned before, since bcache doesn't return writes as completed until they really are completed there shouldn't be any need for a bcache specific sync just to make sure everything is written.

GettingStarted

MoreInformationAboutBcache

Roadmap

AdditionalNotes

FAQ

Current status

2010-10-08

Got a mailing list - linux-bcache@vger.kernel.org. Feel free to direct anything bcache related there :)

Just implemented UUIDs. Adding a field in the superblock and having make-bcache generate a UUID was easy enough, figuring out how to get udev to use it to generate the /dev/disk/by-uuid symlink was harder. Eventually we'll want the bcache superblock added to libblkid - for now I added a program (probe-bcache) that works analogously to blkid for bcache, and a udev rule to use it. I added a hook for debian's initramfs tools to pull it all in to the initramfs - not entirely sure I have that part right yet.

Assuming it all works though, you should be able to do "echo /dev/disk/by-uuid/foo > /sys/kernel/bcache/register_cache", and have it work no matter where your cache device pops up that particular boot.

2010-09-16

It's been awhile since I've written one of these...

The most recent Sysbench numbers, on an X25-E, have Bcache at around 80% of the bare SSD and 50-60% better than Flashcache (MySQL transactions per second). I'd post the full benchmarks but I didn't run them myself :)

Stabilizing writeback took longer than I expected; Bcache is way out there on the simplicity vs. performance tradeoff. But it's been rock solid for around a week now, under a great deal of torture testing. I'd very much like to know if anyone can break it - I've fixed a lot of bugs that were only possible to trigger in virtual machines running out of ram, there's been only one bug I've been able to trigger on real hardware for maybe a month now.

You still shouldn't trust it with real data, at least in writeback mode, pending more work and testing on unclean shutdowns. There's a few relatively minor issues that need to be fixed before it'll work reliably - they're all basically ordering writes correctly. After I fix all the known issues I'll start testing how it handles unclean shutdowns heavily.

Besides that, there isn't a whole lot left before it might be production ready - primarily error handling. IO error handling is written but untested, handling memory allocation failures (and avoiding deadlock) will take more work. Those will need testing with md's faulty layer, and fault injection. IO error handling meant I finally had to write the (much belated) code to unregister cached devices and caches; this is in progress now.

Anyone who's willing to do outside testing should feel free to ask for help, point out areas that need documentation, or otherwise provide input. The more testing it sees, the sooner it'll be production ready - I for one am excited to be using it on my dev machine... just as soon as I have another SSD to use :)

2010-08-07

Writeback is looking pretty stable. Definitely needs optimization, but the current dbench numbers don't look too terrible:

Uncached:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     354819     6.975  3546.510
 Close         261513     0.002     9.702
 Rename         14829   374.939  3840.287
 Unlink         71453   280.598  4041.658
 Qpathinfo     322406     0.013    14.727
 Qfileinfo      55752     0.003     3.293
 Qfsinfo        57571     0.173    16.834
 Sfileinfo      29540     4.680  1968.190
 Find          123221     0.033    11.395
 WriteX        174183     0.940  3501.915
 ReadX         542302     0.006     9.178
 LockX           1080     0.003     0.029
 UnlockX         1080     0.002     0.013
 Flush          25160   303.330  4014.527

Throughput 17.9185 MB/sec (sync dirs)  60 clients  60 procs max_latency=4041.664 ms

Cached:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    1217617     3.719  1996.974
 Close         894777     0.002    24.008
 Rename         51682    58.830  1853.280
 Unlink        245414    58.024  2029.589
 Qpathinfo    1104565     0.013    30.958
 Qfileinfo     192845     0.003    29.178
 Qfsinfo       202248     0.184    41.389
 Sfileinfo      99487     7.141  1884.748
 Find          426854     0.033    36.754
 WriteX        602274     1.205  1494.175
 ReadX        1912230     0.006    36.247
 LockX           3978     0.003     0.027
 UnlockX         3978     0.002     0.019
 Flush          85282   148.580  2025.378

Throughput 63.3555 MB/sec (sync dirs)  60 clients  60 procs max_latency=2029.597 ms

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    1431528     6.358  2741.227
 Close        1053052     0.002    11.128
 Rename         60734    82.006  2836.559
 Unlink        288169    82.373  2920.550
 Qpathinfo    1296364     0.013   297.144
 Qfileinfo     226278     0.003    17.307
 Qfsinfo       237291     0.193    32.547
 Sfileinfo     117622    13.979  2729.844
 Find          500876     0.033    25.412
 WriteX        707939     2.735  2731.899
 ReadX        2239328     0.006    45.265
 LockX           4622     0.003     0.090
 UnlockX         4622     0.002     0.037
 Flush         101075   184.241  2821.352

Throughput 74.105 MB/sec (sync dirs)  100 clients  100 procs max_latency=2920.561 ms

And using just the SSD:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    2209643     3.867   571.200
 Close        1622947     0.001     5.148
 Rename         93834    84.659   779.036
 Unlink        446251    74.769   779.419
 Qpathinfo    2004984     0.009   133.042
 Qfileinfo     349538     0.002    15.865
 Qfsinfo       367418     0.105    13.548
 Sfileinfo     180564     4.850   491.543
 Find          774428     0.022    16.256
 WriteX       1092433     1.073   336.781
 ReadX        3466949     0.005     9.330
 LockX           7194     0.003     0.137
 UnlockX         7194     0.002     0.045
 Flush         154609    51.690   738.412

Throughput 114.893 MB/sec (sync dirs)  100 clients  100 procs max_latency=779.425 ms

2010-08-03

Writeback is coming along well. Not seeing any data corruption at all, which is awesome. It's definitely not seen enough testing to be trusted, but writeback introduces some subtle cache coherency issues that were cause for concern, so this bodes well. It's not quite stable - one of my VMs went for around 20 hours before a write hanged, another has been able to trigger the hang easily, but my test box hasn't had any issues. That probably won't be too hard to fix when I get around to it.

There's a lot of work to be done on performance. It looks like synchronous btree updates are slower than we want them to be, so next up is adding a switch to turn synchronous mode on and off (so writeback can be tested independently of synchronous btree updates, and the reverse), and it looks like I'm going to have to switch to high res timers for the btree write delays, as I suspected (bcache delays btree writes by a small amount to coalesce key insertions). That scheduling related bug, whatever it is, is also going to have to be fixed soon.

The code to actually recover from an unclean shutdown is still missing, too - with synchronous btree updates everything should be consistent, but it'll still invalidate the entire cache. There's a bit of work that has to be done (bucket generations can be slightly out of date, so it has to walk the btree and check for that) but mostly it's just a matter of testing, and going over the code again looking for anything I missed.

The new elevator code appears to be working correctly now, too. I need to look at some blktraces and whatnot to make sure it's not doing anything dumb, but I haven't seen any obvious bugs. There is a performance issue, at least with the deadline scheduler - when the cache has dirty data, it can keep writes queued up to the cached device for so long that reads get starved. Hopefully this can be solved with configuration. Bcache does set the priority on all the writeback io to the idle priority, but only cfq looks at that (which I haven't tried), not deadline, and I've heard people say that deadline is much preferred for servers.

There's also a bug with the way the writeback code submits its io that to my knowledge only affects raid devices. That's next up on my list.

At this point though I'd very much like to get other people trying it out, and hopefully helping track down performance bugs and just seeing how well it works. Benchmarks suck for the moment so don't expect to be awed, but that should rapidly improve now.

And if you like what you see and want to contribute to the continued development of bcache, I've got funding for another month but nothing else lined up so far. Hardware would most definitely be welcome - I'd like to be able to test on more SSDs and more architectures.

Bcache also puts a lot of effort into allocation, some of which could be defeated by the SSD's firmware trying to be overly smart. I'd like to get into contact with people who are writing that firmware so we can see if there's anything that could be done to allow bcache and the firmware to cooperate better, or at least not walk all over each other. If anyone has any relevant information or knows who to talk to, please contact me :)

2010-07-30

Initial implementation of writeback is up, in the bcache-dev branch. There's probably still a heisenbug lurking, but I can't reproduce it at the moment; if it locks up for you, that's what you hit. The writeback code itself though should be working, with caveats - dirty data is correctly flagged as such and retained in the cache, but the code to actually write it out is disabled pending more debugging. For now, it just switches to writethrough when less than half of the buckets in the cache can be reclaimed.

Lots still to do, but this is a huge amount of progress.

2010-07-13

The version in the bcache branch is as far as I can tell stable; it passes all my torture tests and I haven't been able to break it. It doesn't have barriers or writeback caching.

Writeback caching is in progress, it'll be in the next version posted. Barriers and full IO tracking are also done, though mostly untested. It now uses a hash table to track the 128 most recent IOs, and it's independent of the process doing it (so your raid resync that does sequential IO on each drive in the array will completely bypass the cache).

With writeback caching, the updates to the btree have to be written to disk before the write can be returned as completed. For writes, when it adds a key to the btree it sets up the required write(s) and sets a timer; any keys that get inserted before it goes off will get written out at the same time. This code is stable, but will need testing to make sure the btree can still be read and contains what it should if the machine is shut down. The timers are currently hardcoded to 10 ms for normal writes and 4 ms for synchronous writes - this might want tweaking later.

Writeback caching splits the writes to the cache from the writes to the cached device. A write that doesn't bypass the cache will initially only be written to the cache; later on dirty keys are read from the cache and written out in sorted order. This is the code I'm working on now, it's mostly written but debugging hasn't started.

2010-06-27

Benchmarks: 2 TB Western Digital Green drive cached by a 64 GB Corsair Nova

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   536  92 70825   7 53352   7  2785  99 181433  11  1756  15
Latency             14773us    1826ms    3153ms    3918us    2212us   12480us
Version  1.96       ------Sequential Create------ --------Random Create--------
utumno              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 21959  32 21606   2 +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               283us     422us     464us     283us      27us      46us
1.96,1.96,utumno,1,1277677504,16G,,536,92,70825,7,53352,7,2785,99,181433,11,1756,15,16,,,,,21959,32,21606,2,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14773us,1826ms,3153ms,3918us,2212us,12480us,283us,422us,464us,283us,27us,46us

Uncached:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   672  91 68156   7 36398   4  2837  98 102864   5 269.3   2
Latency             14400us    2014ms   12486ms   18666us     549ms     460ms
Version  1.96       ------Sequential Create------ --------Random Create--------
utumno              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 21228  31 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               265us     555us     465us     283us      74us      45us
1.96,1.96,utumno,1,1277675678,16G,,672,91,68156,7,36398,4,2837,98,102864,5,269.3,2,16,,,,,21228,31,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14400us,2014ms,12486ms,18666us,549ms,460ms,265us,555us,465us,283us,74us,45us

On direct IO bcache is getting within 10% of what the SSD can do on cache hits - both sequential and, more importantly, 4k random reads. There's still some performance bugs that seem to primarily affect buffered IO - the bonnie numbers are promising, particularly in that they show improvement across the board (I'm not doing writeback caching yet), but there's much room for improvement in random reads.

I just finished rewriting the btree writing code and some other extensive cleanups; stability looks much improved though there still are some bugs remaining. Ran two test VMs overnight and both survived; the one running 4x bonnies continuously had some ext4 errors at one point, but kept going - so there's definitely still a race somewhere.

And I finally merged code to free unused buckets, and to track sequential IO and bypass the cache. Previously, if data already in the cache was rewritten, the pointer to the old data would be invalidated but the bucket would not be freed any sooner than normal. Now, garbage collection adds up how much of each bucket contains good data, and frees buckets that are less than a quarter full. It's probably possible to be smarter about which buckets we free, in the future I'll write about all the heuristics that could potentially be tuned.

Bcache also now tracks the most recent IOs, both read and write that it's seen. It's done this for awhile, so that if multiple processes are adding data to the cache their data gets segregated into their own buckets, improving cache locality. I extended this to keep track of how much IO has been done sequentially, and also track the average IO size it's seen for the most recent processes. This means if a process does say a 10 mb read, only the first (configurable) megabyte will be added to the cache - after that, reads will be satisfied from the cache if possible but will not be added to the cache. Even better, if you're doing a copy or a backup and the average IO size is above (by default) 512k, it will add nothing to the cache from that process as long as the average stays over the cutoff.

There's still a fair amount that can be done here. Currently it's only possible to track one IO per process - in the future this limitation will be removed, meaning if you're caching the drives that make up a raid5/6, a raid resync will completely bypass the cache. Also, the number of IOs that can be tracked is limited due to it using a linked list - in the future I'll have to convert it to a heap/hash table or red/black tree, so if you've got a busy server you can track the most recent several hundred IOs. Also, currently writes are always added to the cache - this is because there isn't yet a mechanism to invalidate part of the cache, but that's needed for a number of things and will happen before too long.

Questions? Comments? Feel free to register and add them, or email me - kent.overstreet@gmail.com

BcacheWiki: Bcache (last edited 2011-07-12 03:03:28 by Kent)