Got a mailing list - email@example.com. Feel free to direct anything bcache related there :)
Just implemented UUIDs. Adding a field in the superblock and having make-bcache generate a UUID was easy enough, figuring out how to get udev to use it to generate the /dev/disk/by-uuid symlink was harder. Eventually we'll want the bcache superblock added to libblkid - for now I added a program (probe-bcache) that works analogously to blkid for bcache, and a udev rule to use it. I added a hook for debian's initramfs tools to pull it all in to the initramfs - not entirely sure I have that part right yet.
Assuming it all works though, you should be able to do "echo /dev/disk/by-uuid/foo > /sys/kernel/bcache/register_cache", and have it work no matter where your cache device pops up that particular boot.
It's been awhile since I've written one of these...
The most recent Sysbench numbers, on an X25-E, have Bcache at around 80% of the bare SSD and 50-60% better than Flashcache (MySQL transactions per second). I'd post the full benchmarks but I didn't run them myself :)
Stabilizing writeback took longer than I expected; Bcache is way out there on the simplicity vs. performance tradeoff. But it's been rock solid for around a week now, under a great deal of torture testing. I'd very much like to know if anyone can break it - I've fixed a lot of bugs that were only possible to trigger in virtual machines running out of ram, there's been only one bug I've been able to trigger on real hardware for maybe a month now.
You still shouldn't trust it with real data, at least in writeback mode, pending more work and testing on unclean shutdowns. There's a few relatively minor issues that need to be fixed before it'll work reliably - they're all basically ordering writes correctly. After I fix all the known issues I'll start testing how it handles unclean shutdowns heavily.
Besides that, there isn't a whole lot left before it might be production ready - primarily error handling. IO error handling is written but untested, handling memory allocation failures (and avoiding deadlock) will take more work. Those will need testing with md's faulty layer, and fault injection. IO error handling meant I finally had to write the (much belated) code to unregister cached devices and caches; this is in progress now.
Anyone who's willing to do outside testing should feel free to ask for help, point out areas that need documentation, or otherwise provide input. The more testing it sees, the sooner it'll be production ready - I for one am excited to be using it on my dev machine... just as soon as I have another SSD to use :)
Writeback is looking pretty stable. Definitely needs optimization, but the current dbench numbers don't look too terrible:
Uncached: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 354819 6.975 3546.510 Close 261513 0.002 9.702 Rename 14829 374.939 3840.287 Unlink 71453 280.598 4041.658 Qpathinfo 322406 0.013 14.727 Qfileinfo 55752 0.003 3.293 Qfsinfo 57571 0.173 16.834 Sfileinfo 29540 4.680 1968.190 Find 123221 0.033 11.395 WriteX 174183 0.940 3501.915 ReadX 542302 0.006 9.178 LockX 1080 0.003 0.029 UnlockX 1080 0.002 0.013 Flush 25160 303.330 4014.527 Throughput 17.9185 MB/sec (sync dirs) 60 clients 60 procs max_latency=4041.664 ms Cached: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 1217617 3.719 1996.974 Close 894777 0.002 24.008 Rename 51682 58.830 1853.280 Unlink 245414 58.024 2029.589 Qpathinfo 1104565 0.013 30.958 Qfileinfo 192845 0.003 29.178 Qfsinfo 202248 0.184 41.389 Sfileinfo 99487 7.141 1884.748 Find 426854 0.033 36.754 WriteX 602274 1.205 1494.175 ReadX 1912230 0.006 36.247 LockX 3978 0.003 0.027 UnlockX 3978 0.002 0.019 Flush 85282 148.580 2025.378 Throughput 63.3555 MB/sec (sync dirs) 60 clients 60 procs max_latency=2029.597 ms Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 1431528 6.358 2741.227 Close 1053052 0.002 11.128 Rename 60734 82.006 2836.559 Unlink 288169 82.373 2920.550 Qpathinfo 1296364 0.013 297.144 Qfileinfo 226278 0.003 17.307 Qfsinfo 237291 0.193 32.547 Sfileinfo 117622 13.979 2729.844 Find 500876 0.033 25.412 WriteX 707939 2.735 2731.899 ReadX 2239328 0.006 45.265 LockX 4622 0.003 0.090 UnlockX 4622 0.002 0.037 Flush 101075 184.241 2821.352 Throughput 74.105 MB/sec (sync dirs) 100 clients 100 procs max_latency=2920.561 ms And using just the SSD: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 2209643 3.867 571.200 Close 1622947 0.001 5.148 Rename 93834 84.659 779.036 Unlink 446251 74.769 779.419 Qpathinfo 2004984 0.009 133.042 Qfileinfo 349538 0.002 15.865 Qfsinfo 367418 0.105 13.548 Sfileinfo 180564 4.850 491.543 Find 774428 0.022 16.256 WriteX 1092433 1.073 336.781 ReadX 3466949 0.005 9.330 LockX 7194 0.003 0.137 UnlockX 7194 0.002 0.045 Flush 154609 51.690 738.412 Throughput 114.893 MB/sec (sync dirs) 100 clients 100 procs max_latency=779.425 ms
Writeback is coming along well. Not seeing any data corruption at all, which is awesome. It's definitely not seen enough testing to be trusted, but writeback introduces some subtle cache coherency issues that were cause for concern, so this bodes well. It's not quite stable - one of my VMs went for around 20 hours before a write hanged, another has been able to trigger the hang easily, but my test box hasn't had any issues. That probably won't be too hard to fix when I get around to it.
There's a lot of work to be done on performance. It looks like synchronous btree updates are slower than we want them to be, so next up is adding a switch to turn synchronous mode on and off (so writeback can be tested independently of synchronous btree updates, and the reverse), and it looks like I'm going to have to switch to high res timers for the btree write delays, as I suspected (bcache delays btree writes by a small amount to coalesce key insertions). That scheduling related bug, whatever it is, is also going to have to be fixed soon.
The code to actually recover from an unclean shutdown is still missing, too - with synchronous btree updates everything should be consistent, but it'll still invalidate the entire cache. There's a bit of work that has to be done (bucket generations can be slightly out of date, so it has to walk the btree and check for that) but mostly it's just a matter of testing, and going over the code again looking for anything I missed.
The new elevator code appears to be working correctly now, too. I need to look at some blktraces and whatnot to make sure it's not doing anything dumb, but I haven't seen any obvious bugs. There is a performance issue, at least with the deadline scheduler - when the cache has dirty data, it can keep writes queued up to the cached device for so long that reads get starved. Hopefully this can be solved with configuration. Bcache does set the priority on all the writeback io to the idle priority, but only cfq looks at that (which I haven't tried), not deadline, and I've heard people say that deadline is much preferred for servers.
There's also a bug with the way the writeback code submits its io that to my knowledge only affects raid devices. That's next up on my list.
At this point though I'd very much like to get other people trying it out, and hopefully helping track down performance bugs and just seeing how well it works. Benchmarks suck for the moment so don't expect to be awed, but that should rapidly improve now.
And if you like what you see and want to contribute to the continued development of bcache, I've got funding for another month but nothing else lined up so far. Hardware would most definitely be welcome - I'd like to be able to test on more SSDs and more architectures.
Bcache also puts a lot of effort into allocation, some of which could be defeated by the SSD's firmware trying to be overly smart. I'd like to get into contact with people who are writing that firmware so we can see if there's anything that could be done to allow bcache and the firmware to cooperate better, or at least not walk all over each other. If anyone has any relevant information or knows who to talk to, please contact me :)
Initial implementation of writeback is up, in the bcache-dev branch. There's probably still a heisenbug lurking, but I can't reproduce it at the moment; if it locks up for you, that's what you hit. The writeback code itself though should be working, with caveats - dirty data is correctly flagged as such and retained in the cache, but the code to actually write it out is disabled pending more debugging. For now, it just switches to writethrough when less than half of the buckets in the cache can be reclaimed.
Lots still to do, but this is a huge amount of progress.
The version in the bcache branch is as far as I can tell stable; it passes all my torture tests and I haven't been able to break it. It doesn't have barriers or writeback caching.
Writeback caching is in progress, it'll be in the next version posted. Barriers and full IO tracking are also done, though mostly untested. It now uses a hash table to track the 128 most recent IOs, and it's independent of the process doing it (so your raid resync that does sequential IO on each drive in the array will completely bypass the cache).
With writeback caching, the updates to the btree have to be written to disk before the write can be returned as completed. For writes, when it adds a key to the btree it sets up the required write(s) and sets a timer; any keys that get inserted before it goes off will get written out at the same time. This code is stable, but will need testing to make sure the btree can still be read and contains what it should if the machine is shut down. The timers are currently hardcoded to 10 ms for normal writes and 4 ms for synchronous writes - this might want tweaking later.
Writeback caching splits the writes to the cache from the writes to the cached device. A write that doesn't bypass the cache will initially only be written to the cache; later on dirty keys are read from the cache and written out in sorted order. This is the code I'm working on now, it's mostly written but debugging hasn't started.
Benchmarks: 2 TB Western Digital Green drive cached by a 64 GB Corsair Nova
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP utumno 16G 536 92 70825 7 53352 7 2785 99 181433 11 1756 15 Latency 14773us 1826ms 3153ms 3918us 2212us 12480us Version 1.96 ------Sequential Create------ --------Random Create-------- utumno -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 21959 32 21606 2 +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 283us 422us 464us 283us 27us 46us 1.96,1.96,utumno,1,1277677504,16G,,536,92,70825,7,53352,7,2785,99,181433,11,1756,15,16,,,,,21959,32,21606,2,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14773us,1826ms,3153ms,3918us,2212us,12480us,283us,422us,464us,283us,27us,46us
Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP utumno 16G 672 91 68156 7 36398 4 2837 98 102864 5 269.3 2 Latency 14400us 2014ms 12486ms 18666us 549ms 460ms Version 1.96 ------Sequential Create------ --------Random Create-------- utumno -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 21228 31 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 265us 555us 465us 283us 74us 45us 1.96,1.96,utumno,1,1277675678,16G,,672,91,68156,7,36398,4,2837,98,102864,5,269.3,2,16,,,,,21228,31,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14400us,2014ms,12486ms,18666us,549ms,460ms,265us,555us,465us,283us,74us,45us
On direct IO bcache is getting within 10% of what the SSD can do on cache hits - both sequential and, more importantly, 4k random reads. There's still some performance bugs that seem to primarily affect buffered IO - the bonnie numbers are promising, particularly in that they show improvement across the board (I'm not doing writeback caching yet), but there's much room for improvement in random reads.
I just finished rewriting the btree writing code and some other extensive cleanups; stability looks much improved though there still are some bugs remaining. Ran two test VMs overnight and both survived; the one running 4x bonnies continuously had some ext4 errors at one point, but kept going - so there's definitely still a race somewhere.
And I finally merged code to free unused buckets, and to track sequential IO and bypass the cache. Previously, if data already in the cache was rewritten, the pointer to the old data would be invalidated but the bucket would not be freed any sooner than normal. Now, garbage collection adds up how much of each bucket contains good data, and frees buckets that are less than a quarter full. It's probably possible to be smarter about which buckets we free, in the future I'll write about all the heuristics that could potentially be tuned.
Bcache also now tracks the most recent IOs, both read and write that it's seen. It's done this for awhile, so that if multiple processes are adding data to the cache their data gets segregated into their own buckets, improving cache locality. I extended this to keep track of how much IO has been done sequentially, and also track the average IO size it's seen for the most recent processes. This means if a process does say a 10 mb read, only the first (configurable) megabyte will be added to the cache - after that, reads will be satisfied from the cache if possible but will not be added to the cache. Even better, if you're doing a copy or a backup and the average IO size is above (by default) 512k, it will add nothing to the cache from that process as long as the average stays over the cutoff.
There's still a fair amount that can be done here. Currently it's only possible to track one IO per process - in the future this limitation will be removed, meaning if you're caching the drives that make up a raid5/6, a raid resync will completely bypass the cache. Also, the number of IOs that can be tracked is limited due to it using a linked list - in the future I'll have to convert it to a heap/hash table or red/black tree, so if you've got a busy server you can track the most recent several hundred IOs. Also, currently writes are always added to the cache - this is because there isn't yet a mechanism to invalidate part of the cache, but that's needed for a number of things and will happen before too long.