2010-10-08
Got a mailing list - linux-bcache@vger.kernel.org. Feel free to direct anything bcache related there :)
Just implemented UUIDs. Adding a field in the superblock and having make-bcache generate a UUID was easy enough, figuring out how to get udev to use it to generate the /dev/disk/by-uuid symlink was harder. Eventually we'll want the bcache superblock added to libblkid - for now I added a program (probe-bcache) that works analogously to blkid for bcache, and a udev rule to use it. I added a hook for debian's initramfs tools to pull it all in to the initramfs - not entirely sure I have that part right yet.
Assuming it all works though, you should be able to do "echo /dev/disk/by-uuid/foo > /sys/kernel/bcache/register_cache", and have it work no matter where your cache device pops up that particular boot.
2010-09-16
It's been awhile since I've written one of these...
The most recent Sysbench numbers, on an X25-E, have Bcache at around 80% of the bare SSD and 50-60% better than Flashcache (MySQL transactions per second). I'd post the full benchmarks but I didn't run them myself :)
Stabilizing writeback took longer than I expected; Bcache is way out there on the simplicity vs. performance tradeoff. But it's been rock solid for around a week now, under a great deal of torture testing. I'd very much like to know if anyone can break it - I've fixed a lot of bugs that were only possible to trigger in virtual machines running out of ram, there's been only one bug I've been able to trigger on real hardware for maybe a month now.
You still shouldn't trust it with real data, at least in writeback mode, pending more work and testing on unclean shutdowns. There's a few relatively minor issues that need to be fixed before it'll work reliably - they're all basically ordering writes correctly. After I fix all the known issues I'll start testing how it handles unclean shutdowns heavily.
Besides that, there isn't a whole lot left before it might be production ready - primarily error handling. IO error handling is written but untested, handling memory allocation failures (and avoiding deadlock) will take more work. Those will need testing with md's faulty layer, and fault injection. IO error handling meant I finally had to write the (much belated) code to unregister cached devices and caches; this is in progress now.
Anyone who's willing to do outside testing should feel free to ask for help, point out areas that need documentation, or otherwise provide input. The more testing it sees, the sooner it'll be production ready - I for one am excited to be using it on my dev machine... just as soon as I have another SSD to use :)
2010-08-07
Writeback is looking pretty stable. Definitely needs optimization, but the current dbench numbers don't look too terrible:
Uncached:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 354819 6.975 3546.510
Close 261513 0.002 9.702
Rename 14829 374.939 3840.287
Unlink 71453 280.598 4041.658
Qpathinfo 322406 0.013 14.727
Qfileinfo 55752 0.003 3.293
Qfsinfo 57571 0.173 16.834
Sfileinfo 29540 4.680 1968.190
Find 123221 0.033 11.395
WriteX 174183 0.940 3501.915
ReadX 542302 0.006 9.178
LockX 1080 0.003 0.029
UnlockX 1080 0.002 0.013
Flush 25160 303.330 4014.527
Throughput 17.9185 MB/sec (sync dirs) 60 clients 60 procs max_latency=4041.664 ms
Cached:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 1217617 3.719 1996.974
Close 894777 0.002 24.008
Rename 51682 58.830 1853.280
Unlink 245414 58.024 2029.589
Qpathinfo 1104565 0.013 30.958
Qfileinfo 192845 0.003 29.178
Qfsinfo 202248 0.184 41.389
Sfileinfo 99487 7.141 1884.748
Find 426854 0.033 36.754
WriteX 602274 1.205 1494.175
ReadX 1912230 0.006 36.247
LockX 3978 0.003 0.027
UnlockX 3978 0.002 0.019
Flush 85282 148.580 2025.378
Throughput 63.3555 MB/sec (sync dirs) 60 clients 60 procs max_latency=2029.597 ms
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 1431528 6.358 2741.227
Close 1053052 0.002 11.128
Rename 60734 82.006 2836.559
Unlink 288169 82.373 2920.550
Qpathinfo 1296364 0.013 297.144
Qfileinfo 226278 0.003 17.307
Qfsinfo 237291 0.193 32.547
Sfileinfo 117622 13.979 2729.844
Find 500876 0.033 25.412
WriteX 707939 2.735 2731.899
ReadX 2239328 0.006 45.265
LockX 4622 0.003 0.090
UnlockX 4622 0.002 0.037
Flush 101075 184.241 2821.352
Throughput 74.105 MB/sec (sync dirs) 100 clients 100 procs max_latency=2920.561 ms
And using just the SSD:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 2209643 3.867 571.200
Close 1622947 0.001 5.148
Rename 93834 84.659 779.036
Unlink 446251 74.769 779.419
Qpathinfo 2004984 0.009 133.042
Qfileinfo 349538 0.002 15.865
Qfsinfo 367418 0.105 13.548
Sfileinfo 180564 4.850 491.543
Find 774428 0.022 16.256
WriteX 1092433 1.073 336.781
ReadX 3466949 0.005 9.330
LockX 7194 0.003 0.137
UnlockX 7194 0.002 0.045
Flush 154609 51.690 738.412
Throughput 114.893 MB/sec (sync dirs) 100 clients 100 procs max_latency=779.425 ms
2010-08-03
Writeback is coming along well. Not seeing any data corruption at all, which is awesome. It's definitely not seen enough testing to be trusted, but writeback introduces some subtle cache coherency issues that were cause for concern, so this bodes well. It's not quite stable - one of my VMs went for around 20 hours before a write hanged, another has been able to trigger the hang easily, but my test box hasn't had any issues. That probably won't be too hard to fix when I get around to it.
There's a lot of work to be done on performance. It looks like synchronous btree updates are slower than we want them to be, so next up is adding a switch to turn synchronous mode on and off (so writeback can be tested independently of synchronous btree updates, and the reverse), and it looks like I'm going to have to switch to high res timers for the btree write delays, as I suspected (bcache delays btree writes by a small amount to coalesce key insertions). That scheduling related bug, whatever it is, is also going to have to be fixed soon.
The code to actually recover from an unclean shutdown is still missing, too - with synchronous btree updates everything should be consistent, but it'll still invalidate the entire cache. There's a bit of work that has to be done (bucket generations can be slightly out of date, so it has to walk the btree and check for that) but mostly it's just a matter of testing, and going over the code again looking for anything I missed.
The new elevator code appears to be working correctly now, too. I need to look at some blktraces and whatnot to make sure it's not doing anything dumb, but I haven't seen any obvious bugs. There is a performance issue, at least with the deadline scheduler - when the cache has dirty data, it can keep writes queued up to the cached device for so long that reads get starved. Hopefully this can be solved with configuration. Bcache does set the priority on all the writeback io to the idle priority, but only cfq looks at that (which I haven't tried), not deadline, and I've heard people say that deadline is much preferred for servers.
There's also a bug with the way the writeback code submits its io that to my knowledge only affects raid devices. That's next up on my list.
At this point though I'd very much like to get other people trying it out, and hopefully helping track down performance bugs and just seeing how well it works. Benchmarks suck for the moment so don't expect to be awed, but that should rapidly improve now.
And if you like what you see and want to contribute to the continued development of bcache, I've got funding for another month but nothing else lined up so far. Hardware would most definitely be welcome - I'd like to be able to test on more SSDs and more architectures.
Bcache also puts a lot of effort into allocation, some of which could be defeated by the SSD's firmware trying to be overly smart. I'd like to get into contact with people who are writing that firmware so we can see if there's anything that could be done to allow bcache and the firmware to cooperate better, or at least not walk all over each other. If anyone has any relevant information or knows who to talk to, please contact me :)
2010-07-30
Initial implementation of writeback is up, in the bcache-dev branch. There's probably still a heisenbug lurking, but I can't reproduce it at the moment; if it locks up for you, that's what you hit. The writeback code itself though should be working, with caveats - dirty data is correctly flagged as such and retained in the cache, but the code to actually write it out is disabled pending more debugging. For now, it just switches to writethrough when less than half of the buckets in the cache can be reclaimed.
Lots still to do, but this is a huge amount of progress.
2010-07-13
The version in the bcache branch is as far as I can tell stable; it passes all my torture tests and I haven't been able to break it. It doesn't have barriers or writeback caching.
Writeback caching is in progress, it'll be in the next version posted. Barriers and full IO tracking are also done, though mostly untested. It now uses a hash table to track the 128 most recent IOs, and it's independent of the process doing it (so your raid resync that does sequential IO on each drive in the array will completely bypass the cache).
With writeback caching, the updates to the btree have to be written to disk before the write can be returned as completed. For writes, when it adds a key to the btree it sets up the required write(s) and sets a timer; any keys that get inserted before it goes off will get written out at the same time. This code is stable, but will need testing to make sure the btree can still be read and contains what it should if the machine is shut down. The timers are currently hardcoded to 10 ms for normal writes and 4 ms for synchronous writes - this might want tweaking later.
Writeback caching splits the writes to the cache from the writes to the cached device. A write that doesn't bypass the cache will initially only be written to the cache; later on dirty keys are read from the cache and written out in sorted order. This is the code I'm working on now, it's mostly written but debugging hasn't started.
2010-06-27
Benchmarks: 2 TB Western Digital Green drive cached by a 64 GB Corsair Nova
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
utumno 16G 536 92 70825 7 53352 7 2785 99 181433 11 1756 15
Latency 14773us 1826ms 3153ms 3918us 2212us 12480us
Version 1.96 ------Sequential Create------ --------Random Create--------
utumno -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 21959 32 21606 2 +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 283us 422us 464us 283us 27us 46us
1.96,1.96,utumno,1,1277677504,16G,,536,92,70825,7,53352,7,2785,99,181433,11,1756,15,16,,,,,21959,32,21606,2,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14773us,1826ms,3153ms,3918us,2212us,12480us,283us,422us,464us,283us,27us,46us
Uncached:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
utumno 16G 672 91 68156 7 36398 4 2837 98 102864 5 269.3 2
Latency 14400us 2014ms 12486ms 18666us 549ms 460ms
Version 1.96 ------Sequential Create------ --------Random Create--------
utumno -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 21228 31 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 265us 555us 465us 283us 74us 45us
1.96,1.96,utumno,1,1277675678,16G,,672,91,68156,7,36398,4,2837,98,102864,5,269.3,2,16,,,,,21228,31,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14400us,2014ms,12486ms,18666us,549ms,460ms,265us,555us,465us,283us,74us,45us
On direct IO bcache is getting within 10% of what the SSD can do on cache hits - both sequential and, more importantly, 4k random reads. There's still some performance bugs that seem to primarily affect buffered IO - the bonnie numbers are promising, particularly in that they show improvement across the board (I'm not doing writeback caching yet), but there's much room for improvement in random reads.
I just finished rewriting the btree writing code and some other extensive cleanups; stability looks much improved though there still are some bugs remaining. Ran two test VMs overnight and both survived; the one running 4x bonnies continuously had some ext4 errors at one point, but kept going - so there's definitely still a race somewhere.
And I finally merged code to free unused buckets, and to track sequential IO and bypass the cache. Previously, if data already in the cache was rewritten, the pointer to the old data would be invalidated but the bucket would not be freed any sooner than normal. Now, garbage collection adds up how much of each bucket contains good data, and frees buckets that are less than a quarter full. It's probably possible to be smarter about which buckets we free, in the future I'll write about all the heuristics that could potentially be tuned.
Bcache also now tracks the most recent IOs, both read and write that it's seen. It's done this for awhile, so that if multiple processes are adding data to the cache their data gets segregated into their own buckets, improving cache locality. I extended this to keep track of how much IO has been done sequentially, and also track the average IO size it's seen for the most recent processes. This means if a process does say a 10 mb read, only the first (configurable) megabyte will be added to the cache - after that, reads will be satisfied from the cache if possible but will not be added to the cache. Even better, if you're doing a copy or a backup and the average IO size is above (by default) 512k, it will add nothing to the cache from that process as long as the average stays over the cutoff.
There's still a fair amount that can be done here. Currently it's only possible to track one IO per process - in the future this limitation will be removed, meaning if you're caching the drives that make up a raid5/6, a raid resync will completely bypass the cache. Also, the number of IOs that can be tracked is limited due to it using a linked list - in the future I'll have to convert it to a heap/hash table or red/black tree, so if you've got a busy server you can track the most recent several hundred IOs. Also, currently writes are always added to the cache - this is because there isn't yet a mechanism to invalidate part of the cache, but that's needed for a number of things and will happen before too long.