2010-10-08

Got a mailing list - linux-bcache@vger.kernel.org. Feel free to direct anything bcache related there :)

Just implemented UUIDs. Adding a field in the superblock and having make-bcache generate a UUID was easy enough, figuring out how to get udev to use it to generate the /dev/disk/by-uuid symlink was harder. Eventually we'll want the bcache superblock added to libblkid - for now I added a program (probe-bcache) that works analogously to blkid for bcache, and a udev rule to use it. I added a hook for debian's initramfs tools to pull it all in to the initramfs - not entirely sure I have that part right yet.

Assuming it all works though, you should be able to do "echo /dev/disk/by-uuid/foo > /sys/kernel/bcache/register_cache", and have it work no matter where your cache device pops up that particular boot.

2010-09-16

It's been awhile since I've written one of these...

The most recent Sysbench numbers, on an X25-E, have Bcache at around 80% of the bare SSD and 50-60% better than Flashcache (MySQL transactions per second). I'd post the full benchmarks but I didn't run them myself :)

Stabilizing writeback took longer than I expected; Bcache is way out there on the simplicity vs. performance tradeoff. But it's been rock solid for around a week now, under a great deal of torture testing. I'd very much like to know if anyone can break it - I've fixed a lot of bugs that were only possible to trigger in virtual machines running out of ram, there's been only one bug I've been able to trigger on real hardware for maybe a month now.

You still shouldn't trust it with real data, at least in writeback mode, pending more work and testing on unclean shutdowns. There's a few relatively minor issues that need to be fixed before it'll work reliably - they're all basically ordering writes correctly. After I fix all the known issues I'll start testing how it handles unclean shutdowns heavily.

Besides that, there isn't a whole lot left before it might be production ready - primarily error handling. IO error handling is written but untested, handling memory allocation failures (and avoiding deadlock) will take more work. Those will need testing with md's faulty layer, and fault injection. IO error handling meant I finally had to write the (much belated) code to unregister cached devices and caches; this is in progress now.

Anyone who's willing to do outside testing should feel free to ask for help, point out areas that need documentation, or otherwise provide input. The more testing it sees, the sooner it'll be production ready - I for one am excited to be using it on my dev machine... just as soon as I have another SSD to use :)

2010-08-07

Writeback is looking pretty stable. Definitely needs optimization, but the current dbench numbers don't look too terrible:

Uncached:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     354819     6.975  3546.510
 Close         261513     0.002     9.702
 Rename         14829   374.939  3840.287
 Unlink         71453   280.598  4041.658
 Qpathinfo     322406     0.013    14.727
 Qfileinfo      55752     0.003     3.293
 Qfsinfo        57571     0.173    16.834
 Sfileinfo      29540     4.680  1968.190
 Find          123221     0.033    11.395
 WriteX        174183     0.940  3501.915
 ReadX         542302     0.006     9.178
 LockX           1080     0.003     0.029
 UnlockX         1080     0.002     0.013
 Flush          25160   303.330  4014.527

Throughput 17.9185 MB/sec (sync dirs)  60 clients  60 procs max_latency=4041.664 ms

Cached:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    1217617     3.719  1996.974
 Close         894777     0.002    24.008
 Rename         51682    58.830  1853.280
 Unlink        245414    58.024  2029.589
 Qpathinfo    1104565     0.013    30.958
 Qfileinfo     192845     0.003    29.178
 Qfsinfo       202248     0.184    41.389
 Sfileinfo      99487     7.141  1884.748
 Find          426854     0.033    36.754
 WriteX        602274     1.205  1494.175
 ReadX        1912230     0.006    36.247
 LockX           3978     0.003     0.027
 UnlockX         3978     0.002     0.019
 Flush          85282   148.580  2025.378

Throughput 63.3555 MB/sec (sync dirs)  60 clients  60 procs max_latency=2029.597 ms

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    1431528     6.358  2741.227
 Close        1053052     0.002    11.128
 Rename         60734    82.006  2836.559
 Unlink        288169    82.373  2920.550
 Qpathinfo    1296364     0.013   297.144
 Qfileinfo     226278     0.003    17.307
 Qfsinfo       237291     0.193    32.547
 Sfileinfo     117622    13.979  2729.844
 Find          500876     0.033    25.412
 WriteX        707939     2.735  2731.899
 ReadX        2239328     0.006    45.265
 LockX           4622     0.003     0.090
 UnlockX         4622     0.002     0.037
 Flush         101075   184.241  2821.352

Throughput 74.105 MB/sec (sync dirs)  100 clients  100 procs max_latency=2920.561 ms

And using just the SSD:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    2209643     3.867   571.200
 Close        1622947     0.001     5.148
 Rename         93834    84.659   779.036
 Unlink        446251    74.769   779.419
 Qpathinfo    2004984     0.009   133.042
 Qfileinfo     349538     0.002    15.865
 Qfsinfo       367418     0.105    13.548
 Sfileinfo     180564     4.850   491.543
 Find          774428     0.022    16.256
 WriteX       1092433     1.073   336.781
 ReadX        3466949     0.005     9.330
 LockX           7194     0.003     0.137
 UnlockX         7194     0.002     0.045
 Flush         154609    51.690   738.412

Throughput 114.893 MB/sec (sync dirs)  100 clients  100 procs max_latency=779.425 ms

2010-08-03

Writeback is coming along well. Not seeing any data corruption at all, which is awesome. It's definitely not seen enough testing to be trusted, but writeback introduces some subtle cache coherency issues that were cause for concern, so this bodes well. It's not quite stable - one of my VMs went for around 20 hours before a write hanged, another has been able to trigger the hang easily, but my test box hasn't had any issues. That probably won't be too hard to fix when I get around to it.

There's a lot of work to be done on performance. It looks like synchronous btree updates are slower than we want them to be, so next up is adding a switch to turn synchronous mode on and off (so writeback can be tested independently of synchronous btree updates, and the reverse), and it looks like I'm going to have to switch to high res timers for the btree write delays, as I suspected (bcache delays btree writes by a small amount to coalesce key insertions). That scheduling related bug, whatever it is, is also going to have to be fixed soon.

The code to actually recover from an unclean shutdown is still missing, too - with synchronous btree updates everything should be consistent, but it'll still invalidate the entire cache. There's a bit of work that has to be done (bucket generations can be slightly out of date, so it has to walk the btree and check for that) but mostly it's just a matter of testing, and going over the code again looking for anything I missed.

The new elevator code appears to be working correctly now, too. I need to look at some blktraces and whatnot to make sure it's not doing anything dumb, but I haven't seen any obvious bugs. There is a performance issue, at least with the deadline scheduler - when the cache has dirty data, it can keep writes queued up to the cached device for so long that reads get starved. Hopefully this can be solved with configuration. Bcache does set the priority on all the writeback io to the idle priority, but only cfq looks at that (which I haven't tried), not deadline, and I've heard people say that deadline is much preferred for servers.

There's also a bug with the way the writeback code submits its io that to my knowledge only affects raid devices. That's next up on my list.

At this point though I'd very much like to get other people trying it out, and hopefully helping track down performance bugs and just seeing how well it works. Benchmarks suck for the moment so don't expect to be awed, but that should rapidly improve now.

And if you like what you see and want to contribute to the continued development of bcache, I've got funding for another month but nothing else lined up so far. Hardware would most definitely be welcome - I'd like to be able to test on more SSDs and more architectures.

Bcache also puts a lot of effort into allocation, some of which could be defeated by the SSD's firmware trying to be overly smart. I'd like to get into contact with people who are writing that firmware so we can see if there's anything that could be done to allow bcache and the firmware to cooperate better, or at least not walk all over each other. If anyone has any relevant information or knows who to talk to, please contact me :)

2010-07-30

Initial implementation of writeback is up, in the bcache-dev branch. There's probably still a heisenbug lurking, but I can't reproduce it at the moment; if it locks up for you, that's what you hit. The writeback code itself though should be working, with caveats - dirty data is correctly flagged as such and retained in the cache, but the code to actually write it out is disabled pending more debugging. For now, it just switches to writethrough when less than half of the buckets in the cache can be reclaimed.

Lots still to do, but this is a huge amount of progress.

2010-07-13

The version in the bcache branch is as far as I can tell stable; it passes all my torture tests and I haven't been able to break it. It doesn't have barriers or writeback caching.

Writeback caching is in progress, it'll be in the next version posted. Barriers and full IO tracking are also done, though mostly untested. It now uses a hash table to track the 128 most recent IOs, and it's independent of the process doing it (so your raid resync that does sequential IO on each drive in the array will completely bypass the cache).

With writeback caching, the updates to the btree have to be written to disk before the write can be returned as completed. For writes, when it adds a key to the btree it sets up the required write(s) and sets a timer; any keys that get inserted before it goes off will get written out at the same time. This code is stable, but will need testing to make sure the btree can still be read and contains what it should if the machine is shut down. The timers are currently hardcoded to 10 ms for normal writes and 4 ms for synchronous writes - this might want tweaking later.

Writeback caching splits the writes to the cache from the writes to the cached device. A write that doesn't bypass the cache will initially only be written to the cache; later on dirty keys are read from the cache and written out in sorted order. This is the code I'm working on now, it's mostly written but debugging hasn't started.

2010-06-27

Benchmarks: 2 TB Western Digital Green drive cached by a 64 GB Corsair Nova

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   536  92 70825   7 53352   7  2785  99 181433  11  1756  15
Latency             14773us    1826ms    3153ms    3918us    2212us   12480us
Version  1.96       ------Sequential Create------ --------Random Create--------
utumno              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
          files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
         16 21959  32 21606   2 +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               283us     422us     464us     283us      27us      46us
1.96,1.96,utumno,1,1277677504,16G,,536,92,70825,7,53352,7,2785,99,181433,11,1756,15,16,,,,,21959,32,21606,2,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14773us,1826ms,3153ms,3918us,2212us,12480us,283us,422us,464us,283us,27us,46us

Uncached:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   672  91 68156   7 36398   4  2837  98 102864   5 269.3   2
Latency             14400us    2014ms   12486ms   18666us     549ms     460ms
Version  1.96       ------Sequential Create------ --------Random Create--------
utumno              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
          files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
         16 21228  31 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               265us     555us     465us     283us      74us      45us
1.96,1.96,utumno,1,1277675678,16G,,672,91,68156,7,36398,4,2837,98,102864,5,269.3,2,16,,,,,21228,31,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14400us,2014ms,12486ms,18666us,549ms,460ms,265us,555us,465us,283us,74us,45us

On direct IO bcache is getting within 10% of what the SSD can do on cache hits - both sequential and, more importantly, 4k random reads. There's still some performance bugs that seem to primarily affect buffered IO - the bonnie numbers are promising, particularly in that they show improvement across the board (I'm not doing writeback caching yet), but there's much room for improvement in random reads.

I just finished rewriting the btree writing code and some other extensive cleanups; stability looks much improved though there still are some bugs remaining. Ran two test VMs overnight and both survived; the one running 4x bonnies continuously had some ext4 errors at one point, but kept going - so there's definitely still a race somewhere.

And I finally merged code to free unused buckets, and to track sequential IO and bypass the cache. Previously, if data already in the cache was rewritten, the pointer to the old data would be invalidated but the bucket would not be freed any sooner than normal. Now, garbage collection adds up how much of each bucket contains good data, and frees buckets that are less than a quarter full. It's probably possible to be smarter about which buckets we free, in the future I'll write about all the heuristics that could potentially be tuned.

Bcache also now tracks the most recent IOs, both read and write that it's seen. It's done this for awhile, so that if multiple processes are adding data to the cache their data gets segregated into their own buckets, improving cache locality. I extended this to keep track of how much IO has been done sequentially, and also track the average IO size it's seen for the most recent processes. This means if a process does say a 10 mb read, only the first (configurable) megabyte will be added to the cache - after that, reads will be satisfied from the cache if possible but will not be added to the cache. Even better, if you're doing a copy or a backup and the average IO size is above (by default) 512k, it will add nothing to the cache from that process as long as the average stays over the cutoff.

There's still a fair amount that can be done here. Currently it's only possible to track one IO per process - in the future this limitation will be removed, meaning if you're caching the drives that make up a raid5/6, a raid resync will completely bypass the cache. Also, the number of IOs that can be tracked is limited due to it using a linked list - in the future I'll have to convert it to a heap/hash table or red/black tree, so if you've got a busy server you can track the most recent several hundred IOs. Also, currently writes are always added to the cache - this is because there isn't yet a mechanism to invalidate part of the cache, but that's needed for a number of things and will happen before too long.