Bcache Wiki

What is bcache?

Bcache is a Linux kernel block layer cache. It allows one or more fast disk drives such as flash-based solid state drives (SSDs) to act as a cache for one or more slower hard disk drives.

Hard drives are cheap and big, SSDs are fast but small and expensive. Wouldn't it be nice if you could transparently get the advantages of both? With Bcache, you can have your cake and eat it too.

Bcache patches for the Linux kernel allow one to use SSDs to cache other block devices. It's analogous to L2Arc for ZFS, but Bcache also does writeback caching (besides just write through caching), and it's filesystem agnostic. It's designed to be switched on with a minimum of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the random reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high end storage arrays, and perhaps even embedded.

The design goal is to be just as fast as the SSD and cached device (depending on cache hit vs. miss, and writethrough vs. writeback writes) to within the margin of error. It's not quite there yet, mostly for sequential reads. But testing has shown that it is emphatically possible, and even in some cases to do better - primarily random writes.

It's also designed to be safe. Reliability is critical for anything that does writeback caching; if it breaks, you will lose data. Bcache is meant to be a superior alternative to battery backed up raid controllers, thus it must be reliable even if the power cord is yanked out. It won't return a write as completed until everything necessary to locate it is on stable storage, nor will writes ever be seen as partially completed (or worse, missing) in the event of power failure. A large amount of work has gone into making this work efficiently.

Bcache is designed around the performance characteristics of SSDs. It's designed to minimize write inflation to the greatest extent possible, and never itself does random writes. It turns random writes into sequential writes - first when it writes them to the SSD, and then with writeback caching it can use your SSD to buffer gigabytes of writes and write them all out in order to your hard drive or raid array. If you've got a RAID6, you're probably aware of the painful random write penalty, and the expensive controllers with battery backup people buy to mitigate them. Now, you can use Linux's excellent software RAID and still get fast random writes, even on cheap hardware.

Bcache is currently beta quality software. There aren't any known data corruption bugs and recovering from unclean shutdown is now working beautifully; it ought to be suitable for non critical use provided you have backups. It has been tested in production environment for many many hours.

Design Overview

The cache set comprises the highest level of the bcache hierarchy. A cache set is a collection of one or more caches and one or more storage devices. Each cache contains a single fast device. The cache set may contain a journal, which keeps a sequential log of all outstanding operations on the cache set. The cache set also contains worker threads for doing garbage collection and dirty page write back. A cache set can be either in synchronous or asynchronous mode. In synchronous mode, all writes are strictly ordered such that the cache set can be recovered from an unclean shutdown. Each slow storage device attached to the cache set can be in either writethrough or writeback mode. In writethrough mode, writes to the cache set are not acknowledged until they have completed on the slow storage device. In writeback mode, writes are acknowledged when the index and the data have been written to the fast device. Background write back will later find these dirty data and asynchronously write them to the slow storage device.

Because the fast block devices are expected to be SSDs, bcache partitions the storage on the fast storage devices into buckets. All buckets on all the devices associated with a cache set have the same bucket size. The bucket size is expected to be the native erase block size of the SSDs (or the least common multiple of the erase block sizes of all the SSDs). Within a bucket, bcache will write sectors sequentially and only once. The minimum number of data written at any one time is determined by the block size of the cache set. The block size is intended to correspond to the optimal write size of the device. When bcache needs to reclaim space on a fast device, it will reclaim a bucket and send the device a discard command covering the entire bucket before it writes sequentially to sectors within the bucket. This allows bcache to work well with SSDs that have very simple controllers. It also means that bcache could be used with raw flash with minor additions to handle ECC, bad block lists and wear leveling.

Bcache implements a very fast index. On every IO request to the devices in the cache set, bcache must very quickly determine where the requested data is located. This index must be fast and space efficient both in memory and on the fast storage devices. The data structure used to implement the index is essentially a B+ tree. Unlike a traditional B+ tree, every node can have more than one key sets. Keys are ordered within key sets but all key sets must be scanned from the newest set towards the oldest when searching for a key. Keys are never removed from a node until such a node is either split or reclaimed by the garbage collection mechanism. Like a traditional B+ tree, only leaf nodes reference data. All internal nodes point to other B+ tree nodes.

When bcache loads, it creates a major device type. Every slow storage device registered with bcache creates a new minor device number that is associated with that storage device. All accesses to the storage device must now be made through the major, minor pair provided by bcache for that storage device. To prevent anyone from inadvertently referencing the storage device without the cache attached (which would potentially show an inconsistent version of the data) storage devices must be created using a special utility before they can be registered with bcache. This utility writes a bcache superblock at the start of the device which bcache then hides from the version of the device it presents through its special major, minor device number. Once this is done, an ext4 (or any other filesystem) partition can be created through bcache and mounted. Should any attempt be made to mount the storage device without bcache, ext4 will not be able to locate its superblock and reject the partition.

Detailed Design

Buckets

Allocation of space on fast devices is done through buckets. A bucket is intended to correspond to an erase block on the SSD. At any given time, nearly all buckets are going to be on a heap. The heap allows buckets to be reclaimed based on how frequently they have been accessed. Every bucket has a priority counter. This counter is increased whenever the bucket serves a cache hit. The priority of all the buckets in the heap is decremented periodically. This allows the least frequently used buckets to be quickly reclaimed and put on a short free list as needed. The generation number of the bucket is incremented when the bucket is reclaimed. This allows for very fast bucket reclaiming because there is no immediate need to traverse the entire index removing pointers that reference the reclaimed bucket. Garbage collection is then responsible for removing stale pointer from the index.

Bcache keeps a short list of open buckets at all times. These buckets are ready to accept new data. Bcache will try to put sequential data into the same bucket, even if the data is separated by other arriving items. Failing that, bcache will try to collect data produced by the same task within a bucket. This strategy is the first stage to merging contiguous or related data sequences. When an open data bucket fills, a new bucket is pulled from the free list.

The layout of a fast device is as follows:

Sector  0 : 4K blank                            // Just in case somebody runs fdisk on this device
Sector  8 : 4K superblock                       // Contains information of bucket size, root of bucket of the index, etc.
Sector  16: 4K UUIDs                            // Used to track the other devices attached to this cache
Sector  24: Bucket priorities and generations   // Requires 3 bytes (16 bit prio, 8 bit gen) per bucket on the device
Bucket   n: Journal                             // Start of journal area
Bucket n+j: Data                                // Data area managed by the bucket heap

At this time, the very first few sectors of the device are fixed to contain the superblock, UUID labels for the devices attached to the cache, bucket priority and generation values and the journal. The bucket priority and generation items and the journal buckets will be integrated into the managed bucket area in the future to fully support the ability to work with a minimal flash controller.

Index

The bcache index is a B+ tree in which each node occupies an entire bucket. This produces a very flat tree structure that often requires only a few pointer traversals to determine the presence or absence of any item. The bcache nodes contain key pointer tuples. A tuple has the following structure:

struct bkey {
    struct {
        uint64_t header_flag        :  1; // Set for the header record, clear for all others
        uint64_t num_ptrs           :  3; // Number of pointers in this tuple
        uint64_t                    : 23; // Reserved for future use
        uint64_t dirty_flag         :  1; // Set if the data is dirty
        uint64_t length             : 16; // Number of sectors in this data item (up to 32MB)
        uint64_t virtual_device     : 20; // Virtual device (currently corresponds to a real device)
    } header;
    struct {
        uint64_t header_flag        :  1; // Clear for all except the header record
        uint64_t sector             : 63; // Sector number in the virtual device (max 2^72 bytes per device)
    } key;
    struct {
        uint64_t header_flag        :  1; // Clear for all except the header record
        uint64_t physical_device    : 11; // Index to a real physical device
        uint64_t offset             : 44; // Sector offset to the physical device (max 2^53 bytes per device)
        uint64_t generation         :  8; // Generation number of the pointer
    } ptr[];
};

This structure is extremely flexible and allows bcache to potentially maintain up to eight cached copies of any particular piece of data. For example, bcache could potentially cache two copies of dirty data on separate SSDs in case one device should go bad there is a backup copy still available to provide full consistency. Every pointer has a generation number which allows buckets to be quickly reclaimed. If the generation number of the pointer does not match the generation number of the bucket, then the cached data has been reclaimed and is no longer valid. At some later point, garbage collection will lazily remove the invalid pointer.

A key has a sector length, so each key can represent up to 32MB of cache data. Generally, the amount represented will be smaller since there are limitations on how much sequential data the block devices associated with the cache set will accept. It is important to note that the key value in the sector field actually indicated the number of the sector at the end of the data. To find the actually starting sector offset of the data referenced by the key, the length must be subtracted from the sector. When searching within the index, the search returns the smallest key greater than the search value.

Unlike a textbook B+tree, there is not one more pointer than there are keys in a node. This is not an issue while searching because each key is the sector following the range represented by the key. This affects insertion because inserting a key in a leaf that extends the range past the last range already in the leaf requires that the new key value be propagated up the tree towards the root.

Another issue that complicates maintaining the index is that it is possible for a new range to be inserted into the index that basically overlaps existing ranges in two adjacent leafs. This results in the insertion of the new range in the left leaf, which supersedes the overlapped ranges in that leaf. A special key to invalidate the existing overlapping ranges in the right leaf must also be added (there is no direct deletion from index node, so a new key to invalidate the required range is appended to the index).

Since a bucket is usually very large, it would be very inefficient to have to insert new keys into a large index node as it starts to fill up. Bcache solves this problem by partitioning the index into subsets. Within each subset, the keys are sorted. New keys are always inserted into the last subset. A subset starts with a special subset header which fits into the standard node index array:

struct bset {
    uint32_t csum;          // Checksum of the subregion
    uint32_t keys;          // Number of keys in this subregion
    uint64_t rand;          // Used to verify that this is a valid subregion
    struct bkey start[0];   // Start of keys in this subset
};

When a subset is first used, a 30 second timer is started. New keys are then inserted into this new subset until the timer expires. When the timer expires, the contents of the subset are written to disk. Since the minimum write size is a block, there is a potential for space to be wasted in the disk version of the index node. This is a small price to pay for ensuring that the written version of the index is relatively current.

When a new subset is prepared to write out to the cache device, the keys are checked to ensure that they are in sorted order. This is an internal sanity check to ensure that the index routines have not corrupted the tree. The subset's checksum is calculated and added to the subset header. To allow tracking of how efficiently the index spaces is used on the device, the global Btree write counter is then incremented and the number of keys in the subset is added to the global keys write counter. A new subset is then set up ready to start accumulating a new set of keys.

Garbage collection will lazily merge written subsets in the memory instance of the index to form longer subsets with sorted keys. This increases the efficiency of future traversals of the index. Garbage collection also detects any keys that need to be removed. Only garbage collection removes keys. On insertion, if any key matching an existing item in the current subset overwrites the previous key, thus effectively deleting the old version. Since the index is always searched from the newest subset towards the oldest, any new key will effectively overwrite any other version of the key in a previous subset of the index node. Finally, since every pointer has a generation, stale pointers do not need to be removed immediately. Garbage collection or node splitting required to grow the index finds any stale or redundant keys and removes them, thus preventing these keys from inordinately consuming valuable memory resources.

Journal

The bcache journal is used to improve performance. It is possible to create cache devices that have no journal, so bcache can work properly with and without a journal. If a journal is present, write requests can be completed as soon as the journal update request completes. This avoids the longer tail latencies that can result when an index insert requires many writes to the caching device because it causes a B+ tree node split that propagates upwards and causes a new level to be added. With the journal, the index updates can progress in the background after the write completes.

Since the index is always kept in a consistent state on the cache device, a power failure should create problems assuming that the cache device is well behaved and does not randomly corrupt data or lose data as a result of a power loss. In such a case, even the journal is going to be of little use since the journal itself may potentially be lost or corrupted. Assuming a well, behaved caching device, a cache set configured to do work in synchronous mode should always be recoverable after a power loss.

If a cache set has a journal, the journal is quickly replayed when the cache set is started and any items in the journal not updated in the index are inserted into the index.

Write-back

When write-back is enabled on a cached storage device there is greater potential for reducing seeks and improving performance. In write-through mode, a write is not completed until it has been written to the storage device. In write-back mode, the write is acknowledged when the data is written to the cache device and the journal update finishes. In the future, the journal will contain a strong checksum of the data so that the write will complete when the journal entry finishes (journal replay will stop if the expected data is not available).

In write-back mode, the dirty flag on the key is set to indicate that the data is not available on the storage device. All read accesses to that data must be serviced from the cache. Any new writes to that same location replace the previous version, potentially saving extra seeks and write operations that would have had to take place on the storage device without the cache. At some point, the dirty data must be written to the storage device. During normal operation, this is done by a background write-back process. There are several tunable parameters that control write-back. The first is a write-back delay parameter. This parameter specifies how long to wait after the dirty data arrives before starting a write-back process. The second is a write-back percentage parameter indicating the minimum percentage of the cache that must be full before write-back will start. This allows for more buffering opportunities and increases the effectiveness of each write-back pass (improves how much data is written back for each movement of the head).

When a write-back pass starts, it write data in order to the drive from smallest sector numbers to the highest. This show increase the effective amount of work done for a single sweep of the head across the disk. Foreground read requests can continue to arrive as write-back is processing. For this reason, write-back requests are sent with lower priority. Despite this, there are obvious opportunities to improve the effectiveness of background write-back by coupling it more closely to the foreground requests. Doing so, however, would complicate the mechanism.

Management

Detection

Assuming that you have installed and booted a kernel that supports bcache, you should have the following directory on the system:

/sys/fs/bcache

If this directory is not present, you do not have a kernel with bcache support and you will not be able to register any devices that you create.

Creation

To create a cache set, you need at least one fast drive. We will call this a cache device and most often this will be an SSD. Bcache requires that the cache device be prepared prior to use much like a file system requires that a file system be created on a device before it is mounted. Creating a cache device is done using the make-bcache utility:

make-bcache -C -b<bucket-size> -w<min-write-size> -j<journal-buckets> cache_device

Likewise, you will need at least one slow storage device. We will call this storage device a backing device, since it provides the backing storage. Bcache also requires that the backing device be prepared before it is used and even before a file system is created on it. Creating a backing device is also done using the make-bcache utility:

make-bcache -B backing_device

Registration

Once the cache and backing devices have been created, they need to be joined to form a cache set. Note that at the current time the memory required by bcache to manage the cache set is allocated at registration time from the container that run the registration command. If this is done from a shell created when you ssh'ed into the machine, you will most likely be in the restricted /sys cgroup and the registration will fail with insufficient memory. You can verify this by looking at the dmesg log. To form a cache set, first register a cache device with bcache:

echo cache_device > /sys/fs/bcache/register

When you register a cache device, bcache creates a new cache set and creates a new directory to help manage and monitor the cache:

/sys/fs/bcache/UUID

Where UUID is the unique id assigned to the cache at creation time. Under this directory are files that can be used to control the cache set and extract statistics from the cache set:

active_journal_entries

Number of journal entries that are still pending for their corresponding B+tree index to update.

block_size

Minimum minimum write block size on the cache set in bytes.

btree_avg_keys_written

Average number of keys per write to the B+tree index during normal index insertion (does not reflect the writes of coalesced items when the are copied to a new node). This indicates how much coalescing is taking place in subsets before they time out and are written.

btree_cache_size

Number of cache buckets currently taken up by the B+tree index.

btree_written

Sum of all B+tree index writes in bytes since the cache set was created or since clear_stats was written to.

bucket_size

Size of a bucket in bytes.

cache<0..n>

A link to each of the cache devices comprising this cache set.

clear_stats

Writing anything to this file resets all the statistics for this cache set.

discard

If not zero, a discard/trim command will be sent to the cache device for each bucket before it is reused. Defaults to on if the caching device support discard/trim commands.

heap_size

Number of buckets current available for reuse. These are buckets that can be trivially reclaimed because they are not being used to hold the B+tree index or don't contain dirty data.

metadata_written

Sum of all B+tree index, bucket priority and journal writes in bytes since the cache set was created or since clear_stats was written to.

nbuckets

Total number of buckets in this cache set.

synchronous

If not zero then the cache set is in synchronous mode and all write requests are ordered. This mode is recommended unless the data does not have to be recovered in the event of an unplanned shutdown. This is the default mode.

tree_depth

Current depth (height) of the B+tree index.

unregister

Writing to this file disables the cache-set and initiates a complete background write-back of all dirty data on all backing devices. No new data will be accepted.

written

This given the total bytes written to the cache. This can be compared to btree_written to determine the amount of write inflation caused by bcache.

You also need to register backing devices with the cache set:

echo backing_device > /sys/fs/bcache/register

When you register a backing device, bcache also creates a new directory under the block device interface for the backing device:

/sys/block/backing_device/bcache

Under this directory are files that can be used to control the backing device and extract statistics:

attach

Used to attach a backing device to a cache set after the initial creation and registration of the device. See the following section for more details.

bypassed

Sum of all requests, both read and writes, that have bypassed the cache

cache_hits, cache_misses, cache_hit_ratio

Hits and misses are counted per individual request as bcache sees them. Partial hits are counted as misses.

clear_stats

Writing anything to this file resets all the statistics for this cache set.

detach

Writing to this file detaches the backing device from its cache set and initiates a complete background write-back of all data.

flush_delay_ms, flush_delay_ms_sync

Optional delay for B+tree writes to allow for more coalescing of updates to the index.

label

Allows an optional label to be associated with the backing device. This can be used by init scripts to determine which device node should be connected to each /dev/bcacheX device.

readahead

This file specifies the size of blocks to read ahead. If it is zero, then read ahead will be off. Common values for enabling read ahead are powers of two between 16k and 256k.

running

If non zero forces a backing device with dirty data on an unavailable cache device to start. This should only be used as a last resort to recover from a failed caching device that is beyond repair.

state

Indicates what state the backing device is in:

sequential_cutoff

Sequential requests larger than this threshold will bypass the cache and go straight to the backing device. The intent for this is to prevent large copy and backup jobs from polluting the cache. This mechanism will likely change in the future to use fadvise information. If zero, sequential cutoff is disabled.

sequential_cutoff_average

If the average request size of a process exceeds this threshold, all requests from this process will bypass the cache. Bypassed requests will continue to affect the average request size of the process. If zero, average sequential cutoff is disabled.

sequential_merge

If non zero, bcache keeps a list of the last 128 requests submitted to compare against all new requests to determine which new requests are sequential continuations of previous requests for the purpose of determining sequential cutoff. This is necessary if the sequential cutoff value is greater than the maximum acceptable sequential size for any single request.

writeback

If this file contains a zero, write-back caching will be disabled to this device. If non-zero, write-back caching will be enabled.

writeback_delay

Number of seconds to wait before initiating background write-back on a cache device after new dirty data arrives. The default is 30 seconds.

writeback_percent

Background write-back on a cache only proceeds when more than the percentage of the cache specified in this file is unavailable. The default is zero. Making it larger forces the cache to wait in hopes that more opportunities for coalescing and buffering become available.

writeback_running

If zero, turns background write-back off completely. Dirty data will continue to collect in the cache devices. This option is intended for testing and benchmarking purposes. It should not be used during normal operation.

Attaching

Once caching and backing devices are registered, backing devices must be attached to a cache set. This only needs to be done once at the time the cache set is first created if the -S option was not used when the backing device was created with make-bcache. After this, simply registering the devices when the system comes up is sufficient. Note that if the backing device is used in write-back mode and dirty data exists for it in a caching device, the /dev/bcache# device for the storage device will not appear until the caching device is registered. If the caching device has been erased or has failed, the only way to expose the data on the backing device is by writting a non zero value into the /sys/block/backing_device/bcache/running file. The device will then appear, but it will probably be in an inconsistent state, so this approach should be used only as a last resort.

Attach a newly created and registered backing device to a cache set using the following command:

echo UUID > /sys/block/backing_device/bcache/attach

Where UUID is the unique id of the cache set and backing_device is the registered backing device to attach.

Provisions exist for detaching a backing device from a cache set. This will force the cache set to write back all dirty data for the backing device and leaves the backing device in a clean, detached state. If this is done, the backing device can then be attached to any other valid cache set.

Mknod

Bcache creates a new block device for each backing device that has been registered and is attached to a cache set with the cache devices present:

/sys/block/bcache#

This directory is not the device itself, but rather the control directory for the device. To create an actual block device that can be operated on, the Linux mknod utility must be used. The dev file in the control directory gives the major:minor number of the bcache device to access the backing device. Creating the device itself requires a few simple commands:

dev=`cat /sys/block/bcache0/dev | sed -e 's/:/ /'`
mknod -m 0660 /dev/hdc3 b $dev

You can now use this device to build a new filesystem on the cached storage device:

mkfs -t ext4 /dev/hdc3
mount /dev/hdc3 /export/hdc3

/export/hdc3 is now a standard directory with an ext4 filesystem but backed by the fast caching device.

Future plans

Further off, there's plans to use Bcache's index to implement overcommited storage. If you're familiar with LVM, it works by allocating logical volumes in units of 4 mb extents; thus you can arbitrarily create and resize LVs. But when you create an LV you have to fully allocate its storage, regardless of whether it'll ever be written to. If you've ever managed servers with lots of random LVs, you've probably experienced first hand how much of a pain it is to keep track of how much free space you have, resize LVs when the filesystems fill up, etc. - not to mention the huge amount of space that typically gets wasted because you really don't want filesystems to fill up.

But all the work has already been done in Bcache for allocating on demand, and maintaining the index while it's in use - and by using the same index for cached data and the volumes themselves, there will be approximately zero extra runtime overhead. You'll be able to create petabyte sized filesystems with a tiny amount of real storage, resize them arbitrarily, and be able to see exactly how much space you're using. Reading from newly created volumes also won't return old data; sectors that haven't previously been written to will return 0s. This was actually the primary motivation for this feature - for shared hosting, you don't want customers to be able to see other people's data.

There's also been quite a bit of interest in tiered storage. If you've got a truly large amount of storage, it may be beneficial to have a really large RAID60 of large 7200 rpm drives, and a smaller RAID10 of 15k rpm SAS drives. Nobody wants to manually manage what goes where - and keep track of what data gets accessed the most - so if you could use it as one large pool, and have data migrate between them automatically, so much the better. Once overcommited storage is implemented, tiered storage should actually be quite easy to add.

Bcache has made amazing progress in the past six months, but there's still work to be done to make it truly production ready, and no shortage of features to implement after that. Completing all this is thus contingent on my ability to afford to continue to work full time on it - any funding and/or hardware that could be contributed would be a great help :) I've received some funding, without which bcache would not be as far along as it is but I'm definitely not fully funded at this time.

Performance

Latest Results

Alexandru Ionica graciously shared some performance data recently, showing a comparison with Facebook's flash cache.

http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/

Old Results

Seeks per second, bonnie++: 90% reads, 10% rewrites
Bcache running in synchronous writeback mode with the same SSD and hard drive:

bcache.png

MoreInformationAboutBcache

Roadmap

AdditionalNotes

FAQ

Current status

2010-10-08

Got a mailing list - linux-bcache@vger.kernel.org. Feel free to direct anything bcache related there :)

Just implemented UUIDs. Adding a field in the superblock and having make-bcache generate a UUID was easy enough, figuring out how to get udev to use it to generate the /dev/disk/by-uuid symlink was harder. Eventually we'll want the bcache superblock added to libblkid - for now I added a program (probe-bcache) that works analogously to blkid for bcache, and a udev rule to use it. I added a hook for debian's initramfs tools to pull it all in to the initramfs - not entirely sure I have that part right yet.

Assuming it all works though, you should be able to do "echo /dev/disk/by-uuid/foo > /sys/kernel/bcache/register_cache", and have it work no matter where your cache device pops up that particular boot.

2010-09-16

It's been awhile since I've written one of these...

The most recent Sysbench numbers, on an X25-E, have Bcache at around 80% of the bare SSD and 50-60% better than Flashcache (MySQL transactions per second). I'd post the full benchmarks but I didn't run them myself :)

Stabilizing writeback took longer than I expected; Bcache is way out there on the simplicity vs. performance tradeoff. But it's been rock solid for around a week now, under a great deal of torture testing. I'd very much like to know if anyone can break it - I've fixed a lot of bugs that were only possible to trigger in virtual machines running out of ram, there's been only one bug I've been able to trigger on real hardware for maybe a month now.

You still shouldn't trust it with real data, at least in writeback mode, pending more work and testing on unclean shutdowns. There's a few relatively minor issues that need to be fixed before it'll work reliably - they're all basically ordering writes correctly. After I fix all the known issues I'll start testing how it handles unclean shutdowns heavily.

Besides that, there isn't a whole lot left before it might be production ready - primarily error handling. IO error handling is written but untested, handling memory allocation failures (and avoiding deadlock) will take more work. Those will need testing with md's faulty layer, and fault injection. IO error handling meant I finally had to write the (much belated) code to unregister cached devices and caches; this is in progress now.

Anyone who's willing to do outside testing should feel free to ask for help, point out areas that need documentation, or otherwise provide input. The more testing it sees, the sooner it'll be production ready - I for one am excited to be using it on my dev machine... just as soon as I have another SSD to use :)

2010-08-07

Writeback is looking pretty stable. Definitely needs optimization, but the current dbench numbers don't look too terrible:

Uncached:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     354819     6.975  3546.510
 Close         261513     0.002     9.702
 Rename         14829   374.939  3840.287
 Unlink         71453   280.598  4041.658
 Qpathinfo     322406     0.013    14.727
 Qfileinfo      55752     0.003     3.293
 Qfsinfo        57571     0.173    16.834
 Sfileinfo      29540     4.680  1968.190
 Find          123221     0.033    11.395
 WriteX        174183     0.940  3501.915
 ReadX         542302     0.006     9.178
 LockX           1080     0.003     0.029
 UnlockX         1080     0.002     0.013
 Flush          25160   303.330  4014.527

Throughput 17.9185 MB/sec (sync dirs)  60 clients  60 procs max_latency=4041.664 ms

Cached:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    1217617     3.719  1996.974
 Close         894777     0.002    24.008
 Rename         51682    58.830  1853.280
 Unlink        245414    58.024  2029.589
 Qpathinfo    1104565     0.013    30.958
 Qfileinfo     192845     0.003    29.178
 Qfsinfo       202248     0.184    41.389
 Sfileinfo      99487     7.141  1884.748
 Find          426854     0.033    36.754
 WriteX        602274     1.205  1494.175
 ReadX        1912230     0.006    36.247
 LockX           3978     0.003     0.027
 UnlockX         3978     0.002     0.019
 Flush          85282   148.580  2025.378

Throughput 63.3555 MB/sec (sync dirs)  60 clients  60 procs max_latency=2029.597 ms

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    1431528     6.358  2741.227
 Close        1053052     0.002    11.128
 Rename         60734    82.006  2836.559
 Unlink        288169    82.373  2920.550
 Qpathinfo    1296364     0.013   297.144
 Qfileinfo     226278     0.003    17.307
 Qfsinfo       237291     0.193    32.547
 Sfileinfo     117622    13.979  2729.844
 Find          500876     0.033    25.412
 WriteX        707939     2.735  2731.899
 ReadX        2239328     0.006    45.265
 LockX           4622     0.003     0.090
 UnlockX         4622     0.002     0.037
 Flush         101075   184.241  2821.352

Throughput 74.105 MB/sec (sync dirs)  100 clients  100 procs max_latency=2920.561 ms

And using just the SSD:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    2209643     3.867   571.200
 Close        1622947     0.001     5.148
 Rename         93834    84.659   779.036
 Unlink        446251    74.769   779.419
 Qpathinfo    2004984     0.009   133.042
 Qfileinfo     349538     0.002    15.865
 Qfsinfo       367418     0.105    13.548
 Sfileinfo     180564     4.850   491.543
 Find          774428     0.022    16.256
 WriteX       1092433     1.073   336.781
 ReadX        3466949     0.005     9.330
 LockX           7194     0.003     0.137
 UnlockX         7194     0.002     0.045
 Flush         154609    51.690   738.412

Throughput 114.893 MB/sec (sync dirs)  100 clients  100 procs max_latency=779.425 ms

2010-08-03

Writeback is coming along well. Not seeing any data corruption at all, which is awesome. It's definitely not seen enough testing to be trusted, but writeback introduces some subtle cache coherency issues that were cause for concern, so this bodes well. It's not quite stable - one of my VMs went for around 20 hours before a write hanged, another has been able to trigger the hang easily, but my test box hasn't had any issues. That probably won't be too hard to fix when I get around to it.

There's a lot of work to be done on performance. It looks like synchronous btree updates are slower than we want them to be, so next up is adding a switch to turn synchronous mode on and off (so writeback can be tested independently of synchronous btree updates, and the reverse), and it looks like I'm going to have to switch to high res timers for the btree write delays, as I suspected (bcache delays btree writes by a small amount to coalesce key insertions). That scheduling related bug, whatever it is, is also going to have to be fixed soon.

The code to actually recover from an unclean shutdown is still missing, too - with synchronous btree updates everything should be consistent, but it'll still invalidate the entire cache. There's a bit of work that has to be done (bucket generations can be slightly out of date, so it has to walk the btree and check for that) but mostly it's just a matter of testing, and going over the code again looking for anything I missed.

The new elevator code appears to be working correctly now, too. I need to look at some blktraces and whatnot to make sure it's not doing anything dumb, but I haven't seen any obvious bugs. There is a performance issue, at least with the deadline scheduler - when the cache has dirty data, it can keep writes queued up to the cached device for so long that reads get starved. Hopefully this can be solved with configuration. Bcache does set the priority on all the writeback io to the idle priority, but only cfq looks at that (which I haven't tried), not deadline, and I've heard people say that deadline is much preferred for servers.

There's also a bug with the way the writeback code submits its io that to my knowledge only affects raid devices. That's next up on my list.

At this point though I'd very much like to get other people trying it out, and hopefully helping track down performance bugs and just seeing how well it works. Benchmarks suck for the moment so don't expect to be awed, but that should rapidly improve now.

And if you like what you see and want to contribute to the continued development of bcache, I've got funding for another month but nothing else lined up so far. Hardware would most definitely be welcome - I'd like to be able to test on more SSDs and more architectures.

Bcache also puts a lot of effort into allocation, some of which could be defeated by the SSD's firmware trying to be overly smart. I'd like to get into contact with people who are writing that firmware so we can see if there's anything that could be done to allow bcache and the firmware to cooperate better, or at least not walk all over each other. If anyone has any relevant information or knows who to talk to, please contact me :)

2010-07-30

Initial implementation of writeback is up, in the bcache-dev branch. There's probably still a heisenbug lurking, but I can't reproduce it at the moment; if it locks up for you, that's what you hit. The writeback code itself though should be working, with caveats - dirty data is correctly flagged as such and retained in the cache, but the code to actually write it out is disabled pending more debugging. For now, it just switches to writethrough when less than half of the buckets in the cache can be reclaimed.

Lots still to do, but this is a huge amount of progress.

2010-07-13

The version in the bcache branch is as far as I can tell stable; it passes all my torture tests and I haven't been able to break it. It doesn't have barriers or writeback caching.

Writeback caching is in progress, it'll be in the next version posted. Barriers and full IO tracking are also done, though mostly untested. It now uses a hash table to track the 128 most recent IOs, and it's independent of the process doing it (so your raid resync that does sequential IO on each drive in the array will completely bypass the cache).

With writeback caching, the updates to the btree have to be written to disk before the write can be returned as completed. For writes, when it adds a key to the btree it sets up the required write(s) and sets a timer; any keys that get inserted before it goes off will get written out at the same time. This code is stable, but will need testing to make sure the btree can still be read and contains what it should if the machine is shut down. The timers are currently hardcoded to 10 ms for normal writes and 4 ms for synchronous writes - this might want tweaking later.

Writeback caching splits the writes to the cache from the writes to the cached device. A write that doesn't bypass the cache will initially only be written to the cache; later on dirty keys are read from the cache and written out in sorted order. This is the code I'm working on now, it's mostly written but debugging hasn't started.

2010-06-27

Benchmarks: 2 TB Western Digital Green drive cached by a 64 GB Corsair Nova

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   536  92 70825   7 53352   7  2785  99 181433  11  1756  15
Latency             14773us    1826ms    3153ms    3918us    2212us   12480us
Version  1.96       ------Sequential Create------ --------Random Create--------
utumno              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 21959  32 21606   2 +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               283us     422us     464us     283us      27us      46us
1.96,1.96,utumno,1,1277677504,16G,,536,92,70825,7,53352,7,2785,99,181433,11,1756,15,16,,,,,21959,32,21606,2,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14773us,1826ms,3153ms,3918us,2212us,12480us,283us,422us,464us,283us,27us,46us

Uncached:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   672  91 68156   7 36398   4  2837  98 102864   5 269.3   2
Latency             14400us    2014ms   12486ms   18666us     549ms     460ms
Version  1.96       ------Sequential Create------ --------Random Create--------
utumno              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 21228  31 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               265us     555us     465us     283us      74us      45us
1.96,1.96,utumno,1,1277675678,16G,,672,91,68156,7,36398,4,2837,98,102864,5,269.3,2,16,,,,,21228,31,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,14400us,2014ms,12486ms,18666us,549ms,460ms,265us,555us,465us,283us,74us,45us

On direct IO bcache is getting within 10% of what the SSD can do on cache hits - both sequential and, more importantly, 4k random reads. There's still some performance bugs that seem to primarily affect buffered IO - the bonnie numbers are promising, particularly in that they show improvement across the board (I'm not doing writeback caching yet), but there's much room for improvement in random reads.

I just finished rewriting the btree writing code and some other extensive cleanups; stability looks much improved though there still are some bugs remaining. Ran two test VMs overnight and both survived; the one running 4x bonnies continuously had some ext4 errors at one point, but kept going - so there's definitely still a race somewhere.

And I finally merged code to free unused buckets, and to track sequential IO and bypass the cache. Previously, if data already in the cache was rewritten, the pointer to the old data would be invalidated but the bucket would not be freed any sooner than normal. Now, garbage collection adds up how much of each bucket contains good data, and frees buckets that are less than a quarter full. It's probably possible to be smarter about which buckets we free, in the future I'll write about all the heuristics that could potentially be tuned.

Bcache also now tracks the most recent IOs, both read and write that it's seen. It's done this for awhile, so that if multiple processes are adding data to the cache their data gets segregated into their own buckets, improving cache locality. I extended this to keep track of how much IO has been done sequentially, and also track the average IO size it's seen for the most recent processes. This means if a process does say a 10 mb read, only the first (configurable) megabyte will be added to the cache - after that, reads will be satisfied from the cache if possible but will not be added to the cache. Even better, if you're doing a copy or a backup and the average IO size is above (by default) 512k, it will add nothing to the cache from that process as long as the average stays over the cutoff.

There's still a fair amount that can be done here. Currently it's only possible to track one IO per process - in the future this limitation will be removed, meaning if you're caching the drives that make up a raid5/6, a raid resync will completely bypass the cache. Also, the number of IOs that can be tracked is limited due to it using a linked list - in the future I'll have to convert it to a heap/hash table or red/black tree, so if you've got a busy server you can track the most recent several hundred IOs. Also, currently writes are always added to the cache - this is because there isn't yet a mechanism to invalidate part of the cache, but that's needed for a number of things and will happen before too long.

Questions? Comments? Feel free to register and add them, or email me - kent.overstreet@gmail.com

BcacheWiki: Bcache (last edited 2012-05-15 00:47:08 by Kent)