Bcachefs

It's a next generation copy on write filesystem for Linux with a long list of features - tiering/caching, data checksumming, compression, encryption, multiple devices, et cetera.

It's not vaporware - it's a real filesystem you can run on your laptop or server today.

We prioritize robustness and reliability over features and hype: we make every effort to ensure you won't lose data. It's building on top of a codebase with a pedigree - bcache already has a reasonably good track record for reliability (particularly considering how young upstream bcache is, in terms of engineer man/years). Starting from there, bcachefs development has prioritized incremental development, and keeping things stable, and aggressively fixing design issues as they are found; the bcachefs codebase is considerably more robust and mature than upstream bcache.

Developing a filesystem is also not cheap or quick or easy; we need funding! Please chip in on Patreon - the Patreon page also has more information on the motivation for bcachefs and the state of Linux filesystems, as well as some bcachefs status updates and information on development.

If you don't want to use Patreon, I'm also happy to take donations via paypal: kent.overstreet@gmail.com.

Join us in the bcache IRC channel, we have a small group of bcachefs users and testers there: #bcache on OFTC (irc.oftc.net).

Why bcachefs?

For existing bcache users, we've got a particularly compelling argument: a block layer cache implements the core functionality of a filesystem (allocation, reclamation, and mapping from one address space to another). By running a filesystem on top of a block cache every IO must traverse two different mapping layers - each taking up memory and space for their index, each adding to every IO's latency (and tail latency!), and adding more complexity to your IO path - in particular, making it that much harder to debug performance issues.

By using a filesystem that was designed for caching from the start (among many other things) we're able to collapse the two mapping layers and eliminate a lot of redundant complexity - many things also become easier when they're done within the context of an (appropriately designed) filesystem, versus the block layer - cache coherency, for example, is a tricky problem in bcache but trivial in bcachefs. This has real world impact - some of the most pernicious bugs and performance issues bcache users have hit have been because of the writeback lock, which is needed for cache coherency - that code is all gone in bcachefs.

What if you don't care about caching, what if you just want a filesystem that works? Bcachefs is not just targeted at caching - it's meant to be a superior replacement for ext4, xfs and btrfs. We intend to deliver a copy on write filesystem, with all the features you'd expect from a modern copy on write filesystem - but with the performance and robustness to be a very viable replacement for ext4 and xfs. We will deliver on that goal.

Documentation

End user documentation is currently fairly minimal; this would be a very helpful area for anyone who wishes to contribute - I would like the bcache man page in the bcache-tools repository to be rewritten and expanded.

There is some fairly substantial developer documentation: see BcacheGuide.

Getting started

Bcachefs is not upstream, and won't be for awhile. If you want to try out bcachefs now, you'll need to be comfortable with building your own kernel. Also, as bcachefs has had many incompatible on disk format changes, you cannot currently build a kernel with support for both bcachefs and the existing, upstream bcache on disk format (this will change prior to bcachefs going upstream).

First, check out the bcache kernel and tools repositories:

git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git
git clone -b dev https://evilpiepirate.org/git/bcache-tools.git

Build and install as usual. Then, to format and mount a single device with the default options, run:

bcache format /dev/sda1
mount /dev/sda1 /mnt

See bcache format --help for more options.

Status

Bcachefs can currently be considered beta quality. It has a small pool of outside users and has been quite stable and reliable so far; there's no reason to expect issues as long as you stick to the currently supported feature set. Being a new filesystem, backups are still recommended though.

Performance is generally quite good - generally faster than btrfs, and not far behind xfs/ext4. There are still performance bugs to be found and optimizations we'd like to do, but performance isn't currently the primary focus - the main focus is on making sure it's production quality and finishing the core feature set.

Normal posix filesystem functionality is all finished - if you're using bcachefs as a replacement for ext4 on a desktop, you shouldn't find anything missing. For servers, NFS export support is still missing (but coming soon) and we don't yet support quotas (probably further off).

Pretty much all the normal posix filesystem stuff is supported (things like xattrs, acls, etc. - no quotas yet, though).

The on disk format is not yet set in stone - there will be future breaking changes to the on disk format, but we will make every effort make transitioning easy for users (e.g. when there are breaking changes there will be kernel branches maintained in parallel that support old and new formats to give users time to transition, users won't be left stranded with data they can't access). We'll need at least one more breaking change for encryption and possibly snapshots, but I'm trying to batch up all the breaking changes as much as possible.

Feature status

  • Full data checksumming

    Fully supported and enabled by default. We do need to implement scrubbing, once we've got replication and can take advantage of it.

  • Compression

    Not quite finished - it's safe to enable, but there's some work left related to copy GC before we can enable free space accounting based on compressed size: right now, enabling compression won't actually let you store any more data in your filesystem than if the data was uncompressed

  • Tiering

    Works (there are users using it), but recent testing and development has not focused enough on multiple devices to call it supported. In particular, the device add/remove functionality is known to be currently buggy.

  • Multiple devices, replication

    Roughly 80% or 90% implemented, but it's been on the back burner for quite awhile in favor of making the core functionality production quality - replication is not currently suitable for outside testing.

  • Encryption

    Implementation is finished, and passes all the tests. The blocker on rolling it out is finishing the design doc and getting outside review (as feedback any changes based on outside review will almost definitely require on disk format changes), as well as finishing up some unrelated on disk format changes (particularly for replication) that I'm batching up with the on disk format changes for encryption.

  • Snapshots

    Snapshot implementation has been started, but snapshots are by far the most complex of the remaining features to implement - it's going to be quite awhile before I can dedicate enough time to finishing them, but I'm very much looking forward to showing off what it'll be able to do.

Known issues/caveats

  • Mount time

    We currently walk all metadata at mount time (multiple times, in fact) - on flash this shouldn't even be noticeable unless your filesystem is very large, but on rotating disk expect mount times to be slow.

    This will be addressed in the future - mount times will likely be the next big push after the next big batch of on disk format changes.

  • Fsck

    There is a fsck - it's just in kernel, done at mount time, not in userspace. We shouldn't be missing any checks - we should be able to detect any filesystem inconsistencies. Repair is only implemented for a few inconsistencies, though.

    By default, fsck is run on every mount - mount with -o nofsck if you don't want to run it. Errors are not fixed by default, because I want to make sure I get bug reports if inconsistencies are found - if you do run into fixable errors, mount with -o fix_errors (and send a bug report!).

FAQ

Please ask questions and ask for them to be added here!

Todo list

Current priorities:

  • Encryption is pretty much done - just finished the design doc.

    Cryptographers, security experts, etc. please review: Encryption.

  • Compression is almost done: it's quite thoroughly tested, the only remaining issue is a problem with copygc fragmenting existing compressed extents that only breaks accounting.

  • NFS export support is almost done: implementing i_generation correctly required some new transaction machinery, but that's mostly done. What's left is implementing a new kind of reservation of journal space for the new, long running transactions.

Breaking changes:

  • Need incompatible superblock changes - encryption key used up remaining reserved space. we need:

    • more flag bits
    • a feature bits field
    • bring some structure to the variable length portion, so we can add more crap later - do it like inode optional fields
    • on clean shutdown, write current journal sequence number to superblock - help guard against corruption or an encrypted filesystem being tampered with
  • More bits (once we have feature bits) for "has this feature ever been used", e.g.

    • encryption - if we don't have encrypted data, we don't need to load cyphers
    • compression - if gzip has never been used, we don't need gzip's crazy huge compression workspace
  • journal format tweaks:

    • right now btree node roots are added to every journal entry - we really only need to journal them when we change, and with the generic journal pin infrastructure this'll be easy to implement. this is a slight on disk format change - old kernels won't be able to read filesystems from newer kernels, but it's not a breaking change

    • prio bucket pointers - We also add to every journal entry a pointer to each device's starting prio bucket. this one is more important to fix, because with large numbers of devices we'll be wasting more and more of each journal entry on these prio pointers that mostly aren't changing. We just need to break out this journal entry into one entry per component device (and do like with btree node roots, and change it to only journal when it changes).

      when tweaking prio bucket pointers, should add a random sequence field so we can distinguish reading valid prio_sets that aren't the one we actually wanted

  • fallocate + compression - calling fallocate is supposed to ensure that a future write call won't return -ENOSPC, regardless of what the file range already contains. We have persistent reservations to support fallocate, but if the file already contains compressed data we currently can't put a persistent reservation where we've already got an extent. We need another type of persistent reservation, that we can add to a normal data extent.

  • checksumming stuff:

    • configurable action for nonfatal IO errors & data checksum errors
    • RO, continue or threshold
    • absolute threshold, or moving average threshold (error rate)
    • when we get a read error/data checksum error, flip a bit in the key - "has seen read error" - so we don't blow through the global limit on one bad extent
    • global and per device options: per device options take precedence if set, but may be unset
    • how should configuration handle multiple devices? we probably want to just continue by default in single device mode, but in multi device mode kick it RO

Other wishlist items:

  • When we're using compression, we end up wasting a fair amount of space on internal fragmentation because compressed extents get rounded up to the filesystem block size when they're written - usually 4k. It'd be really nice if we could pack them in more efficiently - probably 512 byte sector granularity.

    On the read side this is no big deal to support - we have to bounce compressed extents anyways. The write side is the annoying part. The options are:

    • Buffer up writes when we don't have full blocks to write? Highly problematic, not going to do this.
    • Read modify write? Not an option for raw flash, would prefer it to not be our only option
    • Do data journalling when we don't have a full block to write? Possible solution, we want data journalling anyways
  • Inline extents - good for space efficiency for both small files, and compression when extents happen to compress particularly well.

  • Full data journalling - we're definitely going to want this for when the journal is on an NVRAM device (also need to implement external journalling (easy), and direct journal on NVRAM support (what's involved here?)).

    Would be good to get a simple implementation done and tested so we know what the on disk format is going to be.