Earthstar May + June update: Squirrel in progress

The Earthstar Project

Contribute

All ways to contribute

Budget

Connect

Updates

About

Earthstar May + June update: Squirrel in progress

Published on June 10, 2022 by Sam Gwilym

I'm deep into the development of the next major version of Earthstar, but will be travelling until the beginning of July. Until then, I'd like to leave you with an idea of how Earthstar Squirrel is shaping up.

Like its namesake, Earthstar Squirrel will be small, fast, and like to store things.

Much faster read/write and sign/verify

Most of Earthstar's computational overhead comes from reading and writing documents, and signing and verifying new and incoming documents. Using Deno's benchmarking tools I was able to get a firm idea of how our different persistence and cryptographic drivers compared, and where they fell short.

This brought forward a few new developments which will make the next version of Earthstar much faster:

Earthstar's memory driver has had some hot-path optimisation (thank you Chris Basham!) which make some operations (like getting the latest document at a path)thousands of times faster.
A new CryptoDriverSodium driver for Deno runtime, which performs about 15x faster than the previous sole option Deno users had.
A new experimental SqliteFfiDriver, which uses native bindings to the system Sqlite (as opposed to the current Sqlite driver, which uses a WASM version). It has far faster write speeds, but slower read speeds (sometimes you just can't win), making it very suitable for usage in a replica server.

Altogether, these changes mean that while a replica used to take 25 seconds to ingest 5000 documents, it now just takes one.

Multi-format support

When we talk about how Earthstar works, a lot of what we're talking about is the es.4 specification. This 'format' describes the rules Earthstar employs, such as what constitutes a valid document, or the shape that document might take.

But to push Earthstar forward we need new formats with new capabilities, and possibly different rules. But it's also important that this progress doesn't come at the cost of leaving existing data behind: users shouldn't have to sacrifice their data for the sake of new features.

So Earthstar needs to be able to support multiple formats at once. This has been challenging because many of Earthstar's APIs are built around the assumption that the es.4 format is the only possible one.

It's also challenging because it means that certain APIs like Replica need to dynamically change the types of documents they accept and return based on the formats which are given to them. Implementing this with Typescript was not pleasant.

Dividing Earthstar's core APIs from its formats is not yet total (for instance, the shape of queries are still very much tied to certain assumptions about format), but it has come a long way.

From here on out, Earthstar APIs will use a new default format unless an alternative is provided by the user.

Attachments

One of the most requested features for Earthstar is support for big files, such as video and audio. There's a whole class of very normal applications that aren't possible without this.

But there are many challenges around implementing blob storage:

Abuse: how can peers only sync blobs they're interested in, and not be forced into downloading (potentially huge or incriminating) data they don't want?
Performance: How should blobs be stored, as our current storage layers like Sqlite are not suitable for blobs? Replicas can be queried and return results with possibly many documents — how do we prevent their blobs from saturating device memory?
Ergonomics: How can reading and writing blobs to an Earthstar feel as straightforwards as working with an ordinary object in memory?

The way I've been approaching this is thinking of large blobs as attachments, data associated with (but separate from) a document.

This means syncing can be sparse, meaning peers can learn about attachments, their size, and what they contain, without needing to download them first.
It means attachments are stored separately from documents, which allows them to be stored in an appropriate way (e.g. just using the filesystem), and to be loaded on demand instead of whenever a document pops into memory.
While the ideal API would probably just be a blob property on documents, instead users will be able to fetch document attachments and interact with them as bytes in memory (Uint8Array) or as a ReadableStream.

There are also some challenges to this approach. Because they stand apart, attachments need their own storage, validation, and syncing methods. But the advantages of being able to manage them independently of the documents they're attached to are great! Imagine an in-browser music library app where all the songs can be listed without needing to download them first.

Because Earthstar will now be syncing both documents and blobs, I knew I would need to revisit synchronisation efficiency. Or Earthstar's lack thereof.

Efficient Sync (and synchronicity)

Earthstar's syncing is anything but efficient right now. It's very bad and slow and I'm sorry.

Currently peers sync by sending all of their documents to each other, every time they sync. This is about as efficient as two friends telling each other their life story every time they meet each other.

Ideally we want peers to transmit as little information as possible to one another to determine what they actually need, and I've been hard at work at this.

One part of this is a new system of messages where peers transmit only a small subset of a document's data, which the partner peer uses to determine whether they want that document or not. This cuts down the amount of data transmitted between peers by a considerable degree.

But this is not enough by itself. Peers would have to send one of these for every single document in a replica, which could number in their millions. To go back to the metaphor of the two friends, now the friends are just asking "did I tell you about the time I...?" about every thing that happened to them in their life instead.

So we want to whittle these messages down to only the documents the replicas do not have in common.

By complete chance I came across Aljoscha Meyer's master thesis on range-based set reconciliation, which describes a way to do exactly what I want: determine the difference between two sets of documents in the most efficient way possible.

I'm thrilled, not just because a solution seemingly fell out of the sky, but also because Aljoscha is a singular contributor to the space of distributed systems, having created Bamboo.

Aljoscha is also a great teacher. I had a call with them today that was really instructive, and leaves me excited to implement this when I return.

But our call also revealed a completely bewildering detail: Aljoscha and Cinnamon had discussed Earthstar one year ago, and this discussion was the push which made Aljoscha choose set reconciliation as the subject of their thesis.

What?!

If not for a random post on SSB (thanks @arj) this circle would never have been closed!