Utreexo in Sia: An Overview

There’s a lot of buzz in the Sia community about the upcoming “Utreexo” hardfork, a radical overhaul of the consensus code which promises to drastically improve performance, cut initial sync time to almost nothing, and allow full nodes to run on smartphones, embedded devices, or even in the browser. But what is Utreexo, exactly, and how does it enable these benefits?

All UTXO-based cryptocurrencies use a database to track their outputs. When an output is created, we insert it into the database; when an output is spent, we delete it from the database. Thus, at any given chain height, the database will contain the set of all outputs that can be legally spent in future blocks—and if a transaction tries to spend an output that’s not in the database, we know that it’s invalid.

While this approach is simple to understand, its simplicity comes at a cost. The database is large, and it grows linearly with the number of UTXOs. Furthermore, updating the database is I/O-intensive, especially because the UTXOs are keyed randomly (that is, the outputs spent by any given transaction are unlikely to be stored sequentially on disk, so accessing them requires lots of seeking around within the database).

In 2018, Bitcoin developer Tadge Dryja described how this database could be replaced by a cryptographic state accumulator, which he named Utreexo. This accumulator comprises nothing more than a handful of hashes: specifically, the roots of a set of perfect Merkle trees. (Coincidentally, we’ve been using this exact data structure in Sia for many years now; I even published a paper on it recently.) And yet, we can still “add” and “delete” elements from this accumulator, just like a database. The trick is that, in order to perform these updates, each element must be paired with a Merkle proof. In a way, Utreexo turns the UTXO database inside out: the data itself lives entirely outside the accumulator, which itself only describes the data. As you might expect, this inversion has the unfortunate downside of increasing bandwidth and storage requirements for nodes, since they now need to transmit and store all those Merkle proofs in addition to the regular transaction data.

So: what’s the benefit of this tradeoff? Well, Tadge and his team have already done a fine job answering that question; I recommend Calvin Kim’s post on the subject. But I want to touch upon a more obscure, less sexy advantage of Utreexo: its effect on the programmer, and on the way we architect blockchain software.

I’ll stress again that Utreexo accumulator is exceedingly small; specifically, it requires log2(n) 32-byte hashes, where n is the number of UTXOs. Logarithmic growth is great, because it allows us to use a classic optimization: for finite n, ∃knn < 2k, so O(log2(n)) is really O(k), and k is usually pretty small. In this particular case, k = 32 will be sufficient for the foreseeable future, which means our accumulator will never be larger than 32*32 bytes, or 1 KiB.

1 KiB is small enough to hold on the stack. Small enough to fit in a single disk sector, or a single packet. Small enough to store alongside every block. When it comes to performance, smaller is better—but constant size is highly desirable as well, because it allows you to eliminate dynamic memory allocations, compute offsets at compile time, omit size metadata from serialized objects, and more. So an accumulator that is both small and constant-sized is really optimization-friendly.

With this tiny accumulator, we can do something that (to my knowledge) has never been possible in a cryptocurrency before: we can fit all of the information necessary to fully validate a block in a small, fixed-size block of memory. I call this a “validation context;” in addition to the accumulator itself, it contains the parent height and hash, the difficulty, the total chain work, etc. This means that our block validation function can be pure, operating solely on immutable in-memory data with no I/O or other side effects. We can also represent block application in a purely functional way: applying a block to a validation context yields a new “child” context that will be used to validate the next block in the chain.

Usually, functional purity is something that only FP weenies like me care about. But in this case, it has a significant practical ramification: it means that our most security-critical code can be “airgapped” from the rest of the node software. For example, it could be written in a formally-verified language and compiled into a dynamic library, which could then be loaded by a node implementation written in any language. More radically, validation code could be offloaded to a hardware module: much like the hardware wallets we know and love, we could manufacture a device preloaded with validation firmware—perhaps even accelerated by ASIC cores for hashing and signature verification. Plug it into any untrusted computer, and you can validate the chain at ridiculous speed and with complete confidence.

The key here is that validation contexts are portable. They don’t contain any pointers or resource handles; they’re just inert data, so transferring them to another device is no different from transferring a transaction or a block. This is a game-changer when it comes to blockchain syncing, because you can download these contexts directly from peers, or get them from your VPS, or a friend, or a trusted third party, whatever. Making contexts "cheap" is what enables parallel blockchain syncing: you can acquire multiple contexts spaced evenly throughout the chain, validate each section on a different thread (or CPU, or device!), and then stitch them all together. Or, if you don’t care about validating the entire history, you can just download a recent context and sync the few remaining blocks, giving you—in a matter of seconds—a fully-validating pruned node.

From a non-technical perspective, this is probably the most exciting thing about Utreexo: it enables a future in which full nodes are ubiquitous. No longer will web wallets or lite wallets or phone apps need to depend on third parties; why bother, when they can sync to tip in less time than it takes to load a bloated website?

To be fair, Utreexo isn’t all sunshine and rainbows. Beyond the increased bandwidth and storage requirements, it introduces at least one major annoyance from an implementation perspective: proof maintenance. Remember, every output that you spend now has to be accompanied by a Merkle proof—and every time the accumulator changes (i.e. every block), those proofs need to be updated. This becomes especially ugly when you consider that a block might be mined soon after you broadcast your transaction, causing its proofs to be invalidated mid-relay! Fortunately, there are workarounds for these issues, such as automatically detecting and updating outdated proofs received from your peers, so the impact on UX should be minimal.

Overall, I’m pretty enamored with Utreexo. As a programmer, I appreciate its elegant Merkle tree algorithms and the functional purity it enables. As an engineer, I appreciate its massive performance benefits and its elimination of the UTXO database. And as a cryptocurrency enthusiast, I appreciate its potential to greatly increase the number of full nodes, helping usher in a future where self-sovereignty is the norm.

Adapting Sia for Utreexo

I’ve mostly been describing an idealized version of Utreexo, where we only care about tracking outputs. Our true goal at the Sia Foundation, though, is to adapt this design for use in Sia, which is a bit more complicated. The main challenge is figuring out how to map Sia’s file contracts onto this model. We also need to accomodate siafunds, host announcements, and the Foundation subsidy.

Let’s tackle these in order, from easiest to hardest. First, the subsidy: this is as simple as adding the subsidy addresses to our validation context object and automatically adding the subsidy outputs to the accumulator at the appropriate block heights. The rest of the subsidy logic (governing how the addresses are updated) can remain essentially unchanged.

Siafunds are also pretty easy. Siafund outputs are added to the accumulator alongside regular outputs, hashed with a prefix string to prevent collisions. The current “pool” (the sum of all file contract taxes) is added to the validation context, and is updated whenever a contract is formed. The “claim” behavior is unchanged: sending SF creates an SC dividend, which is added to the accumulator.

File contracts are more challenging. When a contract is formed, we add its metadata to the accumulator. (As with siafunds, we don’t need a separate accumulator; a distinct prefix is sufficient.) When the host submits its storage proof, it includes the original contract and a proof of its presence in the accumulator. Conversely, if the host fails to provide a valid proof within the contract window, anyone can submit a transaction proving that the contract has expired, thus causing the MissedPayout outputs to be added to the accumulator. There’s just one complication: a storage proof requires deterministic entropy, which we currently derive from the hash of the first block in the proof window. To make this compatible with our validation context model, our storage proof transaction needs to include that hash directly, along with a proof that the hash was indeed an ancestor of the current block. Such proofs are fairly trivial to construct and verify, but this does mean that storage proof transactions now contain three separate Merkle proofs: the accumulator proof, the ancestor proof, and the storage proof itself.

Next, the host announcements. This is kind of a trick question, because host announcements aren’t actually part of consensus at all: they are encoded opaquely in a transaction’s ArbitraryData. This presents a problem for pruned renter nodes: in order to build a healthy HostDB, they need to obtain the announcements contained in the blocks that they skipped over; but how can they accomplish this without downloading the whole chain or compromising their trust model? One solution would be to make blocks commit to the HostDB; the actual announcements could then be downloaded from any untrusted party, such as a peer node or a service like SiaStats. Each new renter node would only need to do this once.

Lastly, there’s the matter of the hardfork itself. How do we coordinate a network-wide upgrade that migrates the existing Sia chain to a new, Utreexo-based consensus implementation? There are a few different ways we could go about this, but they all boil down to constructing a validation context that represents the state of the Sia chain at the hardfork height. So it’s quite possible that the network will grind to a (temporary!) halt as every node simultaneously trawls through its consensus.db, building up the accumulator and converting the database to a new format. Interestingly, once this initial validation context has been agreed upon by the network, we’ll be able to hard-code it into future releases of siad, saving new nodes the trouble of duplicating all that effort.

All of that represents an intimidating amount of work. But I also see it as an opportunity: we can make Sia the world’s first Utreexo-native blockchain. For example, our block headers could commit to their parent validation context. That way, after validating the header chain, a node can trustlessly acquire the context for any block. In turn, you’ll be able to initialize a pruned node with just a block hash, since the associated context can be retrieved from any peer. We would then have reduced the information necessary to bootstrap a node to a mere 32 bytes—short enough to pass as a CLI argument, or send via QR code, or even transcribe by hand.

Current Status

For the past year or so, I’ve been working on a “minimum viable Utreexo.” This is a barebones blockchain implementation, written from scratch, with Utreexo deeply integrated throughout the system from day 1. Basically, the plan was to build a really great skeleton, then “graft” Sia onto that skeleton. This approach has given us a chance to learn what Utreexo is really like in practice: both the novel features it enables (like validation contexts), as well as its unique challenges (like keeping proofs updated).

The skeleton, which I’ve named sunyata, still has plenty of rough edges, but it’s long past time for me to share it with the wider crypto community—so here it is. I take solace in the fact that most of the roughness is concentrated in the p2p code, which will need to be completely overhauled for Sia compatibility anyway. To be clear, sunyata is not Sia, nor is it a new altcoin. But if you’re interested in what Utreexo looks like in code, or you want to see a blockchain stripped to its core essence, I encourage you to check it out.

sunyata doesn’t gossip about peers or automatically form new connections, so there isn’t really a testnet. However, you can get a taste of its instant syncing capabilities by connecting to my node and specifying a "checkpoint:"

./sunyata -peer 147.135.16.182:31354 \
  -checkpoint 10000::00000079ec0a99cad7dfad0085bfaf7531b6ce1083039dd6720f4ce825df2111
空 sunyata v0.5.0
Using tempdir /var/folders/3b/47x3x9cs3fxfhw9vtjzbw2vh0000gn/T/sunyata332452216
Downloading checkpoint 10000::25df2111 from 147.135.16.182:31354...Success!
Using wallet seed 0650f67bfde8f04ec814a2ef6a86335b
Listening on [::]:63180
Connecting to 147.135.16.182:31354...Success!
空> height
10660

You’re looking at a fully-validating node, bootstrapped from nothing more than a block hash, in under a second. Try it yourself, and I’m sure you’ll agree: this is the future of UTXO-based cryptocurrency. :)

Anyway, for the rest of 2021, I’ll be working closely with Skynet Labs to adapt sunyata for Sia and prepare the hardfork. This is a huge undertaking, so I’ll be cagey as usual about a delivery date, but suffice it to say I’m optimistic. Now that the groundwork has been laid, we can proceed with confidence. Stay tuned for more updates later this year.