us
Previously, we saw how to
manually add encryption and redundancy to our files. What we didn’t see was
how to manage the additional metadata required by those features. And there is
a lot of metadata: we need to know which hosts the file’s shards are stored
on; which encryption algorithm was used to encrypt the shards; which erasure
code parameters were used; and so on. If we want to retrieve our file later,
we need to save all this metadata somewhere. While there are many ways to
accomplish this, us
provides a “blessed” format for doing so, called a
metafile.
In this post, we’ll examine the structure of a metafile and how it fits into
the us
system. We’ll also take a look at another us
format: the format
that stores your file contracts. Although us
provides types and functions for
working with these formats, this post won’t contain any code examples;
instead, I’d like to focus on the design decisions. So we’ll wrap up with a
discussion of why metafiles and contracts in us
are fundamentally different
from their siad
counterparts.
A metafile consists of an index and a set of shards. The
metafile format itself is a gzipped tar archive of the index and shards, each
stored as a separate file. The suggested extension for metafiles is
.usa
– “a” for “archive.”
The index contains what we typically think of as file metadata: the size,
mode bits, modtime, etc. It also contains Sia-specific information: the
encryption key, the erasure code parameters, and the public keys of each host
the file was stored on. The index is encoded as a simple JSON object; you can
view it directly with tar xzf [metafile] index -O
.
The shards collectively describe the actual bytes that comprise the encrypted, redundant file. Each shard is associated with a single host. Shards consist of a series of binary-encoded “sector slices,” each identifying a sector (by its Merkle root) and an offset and length within that sector. For example, a shard might refer to the first 512 bytes of sector A, followed by 1024 bytes from the middle of sector B; the “logical shard” is then the concatenation of these two slices, for a total of 1536 bytes.
To upload a file, we first create the index and initialize a shard file for
each host. Each time we upload a sector of the file, we’ll append a slice to
the host’s corresponding shard file. When we’re done, we bundle the index and
shard files into a .tar.gz
. To download, we first need to unzip and un-tar
that file; then, we use the sector slices in each shard to retrieve the
(encrypted, redundant) data stored on hosts; then we use the metadata in the
index to turn that raw data back into our original file.
The metafile format was designed to be easy to grok and easy to work with. It uses existing popular formats – tar, gzip, JSON – for everything but the shards, which must be binary-encoded for performance reasons. The shards themselves are intuitive if you are familiar with erasure-coding; this is why the format explicitly makes each shard its own “thing,” rather than having one giant binary blob where the boundaries of each shard are determined by fixed offsets. (Don’t get me wrong, the latter approach has many advantages – but it’s also harder to work with.)
For a more technical description of the metafile format, see formats.md.
Contract files are much simpler than metafiles. They have two parts. First is an immutable header, containing the host’s public key, the contract ID, and the secret key that can sign revisions. Following this is the most recent revision of the contract, which contains information like the total Merkle root of the contract data, how many coins are allocated to the renter and host, etc. We overwrite the revision each time we modify the contract.
In principle, a contract could consist of just a header, because the Sia
protocol allows us to request the most recent revision from the host. We keep a
local copy of the revision for performance and convenience; it means we don't
need to perform any network I/O to answer simple questions like "how many coins
are left in this contract?" But since the revision is non-essential, we can play
fast and loose with it. If it becomes corrupted or lost, we can always ask the
host for their copy. Consequently, we don't need to worry about updating the
revision atomically or calling fsync
after each update, which can
be major performance bottlenecks. The header, on the other hand, is immutable
and must not be modified. If we accidentally overwrite our secret key, for
example, we'll never be able to revise the contract again. To avoid this, we
write
the header exactly once, fsync
it, and never
touch it again.
Contract files also used to contain the Merkle roots of each sector comprising the contract data, but a later upgrade to the Sia protocol made this unnecessary. The renter now only needs to store a single Merkle root – one covering the entire contract – in order to verify that the host has processed its requests correctly. (This root is stored in the revision.)
For a more technical description of the contract file format, see formats.md.
What do contracts and metafiles have in common? The answer lies not so much in what they are, but rather what they allow you to do. What contracts and metafiles share is that possessing them inherently bestows access rights. If you possess a contract file, you can revise that contract; if you possess a metafile, you can download that file.
The technical term for this is a capability. A capability is “a communicable, unforgeable token of authority.” Contracts and metafiles contain both a reference to an object and a key that permits access to that object – the key is the “unforgeable token of authority.” And both files are communicable: if you send one to your friend, they gain the exact same access rights.
In siad
, the contract and siafile formats contain references and keys, just
like their us
counterparts. But crucially, these files are not
communicable. If you send a siad
contract to a friend, they will not be
able to load it into their own siad
; the system was simply not designed to
accommodate this sort of operation. Likewise with siafiles, although siad
plans to support some form of filesharing soon.
By contrast, us
encourages treating these files as first-class citizens.
When you finish uploading a file with user
, it doesn’t just print “done”, it
directly returns a capability, in the form of a metafile. What you do with
that metafile is up to you; you can stick it in a folder, rename it, delete
it, compress it, rsync
it to a backup server, whatever. The important thing
is that it’s sitting out in the open for you to manipulate, and you manipulate
it directly via the filesystem instead of through a custom API.
A great example of this is how user
manages “enabled” and “disabled”
contracts: to enable a contract, just create a symlink to it in the
appropriate directory. To disable the contract, just delete the symlink. (This
should sound familiar to anyone who has used nginx.) This sort of
functionality could have been accomplished with an “enabled” list in a config
file, or by adding a bool
to the contract format, but punting it to the
filesystem gives us this feature for free, and with semantics the user
is already familiar with.
Of course, there are downsides to shoving these files in the user’s face.
Managing contracts is hard work; you need to pick your hosts carefully and make
sure you renew on time. These tasks are best handled by a sophisticated program,
not a human. That’s why siad
abstracts your contracts into an
“allowance,” so that you can focus on the high-level goals: how much to spend,
and over what period. Same with files: by ceding control of your files to
siad
, you allow it to automatically repair them when hosts go
offline. user
, on the other hand, forces you to make all of these
decisions explicitly, which is both empowering and overwhelming. The good news
is that you can write your own siad
! That is, you can write
programs in any language that automatically manage your contracts and files. For
example, you could set up a cron
job that automatically renews your
contracts, or a Python script that regularly scans your hosts and sorts them by
latency/throughput/price. So in the long term, I don’t expect many people to
invoke user
directly. Instead, they’ll build more sophisticated
systems on top of user
(or us
) that are tailored to
their specific needs.
We examined some of the design decisions of metafiles and contract files, and
saw how they operate as capabilities within the us
system. We also saw how
user
takes a “Unix philosophy” approach to contract and metafile management
by encouraging the user to manipulate these files directly.