Crums is a witness micro service providing digital certificates designed to establish when an object was first seen. The object can be anything digital: plain text, a document, an image, a video clip, a signature, a row in a database. The objects are actually identified by their SHA-256 hash, and it is these hashes that are witnessed and tracked.
To see how the service works, it's useful to consider how (without using this service) you could leave digital evidence on the web proving to anyone in the future that a file existed today or earlier. A simple solution might be to tweet a cryptographic hash (eg SHA-256) of your file: since tweets are timestamped by Twitter--assuming everyone trusts the inner plumbings of Twitter (and if there's still no edit button), then you've left a digital trail. A link to your tweet then might satisfy a court in the future that you were today in possession, for example, of that document describing your invention. Or you might drop the hash in a conversation on stackexchange (again, assuming it's not editable). Or for good measure, just in case Twitter goes out of business, you might embed that hash on the blockchain in a Bitcoin transaction.
Indeed this is exactly the idea behind how crums.io works. It creates an immutable digital trail embedding information about which hashes it has seen. But because this digital trail is fairly compact, it can be recorded and backed up at many places. Instead of having you save a link to a timestamp, the service provides you a text file that is the timestamp. This way, even if crums.io goes offline for whatever reason, chances are your timestamp is still verifiable.
Continuing with the do-it-yourself example above, how would you timestamp 10 files, not just one? You could of course tweet all 10 hashes. But let's say you don't want anyone to know just yet how many files you're timestamping. One approach might be to combine them into one file (say a zip file) and then tweet the hash of that. The downside with this is that to prove you had any 1 of those 10 documents at the time of the tweet, you now have to show all 10 (the zip file) that make up the tweeted hash.
A better DIY method than the zip file idea would be to [compute the] hash [of] each file, write those hashes in a separate file, and then tweet the hash of that file of hashes. You save this file of hashes (along with the documents you timestamped), and now you can prove any 1 of those 10 documents belong to your tweeted hash without revealing the other 9. The way crums.io works is closer to this second approach.
We use a well known data structure called a Merkle tree to combine document hashes into a single hash. The idea is similar to the DIY way of writing all the hashes in a single file and then computing the hash of that file of hashes--but with some advantages. A Merkle tree structure looks something like this:
It's a pyramid of hashes. The service publishes one of these roughly every 5 minutes. At the base of each pyramid (Merkle tree) lie the hashes crums.io has seen over those minutes, roughly in the order the service has seen them. The layer immediately above the base contains the hash of the combination (concatenation) of each pair of hashes in the base layer below. So there are half as many hashes in this layer above. The same pairing rule is applied at successive layers until there is but one hash at the top of the pyramid. We've glossed over some details, but that's essentially how the Merkle tree is built.
The hashes that go into the tree are not actually the hashes the service witnesses; instead every hash added is derived from a combination of both the hash and the time it was witnessed. We call the structure containing this hash/time combination a crum.
One nice thing about Merkle trees is that you don't have to know what's in the whole tree in order to show that a particular hash belongs in it. In fact all you need are a few intermediate hashes that establish a path to the root of the tree (top of the pyramid in our parlance). A structure establishing this path is known as a Merkle proof.
In our application, we're calling this path of hashes from a crum to the root of the tree a crumtrail: it's for the user to keep and is their timestamp for the hash they dropped to be witnessed. The size of this timestamp is not very many bytes no matter how big the tree. (It's a multiple of how many digits there are in the number hashes there are in the tree.)
The service doesn't remember every hash it's ever seen. (How could it?) No, it remembers at best maybe a few days back. As it publishes new Merkle trees it drops old ones. But it still saves (remembers) 2 things about the Merkle tree it's about to drop.
The service keeps these forever. The first, the saved root hash, identifies a Merkle tree the service has published. It provides one way to verify that a given crumtrail file was indeed generated by the service and belongs to the Merkle tree it published.
The second thing about a Merkle tree that's always saved, the last Merkle proof is kept for bookkeeping. The very last hash (in the bottom layer) of any crums.io tree is special: it's the root of the previously published Merkle tree, and its Merkle proof threads the previous Merkle tree to this one. In this way, the service maintains an audit trail of everything it's done--without remembering much.
From time to time the service sprinkles corroborating evidence across 3rd party online assets in a way that is meant to timestamp the service's state. This is done by publishing the latest Merkle proof linking the previous tree to the new one. A multitude of such 3rd party sites are under consideration, including tombstoning on the Bitcoin blockchain. A dedicated GitHub data repo as well as the crums.io Twitter bot will be our first operating examples. Procedures to validate a crumtrail against 3rd party evidence without appeal to crums.io are spelled out in the technical docs. This way, if crums.io is down for whatever reason, a user's previously saved crumtrail is still verifiable.
Because new Merkle trees are only published every few minutes, the witness workflow is a 2 step process.
When the service witnesses (receives) a hash it doesn’t remember seeing, it first generates and returns a new crum. This crum is not the final the final product: it just contains the hash, and the witness time (in UTC milliseconds).
Behind the scenes, the witnessing of the new (i.e. not recently seen) hash triggers its inclusion in an upcoming tree (the next tree or the one right after it). It takes a few minutes, but once the next tree is generated, using the same hash, its crumtrail is available. The user should download and save the hash’s crumtrail promptly--within, say, the day. (As discussed above, the information for generating a verifiable crumtrail is not kept indefinitely.)
© 2020-2021 crums.io