Sharing and archiving data sets with Dat

August 27, 2018

This article was contributed by Antoine Beaupré

Dat is a new peer-to-peer protocol that uses some of the concepts of BitTorrent and Git. Dat primarily targets researchers and open-data activists as it is a great tool for sharing, archiving, and cataloging large data sets. But it can also be used to implement decentralized web applications in a novel way.

Dat quick primer

Dat is written in JavaScript, so it can be installed with npm, but there are standalone binary builds and a desktop application (as an AppImage). An online viewer can be used to inspect data for those who do not want to install arbitrary binaries on their computers.

The command-line application allows basic operations like downloading existing data sets and sharing your own. Dat uses a 32-byte hex string that is an ed25519 public key, which is is used to discover and find content on the net. For example, this will download some sample data:

    $ dat clone \
      dat://778f8d955175c92e4ced5e4f5563f69bfec0c86cc6f670352c457943666fe639 \
      ~/Downloads/dat-demo

Similarly, the share command is used to share content. It indexes the files in a given directory and creates a new unique address like the one above. The share command starts a server that uses multiple discovery mechanisms (currently, the Mainline Distributed Hash Table (DHT), a custom DNS server, and multicast DNS) to announce the content to its peers. This is how another user, armed with that public key, can download that content with dat clone or mirror the files continuously with dat sync.

So far, this looks a lot like BitTorrent magnet links updated with 21st century cryptography. But Dat adds revisions on top of that, so modifications are automatically shared through the swarm. That is important for public data sets as those are often dynamic in nature. Revisions also make it possible to use Dat as a backup system by saving the data incrementally using an archiver.

While Dat is designed to work on larger data sets, processing them for sharing may take a while. For example, sharing the Linux kernel source code required about five minutes as Dat worked on indexing all of the files. This is comparable to the performance offered by IPFS and BitTorrent. Data sets with more or larger files may take quite a bit more time.

One advantage that Dat has over IPFS is that it doesn't duplicate the data. When IPFS imports new data, it duplicates the files into ~/.ipfs. For collections of small files like the kernel, this is not a huge problem, but for larger files like videos or music, it's a significant limitation. IPFS eventually implemented a solution to this problem in the form of the experimental filestore feature, but it's not enabled by default. Even with that feature enabled, though, changes to data sets are not automatically tracked. In comparison, Dat operation on dynamic data feels much lighter. The downside is that each set needs its own dat share process.

Like any peer-to-peer system, Dat needs at least one peer to stay online to offer the content, which is impractical for mobile devices. Hosting providers like Hashbase (which is a pinning service in Dat jargon) can help users keep content online without running their own server. The closest parallel in the traditional web ecosystem would probably be content distribution networks (CDN) although pinning services are not necessarily geographically distributed and a CDN does not necessarily retain a complete copy of a website.

A web browser called Beaker, based on the Electron framework, can access Dat content natively without going through a pinning service. Furthermore, Beaker is essential to get any of the Dat applications working, as they fundamentally rely on dat:// URLs to do their magic. This means that Dat applications won't work for most users unless they install that special web browser. There is a Firefox extension called "dat-fox" for people who don't want to install yet another browser, but it requires installing a helper program. The extension will be able to load dat:// URLs but many applications will still not work. For example, the photo gallery application completely fails with dat-fox.

Dat-based applications look promising from a privacy point of view. Because of its peer-to-peer nature, users regain control over where their data is stored: either on their own computer, an online server, or by a trusted third party. But considering the protocol is not well established in current web browsers, I foresee difficulties in adoption of that aspect of the Dat ecosystem. Beyond that, it is rather disappointing that Dat applications cannot run natively in a web browser given that JavaScript is designed exactly for that.

Dat privacy

An advantage Dat has over other peer-to-peer protocols like BitTorrent is end-to-end encryption. I was originally concerned by the encryption design when reading the academic paper [PDF]:

It is up to client programs to make design decisions around which discovery networks they trust. For example if a Dat client decides to use the BitTorrent DHT to discover peers, and they are searching for a publicly shared Dat key (e.g. a key cited publicly in a published scientific paper) with known contents, then because of the privacy design of the BitTorrent DHT it becomes public knowledge what key that client is searching for.

So in other words, to share a secret file with another user, the public key is transmitted over a secure side-channel, only to then leak during the discovery process. Fortunately, the public Dat key is not directly used during discovery as it is hashed with BLAKE2B. Still, the security model of Dat assumes the public key is private, which is a rather counterintuitive concept that might upset cryptographers and confuse users who are frequently encouraged to type such strings in address bars and search engines as part of the Dat experience. There is a security & privacy FAQ in the Dat documentation warning about this problem:

One of the key elements of Dat privacy is that the public key is never used in any discovery network. The public key is hashed, creating the discovery key. Whenever peers attempt to connect to each other, they use the discovery key.

Data is encrypted using the public key, so it is important that this key stays secure.

There are other privacy issues outlined in the document; it states that "Dat faces similar privacy risks as BitTorrent":

When you download a dataset, your IP address is exposed to the users sharing that dataset. This may lead to honeypot servers collecting IP addresses, as we've seen in Bittorrent. However, with dataset sharing we can create a web of trust model where specific institutions are trusted as primary sources for datasets, diminishing the sharing of IP addresses.

A Dat blog post refers to this issue as reader privacy and it is, indeed, a sensitive issue in peer-to-peer networks. It is how BitTorrent users are discovered and served scary verbiage from lawyers, after all. But Dat makes this a little better because, to join a swarm, you must know what you are looking for already, which means peers who can look at swarm activity only include users who know the secret public key. This works well for secret content, but for larger, public data sets, it is a real problem; it is why the Dat project has avoided creating a Wikipedia mirror so far.

I found another privacy issue that is not documented in the security FAQ during my review of the protocol. As mentioned earlier, the Dat discovery protocol routinely phones home to DNS servers operated by the Dat project. This implies that the default discovery servers (and an attacker watching over their traffic) know who is publishing or seeking content, in essence discovering the "social network" behind Dat. This discovery mechanism can be disabled in clients, but a similar privacy issue applies to the DHT as well, although that is distributed so it doesn't require trust of the Dat project itself.

Considering those aspects of the protocol, privacy-conscious users will probably want to use Tor or other anonymization techniques to work around those concerns.

The future of Dat

Dat 2.0 was released in June 2017 with performance improvements and protocol changes. Dat Enhancement Proposals (DEPs) guide the project's future development; most work is currently geared toward implementing the draft "multi-writer proposal" in HyperDB. Without multi-writer support, only the original publisher of a Dat can modify it. According to Joe Hand, co-executive-director of Code for Science & Society (CSS) and Dat core developer, in an IRC chat, "supporting multiwriter is a big requirement for lots of folks". For example, while Dat might allow Alice to share her research results with Bob, he cannot modify or contribute back to those results. The multi-writer extension allows for Alice to assign trust to Bob so he can have write access to the data.

Unfortunately, the current proposal doesn't solve the "hard problems" of "conflict merges and secure key distribution". The former will be worked out through user interface tweaks, but the latter is a classic problem that security projects have typically trouble finding solutions for—Dat is no exception. How will Alice securely trust Bob? The OpenPGP web of trust? Hexadecimal fingerprints read over the phone? Dat doesn't provide a magic solution to this problem.

Another thing limiting adoption is that Dat is not packaged in any distribution that I could find (although I requested it in Debian) and, considering the speed of change of the JavaScript ecosystem, this is unlikely to change any time soon. A Rust implementation of the Dat protocol has started, however, which might be easier to package than the multitude of Node.js modules. In terms of mobile device support, there is an experimental Android web browser with Dat support called Bunsen, which somehow doesn't run on my phone. Some adventurous users have successfully run Dat in Termux. I haven't found an app running on iOS at this point.

Even beyond platform support, distributed protocols like Dat have a tough slope to climb against the virtual monopoly of more centralized protocols, so it remains to be seen how popular those tools will be. Hand says Dat is supported by multiple non-profit organizations. Beyond CSS, Blue Link Labs is working on the Beaker Browser as a self-funded startup and a grass-roots organization, Digital Democracy, has contributed to the project. The Internet Archive has announced a collaboration between itself, CSS, and the California Digital Library to launch a pilot project to see "how members of a cooperative, decentralized network can leverage shared services to ensure data preservation while reducing storage costs and increasing replication counts".

Hand said adoption in academia has been "slow but steady" and that the Dat in the Lab project has helped identify areas that could help researchers adopt the project. Unfortunately, as is the case with many free-software projects, he said that "our team is definitely a bit limited on bandwidth to push for bigger adoption". Hand said that the project received a grant from Mozilla Open Source Support to improve its documentation, which will be a big help.

Ultimately, Dat suffers from a problem common to all peer-to-peer applications, which is naming. Dat addresses are not exactly intuitive: humans do not remember strings of 64 hexadecimal characters well. For this, Dat took a similar approach to IPFS by using DNS TXT records and /.well-known URL paths to bridge existing, human-readable names with Dat hashes. So this sacrifices a part of the decentralized nature of the project in favor of usability.

I have tested a lot of distributed protocols like Dat in the past and I am not sure Dat is a clear winner. It certainly has advantages over IPFS in terms of usability and resource usage, but the lack of packages on most platforms is a big limit to adoption for most people. This means it will be difficult to share content with my friends and family with Dat anytime soon, which would probably be my primary use case for the project. Until the protocol reaches the wider adoption that BitTorrent has seen in terms of platform support, I will probably wait before switching everything over to this promising project.

Index entries for this article
GuestArticles	Beaupré, Antoine

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 2:17 UTC (Tue) by bnewbold (subscriber, #72587) [Link] (2 responses)

Great article! Thanks for your diligence, this is a nice balanced overview to point folks at.

> One advantage that Dat has over IPFS is that it doesn't duplicate the data. When IPFS imports new data, it duplicates the files into ~/.ipfs. For collections of small files like the kernel, this is not a huge problem, but for larger files like videos or music, it's a significant limitation. IPFS eventually implemented a solution to this problem in the form of the experimental filestore feature, but it's not enabled by default. Even with that feature enabled, though, changes to data sets are not automatically tracked. In comparison, Dat operation on dynamic data feels much lighter.

I think the above could be misread. To clarify, dat can currently be configured to *either* track all changes (history) of files in a folder (at the cost of a full duplication of all files and all historical changes), *or* track only the most recent version of files with no duplication (at the cost of losing all history). There is not (yet?) any fancy dat mode which efficiently tracks only deltas (changes) to files with no other file overhead.

I think the later might be best left to the OS/filesystem layer (copy-on-write for files that have explicitly been copied, like recent Apple file systems, and I assume btrfs/zfs).

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 5:01 UTC (Tue) by anarcat (subscriber, #66354) [Link] (1 responses)

Good point! I did notice that during that research but forgot to make that more explicit.

The comparison I made there was more about the default behavior of the basic commands of dat vs ipfs.

With dat, to share data, you basically only need to call dat share and you're done: that creates a magic URL and no data is moved around. History can be kept using an external archiver, which means data is duplicated, but it's not the out of the box behavior.

With IPFS, to share data, you would call ipfs add which will copy each file (or its chunks, I don't quite remember) to ~/.ipfs to be globally reference by the ipfs daemon. There's the filestore extension to workaround that, but it's not enabled by default. Furthermore, it's not clear to me changes in the original dataset are automatically tracked the same way they are in dat.

Now this does not address deduplication within the repository: changes between files or history are out of scope of the above statements and the comments I made in the article. I haven't actually researched that aspect in details for IPFS. From what I understand, dat does have some sort of "chunking" mechanism where only changed chunks are transfered. This is somewhat like how the rsync protocol works, except the chunk size is fixed and rsync uses a "rolling checksum" that is much more efficient with larger files. According to this comment, Rabin fingerprinting was tested but never deployed, and clients do not actually use any sort of chunking for network transfers, and hyperdrive does not use chunking for storage either, which begs the question of where exactly is chunking used in the first place. :)

And I would argue that leaving that to the filesystem is putting a huge load on the shoulders of that thing and the users, who normally never have to worry about such low-level stuff. The software should handle that, not the system, if only to optimize the network.

I hope that makes sense!

Sharing and archiving data sets with Dat

Posted Aug 30, 2018 3:51 UTC (Thu) by zougloub (guest, #46163) [Link]

FYI IPFS also can add without duplicating data (https://discuss.ipfs.io/t/does-ipfs-add-duplicate-content...)

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 3:05 UTC (Tue) by ianmcc (subscriber, #88379) [Link] (2 responses)

The article gives an example of getting data via a dat://<key> URL, where <key> is a ed25519 public key. But later it days that the public key is supposed to remain secure and peers use only the 'discovery key', obtained from hashing the ed25519 key. Can someone clarify which key appears in a dat:// url?

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 3:18 UTC (Tue) by bnewbold (subscriber, #72587) [Link] (1 responses)

The key in the URL is the "public key", which generally grants read access to the contents of the archive in question. The "discovery key" is usually never seen by humans.

There are a lot of use cases for dat archives. If you are sharing data with one or a small number of peers, you keep the URL secure and send it out of band to only your intended recipients. Note that any recipient can share the key on to whoever they want. If you're publishing data publicly, obviously you want to spread that `dat://` URL around liberally, not keep it secure.

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 8:22 UTC (Tue) by epa (subscriber, #39769) [Link]

I thought that any kind of "security" based on keeping a URL secret was flawed. The URL mechanism, and most software, is designed with the assumption that the URL is a public identifier and access control is done with other mechanisms, not by whether you know the URL or not. It gets included in referrer logs, for example. Tim B-L outlined these principles in 1997 (and even back then they were just codifying existing practice): https://www.w3.org/DesignIssues/LinkMyths.html

It could be that Dat is quite different in its approach and takes care not to leak the address of a resource, since knowledge of the address acts as access control. But in that case it shouldn't be called a URL.

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 4:46 UTC (Tue) by anarcat (subscriber, #66354) [Link]

Forgot to mention my usual list of contributed comments and bug reports:

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 7:37 UTC (Tue) by sagi (subscriber, #64671) [Link]

Thank you for a very readable introduction to Dat.

Dat is completely new to me. I’m interested to understand how Dat compares to Tahoe-lafs for the mentioned use cases.

Would any of the commenters know?

Sharing and archiving data sets with Dat

Posted Aug 28, 2018 15:39 UTC (Tue) by Tara_Li (guest, #26706) [Link]

Is there some source available for a listing of public data sets that can be accessed using this protocol? I'm not interested in installing a new protocol unless I know there something worth getting with it - perhaps the NASA Big Blue Marble files, the Gaia database, etc.

Sharing and archiving data sets with Dat

Posted Aug 30, 2018 22:45 UTC (Thu) by flussence (guest, #85566) [Link]

IPFS gets around the protocol support problem by having a built-in HTTP proxy (which can also be stacked behind a regular HTTPS reverse proxy like a webapp). Could Dat do the same thing?

Sharing and archiving data sets with Dat

Posted Sep 2, 2018 18:45 UTC (Sun) by willy (subscriber, #9762) [Link]

"It can be installed using npm". Nope. Nope nope nope nope nope. I stopped reading at that point.

I don't care whether it depends on left-pad or not. The whole concept is broken and I'm not interested in anything so fragile.