Lewis Edwards

Posted: 2026-01-20

P2PCodeNetworkingLight

P2P: Smart Hashing

This is another one I figured out 10-15 years ago, but again hasn't really seen uptake in any meaningful way.

🔗 Background

Hashes are a foundational primitive in cryptography: they allow you to securely "fingerprint" a block of data. There are a few schemes to extend the concept, including Rabin fingerprinting and Merkle trees, but the mechanics of each are outside scope to this concept even if the implementation is useful.

The classical divide between DC++ clients was "hashing" vs "non-hashing". A minority of clients supported both. In more recent protocols like BitTorrent, this distinction is nonexistent: hashing is simply fundamental to how the protocol works. You must hash information to create a torrent of it, and continuously verifying information as it's coming in is simply business as usual when torrenting.

Hashing your data has many benefits: it guarantees integrity and allows you to aggregate data from many sources while still ensuring you have the correct information coming in. This improves performance and efficiency with basically all operations on a large data set — being able to prove that data is correct no matter where it came from opens up a wide range of possibilities.

Hashing clients created large practical problems. You might have multi-terabyte datasets to transfer, and iterating over it to create a set of hashes could take many days of energy intensive computation. Taking this time before anything ever gets sent generates a massive opportunity cost (environmental impact aside), and creates a giant pool of friction around any protocol that demands it.

🔗 Design Philosophy

🔗 Speculative Data Matching and Optimistic Lazy Evaluation

Before we start downloading, we can cheaply get access to file-system level metadata: across machines, this is filename, file size, date created and date modified (which ideally should propagate with downloads). Locally, inodes/file ID and historical hash structures. This is already enough to do a lot of data matching and get a table of pretty-high-probability alternate source candidates.

Each of these sources of evidence are ranked and contextual according to date and other variables.

As the transfer begins, we can hash data as it is transferred; it doesn't take much before we can reasonably assume that this is, at bare minimum, a copy of the correct file in some degree of completeness and integrity. At that point, we can likely extrapolate a coherent picture of the hash trees belonging to this file, and we can work backwards from there to determine whether or not the data coming in is correct, then replace it later if it isn't.

In the worst (rare) case we can simply correctly download the original file from the original source and discard all of that speculative data. We can also ask the originating source to hash data without sending it to us, to confirm matches.

This requires some smarts; it is fundamentally a constraint solver with linear uncertainty metrics.

When we treat hashing less as a one-off canonical definition and more as a living knowledge base which has uncertainty and freshness built-in, we can make better guesses about which data we have and its integrity.

🔗 Container Awareness

There is a neat paper relating to this, proposing what they call SET (Similarity Enhanced Transfer). It uses this fabulous stochastic handprinting concept to improve the chances of finding partial matches... which they then concede exist almost entirely because of large media files with identical content streams but different metadata. The need for this clever piece of stochastic trickery can be obviated with a fraction of the work.

Why not simply learn the layout of common media container formats, split them into separate streams as well as metadata, and hash those separately? This gives us identity confidence sooner, makes minor tag changes unimportant, and allows us to reuse payloads across different remuxes.

There's a certain amount of domain-specific effort associated with doing this, but for most real-world uses it's finite, and now we can include things like MP3 with modified tags or video files with various languages dubbed as data sources.

🔗 Theoretical Benefits

What's neat about this concept is not just that it allows you to share terabytes of data with strong validation of its integrity but without spending days hashing; it's that it collapses the entire spectrum of "simple file upload" to "dedicated torrent infrastructure" down to a repeatable primitive that works for both cases and everything in between.

While this may seem a little elaborate for simply moving data, this could be the foundation of the One File Transfer Method To Rule Them All; one which has the best of all possible worlds. Once perfected, this could simply become the default for shifting data of any size.

Checkin

Version: 1

Written: 2026-01-20

Written on: 7.5mg olanzapine since 2025-11-11

Mental health was: poor - estimate 25% brain