Publishing the AgentSearch-V1 Torrent
A while back I mentioned my interest in the status of the open web and specifically, marking the year 2023 as the last year of majority-human written data. With the launch of generative AI, the web is now increasingly being populated by machine-written content.
A human curated dataset could turn out to be a bit of a novelty down the road and I wanted to contribute to this event.
This involved several attempts at downloading the entire dataset using Git, a version control system that allows you to track changes in computer files. I then transferred it to my external drive, which, despite not being the fastest, eventually housed the entire dataset. The resulting .torrent file is now available for anyone who's interested:
A brief side note on what a torrent is and how this works
A torrent file is essentially a key to the world of peer-to-peer file sharing. It does not contain the file you wish to download itself, but rather, it holds metadata about the files and folders to be shared, including their names, sizes, and the structure of the files. Crucially, it also contains information about the tracker, a server that coordinates the action of all peers. Think of a torrent file as a map that guides your computer to find and assemble pieces of your desired file from computers across the globe that are part of the same P2P network.
Peer-to-peer networking is the backbone of this process, a decentralized approach to file sharing that contrasts sharply with traditional, server-based methods. In a P2P network, every participant (or "peer") plays both roles of client and server. This means that instead of downloading a file from a single source, your computer connects to multiple computers that have the file or parts of it. As you begin downloading pieces of the file, you also start uploading parts you've already received to other peers. This simultaneous give-and-take ensures the file is distributed efficiently and can be more resilient and faster than downloading from a single server, especially for files in high demand.
This model not only speeds up the process of file sharing but also distributes the load across multiple points, reducing reliance on any single server and potentially lowering hosting costs.
Getting started is as easy as downloading a free torrent app I use Transmission which is available on Mac, Windows or Linux. Download the torrent file above and open it in Transmission. Click a spot to save the file - remember you'll need ~1.2 Terrabytes of free space for the entire thing - and then the download will begin. You are then also able to "seed" the parts you've downloaded with others.
Current Availability
If you share my passion for preserving the open web and have insights or resources to contribute to this project, particularly in distributing the dataset or enhancing its security, please connect with me on X. Together, we can explore innovative ways to leverage torrent technology responsibly.
In an ideal world, HuggingFace would use their technology to provide an official torrent with their security/identification built-in. There would be benefits to them in doing this, both as a strategy credit with the open-source community, and as a reduction to their infrastructure costs to host these resources, should more users embrace torrent technology. The current association with pirated or copyright infringing material is a real concern, but should that stop open source from using the tech for democratizing access to large datasets such as this? Let me know if you have any thoughts about this!