By now, you've probably heard that a new wave of Artificial Intelligence (AI) is emerging, possessing great power and the potential to change virtually everything. Opinions vary widely, with some predicting a future of either absolute horror or unending utopia. The impact of AI is often likened to the invention of nuclear technology, which gave rise to both atomic bombs and nuclear power plants. I believe that while AI's impact will likely fall between these extremes, the atomic bomb analogy is insightful for another reason.
Scientists use a method called radiocarbon dating to verify the age of objects containing organic material. They do this by focusing on a very particular type of carbon (carbon-14) which has a certain amount of radiation present. This radiation is released at a known rate and, thus, by measuring how much radiation is present we can date how long the object has been around. The less Carbon-14, the older it is.
However, all of that changed in the early 1940s. As nuclear testing increased and ultimately nuclear bombs were created and deployed, this overall level of radiation went up. We now have to use a different half-life to date anything after that time period from the time before. That's how big of a mark it left.
Similarly, the internet, essentially a network of databases and datasets, was solely human-generated until 2023. That year, ChatGPT emerged, demonstrating that machines could produce content indistinguishable from that created by humans.
AI-generated content is now ubiquitous, reshaping search results and influencing the creation and editing of articles. In the future, we might pinpoint the moment when AI began significantly contributing to our global knowledge base, analogous to the increase in atomic radiation levels.
One intriguing project I've been following is AgentSearchV1. Available on the Huggingface repository, this dataset compiles the latest from reputable sources like Arvix, Google Books, OpenWeb Math, Stack Exchange, Wikipedia, and more. It represents a snapshot of pre-AI-dominated data, curated by humans in mid-2023.
Realizing this, I did what anyone would do. I attempted to download it. All of it. This turned out to be much more complicated than I thought. I wanted to use .git so that I could potentially run diffs on future versions. In total this dataset is just over 2.5TB and this took several failed attempts before getting it correct.
So this 2.5TB human-curated dataset, perhaps the last one containing only human-authored content is one that I want to see preserved. I've reached out to the authors and am exploring putting this together as a torrent file to make it easier/more accessible to others.
Of course, it could turn out that the AI-generated content is actually better than what came before. In any case, I find it interesting and want to see what others will want to do with it. That's what open-source is all about after all.
Update: You can now download (and seed) the Agent-V1 Dataset here: