In a 2002 television commercial for DirecTV’s blazing fast DSL Internet service, a man is seen aimlessly browsing the Internet, clicking link after link. Suddenly, his computer informs him that he has “reached the end of the Internet” and has “seen everything there is to see.” He stares at the screen in amazement.
Going back to the late 1990s, jokesters have set up numerous web pages that purport to be the last page of the Internet. Of course, the notion that any human could visit every on-line page is preposterous: according to WorldWideWebSize.com, which attempts to track the daily estimated size of the web, the Internet contained at least 5 billion web pages as of April 6, 2024.
But even though AI might have difficulty re-creating the human hand, it can click through web pages really fast.
In a recent story for The Wall Street Journal, Deepa Seetharaman writes that AI companies are running out of training data after burning through the entire Internet.
Companies racing to develop more powerful artificial intelligence are rapidly nearing a new problem: The internet might be too small for their plans.
Ever more powerful systems developed by OpenAI, Google and others require larger oceans of information to learn from. That demand is straining the available pool of quality public data online at the same time that some data owners are blocking access to AI companies.
Some executives and researchers say the industry’s need for high-quality text data could outstrip supply within two years, potentially slowing AI’s development.
While companies like OpenAI don’t disclose the exact sources of their training data, it is believed that they vacuum up on-line content such as research papers, news articles, and Wikipedia pages. They also likely ingest content from social media platforms, although much of that content is “walled off” from automated bots.
That doesn’t stop those social media platforms from utilizing the data themselves: the WSJ story notes that Meta CEO Mark Zuckerberg recently touted his company’s access to hundreds of billions of publicly shared images and videos across Facebook and Instagram, creating a rather robust training set for Meta’s own ambitious AI goals.
As AI companies begin to run out of training data on the Internet, they’re turning to synthetically-generated data, but that practice is controversial.
In a story for Futurism, Noor Al-Sibai touches on some of the challenges of using synthetic data sets.
Synthetic data, meanwhile, has been the subject of ample debate in recent months after researchers found last year that training an AI model on AI-generated data would be a digital form of “inbreeding” that would ultimately lead to “model collapse” or “Habsburg AI.”
Habsburg AI refers to the House of Habsburg (or House of Austria), one of the most prominent dynasties in European history. Seeking to consolidate their power, the Habsburgs relied on frequent consanguine marriages between closely related individuals. This inbreeding negatively affected the gene pool and was believed to lead to health ailments such as epilepsy, insanity, and early death.
Some are now predicting that AI models may suffer a similar fate as they begin to train against data they themselves created, recycling biases and errors and making it increasingly difficult to check the results for reliability.
In a 2023 research paper titled The Curse of Recursion: Training on Generated Data Makes Models Forget, researchers caution about an effect called Model Collapse that could occur when large language models are trained on data produced by other AIs instead of by humans.
