ChatGPT like AI models running out of text to train, claims UC Berkeley professor

AFP

Stuart Russell, an artificial intelligence expert and professor at University of California, has raised concerns about AI-powered language models, such as ChatGPT, potentially "running out of text in the universe" that is used to train them.

He explained that the technology behind AI bots, which rely on vast amounts of text data, is "starting to hit a brick wall".

Russell shared this insight during an interview with the International Telecommunication Union, a UN communications agency, last week. He emphasised that there is a finite amount of digital text available for these language models to consume.

The implications of this text scarcity may influence the future practices of generative AI developers as they collect data and train their technologies.

However, he maintained his belief that AI will increasingly replace humans in various language-dependent jobs. Russell referred to these jobs as "language in, language out" tasks during the interview. His comments contributed to the ongoing discussion surrounding data acquisition practices conducted by OpenAI and other developers of generative AI models.

Concerns have been raised by creators worried about their work being replicated without consent, as well as by social media executives dissatisfied with the unrestricted usage of their platforms' data. Russell's observations drew attention to another potential vulnerability: the scarcity of text available for training these datasets.

A study conducted by Epoch, a group of AI researchers, in November, revealed that machine learning datasets are likely to deplete all "high-quality language data" before 2026. The study defined "high-quality" language data as originating from sources like "books, news articles, scientific papers, Wikipedia, and filtered web content".

Today's most popular generative AI tools, powered by large language models (LLMs), were trained on massive amounts of published text extracted from public online sources, including digital news platforms and social media websites. The practice of "data scraping" from the latter was a contributing factor behind Elon Musk's decision to limit daily tweet views, as he previously stated.

Russell highlighted in the interview that OpenAI, in particular, had to supplement its public language data with "private archive sources" to develop GPT-4, the company's most robust and advanced AI model to date. However, he acknowledged in his email to Insider that OpenAI has yet to disclose the exact training datasets used for GPT-4. Recent lawsuits filed against OpenAI allege the use of datasets containing personal data and copyrighted materials in training ChatGPT. Notably, a prominent lawsuit was filed by 16 unnamed plaintiffs, asserting that OpenAI utilised sensitive data like private conversations and medical records.

Another lawsuit, involving comedian Sarah Silverman and two additional authors, accused OpenAI of copyright infringement due to ChatGPT's capability to generate accurate summaries of their work. Authors Mona Awad and Paul Tremblay also filed a similar lawsuit against OpenAI in late June.

More from Business News

  • DMCC records 7% growth in Indian companies

    The Dubai Multi Commodities Centre (DMCC) announced Indian companies now account for 16 per cent of the international business district’s total member base, a 7 per cent year-on-year increase.

  • Dubai named autism-certified destination

    Dubai has officially gained the recognition as a certified Autism destination, underscoring citywide efforts to make tourism more accessible and inclusive for all.

  • Dubai World Trade Centre generated AED 22.35 billion in 2024

    His Highness Sheikh Hamdan bin Mohammed bin Rashid Al Maktoum, Crown Prince of Dubai, Deputy Prime Minister, Minister of Defense, and Chairman of the Dubai Executive Council, has announced that the Dubai World Trade Centre (DWTC) generated an economic output exceeding AED 22.35 billion in 2024.

  • OpenAI unveils slimmed-down ChatGPT deep research tool

    OpenAI has announced the launch of a new version of its advanced tool 'Deep Research' integrated into ChatGPT, maintaining a high level of quality while introducing enhanced accessibility across user tiers.

Blogs