Study Reveals OpenAI’s GPT-4o Trained on Copyrighted Data

Highlights

OpenAI's GPT-4o likely used copyrighted O'Reilly books.

Study highlights need for greater AI data transparency.

Implications stress ethical data sourcing in AI development.

Last updated: 2 April, 2025 - 12:09 pm 12:09 pm

Samantha Reed 11 months ago

A recent investigation by the AI Disclosures Project has uncovered that OpenAI’s GPT-4o model likely utilized copyrighted materials from O’Reilly Media without proper authorization. This revelation raises significant concerns about data sourcing practices in the development of advanced language models. The study highlights potential legal and ethical ramifications for AI developers and content creators alike.

Contents

How Did the Study Determine Data Usage?What Were the Key Findings?What Are the Implications for AI Companies?

The research builds on previous examinations of data usage by AI companies, providing concrete evidence of unauthorized training data. Unlike earlier models, GPT-4o demonstrates a higher ability to recognize and replicate proprietary content, which underscores the increasing sophistication of AI systems in handling restricted information. This development prompts a reevaluation of existing data acquisition protocols in the AI industry.

How Did the Study Determine Data Usage?

Researchers employed a legally-obtained dataset comprising 34 copyrighted O’Reilly Media books to test if GPT-4o could distinguish between original and paraphrased texts. Utilizing the DE-COP membership inference attack method, the study assessed the model’s ability to recognize specific content, revealing a significant level of data awareness.

What Were the Key Findings?

The study found that GPT-4o achieved an AUROC score of 82% in recognizing paywalled O’Reilly content, substantially higher than the GPT-3.5 Turbo model, which scored just above 50%. Additionally, GPT-4o showed better recognition of non-public materials compared to publicly accessible samples, indicating a deeper engagement with restricted data sources.

What Are the Implications for AI Companies?

“AI companies must prioritize transparency in their data acquisition processes to ensure ethical standards are upheld,”

the AI Disclosures Project emphasized. Unauthorized use of copyrighted data could lead to legal challenges and diminish trust in AI technologies. The study advocates for stronger accountability measures and enhanced disclosure practices to safeguard intellectual property rights.

While previous reports hinted at similar issues, this study provides empirical evidence specifically linking OpenAI’s GPT-4o with the unauthorized use of O’Reilly Media’s content. The findings suggest a broader, systemic issue within the AI sector regarding the sourcing of training data, necessitating comprehensive regulatory frameworks to address these challenges effectively.

Robust data licensing agreements and transparent training methodologies are essential for maintaining the integrity of AI development. Implementing the EU AI Act’s disclosure requirements could significantly improve accountability, ensuring that content creators are fairly compensated and informed about the use of their work in training models.

Efficiently navigating the balance between technological advancement and ethical data use will be crucial for the sustainable growth of AI. Companies must adopt responsible practices to foster innovation while respecting intellectual property rights, ultimately contributing to a more equitable digital ecosystem.

You can follow us on Youtube, Telegram, Facebook, Linkedin, Twitter ( X ), Mastodon and Bluesky

Share This Article

By Samantha Reed

Samantha Reed is a 40-year-old, New York-based technology and popular science editor with a degree in journalism. After beginning her career at various media outlets, her passion and area of expertise led her to a significant position at Newslinker. Specializing in tracking the latest developments in the world of technology and science, Samantha excels at presenting complex subjects in a clear and understandable manner to her readers. Through her work at Newslinker, she enlightens a knowledge-thirsty audience, highlighting the role of technology and science in our lives.

Wireless Logic Acquires Brazil’s Arqia to Boost IoT

Apple Launches watchOS 11.4 Featuring Enhanced Sleep Alarm

Study Reveals OpenAI’s GPT-4o Trained on Copyrighted Data

Highlights

How Did the Study Determine Data Usage?

What Were the Key Findings?

What Are the Implications for AI Companies?

Stay Connected

Latest News

Nigerian Man Faces Prison for Tax Fraud Targeting U.S. Agencies

Global Gamers Spend Less Time Playing as Competition Grows

Tesla Drivers Log 8 Billion Miles Using Full Self-Driving System

Tesla Cybercab Begins Production, Launches New Era in Transport

Tesla Model 3 Secures Edmunds’ 2026 Top Electric Car Honor

ARTIFICAL INTELLIGENCE

ELECTRIC VEHICLE

RESEARCH

How Did the Study Determine Data Usage?

What Were the Key Findings?

What Are the Implications for AI Companies?

You Might Also Like

Stay Connected

Latest News