AI Data Shortage Spurs Interest in Synthetic Alternatives

Highlights

Synthetic data may fill the gap in AI training data.

Risks include biases and lower-quality outputs.

Real-world data remains crucial for complex models.

Last updated: 2 August, 2024 - 3:26 pm 3:26 pm

Samantha Reed 1 year ago

A growing scarcity of quality data for training generative A.I. models has prompted a shift towards synthetic data alternatives. As digital publishers increasingly restrict public data access, future advancements of large language models like OpenAI’s GPT-4 and Google’s Gemini could stall. Synthetic data, generated by machines to imitate authentic human-created data, is being considered to fill this gap, albeit with potential pitfalls if not implemented correctly.

Contents

Applications of Synthetic Data in AI Training Potential Benefits and Risks Challenges with Synthetic Data

Applications of Synthetic Data in AI Training

Historically, synthetic data has been employed in various applications, including autonomous vehicle systems by companies like Waymo and Tesla. These firms use synthetic data to simulate diverse driving conditions. Today, experts see creative opportunities for synthetic data in training generative A.I. models. For example, OpenAI’s GPT-4 can generate synthetic datasets to fine-tune smaller, specialized models. This approach could enable more targeted advertising and enhance LLM performance in multilingual scenarios.

“Synthetic data plays a crucial role in enhancing our large language models,” said Jigyasa Grover, former machine learning engineer at X and current head of A.I. at Bordo AI. “By generating synthetic datasets, we can train LLMs on diverse scenarios, improving their generalization capabilities.”

Potential Benefits and Risks

Synthetic data also offers a way to navigate sensitive data issues in sectors like healthcare and finance. Hospitals can generate synthetic X-rays for training A.I. models to detect tumors, while governments can use synthetic data to study money laundering. Using synthetic data can help avoid intellectual property disputes, which are increasingly problematic for A.I. companies.

“Synthetic training data could clear a lot of these issues,” said Star Kashman, a tech litigation attorney. “It gets around the hurdle of unintentionally infringing upon other people’s work.”

Challenges with Synthetic Data

Despite its advantages, synthetic data comes with risks. A study in Nature found that models trained on synthetic data produced lower-quality outputs, a phenomenon known as “model collapse.” This could be due to the nascent stage of synthetic data techniques and a shortage of skilled engineers. Additionally, biased synthetic data could lead to legal liabilities if models generate discriminatory or inaccurate outputs.

“You can totally screw things up and make things worse,” warned Kjell Carlsson, head of AI strategy at Domino Data Lab.

Though synthetic data offers a promising way to address data shortages, real-world data remains invaluable. Many enterprises still have untapped data sources that could be utilized for training A.I. models. As such, while synthetic data can supplement real data, it is unlikely to replace it entirely in the near future.

“There’s actually so much data still that can be used to train these specialized models,” said Mayur Pillay, VP of corporate development at Hyperscience. “It’s just embedded at the core of the enterprise.”

You can follow us on Youtube, Telegram, Facebook, Linkedin, Twitter ( X ), Mastodon and Bluesky

Share This Article

By Samantha Reed

Samantha Reed is a 40-year-old, New York-based technology and popular science editor with a degree in journalism. After beginning her career at various media outlets, her passion and area of expertise led her to a significant position at Newslinker. Specializing in tracking the latest developments in the world of technology and science, Samantha excels at presenting complex subjects in a clear and understandable manner to her readers. Through her work at Newslinker, she enlightens a knowledge-thirsty audience, highlighting the role of technology and science in our lives.

Nintendo Reveals Sales Numbers for New Switch Titles

Helldivers 2 Update Brings New Challenges and Weapons

AI Data Shortage Spurs Interest in Synthetic Alternatives

Highlights

Applications of Synthetic Data in AI Training

Potential Benefits and Risks

Challenges with Synthetic Data

Stay Connected

Latest News

Tesla Offers Cybertruck Owners Added Protection With Armor Package

Tesla Launches Larger Model Y L Production at Shanghai Factory

Garmin Venu X1 Delivers Redesign and Features, Trades Off Battery Life

Wordle Challenges Players With ‘Nasal’ as August 9 Puzzle

Microsoft Faces Scrutiny Over Azure Support for Israeli Military

ARTIFICAL INTELLIGENCE

ELECTRIC VEHICLE

RESEARCH

Applications of Synthetic Data in AI Training

Potential Benefits and Risks

Challenges with Synthetic Data

You Might Also Like

Stay Connected

Latest News