A growing scarcity of quality data for training generative A.I. models has prompted a shift towards synthetic data alternatives. As digital publishers increasingly restrict public data access, future advancements of large language models like OpenAI’s GPT-4 and Google’s Gemini could stall. Synthetic data, generated by machines to imitate authentic human-created data, is being considered to fill this gap, albeit with potential pitfalls if not implemented correctly.
Applications of Synthetic Data in AI Training
Historically, synthetic data has been employed in various applications, including autonomous vehicle systems by companies like Waymo and Tesla. These firms use synthetic data to simulate diverse driving conditions. Today, experts see creative opportunities for synthetic data in training generative A.I. models. For example, OpenAI’s GPT-4 can generate synthetic datasets to fine-tune smaller, specialized models. This approach could enable more targeted advertising and enhance LLM performance in multilingual scenarios.
“Synthetic data plays a crucial role in enhancing our large language models,” said Jigyasa Grover, former machine learning engineer at X and current head of A.I. at Bordo AI. “By generating synthetic datasets, we can train LLMs on diverse scenarios, improving their generalization capabilities.”
Potential Benefits and Risks
Synthetic data also offers a way to navigate sensitive data issues in sectors like healthcare and finance. Hospitals can generate synthetic X-rays for training A.I. models to detect tumors, while governments can use synthetic data to study money laundering. Using synthetic data can help avoid intellectual property disputes, which are increasingly problematic for A.I. companies.
“Synthetic training data could clear a lot of these issues,” said Star Kashman, a tech litigation attorney. “It gets around the hurdle of unintentionally infringing upon other people’s work.”
Challenges with Synthetic Data
Despite its advantages, synthetic data comes with risks. A study in Nature found that models trained on synthetic data produced lower-quality outputs, a phenomenon known as “model collapse.” This could be due to the nascent stage of synthetic data techniques and a shortage of skilled engineers. Additionally, biased synthetic data could lead to legal liabilities if models generate discriminatory or inaccurate outputs.
“You can totally screw things up and make things worse,” warned Kjell Carlsson, head of AI strategy at Domino Data Lab.
Though synthetic data offers a promising way to address data shortages, real-world data remains invaluable. Many enterprises still have untapped data sources that could be utilized for training A.I. models. As such, while synthetic data can supplement real data, it is unlikely to replace it entirely in the near future.
“There’s actually so much data still that can be used to train these specialized models,” said Mayur Pillay, VP of corporate development at Hyperscience. “It’s just embedded at the core of the enterprise.”