Five ways IBM is using synthetic data to improve AI models | IBM Research Blog

Five ways IBM is using synthetic data to improve AI models | IBM Research Blog

Synthetic data are computer-generated examples that can augment or replace real data to speed up the training of AI models, protect sensitive data, improve accuracy, or find and mitigate bias and security weaknesses.

“We're entering an era in which our enemies can make anyone say anything at any point in time.” In this viral video from 2018, actor-writer Jordan Peele projected his voice into former President Obama’s moving lips. Peele’s PSA on ‘deepfakes,’ audio and video altered with the intent to mislead, was the first time many people heard of synthetic data.

It won’t be the last. Today, synthetic data are everywhere, driving some of AI’s most innovative applications. They’re commonly used to inject more variety into a training dataset to boost accuracy and reliability in AI model predictions. But fake data are also fodder for pretrained foundation models, allowing researchers to develop new AI applications faster, using fewer real-world examples that are expensive to gather and label.

Synthetic data can help fill in gaps when real data are in short supply. Health care records, financial data, and content on the web, are all increasingly restricted due to privacy and copyright protections. Collecting, labeling, and auditing real data is also expensive, and no matter how much you collect, they never fully capture the complexity of the physical world. This gap between reality and its representation creates vulnerabilities that can cause AIs to make mistakes a human never would. Real data also come with baked-in biases that can reinforce or amplify existing inequities.

Synthetic data for training AI are most often made, perhaps unsurprisingly, using AI. Generative models, the technology Peele used to mimic Obama, is one method. Prompt a generative model like GPT-3 or Dall-E with a sample, and it can draw a picture, write a poem, or narrate a speech, in that style. Transformer-based foundation models like BERT are another way. IBM recently generated more expressive voices for its legacy customer service bots with the help of this technology.

Video game graphics engines are a third way of creating synthetic data. IBM researchers recently used the ThreeDWorld virtual environment to create fake images for pretraining a vision model called Task2Sim. Fine-tuned on actual images, Task2Sim outperformed a model trained on real images alone for tasks like identifying skin cancer in medical scans and diseased crops in satellite images.

Fake data seem to also work well for training AIs to recognize human actions in video. A recent study showed that synthetically trained models performed even better than models trained on real data for videos with fewer background objects.

Demand for fake data to feed the AI pipeline is on the rise. By 2025, we’ll need 70% less real data for AI training, the research firm Gartner recently estimated. As synthetic data replace real data, infractions for the misuse of personal data will also drop, the firm predicts. “We believe that synthetic data is important for the future of AI because it solves one of the most pervasive and critical challenges that AI systems face today — the lack of domain-specific, well-labeled, high-volume data at a reasonable cost,” Gartner wrote.

Fake data make experimentation easier and, as a bonus, come automatically labeled. They can also be a tool for probing AI models for security holes or bias. Deployed as adversarial examples, synthetic data can show where an AI model is likely to make mistakes or unfair decisions.

“Synthetic data can be an indispensable testing tool for AI,” said Inkit Padhi, a researcher at IBM and expert on synthetic data. “They can help to make AI models more fair, accurate, and trustworthy.”

Here are five inventive ways that IBM is using synthetic data.

Studying nonsense before trying to learn Urdu, Pakistan’s official language, might sound ridiculous. But it’s how IBM researchers are tackling the moonshot of developing AI applications for less dominant languages.

Thousands of spoken languages have relatively few texts in machine-readable form, stalling the development of AI applications for those languages. In a paper spotlighted at ICLR this year, IBM researchers showed that pretraining a language model on a made-up language grounded in images could make it easier to master a low-resource language like Urdu.

“When humans learn to talk, they associate words with visual concepts,” said Yang Zhang, an IBM researcher with the MIT-IBM Watson AI Lab. “We try to mimic that idea here.”

Researchers used a generative model to create some 2 million symbolic “tokens” in a game pairing symbols with natural images. One algorithm receives a visual prompt — like an image of a bedroom — and outputs a sequence of numbers. A second algorithm compares the numbers to a set of images and picks the image that seems like the best match. Eventually, an emergent language arises from these image-grounded symbols.

Trained on this prototype language, the model was fine-tuned on labeled text in Urdu, Basque, Persian and seven other languages. In the end, the model performed nearly as well on a fill-in-the-blank fluency test as a model pretrained on Spanish, the researchers found. They hypothesize that no matter what language we speak, our visual world is largely the same, creating a common foundation for natural language.

“It’s easier to learn German if you know English, but that’s not the case with non-Indo-European languages like Niger-Congo or Trans-New Guinea," said Chuang Gan, an IBM researcher with the MIT-IBM Watson AI Lab. “Teaching the model an emergent language first can make it easier to learn non-Indo-European languages, while avoiding some of the cultural biases that come with pretraining on a Western language.”

Show an AI model enough Impressionist art and it can learn to paint in that style. But try to design a windshield wiper that way, and it’s almost certain to fail. 

Most moving machines require a linkage mechanism that transfers motion or force from one part to another. Think of the wipers that clear rain and snow from your car windshield: A motor rotates an arm connected to links that move each wiper.

"When you create an image using AI, you can get two pixels wrong, and it doesn't matter," said Faez Ahmed, a mechanical engineering professor at MIT. "But if you're designing a mechanical system, a small change may lead the whole thing to fail. This makes the problem exceedingly challenging."

Most linkage systems today are designed manually because of the high level of precision needed. Using a computer-aided design (CAD) program, engineers move around the joints and bars of a mechanism until they hit on one that can produce the desired movement.

Led by Ahmed at MIT and Akash Srivastava at IBM, researchers want to turn this process on its head. Give the AI a goal, and let it propose a linkage system that can produce the desired movement.

In a recent breakthrough, researchers created an AI-generated dataset of 100 million mechanisms, nearly 1,000 times larger than the next biggest archive of 30,000 mechanisms. The dataset also features mechanisms with up to 20 joints — far more complex than a human could ever dream up.

As linkage systems grow in complexity, they become less and less likely to work, a principle that also applies to AI-generated mechanisms. To create a dataset with 100 million functioning mechanisms, the researchers ran billions of simulations and threw out most of their designs. They were able to run that many simulations, they said, only after figuring out how to speed up the process by 800 times.

They next plan to expand their dataset from 2D planar mechanisms to sliders, cams and gears. “Designing machines using probabilistic generative modeling rather than traditional optimization techniques has the potential to bring more creativity and efficiency into the design process,” said Srivastava, an IBM researcher at the MIT-IBM Watson AI Lab. “I’m excited to see what this AI can help us achieve.”

You can design your own mechanism with this demo from Ahmed’s lab.

As children, we learn language with all our senses. The more associations, the easier it is to remember new words. Researchers draw on this principle with Valhalla, an AI model that uses fake images to improve machine translation.

Feed the model a sentence in English, and Valhalla draws a visual representation, using a Dall-E-like transformer. It then extrapolates from the ‘hallucinated’ picture to translate from English to, say, French. “Imagining objects or scenes in our mind’s eye improves their memorability,” said Rameswar Panda, an IBM researcher with the MIT-IBM Watson AI Lab. “We thought machines might be similar.”

Researchers trained their translation model on pairs of sentences in the source and target language, matched with their pictorial representation. Give the model a sentence in one language, and it learns to generate a picture, then use it to predict how the sentence should read in the target language. The team showed that their method produced more accurate translations than a model trained on text alone. It could also handle longer sentences, under-resourced languages, and sentences with missing words.

AI models that ace performance benchmarks in the lab are often highly sensitive to adversarial examples — images and text that have been subtly altered to trigger mistakes. Using publicly available data, IBM researchers recently built a tool to fabricate quote tweets on Twitter to test the robustness of stock prediction models that trawl social media for tips.

The tool selects the tweet of a CEO or other influencer and finds a word in their tweet deemed most likely to flip the stock prediction model. The tool then swaps that word with one that’s semantically similar when it quote-tweets the CEO’s original post.

The substitute word is unlikely to raise any red flags because of its similar meaning, but it’s enough to trigger the stock prediction model to reverse its prediction. After ingesting the fake tweet, a stock picker that might have predicted that a stock price was falling and suggested that investors sell, might reverse its decision, and instead nudge the investor to buy. 

“If you want to manipulate stock prices, you don’t need access to an investor’s model or data,” said IBM researcher Dakuo Wang. “You just create a few hundred fake Twitter accounts, pretend to be an investor, and change a word or two when quote tweeting the CEO.”

Language models are sometimes used to scan things like news articles and earnings reports to quickly label their emotional tone as positive or negative. This type of shorthand sentiment analysis is useful for a variety of applications, including investing, or running your fantasy football team. But sentiment classifiers can produce biased or misleading results when trained on text with implicit racist, sexist, or ageist assumptions.

In a 2021 paper at AAAI, IBM researchers introduced a tool for creating synthetic text to to reduce bias in language classification models. It works by generating a counterfactual conditioned on the class you want to test — a topic, tense, or sentiment — to flip the model's decision.

Take the statement: “my boss is a man.” The tool generates a hypothetical statement with the gender reversed: “my boss is a woman.” Such a minor change shouldn’t cause a classifier to change its “positive” sentiment-rating to “negative,” but in this case it does. To mitigate the bias, the model could be retrained on a dataset augmented with counterfactuals, said IBM’s Padhi, to teach it that the statements are equivalent and should be classified similarly.

“Real-world data are rarely free of complications,” he said. “Synthetic data offer a way to probe AI models for problems and correct them in order to make them more fair, robust, and easier to transfer to other tasks.”

Images Powered by Shutterstock