Closing the gap between real and synthetic data

Despite the massive opportunities that synthetic data brings to the table, one of the main challenges it faces is the reality gap that exists.

BY SHRADDHA GOLED

Synthetic data was listed among the top five biggest data science trends for 2022, and Gartner named it among the top strategic predictions for this year. In a world that is highly driven by data, privacy and process issues often limit the kind of data researchers might require. A promising way out here is artificially generated data or synthetic data.

Various algorithms and tools are used to generate this synthetic data which is then used in a wide variety of applications. When used properly, synthetic data can be a good addition to human-annotated data while maintaining the speed and cost factors of the project.

Despite the massive opportunities that synthetic data brings to the table, one of the main challenges it faces is the reality gap that exists. A neural network can tell the difference between simulation and reality. This domain gap, which is also referred to the as uncanny valley, limits the real-world performance of machine learning models trained only in simulation. Closing the gap is important to research and practical challenge for the effective use of synthetic data.

Domain randomisation

Real-world data often contains a large amount of variability. To match up to this variability even in synthetic data generation, researchers are increasingly depending on domain randomisation. Speaking particularly about computer vision applications, domain randomisation can help randomise parameters like lighting, pose, object textures, etc.

Domain randomisation has been viewed as an alternative to high-fidelity synthetic images. The domain randomisation technique was first introduced by Josh Tobin and his team via a paper titled “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World”. In this paper, the researchers defined domain randomisation as a promising method for addressing the reality gap, where the simulator is randomised to expose the model to a range of environments instead of just one at training time. The team worked on the hypothesis that if the variability in simulation is big enough, the models trained in a simulation will generalise to the real world without additional training.

In 2018, researchers from NVIDIA presented a data randomisation approach to train a neural network to accomplish complex tasks like object detection. Results of this technique were found to be comparable with more expensive and labour intensive datasets. In this technique, synthetic images were randomly perturbed during training while focusing on relevant features. They were able to demonstrate that domain randomisation outperforms more photorealistic datasets and improves performance on results obtained using real data alone.

A slight improvement over domain randomisation is structured domain randomisation. It takes into account the structure and the context of a scene. Unlike domain randomisation, which places objects and distractors randomly according to a uniform probability distribution, structured domain randomisation places objects and distractors according to the probability distributions with respect to the specific problem at hand. This approach helps neural networks in taking the context into consideration during detection tasks.

Domain adaption

Despite the popularity of domain randomisation, the technique requires a domain expert to define the parts that must stay invariant. Conversely, increasing photorealism requires an artist to model the specific domains in detail, which increases the cost of generating data. The whole exercise defeats the entire cost-effectiveness aspect – a major selling point of synthetic data.

Enter domain adaption.

Domain adaption is an approach that makes a model that is trained on one domain of data to work well even with a different target domain. One of the most popular domain adaptation techniques is the usage of GANs. Conditional GANs, in particular, take additional inputs to the condition generated output. The image conditional GANs form a general-purpose framework for image-to-image translation problems. Conditional GAN was proposed in late 2014. In this technique, the GAN architecture is modified by adding label y as a parameter to the input of the generator module. This architecture generates corresponding data points while adding labels to the discriminator input to distinguish real data better.

Source: https://analyticsindiamag.com/closing-the-gap-between-real-and-synthetic-data/