The Synthetic Data Metrics help compare individual aspects of synthetic data vis-à-vis real data
AI laboratory DataCebo, an MIT startup introducesSynthetic DataMetrics, a new tool, for companies to compare synthetic data to real-world data sets. Synthetic Data Metrics is an open-source Python library that defines metrics for synthetic data and works across different types of tabular data for evaluating data statistics, efficiency, and privacy. “For tabular synthetic data, you need to create metrics that quantify how the synthetic data compares to the actual data. Each metric measures a specific aspect of the data – such as coverage or correlation – allowing you to identify which specific elements are been preserved or forgotten in the process of synthetic data,” said Neha Patki, co-founder of DataCebo. The Synthetic Data Metrics help compare individual aspects of synthetic data vis-à-vis real data, i.e., if they have been preserved or overlooked, throughout the data. The Synthetic Data Metric tool comes with synthetic data evaluation features like Category Coverage and Range Coverage to estimate the range of the synthetic data and its potential compared to real data. The Correlation Similarity metric allows software developers or data scientists to compare correlations between the synthetic data types. According to the company sources, there are more than 30 metrics overall and some are still in the developmental stage. The tool is primarily developed to protect data privacy as most of the cases result in data compromise. In the backend, the company claims to have used several graphical and deep learning techniques such as CTGAN, DeepEcho, Copulas, etc. SDM is part of MIT Data to AI lab’s SDV (Synthetic Data Vault) project that was started in 2016 and later handed over to DataCebo in 2020. The Vault is a mechanism to generate synthetic data, that was started to assist companies with creating data models in-situ. The developers are of opinion that though there is so much being done around synthetic data, particularly in areas like self-driving cars and image processing, enterprises are least equipped to utilize the data.
Source: https://www.analyticsinsight.net/mit-startup-introduces-synthetic-data-metric-to-evaluate-synthetic-data/