AI enables researchers to weave ‘synthetic data’ from sets of patient data. But CIOs question whether the generated data sufficiently captures the medical variables useful in research
Johnson & Johnson employees collaborate at the company’s Irvine, Calif., lab. The company sees potential in using synthetic data for research, but said the technology still needs to mature. PHOTO: JOHNSON & JOHNSON
Healthcare companies have been mesmerized by the possibilities of “synthetic data”—data built out by applying artificial intelligence algorithms to real data sets. But ongoing technology challenges continue to limit widespread industry adoption, with many companies holding back on using it.
For years, drug and health researchers have been experimenting with the technology, which enables them to more freely analyze, for example, the impact of a drug on a given subpopulation, without the typical privacy and regulatory hurdles. A 2021
study estimated that by next year, 60% of the data used broadly for the development of AI and analytics projects would be synthetically generated.
NEWSLETTER SIGN-UP
WSJ | CIO Journal
The Morning Download delivers daily insights and news on business technology from the CIO Journal team.PreviewSubscribe
Reality is nowhere near that, clarified Arun Chandrasekaran, an analyst at the IT research and consulting firm. In some areas, he said, the engendered data has made headway, for instance, as images for training self-driving cars. But in health and drug research, where synthetic data could be particularly useful in generating medical records, adoption remains low.
The technology’s steep cost and the sparse number of vendors have been a drag on its uptake. But the far larger problem, healthcare companies say, is ensuring that synthetic data accurately represents the target population—in other words, is more like real data.
“The complexity and the variability in healthcare and science makes it a really hard problem to solve,” said Jim Swanson, chief information officer of
.
One area where Swanson sees promise is in analyzing the long-term impacts and effectiveness of medicines already on the market. Currently the company does this with de-identified patient data—data from which identifiers have been removed or changed, but which could be linked back to the person by other details. (Anonymized data in theory strips out all identifiable information.) Synthetic data could create much larger data sets, including in areas with stringent data curbs, the company said.
A NewYork-Presbyterian hospital in the Wall Street area. PHOTO: ZUMA PRESS
But creating a representative data set is hard, given the many relevant variables in patients: how many medications they are on, whether they smoke, whether they need a hip replacement, among many others, Swanson said. And those variables can change as new scientific discoveries emerge, he added. At the same time, it’s critical that the mix and makeup of variables in the original data be fairly captured in the synthetic data order to run an accurate analysis.
“You can create synthetic data easily enough, but is it correlated enough to give you a specific and an accurate example?” Swanson said. “That’s the problem you have to solve.”
When creating synthetic data, there’s a trade-off between accuracy and privacy, said Lalana Kagal, principal research scientist at the MIT Computer Science and Artificial Intelligence Lab. Typically, synthetic data is created by running real data through an AI algorithm that re-creates it in a form that is similar, but not identical. The closer the synthetic data is to the original source data, the more accurate it is—but it is also more likely to leak the original data. In addition, it’s unclear exactly how similar synthetic data must be to the source data to be subject to HIPAA laws, she said. The Health Insurance Portability and Accountability Act shields health records.
SHARE YOUR THOUGHTS
What do you think about the use of synthetic data in medical research? Join the conversation below.
It’s possible new techniques could be developed to ensure the original data stays private without sacrificing the accuracy of the synthetic data, Kagal said. In the meantime, some companies are hanging back.
In 2021, genomics company
published a promising case study on the use of synthetic data in genomics with technology vendor Gretel. More recently, however, Illumina said synthetic data isn’t a priority in research and development at the company.
At NewYork-Presbyterian, Peter Fleischut, chief information and transformation officer, said he’s more focused on ensuring there are strong enough cybersecurity and privacy systems in place to use real data. He is following developments around synthetic data, he said, but it’s not something the medical center has experimented much with.
Advertisement – Scroll to Continue
“If we’re creating a heart-failure algorithm, we really think that those algorithms should be based on actual data and patients that represent the patients that we serve,” Fleischut said. With synthetic data, “I have not yet been convinced that it’s truly representative of the patients we serve.”
“You can create synthetic data easily enough, but is it correlated enough to give you a specific and an accurate example? That’s the problem you have to solve.”— Jim Swanson, J&J
Yet another difficulty is the fledgling vendor market, said Gartner’s Chandrasekaran. Gartner is tracking just over two dozen vendors, but most are startups, founded in the last four to five years. The major cloud providers, which businesses are generally more comfortable working with, have largely stayed out of the market, he said. They might be tempted in once there’s more demand, but it’s hard to generate that demand without them already in the market, he said, calling it “sort of a chicken and egg.”
The health sector’s reticence might be overcome as the technology matures.
Swanson said, “We’re excited about its potential.”
Write to Isabelle Bousquette at isabelle.bousquette@wsj.com
Copyright ©2023 Dow Jones & Company, Inc. All Rights Reserved.
Source: https://www.wsj.com/articles/ai-generated-data-could-be-a-boon-for-healthcareif-only-it-seemed-more-real-5bfe52dd