Real-World Data for Healthcare Research in China: Call for Actions

Highlights

Many new data sources have emerged in response to the increasing demand for real-world evidence to support healthcare decisionmaking in China. Nevertheless, public information regarding these data sources is limited. The existing healthcare real-world data sources have not been systematically evaluated for the purposes of epidemiology, health economics, and outcomes research.•

Through a comprehensive evaluation of the major secondary healthcare databases in China, the current study found that despite the large number of healthcare databases, few could be directly used in epidemiology, health economics, and outcomes research due to limited data access and suboptimal data quality (eg, lack of longitudinal follow up). In contrast, these databases generally have large sample sizes and have the potential to expand the data through chart reviews or prospective data collection, which presents opportunities to further improve real-world data in China.•

Researchers and sponsors should carefully evaluate the feasibility of using a secondary database for their research. To efficiently use these databases to generate real-world evidence, 2 fundamental issues need to be addressed—data quality and data access.

Abstract

Objectives

This study aimed to provide an overview of major data sources in China that can be potentially used for epidemiology, health economics, and outcomes research; compare them with similar data sources in other countries; and discuss future directions of healthcare data development in China.

Methods

The study was conducted in 2 phases. First, various data sources were identified through a targeted literature review and recommendations by experts. Second, an in-depth assessment was conducted to evaluate the strengths and limitations of administrative claims and electronic health record data, which were further compared with similar data sources in developed countries.

Results

Secondary databases, including administrative claims and electronic health records, are the major types of real-world data in China. There are substantial variations in available data elements even within the same type of databases. Compared with similar databases in developed countries, the secondary databases in China have some general limitations such as variations in data quality, unclear data usage mechanism, and lack of longitudinal follow-up data. In contrast, the large sample size and the potential to collect additional data based on research needs present opportunities to further improve real-world data in China.

Conclusions

Although healthcare data have expanded substantially in China, high-quality real-world evidence that can be used to facilitate decision making remains limited in China. To support the generation of real-world evidence, 2 fundamental issues in existing databases need to be addressed—data access/sharing and data quality.

Keywords

administrative claimsdata accesselectronic health recordsreal-world data

Introduction

Real-world evidence (RWE) has been increasingly used in healthcare decision making. For example, the Food and Drug Administration has published a series of guidance documents on using real-world data (RWD) and RWE to support regulatory decision-making for drugs and medical devices.1, 2, 3 As another example, the European Medicines Agency has also published related documents on RWD and RWE, though as unofficial guidance.⁴^,⁵ Similarly, the China Center for Drug Evaluation under the National Medical Products Administration (NMPA) also published a draft document related to RWE entitled “Key Considerations in Using Real-World Evidence to Support Drug Development (Draft for Public Review)” in May 2019.⁶ Beyond regulatory agencies, there has been an increasing demand for RWD by various stakeholders in healthcare sectors, including health technology assessment agencies, payers, clinicians, pharmaceutical industries, etc. Although we have seen an increasing demand for RWD worldwide, such a trend is particularly evident in China, the second largest pharmaceutical market worldwide.⁷

The healthcare system in China has experienced substantial changes in past decades. Similar to most European countries, China provides national health insurance.⁸^,⁹ The current multi-level health insurance system includes the Urban Employee Basic Medical Insurance (UEBMI)¹⁰ for employed urban residents, the Urban Resident Basic Medical Insurance (URBMI)¹¹ for unemployed urban residents, and the New Rural Cooperative Medical System (NRCMS)¹² for rural residents. In addition, commercial health insurance plans can be purchased to supplement the national health plans. UEBMI and URBMI are managed by the National Healthcare Security Administration (NHSA).¹³ They cover both inpatient and outpatient care and provide drug benefits, which were defined by the National Reimbursement Drug List (NRDL).¹⁴ Compared with insurance plans for urban residents, the NRCMS is a relatively new insurance plan that was launched among rural residents in 2003.¹² It was managed by the National Health Commission before 2016, after which it was managed by the NHSA. With much lower premiums, it has limited coverage compared with the other 2 types of insurances. On the supply side, China’s healthcare system is primarily based on public hospitals owned by the city/province. With digital health, the way public hospitals deliver and document patient care has changed substantially. Most urban hospitals have electronic health record (EHR) systems. In some hospitals, mobile applications are used to follow up with patients and provide online consultations. These platforms have generated a significant amount of medical big data. At the same time, the healthcare services in China have also advanced substantially with many innovative treatments and clinical pathways consistent with western countries. In recent years, the Chinese government made substantial efforts to improve the accessibility of innovative drugs to the Chinese population.¹⁵ The Center for Drug Evaluation at the NMPA implemented a series of new policies to simplify review and streamline data requirements for novel treatments, especially in disease areas with high unmet needs.¹⁶ The newly formed agency, NHSA, also evaluated newly approved drugs in a timely manner to update the NRDL annually.¹⁷^,¹⁸

The series of reforms in healthcare have catalyzed the demand for RWE in China. For example, to expedite the access to innovative drugs by Chinese patients, the NMPA allows submissions and approval of novel therapies based on global clinical trials instead of trials among the Chinese population. This means that RWE is required to support unmet needs in relevant disease areas and provide evidence of effectiveness and safety in the Chinese population in postmarketing research. In addition, the NHSA has explicitly included analysis of insurance data and pharmacoeconomic evidence in the evaluation process of the NRDL in the 2019 document.¹⁹ The use of RWE in pharmacoeconomic evaluation is also discussed in detail in the most recent version of China Guidelines for Pharmacoeconomic Evaluations (2019 edition).²⁰ Furthermore, the upcoming reform in the hospital reimbursement system, such as diagnosis-related groups (DRGs) and other payment systems, will rely on the RWE to help hospitals and payers make decisions on the clinical pathways and reimbursement. In addition, with numerous novel treatments available in China, clinicians and patients need to rely on RWE to optimize their treatment decisions. Moreover, pharmaceutical companies need RWE to inform their decisions on clinical development and programs in China. Given these needs in different healthcare functions, we expect that the demand for RWE and RWD at all levels will continue to increase.

Indeed, we have witnessed an unparalleled effort in China to generate RWD from different sources to meet such demand. These initiatives are funded by different entities, such as the government, academic institutes, hospitals, private ventures, and multinational companies. At the national level, the State Council and the National Health Commission have published a series of documents regarding medical big data development, applications, and regulations.21, 22, 23, 24 At the regional level, there is a substantial investment in integrated data across regional hospitals and linkage to claims data. Medical associations, physician organizations, hospitals, and clinicians also invest in various RWD, mainly in the form of disease-specific registries or single-center clinical studies. Private ventures also have significant input in this process. Therefore, various multicenter EHR databases have been developed.

Although this is a welcome trend for researchers and decision makers, many data sources have not been systematically evaluated for research purposes, even for certain databases that have existed for a decade. For new data sources that emerged in the past 2 to 3 years, there is limited public information describing the databases. Therefore, we sought to provide an overview of major data sources in China that can be potentially used for epidemiology, health economics, and outcomes research; compare them with similar data sources in other countries; and discuss future directions of healthcare data development in China. We hope that our research will provide more clarity on the potential applications of the current healthcare data in epidemiology, health economics, and outcomes research and facilitate the development of healthcare data suitable for research purposes.

Methods

We conducted this study in 2 phases. In the first phase, we scanned various data sources through a targeted review of the literature and public information between 2015 and 2019, including Chinese and international health economics and outcomes research conferences, medical big data conferences, and policy conferences. We focused on 3 common types of chronic diseases — oncologic diseases, cardiovascular diseases, and diabetes. We also identified additional data sources recommended by experts and other researchers to get a better understanding of the emerging data sources for which public information is limited. The first phase provided us with a comprehensive understanding of the types of RWD sources in China. We were primarily interested in real-world noninterventional studies, and, thus, excluded pragmatic clinical trials. In the second phase, we conducted an in-depth investigation on the strengths and limitations of administrative claims and EHR data based on published studies, our own research experience, and discussions with data developers and/or data users, and summarized the database characteristics in Table 1. We would like to note that the RWD is rapidly evolving in China; therefore, a traditional literature review methodology would not be suitable for our purpose because the databases that are of most interest to researchers are the emerging databases that have not yet been published. The study provided an up-to-date assessment of RWD in healthcare in China based on the information collected from multiple venues. Although we strived to comprehensively evaluate the databases in the second phase, we also identified substantial information gaps in these databases. We summarized the types of research for which the databases are suitable based on the available information. In addition, we also compared the databases in China with similar types of databases in other countries, which could shed light on future directions of RWD development in China.

Table 1. Information assessed in in-depth database evaluation.

Information assessed
General database information-Data type, geographic coverage, disease areas, time frame, etc.
Data collection method or enrollment criteria
Availability of specific data elements
-Demographic variables
-Diagnosis and diagnosis code
-Treatment information, including brand and generic drug names, dose, initiation and ending dates, and procedure names and dates
-Lab tests, including test names, dates, and results
-Other examinations, such as imaging tests, pathology tests, and mutation tests
-Resource use and costs, including cost components
-Data on follow-up visits and methods for follow-up data collection

Results

Summary of RWD in China

The majority of the RWD can be classified into the following types: (1) administrative claims databases (including national and regional levels), (2) EHR (including regional EHR databases, multicenter, and single-center EHR databases), (3) databases linking EHR with claims, (4) cohort or registry data, (5) medical chart review studies, and (6) surveys of the general population and patients.

Administrative claims databases are available at national and regional levels. The National Claims Database is the only national-level claims database. It employs a stratified sampling approach to achieve a nationally representative sample of UEBMI and URBMI beneficiaries.²⁵ It does not cover rural residents who usually enroll in the NRCMS. The database was established in 2008 and is scheduled to be updated annually.¹⁴ The most recent study published used the 2017 data.²⁶ The database includes both inpatient and outpatient claims. In 2017, it covered more than 10 million lives and more than 60 million visits in the database. It resamples the subjects annually. Therefore, although patients can be tracked within the same year, subjects cannot be linked across different calendar years. This substantially limits the utilization of the data for longitudinal research. Various regional claims databases (eg, Guangzhou, Beijing) are available with a limited number of publications.²⁷^,²⁸ The regional claims databases generally cover all beneficiaries in the region and provide longitudinal data across multiple years.²⁷^,²⁸ Nevertheless, in certain regions (eg, Tianjin), only random samples are available.²⁹^,³⁰

EHR databases have developed rapidly in recent years due to the efforts invested in the integration of EHR systems within and across hospitals in China. Based on geographic regions, EHR databases can be classified as national or regional. National EHR databases refer to the ones that include multiple hospitals in different regions in China (eg, HLT, GennLife, LinkDoc, and DC Health databases).31, 32, 33, 34 These databases use a convenient sample of hospitals. Although they can be used to evaluate the clinical practice in different regions, the samples are not nationally representative. Regional EHR databases refer to multicenter EHR databases that include all or most of the hospitals in a specific region (eg, Yinzhou EHR database in Ningbo city, a city in Southeast China, Langchao EHR database, Xiamen, and Fuzhou EHR databases).³⁵ Based on disease areas, EHR databases can be classified as general EHR databases (eg, regional EHR, HLT, and GennLife) and disease-specific databases (eg, LinkDoc, and DC Health EHR databases are oncology-specific). With a few exceptions, most of the EHR databases were developed recently (ie, within the past 10 years). The sample sizes in these databases varied substantially from a few 10 thousand to more than 100 million. The number of hospitals included also had a wide range from 17 to a few hundred. In general, data on inpatient visits were more readily available than those on outpatient visits. In certain databases, only inpatient EHRs were available. The reasons are multifold. First, the EHR system is more developed in the inpatient setting because many hospitals do not enforce the use of EHR in the outpatient setting. Moreover, compared with western countries, it is more common that patients in China go to multiple hospitals for outpatient care so the EHR records cannot be systematically captured. The completion of longitudinal records was a general limitation in the EHR, which may be alleviated to a certain extent in regional EHR databases because of the ability of linking patient records across regional hospitals in the system. Despite the availability of these multicenter EHR databases, most publications are still based on single-center EHR studies.

Data linking EHR and claims are generally lacking in China. Certain EHR databases, primarily regional EHRs, can be linked to regional claims databases (eg, Yinzhou EHR database²⁵ and Xiamen EHR database). Nevertheless, detailed information is not available in the public domain nor have there been any published studies based on linked EHR and claims. Generally speaking, such databases need to be supported by local or central health authorities that hold the ownership of the claims databases.

Cohort/Registry data account for most of the publications in real-world studies in China. The most common registries are in oncology and cardiovascular diseases.³⁶ Substantial variations exist with regard to geographic coverage, completeness of clinical information, follow-up time, and data quality. If well-planned and managed, this type of data could provide longitudinal data with rich clinical information that addresses the general limitation in the EHR databases.

Other types of data include medical chart review data and surveys. Because these involve de novo data collection, they were not included in the scope of in-depth investigation. In general, retrospective chart review studies can partially address the limitations of EHR databases. For example, through chart review, physicians could provide more detailed and accurate clinical information that is in the unstructured fields of EHR data. Nevertheless, chart reviews cannot address the issue of missing outpatient records that are not systematically managed by physicians or hospitals. In addition, chart reviews are subject to incomplete follow ups because patients seek care in multiple hospitals.

In terms of data elements in the existing databases, there is substantial variation even within the same type of data. The findings are summarized in Table 2. In general, claims databases only have information on the treatments covered by insurance, and there has been no nationwide standardized coding system for drugs and procedures. In EHR databases, information on oral drugs is often incomplete because most are filled in an outpatient setting. Laboratory test results are often available in structured fields, especially in inpatient settings. Nevertheless, the availability of results from procedures, imaging examinations, and pathology varies. Even if the results are available in a database, they are mostly in unstructured fields that are not ready for research use. Total hospitalization cost information is usually available in the EHR databases but the availability of individual cost components varies. Linked EHR and claims combine the information of the 2 types of databases and provide clinical data and longitudinal treatment information. These data are currently only available at the regional level. Cohort and registry data often focus on a specific disease. The data elements vary substantially depending on the purpose and the design. Even though longitudinal follow up is often built in the design, existing registries with good follow-up records are rare. Nevertheless, they do provide a good opportunity to set up a longitudinal clinically rich database for epidemiology, health economics, and outcomes research.

Table 2. Data elements that are available in different types of RWD.

	Claims	EHR	Linked EHR and claims	Cohort and registry
Demographics	Available	Available	Available	Available
Diagnosis	ICD-10 code	ICD-10 code; unstructured diagnosis description	ICD-10 code; unstructured diagnosis description	Unstructured diagnosis description with or without ICD codes
Pharmacologic treatment	Available but only for drugs covered by insurance	Available and not restricted to drugs covered by insurance; often need to rely on physician notes to reconstruct complete treatment history	Available and not restricted to drugs covered by insurance	Pharmacologic treatments are not available or often incomplete
Drug name	Generic names are available Various coding systems are used in different regions and no nationwide standardized coding system is available	Generic names; no standardized coding Brand names in some databases Incomplete outpatient drug information in some databases because of incomplete outpatient data	Generic names Brand names in some databases	Varies depending on the purpose of the data collection
Dose	Often available	Available in prescriptions; limited in treatment history notes	Often available	Varies depending on the purpose of the data collection
Initiation and ending dates	Prescription fill dates and quantity	Prescription date; unstructured field, eg, physician notes	Prescription date, prescription fill dates, and quantity	Varies depending on the purpose of the data collection
Nonpharmacologic treatment
Surgery and procedures	Names and dates are available; various coding systems are used in different regions and no nationwide standardized coding system is available; Results are not available	Names and dates are available; no standardized coding Results may not be available or available in an unstructured field	Names and dates are available; Results may not be available or available in an unstructured field	Varies depending on the purpose of the data collection
Lab tests	Names and dates are available; Results are not available	Names and dates are available; Results are available	Names and dates are available; Results are available	Names and dates are available; Results are often available
Other exams
Imaging tests	Names and dates are available; Results are not available	Names and dates are available; Results may not be available or available in an unstructured field	Names and dates are available; Results may not be available or available in an unstructured field	Varies depending on the purpose of the data collection
Pathology tests	Names and dates are available; Results are not available	Names and dates are available; Results may not be available or available in an unstructured field	Names and dates are available; Results may not be available or available in an unstructured field	Varies depending on the purpose of the data collection
Mutation tests	Many tests are not reimbursable so are not available; Results are not available	Names and dates are available for tests conducted within hospitals; Results are available in unstructured fields	Names and dates are available for tests conducted within hospitals; Results are available in unstructured fields	Varies depending on the purpose of the data collection
Resource use and costs	Total costs and itemized costs are available	Outpatient often incomplete Costs vary across database but total hospitalization costs are generally available	Total costs and itemized costs are available	Often not available
Follow-up visits	Available (The National Claims Database only has data within the same calendar year; regional claims databases have data across multiple years)	Incomplete due to missing outpatient information; Some databases have followed-up surveys	Available	Often available as part of the design but data quality varies

EHR indicates electronic health records; ICD 10, International Classification of Diseases, 10th Revision; RWD, real-world data.

Comparison of Secondary Databases Between China and Other Countries

Secondary databases comprise a major component of RWD and they allow timely research execution. Compared with developed countries, the application of secondary databases in healthcare research is still in its infancy stage in China. Table 3 provides a detailed comparison in administrative claims databases and EHR databases between China, the United States (US), and the European Union (EU).

Table 3. Comparisons of secondary databases between China and other countries.

	China	United States	EU
Administrative claims
Availability	Yes, at national, and regional levels	Yes, at national, and regional levels	Not available in all countries. France and Italy have a national level; Germany has regional level
Sponsorship/ownership	Government	Government and private	Mostly government
Access	Certain academic groups and organizations; no clear guidelines on data application	Available for noncommercial and commercial research purposes; A clear process of data application	Varies but the majority is available for noncommercial and commercial research purposes; The data application process is generally long (ie, 6 months to 1 year)
Database information	Not available (only through published information); Unclear about data QC and validation	Available in public domains or upon request	Available in public domains or upon request
Key information available	Inpatient and outpatient claims; Pharmacologic and non-pharmacologic treatments that are reimbursable; Demographic information	All claims, pharmacy and medical; Demographic information	All claims, pharmacy and medical; Demographic information
Clinical information	Not available	Limited data available in certain databases (eg, IBM Truven); Possible to link with EHR (eg, PharMetrics)	Generally not available
Longitudinal follow up	The National Claims Database only has data within 1 calendar year; regional claims databases have data across multiple years	Generally available as long as patients are enrolled in the health plan	Yes. Long-term follow-up is available due to the universal healthcare system
Data update	Subject to policy change	Varies from quarterly (commercial claims) to to biennially (Medicare)	Varies; often annually
Data lag	Subject to policy change	Generally 6-9 months, longer for government-sponsored data, such as Medicare, and Medicaid	Varies; 3 months to 1 year
Ethnic reviews	Unclear	Not required	Varies
EHRs
Availability	Yes, at national, and regional levels	Yes, at national, and regional levels	Varies across countries. New EHR databases are emerging under the initiatives of the European Commissions.⁵²
Sponsorship/ownership	Government and private	Government and private	Government and private
Access	Varies; Generally no clear guidelines on data application; Claims data generally not accessible by third-party non-academic organizations; EHR databases sponsored by private ventures may be accessed by researchers after obtaining ERB and HGRAC approval.	Generally available for noncommercial and commercial research purposes; A clear process of data application	Varies, eg, CPRD available for commercial use while single-center data may not The data application process varies widely. For CPRD, it takes 2-3 months.
Database information	Not available (some only through published information); Unclear about data QC and validation	Available in public domains or upon request	Available in public domains or upon request
Timing of data validation	Basic data validation is conducted, but many variables are generated and validated when individual research projects are conducted.	Mostly completed before data licensing	Mostly completed before data licensing
Key information available	Most inpatient records; Outpatient records are generally incomplete; Treatment information may not be completed in certain databases	Depending on the data collection methods, some include outpatient records only while others include both inpatient and outpatient; Generally include treatment information (procedure, drug, dose, treatment duration, etc.)	Varies; CPRD is an outpatient database but can be linked to HES, the inpatient database; Generally include treatment information (procedure, drug, dose, treatment duration, etc.)
Clinical information	Lab values are available; treatment outcomes (such as response and progression in cancer) and symptoms are generally not available and rely on natural language processing + chart review.	Lab values are available; treatment outcomes (such as response and progression in cancer and symptoms are generally not available and rely on natural language processing + chart review.	Lab values are available; treatment outcomes (such as response and progression in cancer generally not available) and symptoms are generally not available.
Longitudinal follow-up	Longitudinal information is generally incomplete though some have follow-up information from surveys.	Longitudinal follow-up is generally available; records outside the healthcare delivery system are not available.	Longitudinal follow-up is generally available; records outside the healthcare delivery system are not available.
Data update	Non-government-funded databases generally have a real-time update	Near real-time data availability (IBM Explorys) Monthly update (Flatiron)	Varies, monthly update for CPRD
Data lag	Non-government-funded databases have a very short data lag up to 2 months	Up to 1 month	Varies, < 30 days for CPRD
Linking to claims	Certain databases have the ability but none has published information using linked databases.	Several examples linked EHR/registry and claims, including:-SEER-Medicare-Optum + Humedica-DoD data-PharMetrics + IQVIA EHR	Limited
Linking to a prospective study	Certain databases (mainly private venture sponsored) conduct regular patient surveys to be included in the database and offer the research collaboration with centers to conduct prospective studies.	Not a common feature but certain databases do offer chart review or patient survey for a specific project.	May be feasible with individual centers
Ethnic reviews	Required; review and approval by HGRAC are generally required	Not required	Varies

CPRD indicates clinical practice research datalink; DoD, department of defense; EHR, electronic health records; ERB, ethics review board; EU, European Union; HES, Hospital Episode Statistics; HGRAC, Human Genetic Resource Administration of China; QC, quality control; SEER, Surveillance, Epidemiology, and End Results Program.

The secondary databases in China have several limitations that significantly hinder their use in healthcare research. First, data access is limited and not well defined. For example, the National Claims Database can only be accessed by academic researchers²⁵ and the research institute under the predecessor of NHSA (ie, the China Health Insurance Research Association). Data exclusivity applies to many privately sponsored EHR databases as well. This data access/sharing model has significantly limited the wide adoption of these databases in healthcare research, which may explain the limited publications or presentations based on these databases in the literature. In contrast, the US and EU database access is clear, with most databases available for research use with a licensing fee. This access model not only helps establish the credibility of these data sources but also improves the data quality over time with feedback from users. Another barrier for using the secondary databases in China is the lack of clear regulations related to patient privacy protection in the secondary databases and the approval process of using these databases. For example, the US has a federal law, the Health Insurance Portability and Accountability Act of 1996, which sets the national standards to protect sensitive patient health information from being disclosed without the patient’s consent or knowledge. The EU has created a law, the General Data Protection Regulation, that defines data protection and privacy in the EU and the European Economic Area and addresses the transfer of personal data outside the EU and European Economic Area. For the US and EU secondary databases that comply with these regulations, ethics review is not required, which substantially improves timely access to these databases for research purposes. In China, however, such regulations are not yet established. Out of an abundance of caution, data developers and researchers opt to apply for approval from the ethics review board, even if only deidentified data are used. Moreover, because there is no central ethics review board in China, data developers and researchers need to apply for the ethics board reviews in individual hospitals or medical centers that contribute to the databases. Because not all hospitals will grant an approval, the actual sample size could be much smaller than the total number in the databases. It is often challenging to estimate the feasible sample size because of the uncertainty of the ethics board review outcomes. To make the issue more complex, the Human Genetic Resource Administration of China (HGRAC) approval is now required by most sponsors to initiate a research project. There is no clear timeline and process for the HGRAC review. In some cases, it could take more than 6 months to get the approval. In case of failing the HGRAC review, the research plan has to be aborted. For government-sponsored databases, the ethics review and HGRAC requirements are less clear with varied practice in reality. In comparison, in the US and EU, even when ethics review is required for certain secondary databases (such as the Surveillance, Epidemiology, and End Results-Medicare Program), the process and timeline are predictable and researchers rarely have to get individual centers’ approval for use of secondary databases that comply with the local regulations on patient privacy protection.

In addition to data access, another limitation in most of the secondary databases in China is the lack of longitudinal follow up. For example, the National Claims Database only has a maximum of 1-year follow up because of resampling each year.²⁵ In contrast, most claims databases in the US and EU have complete longitudinal records of enrollees as long as they remain in the same health plans. Using regional claims databases may address this limitation; yet, with substantial variations in clinical practice across different regions, a national database is more desirable for certain research objectives. For the EHR databases, although incomplete records are a general limitation globally because of the inability to capture patient care outside the system, the issue is more salient in EHR databases in China.

The lack of transparency in data content, quality, and validation is another concern to researchers. Information on data elements, quality, and validation is limited in the public domain. Even though the list of data elements may be shared by data developers upon request, the availability of data elements may vary across contributing hospitals because it is at the hospital’s discretion what data to share with data developers. Certain hospitals are cautious about sharing the entire EHR data because of concerns of “data leak.” Although in the long-term, appropriate regulations may help alleviate the issues, in the short term, the data developers should ensure a certain standard when recruiting hospitals that contribute to the database and exclude the hospitals if they do not meet the standard. In addition, detailed information on data curation and harmonization is often not available. Many variables (including certain diagnoses, treatment information, and clinical outcomes) are defined using artificial intelligence (AI) methods based on the unstructured fields in the raw data. There is little information on the performance of AI algorithms. Staff with a medical background are involved in ascertaining the cases and validating the variables based on their medical knowledge. The data validation standards, however, are not transparent. External validation, where the summary statistics based on the databases are compared with published literature, is not usually performed. As such, data quality issues, such as missing values, data consistencies, etc., are usually not available until after the project is initiated. Insufficient information on data quality is likely to substantially limit the use of these databases. In fact, it is common that data developers need to re-extract and manually enter data from the hospitals’ original EHR systems because of a large number of missing values in key variables. This has implications on the timeline of the study, making the secondary EHR databases less attractive.

Despite these limitations, these databases present opportunities to further improve RWD in China. First, they generally have a larger sample size compared with similar databases in other countries, which is attributed to the large population size in China. This creates opportunities for rare disease research, which might not be feasible in other countries with a smaller patient population. In addition, some EHR databases conduct regular follow-up patient surveys through a call center (especially for oncology patients) and provide the follow-up data along with the existing EHR data. Although some EHR vendors in the United States and EU can conduct a patient survey for a specific project, they normally do not include a regular patient follow-up survey as a feature of the database. The survey helps partially address the limitation of lack of long-term outcomes; however, it mainly focuses on death and performance status. The surveys could be further improved by collecting information on treatments and resource use in the outpatient setting. Another distinction is that certain EHR databases in China, mainly those funded by private ventures, can offer to facilitate research collaborations between researchers and certain hospitals or medical centers, which not only allow additional data collection to address the missing data issues in the existing databases but also provide insight from the clinical perspective. This is not commonly provided by EHR database developers in other countries.

Potential Applications of Existing Databases in Epidemiology, Health Economics, and Outcomes Research

Based on the assessment, the current secondary databases are more suitable for cross-sectional studies. In fact, most published or presented studies based on these databases use a cross-sectional design (eg, understanding patient characteristics, treatment compositions, costs per hospitalization and visit).37, 38, 39

The National Claim Database can generally be used to evaluate the prevalence of a disease, understand patient characteristics, current treatments, and the economic burden (resource use and costs) associated with a disease or a treatment. It can also be used to evaluate regional variations in the above outcomes. Using data from multiple years, the database can also be used to assess trends in the outcomes of interest over time.

Regional claims databases can be used to address similar research topics as the National Claim Database; however, the generalizability to the Chinese population is limited. In contrast, the longitudinal nature of the databases can support epidemiologic studies, such as incidence, risk factors, and risk predictions as well as outcomes research such as the natural history of a disease, treatment trajectory, treatment adherence and persistence, complications, and economic outcomes associated with a disease and a treatment.

The national-level EHR databases are most suitable to assess diseases and treatments in an inpatient setting, including inpatient clinical outcomes, length of stay, and costs. Certain regional EHR databases can provide longitudinal records, which may be used to address broader research topics in outcomes research due to their richer clinical information compared with the claims databases.⁴⁰ Nevertheless, as discussed previously, there is generally limited information on data elements and quality regarding these databases.

Discussion

Key Considerations for Conducting Research Using Existing Databases

With a large number of available secondary databases in China, researchers should make an informed decision about the suitable data for their research objectives; however, the information on these databases is limited. Although several previous studies have assessed various RWD,²⁵^,³⁵^,³⁶ to our knowledge, the current study is the first one to provide an in-depth and comprehensive assessment of various databases, including emerging databases for the use of healthcare research.

Given the unique features of the current databases in China, there are specific considerations when forming a research collaboration through data management companies in real-world studies. We recommend that researchers conduct a thorough feasibility assessment before committing to a specific database. Such feasibility should tackle both logistic questions and study-specific questions. Logistic questions include whether data are deidentified, researcher’s access to individual patient-level data, the European Commission and HGRAC review processes, data transfer, and timeline. Study-specific questions include sample size, availability and completeness of key variables, definitions and validation of key variables, completeness of the follow-up data, etc. Through the feasibility assessment, researchers will have a better understanding of the timeline and main limitations of using a specific data source. Even with a thorough feasibility assessment, it is important for researchers to be flexible during the engagement of the study and allow sufficient time and research funding to handle uncertainties.

Call for Actions

Consistent with the findings from the previous studies,³⁵^,³⁶ our study identified 2 fundamental issues that hinder the generation of RWE in China: data access/sharing and data quality. It requires a collective effort from researchers, information technology (IT) vendors, hospitals, and government agencies to address these issues.

First, data access needs to follow clear regulations that define data ownership and approval process while setting up the standards for confidentiality and data security. Regarding this, there are numerous examples in other regions that we can reference.41, 42, 43 For example, the Digital Agenda for Europe initiative by the European Commission has fueled tangible policy changes to make databases more open to sharing.⁴⁴ In China, the China RWD and Study Alliance was founded in 2017 with the goal to develop an RWE ecosystem and promote the use of RWE in healthcare and policy decision making.⁴⁵ Beyond these initiatives, regulations similar to the Health Insurance Portability and Accountability Act or General Data Protection Regulation that clearly define protection of personal health information and data transfer in China are urgently needed. Such regulations will play a critical role in forming a healthcare data system that allows the legitimate use of deidentified secondary databases for research use irrespective of funding sources. In addition, establishing a central ethics review board and a clear process of reviewing and approving the use of secondary databases will help expedite data access.

Second, the data quality issue must be substantially improved. It should be recognized that database development is not merely an IT effort. To fill this gap, the China RWD and Study Alliance has developed a series of high-level technical guidance for developing RWD and subsequent studies involving experts in IT, clinical medicine, epidemiology, statistics, health economics, health policy and management, AI, and so on.45, 46, 47, 48, 49, 50 Other guidelines, such as the one published by the Chinese Thoracic Oncology Group, provided detailed guidance on the design of real-world studies as well as the quality control of RWD.⁵¹ These guidelines emphasize the importance of multidisciplinary collaborations to improve the quality of RWE in China. Specific data standards need to be established to define quality control, data validation, and transparency of data processing methods for individual research projects. Moving forward, it is more important to devote our time to improving the quality of the databases instead of quantity. Increasing the number of hospitals will not address the fundamental issue of lack of follow-up data and incomplete outpatient records. An efficient approach to enable high-quality RWE could stem from a committed collaboration between individual academic centers and experienced researchers, which is supported by funding from the government or pharmaceutical industry.

The recent initiative of the National Longitudinal Cohort of Hematological Diseases in China is an example of such collaboration.⁵² The cohort data will be built upon clearly defined research objectives and rigorous research design. The initiative is supported by a dedicated clinical team for the patient follow ups to collect high-quality longitudinal data. Such data can be used to address questions in healthcare and policy decisions, improve quality of care, and facilitate clinical innovations. Such a model can be adopted to generate high-quality RWD in other disease areas as well.

Finally, a functional infrastructure is needed to secure the devoted effort on RWD collection. Given the unique healthcare system in China, the infrastructure is even more critical. Unlike Western countries where private sectors play major roles in the healthcare system and have strong incentives to generate high-quality RWD, hospitals in China are mostly public-funded. The investment in data integration and management in Chinese hospitals is generally limited. Therefore, the clinical data remain fragmented. To address this issue at the hospital level, setting up an appropriate infrastructure is the key to ensure data integration and efficient data management. The infrastructure should have a functional unit to streamline RWD collection across disease specialties and different settings (eg, inpatient and outpatient). For example, to effectively implement the National Longitudinal Cohort of Hematological Diseases in China program, the Institute of Hematology and Blood Diseases Hospital has established a Center for Information and Resources that is directly led by the institutional head and consists of IT, biobank, library, bioinformatics, and epidemiology/biostatistics teams. The center enabled the collection of data originally hosted in each department into a central system, where data are processed and managed using the same standards. With a streamlined data processing and management system, researchers can access high-quality data in a timely manner. Admittedly, not all hospitals have the capability of setting up such infrastructure within their organization. With appropriate regulations in place, there is an opportunity for these hospitals to outsource the data management to a vendor with such specialty without risking the disclosure of patient privacy.

In conjunction with appropriate infrastructures, multiple funding mechanisms via governments, funding agencies, and industries should also be explored and established. As RWD is a common interest among key stakeholders of the healthcare system, plenty of opportunities exist in public-private collaborations to support innovation in treatment and prevention, improve quality of care and patient outcomes, and inform health policies and regulatory and reimbursement decisions. Establishing a sustainable long-term funding mechanism is one of the keys for the healthy development of RWD in the long run.

Conclusions

Administrative claims databases and EHR databases are relatively new in China. Compared with developed countries, the major limitations of these databases include unclear data access, lack of longitudinal follow-up data, and suboptimal data quality. Nevertheless, with a large sample size and linked patient survey component, some databases also present opportunities for further improvement in RWD in China.

With the dynamic healthcare environment, RWD will continue to evolve in China. Our goal is to continue to update our knowledge in China RWD and prompt meaningful changes in improving RWE generation through our research.

Article and Author Information

Author Contributions: Concept and design: Xie, Wu, Wang, Cheng, Zhou, Zhong, Liu

Analysis and interpretation of data: Xie, Wu, Zhou, Zhong, Liu

Drafting of the manuscript: Xie, Wu, Zhou, Zhong, Liu

Critical revision of the paper for important intellectual content: Xie, Wu, Wang, Cheng, Zhou, Zhong, Liu

Supervision: Wang, Cheng

Conflict of Interest Disclosures: Drs Xie, Wu, Zhou, and Zhong are employed by Analysis Group, Inc. Dr Liu is employed by Merck & Co Inc. No other disclosures were reported.

Funding/Support: This work was supported by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA.

Role of the Funder/Sponsor: The sponsor was involved in all stages of the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and the decision to submit the manuscript for publication.

Supplemental Material

Download : Download Word document (18KB)

Appendix Table 1.

References

1US Food and Drug AdministrationSubmitting documents using real-world data and real-world evidence to FDA for drugs and biologics guidance for industryhttps://www.fda.gov/regulatory-information/search-fda-guidance-documents/submitting-documents-using-real-world-data-and-real-world-evidence-fda-drugs-and-biologics-guidance, Accessed 2nd Dec 2019Google Scholar 2US Food and Drug AdministrationUse of real-world evidence to support regulatory decision-making for medical devices 2017https://www.fda.gov/regulatory-information/search-fda-guidance-documents/use-real-world-evidence-support-regulatory-decision-making-medical-devices, Accessed 2nd Dec 2019Google Scholar 3US Food and Drug AdministrationUse of electronic health record data in clinical investigations guidance for industry 2018https://www.fda.gov/regulatory-information/search-fda-guidance-documents/use-electronic-health-record-data-clinical-investigations-guidance-industry, Accessed 2nd Dec 2019Google Scholar 4European Medicines AgencyA common data model for Europe? – Why? Which? How?https://www.ema.europa.eu/en/documents/report/common-data-model-europe-why-which-how-workshop-report_en.pdf (2017), Accessed 2nd Dec 2019Google Scholar 5European Medicines AgencyReal-world evidence (RWE) – an introduction; how is it relevant for the medicines regulatory system?https://www.ema.europa.eu/en/documents/presentation/presentation-real-world-evidence-rwe-introduction-how-it-relevant-medicines-regulatory-system-emas_en.pdf, Accessed 2nd Dec 2019Google Scholar 6Center for Drug EvaluationKey considerations in using real-world evidence to support drug development (draft for public review)https://www.cde.org.cn/main/news/viewInfoCommon/7e6fb9fc3f066a966a02130f24dbff1c, Accessed 2nd Dec 2019Google Scholar 7IQVIAThe Global Use of Medicine in 2019 and Outlook to 2023https://www.iqvia.com/en/locations/belgium/newsroom/2019/02/the-global-use-of-medicine-in-2019-and-outlook-to-2023, Accessed 2nd Dec 2019Google Scholar 8Y. Zhang, W. Tang, X. Zhang, Y. Zhang, L. ZhangNational Health Insurance Development in China from 2004 to 2011: coverage versus benefitsPLoS One, 10 (5) (2015), Article e0124995 View PDF CrossRef View Record in Scopus Google Scholar 9W. Liang, J. Xie, H. Fu, E.Q. WuThe role of health economics and outcomes research in healthcare reform in ChinaPharmacoeconomics, 32 (3) (2014), pp. 231-234 View PDF CrossRef View Record in Scopus Google Scholar 10F. Huang, L. GanThe impacts of China’s urban employee basic medical insurance on healthcare expenditures and health outcomesHealth Econ, 26 (2) (2017), pp. 149-163 View PDF CrossRef View Record in Scopus Google Scholar 11J. Pan, S. Tian, Q. Zhou, W. HanBenefit distribution of social health insurance: evidence from China’s urban resident basic medical insuranceHealth Policy Plan, 31 (7) (2016), pp. 853-859 View PDF CrossRef View Record in Scopus Google Scholar 12J. Chen, H. Dong, H. Yu, Y. Gu, T. ZhangImpact of new rural cooperative medical scheme on the equity of health services in rural ChinaBMC Health Serv Res, 18 (1) (2018), p. 486 View PDF View Record in Scopus Google Scholar 13Information Office of the State Council. Medical and health services in China. The State Council of the People’s Republic of Chinahttp://english.www.gov.cn/archive/white_paper/2014/08/23/content_281474982986476.htmAccessed December 2, 201914National Healthcare Security AdministrationNotice of the Ministry of Human Resources and Social Security of the State Medical Insurance Bureau on printing and distributing the catalogue of national basic medical insurance, work injury insurance, and maternity insurance drugshttp://www.nhsa.gov.cn/art/2019/8/20/art_37_1666.html, Accessed 16th Nov 2020Google Scholar 15X. Tan, X. Liu, H. ShaoHealthy China 2030: a vision for health careValue Health Reg Issues, 12 (2017), pp. 112-114Article Download PDF View Record in Scopus Google Scholar 16Center for Drug EvaluationAnnouncement of the National Health Commission on optimizing drug registration review and approvalhttp://www.cde.org.cn/policy.do?method=view&id=33df81db7cdbe45e, Accessed 16th Nov 2020Google Scholar 17National Healthcare Security AdministrationAnnouncement of the national health insurance bureau on the announcement of the work plan for the adjustment of the national medical insurance drug list in 2019http://www.nhsa.gov.cn/art/2019/4/17/art_37_1214.html, Accessed 16th Nov 2020Google Scholar 18Economic Information DailyScientifically construct the dynamic adjustment mechanism of the national medical insurance catalogue – medical insurance catalog adjustments can benefit more patientshttp://dz.jjckb.cn/www/pages/webpage2009/html/2019-08/09/content_56222.htm, Accessed 16th Nov 2020Google Scholar 19National Healthcare Security AdministrationAnnouncement of the National Healthcare Security Administration on the “Adjustment Plan of 2019 National Reimbursement Drug Listhttp://www.nhsa.gov.cn/art/2019/4/17/art_37_1214.html, Accessed 2nd Dec 2019Google Scholar 20G. Liu, S. Hu, J. Wu, J. Wu, Z. Dong, H. LiChina Guidelines for Pharmacoeconomic EvaluationsChina Market Press, Beijing (2020)Google Scholar 21The State Council of the People’s Republic of ChinaThe General Office of the State Council regarding promotion and standardization guiding opinions on the development of health and medical big data applicationshttp://www.gov.cn/zhengce/content/2016-06/24/content_5085091.htm, Accessed 2nd Dec 2019Google Scholar 22Cyberspace Administration of ChinaNotice on issuing national health and medical big data standards, safety, and service managementhttp://www.cac.gov.cn/2018-09/15/c_1123432498.htm, Accessed 2nd Dec 2019Google Scholar 23The State Council of the People’s Republic of ChinaOutline for the “Healthy China 2030” initiativehttp://www.gov.cn/zhengce/2016-10/25/content_5124174.htm, Accessed 2nd Dec 2019Google Scholar 24The State Council of the People’s Republic of ChinaGuidelines for promoting health and health technology innovationhttp://www.gov.cn/xinwen/2016-10/12/content_5118171.htm, Accessed 2nd Dec 2019Google Scholar 25Y. Yang, X. Zhou, S. Gao, et al.Evaluation of electronic healthcare databases for post-marketing drug safety surveillance and pharmacoepidemiology in ChinaDrug Saf, 41 (1) (2018), pp. 125-137 View PDF CrossRef View Record in Scopus Google Scholar 26F. ZhuThe medical insurance data tells you: why is the cost of treating the same disease different?https://www.cn-healthcare.com/articlewm/20190531/content-1060878.html, Accessed 16th Nov 2020Google Scholar 27H. Zhang, Y. Sun, D. Zhang, C. Zhang, G. ChenDirect medical costs for patients with schizophrenia: a 4-year cohort study from health insurance claims data in Guangzhou city, Southern ChinaInt J Ment Health Syst, 12 (2018), p. 72 View PDF CrossRef View Record in Scopus Google Scholar 28L. Zhuo, Y. Cheng, Y. Pan, et al.Prostate cancer with bone metastasis in Beijing: an observational study of prevalence, hospital visits, and treatment costs using data from an administrative claims databaseBMJ Open, 9 (6) (2019), Article e028214 View PDF CrossRef View Record in Scopus Google Scholar 29X. He, L. Chen, K. Wang, H. Wu, J. WuInsulin adherence and persistence among Chinese patients with type 2 diabetes: a retrospective database analysisPatient Prefer Adherence, 11 (2017), pp. 237-245 View PDF View Record in Scopus Google Scholar 30X. He, Y. Wang, H. Cong, C. Lu, J. WuImpact of optimal medical therapy at discharge on one-year direct medical costs in patients with acute coronary syndromes: a retrospective, observational database analysis in ChinaClin Ther, 41 (3) (2019), pp. 456-465.e2Article Download PDF View Record in Scopus Google Scholar 31Happy Life Techhttp://www.happylifetech.com/, Accessed 2nd Dec 201932GennLifehttp://www.gennlife.com/, Accessed 2nd Dec 201933LinkDocCare Data Care Lifehttps://www.linkdoc.com/, Accessed 2nd Dec 2019Google Scholar 34Health DChttp://dchealth.com/, Accessed 2nd Dec 201935X. SunReal-world evidence in China – current practices, challenges, strategies, and developments. ISPORhttps://www.ispor.org/docs/default-source/conference-ap-2018/china-2nd-plenary-for-handouts.pdf?sfvrsn=5fbc7719_0, Accessed 2nd Dec 2019Google Scholar 36X. Sun, J. Tan, L. Tang, J.J. Guo, X. LiReal-world evidence: experience and lessons from ChinaBMJ, 360 (2018), p. j5262 View PDF CrossRef View Record in Scopus Google Scholar 37Q. Hui ZG Guo, W.Z. Shi, M.C. Gong, C. Liu, H. Xu, H. LiThe National Cancer Big Data Platform of China – Vision and Status. ISPORhttps://www.ispor.org/heor-resources/presentations-database/presentation/intl2019-742/90815, Accessed 6th Dec 2019Google Scholar 38Y. Ying, L. Zhang, L. Li, et al.Epidemic characteristics of diabetic retinopathy in Ningbo population based on regional health information platformChin J Diabetes Mellitus, 9 (10) (2017)Google Scholar 39L. Ye, L. Zhang, T. Fang, et al.Analysis of epidemiological characteristics and duration of hospitalization of pneumonia in children aged 5 and below in NingboPrev Med, 31 (2) (2019), pp. 202-205View Record in Scopus Google Scholar 40Ma H, Wang T, Li C, et al. Current status of lipid-lowering therapy and low-density lipoprotein cholesterol goal achievement in hospitalized acute coronary syndrome patients in Fuzhou, China: a retrospective real-world study. Poster abstract presented at Congress of the International Society on Thrombosis and Haemostasis Vol. XXVII; June 21, 2019; Berlin, Germany.Google Scholar 41S. Salas-Vega, A. Haimann, E. MossialosBig data and healthcare: challenges and opportunities for coordinated policy development in the EUHealth Syst Reform, 1 (4) (2015), pp. 285-300 View PDF CrossRef View Record in Scopus Google Scholar 42A. Pacurariu, K. Plueschke, P. McGettigan, et al.Electronic healthcare databases in Europe: descriptive analysis of characteristics and potential for use in medicines regulationBMJ Open, 8 (9) (2018), Article e023090 View PDF CrossRef View Record in Scopus Google Scholar 43C. Auffray, R. Balling, I. Barroso, et al.Making sense of big data in health research: towards an EU action planGenome Med, 8 (1) (2016), p. 71 View PDF View Record in Scopus Google Scholar 44European CommissionCommunication from the Commission: a strategy for smart, sustainable, and inclusive growthhttp://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52010DC2020&from=EN, Accessed 6th Dec 2019Google Scholar 45X. Sun, J. Tan, W. Wang, et al.Developing technical guidance for real-world data and studies to achieve better production and use of real-world evidence in ChinaChin J Evid Based Med, 7 (2019), p. 1Article Download PDF CrossRef View Record in Scopus Google Scholar 46X. Peng, X. Shu, J. Tan, et al.Technical guidance for designing observational studies to assess therapeutic outcomes using real-world dataChin J Evid Based Med, 19 (7) (2019)Google Scholar 47J. Tan, X. Peng, X. Shu, et al.Technical guidance for patient registration databaseChin J Evid Based Med, 19 (7) (2019)Google Scholar 48W. Wang, P. Gao, J. Wu, et al.Technical guidance for developing research databases using existing health and medical dataChin J Evid Based Med, 19 (7) (2019)Google Scholar 49Z. Wen, L. Li, Y. Liu, et al.Technical guidance for pragmatic randomized controlled trialsChin J Evid Based Med, 19 (7) (2019)Google Scholar 50P. Gao, Y. Wang, J. Luo, et al.Technical guidance for statistical analysis of evaluating treatment outcome research based on real-world dataChin J Evid Based Med, 19 (7) (2019)Google Scholar 51Wu Jieping Medical Foundation, Chinese Thoracic Oncology Group. Guidelines for real-world study research. In: 8th Symposium of CTONG; August, 2018; Guangzhou, China.Google Scholar 52National Longitudinal Cohort of Hematological Diseases in China (NICHE)Advancing blood disease research in Chinahttp://niche-study.com/, Accessed 6th Dec 2019Google Scholar View Abstract© 2021 Published by Elsevier Inc. on behalf of ISPOR–The professional society for health economics and outcomes research.

Source: https://www.sciencedirect.com/science/article/pii/S2212109921000765