Abstract
Background: The use of routinely recorded electronic health record (EHR) data is increasingly common, especially in epidemiological research. However, data must be processed and prepared for secondary use, and decisions made during this process could significantly impact research outcomes. A demonstration of the extent of these consequences is necessary.
Objective: The aim of this study was to investigate the influence of data processing steps on research outcomes derived from the secondary use of EHR data. Methods: EHR data from 8 Dutch general practices from 2019 were used. These practices contributed data to 2 research databases: the Academic General Practitioner Development Network registry and the Nivel Primary Care Database. Data were extracted and processed through distinct extraction, transformation, and loading (ETL) pipelines, allowing the evaluation of the impact of different ETL methods by comparing the 2 datasets in three steps: (1) patient demographics, (2) epidemiology of concordant patients, and (3) health service use of patients with 3 diagnoses. A number of similarity indicators, including the number of contacts, regular consultations and visits, prescriptions, and episodes, were compared between the 2 databases. The outcomes were compared by performing paired samples t tests using 99% CIs. Prevalence, number of prescriptions, and number of regular consultations and visits per 1000 patient years were calculated and compared for 3 diagnoses (diabetes mellitus, urinary tract infection, and cough). These outcomes were compared using the SD.
Results: Differences were observed between the datasets in the number of enrolled patients (Academic General Practitioner similar. All indicator outcomes of the concordant patients showed significant differences between the databases, that is, the number of contacts, prescriptions, and episodes per patient, and the number of regular consultations and visits. Differences in the indicator outcomes for the 3 diagnosis groups varied greatly in SD, however, none of the differences were deemed Conclusions: The findings highlight the importance of routine health data users' awareness of different ETL steps involved. Transparency and shared knowledge about these processes are critical, and making them available for research is necessary. knowledge of this type of metadata. Transparency and shared knowledge are particularly important in light of the European the role of transparency, joint decision-making, and the minimization of effects of ETL steps, and on the insight into the individual influence of ETL steps on research outcomes. This could stimulate standardized approaches among data processors and researchers, resulting in increased data interoperability.
Objective: The aim of this study was to investigate the influence of data processing steps on research outcomes derived from the secondary use of EHR data. Methods: EHR data from 8 Dutch general practices from 2019 were used. These practices contributed data to 2 research databases: the Academic General Practitioner Development Network registry and the Nivel Primary Care Database. Data were extracted and processed through distinct extraction, transformation, and loading (ETL) pipelines, allowing the evaluation of the impact of different ETL methods by comparing the 2 datasets in three steps: (1) patient demographics, (2) epidemiology of concordant patients, and (3) health service use of patients with 3 diagnoses. A number of similarity indicators, including the number of contacts, regular consultations and visits, prescriptions, and episodes, were compared between the 2 databases. The outcomes were compared by performing paired samples t tests using 99% CIs. Prevalence, number of prescriptions, and number of regular consultations and visits per 1000 patient years were calculated and compared for 3 diagnoses (diabetes mellitus, urinary tract infection, and cough). These outcomes were compared using the SD.
Results: Differences were observed between the datasets in the number of enrolled patients (Academic General Practitioner similar. All indicator outcomes of the concordant patients showed significant differences between the databases, that is, the number of contacts, prescriptions, and episodes per patient, and the number of regular consultations and visits. Differences in the indicator outcomes for the 3 diagnosis groups varied greatly in SD, however, none of the differences were deemed Conclusions: The findings highlight the importance of routine health data users' awareness of different ETL steps involved. Transparency and shared knowledge about these processes are critical, and making them available for research is necessary. knowledge of this type of metadata. Transparency and shared knowledge are particularly important in light of the European the role of transparency, joint decision-making, and the minimization of effects of ETL steps, and on the insight into the individual influence of ETL steps on research outcomes. This could stimulate standardized approaches among data processors and researchers, resulting in increased data interoperability.
| Original language | English |
|---|---|
| Article number | e64628 |
| Number of pages | 16 |
| Journal | Journal of Medical Internet Research |
| Volume | 27 |
| DOIs | |
| Publication status | Published - 11 Jun 2025 |
Keywords
- Etl
- Data extraction
- Data governance
- Data processing
- Data quality
- Electronic health records
- Extraction, transformation, and loading
- Fitness for purpose
- General practice
- Routine health care data