1. Data sources for pharmacoepidemiological research

1.3. Secondary data

Secondary data refers to data already collected for another purpose, such as for a previous research (e.g., clinical trial), or to facilitate a process (e.g., hospital discharge records) and are often captured in an automated fashion, e.g., electronically. Although data from RCTs are not frequently used in pharmacoepidemiology, they can be used for secondary analyses. Over the last decades key data resources, expertise and methodology have been developed that allow use of such data for pharmacoepidemiology and pharmacovigilance (for more information see the ENCePP Inventory of Data Sources). These can be further linked to prospectively collected medical and non-medical data.

The key limitation of secondary data is that the data are not collected for answering the primary research question; thus, there may be specific validity issues that must be considered. They include: completeness of data capture, bias in the assessment of exposure, outcome and covariates, variability between data sources and the impact of changes over time in data, access methodology and the healthcare system.

Administrative data, claims data, data in electronic medical records and other data from digital health sources are predominantly collected in electronic automated databases and can be grouped as secondary data. They are presented in more detail in the following.

1. Administrative Databases

The provision of health care frequently generates digitised data such as claims data, hospital-based electronic data, and medication prescription records, collectively subsumed as “administrative database”. These are predominantly designed for billing and record-keeping purposes, and not primarily intended for research. A disadvantage of administrative data is the variation in size of the population, e.g., due to migration.

2. Claims Databases

For pharmacoepidemiologic research, claims data (predominantly claims to insurance companies for reimbursement) provide some of the best data on the prescription of medications on an individual basis. In claims processing, large amounts of data are generated and stored, detailing the services provided, often accompanied by supplementary information such as medical diagnoses and/or demographic information of individuals. It has to be recognised that different types of claims are not kept in the same database; e.g., pharmacy claims may be processed separately from the rest of the medical claims, which requires provisions for linking different databases.

3. Electronic Health Records (EHR)

An Electronic Health Record (EHR) (Masnoon et al., 2018) is defined as a longitudinal digital record of patient health information comprising all available data from different sources, including EMRs (see below). EHRs provide a 360-degree view of an individual’s health and wellness and also include data from patient portals or external laboratories and can be linked to further data sources. Data from EHRs are increasingly seen as a valuable data source for pharmacoepidemiology, pharmacovigilance, and clinical research requiring similar observational clinical data (Murray, 2014).

4. Electronic Medical Records (EMR)

EMRs are digital versions of the paper charts in clinician offices, clinics, and hospitals acting as a repository of patient health information and support capture and access to this information. EMRs provide administrative functionality and basic connectivity to enable tracking of patients and exchange of information. EMRs offer a higher level of detail compared to administrative data or patient/provider questionnaires. The EMR is based on real-time data, thus there is minimum latency period between the generation of data and its availability.

As with all secondary data EMR data are not specifically recorded for research purposes. They may be lacking entries of laboratory reports, progress notes, quality of life measurements, and information on health beliefs or health behaviours. The data contained in EMRs are typically complex, given the longitudinal nature and volume of information, and can be challenging for conducting statistical analysis. Multisite sources can be advantageous in terms of sample size, but there may be problems with data consistency and validity because documentation and

 coding practices (diagnostic information) or implemented functions may vary across the participating sites.

Despite limitations, EMR databases are nowadays widely used, often replacing paper medical records as the primary document (for further preference please see IMI BD4BO project). This facilitates the large data collection for pharmacoepidemiologic studies and, by cross-linking digital medical records, may enable researchers and physicians to monitor, among others, post-marketing safety. Some of the advantages are:

  • The validity of the diagnosis data is superior to that in administrative databases.
  • Cost of data collection is low - helpful when budget is restricted.
  • The representativeness of research is enhanced by covering the entire population of a region.
  • Large sample size permits examining rare outcomes and prescription patterns of infrequently used medications and list their uncommon effects.
  • There is no recall and interviewer bias (they do not rely on patient recall or interviewers to obtain their data)
  • They can potentially be linked to external electronic databases (e.g., birth and mortality records, claims), via a unique identification number, enhancing the capabilities and scope of research.
  • They are useful when short-term effects of medications and objective, laboratory-driven diagnoses are studied; when there is limited time.

Examples:
The General Practice Research Datalink (GPRD) in the UK, has population-based anonymised longitudinal medical records from primary care data with over 15 million patients and has been extensively used for published research in the field of pharmacoepidemiology. https://www.cprd.com/

The German Pharmacoepidemiological Research Database (GePaRD) is based on claims data from statutory health insurance providers and currently includes information on about 25 million persons who have been insured with one of the participating providers since 2004. In addition to demographic data, GePaRD contains information on prescriptions, outpatient and inpatient services and diagnoses. Per data year, there is information on approximately 20% of the general population from all geographical regions of Germany https://www.bips-institut.de/en/research/research-infrastructures/gepard.html

5. Social media

Technological advances have dramatically increased the range of data sources that can be used to complement traditional ones and may provide insights into effectiveness and safety of interventions (see also Patient-Generated Health Data (PGHD) section). Social media is considered a sub-set of digital media and data exist in a computer-readable format as websites, web pages, blogs, vlogs, social networking sites, internet forums, chat rooms, health portals. This data is unsolicited and generated in real time.

The European Commission’s Digital Single Market Glossary defines social media as “a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of user- generated content. It employs mobile and web-based technologies to create highly interactive platforms via which individuals and communities share, co-create, discuss, and modify user-generated content.”

Social media are a heterogeneous data source. Social media has been used to provide insights into the patient’s perception of the effectiveness of medicines and for the collection of patient-reported outcomes. However, it must be acknowledged that the main purpose of most social media publications is different from e.g. reporting a suspected association between one specific health intervention and an adverse event (for further information see the IMI WEB-RADR European collaborative project which explored different aspects related to the use of social media data as a basis for pharmacovigilance and summarised its recommendations in Recommendations for the Use of Social Media in Pharmacovigilance: Lessons From IMI WEB-RADR (Drug Saf 2019;42(12):1393-1407)).

Indeed, one possible use of social media would be as a source of information for signal detection or assessment. Studies have evaluated whether analysis of social media data (specifically Facebook and Twitter posts) could identify pharmacovigilance signals early, but in their respective settings, found that this was not the case. However, a report from 2020 found that social media data actually could have indicated the outbreak of COVID 19 prior to the health system as certain key words were used much more frequently in posts (https://www.eurosurveillance.org/content/10.2807/1560-7917.ES.2020.25.10.2000199?crawler=true). 

The rapidly evolving social media environment presents many challenges including the need for viable processes for selection, validation and study implementation. There is currently no defined strategy or framework in place to address concerns about generalisability and validity. Social media data can include slang, emojis, false IDs and duplicates, hidden messages (such as Twitter sub-hashtag) or comments intended to foster a false belief. However, more tools and solutions for analysing unstructured, heterogenous data and free text (natural language[1]) in large volumes (Big Data[2]) are becoming available, especially for pharmacoepidemiology and Drug Safety research, employing emerging techniques, using Artificial Intelligence (AI) (with its subfield of machine learning[3]), neural network architectures, deep learning. Social media platforms may potentially constitute an important resource for medical research, if their potential can be realised. However, there is also the possibility of negatively influencing research, e.g. young patients in a trial connected via social media might in a chat inadvertently discover they are in the same trial and run the risk of unblinding or compare outcomes and may conclude in which arm of the study they were.

6. Other Data Sources

There are various other sources of data employed in pharmacoepidemiology that do not correspond to any of the previous categories. Results of laboratory tests contained in laboratory collections may be used as surrogates to monitor the outcome of interest; monitoring systems, such as disease specific surveillance systems to monitor infectious disease outbreaks like seasonal influenza and tuberculosis, and national statistics, such as mortality (death certificate data) and statistics on hospitalisations. Data can also be found outside of healthcare systems. For example, sales of medical products could be obtained from a pharmacy or a group of pharmacies in a specific area. Also, sales data have been utilised as a method of syndromic surveillance to predict disease outbreaks.


[1] Natural Language Processing (NLP): Computer software that can analyse, ‘understand’, and generate language that humans use naturally. It can be used to extract key terms or phrases from bodies of unstructured text, providing insights faster than a human can.
https://www.microsoft.com/en-us/research/group/natural-language-processing/

[3] Machine Learning: A method of data analysis that uses algorithms that iteratively learn from data so that computers can find hidden information without being explicitly programmed where to look. http://www.sas.com/en_us/insights/analytics/machine-learning.html