4. Measurement methodology in PRO research

1. Measurement methodology in PRO research

Patient-Reported Outcome Measures (PROMs) are the tools or instruments used to capture Patient-Reported Outcomes (PROs).

Each PROM includes:

A concept of measurement (e.g. symptoms, sensations, functioning, or quality of life), and
A method of measurement, typically a rating scale of some form.

Once the concept and the items are identified and set out, careful decisions also need to be made about:

how the questions are delivered to patients,
when the questions are delivered to patients,
how answers are recorded, and
how the data is interpreted.

Typically, PROs are measured with questionnaires or surveys that are either:

completed by the patients themselves,
completed by the patients in the presence of the researcher,
completed by the researcher through face-to-face interview or by telephone interview,
via different interfaces such as hand-held devices or computers (see below ePROMs).

There are strengths and weaknesses to the different approaches to collecting information. For example, while the use of trained interviewers reduces errors and ensures surveys are completed, trial/treatment resources may not allow for this.

Ensuring Relevance and Accuracy

It is crucial that approaches and methods used address patients’ perceptions and the actual concepts being measured rather than focusing on the interviewer and on the way questions are asked

In the example given in the section above, morning symptoms can more reliably be ascertained if the questionnaire is administered in the morning than if the questionnaire is completed later during the day.

⚠️ Watch out for interviewer bias: leading questions or inconsistent delivery can affect how patients interpret and answer questions.

Properties of PROMS

The researchers who develop these tools/instruments must make every attempt to ensure that they are measuring concepts important to patients in a way that is repeatable and understandable.

Therefore, instruments must be designed to:

Measure concepts important to patients
Be repeatable, understandable, and interpretable

A well-constructed PROM should either:

Assess a single underlying construct (unidimensional), or
Use separate scales for multiple constructs, ensuring each scale targets one concept.

PROMs may be:

Generic: applicable across various conditions and populations, capturing broad aspects of health or functioning
Condition-targeted: developed to focus on outcomes specifically relevant to a particular disease or patient group

Property	Description	Explanatory notes
Reliability	Measurements are repeatable and consistent, and must distinguish between changes in response and changes due to errors in administration. Is a necessary (but not comprehensive) component of validity.	Test-retest reliability- a measure of the ability of e.g., a psychologic testing instrument to yield the same result for a single Patient at 2 different closely spaced test periods, so that any variation detected reflects reliability of the instrument rather than changes in the Patient's status. May be considered as intra-interviewer reliability for interviewer administered PROs. Internal consistency- consistency of the results delivered in a test, ensuring that the various items measuring the different constructs deliver consistent scores. Inter-interviewer reliability – determines the changes in the results when the instrument is administered by two or more interviewers.
Validity
Criterion validity	The extent to which the scores of a PRO instrument are related to a known gold standard measure of the same concept (correlate with another measure considered more accurate).	Practically, it is difficult to determine the criterion validity as there is no gold standard for most PROs. There are two types: “predictive” and “concurrent”. ✅Check: The ability to predict something in the real world which is present theoretically, for instance, if slowness of movement in Parkinson's disease is measured, one could provide the respective patients with the result and determine if there is a high correlation between the scores and the degree of slowness. High correlation is evidence for “predictive validity”. In concurrent validity the ability of the item to distinguish between the groups in real world as theoretically expected. E.g. in the assessment of breathlessness, the item should be able to differentiate between asthma and COPD
Content validity	The extent to which an instrument measures what it is intended to measure (the concept of interest).	A lack of content validity means that data gathered may appear useful but is in fact measuring a different effect. (off target) ✅Check for example by: Literature review Expert opinion Evaluation of how well items deal with the complete continuum of patient experiences Patient and clinician evaluation of the relevance and comprehensiveness of the content contained in the measures through qualitative research with the targeted patient population (essential) Input from target population of patients to document understandability and comprehensiveness of measure Diversity in demographic & disease characteristics of target population
Construct validity	Logical relationships exist between assessment measures.	A lack of construct validity means that data gathered may appear useful but is in fact measuring a different effect. ✅Check: Evidence that relationships among concepts, (domains) and items conform to a priori hypotheses concerning logical relationships that should exist with other measures or characteristics of patients and patient groups In COPD patients: expect that patients with lower treadmill exercise capacity generally will have more dyspnoea (shortness of breath) in daily life than those with higher exercise capacity, and – logical connection - expect substantial correlations between a new measure of emotional function and existing emotional function questionnaires.
Responsiveness	Captures changes over time in the construct being measured – before and after an intervention OR–in different disease or treatment states	If the instrument is not adequately responsive, it may fail to measure patient response in which the intervention improves how patients feel and return false negative results. ✅Check: Evaluated within specific populations and not a fixed/inherent property of the item Determine the relevant, clinically meaningful effect size
Practicality & Feasibility	Measurements are easy to obtain; the instrument is easy to operate.	If the instrument is difficult to operate, patient response may change but not be measured or be measured inappropriately. If the instrument requires inordinate time the compliance is jeopardised. ✅Check: Verify the acceptability of the PROM by patients Review the instructions, questions, response options and recall period of the PROM
Interpretability	The degree to which one can assign qualitative meaning (clinical or commonly understood connotations) to a PROM’s quantitative scores or change in scores[1]	Patients or clinicians or payers may not make best decisions if output of tool is difficult to understand. PROMs vary in scale, measurement, and interpretation, making it difficult for patients and physicians alike to reach conclusions when evaluating results from these measures, e.g., in clinical trials. Metrics can indicate a favorable result with higher values, whereas some indicate a favorable result with lower values. This can lead to a lack of patient or clinician acceptance and may limit its utility. ✅Check: The meaningfulness of scores produced by the questionnaire What does a score mean? What is the minimal clinically important difference (MID)? (see box below) Should an overall score be computed and presented to support a given claim?

Note on minimal clinically important difference (MID)

An important advance in HRQoL research is the concept of minimum clinically important difference (MID), defined as the smallest difference in score on an HRQoL instrument that patients perceive as beneficial and that would suggest, in the absence of troublesome side effects and excessive cost, a change in the patient’s management. Differences in scores smaller than the MID are considered unimportant, regardless of whether statistical significance is reached. For example, although an average change of 0.15 point on the HAQ-DI (the health assessment questionnaire (HAQ) disability index (DI)) may be statistically significant in a clinical trial, it may not be perceived as meaningful by study participants, and would not meet MID criteria since the MID for the HAQ-DI in is 0.22 point in that study. MID estimates of HRQoL measures have influenced designs of subsequent clinical trials aimed at improving HRQoL.

📌Interactive activity

Time to test your knowledge: Read the questions carefully and select the correct one. Please note that there is three activities, you can access the different questions using the blue arrows

This is not the assessment, it is just an interactive activity to help you with your learning.

[1] Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., et al. (2010). The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. Journal of Clinical Epidemiology, 63(7), 737–745.