4. Measurement methodology in PRO research
1. Measurement methodology in PRO research
1.1. Properties of PROMS
The researchers who develop these tools/instruments must make every attempt to ensure that they are measuring concepts important to patients in a way that is repeatable and understandable.
A well-designed PRO questionnaire should assess either a single underlying characteristic or, where it addresses multiple characteristics, should be a number of scales that each address a single characteristic.
Questionnaires may be generic (designed to be used in any disease population and cover a broad aspect of the construct measured) or condition-targeted (developed specifically to measure those aspects of outcome that are of importance for people with a particular medical condition).
Table 2 below provides an overview of important aspects to be considered in PROMs.
Table 2: Important properties of PROMs
Property |
Description |
Explanatory notes |
Measurements are repeatable and consistent, and must distinguish between changes in response and changes due to errors in administration. Is a necessary (but not comprehensive) component of validity. |
1) Test-retest
reliability- a measure of the ability of e.g., a psychologic testing
instrument to yield the same result for a single Patient at 2 different
closely spaced test periods, so that any variation detected reflects
reliability of the instrument rather than changes in the Patient's status.
May be considered as intra-interviewer reliability for interviewer
administered PROs.
|
|
|
|
|
The extent to which the scores of a PRO instrument are related to a known gold standard measure of the same concept (correlate with another measure considered more accurate). |
Practically, it is
difficult to determine the criterion validity as there is no gold standard
for most PROs. Check: the ability to predict something in the real world which is present theoretically, for instance, if slowness of movement in Parkinson's disease is measured, one could provide the respective patients with the result and determine if there is a high correlation between the scores and the degree of slowness. High correlation is evidence for “predictive validity”. Check: in concurrent validity the ability of the item to distinguish between the groups in real world as theoretically expected. E.g. in the assessment of breathlessness, the item should be able to differentiate between asthma and COPD |
|
”Face validity” (appears to measure concept of interest) ”Content validity in context of use” (adequately covers concept/domain of interest) |
The extent to which an instrument measures what it is intended to measure (the concept of interest). |
A lack of content validity
means that data gathered may appear useful but is in fact measuring a
different effect. (off target) - Literature review - Expert opinion - Evaluation of how well items deal with the complete continuum of patient experiences - Patient and clinician evaluation of the relevance and comprehensiveness of the content contained in the measures through qualitative research with the targeted patient population (essential) - Input from target population of patients to document understandability and comprehensiveness of measure - Diversity in demographic & disease characteristics of target population |
Logical relationships should exist between assessment measures |
A lack of construct
validity means that data gathered may appear useful but is in fact measuring
a different effect. Example: In COPD patients: expect that patients with lower treadmill exercise capacity generally will have more dyspnoea (shortness of breath) in daily life than those with higher exercise capacity, and – logical connection - expect substantial correlations between a new measure of emotional function and existing emotional function questionnaires. |
|
RESPONSIVENESS Also: sensitivity to change or ability to detect change |
Captures changes over time in the construct being measured – before and after an intervention OR–in different disease or treatment states |
If the instrument is not adequately responsive, it may fail to measure patient response in which the intervention improves how patients feel and return false negative results. Check: Evaluated within specific populations and not a fixed/inherent property of the item Determine the relevant, clinically meaningful effect size |
PRACTICALITY and FEASIBILITY |
Measurements are easily obtained, and the instrument is easy to operate. |
If the instrument is difficult to operate, patient response may change but not be measured or be measured inappropriately. If the instrument requires inordinate time the compliance is jeopardised. Check: - Verify the acceptability of the PROM by patients - Review the instructions, questions, response options and recall period of the PROM |
INTERPRETABILITY |
The degree to which one can assign qualitative meaning (clinical or commonly understood connotations) to a PROM’s quantitative scores or change in scores[1] |
Patients or clinicians or payers may not make best decisions if output of tool is difficult to understand. PROMs vary in scale, measurement, and interpretation, making it difficult for patients and physicians alike to reach conclusions when evaluating results from these measures, e.g., in clinical trials. Metrics can indicate a favorable result with higher values, whereas some indicate a favorable result with lower values. This can lead to a lack of patient or clinician acceptance and may limit its utility. Check: The meaningfulness of scores produced by the questionnaire – What does a score mean? – What is the minimal clinically important difference (MID)? (see box below) – Should an overall score be computed and presented to support a given claim? |
Note on minimal clinically important difference (MID)
An
important advance in HRQoL research is the concept of minimum clinically important difference (MID), defined as the
smallest difference in score on an HRQoL instrument that patients perceive as
beneficial and that would suggest, in the absence of troublesome side effects
and excessive cost, a change in the patient’s management. Differences in scores
smaller than the MID are considered unimportant, regardless of whether
statistical significance is reached. For example, although an average change of
0.15 point on the HAQ-DI (the health assessment questionnaire (HAQ) disability
index (DI)) may be statistically significant in a clinical trial, it may not be
perceived as meaningful by study participants, and would not meet MID criteria
since the MID for the HAQ-DI in is 0.22 point in that study. MID estimates of HRQoL
measures have influenced designs of subsequent clinical trials aimed at
improving HRQoL.
[1] Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., et al. (2010). The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. Journal of Clinical Epidemiology, 63(7), 737–745.