|
|
||||||||
1 From the Department of Ophthalmology, Manchester Royal Eye Hospital, United Kingdom, and 2 Health Services Research Unit, London School of Hygiene and Tropical Medicine, United Kingdom.
| Abstract |
|---|
|
|
|---|
METHODS. Twenty published evaluations of ophthalmic screening/diagnostic tests or technologies were independently assessed by two reviewers for compliance with the following methodological standards: specification of the spectrum composition for populations used in the evaluation, analysis of pertinent subgroups, avoidance of work-up (verification) bias, avoidance of review bias, presentation of precision of results for test accuracy, presentation of indeterminate test results, and presentation of test reproducibility.
RESULTS. Compliance ranged from just 10% (95% CI, 1%32%) for presentation of test reproducibility data and avoidance of review bias to 70% (95% CI, 46%88%) for avoidance of work-up bias and presentation of indeterminate test results. Only 5 of the 20 evaluations complied with four or more of the methodological standards and none with more than five of the standards.
CONCLUSIONS. The evaluations of ophthalmic diagnostic tests discussed in this article show limited compliance with accepted methodological standards but are no worse than previously described for evaluations published in general medical journals. Adherence to these standards by researchers can improve the study design and reporting of evaluations of new diagnostic techniques. Limited compliance, combined with a lack of awareness of the standards among users of research evidence, may lead to the inappropriate adoption of new diagnostic technologies, with a consequent waste of health care resources.
| Introduction |
|---|
|
|
|---|
The performance of a diagnostic test is often referred to as "diagnostic accuracy," (i.e., the extent to which the result of a particular test correctly classifies patients into predefined disease categories). Diagnostic accuracy is usually characterized by the sensitivity and specificity of a test. Although likelihood ratios are considered to be the key indices in making clinical decisions about patients, by providing an explicit tool for revision of diagnostic probabilities according to the test outcomes,5 it is the sensitivities and specificities that are most commonly presented when reporting the diagnostic accuracy of tests.
Evaluations of diagnostic tests should comply with accepted standards to provide clinically relevant estimates of diagnostic accuracy. However, there is limited compliance with such standards within the general medical literature.6 The purposes of this article are first, to draw attention to the need for researchers to comply with the standards when performing evaluations of diagnostic accuracy and the need for practitioners to appraise published reports of the diagnostic accuracy against the standards before implementing tests in clinical practice, and second, to review published reports of the diagnostic accuracy of ophthalmic tests, both to illustrate the importance of the standards and to estimate the extent of compliance with the standards. Within the context of this article, screening tests will also be included, because evaluations of their performance should comply with the same criteria as that used for the evaluation of diagnostic tests.
| Methods |
|---|
|
|
|---|
Twenty published evaluations of both established and new ophthalmic tests were selected. Eleven evaluations of diagnostic or screening tests for glaucoma were chosen, spanning structural (clinical examination and imaging of the optic nerve head), physiological (intraocular pressure [IOP] and pattern electroretinogram) and psychophysical (perimetry, contrast sensitivity) tests. To illustrate wider applicability of the standards to a range of ophthalmic conditions, a MEDLINE database search was performed using a recommended strategy.5 A further nine studies were selected from this process, including imaging and photographic, visual function, and laboratory tests. The studies reviewed are drawn predominantly from recent publications within the literature (12 were published between 1995 and 1997, 5 between 1990 and 1994, and 3 before 1990). In each case, selection of the studies was made before assessment with the standards.
Assessment of Compliance with Standards
All articles were independently assessed by two reviewers for
compliance with seven widely accepted methodological
standards.6
7
8
9
10
11
Reviewers used the definitions described
by Reid et al.6
These standards are summarized in Table 1
. Overall agreement between the reviewers was 80% (range, 55%100%;
see Table 2
). All instances of disagreement (28 of 140) were resolved by
discussion. The majority occurred for three standards (21 disagreements
for standards 2, 4, and 6); these disagreements are considered further
in the discussion.
|
|
| Results |
|---|
|
|
|---|
|
Standard 2: Analysis of Pertinent Subgroups
This standard was met by 11 of the 20 evaluations (55%; 95% CI,
32%77%). The second standard is closely linked with the first,
because it concerns the way in which the sensitivity and specificity of
a test can vary for different subgroups in a population. If the
population studied includes people with wide-ranging characteristics
(e.g., all ages), overall estimates of sensitivity and specificity may
disguise considerable variations in performance in different
subgroups.12
Thus, even if overall performance is
disappointing, a test may perform well in a subgroup; alternatively,
overall performance may appear to be good, but may be unacceptably poor
in a minority of subjects. It is important to point out, however, that
this standard is intended to encourage the reporting of diagnostic
accuracy in clinically relevant subgroups; it is not appropriate to
search for "good" performance in a subgroup without a priori
justification.
An early evaluation of oculokinetic perimetry estimated sensitivity to be more than 80% for the detection of glaucomatous visual field defects.15 Subsequent evaluations of this form of perimetry, which included patients with a range of visual field loss, indicated much lower sensitivity for early visual field loss.20 29
Standard 3: Avoidance of Work-up, or Verification, Bias
Fourteen of the 20 evaluations (70%, 95% CI, 46%88%)
complied with this standard. This bias is introduced when subjects with
positive or negative diagnostic test results are selectively referred
to receive verification by the validating criterion (i.e., the gold
standard), or where the groups of "diseased" and "normal"
subjects (based on the gold standard) have been selected according to
some clinical factor relating to the disease.
A population-based study of glaucoma screening24 30 provides an example of work-up bias. More than 5000 subjects were screened using a test battery, and diagnostic accuracies for optic disc assessment, IOP, and field screening were reported. However, only those who "failed" one or more of the tests under evaluation were referred for a "definitive ophthalmologic examination." Referred subjects were classified as having or not having glaucoma by this examination (i.e., the gold standard), but the diagnosis of glaucoma could be made only in those subjects who were referred. Thus, there are likely to have been a small number of truly glaucomatous patients who "passed" all screening tests and whose condition was not detected because they were not referred for the gold standard. If all patients had had the definitive examination, the results in these patients would have been classified as false negatives rather than true negatives, suggesting that the reported sensitivity estimate may be biased upward, and the specificity estimate downward (see the Discussion section).
Standard 4: Avoidance of Review, or Expectation, Bias
Only 2 of the 20 evaluations (10%; 95% CI, 1%32%) complied
with this standard. Bias can be introduced if the results of the test
under evaluation are interpreted with a knowledge of the results of the
gold standard (or vice versa).
Studies by Xu et al.31 and Bjerrum14 reported the sensitivity and specificity of diagnostic tests for dry eye (keratoconjunctivitis sicca) in patients with primary Sjögrens syndrome and other connective tissue diseases. However, it is not clear whether the clinician evaluating the tests was masked with respect to the validation status of the subjects.
Standard 5: Precision of Results for Test Accuracy
Only 3 of the 20 evaluations (15%; 95% CI, 3%38%) complied
with this standard. If sensitivity and specificity estimates are
reported without a measure of precision, clinicians cannot know the
range within which the true values of sensitivity and specificity may
lie. For example, the sensitivity estimate of 73% for a laboratory
test for ocular sarcoidosis26
based on only 22 patients
has a 95% CI ranging from 54% to 92%. In contrast, the specificity
estimate of 83% has better precision (95% CI, 74%92%), a
reflection of both the higher point estimate and the larger sample size
used by the researchers for their nonsarcoid group (n =
70) (Note: The formula for the SE of a proportion,
, is based on a binomial
approximation to the normal distribution and can be used to calculate
95% CIs for sensitivity and specificity: p ± 1.96
, where p represents
either sensitivity or specificity, q = 1 -p, and n is the sample size for either sensitivity or
specificity. When p or q x n is
less than 5, the validity of the approximation becomes doubtful, and
exact methods should be used to calculate the 95% CI [see Fig. 1
]).
|
Standard 6: Presentation of Indeterminate Test Results
Fourteen of the 20 evaluations (70%; 95% CI, 46%88%)
complied with this standard. For a variety of reasons, tests
occasionally yield indeterminate results. For example, patient
cooperation may be limited or the presence of media opacities may
obscure fundus observation. Knowledge of the percentage of
indeterminate results is important in deciding how applicable a test
may be to a population of interest. The way in which indeterminate
results are classified (i.e., as positive, negative, or by excluding
them altogether33
) affects the estimate of diagnostic
accuracy.
A study to determine the effectiveness of videorefraction in screening for significant refractive errors in infants reported a sensitivity of 84% and a specificity of 91% for hyperopia of more than 4.00 D when cycloplegic videorefraction was used.22 However, limited cooperation, failure to obtain adequate blur circles and difficulty in obtaining adequate cycloplegia restricted the final sample size. The prevalence of these indeterminate results was not reported. Because this test is intended to be used for screening, the "untestable" children should arguably have been included and the test results regarded as positive (i.e., requiring further investigation), thereby decreasing the specificity of the test.
Standard 7: Test Reproducibility
Only 2 of the 20 evaluations (10%; 95% CI, 1%32%) complied
with this standard. Limited reproducibility is inevitably reflected in
the sensitivity and specificity estimates, and so these estimates
provide valid measures of performance that take into account the degree
of reproducibility. However, reporting test reproducibility is
important to allow the reader to appraise whether the same level of
reproducibility can be obtained in the study setting, particularly when
a test result is based on expert judgment. For example, experts or
proponents of a new test might be expected to be able to apply a new
grading system with better reproducibility than nonexperts. Therefore,
evidence about test reproducibility should clarify whether the test
results are reproducible in "average" or "expert" hands.
| Discussion |
|---|
|
|
|---|
A review of diagnostic evaluations published in four prestigious general medical journals between 1978 and 19936 found widespread noncompliance with these standards, with compliance exceeding 50% for only one of the seven standards. For the present review compliance ranged from 10% for presentation of test reproducibility data and avoidance of review bias to 70% for avoidance of work-up bias (the standard for which compliance was also the highest in the review by Reid et al.6 ) and presentation of indeterminate results. Overall, only 25% of studies complied with four or more of the standards, a proportion that is comparable to that found by Reid et al.6 for recently published studies.
We acknowledge that our findings from 20 selected evaluations may not be representative of the ophthalmic literature in general. We selected our sample to include a high proportion of evaluations of glaucoma tests, albeit tests of varying modality, because of the significance of glaucoma in ophthalmology, the need for early detection with diagnostic tests at an early asymptomatic stage, and the considerable research effort in evaluating screening or diagnostic methods. In addition, two of the evaluations we reviewed were from the Baltimore Eye Study,24 30 and these cannot be considered to be independent examples.
Despite the selected nature of our sample, we believe that our findings for compliance, although somewhat imprecise, are nevertheless likely to be reasonably representative of compliance within ophthalmology for four reasons. First, our findings are similar to those reported by Reid et al.6 for publications in high-ranking general medical journals. Second, we have recently conducted a systematic review for one of the standards that found a similarly limited compliance.35 Third, all reports we reviewed were published in peer-reviewed journals, with more than half of them published in high-ranking ophthalmic or general medical journals. Finally, selection of studies was made before assessment of compliance.
The limited agreement between the reviewers when assessing standards 2, 4, and 6 may cast doubt on the importance of these particular standards. The particular problem with standard 4 related to interpretation of review bias in instances in which the diagnostic test or validating criterion was automated. The disagreements would not have occurred with a strict interpretation of the standards, and we suggest that this standard should be clarified (see discussion later). Disagreement on standards 2 and 3 arose primarily from a lack of clarity in some articles or the failure by reviewers to identify relevant information, rather than from ambiguity of the standards themselves. These problems would be unlikely to arise if the reporting of compliance with the standards were mandatory, as for RCTs.32
Applicability of the Methodological Standards in Ophthalmology
The level of compliance that we found would usually cast doubt on
the relevance of the reported findings to clinical practice. However,
it is important to consider the applicability of the standards within
ophthalmology, because some of the standards may be less important for
some of the tests included in the review.
First, it might be argued that the standard for test reproducibility is less important for automated tests when the classification of the test result does not depend on expert judgment. All sources of measurement error are reflected in the estimates of sensitivity and specificity. The crucial difference is that when variation in the expertise of observers is not a source of measurement error, the estimates are more likely to be generalizable across settings.
Second, it might be argued that applying the standard for avoidance of review bias is unnecessary if a test is entirely automated, on the assumption that a test result obtained automatically cannot be biased through interpretation by a clinician. However, if an automated diagnostic test is used before validation,22 review bias can still occur because the diagnostic test may influence interpretation of the gold standard; it is only when an automated diagnostic test is used after validation17 that review bias is likely to be avoided. Because of the possibility of an operator influencing an automated test in a subtle way (e.g., in the setting up of test parameters or in the interpretation of the result), we recommend that researchers maintain and report the independence of the test and gold standard procedures, even when one or the other appears to be completely independent of the operator. The standard on avoidance of review bias, as described by Reid et al.,6 does not discuss how automated tests should be judged. Consequently, it was perhaps not surprising that the main source of disagreement between reviewers was interpretation of this standard when a diagnostic test was automated or semiautomated.
Finally, it might be argued that if the main intended application of a screening or diagnostic test is to a specific population only (e.g., stereoacuity tests or video refraction in infants), then it is not realistic to expect the reporting of indices of accuracy for further subgroups.
Additional Standards
In addition to the seven accepted standards described here, we
believe that there are three further principles that researchers should
adhere to: There should be a clear definition of the gold standard, the
gold standard should be independent of the test under evaluation, and
the population studied should be appropriate for the intended
application of the test.
Definition of the Gold Standard.
We believe that the gold-standard should always be clearly defined,
even though there may be some overlap between this requirement and
standard 3 (i.e., work-up bias). This requirement is particularly
important in situations in which it is impracticable or unethical to
administer the gold standard to all patients. The overlap between
definition of the gold standard and work-up bias is demonstrated in the
study by Tielsch et al.30
The gold standard for this
evaluation was described as a definitive ophthalmologic examination
and, consequently, we scored this evaluation as failing to avoid
work-up bias, because only those who failed at least one of the
screening tests were referred for this examination. It appears that the
researchers regarded the referral of all subjects for a definitive
examination as impracticable, a not unreasonable decision. However, the
implication of this decision for the gold standard was not spelled
outnamely, that the gold standard definition of normality should have
become "no disease found on definitive examination or
passed all screening tests." We believe that such a statement
clarifies the true gold-standard definition, and highlights the
possibility of work-up bias in a way in which the original article did
not.
Work-up bias may be unavoidable if the gold standard carries a health risk, making it unethical to administer the gold standard to all subjects (e.g., highly invasive tests). In such cases, the duty of the researchers is to make explicit the validating criterion for normality in the absence of the gold standard; this may be demonstration of normality on a battery of tests, or the continuing absence of disease (demonstrated by whatever means) over a prolonged period of follow-up.
Work-up bias is difficult to avoid in the context of the evaluation of screening tests in which the prior probability of disease is usually very low. A practicable evaluation must either make assumptions about the normality of those who pass the screening test, as discussed above, or select a population for the evaluation that contains a much higher proportion of diseased people than would be expected when screening (e.g., by choosing equal numbers of definitively normal people and people who have been newly referred for investigation). Selection of this kind almost inevitably results in work-up bias, because the reasons for referral are likely to be associated with the results of the screening test.36
Independence of the Gold Standard.
The gold standard should be independent of the diagnostic test under
evaluationthat is, the test under evaluation should not be performed
as part of the gold standard. This requirement should hold, even when
the objective of the evaluation is to investigate the decrease in
diagnostic accuracy when one or more elements of the gold standard are
omitted. This problem is illustrated by an evaluation of the
sensitivity and specificity of a 26-point screening program on the
Henson field screener.21
The points tested for the
screening program are also tested during the extended program, which
was used as the gold standard criterion. The investigators simply
calculated the diagnostic accuracy for the screening program by
extracting the data from the extended test, rather than by performing
the screening and extended tests on separate occasions. This procedure
eliminates variability between screening and extended tests that would
occur in practice. (In fact, a subsequent evaluation using an
independent validating criterion has confirmed high sensitivity and
specificity for this particular form of field
screening.29
) This criticism may also apply to the study
by Bjerrum,14
who appears to have performed two of the
tests under evaluation as part of the gold-standard examination for
diagnosis of dry eye.
It is often the case that the results of the test under evaluation and the gold standard are highly correlated, not because of the problem just described, but because the test and the gold standard are measuring similar underlying properties (e.g., aspects of visual function in glaucoma). It is not surprising, therefore, that a field test has higher diagnostic accuracy than IOP or cup-to-disc ratio when screening for glaucoma, when the gold standard includes a definitive perimetric examination.16 24 30 In these circumstances, the evaluation is not invalid, but it is important to be aware of the inherent tautology.
Appropriateness of the Study Population.
It is difficult to recommend including the third additional standard as
a true standard because of the subjectivity of judging appropriateness.
The appropriateness of the population included in the evaluation has
previously been mentioned in relation to the standard on specification
of spectrum composition. This point can be graphically illustrated by
comparing two datasets on the diagnostic accuracy of IOP, shown as
receiver operating characteristic curves in Figure 2
.16
30
These curves suggest a considerable difference in
diagnostic accuracy, with curve B indicating that IOP is a much better
test.
|
In considering appropriateness, it is also important to highlight the selective nature of populations in some evaluations. Evaluations of the diagnostic accuracy of glaucoma tests often use the results from the Humphrey Visual Field Analyzer (San Leandro, CA) as the gold standard. Sometimes subjects are selected to have prior experience of automated perimetry18 23 or subjects with unreliable test results are excluded.29 Selecting subjects in this way is likely to result in inflated estimates of diagnostic accuracy and gross underestimation of indeterminate results. The prevalence of unreliable subjects is not insignificant and has been estimated to be as high as 45% in glaucomatous subjects and 30% in control subjects.40
A study to evaluate an artificial neural network for the automatic detection of diabetic retinopathy from fundus images, provides another example of the selective nature of a study population.17 The sample used to test the system comprised 200 diabetic fundus images and 101 normal fundus images. The researchers concluded that the system could be used as an aid to the screening of diabetic patients for retinopathy. However, given that the normal fundus images do not appear to have included nondiabetic lesions (e.g., age-related maculopathy), the specificity of the system in a screening setting is likely to be worse than reported. It may be appropriate to use selected populations for the preliminary evaluation of a system, but any conclusion about the wider application requires an evaluation on a representative population.
In conclusion, this article has highlighted the importance of complying with methodological standards when evaluating ophthalmic diagnostic tests. Our findings emphasize the need for researchers to comply with standards, so that published estimates of diagnostic accuracy are relevant to clinical practice, and for practitioners to appraise critically evaluations of diagnostic tests against the standards, to avoid being misled by biased (and sometimes overoptimistic) results.
Improved diagnostic accuracy is only one of many steps toward effective treatment, and the use of rigorously evaluated tests cannot guarantee better patient outcomes.5 However, patient care can be expected to improve if ineffective diagnostic tests are avoided, because the widespread use of tests with limited accuracy can have serious health and financial consequences. Ideally, diagnostic tests that show promising accuracy should be subjected to RCTs to determine whether the test results in improved health outcomes.41
| Footnotes |
|---|
Submitted for publication September 15, 1998; revised February 5, 1999; accepted March 10, 1999.
Proprietary interest category: N.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
L. Ore, H. J Garzozi, A. Tamir, N. Stein, and M. Cohen-Dar Performance measures of the illiterate E-chart vision-screening test used in Northern District Israeli school children J Med Screen, June 1, 2008; 15(2): 65 - 71. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Bochmann, Z. Johnson, and A. Azuara-Blanco Sample size in studies on diagnostic accuracy in ophthalmology: a literature survey Br J Ophthalmol, July 1, 2007; 91(7): 898 - 900. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Shunmugam and A. Azuara-Blanco The quality of reporting of diagnostic accuracy studies in glaucoma using the heidelberg retina tomograph. Invest. Ophthalmol. Vis. Sci., June 1, 2006; 47(6): 2317 - 2323. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Y. Y. Lai, G. M. Leung, V. W. Y. Wong, R. F. Lam, A. C. O. Cheng, and D. S. C. Lam How Evidence-Based Are Publications in Clinical Ophthalmic Journals? Invest. Ophthalmol. Vis. Sci., May 1, 2006; 47(5): 1831 - 1838. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. W.S. Rutjes, J. B. Reitsma, M. Di Nisio, N. Smidt, J. C. van Rijn, and P. M.M. Bossuyt Evidence of bias and variation in diagnostic accuracy studies. Can. Med. Assoc. J., February 14, 2006; 174(4): 469 - 476. [Abstract] [Full Text] [PDF] |
||||
![]() |
A Azuara-Blanco and J Burr The rising cost of glaucoma drugs Br J Ophthalmol, February 1, 2006; 90(2): 130 - 131. [Full Text] [PDF] |
||||
![]() |
B C Reeves Evidence about evidence Br J Ophthalmol, March 1, 2005; 89(3): 253 - 254. [Full Text] [PDF] |
||||
![]() |
M A R Siddiqui, A Azuara-Blanco, and J Burr The quality of reporting of diagnostic accuracy studies published in ophthalmic journals Br J Ophthalmol, March 1, 2005; 89(3): 261 - 265. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Phipps, T. M. Dang, A. J. Vingrys, and R. H. Guymer Flicker Perimetry Losses in Age-Related Macular Degeneration Invest. Ophthalmol. Vis. Sci., September 1, 2004; 45(9): 3355 - 3360. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Anderson and C. A. Johnson Anatomy of a Supergroup: Does a Criterion of Normal Perimetric Performance Generate a Supernormal Population? Invest. Ophthalmol. Vis. Sci., November 1, 2003; 44(11): 5043 - 5048. [Abstract] [Full Text] [PDF] |
||||
![]() |
J-C Barry and H-H Konig Test characteristics of orthoptic screening examination in 3 year old kindergarten children Br J Ophthalmol, July 1, 2003; 87(7): 909 - 916. [Abstract] [Full Text] [PDF] |
||||
![]() |
J van der Meulen and J Rahi Clinical decision making in opthalmology Br J Ophthalmol, June 1, 2002; 86(6): 599 - 600. [Full Text] [PDF] |
||||
![]() |
R. HARPER, D. HENSON, and B. C REEVES Appraising evaluations of screening/diagnostic tests: the importance of the study populations Br J Ophthalmol, October 1, 2000; 84(10): 1198 - 1202. [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |