IOVS
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Harper, R.
Right arrow Articles by Reeves, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Harper, R.
Right arrow Articles by Reeves, B.
(Investigative Ophthalmology and Visual Science. 1999;40:1650-1657.)
© 1999 by The Association for Research in Vision and Ophthalmology, Inc.

Compliance with Methodological Standards When Evaluating Ophthalmic Diagnostic Tests

Robert Harper1 and Barnaby Reeves2

1 From the Department of Ophthalmology, Manchester Royal Eye Hospital, United Kingdom, and 2 Health Services Research Unit, London School of Hygiene and Tropical Medicine, United Kingdom.


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
PURPOSE. To draw attention to the importance of methodological standards when carrying out evaluations of ophthalmic diagnostic tests by reviewing the extent of compliance with these standards in reports of evaluations published within the ophthalmic literature.

METHODS. Twenty published evaluations of ophthalmic screening/diagnostic tests or technologies were independently assessed by two reviewers for compliance with the following methodological standards: specification of the spectrum composition for populations used in the evaluation, analysis of pertinent subgroups, avoidance of work-up (verification) bias, avoidance of review bias, presentation of precision of results for test accuracy, presentation of indeterminate test results, and presentation of test reproducibility.

RESULTS. Compliance ranged from just 10% (95% CI, 1%–32%) for presentation of test reproducibility data and avoidance of review bias to 70% (95% CI, 46%–88%) for avoidance of work-up bias and presentation of indeterminate test results. Only 5 of the 20 evaluations complied with four or more of the methodological standards and none with more than five of the standards.

CONCLUSIONS. The evaluations of ophthalmic diagnostic tests discussed in this article show limited compliance with accepted methodological standards but are no worse than previously described for evaluations published in general medical journals. Adherence to these standards by researchers can improve the study design and reporting of evaluations of new diagnostic techniques. Limited compliance, combined with a lack of awareness of the standards among users of research evidence, may lead to the inappropriate adoption of new diagnostic technologies, with a consequent waste of health care resources.


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Ophthalmic diagnostic tests help the clinician to make a diagnosis, assess the severity and prognosis of disease, and choose appropriate treatments. Both new and established technologies have been advocated to investigate patients in ophthalmic practice. Before using these technologies to guide clinical decisions, however, clinicians must know how "good" the tests are. Unfortunately, diagnostic tests are frequently not evaluated rigorously before they are made available to clinicians, and once implemented, their performance is sometimes disappointing. For example, the diagnostic value of clinical contrast sensitivity tests in ophthalmic practice has been limited,1 despite considerable enthusiasm when commercial tests were first made available.2 3 4

The performance of a diagnostic test is often referred to as "diagnostic accuracy," (i.e., the extent to which the result of a particular test correctly classifies patients into predefined disease categories). Diagnostic accuracy is usually characterized by the sensitivity and specificity of a test. Although likelihood ratios are considered to be the key indices in making clinical decisions about patients, by providing an explicit tool for revision of diagnostic probabilities according to the test outcomes,5 it is the sensitivities and specificities that are most commonly presented when reporting the diagnostic accuracy of tests.

Evaluations of diagnostic tests should comply with accepted standards to provide clinically relevant estimates of diagnostic accuracy. However, there is limited compliance with such standards within the general medical literature.6 The purposes of this article are first, to draw attention to the need for researchers to comply with the standards when performing evaluations of diagnostic accuracy and the need for practitioners to appraise published reports of the diagnostic accuracy against the standards before implementing tests in clinical practice, and second, to review published reports of the diagnostic accuracy of ophthalmic tests, both to illustrate the importance of the standards and to estimate the extent of compliance with the standards. Within the context of this article, screening tests will also be included, because evaluations of their performance should comply with the same criteria as that used for the evaluation of diagnostic tests.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Selection of Evaluation Studies
Studies were eligible for review if indices of diagnostic accuracy were reported, the test under evaluation was intended for clinical application, and the findings were reported in peer-reviewed ophthalmic or general medical journals.

Twenty published evaluations of both established and new ophthalmic tests were selected. Eleven evaluations of diagnostic or screening tests for glaucoma were chosen, spanning structural (clinical examination and imaging of the optic nerve head), physiological (intraocular pressure [IOP] and pattern electroretinogram) and psychophysical (perimetry, contrast sensitivity) tests. To illustrate wider applicability of the standards to a range of ophthalmic conditions, a MEDLINE database search was performed using a recommended strategy.5 A further nine studies were selected from this process, including imaging and photographic, visual function, and laboratory tests. The studies reviewed are drawn predominantly from recent publications within the literature (12 were published between 1995 and 1997, 5 between 1990 and 1994, and 3 before 1990). In each case, selection of the studies was made before assessment with the standards.

Assessment of Compliance with Standards
All articles were independently assessed by two reviewers for compliance with seven widely accepted methodological standards.6 7 8 9 10 11 Reviewers used the definitions described by Reid et al.6 These standards are summarized in Table 1 . Overall agreement between the reviewers was 80% (range, 55%–100%; see Table 2 ). All instances of disagreement (28 of 140) were resolved by discussion. The majority occurred for three standards (21 disagreements for standards 2, 4, and 6); these disagreements are considered further in the discussion.


View this table:
[in this window]
[in a new window]
 
Table 1. TABLE 1.

Methodological Standards for the Evaluation of Diagnostic Tests

 

View this table:
[in this window]
[in a new window]
 
Table 2. TABLE 2.

Agreement between Reviewers for Each of the Seven Standards, Expressed in Terms of Percentage of Agreement and the Kappa Statistic

 
In addition to overall percent agreement, kappa values were calculated to measure the agreement between the reviewers after taking account of agreement expected by chance; kappa values ranged from -0.24 to 1.0 for each criterion (Table 2) . Kappa values appear inconsistent with percent agreement in some cases because of the uniformity of the responses across evaluations for some criteria (e.g., standards 4 and 7). When responses are relatively uniform across the examples rated (e.g., standard 7, in which 18 of the 20 evaluations failed to meet the criterion), it is very difficult to obtain a high kappa score because "expected" agreement (calculated from the marginal totals) is also high. Standard 5, to which responses were relatively uniform, only achieved a kappa of 1.00, because there was 100% agreement.


    Results
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Table 3 summarizes the extent of compliance of the 20 studies reviewed12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 with the seven standards, after disagreements between reviewers had been resolved.


View this table:
[in this window]
[in a new window]
 
Table 3. TABLE 3.

Compliance of Evaluation Studies with the Seven Methodological Standards (Table 1)

 
Standard 1: Specification of Spectrum Composition
Twelve of the 20 studies (60%, 95% CI, 36%–81%) complied with this standard. Characterizing the study population is important, because the sensitivity and specificity of a test can be markedly influenced by the demographic and clinical composition of the population studied (e.g., age, sex, ethnicity, disease severity, comorbidity). Although this standard merely requires that the spectrum composition is specified, this standard is important, because it allows the reader to judge whether the estimates of diagnostic accuracy reported by the evaluation can be applied to the population in which the reader wants to use the test (see the Discussion section).

Standard 2: Analysis of Pertinent Subgroups
This standard was met by 11 of the 20 evaluations (55%; 95% CI, 32%–77%). The second standard is closely linked with the first, because it concerns the way in which the sensitivity and specificity of a test can vary for different subgroups in a population. If the population studied includes people with wide-ranging characteristics (e.g., all ages), overall estimates of sensitivity and specificity may disguise considerable variations in performance in different subgroups.12 Thus, even if overall performance is disappointing, a test may perform well in a subgroup; alternatively, overall performance may appear to be good, but may be unacceptably poor in a minority of subjects. It is important to point out, however, that this standard is intended to encourage the reporting of diagnostic accuracy in clinically relevant subgroups; it is not appropriate to search for "good" performance in a subgroup without a priori justification.

An early evaluation of oculokinetic perimetry estimated sensitivity to be more than 80% for the detection of glaucomatous visual field defects.15 Subsequent evaluations of this form of perimetry, which included patients with a range of visual field loss, indicated much lower sensitivity for early visual field loss.20 29

Standard 3: Avoidance of Work-up, or Verification, Bias
Fourteen of the 20 evaluations (70%, 95% CI, 46%–88%) complied with this standard. This bias is introduced when subjects with positive or negative diagnostic test results are selectively referred to receive verification by the validating criterion (i.e., the gold standard), or where the groups of "diseased" and "normal" subjects (based on the gold standard) have been selected according to some clinical factor relating to the disease.

A population-based study of glaucoma screening24 30 provides an example of work-up bias. More than 5000 subjects were screened using a test battery, and diagnostic accuracies for optic disc assessment, IOP, and field screening were reported. However, only those who "failed" one or more of the tests under evaluation were referred for a "definitive ophthalmologic examination." Referred subjects were classified as having or not having glaucoma by this examination (i.e., the gold standard), but the diagnosis of glaucoma could be made only in those subjects who were referred. Thus, there are likely to have been a small number of truly glaucomatous patients who "passed" all screening tests and whose condition was not detected because they were not referred for the gold standard. If all patients had had the definitive examination, the results in these patients would have been classified as false negatives rather than true negatives, suggesting that the reported sensitivity estimate may be biased upward, and the specificity estimate downward (see the Discussion section).

Standard 4: Avoidance of Review, or Expectation, Bias
Only 2 of the 20 evaluations (10%; 95% CI, 1%–32%) complied with this standard. Bias can be introduced if the results of the test under evaluation are interpreted with a knowledge of the results of the gold standard (or vice versa).

Studies by Xu et al.31 and Bjerrum14 reported the sensitivity and specificity of diagnostic tests for dry eye (keratoconjunctivitis sicca) in patients with primary Sjögren’s syndrome and other connective tissue diseases. However, it is not clear whether the clinician evaluating the tests was masked with respect to the validation status of the subjects.

Standard 5: Precision of Results for Test Accuracy
Only 3 of the 20 evaluations (15%; 95% CI, 3%–38%) complied with this standard. If sensitivity and specificity estimates are reported without a measure of precision, clinicians cannot know the range within which the true values of sensitivity and specificity may lie. For example, the sensitivity estimate of 73% for a laboratory test for ocular sarcoidosis26 based on only 22 patients has a 95% CI ranging from 54% to 92%. In contrast, the specificity estimate of 83% has better precision (95% CI, 74%–92%), a reflection of both the higher point estimate and the larger sample size used by the researchers for their nonsarcoid group (n = 70) (Note: The formula for the SE of a proportion, , is based on a binomial approximation to the normal distribution and can be used to calculate 95% CIs for sensitivity and specificity: p ± 1.96 , where p represents either sensitivity or specificity, q = 1 -p, and n is the sample size for either sensitivity or specificity. When p or q x n is less than 5, the validity of the approximation becomes doubtful, and exact methods should be used to calculate the 95% CI [see Fig. 1 ]).



View larger version (28K):
[in this window]
[in a new window]
 
Figure 1. Illustration of the breadth of exact binomial 95% CIs as a function of the sample estimate of the proportion of interest and sample size. From outside to center, the pairs of lines represent sample sizes of 20, 40, 60, 100, 200, and 500. Note the 95% CI is at its widest for a proportion equal to 0.5 and narrows as the proportion tends to 0 or 1. To use the graph, read off the upper and lower 95% CIs and simply add and subtract the sample estimate; for example, a sample estimate of 0.5 (i.e., a sensitivity or specificity of 50%), based on a sample size of 100, has a 95% CI that ranges from 0.4 to 0.6.

 
There is widespread use of statistical CIs in the ophthalmic literature, and journals usually require CIs to be specified for descriptive estimates and analytic comparisons.32 However, the same journals seem less vigilant for evaluations of diagnostic accuracy.

Standard 6: Presentation of Indeterminate Test Results
Fourteen of the 20 evaluations (70%; 95% CI, 46%–88%) complied with this standard. For a variety of reasons, tests occasionally yield indeterminate results. For example, patient cooperation may be limited or the presence of media opacities may obscure fundus observation. Knowledge of the percentage of indeterminate results is important in deciding how applicable a test may be to a population of interest. The way in which indeterminate results are classified (i.e., as positive, negative, or by excluding them altogether33 ) affects the estimate of diagnostic accuracy.

A study to determine the effectiveness of videorefraction in screening for significant refractive errors in infants reported a sensitivity of 84% and a specificity of 91% for hyperopia of more than 4.00 D when cycloplegic videorefraction was used.22 However, limited cooperation, failure to obtain adequate blur circles and difficulty in obtaining adequate cycloplegia restricted the final sample size. The prevalence of these indeterminate results was not reported. Because this test is intended to be used for screening, the "untestable" children should arguably have been included and the test results regarded as positive (i.e., requiring further investigation), thereby decreasing the specificity of the test.

Standard 7: Test Reproducibility
Only 2 of the 20 evaluations (10%; 95% CI, 1%–32%) complied with this standard. Limited reproducibility is inevitably reflected in the sensitivity and specificity estimates, and so these estimates provide valid measures of performance that take into account the degree of reproducibility. However, reporting test reproducibility is important to allow the reader to appraise whether the same level of reproducibility can be obtained in the study setting, particularly when a test result is based on expert judgment. For example, experts or proponents of a new test might be expected to be able to apply a new grading system with better reproducibility than nonexperts. Therefore, evidence about test reproducibility should clarify whether the test results are reproducible in "average" or "expert" hands.


    Discussion
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
The seven quality standards described for evaluations of diagnostic tests have a role comparable to that of more familiar quality standards for randomized controlled trials (RCTs),34 which have led to an improvement in the design of RCTs and better critical appraisal by readers of published reports.

A review of diagnostic evaluations published in four prestigious general medical journals between 1978 and 19936 found widespread noncompliance with these standards, with compliance exceeding 50% for only one of the seven standards. For the present review compliance ranged from 10% for presentation of test reproducibility data and avoidance of review bias to 70% for avoidance of work-up bias (the standard for which compliance was also the highest in the review by Reid et al.6 ) and presentation of indeterminate results. Overall, only 25% of studies complied with four or more of the standards, a proportion that is comparable to that found by Reid et al.6 for recently published studies.

We acknowledge that our findings from 20 selected evaluations may not be representative of the ophthalmic literature in general. We selected our sample to include a high proportion of evaluations of glaucoma tests, albeit tests of varying modality, because of the significance of glaucoma in ophthalmology, the need for early detection with diagnostic tests at an early asymptomatic stage, and the considerable research effort in evaluating screening or diagnostic methods. In addition, two of the evaluations we reviewed were from the Baltimore Eye Study,24 30 and these cannot be considered to be independent examples.

Despite the selected nature of our sample, we believe that our findings for compliance, although somewhat imprecise, are nevertheless likely to be reasonably representative of compliance within ophthalmology for four reasons. First, our findings are similar to those reported by Reid et al.6 for publications in high-ranking general medical journals. Second, we have recently conducted a systematic review for one of the standards that found a similarly limited compliance.35 Third, all reports we reviewed were published in peer-reviewed journals, with more than half of them published in high-ranking ophthalmic or general medical journals. Finally, selection of studies was made before assessment of compliance.

The limited agreement between the reviewers when assessing standards 2, 4, and 6 may cast doubt on the importance of these particular standards. The particular problem with standard 4 related to interpretation of review bias in instances in which the diagnostic test or validating criterion was automated. The disagreements would not have occurred with a strict interpretation of the standards, and we suggest that this standard should be clarified (see discussion later). Disagreement on standards 2 and 3 arose primarily from a lack of clarity in some articles or the failure by reviewers to identify relevant information, rather than from ambiguity of the standards themselves. These problems would be unlikely to arise if the reporting of compliance with the standards were mandatory, as for RCTs.32

Applicability of the Methodological Standards in Ophthalmology
The level of compliance that we found would usually cast doubt on the relevance of the reported findings to clinical practice. However, it is important to consider the applicability of the standards within ophthalmology, because some of the standards may be less important for some of the tests included in the review.

First, it might be argued that the standard for test reproducibility is less important for automated tests when the classification of the test result does not depend on expert judgment. All sources of measurement error are reflected in the estimates of sensitivity and specificity. The crucial difference is that when variation in the expertise of observers is not a source of measurement error, the estimates are more likely to be generalizable across settings.

Second, it might be argued that applying the standard for avoidance of review bias is unnecessary if a test is entirely automated, on the assumption that a test result obtained automatically cannot be biased through interpretation by a clinician. However, if an automated diagnostic test is used before validation,22 review bias can still occur because the diagnostic test may influence interpretation of the gold standard; it is only when an automated diagnostic test is used after validation17 that review bias is likely to be avoided. Because of the possibility of an operator influencing an automated test in a subtle way (e.g., in the setting up of test parameters or in the interpretation of the result), we recommend that researchers maintain and report the independence of the test and gold standard procedures, even when one or the other appears to be completely independent of the operator. The standard on avoidance of review bias, as described by Reid et al.,6 does not discuss how automated tests should be judged. Consequently, it was perhaps not surprising that the main source of disagreement between reviewers was interpretation of this standard when a diagnostic test was automated or semiautomated.

Finally, it might be argued that if the main intended application of a screening or diagnostic test is to a specific population only (e.g., stereoacuity tests or video refraction in infants), then it is not realistic to expect the reporting of indices of accuracy for further subgroups.

Additional Standards
In addition to the seven accepted standards described here, we believe that there are three further principles that researchers should adhere to: There should be a clear definition of the gold standard, the gold standard should be independent of the test under evaluation, and the population studied should be appropriate for the intended application of the test.

Definition of the Gold Standard.
We believe that the gold-standard should always be clearly defined, even though there may be some overlap between this requirement and standard 3 (i.e., work-up bias). This requirement is particularly important in situations in which it is impracticable or unethical to administer the gold standard to all patients. The overlap between definition of the gold standard and work-up bias is demonstrated in the study by Tielsch et al.30 The gold standard for this evaluation was described as a definitive ophthalmologic examination and, consequently, we scored this evaluation as failing to avoid work-up bias, because only those who failed at least one of the screening tests were referred for this examination. It appears that the researchers regarded the referral of all subjects for a definitive examination as impracticable, a not unreasonable decision. However, the implication of this decision for the gold standard was not spelled out—namely, that the gold standard definition of normality should have become "no disease found on definitive examination or passed all screening tests." We believe that such a statement clarifies the true gold-standard definition, and highlights the possibility of work-up bias in a way in which the original article did not.

Work-up bias may be unavoidable if the gold standard carries a health risk, making it unethical to administer the gold standard to all subjects (e.g., highly invasive tests). In such cases, the duty of the researchers is to make explicit the validating criterion for normality in the absence of the gold standard; this may be demonstration of normality on a battery of tests, or the continuing absence of disease (demonstrated by whatever means) over a prolonged period of follow-up.

Work-up bias is difficult to avoid in the context of the evaluation of screening tests in which the prior probability of disease is usually very low. A practicable evaluation must either make assumptions about the normality of those who pass the screening test, as discussed above, or select a population for the evaluation that contains a much higher proportion of diseased people than would be expected when screening (e.g., by choosing equal numbers of definitively normal people and people who have been newly referred for investigation). Selection of this kind almost inevitably results in work-up bias, because the reasons for referral are likely to be associated with the results of the screening test.36

Independence of the Gold Standard.
The gold standard should be independent of the diagnostic test under evaluation—that is, the test under evaluation should not be performed as part of the gold standard. This requirement should hold, even when the objective of the evaluation is to investigate the decrease in diagnostic accuracy when one or more elements of the gold standard are omitted. This problem is illustrated by an evaluation of the sensitivity and specificity of a 26-point screening program on the Henson field screener.21 The points tested for the screening program are also tested during the extended program, which was used as the gold standard criterion. The investigators simply calculated the diagnostic accuracy for the screening program by extracting the data from the extended test, rather than by performing the screening and extended tests on separate occasions. This procedure eliminates variability between screening and extended tests that would occur in practice. (In fact, a subsequent evaluation using an independent validating criterion has confirmed high sensitivity and specificity for this particular form of field screening.29 ) This criticism may also apply to the study by Bjerrum,14 who appears to have performed two of the tests under evaluation as part of the gold-standard examination for diagnosis of dry eye.

It is often the case that the results of the test under evaluation and the gold standard are highly correlated, not because of the problem just described, but because the test and the gold standard are measuring similar underlying properties (e.g., aspects of visual function in glaucoma). It is not surprising, therefore, that a field test has higher diagnostic accuracy than IOP or cup-to-disc ratio when screening for glaucoma, when the gold standard includes a definitive perimetric examination.16 24 30 In these circumstances, the evaluation is not invalid, but it is important to be aware of the inherent tautology.

Appropriateness of the Study Population.
It is difficult to recommend including the third additional standard as a true standard because of the subjectivity of judging appropriateness. The appropriateness of the population included in the evaluation has previously been mentioned in relation to the standard on specification of spectrum composition. This point can be graphically illustrated by comparing two datasets on the diagnostic accuracy of IOP, shown as receiver operating characteristic curves in Figure 2 .16 30 These curves suggest a considerable difference in diagnostic accuracy, with curve B indicating that IOP is a much better test.



View larger version (19K):
[in this window]
[in a new window]
 
Figure 2. Receiver operating characteristic curves for tonometry, drawn from the data of Daubs and Crick16 (curve A, open circles) and Tielsch et al.30 (curve B, closed circles). The data points represent the sensitivity/specificity at different levels of IOP (in millimeters of mercury).

 
Closer inspection of the two curves suggests that they differ primarily with respect to the specificity estimates because both curves have comparable sensitivity estimates (e.g., <50% for an IOP >22 mm Hg). This finding is consistent with other epidemiologic studies of glaucoma.37 38 However, the specificity estimates at this level of IOP vary from approximately 75% in curve A to more than 90% in curve B. Because specificity estimates are derived from the normal subjects in the sample under evaluation, this discrepancy must be attributable to differences between the studies in the populations of normal subjects. Daubs and Crick16 evaluated glaucoma diagnostic tests in a hospital population. The normal subjects did not have field loss but had been referred to King’s College Hospital as "suspects" (Crick, personnel communication). Because "suspects" are likely to have included a higher proportion of people with raised IOP, because raised IOP is a common indicator for referral, the specificity findings are lower than expected. In the population-based study of Tielsch et al.,30 the specificity of more than 90% for an IOP cutoff criterion of more than 22 mm Hg is more consistent with the known prevalence of ocular hypertension.39 Consequently, their data are more representative of the performance of tonometry for screening.

In considering appropriateness, it is also important to highlight the selective nature of populations in some evaluations. Evaluations of the diagnostic accuracy of glaucoma tests often use the results from the Humphrey Visual Field Analyzer (San Leandro, CA) as the gold standard. Sometimes subjects are selected to have prior experience of automated perimetry18 23 or subjects with unreliable test results are excluded.29 Selecting subjects in this way is likely to result in inflated estimates of diagnostic accuracy and gross underestimation of indeterminate results. The prevalence of unreliable subjects is not insignificant and has been estimated to be as high as 45% in glaucomatous subjects and 30% in control subjects.40

A study to evaluate an artificial neural network for the automatic detection of diabetic retinopathy from fundus images, provides another example of the selective nature of a study population.17 The sample used to test the system comprised 200 diabetic fundus images and 101 normal fundus images. The researchers concluded that the system could be used as an aid to the screening of diabetic patients for retinopathy. However, given that the normal fundus images do not appear to have included nondiabetic lesions (e.g., age-related maculopathy), the specificity of the system in a screening setting is likely to be worse than reported. It may be appropriate to use selected populations for the preliminary evaluation of a system, but any conclusion about the wider application requires an evaluation on a representative population.

In conclusion, this article has highlighted the importance of complying with methodological standards when evaluating ophthalmic diagnostic tests. Our findings emphasize the need for researchers to comply with standards, so that published estimates of diagnostic accuracy are relevant to clinical practice, and for practitioners to appraise critically evaluations of diagnostic tests against the standards, to avoid being misled by biased (and sometimes overoptimistic) results.

Improved diagnostic accuracy is only one of many steps toward effective treatment, and the use of rigorously evaluated tests cannot guarantee better patient outcomes.5 However, patient care can be expected to improve if ineffective diagnostic tests are avoided, because the widespread use of tests with limited accuracy can have serious health and financial consequences. Ideally, diagnostic tests that show promising accuracy should be subjected to RCTs to determine whether the test results in improved health outcomes.41


    Footnotes
 
Reprint requests: Robert Harper, Department of Ophthalmology, Manchester Royal Eye Hospital, Oxford Road, Manchester, M13 9WH, UK.

Submitted for publication September 15, 1998; revised February 5, 1999; accepted March 10, 1999.

Proprietary interest category: N.


    References
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 

  1. Moseley, MJ, Hill, AR (1994) Contrast sensitivity testing in clinical practice Br J Ophthalmol 78,795-797[Free Full Text]
  2. Arden, GB, Jacobson, JJ (1978) A simple grating test for contrast sensitivity: preliminary results indicate value for screening in glaucoma Invest Ophthalmol Vis Sci 17,23-32[Abstract/Free Full Text]
  3. Ginsberg, AP (1984) A new contrast sensitivity vision test chart Am J Optom Physiol Opt 61,403-407[Medline][Order article via Infotrieve]
  4. Della Sala, S, Bertoni, G, Somazzi, L, Stubbe, F, Wilkins, AJ (1985) Impaired contrast sensitivity in diabetic patients with and without retinopathy: a new technique for rapid assessment Br J Ophthalmol 69,136-142[Abstract/Free Full Text]
  5. Deeks JJ, Morris JM. Evaluating diagnostic tests. Obstet Gynaecol. In press.
  6. Reid, MC, Lachs, MS, Feinstein, AR (1995) Use of methodological standards in diagnostic test research: getting better but still not good JAMA 274,645-651[Abstract/Free Full Text]
  7. Ransohoff, DF, Feinstein, AR (1978) Problems of spectrum and bias in evaluating the efficacy of diagnostic tests N Engl J Med 299,926-930[Abstract]
  8. Cooper, LS, Chalmers, TC, McCally, M, Berrier, J, Sacks, HS (1988) The poor quality of early evaluations of magnetic resonance imaging JAMA 259,3277-3280[Abstract/Free Full Text]
  9. Arroll, B, Schechter, MT, Sheps, SB (1988) The assessment of diagnostic tests: a comparison of medical literature in 1982 and 1985 J Gen Intern Med 3,443-447[Medline][Order article via Infotrieve]
  10. Jaeschke, A, Guyatt, GH, Sackett, DL, . for the Evidence-based Medicine Working Group (1994) Users’ guides to the medical literature, III: how to use an article about a diagnostic test, A: are the results of the study valid? JAMA 271,389-391[Abstract/Free Full Text]
  11. Jaeschke, A, Guyatt, GH, Sackett, DL, . for the Evidence-based Medicine Working Group (1994) Users’ guides to the medical literature, III: how to use an article about a diagnostic test, B: what are the results and will they help me in caring for patients? JAMA 271,703-707[Abstract/Free Full Text]
  12. Ariyasu, RG, Lee, PP, Linton, KP, LaBree, LD, Azen, SP, Siu, AL (1996) Sensitivity, specificity, and predictive values of screening tests for eye conditions in a clinic-based population Ophthalmology 103,1751-1760[Medline][Order article via Infotrieve]
  13. Birch, E, Williams, C, Hunter, J, Lapa, MC (1991) Random dot stereoacuity in preschool children J Pediatr Ophthalmol Strabismus 34,217-222
  14. Bjerrum, KB (1996) Test and symptoms in keratoconjunctivitis sicca and their correlation Acta Ophthalmologica 74,436-441
  15. Damato, BE, Chyla, J, McClure, E, Jay, JL, Allan, JD (1990) A hand-held OKP chart for the screening of glaucoma: preliminary evaluation Eye 4,632-637
  16. Daubs, J, Crick, RP (1980) Epidemiological analysis of the King’s College Hospital glaucoma data Res Clin Forums 2,41-59
  17. Gardner, GG, Keating, D, Williamson, TH, Elliott, AT (1996) Automatic detection of diabetic retinopathy using an artificial neural network: a screening tool Br J Ophthalmol 80,940-944[Abstract/Free Full Text]
  18. Graham, SL, Drance, SM, Chauhan, BC, et al (1996) Comparison of psychophysical and electrophysiological testing in early glaucoma Invest Ophthalmol Vis Sci 37,2651-2662[Abstract/Free Full Text]
  19. Harding, SP, Broadbent, DM, Neoh, C, White, MC, Vora, J. (1995) Sensitivity and specificity of photography and direct ophthalmoscopy in screening for sight threatening eye disease: the Liverpool diabetic eye study BMJ 311,1131-1135[Abstract/Free Full Text]
  20. Harper, RA, Hill, AR, Reeves, BC (1994) Effectiveness of unsupervised oculokinetic perimetry for detecting glaucomatous visual field defects Ophthalmol Physiol Opt 14,199-202[Medline][Order article via Infotrieve]
  21. Henson DB, Bryson H. Clinical results with the Henson–Hamblin CFS 2000. In: Greve EL, Heijl A, eds. Seventh International Visual Field Symposium. Dordrecht: Dr. W. Junk 1987:233–238.
  22. Hodi, S. (1994) Screening of infants for significant refractive error using videorefraction Ophthalmol Physiol Opt 14,310-313[Medline][Order article via Infotrieve]
  23. Johnson, CA, Samuels, SJ (1997) Screening for glaucomatous visual field loss with frequency doubling perimetry Invest Ophthalmol Vis Sci 38,413-425[Abstract/Free Full Text]
  24. Katz, J, Tielsch, JM, Quigley, HA, Javitt, J, Witt, K, Sommer, A. (1993) Automated suprathreshold screening for glaucoma: the Baltimore Eye Survey Invest Ophthalmol Vis Sci 34,3271-3277[Abstract/Free Full Text]
  25. Mikelberg, FS, Parfitt, CM, Swindale, SL, Graham, SL, Drance, SM, Gosine, R. (1995) Ability of the Heidelberg Retina Tomograph to detect early glaucomatous visual field loss J Glaucoma 4,242-247[Medline][Order article via Infotrieve]
  26. Power, WJ, Neves, RA, Rodriguez, A, Pedroza–Seres, M, Foster, CS (1995) The value of combined serum angiotensin-converting enzyme and gallium scan in diagnosing ocular sarcoidosis Ophthalmology 102,2007-2011[Medline][Order article via Infotrieve]
  27. Smolek, MK, Klyce, SD (1997) Current keratoconus detection methods compared with a neural network approach Invest Ophthalmol Vis Sci 38,2290-2299[Abstract/Free Full Text]
  28. Sommer, A, Cheryl, E (1987) nger MS, Witt K Screening for glaucomatous visual field loss with automated threshold perimetry. Am J Ophthalmol. 103,681-684[Medline][Order article via Infotrieve]
  29. Sponsel, WE, Ritch, R, Stamper, R, et al (1995) Prevent blindness America visual field screening study Am J Ophthalmol 120,699-708[Medline][Order article via Infotrieve]
  30. Tielsch, JM, Katz, J, Singh, K, et al (1991) A population-based evaluation of glaucoma screening: The Baltimore Eye Survey Am J Epidemiol 134,1102-1110[Abstract/Free Full Text]
  31. Xu, KP, Yagi, Y, Toda, I, Tsubota, K. (1995) Tear function index: a new measure of dry eye Arch Ophthalmol 113,84-88[Abstract/Free Full Text]
  32. Begg, C, Cho, M, Eastwood,, et al (1996) Improving the of quality of randomized controlled trials. The CONSORT statement JAMA 276,637-639[Abstract/Free Full Text]
  33. Simel, DL, Feussner, JR, Delong, ER, Matchar, DB (1987) Intermediate, indeterminate and uninterpretable diagnostic test results Med Decis Making 7,107-114
  34. Moher, D, Jadad, AR, Nichol, G, et al (1995) Assessing the quality of randomised controlled trials: an annotated bibliography of checklists Control Clin Trials 16,62-73[Medline][Order article via Infotrieve]
  35. Harper, R, Reeves, B. (1999) Reporting of precision for estimates of diagnostic accuracy: a review BMJ 318,1322-1323[Free Full Text]
  36. Harper, RA, Reeves, BC (1999) Glaucoma screening: the importance of combined test data Optom Vis Sci 318,1322-1323
  37. Leibowitz HM, Krueger DE, Maunder LR., et al. The Framingham Eye Study monograph. Surv Ophthalmol. 1980;24(Suppl):335–610.
  38. Bengtsson, B. (1981) The prevalence of glaucoma Br J Ophthalmol 65,46-49[Abstract/Free Full Text]
  39. David, R. (1986) Ocular hypertension Cairns, J eds. Glaucoma ,551-567 Grune & Stratton London.
  40. Katz, J, Sommer, A. (1988) Reliability indexes of automated perimetric tests Arch Ophthalmol 106,1252-1254[Abstract/Free Full Text]
  41. Holland, WW, Stewart, S. (1990) Screening in heath care London: Nuffield Provincial Hospitals Trust



This article has been cited by other articles:


Home page
J Med ScreenHome page
L. Ore, H. J Garzozi, A. Tamir, N. Stein, and M. Cohen-Dar
Performance measures of the illiterate E-chart vision-screening test used in Northern District Israeli school children
J Med Screen, June 1, 2008; 15(2): 65 - 71.
[Abstract] [Full Text] [PDF]


Home page
Br J OphthalmolHome page
F. Bochmann, Z. Johnson, and A. Azuara-Blanco
Sample size in studies on diagnostic accuracy in ophthalmology: a literature survey
Br J Ophthalmol, July 1, 2007; 91(7): 898 - 900.
[Abstract] [Full Text] [PDF]


Home page
IOVSHome page
M. Shunmugam and A. Azuara-Blanco
The quality of reporting of diagnostic accuracy studies in glaucoma using the heidelberg retina tomograph.
Invest. Ophthalmol. Vis. Sci., June 1, 2006; 47(6): 2317 - 2323.
[Abstract] [Full Text] [PDF]


Home page
IOVSHome page
T. Y. Y. Lai, G. M. Leung, V. W. Y. Wong, R. F. Lam, A. C. O. Cheng, and D. S. C. Lam
How Evidence-Based Are Publications in Clinical Ophthalmic Journals?
Invest. Ophthalmol. Vis. Sci., May 1, 2006; 47(5): 1831 - 1838.
[Abstract] [Full Text] [PDF]


Home page
CMAJHome page
A. W.S. Rutjes, J. B. Reitsma, M. Di Nisio, N. Smidt, J. C. van Rijn, and P. M.M. Bossuyt
Evidence of bias and variation in diagnostic accuracy studies.
Can. Med. Assoc. J., February 14, 2006; 174(4): 469 - 476.
[Abstract] [Full Text] [PDF]


Home page
Br J OphthalmolHome page
A Azuara-Blanco and J Burr
The rising cost of glaucoma drugs
Br J Ophthalmol, February 1, 2006; 90(2): 130 - 131.
[Full Text] [PDF]


Home page
Br J OphthalmolHome page
B C Reeves
Evidence about evidence
Br J Ophthalmol, March 1, 2005; 89(3): 253 - 254.
[Full Text] [PDF]


Home page
Br J OphthalmolHome page
M A R Siddiqui, A Azuara-Blanco, and J Burr
The quality of reporting of diagnostic accuracy studies published in ophthalmic journals
Br J Ophthalmol, March 1, 2005; 89(3): 261 - 265.
[Abstract] [Full Text] [PDF]


Home page
IOVSHome page
J. A. Phipps, T. M. Dang, A. J. Vingrys, and R. H. Guymer
Flicker Perimetry Losses in Age-Related Macular Degeneration
Invest. Ophthalmol. Vis. Sci., September 1, 2004; 45(9): 3355 - 3360.
[Abstract] [Full Text] [PDF]


Home page
IOVSHome page
A. J. Anderson and C. A. Johnson
Anatomy of a Supergroup: Does a Criterion of Normal Perimetric Performance Generate a Supernormal Population?
Invest. Ophthalmol. Vis. Sci., November 1, 2003; 44(11): 5043 - 5048.
[Abstract] [Full Text] [PDF]


Home page
Br J OphthalmolHome page
J-C Barry and H-H Konig
Test characteristics of orthoptic screening examination in 3 year old kindergarten children
Br J Ophthalmol, July 1, 2003; 87(7): 909 - 916.
[Abstract] [Full Text] [PDF]


Home page
Br J OphthalmolHome page
J van der Meulen and J Rahi
Clinical decision making in opthalmology
Br J Ophthalmol, June 1, 2002; 86(6): 599 - 600.
[Full Text] [PDF]


Home page
Br J OphthalmolHome page
R. HARPER, D. HENSON, and B. C REEVES
Appraising evaluations of screening/diagnostic tests: the importance of the study populations
Br J Ophthalmol, October 1, 2000; 84(10): 1198 - 1202.
[Full Text]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Harper, R.
Right arrow Articles by Reeves, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Harper, R.
Right arrow Articles by Reeves, B.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS