|
|
||||||||
1From the Departments of Bio-Medical Physics and Bio-Engineering and 3Ophthalmology, University of Aberdeen, Foresterhill, Aberdeen, United Kingdom; and the 2Eye Clinic, Aberdeen Royal Infirmary, Foresterhill, Aberdeen, United Kingdom.
| Abstract |
|---|
|
|
|---|
METHODS. Two clinical datasets were used to evaluate the performance of the automated turnover measurement system. The first consisted of 10 patients who had two fluorescein angiograms acquired a year apart. These data were analyzed, both manually and using the automated system, to investigate the inter- and intraobserver variations associated with manual measurement and to assess the performance of the automated system. The second dataset contained FAs from a further 25 patients. This dataset was analyzed only with the automated system to investigate some properties of microaneurysm turnover, in particular the differing detection sensitivities of new, static and regressed microaneurysms.
RESULTS. Manual measurements exhibited large inter- and intraobserver variation. The sensitivity and specificity of the automated system were similar to those of the human observers. However, the automated measurements were more consistentan important condition for accurate turnover quantification. Regressed MAs were more difficult to detect reliably than new MAs, which were themselves more difficult to detect reliably than static MAs.
CONCLUSIONS. The automated system was shown to be fast, reliable, and repeatable, making it suitable for processing large numbers of images. Performance was similar to that of trained manual observers.
The accuracy of manual measurements is limited by the unavoidable variation in observer performance. Although counting the total number of MAs in an image has been shown to be reasonably reproducible, matching individual MAs in serial images is much less so, because an error in either image results in an erroneously classified MA. Human variation derives from two sources: unconscious variations in the observers criteria for identifying an MA and tiredness and fatigue. A fully automated system is not affected by either of these problems.
We have described previously a fully automated computer system for detecting and quantifying MAs in fluorescein angiograms and red-free images, with performance similar to that of trained clinicians.8 9 10 It was found that automatic alignment of the follow-up image with the baseline image allows MA turnover to be calculated without manual intervention. The system was shown to be fast, reliable, and repeatable, making it suitable for processing large numbers of images.
In this study two clinical datasets were used to evaluate the automated turnover measurement system. The first consisted of 10 patients in whom two fluorescein angiograms were acquired a year apart. These data were analyzed both manually and using the automated system to investigate the inter- and intraobserver variation associated with manual measurement and to assess the performance of the automated system. The second dataset contains fluorescein angiograms from 25 different patients. This larger dataset was analyzed using the automated system alone to investigate further some properties of microaneurysm turnover, in particular the differing sensitivities for the detection of new, static, and regressed microaneurysms.
| Materials and Methods |
|---|
|
|
|---|
First Clinical Dataset.
Fluorescein angiograms (FAs) with a 35° field of view were acquired using a fundus camera (TRC50XT; Topcon Optical, Tokyo, Japan). The images were recorded on 35-mm film (Tri-X; Eastman Kodak, Rochester, NY). Ten type I diabetic subjects, with diabetic retinopathy grading ranging from "no retinopathy" to "severe" were selected from a larger study, in which they had two good-quality FAs taken 12 months apart. The images were graded using the interim Early Treatment Diabetic Retinopathy Study [ETDRS] severity scale based on seven-field color photography.11 Early venous stage images were chosen for digitization (1024 x 1024 8-bit gray-scale pixels, using a digital camera; MegaPlus; Eastman Kodak). The mean time after injection of the fluorescein for the digitized image frames, ±1 SD, was 24.7 ± 8.1 (range, 14-41) seconds (this notation will be used throughout).
Second Clinical Dataset.
A second, larger group of images was chosen from a separate study for analysis, by using the automated system. Images from 25 subjects (4 women and 21 men; 11 type I diabetes/14 type II) were acquired using the same fundus camera model as before. However, in this study film was not used, and the images were digitized directly with the CCD camera (MegaPlus; Eastman Kodak). The ophthalmologists grading of the baseline patient images, using the EURODIAB scale (developed as part of the European Programme into the epidemiology and prevention of diabetes)12 based on two-color slide photographs, were 16 with mild and 9 with moderate retinopathy. At follow-up there were 13 with mild, 10 with moderate, and 1 with severe retinopathy (one of the baseline moderate cases was ungraded at follow-up). Overall, there was no change in grading of 16 patients, 5 progressed from mild to moderate, 2 from moderate to mild, and 1 from moderate to severe. The average age of the subjects was 53.4 ± 13.1 (25-74) years. The mean time after injection for the chosen digitized frames was 34.2 ± 13.2 (11.4-76.8) seconds. The mean time between the baseline and follow-up images was 213.4 ± 51.0 (142-302) days.
Manual Annotation of MAs
Manual observers were used to create the reference standard, assess intra- and interobserver variation, and evaluate the performance of the automated method using the 10 pairs of images in the first dataset.
Software was written to aid observers in marking the MAs that they identified. The baseline and follow-up images were displayed side by side on the computer monitor. The FA images were viewed as negatives (i.e., the MAs appeared as dark lesions), and shade correction9 was applied to the images to maintain optimal contrast both inside and outside the foveal avascular zone (Fig. 1) . The observers were able to note the positions of most of the MAs with a single mouse click, and a region-growing algorithm then delineated the full extent of the lesion. The MAs were shown overlaid in green after selection. Occasionally, the region-growing algorithm failed for lower contrast lesions. In such cases the operator was able to "paint" the lesion manually using the mouse. The images were always presented to the observers in the same order.
|
The MA detector was applied to the data twice, using two different operating points. The second setting had slightly greater lesion detection sensitivity, but reduced specificity.
The Reference Standard
An ophthalmologist who was experienced in identifying and counting microaneurysms in angiographic images and also in grading retinal images using standard grading techniques was chosen as the standard observer. The same software used by the other manual observers was used to annotate MA positions for the reference standard.
The standard observer performed the analysis on two occasions 6 months apart. A total of 270 MAs were identified in the set of 20 images during the first session. Six months later, 292 MAs were marked, of which 246 matched MAs found during the first session. Seventy MAs were marked in only one of the sessions (24 in the first session, and 46 in the second).
The reproducibility (r) is defined as the ratio of the number of MAs that matched in both sessions to the number of unique MAs found in both sessions expressed as a percentage, given by
![]() |
Manual Measurement Evaluation
Interobserver variation was determined using 9 observers (7 ophthalmologists and 2 medical physicists). All the observers had experience in grading or analyzing retinal images containing retinopathy. They examined the 10 image pairs from the first patient dataset and marked the positions of MAs they found using the computer program described earlier.
Four of the observers performed the manual measurement a further two times (with at least 1 week between sessions) to assess intraobserver variation.
Automated Measurement Evaluation
To evaluate basic MA detection using the MA detector (at both operating points) the 10 pairs of images were treated as 20 independent images and compared with the standard result. Turnover results were calculated from the 10 pairs of images (using both detector operating points) and the numbers of static, new, and regressed MAs compared with the reference standard.
The second study of 25 patients was analyzed using the automated system. The average number of MAs per image was measured and turnover analysis performed to determine the numbers of static, new, and regressed MAs.
Repeated Measurements at Different Sensitivities
Each type of MA (static, new, and regressed) may be misclassified by an error in one or both of the images. If the probability of an error is independent of MA type, then changes in the overall MA detection sensitivity should affect the three types equally.
To test the null hypothesis that there is no difference in the detectability of the three MA types the automated detector was run twice on each datasetthe second time using a setting with greater specificity but lower sensitivity. The numbers of static, new, and regressed MAs that were detected at both sensitivity settings were recorded.
Free-Response Receiver Operator Characteristic Curves
The MA detection and MA turnover results for both the manual observers and the automated MA detector were plotted as free-response receiver operator characteristic (FROC) curves.16 The FROC graph plots sensitivity against the mean number of false-positives per image. A best fit (in a least-squares error sense) curve was calculated for basic MA detection using the FROC model by Chakraborty.17 The turnover results do not show interpolated curves, because they are a multiple-class ROC problem, and the standard FROC model is not applicable.
| Results |
|---|
|
|
|---|
0.16). However, although the mean number of MAs apparently indicates a reasonably static MA population, turnover analysis found only 61 of the MAs to be truly static: there were 78 new MAs and 46 MAs that had regressed. Hence, only 57% of the baseline MAs were found in the follow-up images, a large change in the MA population that is not reflected by the mean number of MAs. Interobserver variation was determined by comparing nine observers with the reference standard. The results are shown in the FROC plots in Figures 2 3 4 5 . Figure 2 shows the result for basic MA detection. The curve was fitted to the results of the manual observers, including only the first result from observers who completed the assessment three times and excluding the standard observer (because this result was used to create the reference standard). Despite all the observers having previous experience grading retinopathy, there was considerable variation in both the sensitivities and false-positive rates between observers.
|
|
|
|
Intraobserver variation was determined by three observers who performed the manual measurements on three separate occasions. Table 1 shows the results from the three observers. For each observer, the results of the second and third sessions were compared with those from their first session. The total number of MAs marked in each session is listed, together with the percentage of MAs marked that were also marked on the first visit, the percentage of the MAs marked in the first session that were detected in the subsequent session, and the reproducibility value as defined earlier. The more conservative the observer (i.e., the fewer total MAs), the more reproducible the result.
|
The MA turnover results for the 10 pairs of images from the first patient group, comparing the manual observers and the automated system with the reference standard, are shown as FROC graphs: static MAs in Figure 3 , new MAs in Figure 4 , and regressed MAs in Figure 5 . As before, the graphs include only the first results from observers who completed the assessment three times, and the results of the standard observer are not shown.
The automated detector was applied to the second clinical dataset. Overall, the mean number of MAs per image was 49.1 ± 50.0 (1-245). In the baseline images the average was 41.4 ± 49.5 (1-245), compared with 56.9 ± 50.4 (6-204) in the follow-up images. As for the first dataset, the difference failed to achieve significance at the 5% level (Wilcoxon test; P
0.08). Neither was there a significant difference between the mean number of MAs at baseline and in the follow-up images after categorizing the patients by their retinopathy grades at baseline: The mean number of MAs per image in the mild group was 30.1 ± 21.4 (1-74) at baseline and 48.8 ± 43.1 (6-187) at follow-up. In the moderate group, the mean number of MAs per image was 61.6 ± 76.0 (3-245) at baseline and 71.2 ± 61.4 (10-204) at follow-up. The number of MAs increased in 16 patients, decreased in 8 and remained constant in 1. Once again, the small change in the absolute number of MAs did not reflect the high level of turnover. A total of 22.1 ± 32.1 (1-163) MAs per patient were static compared with 34.8 ± 35.6 (2-171) new MAs and 19.3 ± 20.3 (0-82) regressed MAs.
Repeat Measurements at Different Sensitivities
The automated system was applied to the data twice: once using a setting with higher sensitivity and once using a setting with higher specificity. In the first dataset 88% of static MAs, 89% of new MAs, and 66% of regressed MAs were detected at both settings. The difference between the proportions of static MAs and new MAs was not significant (
2 test; P
0.8). However, the difference between the proportion of regressed MAs and that of either the static or the new MAs was significant (
2 test; P < 0.01).
In the second dataset 93% of static MAs, 87% of new MAs, and 80% of regressed MAs were detected at both settings. The differences between the proportions of the three MA types were all significant (
2 test; P < 0.01).
| Discussion |
|---|
|
|
|---|
Simple counting of MAs in an image has been shown to be reasonably robust: Errors of omission tend fortuitously to be balanced by errors of inclusion.1 2 3 18 In contrast, both of these errors confound turnover measurement. Furthermore, under nonideal conditions, factors such as fatigue and distraction also increase the number of errors. Observer errors generate "turnover noise"artifactual turnover caused by false-negative and false-positive MA identifications. Turnover noise arises from two related sources: straightforward errors detecting MAs (described by the sensitivity and false-positive rate) and shifts in the operator decision criteria, leading to variations in the sensitivity and specificity between the baseline and follow-up sessions. The shift in the decision criteria is inevitable with human observers19 for example, slightly smaller lesions may be accepted in the second session, thereby including MAs in the selection process that were summarily excluded during the first session. Table 1 showed the large variation of the three observers in the study, each of whom analyzed the same images on three separate occasions.
The absolute level of turnover noise is difficult to measure. There is no independent, reliable, and noninvasive method for counting MAs currently available to act as a gold standard. Therefore, it was necessary to designate the observer with the greatest experience of grading retinopathy images as the reference standard. The other observers, both manual and automated, were compared with this standard.
The level of turnover noise due to human variation may be estimated, without an independent gold standard, by comparison of repeat measurements on the same images where the true result (i.e., zero turnover) is known. The most consistent observer in this studythe standard observerhad a reproducibility of 78%. This is equal to the best reproducibility achieved in any turnover study published to date (previous results range from 40% to 78%).7 14 15 Even so, 22% (70/316) of the MAs marked in the two sessions were indicated inconsistently. This underestimates the total turnover error because it fails to consider consistent errors (i.e., false-positive and false-negative lesions in both sessions). Nevertheless, this statistic is useful because it represents a lower bound on the uncertainty associated with manual turnover measurements. However, as will be described, the actual uncertainty depends on the relative proportions of static, new and regressed MAs present.
These results are probably approaching the best performance possible by manual measurement. Turnover noise was greater when different observers grade the baseline and follow-up images, because of interobserver variation. For the most consistent results, the baseline and follow-up images should be annotated at the same time and by the same observer, to ensure similar selection criteria are applied to both images. Unfortunately, this soon becomes impractical as the number of follow-up sessions increases.
The turnover noise associated with the automated method cannot be estimated in the same way; the reproducibility is 100%, since the computer will always return the same result for the same pair of images. From the FROC graphs in Figures 2 3 4 5 , the performance of the automated detector was similar to that of the manual observers. Although the automated method was apparently not as sensitive as some of the manual observers (though it is more specific than most), for robust turnover measurement, the consistency of the automated system may be more important than its slightly poorer sensitivity.
Turnover analysis of the first clinical dataset (shown on the FROC graphs in Figures 3 4 5 ) revealed an asymmetry in the detection sensitivities for static, new, and regressed MAs. Regressed MAs in particular appeared more difficult to detect reliably than static and new MAs. This finding could have been spurious: for instance, if the original images were biased in some way (e.g., if the first-session images were, on average, of poorer quality), or if all the observers were biased (e.g., by virtue of the order of presentation of the images), or if the reference standard alone was biased. These latter two possibilities were ruled out by the repeated automated measurement, using two different sensitivity settings. This demonstrated an asymmetry similar to that found by the manual observers, in which regressed MAs appeared more difficult to detect reliably. Finally, the likelihood that the original images in the first dataset were intrinsically biased was greatly reduced by demonstrating the same effect in the second, larger patient dataset.
The different sensitivities appear to be a genuine phenomenon wherein static MAs are intrinsically easier to detect than new and regressed MAs. This is probably due to the appearance and disappearance of the MAs not being instantaneous. Instead they pass at different rates through an intermediate stage where their identification is equivocal. The effect appears most pronounced during regression. Consequently, the uncertainty associated with the turnover results are greatest for regressed MAs and least for static MAs.
Diabetic retinopathy is a condition that progresses relatively slowly. It was necessary for trials such as the Diabetes Control and Complications Trial (DCCT)20 and the UK Prospective Diabetes Study (UKPDS)21 to recruit a large number of patients and observe them for many years, to enable trial end points to be reachedtypical end points being new vessel formation or macular thickening. However, diabetic retinopathy is predominantly a disease of capillary occlusion. MA turnover is likely to produce more sensitive measures of these changes than relying on later complications. In contrast to measures of absolute counts, which have been shown in the current study to hide the continual process of capillary occlusion and remodeling, turnover measures provide information about the dynamic nature of the disease at the capillary level. Given a larger dataset it will be interesting to show whether rates of turnover of the new, static, and regressed MAs correlate with the current retinopathy grading and whether they provide a useful prognostic indicator of disease development.
In summary, an automated system for quantifying MA turnover was developed and compared with manual measurements. The automated system was fast and was shown to be reliable, making it suitable for processing studies containing large numbers of images. The system also worked with red-free images. Although fewer MAs were visible on red-free images, the noninvasive nature of the procedure is attractive, and work has been undertaken to investigate whether red-free turnover correlates with the turnover on angiograms. The automated system may have value in a screening context, in treatment evaluation, or for research on the dynamic nature and behavior of the MA population.
| Acknowledgements |
|---|
| Footnotes |
|---|
Disclosure: K.A. Goatman, None; M.J. Cree, None; J.A. Olson, None; J.V. Forrester, None; P.F. Sharp, None
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be marked "advertisement" in accordance with 18 U.S.C.
1734 solely to indicate this fact.
Corresponding author: Keith A. Goatman, Department of Bio-Medical Physics and Bio-Engineering, University of Aberdeen, Foresterhill, Aberdeen, AB25 2ZD, UK; k.a.goatman{at}biomed.abdn.ac.uk.
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |