|
|
||||||||
1From the Hamilton Glaucoma Center and the 2Institute for Neural Computation, University of California, San Diego, California; the 3Computational Neurobiology Laboratories, The Salk Institute, La Jolla, California; and the 4VA San Diego Healthcare System, San Diego, California.
| Abstract |
|---|
|
|
|---|
METHODS. Seventy-two eyes of 72 healthy control subjects (average age = 64.3 ± 8.8 years, visual field mean deviation = 0.71 ± 1.2 dB) and 92 eyes of 92 patients with glaucoma (average age = 66.9 ± 8.9 years, visual field mean deviation = 5.32 ± 4.0 dB) were imaged with SLP with variable corneal compensation (GDx VCC; Laser Diagnostic Technologies, San Diego, CA). RVM and SVM learning classifiers were trained and tested on SLP-determined RNFL thickness measurements from 14 standard parameters and 64 sectors (approximately 5.6° each) obtained in the circumpapillary area under the instrument-defined measurement ellipse (total 78 parameters). Ten-fold cross-validation was used to train and test RVM and SVM classifiers on unique subsets of the full 164-eye data set and areas under the receiver operating characteristic (AUROC) curve for the classification of eyes in the test set were generated. AUROC curve results from RVM and SVM were compared to those for 14 SLP software-generated global and regional RNFL thickness parameters. Also reported was the AUROC curve for the GDx VCC software-generated nerve fiber indicator (NFI).
RESULTS. The AUROC curves for RVM and SVM were 0.90 and 0.91, respectively, and increased to 0.93 and 0.94 when the training sets were optimized with sequential forward and backward selection (resulting in reduced dimensional data sets). AUROC curves for optimized RVM and SVM were significantly larger than those for all individual SLP parameters. The AUROC curve for the NFI was 0.87.
CONCLUSIONS. Results from RVM and SVM trained on SLP RNFL thickness measurements are similar and provide accurate classification of glaucomatous and healthy eyes. RVM may be preferable to SVM, because it provides a Bayesian-derived probability of glaucoma as an output. These results suggest that these machine learning classifiers show good potential for glaucoma diagnosis.
A related machine learning classifier, the relevance vector machine (RVM), recently has been introduced,14 15 which, unlike SVM, incorporates probabilistic output (probability of class membership, e.g., probability of glaucoma) through Bayesian inference. Its decision function depends on fewer input variables than SVM, possibly allowing better classification estimates for small data sets with high dimensionality (i.e., a large number of input variables).15 In the present study we compared the performance of RVM and SVM for classifying eyes as healthy or glaucomatous using SLP data.
| Methods |
|---|
|
|
|---|
Each study participant underwent a comprehensive ophthalmic evaluation, including review of medical history, best-corrected visual acuity testing, slit-lamp biomicroscopy, intraocular pressure measurement with Goldmann applanation tonometry, gonioscopy, dilated fundus examination with a 78-D lens, simultaneous stereoscopic optic disc photography (TRC-SS; Topcon Instruments Corp. of America, Paramus, NJ), and standard automated perimetry (SAP) with the 24-2 Swedish Interactive Threshold Algorithm (SITA; Humphrey Field Analyzer II; Carl Zeiss Meditec, Dublin, CA). To be included in the study, participants had to have a best-corrected acuity better than or equal to 20/40, spherical refraction within ±5.0 D, cylinder correction within ±3.0 D, and open angles on gonioscopy. Eyes with coexisting retinal disease, uveitis, or nonglaucomatous optic neuropathy were excluded.
For labeling eyes during classifier training, glaucomatous eyes were defined as those with repeatable (two consecutive) SAP results outside normal limits by pattern standard deviation (PSD; P < 5%) or Glaucoma Hemifield Test (GHT). The first abnormal SAP was on or before the imaging date. Neither optic disc appearance nor intraocular pressure was part of the inclusion criteria for the glaucoma group. Average SAP mean deviation (MD) of the glaucomatous eyes was 5.32 ± 4.0 dB (range, 20.14 to 0.26 dB). According to the scale of Hodapp et al.16 for glaucoma severity, 54 (59%) patients had early, 24 (26%) had moderate, and 14 (15%) had severe visual field defects. The mean age of the patients with glaucoma was 66.9 ± 8.9 years (range, 48.785.9).
Healthy eyes were defined as those with healthy-appearing optic discs on clinical examination, SAP results (MD, PSD, GHT) within normal limits, and no history of intraocular pressure > 22 mm Hg. Average SAP MD of the healthy eyes was 0.71 ± 1.22 dB (range, 3.921.87 dB) and was significantly different from that of healthy eyes (t-test, P < 0.001). The mean age of the healthy participants was 64.3 ± 8.8 years (range, 48.286.8) and was similar to that of the patients with glaucoma (t-test, P = 0.07).
This research adhered to the tenets of the Declaration of Helsinki. Informed consent was obtained from each participant, and the University of California, San Diego, Human Research Protection Program approved all methodology.
Scanning Laser Polarimetry
Study participants underwent ocular imaging with the commercially available SLP with variable corneal compensation, the GDx VCC (software version 5.01; Laser Diagnostic Technologies, San Diego, CA). Scanning laser polarimetry measures the retardation of light reflected from the birefringent RNFL fibers and provides an estimated RNFL thickness based on the linear relationship between observed retardation, measured using a prototype instrument, and RNFL thickness, determined histologically.17 Details of this technique have been described previously.18 19 Because corneal polarization axis and magnitude effect SLP measurements and are not similar across eyes,20 21 the GDx VCC employs a variable corneal polarization compensator that allows eye-specific compensation. After determining the axis and magnitude of corneal polarization in each eye by macular scanning,22 three appropriately compensated retinal polarization images per eye were automatically obtained and combined, to form each mean image used for analysis. Only well-focused, evenly illuminated, and centered scans with residual anterior segment retardation
15.0 nm and atypical scan scores <25, determined by GDx VCC software, were included (cutoffs suggested by written communication, Michael Sinai, PhD, Laser Diagnostic Technologies, June 2004). The atypical scan score indicates the presence of atypical patterns of retardation that can generate spurious RNFL thickness measurements.
We trained the machine learning classifiers on RNFL thickness measurements from 14 standard RNFL measurements (described in detail elsewhere5 ; Table 1 ) in addition to RNFL thickness measurements from 64 sectors (approximately 5.6° each) obtained in the circumpapillary area under the instrument-defined measurement ellipse (total, 78 parameters). Sector 1 was located temporally, with sectors 16 and 48 located superiorly and inferiorly, respectively (i.e., results were normalized to a right eye). These measurements were determined automatically by GDx VCC software ver. 5.01. We also conducted a subanalysis in which only the 64 sectoral RNFL thickness measurements were included in the machine learning classifier training set to determine which RNFL sectors were most important for classifying eyes as healthy or glaucomatous in our sample.
|
The SVM was implemented by using Platts sequential minimal optimization algorithm in commercial software (MatLab, ver. 5.0; The MathWorks, Natick, MA). For classification of the SLP data, Gaussian (nonlinear) kernels of various widths were tested, and a Gaussian kernel with width =
(2 x number of input variables) was chosen that gave the highest area under the receiver operating characteristic (AUROC) curve using 10-fold cross-validation. The penalty for error/margin tradeoff C was 1.0.
Because SVM does not model the data distribution, but instead directly minimizes the classification error, the resultant output is a binary decision. Although a binary decision is sufficient for many applications, it is difficult to arrive at a meaningful disease-versus-no disease cutoff for glaucoma. This concern can be alleviated with a new machine learning classifier, the RVM.14 15 The RVM has the same functional form as the SVM within a Bayesian framework. This classifier is a sparse Bayesian model that provides probabilistic predictions (e.g., probability of glaucoma based on the training examples) through Bayesian inference.15 Its decision function depends on fewer input data (i.e., more sparse) than a comparable SVM, because SVM minimizes the training error under the constraint of maximum smoothness, requiring more decision points.14 The benefit of a sparser classifier is that its results are more generalizable (i.e., it decreases overfitting). RVM predications are more reliable than SVM predictions because they are directly generated through Bayesian inference, whereas SVM can provide pseudoprobabilistic outputs (i.e., between 0 and 1.0) only through postprocessing. In classification, RVM outputs probabilities of class membership rather than point estimates like SVM. This provides a conditional distribution that allows the expression of uncertainty in the prediction.30 A Medline search indicates that the RVM has not been applied to clinical data.
The RVM was implemented using a commercially available algorithm (SparseBayes ver. 1.0; Microsoft Research, Cambridge, UK, for MatLab, The MathWorks). For classification of the SLP data, a Gaussian kernel with width =
(2 x number of input variables) was chosen because it gave the highest AUROC curve with 10-fold cross-validation.
Analyses
AUROC curves for classifying eyes as healthy or glaucomatous were determined for each machine learning classification technique and each individual parameter automatically provided by the GDx VCC software (Table 1) . Significant differences in AUROC curves among RVM, SVM, and all individual parameters were determined by using the method of DeLong et al.31 We also reported classification results of the GDx VCC NFI, but did not compare these results to the other machine learning classifier results, because we thought such a comparison was somewhat biased (see the Discussion section).
Training and Testing Machine Learning Classifiers
Ten-fold cross-validation was used to train and test RVM and SVM classifiers to avoid training and testing on the same data. First, glaucomatous and healthy eyes were randomly divided into 10 approximately equal, exhaustive, and mutually exclusive subsets. Next, classifiers were trained on 9 subsets and subsequently tested on the 10th subset. This sequence was repeated 10 times, with each subset serving as the test set one time, so that each tested eye was never part of its training set and was tested only once. The test results from 164 eyes were then used to plot the bias-corrected ROC curve. Sensitivities at 75% and 90% specificities, arbitrarily chosen to represent moderate and high specificity, respectively, also were reported, although these values can be estimated from visual inspection of ROC curves.
As the dimensionality of the data sets (number of parameters) is relatively large but the size of the data sets (number of observations) is relatively small, we used sequential forward selection and backward elimination to reduce the data dimension to alleviate the "curse of dimensionality" (reduced classifier performance caused by the forced inclusion of irrelevant parameters in the solution set).30 For the sake of simplicity, for RVM these techniques were performed using RVM and for SVM these techniques were performed using SVM, although RVM can be optimized using SVM and vice versa. For forward selection, we started with an empty feature set and sequentially added parameters that improved the performance of the feature set the most, until peak performance (e.g., highest AUROC curve after which inclusion of additional parameters decreased the AUROC curve) was reached. For backward elimination, we started with the full-dimensional feature set (e.g., 78 RNFL thickness measurements) and sequentially deleted the parameters that improved the performance of the feature set the most, until performance began to decline.
Similar to using 10-fold cross-validation to minimize bias in the testing and training of the full-dimension RVM and SVM, we used cross-validation to minimize bias in the test sets used during the process of optimizing the feature sets. The data were randomly divided into five approximately equal-sized subsets. Four of the five subsets were used as feature-selection sets to determine the optimized feature set, selected based on the maximum AUROC curve when using 10-fold cross-validation. The optimized feature set was then trained on the initial four subsets, resulting in a single classifier. Next, this classifier was tested on the remaining single subset, and a bias-corrected AUROC curve was generated. This sequence was repeated five times, with each partition serving as the test set one time, resulting in five unbiased estimates of the AUROC curve. This technique for optimization is discussed in greater detail elsewhere.24
Several optimized feature sets were investigated for RVM and SVM using both the full data set (i.e., 78 RNFL thickness measurements) and the sector-only data set (i.e., 64 sectoral RNFL thickness measurements). However, we report only results from the optimized feature set for each technique that resulted in the largest AUROC curve.
| Results |
|---|
|
|
|---|
ROC Curve Areas and Sensitivities
AUROC curves for nonoptimized RVM and SVM trained using the full (78 RNFL thickness measurements) data set were 0.90 and 0.91, respectively. AUROC curves for nonoptimized RVM and SVM, trained using the sector-only (64-dimensional) data set, were both 0.91. AUROC curves for individual SLP parameters ranged from 0.51 (for symmetry) to 0.87 (for normalized inferior area). The AUROC curves for all individual SLP parameters, except for normalized inferior area, were significantly lower than AUROC curves for RVM and SVM analysis of RNFL thickness (method of DeLong et al.,31 P
0.01; Table 2 ). The AUROC curve for the NFI was 0.87 and was significantly higher than the AUROC curves for all other individual RNFL measurements (method of DeLong et al., P
0.01) except for normalized inferior area (0.87) and inferior RNFL thickness (0.84).
|
|
Sensitivities at 90% specificity were 79% for RVM and 76% for SVM (Table 2) . These values improved to 84% and 77% for optimized RVM and SVM, respectively. Sensitivity at 90% specificity for global RNFL thickness measurements ranged from 15% (superior-to-nasal ratio) to 61% (average RNFL thickness). Sensitivity at 90% specificity of the NFI was 66%.
Best RNFL Sectors Determined by Optimization
To determine what individual RNFL sectors are most informative for classifying healthy and glaucomatous eyes in our sample, we determined the "best" reduced-dimensional data sets (defined as those with the largest AUROC curves) by optimizing the sector-only data set (i.e., 64 sector RNFL thickness measurements) using RVM and SVM with forward and backward selection. Results are shown in Table 3 . In all cases the highest ranked sectors were those in the inferior temporal circumpapillary quadrant. A significant number of sectors in the superior temporal and superior nasal quadrants also were identified. However, only two sectors in the inferior nasal quadrant were identified.
|
|
|
|
Figure 4 shows GDx VCC retardation images from three glaucomatous eyes assigned the probabilities of 0.14 (A), 0.49 (B), and 0.99 (C). (Again, eyes were selected as closest to 0.0, 0.5, and 1.0 probabilities, respectively.) These eyes all had abnormal SAP results by definition. Eye (A) had a SAP MD of 4.41 dB and a PSD of 7.93 dB. GDx VCC NFI output was 26, suggesting a low likelihood of glaucomatous damage. Eye (B) had a SAP MD of 3.51 dB and a PSD of 3.09 dB. GDx VCC NFI output was 22, also suggesting a low likelihood of glaucomatous damage. Eye (C) had a SAP MD of 8.53 dB and a PSD of 7.61 dB. GDx VCC NFI output was 98, indicating a very high likelihood of glaucomatous damage.
| Discussion |
|---|
|
|
|---|
Probabilistic results from the RVM indicated that most of the glaucomatous eyes fell within the 51% to 100% probability range and most healthy eyes fell within the 0% to 50% range, although some overlap was observed. Figures 3B and 3C may suggest that SLP-trained RVM can detect eyes with normal SAP results that have glaucomatous abnormalities detectable by other diagnostic tests, although this suggestion is not conclusive. The eye shown in Figure 3B , with a SAP MD of 0.10 dB and PSD of 1.56 dB (i.e., normal result), was assigned a 48% probability of belonging to the glaucoma group. At the time of SLP imaging, this eye had an abnormal short-wavelength automated perimetry result with MD of 5.99 dB (P < 5%), PSD of 4.40 dB (within normal limits), and GHT result within normal limits. However, Heidelberg Retina Tomograph (HRT II; Heidelberg Engineering, Dossenheim, Germany) Moorfields Regression Analysis results were all within normal limits. The eye shown in Figure 3C , with a SAP MD of 1.58 dB and PSD of 1.49 dB (normal result), was assigned a 92% probability of belonging to the glaucoma group. At the time of SLP imaging, this eye had a normal short-wavelength automated perimetry result with MD of 2.60 dB (within normal limits), PSD of 4.40 dB (within normal limits), and GHT result within normal limits. The Heidelberg Retina Tomograph Moorfields Regression Analysis of the superior nasal sector was assigned a borderline result.
In the above cases, RVM probabilistic output can be used clinically to help determine the posttest probability of disease. If the output is greater than 50% (0.50), the probability of having glaucoma is increased compared with the pretest probability. If the output is <50%, the probability of having glaucoma is decreased.
Machine learning classifier analyses, using forward and backward selection on sector-only RNFL thickness measurements, were used to determine the RNFL sectors most essential for classifying healthy and glaucomatous eyes in our sample. We determined that sectors in the inferior temporal quadrant were most important, followed by sectors in the superior temporal and the superior nasal quadrants. In general, inferior nasal sectors were not included. These results are in agreement with results from other studies in which SLP, CSLO, or optical coherence tomography were used, indicating that the inferior temporal RNFL and neuroretinal rim are the most important regions for discriminating between healthy and early-stage glaucomatous eyes.24 32 33 The difference in the relative importance of superior nasal and inferior nasal quadrants for classifying eyes as healthy or glaucomatous may be due, in part, to the presence of split superior RNFL bundles in many eyes.34 Based on viewing many SLP images, we suspect that the effect of split superior RNFL bundles is that the superior nerve fiber bundle is displaced nasally in many eyes, thus increasing this regions importance in glaucoma discrimination.
The GDx VCC currently includes an SVM-based classification parameter, the NFI. We did not compare our RVM or SVM results directly with results from the NFI because we suspected that the outcome of such a comparison would be biased in favor of our RVM/SVM, because we trained and tested our classifiers on very similar data sets. Both the training and testing data sets were constrained by the selection criteria for study inclusion, and each data set reflected the age, race, and glaucoma severity characteristics of our clinic population, which were not necessarily similar to that used to develop the NFI. For instance, the mean age of our patients with glaucoma was 66.9 ± 8.9 years (range, 4889). The mean age of the patients with glaucoma included in the development the NFI was similar (65.4 ± 13.33 years), but the NFI data set included 10% of patients under the age of 46 years (Sinai M, Laser Diagnostics Technologies, personal communication, June 2004). The mea SAP MD of patients in the present study was 5.3 ± 4.0 dB (range, 20.14 to +0.26). Although the mea SAP MD of patients included in the development of the NFI was similar (5.42 ± 6.11 dB), the range was greater than that in our study (31.57 to +2.81 dB; Sinai M, Laser Diagnostics Technologies, personal communication, June 2004). These differences in study population characteristics of the machine classifier training sets would be likely to confound any direct comparison between our techniques and the NFI.
Results for the NFI and other standard RNFL thickness measurements are similar to those in other work reported from our laboratory (using GDx VCC and a prototype instrument, with similar inclusion criteria)4 5 8 and others. For instance, Tannenbaum et al.35 reported AUROC curves for average, superior, and inferior RNFL thicknesses of 0.81, 0.87, and 0.85, respectively. The AUROC curve for superior thickness in their study was higher than that reported in our study (0.78). This may be a function of the inclusion in their study of more advanced cases, suggested by the range of SAP MD to approximately 30 dB, or by the presence of a larger number of subjects with inferior visual field defects.
The current results, demonstrating that machine learning classification techniques are superior to single parameters for discriminating between healthy and glaucomatous eyes using optical imaging technologies, support previously published results on this topic.12 24 However, in prior publications on this topic, CSLO data were used for machine learning classifier training (with the exception of studies reporting results from the standard software-provided machine learning classifiers available with the current and previous versions of the commercially available SLP).
Previous studies using back-propagated, multi-layer perceptrons trained using global CSLO optic disc topography parameters found good discrimination between normal and glaucomatous eyes with a diagnostic precision of 80%36 and 92% (with an AUROC curve of 0.94).37 More recently, "bagging" (boot-strap aggregating) classification trees, trained using a large number of global and regional HRT parameters, showed decreases in the normal versus glaucoma misclassification error, compared with linear discriminant functions.38 39 For instance, reported misclassification was 15% using the bagged classification tree, compared with approximately 20% for previously published linear discriminant analyses.38 39 In a study similar to the current one, SVM trained on CSLO global and regional topographic optic disc parameters were shown to improve on multi-layer perceptron techniques and previously published linear discriminant functions for differentiating between mild and moderate glaucoma.12 AUROC curves for nonoptimized neural network techniques ranged from 0.94 to 0.95, compared with 0.85 to 0.91 for statistic-based methods. After forward selection and backward elimination were applied, the discriminating ability of the SVM increased significantly to 0.97. These findings are similar to those from the present study.
AUROC curve results from our machine learning classifier techniques may be somewhat overestimated, because we used cross-validation instead of truly independent training and test sets. Although RVM and SVM were trained and tested on different data, each data set was generated from the same rather homogeneous pool. This fact may exaggerate somewhat the differences in classification ability between RVM and SVM and standard SLP RNFL thickness parameters, although we were careful to employ cross-validation to separate training and test sets at all steps in training, testing, and optimizing our machine learning classifiers.
In addition, when using optimized techniques on sector-only data to identify RNFL sectors that were most important for discriminating between healthy and glaucomatous eyes, the resultant measurement regions (i.e., RNFL sectors) may be specific to the constraints of the early-to-moderate glaucoma damage inclusion criterion of the present study. It is possible that in more advanced glaucoma, or with a larger training set, different sectors would be identified as most important for the classification task. Our training set may have included more examples of inferior temporal RNFL defects, than other defects, because of the glaucoma severity investigated or because of the size of the training set. This possibility suggests that the development (training and testing) of different classifiers for different degrees of glaucoma severity may be necessary. In addition, a larger training set including more examples of glaucomatous eyes is desirable.
Overall, our results showed that optimized RVM and SVM, trained on SLP RNFL thickness measurements, classify glaucomatous and healthy eyes more accurately than current software-provided RNFL thickness measurements. These results suggest that these machine learning classifiers show good potential for glaucoma diagnosis. Moreover, results from relevance vector machine analyses showed that most glaucomatous eyes were assigned a high probability of being glaucomatous, based on labeled training examples, and that most healthy eyes were assigned a low probability of being glaucomatous. RVM and SVM performed similarly at classifying eyes as healthy or glaucomatous. Because RVM output provides a Bayesian-derived probability of glaucoma and SVM output does not, RVM classifiers are likely to provide more information than SVM classifiers for the diagnosis of glaucoma.
| Footnotes |
|---|
Submitted for publication September 21, 2004; revised November 22, 2004; accepted December 18, 2004.
Disclosure: C. Bowd, None; F.A. Medeiros, None; Z. Zhang, None; L.M. Zangwill, None; J. Hao, None; T.-W. Lee, None; T.J. Sejnowski, None; R.N. Weinreb, Laser Diagnostic Technologies (F); M.H. Goldbaum, None
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be marked "advertisement" in accordance with 18 U.S.C.
1734 solely to indicate this fact.
Corresponding author: Christopher Bowd, Hamilton Glaucoma Center, University of California, San Diego, La Jolla, CA 92037-0946; cbowd{at}eyecenter.ucsd.edu.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
C. Bowd, J. Hao, I. M. Tavares, F. A. Medeiros, L. M. Zangwill, T.-W. Lee, P. A. Sample, R. N. Weinreb, and M. H. Goldbaum Bayesian Machine Learning Classifiers for Combining Structural and Functional Measurements to Classify Healthy and Glaucomatous Eyes Invest. Ophthalmol. Vis. Sci., March 1, 2008; 49(3): 945 - 953. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Boden, K. Chan, P. A. Sample, J. Hao, T.-W. Lee, L. M. Zangwill, R. N. Weinreb, and M. H. Goldbaum Assessing Visual Field Clustering Schemes Using Machine Learning Classifiers in Standard Perimetry Invest. Ophthalmol. Vis. Sci., December 1, 2007; 48(12): 5582 - 5590. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-L. Huang, H.-Y. Chen, and J.-C. Lin Rule Extraction for Glaucoma Detection with Summary Data from StratusOCT Invest. Ophthalmol. Vis. Sci., January 1, 2007; 48(1): 244 - 250. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |