|
|
||||||||
1From the Department of Computing, Curtin University of Technology, Perth, Western Australia, Australia; the 2School of Psychology, University of Western Australia, Crawley, Western Australia, Australia; 3Legacy Clinical Research and Technology Center, Discoveries in Sight, Devers Eye Institute, Portland, Oregon; and the 4Department of Optometry and Vision Sciences, University of Melbourne, Victoria, Australia.
| Abstract |
|---|
|
|
|---|
METHODS. A computerized visual field simulation model was developed to compare the performance (accuracy, precision, and number of presentations) of the three algorithms. SQ implemented aspects of the SITA algorithm that are in the public domain. The simulation was tested by using standard automated perimetry (SAP) visual field data from 265 normal subjects and 163 observers with glaucomatous visual field loss and by exploring the effect of response variability and response errors on algorithm performance.
RESULTS. SQ was faster than FT or ZEST, with a comparable mean error when simulating field tests on patients. Point-wise analysis revealed similar error and standard deviation of error as a function of threshold for FT and SQ. If the initial estimate of threshold for either procedure was incorrect, the means and standard deviations of the error increased markedly. ZEST produced more accurate thresholds than did the other two strategies when the initial estimate was removed from the true threshold.
CONCLUSIONS. When simulated patients made errors, the accuracy and precision of sensitivity estimates were poor when the initial estimate of threshold either overestimated or underestimated the true threshold. This was particularly so for FT and SQ. ZEST demonstrated more consistent error properties than the other two measures.
In recent years, a new generation of perimetric test algorithms based on maximum-likelihood principles have been developed. One such approach is the family of Swedish interactive threshold algorithms (SITAs) that are commercially available for the Humphrey Field Analyzer. The SITA strategy is a hybrid of both staircase and maximum-likelihood threshold procedures and was developed specifically for automated perimetry.4 5 6 SITA Standard reduces the test time for assessment of the central 30° of the visual field by up to 50% compared with the test times required by the FT strategy.5 6 The reduction in test duration is achieved in several ways4 7 : (1) more efficient threshold estimation based on maximum likelihood principles results in a reduced number of presentations; (2) false-positive responses are estimated without the use of catch trials; (3) the interstimulus interval is altered to match the patients speed of response; and (4) SITA repeats testing if the threshold returned is more than 12 dB from an initial estimate of threshold, whereas FT repeats if the threshold is more than 4 dB from the initial estimate.
Another maximum-likelihood test procedure that has been applied successfully to perimetry is ZEST (zippy estimation by sequential testing).8 9 10 11 ZEST has been shown to determine efficiently the thresholds for frequency-doubling technology (FDT) perimetry10 11 and is available commercially for SAP in the Medmont perimeter (Medmont Pty. Ltd., Camberwell, Victoria, Australia) and in the Humphrey Matrix, a new FDT perimeter. As it is based on maximum likelihood principles, ZEST shares some features with SITA but is computationally simpler.
Given the marked reduction in test times afforded by newer threshold strategies, there is strong motivation for them to replace the FT strategy as the standard procedure both in clinical practice and research. SITA Standard has been thoroughly evaluated in clinical populations and has been found to return thresholds that are qualitatively comparable to FT.5 6 12 13 14 SITA standard has also been shown to have lower global test-retest variability in comparison with FT estimates.14 15 16 However, newer strategies are computationally more complicated than first-generation staircase strategies, and a full understanding of their performance may not be revealed by such global comparisons. This is evidenced by a recent study by Artes et al.16 which provides a detailed examination of differences between FT and SITA strategies and reveals that the differences in threshold estimates returned by these procedures vary with threshold in a nonlinear manner.
Although FT is used as a quasistandard, threshold estimates returned by this procedure are often highly variable, particularly with increasing deficit depth.1 13 14 16 This lack of precision means that many repeated tests are required to obtain a reliable threshold estimate, which has practical limitations when testing patients. Furthermore, it is impossible to evaluate the accuracy of the mean threshold estimate obtained from repeated testing, because a patients true threshold is not known. Hence, thresholds returned by FT are not an adequate standard against which to measure the accuracy of other strategies. Computer simulation of visual field assessment is the ideal tool for evaluating test performance and has been successfully applied to the study of perimetric algorithms.2 4 11
This study was designed to investigate the accuracy, precision, and number of presentations required of two recent algorithms (ZEST and staircase-QUEST, a SITA-like approach). The FT algorithm was evaluated for comparison. Staircase-QUEST (SQ) implements those aspects of the SITA family of algorithms that are available within the public domain. We explored the performance of these algorithms, first by using a visual field approach, designed to be similar to clinical visual field assessment for both normal and glaucomatous visual fields. We also evaluated the performance of each of the test strategies as a function of true threshold for specific initial threshold estimates. This enabled evaluation of the performance of the algorithms in all situations, rather than simply in cases that commonly occur in practice. By focusing on all aspects of an algorithms performance, subtle but clinically relevant differences can be revealed.
| Methods |
|---|
|
|
|---|
Each test procedure was assessed using the observer without response variability described earlier and two additional simulated observer groups: low-variability and high-variability observers. For these observers both response variability and patient errors were incorporated in the simulation. Response variability was simulated by repeated sampling of a Gaussian distribution with a mean equal to the input threshold. The standard deviation of the Gaussian distribution was set to 1.0 dB for low-variability observers and 2.0 dB for high-variability observers. False-positive and negative rates were incorporated as a probability that the subject would respond yes or no irrespective of what stimulus was presented. False-positive and -negative rates of 15% were used for low-variability observers and 30% for high-variability observers.
Visual Field Simulation
Test procedures were run on visual fields simulating patient testing. The input visual fields comprised 265 normal and 163 glaucomatous visual fields (24-2 FT strategy) supplied by one of the authors (CAJ). Written informed consent was obtained from all subjects, in accordance with the Declaration of Helsinki. The mean age of the normal patients was 47 ± 16 (SD) years, and the mean age of patients with glaucoma was 61 ± 13 years. The glaucomatous visual fields ranged from mild to severe visual field damage (median mean deviation [MD] = -1.81 dB, 5th percentile = +2.14 dB, 95th percentile = -22.55 dB).
All three test procedures require an initial estimate of threshold at each location of the visual field. We followed the approach of the Humphrey Field Analyzer 24-2 "growth pattern" for determining these initial estimates.17 With this approach, four seed locations have the threshold estimated by using the mean sensitivity of 541 normal patients as a starting value. These four locations are marked A in Figure 1 , which shows a 24-2 stimulus presentation pattern in the format for a left eye. Once these four locations have been tested, their threshold values are used as the initial estimate for their immediate neighborspoints labeled B in Figure 1 . Remaining points derive their initial estimates by averaging their immediate neighbors that have already been tested. The averaging process is restricted so that it does not cross the horizontal midline, but it may cross the vertical midline. The simulation assumed that all A locations were fully determined before beginning any B locations. Similarly, all B locations were determined before commencing C locations and all Cs completed before commencing any of the locations labeled D.
|
Test Procedures
Full-Threshold Algorithm.
The FT algorithm was based on that of the Humphrey Field Analyzer.17 It consists of a staircase procedure that begins with 4-dB luminance changes until the first response reversal (seeing to nonseeing or vice versa). After the first response reversal, the step size is reduced to 2 dB. The procedure terminates after two reversals, and the threshold estimate is the last-seen intensity. If the difference between the measured threshold and the initial estimate is greater than 4 dB then a second staircase is initiated.17 The current estimate is used to derive the starting value for the second staircase. In cases in which a second staircase was initiated, our simulation reported the threshold estimate as the mean of the two staircase results.
The commercial instrument additionally doubly determines 10 locations (the four seed locations and six additional locations) to determine short-term fluctuation.17 We did not implement these double determinations, because we are determining precision by replicating the simulation multiple times. Hence, FT assessment using the HFA requires, on average, 50 to 60 more presentations per visual field than reported herein.
Zippy Estimation by Sequential Testing.
Our ZEST implementation within the computer simulation was similar to the one we have described previously.11 The ZEST procedure is based on a maximum-likelihood determination described elsewhere.8 9 For each stimulus location, an initial probability density function (pdf) is defined that states, for each possible threshold, the probability that any patient will have that threshold (after adjusting for normal aging effects). We used the combined pdf approach recommended by Vingrys and Pianta,9 where the pdf is a weighted combination of normal and abnormal thresholds. The normal pdf gives a probability for each possible patient threshold, assuming that the location is "normal," whereas the abnormal pdf gives probabilities assuming the location is "abnormal." Our normal and abnormal pdfs were derived from empiric data as shown in Figures 2A and 2B . The patient set used to determine these pdfs consisted of 541 normal and 315 glaucomatous visual fields and was different from the input to the simulation. For each location, the lower 95th percentile for normal performance was determined from the 541 normal visual fields. The abnormal pdf was derived from the 315 patients with glaucoma by including only those thresholds that were below the lower 95% percentile for norma subjects. For both normal and abnormal pdfs, threshold estimates were pooled across all locations. For each test location, the normal pdf was adjusted along the threshold axis so that its mode was at the initial estimate of threshold, and then the abnormal and normal pdfs were combined in a ratio of 1:4. A small nonzero pedestal was added to the normal pdf, to ensure that all thresholds were represented with nonzero probability in the combined pdf. This is shown in Figure 2C , for an initial estimate of 32 dB.
|
Staircase-QUEST.
The staircase-QUEST (SQ) algorithm was designed to mimic the primary functions of SITA.4 The SITA approach to determining thresholds consists of four components:
Our SQ algorithm outputs the results of components 1 and 2, before postprocessing. We did not implement components 3 and 4, because aspects of this postprocessing are not available in the literature.
The SQ algorithm proceeds as follows. For each location, the stimulus is presented at an initial estimated threshold value. Subsequent stimulus intensities are determined as for the FT algorithmthat is, using a staircase procedure with initial step sizes of 4 dB followed by 2 dB after the first reversal. However SQ differs from FT in determining when to terminate the staircase and in the final threshold estimate.
In conjunction with the staircase, two probability functions (pfs) are maintained. (We do not use the term pdf as for ZEST, because the area under the SITA probability functions appear not to be one. See Figure 1 in Ref. 4 .) One pf gives the probability for each possible patient threshold, assuming that the location is abnormal, whereas the other maintains probabilities for thresholds that are normal. We begin with the same normal and abnormal pfs as in the ZEST procedure (Figs. 2A 2B) . Before the sequence of stimulus presentations begins for each location, the normal pf is translated along the threshold axis so that its mode aligns with the initial estimate for that particular location.
After each presentation, new pfs are determined based on the previous patient response (seen or not seen). Similar to ZEST, the rule for generating the new pf is to multiply the old pf by a likelihood function, but the 50% location of the likelihood function is aligned with the presented staircase value, not the mean or mode of the pf. Both pfs were maintained independently. The same likelihood function was used as for ZEST (Fig. 2D) . There are two termination rules for SQ, which are the same as those used for SITA. The staircase terminates when either one of the pfs has a sufficiently small variance, or if two reversals are achieved in the staircase (in this latter case, the termination rule is the same as FT). SQ reports the most likely mode of the two pfs as the threshold for the location, irrespective of the basis of staircase termination.
The SITA algorithm uses the error-related factor (ERF)4 to determine whether the variance of either pf is sufficiently narrow to terminate the staircase procedure, where
![]() |
andrew/barramundi/sap.html. This formulation of ERF allows for more error (increased variance) when thresholds are close to normal and requires smaller variances in pf when thresholds are abnormal. According to simulations performed by the developers of the SITA Standard algorithm, terminating the staircase when ERF reaches 0.69 works well in practice.4 Similar to the SITA developers, we tuned ERF in our experiments to obtain the best performance from SQ, and report herein experiments using an ERF of 0.70.
If the threshold estimate returned from SQ is more than 12 dB from the initial estimate a second staircase is initiated. This staircase is commenced at the current threshold estimate. The mode of the normal pf is also moved to the current estimate. This retest rule is based on that used by SITA.4
| Results |
|---|
|
|
|---|
|
|
|
|
|
Inspection of the upper panels of Figures 5 6 and 7 reveals that the number of presentations necessary to terminate the procedures increased with the level of inaccuracy of the initial estimate. This occurred more rapidly for FT than for SQ; hence, for any particular initial estimate, SQ is quick to terminate over a wider range of actual thresholds. When the true threshold was close to the initial estimate, ZEST was slower than the other two procedures; however, when the initial estimate was in error, ZEST used a number of presentations comparable to the number in SQ.
Inspection of the middle panels of Figures 5 6 and 7 reveals that the error distribution for SQ and FT was similar and is rather symmetrical about the initial estimate in patients with low- or high-variability. If the initial estimate either overestimated or underestimated true sensitivity, the mean error increased. Furthermore, the standard deviation of the error increased markedly. In contrast, the error performance of ZEST was more robust, with lower mean errors when the initial estimate was incorrect than in the other two strategies. For observers with low variability, the standard deviation of the error for ZEST was much lower and more consistent across the range of thresholds than were those of the other two procedures.
| Discussion |
|---|
|
|
|---|
In our simulation, SQ was based on the details of SITA that appear in the public domain. Our purpose was to demonstrate the underlying principles of the hybrid staircase-Bayesian approach incorporated in SITA. SQ is not the same as SITA. First, the pf used is not the same as in the commercial version, and second, SITA incorporates postprocessing analysis. The postprocessing aspects of SITA are likely to be equally applicable to those of any test strategy. SITA was developed to have error properties similar to those of FT, but to return thresholds using fewer stimulus presentations.4 5 19 SQ meets these development goals, and so we assume that it is likely to be representative of the underlying principles of SITA. One further aspect of SITA that is not incorporated in SQ is that SITA alters pfs during the test based on the pfs of neighboring values. The details of these alterations are not published in the literature, and therefore we could not incorporate them in our SQ simulations.
Inspection of Figure 4 shows that for simulated patients with low variability, the difference in the mean error across the field between SQ and FT was approximately 1 dB in normal observers and in those with glaucoma. This compares favorably with the approximate 1 dB difference reported between SITA and FT in clinical studies.5 14 16 18 It has been suggested that the difference between thresholds returned by SITA and FT may be caused in part by a reduction in fatigue in the shorter SITA examination.19 However, several studies have argued that factors other than fatigue are more likely to explain the difference.15 16 18 In addition, our simulation results suggest that the differences between SITA and FT estimates are unlikely to be due to differential effects of fatigue, but rather to the mechanics of the test algorithms. FT returned the last-seen presentation, whereas SQ/SITA returned the most likely mode of the two pfs used in the procedure. As ZEST returned the mean of the final pdf, which provided a less biased estimate than the mode,8 a slightly different threshold again was returned by ZEST, because of this factor alone. Inspection of Figures 4 5 6 7 reveals that the differences in error between SQ and FT varied with threshold, a finding that is broadly compatible with that of Artes et al.16
The performance of both ZEST and SQ depends in part on the choice of pdfs, the choice of likelihood function, and the particular termination rules imposed. We used empiric pdfs based on normal and abnormal thresholds measured for SAP and chose to use a hybrid normal+abnormal pdf for ZEST, because results in previous studies suggest this approach works well.9 Thresholds were pooled across locations to form the normal and abnormal pdfs resulting in a broader pdf than if locations were treated separately. Initial inspection of location-specific pdfs revealed that the shape of abnormal pdfs was highly aberrant in some locations because of sampling issueshence, the decision to pool across locations. The broader pdfs produced by pooling create a more uniform combined pdf that increases the number of presentations required for ZEST to terminate with marginal improvements in accuracy and precision.8 10 Although our pdfs were based on empiric thresholds, the specific derivation of pdfs for Bayesian test strategies is somewhat arbitrary. These pdfs may be different from those used in both the commercial application of ZEST on the Medmont perimeter and SITA in the Humphrey Field Analyzer; however, they were based on a large number of empiric thresholds and so may be assumed to represent reasonably the underlying population distribution of thresholds.
The likelihood function used within the ZEST and SQ procedures affects both the spread of errors and the number of trials needed to reduce the errors to an acceptable level.8 20 The likelihood function used in these experiments was the discrete version of a cumulative Gaussian with a standard deviation of 1.5 dB. This slope is similar to that found for empiric frequency-of-seeing curves measured for SAP in normal observers.21 We also evaluated numerous other likelihood functions within the simulator and found that this function resulted in SQs terminating with similar average presentations and precision to that reported for SITA.4 5 6 We maintained the same likelihood function for ZEST to facilitate comparison between the mechanics of the procedures.
Termination rules for SQ were chosen to be the same as those for SITA: SQ ends by using a dynamic termination criterion based on whether the spread of the pf becomes sufficiently narrow, or if two reversals are achieved in the staircase. It is also possible to terminate adaptive procedures after a fixed number of presentations which has been shown to result in errors similar to those obtained using a dynamic criterion.20 We chose a dynamic termination criterion for ZEST to keep it similar to SQ. The parameters chosen for each of pdf, likelihood function, and termination criterion may be suboptimal; however, optimizing SQ and ZEST falls beyond the scope of this study.
A difference between the simulation and human performance is that our variability models (no, low, and high variability) were kept fixed across the visual field. These variability models incorporate both response variability and patients errors. Response variability is known to increase with deficit depth.21 22 23 Hence, in a given patient responses may range from having no variability to high variability at different locations within their visual field. We present three variability conditions chosen to represent the end points of the range of response variability and patient response errors: no errors and 30% false-positive and false-negative responses (a commonly used cutoff criterion for acceptable performance), as well as the middle of this range, and assess performance for all possible stimulus levels for each of these conditions (Figs. 5 6 7) . An alternate approach would have been to increase response variability with increasing deficit depth. Although this alternate approach may more closely represent average clinical performance, the approach taken provides far greater information regarding the underlying performance of the three algorithms and their tolerance to variability, enabling the assessment of the algorithms for situations that are uncommon but still occur at times (for example, locations in which threshold is normal but the subjects responses have high variability). In practice, the results with any individual patient may be a hybrid of the three variability models presented and can be determined from the data shown in Figures 5 6 7 . It is also possible that our choice of having equivalent numbers of false positives and false negatives is not representative of typical performance. Indeed, typical patients may be likely to have either 15% false-positive or false-negative responses, but not both. It is to be expected that significant response biases in one direction only (for example false positives) will introduce a more severe systematic error than that shown in our low-variability group, but may reduce the standard deviation of the error.
For all the test strategies, if the initial estimate for the procedure is close to the true threshold then the procedures are fast and accurate. This is likely to happen in most real cases, because of the preponderance of normal thresholds and the use of the growth pattern to determine the initial estimate. This is reflected in the visual field simulations shown in Figure 3 , which demonstrates small absolute errors when averaged across the visual field for all test procedures. However, as the point-wise analysis shows, in locations in which the initial estimate is wrong (either an underestimate or overestimate) the procedures can take a long time and have reduced accuracy. This is especially true of SQ and FT, despite the fact that these procedures incorporate an error-checking retest strategy. For retested locations the HFA FT procedure provides the results of both determinations with no interpretation instructions. In these situations we chose to take an average. For SQ, locations are retested only if more than 12 dB from the initial estimate of threshold. This relaxed retest policy favors fewer presentations over improved accuracy and precision.
ZEST clearly outperforms SQ and FT when the initial estimate is removed from the true threshold. In practice, this occurs in a minority of locations (such as on the edge of a scotoma) however, determining accurate and repeatable thresholds in these locations is essential for monitoring progression of visual field loss. ZEST shows more consistent error response properties, irrespective of initial estimate and deficit depth, than do the other two procedures, although it is slower to terminate. The test time for ZEST can be decreased by altering the termination rule (for example, terminating after four presentations makes it comparable to SQ); however, this is achieved at the expense of accuracy and precision.
Both SQ and FT (and to a lesser extent ZEST) have similar limitations: when patients make response errors, both the mean error and the standard deviation of the error increase when the initial estimate is not close to the true threshold. It is possible to compensate for the presence of a systematic error. This is not the case for SQ and FT, because not only does the mean error increase with greater disparity of the initial estimate but the standard deviation of the error also increases. ZEST performs better than the other two strategies under these conditions; however, the errors still increase markedly when patients respond unreliably. Although SITA has provided welcome benefits over FT in reducing test time, further improvements in the accuracy and precision of visual field assessment should be possible. For better detection of visual field loss and particularly for better monitoring of progression, test procedures that reduce both the mean error and standard deviation of the error for locations with abnormal thresholds are needed.
| Footnotes |
|---|
Submitted for publication January 11, 2003; revised May 28 and June 25, 2003; accepted June 30, 2003.
Disclosure: A. Turpin, None; A.M. McKendrick, None; C.A. Johnson, None; A.J. Vingrys, Medmont Pty. Ltd. (C, F)
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be marked "advertisement" in accordance with 18 U.S.C.
1734 solely to indicate this fact.
Corresponding author: Allison M. McKendrick, School of Psychology, University of Western Australia, 35 Stirling Highway, Crawley, WA 6009, Australia; allisonm{at}psy.uwa.edu.au.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Turpin, G. P. Sampson, and A. M. McKendrick Combining Ganglion Cell Topology and Data of Patients with Glaucoma to Determine a Structure-Function Map Invest. Ophthalmol. Vis. Sci., July 1, 2009; 50(7): 3249 - 3256. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Schiefer, J. P. Pascual, B. Edmunds, E. Feudner, E. M. Hoffmann, C. A. Johnson, W. A. Lagreze, N. Pfeiffer, P. A. Sample, F. Staubach, et al. Comparison of the New Perimetric GATE Strategy with Conventional Full-Threshold and SITA Standard Strategies Invest. Ophthalmol. Vis. Sci., January 1, 2009; 50(1): 488 - 494. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Turpin, D. Jankovic, and A. M. McKendrick Retesting Visual Fields: Utilizing Prior Information to Decrease Test-Retest Variability in Glaucoma Invest. Ophthalmol. Vis. Sci., April 1, 2007; 48(4): 1627 - 1634. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |