Model-based selection of most informative diagnostic tests and test parameters

– Given the complexity of most brain and body processes, it is often not possible to relate experimental data from an individual to the underlying subject-speci ﬁ c physiology or pathology. Computer simulations of these processes have been suggested to assist in establishing such a relation. However, the aforementioned complexity and required simulation accuracy impose considerable challenges. To date, the best-case scenario is varying the model parameters to ﬁ t previously recorded experimental data. Con ﬁ dence intervals can be given in the units of the data, but usually not for the model parameters that are the ultimate interest of the diagnosis. We propose a likelihood-based ﬁ tting procedure, operating in the model-parameter space and providing con ﬁ - dence intervals for the parameters under diagnosis. The procedure is capable of running parallel to the measurement, and can adaptively set test parameters to the values that are expected to provide the most diagnostic information. Using the pre-de ﬁ ned acceptable con ﬁ dence interval, the experiment continues until the goal is reached. As an example, the approach was tested with a simplistic three-parameter auditory model and a psychoacoustic binaural tone in a noise-detection experiment. For a given number of trials, the model-based measurement steering provided 80% more information.


Introduction
Audiological diagnostics has the goal of identifying the physiological cause of a hearing impairment, and of quantifying its extent (e.g., [1]). Due to the complexity of the auditory system, it is a far from trivial exercise to identify a specific cause from non-invasively recorded data. Experienced audiologists and elaborate experimental protocols allow for diagnosing most of the common causes, but even in these "easy" cases, the description of the cause is qualitative. The audiologist gathered lots of quantitative data such as an audiogram, a tympanogram, an auditory brainstem response (ABR), and so forth. Inevitably, however, the diagnosis is usually not quantitative with respect to the physiological cause. It is not going to be "patient X has 60-80% loss of auditory nerve fibers at frequencies above 2000 Hz and a 5-10 mV reduced endocochlear potential". But these physiological parameters would be most useful for providing optimal treatment, at least in theory, if a patient's auditory system can be fully characterized by a large set of such parameters. To derive physiological parameters from experimental data, an accurate computational model is required.
To date, however, such models only exist for isolated parts of the auditory system; especially models of the auditory periphery have evolved into comprehensive and broadly-tested frameworks (e.g., [2][3][4]). A second, equally large, problem is that for diagnostics, we need to calculate the inverse function that brings us from a data set to the model parameters [5]. The number of physiological parameters will be too large to expect by chance to collect a data set that will be large enough to confine the parameter space. These limitations do not, however, need to stop us from developing a theoretical framework for model-based auditory diagnostics. Pioneering work from Panda et al. [6] have already demonstrated the general applicability of models in auditory diagnostics. They used a model of the auditory periphery [2], and altered its parameters to simulate data from several patients. For each patient, they changed only one parameter at a timeall other parameters were kept at their default value. They could then speculate that the respective non-default parameter is a possible description of the respective patients' peripheral pathology. However, the study was limited to qualitative fitting, and was only used for the analysis of a single, non-default parameter per patient, which runs counter to the common observation of co-morbidities.
With a model at hand, it is also possible to orchestrate the experiment for an efficient diagnosis: It should be possible to calculate which experiment, and which experimental condition, can be expected to generate the most useful information to confine the parameters. Moreover, with the model steering the experiment, any new data is going to live-update the diagnosis. Parameter estimates and confidence intervals are generated continually and can even be considered to determine the end of the experiment.
A common example is the maximum-likelihood procedure. The maximum-likelihood procedure is an established, adaptive psychophysical measurement technique that sets an adaptive stimulus variable to the value at which most information can be obtained. What "most information" means, can be defined by the operator and will heavily influence the adaptation process. Typically, it is either a threshold value, or the parameters of a psychometric function. Essentially, the psychometric function is already a simple (non-physiological) model that has two or three parameters. A related approach also exists for fast audiogram measurement [7]. Both the classical maximum-likelihood procedure and [7] use a simple "model" of how a stimulation parameter (e.g., level) influences the response. The goal is to minimize confidence intervals on conventional result scales, such as, e.g., a "fraction of correct responses" scale. In contrast, an automated, model-based diagnosis aims at maximizing information about a set of model parameters, e.g., the amount of hair-cell loss. A related approach has been used for a time-efficient estimation of the band importance function parameters of the speech intelligibility index [8] but not for estimating parameters of a processing model.
The main goal of the current study is therefore to provide a first conceptual attempt at model-steered diagnostics. Section 2 describes the concept in very general terms, as it can also be applied to non-auditory diagnostics. In Section 3, the concept will be applied to a specific example. An auditory model will be used to steer a psychoacoustic experiment aimed at identifying three parameters of an artificial test subject.

Formalism
It is assumed that the behavior of a certain brain or other body function can be described as a function f that has been implemented as a computer model. The goal is to identify the model instance f j , that is optimal for reproducing experimental responses r of an individual. In this general view, the different instances of f j can represent one model function operating with different parameter sets, and/or different model functions.
For any experimental condition or stimulus s i the output of the model y i,j = f j (s i ) must have the same format as the experimental data d i = r(s i ); i.e., the model has to operate as an artificial subject. Last, the set of experimental conditions s 1 . . .s N must allow for disambiguating between the possible model instances. In other words, an s i must exist, so that any two different model instances, e.g., f 1 and f 2 , produce different outputs.
For solving the inverse problem, it is necessary to compare all model data y i,j against all corresponding experimental data d i . The y i,j is effectively a large matrix with simulated data, which can be pre-calculated, if desired. With the introduction of an error cost function c(d i , y i,j ), the optimization problem is formalized as a cost minimization:

Maximum likelihood
The cost function will typically increase with an increasing difference between the experimental data d i and the corresponding simulated data y i,j . However, it will also critically depend on the probability density functions of d i and y i,j . In theory, the latter can be infinitely precise, but at least the experimental data has some uncertainty. However, it is also possible that the y i,j is a probability distribution rather than a singular value. In any case, when at least one of the two quantities is a distribution, the maximum-likelihood method can be used as an elegant substitute for the general cost function (Eq. (1)). In the following, we assume a precise simulation value y i,j and a probability distribution of d i . When defining p(d i , y i,j ) as the probability of d i at the simulated value y i,j , the likelihood estimator can be written as: where L is the estimated likelihood. To be able to derive the function more easily and to avoid numerical problems, the logarithm is applied to the formula: The instance j resulting in the maximum of L j or ln(L j ) corresponds to the most likely model f j underlying the data set d 1 . . .d N .

Parametric approach
Instead of the arbitrary model instances f j , in most cases a fixed functional relation is assumed, and instances differ only in their parameters f j (s i ) = f(m 1 , . . . , m l , s i ). The goal is then to determine the model parameters m 1 , . . . , m l that have the maximum likelihood of underlying a certain data set ( Fig. 1). While the difference may appear to be only terminological, it quickly becomes essential in practice, when the goal is not only to find the most likely set of parameters, but also to estimate confidence intervals or probability density functions for each model parameter. Analog to the likelihood L j of instance j, the parametric likelihood is written as L(m 1 , . . . , m l ). The total likelihood for parameter m 1 can then be written as: ...m l Lðm 1 ; :::; m l Þ: In the common case that L(m 1 ) has a single maximum and tails off monotonically on both sides, diagnostically relevant confidence intervals can be specified. Likelihoods of all other parameters m 2 . . .m l are derived accordingly.

Model-based measurement steering
Our diagnostic goal is to minimize or to limit the confidence intervals of the diagnostic parameters. This can be m 1 , . . . , m l or any subset or mathematical function of these primary parameters. For example, using the bodymass-index rather than mass and height.
The skilled experimenter chooses experimental conditions that cause a reduction of the confidence intervals in the diagnostically relevant parameters. In an ideal case, the experiment should be optimized towards efficiently collecting diagnostic information. For this task, the modelbased approach is instrumental, but suffers the same problem as the human experimenter: without knowing the parameters of the test subject, it is not possible to define a complete set of experimental conditions upfront. An ongoing adjustment of the experimental conditions at runtime is a solution for this problem.
From a theoretical point of view, it would be ideal to determine the most informative experimental condition, execute the experiment, update the estimated likelihoods and repeat these steps in a loop until the confidence intervals are sufficiently small or until the allocated time has been expended (Fig. 2). However, this may not be the most practical approach, and in most cases, a cost function for changing the experimental conditions must be added.
The arguably most central question of model-based experiment steering is how to combine the confidence intervals into a single scalar value representing the diagnostic fidelity. In the simplest case, all relative standard deviations can be added, and the sum can be minimized. If some parameters are more important, they can be weighted. If relations of primary parameters are of diagnostic value, the confidence intervals of the respective relations must be used. In most cases, it is not diagnostically optimal to minimize the volume of a multidimensional confidence space. Such a criterion does not lead to small confidence intervals for each parameter, but rather to co-varying parameter estimates, i.e., eccentric (ellipsoid) confidence spaces. General approach for the model-based diagnostics using a likelihood procedure. The model function is used to generate the look-up table y, and it is assumed that this function works in a same way as the subject. The subject data is obtained experimentally. For illustration, a folder represents all data from either a subject or from a specific model instance. All possible combinations of model parameters make up the folder shelf, representative of the model table y. Consider that for the respective folder, each new measurement generates a new sheet of paper containing the stimulus information and the response. The goal is to find the model instance (folder) that has the maximum likelihood of having generated the experimental data. The resulting likelihood surface (left side of purple rectangular) has the same dimension as the folder shelf. In addition to the likelihood of each instance, confidence intervals can be calculated for each model parameter, which represent the model-based diagnosis.

Application to psychophysics
So far, the aim was to describe a general concept. Within this subsection, the concept is specified step-by-step until it can be applied to an alternative forced-choice procedure.
First we assume that y i,j = f j (s i ) is known. If f is an artificial observer, it will provide binary wrong or correct answers, while we require the correct rate. Therefore, many repetitions have to be simulated with the same stimulus until the underlying binomial statistics guarantee the desired accuracy. Furthermore, the large number of different model and stimulus instances cause an enormous computational load, even if a single run of f is very fast. In this case, the simulations have to be carried out before the experiments and y i,j is stored as a look-up table. The concept is much better suited for analytical models that directly provide the correct rate. Many approaches are expected to be between the two extremes, e.g., a numerical model that is capable of estimating the response probabilities more effectively than via an artificial observer.
While model parametrization is quintessential for diagnostic purposes, stimulus parametrization is not necessary within the present concept. The only exception can be a fundamental parameter such as, e.g., stimulus level, or frequency difference, that is conventionally used as independent, or adaptive variable. It may be necessary to treat each value of such a fundamental variable as a new stimulus instance, but if the psychometric functions have a constant steepness for all model parameter sets and stimuli, different values of the fundamental variable do not need to constitute a new stimulus instance. In this common case, a constant ratio exists between d 0 and the fundamental variable that carries all the information about the psychometric function [9]. The variable should then be set to the estimated most informative value of the psychometric function, as in conventional maximum-likelihood adaptive procedures targeting only the threshold but not the slope [10].
Even beyond such a tentative fundamental variable, the proposed procedure can be understood as closely related to the maximum-likelihood adaptive procedure. The next stimulus is chosen to maximize the information to be learned about the model parameters. The main difference compared to the classic likelihood method is that the fitting function is not a typical sigmoidal psychometric function going from chance to certainty but rather a more abstract model function with more parameters to estimate. Above all, it remains a maximum-likelihood method, intending  Figure 1 extended with the model-based measurement steering. The calculation of the likelihood is carried out after every trial, to call up the predefined termination criterion. As long as this is not fulfilled, the next stimulus is calculated, simulating the next trial by adding all possible stimuli s i to data d. For every simulated stimulus, a likelihood function with the associated standard deviations is also determined in the simulation and the stimulus that leads to the smallest standard deviation of parameter estimates will be presented next.
to reduce the measurement time. Also, just as in all maximum-likelihood procedures, "unlikely" responses in the early phase of the measurement will increase the duration.

Example
In this section, the theoretical concept described in the previous sections is applied to a practical example. Binaural tone-in-noise detection data was simulated for two different stimulus parameters (stimulus bandwidth and noise delay) as well as for tone level, the fundamental stimulus variable. The model has the three parameters auditory filter bandwidth (BW), phase noise (r IPD ), and decision noise (r d ) [11]. The experimental procedure, the model, and the model-based steering algorithm was implemented in Matlab as a procedure for the psychophysical measurement package AFC [12]. The framework can operate stand-alone, but it can also interface with any experimental software or with any model that provides the basic artificial subject conventions [13]. This also includes several models and data sets from the auditory modeling toolbox [14], for which a general interface is available. The Matlab code is freely available in Zenodo (https://doi.org/10.5281/zenodo.5211870 [15]).

Experiment
A 300 ms interaurally antiphasic 500 Hz target tone (S p ) was temporally centered in 380 ms of white masking noise with a center frequency of 500 Hz and bandwidths of 25 Hz, 50 Hz, 100 Hz, 200 Hz, 400 Hz, 700 Hz, and 1 kHz. An interaural time difference of 0 ms, 2 ms, 4 ms, 6 ms, 8 ms, or 10 ms was applied to the interaurally correlated noise (N s ). In addition, an interaurally uncorrelated masking noise condition was used, corresponding to an infinite noise delay. Note that the interaural delay is always an integer multiple of the 500 Hz tone period, resulting in a constant average p phase difference between the target tone and the masking noise. The noise was presented at constant 45.5 dB spectral level. A raised sine gating 20 ms in duration was independently applied to both target and masker after introducing the interaural delay. While the last section referred to stimulus instances that are not necessarily parametrized, in the following, stimuli are described by their two noise parameters: bandwidth and interaural delay (s).
The measurement was set up as a 3-interval, 3-alternative forced-choice (AFC) procedure. Two reference intervals contained only the masker, whereas the randomly chosen target interval contained the masker with the tone.

Model
The model used in this example is a slightly modified version of the so-called IPD model [16,17]. The auditory filters are simulated by a gammatone filterbank [14,18]. Here, only a single 4th-order filter with the center frequency equal to the 500 Hz target was employed. In contrast to [16,17], the bandwidth of the filter was not set to a fixed value, but constituted the first diagnosis parameter, e.g., to phe-nomenologically model the consequences of an outer haircell loss (e.g., [19]). Very simplified hair-cell processing is simulated by half-wave rectification and a subsequent 5th-order low-pass filter with a cut-off frequency of 770 Hz [20]. In addition, the signal was taken to the power of 0.4 to simulate compression. Binaural processing was then simulated by an IPD extraction process with high temporal resolution (for details see [17]). White noise was added to the resulting phase difference, to simulate binaural processing limitations. The standard deviation of this noise was adjustable and constituted the second model parameter.
The model concept is based on the assumption that the tone is detectable in the noise masker if it induces a large enough increase in IPD fluctuations. Therefore, the decision stage first quantifies these IPD fluctuations by calculating their mean deviation from zero. After taking the cosine of the IPD fluctuation, the obtained value is mostly identical to the classical cross-correlation coefficient [21]. To map changes of this value to a scale that is proportional to a subjects' correlation discrimination sensitivity, a Fisher Ztransformation is applied. The individual discrimination sensitivity on this scale is simulated with the third diagnostic parameter: A random value is taken from a Gaussian distribution with a standard deviation of r d and added to the resulting fisher's Z-transformed quantity to form the final decision variable. This is simulating an imperfect internal representation at the level of the decision-making process. Last, the artificial observer expects the tone to be present in the interval with the lowest decision variable.

Conventional simulation
To get a better understanding of the effect of the individual model parameters on the detection thresholds, the corresponding experiment was simulated with different parameter settings of the model using a 3-interval, 1-up 2down procedure [22]. Initially, the level was changed by 4 dB and, after the 2nd and 4th reversal, the step size was reduced to 2 and 1 dB, respectively. An additional 6 reversals were measured at this step size and the average of these reversal levels was taken as the threshold obtained in that run. In each condition, 10 runs were simulated. The simulation results obtained with four different model parameters sets are shown as an example in Figure 3. The first set resembles the assumed parameters of a well-performing, normal-hearing subject (Fig. 3A). The values selected here were taken from Dietz et al. [11], but also agree well with data from Bernstein and Trahiotis [23]. In each of the other sets, one parameter has been doubled while the other two remain at the default value. All three model parameters affect the simulated thresholds of the experiment in different ways. This is an important prerequisite for the inverse problem of model-based diagnostics: to identify the model parameters from the data.
Specifically, an increase in filter bandwidth increases thresholds for broadband noise maskers, while leaving 25and 50 Hz noise thresholds unchanged (Fig. 3B). An increase in phase noise primarily effects the conditions with zero or small values of tau (Fig. 3C), whereas the decision noise increases the thresholds for all conditions fairly uniformly, but also increases the variances (Fig. 3D).

Model-steered simulation
To pre-calculate the required look-up table y, psychometric functions for all 49 stimuli were simulated using all possible combinations of model parameters. Each model parameter range was discretized into 8 values with a 1/3octave spacing, 0.2-1.0 for the two noise parameters, and 40-200 Hz for the bandwidth, resulting in 512 possible model parameter sets and a total of 25 088 test conditions. In order to obtain the desired correct rate for each of these conditions, the binary simulation result (correct or wrong interval) requires repetitive testing to obtain a correct rate or even the psychometric function. As deduced in Section 2.5, the correct rate at a single level is sufficient in the case of a condition-independent psychometric function steepness [9]. In the present example, however, the steepness changes somewhat with r d . We therefore decided to store both steepness and a threshold value for each condition. To derive the two parameters, a psychometric function was estimated from 1500 stimulus presentations distributed equally from 25 to 75 dB in a 1.5 dB grid and a subsequent logistic fit. Arguably more elegantly, the steepness and slope can also be obtained following the classical maximum-likelihood procedure [24]. The threshold was defined at the most informative level of the psychometric function [25], also called the sweet point. In the current example, this point is at a correct rate of 72.9%.
Once the table was generated the actual model-steered measurement as described in Section 2.4 started: After each response, the likelihood was calculated in the 3-dimensional parameter space (Sect. 2.2). From this likelihood space, the three standard deviations were calculated (Sect. 2.2) and summed to obtain the final variable to be minimized. Moreover, this variable was also calculated for all 49 possible next stimuli. For each stimulus, the updated variable was calculated for both an assumed correct and an assumed wrong response from the subject. In order to obtain a single value from each of the 49 pairs, the two values of each pair were averaged, but weighted according to their expected probability of occurrence (i.e., 72.9% correct, 27.1% wrong). Then the stimulus with the smallest weighted average was selected for the next presentation. This was repeated until the previously defined criterion was met. In our case, the average relative standard deviation of the three parameters was below 10%, or 1500 trials had been completed. Figure 4 shows an example result of the procedure as described in Section 3.4 with an artificial test subject. The parameter set for this artificial test subject correspond to that of a normal-hearing person [11]: BW: 79 Hz, r ITD : 0.33, r D : 0.38. The values were deliberately chosen so that they are not contained directly in the model table y, but lie between the values in the table. The stimulus parameter set for the first trial was arbitrarily set to s = 0 ms and a bandwidth of 25 Hz at 65 dB SPL. At this high level, a "correct" response can be expected for all (artificial) subjects, a common choice for maximum-likelihood procedures.

Results and discussion
After about 500 trials, the phase noise and decision noise could be estimated with an accuracy similar to the discretization step size. In the displayed example, the first termination criterion, a relative standard deviation of only 10%, was met after 1121 trials. Note that this standard deviation is even less than the discretization step size. The slightly larger relative standard deviation of the bandwidth estimate can be explained by the fact that this parametercompared to the other two parametersin general has a smaller influence on thresholds. A single step (i.e., 33%) change in estimated phase noise (as it happened near trial 800) caused an immediate 67% change in estimated filter bandwidth, because the two parameters are covariant, but with different strong impacts at this point of the parameter space.
As we used an equal weighting of all three diagnosis parameters, it was very important that all parameters had a similar lever on the thresholds, to minimize problems such as that just described. If a discrete parameter space is used, the threshold change caused by a single model parameter step should be of the same order of magnitude as behaviorally discriminable sensitivity levels, i.e., a few dB in the present example. Across all possible stimuli, changing a single model parameter by one step (here 33%) resulted in a change of simulated detection thresholds of up to 4.3 dB for phase noise (almost only for stimuli with s = 0), up to 1.9 dB for a step change in decision noise, but similar for all stimuli, and up to 3.6 dB for bandwidth, but only for wider-band stimuli with s = 2 ms or s = 4 ms. While grid optimization may further improve performance, the present grid appears to offer three fairly comparable dimensions with reasonable discretization steps.
It was also noticeable that the mean from the fitted Gaussian can deviate from the maximum of the likelihood. This is because of the occasionally asymmetric shape of the likelihood. Therefore, the Gaussian fit is problematic, but still has practical merit.
Overall, after 1121 trials, the estimated filter bandwidth (true value 79 Hz) was estimated to be 82.05 ± 10.57 Hz, the phase noise (true value 0.33) was estimated to be 0.34 ± 0.03 and the decision noise (true value 0.38) was estimated as 0.41 ± 0.03.
In Figure 5, the same experiment was carried out as above (Fig. 4), with the difference that the model used as an artificial test subject was modified: instead of the gammatone filter bank used to simulate the basilar membrane, a DRNL (dual resonance nonlinear filter) filter bank was used [26]. Therefore, the model used to calculate the model table y was no longer identical to the model that functioned as an artificial test subject. Visual inspection of Figure 5 and a comparison of the estimated parameters reveal that despite the modification, the system was able to meaningfully estimate the parameters of the test subject.
As in Figure 4, this initially only applied to the combination of parameters that represent a normal-hearing subject. The next step was to check the extent to which the system is able to correctly estimate different parameter combinations, and thus subjects with different hearing abilities. Figure 6 shows the final estimates of the internal parameters of different artificial test subjects with different parameter combinations: Four different values were selected for each of the three parameters and simulations were carried out for all of the resulting 64 parameter combinations. Each simulation was stopped after the accuracy for all three parameters reached the 1 Â r confidence interval. The model chosen for the artificial test subject contained the gammatone filter bank. It can be seen that the two noise parameters can be estimated relatively accu-rately, regardless of their value, although the accuracy decreased somewhat at high values. The bandwidth estimate was also generally good, but with slightly larger confidence intervals. One exception was the underestimation of filter bandwidth at 160 Hz. Due to the above-mentioned covariance of bandwidth and phase noise, the bandwidth Figure 5. Example result of the model-based measurement steering with an artificial test subject. In general, the same experiment is shown as in Figure 4 with the difference that here, the model used as an artificial test subject was modified: the gammatone filter, which simulates the basilar membrane, was replaced with a DRNL filter. Due to this new filter, the equivalent rectangular bandwidth was 69 Hz (at 70 db SPL), instead of 79 Hz in the previous example. The other two parameters were identical to the previous example (r ITD : 0.33, r D : 0.38). Figure 6. Final estimates of the internal parameters of the artificial test subject using 64 different parameter combinations. The results of the individual parameters are presented using a 95% confidence interval. Note that the resolutions of the 3 parameter values are the same, but to obtain better readability, the abscissa for the bandwidth is shown in an abbreviated form.
likelihood function was not always Gaussian, but sometimes had noticeable side peaks (see Fig. 4), each corresponding to a certain phase-noise value. In addition, the bandwidth likelihood functions were arbitrarily limited by our grid to 200 Hz, truncating the upper side peak for the 160-Hz condition. Hence, the fitted mean value was systematically biased towards lower values.
With conventional maximum likelihood, but without the model-based steering, the relative standard deviations were about 1.3-1.5 times larger (see Fig. 7). Equivalently, without the model-based steering, almost 80% more trials (e.g., for an average standard deviation of 33%) were required to achieve the same accuracy.
Insights can also be obtained by analyzing which stimuli were selected by the steering algorithm, and perhaps even in which temporal order. In the present example, it was observed that the algorithm primarily selected stimuli from the edges of the possible parameter ranges, and from 2 to 4 ms delays at large bandwidths (Fig. 8A).
Within the first 20 trials, the stimulus selection was fairly equally distributed, with a small preference for 25 Hz (Fig. 8B). A little later in each run, the 25 Hz preference is more pronounced. This behavior can be understood by comparing to Figure 3: At first, the only parameter that could be estimated irrespective of the other two parameters is decision noise. All conditions were influenced by decision noise in a similar way, but especially the 25 Hz conditions were not influenced by filter bandwidth at all, so they are initially overrepresented. Particularly, the 25 Hz, s = 1 condition was not influenced by either bandwidth or phase noise, so it is, at least in retrospect, understandably the most frequent condition and always caused an improvement of the estimates of decision noise. In Figure 3, the largest influence of filter bandwidth was observed under conditions with a large noise bandwidth and s = 2 or 4 ms. These conditions were also moderately overrepresented (see Fig. 8).

General discussion
We have demonstrated conceptually (Sect. 2) and via simulation examples (Sect. 3) that a processing model of a brain-or body function can be used to steer a measurement procedure such that the selected measurement instance maximizes the information about the respective function. In a diagnostic context, "information" refers to model parameters, and the information maximization process to the minimization of confidence intervals. Within a research context, the same approach may also help to select measurement conditions that optimally disambiguate between different model concepts, i.e., different functional I/O relations, irrespective of parameters. From a user perspective, in addition to the time efficiency, the main benefit is that the diagnostic accuracy is inherently tracked. This allows for well-informed termination criteria, and avoids negative surprises about missed levels of significance that can happen with the common sequential measurementanalysis approach.
A critical element of the proposed framework is the optimization criterion. In the present example, the sum of three standard deviations was chosen to be minimized. If diagnostic interest exists only in some parameters, the other parameters can be left out of the optimization criterion, but will still be confined to some degree as a by-product. Further, a diagnostic priority can be incorporated as an increased weighting of the respective parameter's confidence interval. A preliminary version of the framework used the volume of the 3-dimensional likelihood space that exceeded a certain likelihood threshold. This method was of no diagnostic value, because there was no incentive to disentangle covarying parameters. In fact, such a simple approach may try to maximize co-variances. Illustratively speaking, in a 2-dimensional space, our approach is operating towards a circle-shaped confidence area, whereas the preliminary version resulted in strongly ellipsoid areas, or lines as confidence areas: a smaller area (or volume in general), but no parameter confinement. The optimization criterion also has to match the diagnostic requirements. In the present example (Sect. 3), it can happen that two parameters are very accurate and one has a larger confidence interval. If the diagnosis requires a maximum confidence interval for certain parameters, the optimization criterion also has to be modified, to ensure that the procedure targets the termination criterion. Last, in the case of a discrete implementation, the optimization process can depend on the resolution of the model parameter grid. In the process of designing the example, a discretization step size that caused notable differences in simulated results, in at least one condition, was found to be ideal. Again, to optimize this critical design parameter, diagnostic ambition and measurement efficiency must be aligned. If effect sizes of steps of the various parameters differ, additional problems may occur. For example, a single discrete step in one dimension may then cause a multi-step covariance of a second parameter, which can create severe problems for the confinement of the second parameter. Near the boundary of the parameter grid, such a covariance can also cause a bias, as observed in one condition in Figure 6.
Despite the proof of concept, various obstacles currently prohibit practical applications. First and foremost, an accurate and detailed general-purpose model is required that captures the complete I/O chain from stimulation to recording or perception. Any simplification (which is inevitable) causes the risk of hidden parameters that may corrupt the diagnosis. Even if we assume that a generalpurpose model will exist at some point in the future, it will likely have far too many parameters to be able to apply the proposed procedure. At least this is already the case for today's most comprehensive auditory models [3,4]. The only chance appears to be in a very short processing chain, e.g., direct measurements of ear processes, such as tympanometry [27], or otoacoustic emissions. Already the model of a single neuron and its presynaptic input cannot be unambiguously parametrized, even with a multidimensional stimulus parameter space [28]. However, any diagnosis is always a quantitatively imperfect estimate, and must rather be evaluated based on its practical merits. If several parameters or several models have similar consequences and cannot be easily disambiguated, it may not be meaningful to increase the effort until they are disambiguated. The most likely way forward is through highly simplified models that may need to be specifically developed for such a purpose. The two parameters of the Plomp model [29], for instance, are of outstanding practical importance [30], even though they are only very indirectly related to physiology.
Schuknecht [31,32] presented a very promising dataset from human inner ears that showed pathological causes of hearing loss. A distinction was made between four predominantly pathological types: sensory, neural, metabolic, and mechanical. This study showed the relationship between these four different types and their different effects on the audiometric hearing threshold. Since the various pathologies can be mapped with the aid of just a few parameters, and this was brought into correlation with a very common measurement, this resulted in a high diagnostic potential. The different phenotypes could also be used successfully for different machine-learning classifiers to determine the etiology of a hearing loss [33].
Another practical problem is thepotentially very large changes of stimulus conditions or even a frequent change of experiment or task. Normally, behavioral experiments have carefully designed orders in the presentation, to counterbalance training effects, or to allow for familiarization with certain tasks or stimulus features. To date, such aspects are usually not included in models, so that the variable measurement conditions may not only be at times irritating, but a possible reduced subject performance will be misinterpreted by the model. A potential solution could be a blocked presentation, in which the steering process is only performed after a given number of constant stimuli is presented. A second option is to add a parameter change cost-function that reduces the model's willingness to jump within the parameter space (Sect. 2.4). Nonetheless, attention, motivation, training, fatigue, and other similar influences can be expected to remain as a significant limitation of the suggested approach when used in the context of psychophysical experiments.
Yet a further problem arises from the computational load of most models. First, the calculation of the multidimensional model table may require dedicated computational resources. Second, the computation of the steering procedure has to be completed while the subject is given the previous stimulus, otherwise a time-lag to the next presentation will reduce the duration benefit, and will be potentially annoying for the subject.
Critically speaking, the diagnosis of the artificial observer (Sect. 3) is to model a model with itself. Getting out what was put in is the lowest possible target. The minor variation of training model and testing model demonstrated in Figure 5 is a first step towards larger discrepancies between any biological system and its model. However, even such a self-validation has the advantage that the steering process can be studied without being concerned with unaccounted influences. For example, if the actual measurements are to be conducted in a conventional way, a prior investigation of the steering with differently parameterized artificial observers can reveal very useful insights about how to design the measurement. At the very least, it informs which test conditions are the most informative and is thus instrumental in defining the measurement protocol. We had not expected to find that the uncorrelated 25-Hz wide noise masker was by far the most informative (Fig. 8) -but after studying the steering process, it appears to be obvious. Moreover, different parameters can be used to test whether certain artificial subjects require very different test conditions than others. If this is the case, a decision-tree-based design may help to select certain testing conditions only for some patients. This would constitute a hybrid approach between conventional measurements and the proposed technique, in a similar spirit to work by Sanchez Lopez et al. [30,34].

Conclusion
We conclude that model-based experiment steering is possible and has at least theoretical advantages over sequential measure-and-fit approaches. Practical problems and the lack of sufficiently accurate models will initially prohibit most diagnostic applications. The method will rather be instrumental in providing insights for designing better condition tables and experimental decision treesbut experiments are expected to be mostly executed in a conventional way.