A comparative study of eight human auditory models of monaural processing

A number of auditory models have been developed using diverging approaches, either physiological or perceptual, but they share comparable stages of signal processing, as they are inspired by the same constitutive parts of the auditory system. We compare eight monaural models that are openly accessible in the Auditory Modelling Toolbox. We discuss the considerations required to make the model outputs comparable to each other, as well as the results for the following model processing stages or their equivalents: Outer and middle ear, cochlear filter bank, inner hair cell, auditory nerve synapse, cochlear nucleus, and inferior colliculus. The discussion includes a list of recommendations for future applications of auditory models.


Introduction
Computational auditory models reflect our fundamental knowledge about hearing processes and have been accumulated during decades of research (e.g., [1]). Models are used to derive conclusions, reproduce findings, and develop future applications. Usually, models are built in stages that reflect basic parts of the auditory system, such as cochlear filtering, mechanoneural interface, and neural processing, by applying signal-processing methods such as bandpass filtering or envelope processing [2]. Models of monaural processing often form a basis for binaural models (e.g., [3]) and more complex models of auditory-based multimodal cognition (e.g., [4]). For this reason, combined with the increasing popularity of reproducible research [5], it is not surprising that there is an increasing number of auditory models available as software packages (e.g., [6,7]).
However, models must be used with caution because they approximate auditory processes and are designed and evaluated under a specific configuration for a specific set of input sounds. While the evaluation conditions are selected to test the main properties of the simulated stages, models may provide different predictions when processing unseen sounds. Combined with the wide and low-threshold availability of model implementations, there is a chance of applying a model outside its specific signal or parameter range. Thus, studies comparing models' properties and configurations are important to model users. For example, Saremi et al. [12] compared seven models of cochlear filtering with respect to various parameters describing the nonlinear filtering process of an active cochlea, and Lopez-Poveda [13] compared eight models of the auditory periphery based on the reproduction of auditory-nerve properties. Other related studies focus on a specific application (e.g., [10,11,[14][15][16][17]) or provide an introduction to modelling frameworks [18,19].
In the current study, we compare various monaural auditory models that approximate subcortical neural processing. For this comparison, we use model configurations that reduce the heterogeneity across model outputs, indicating advantages and disadvantages of these configuration choices. These configurations are evaluated using the same set of sound stimuli across models. The selected set of stimuli illustrates critical model properties that can be used as a guideline for the choice of a specific model. These properties include fast and slow temporal aspects, i.e., temporal fine structure and temporal envelope, that are evaluated for a wide range of presentation levels.
We selected a number of auditory models that met two main criteria. First, the selected models describe the auditory path beginning with the acoustic input up to subcortical neural stages, in the cochlear nucleus (brainstem) and the inferior colliculus (midbrain). Consideration of these stages extends previous comparisons of auditory periphery models [12,13]. Second, the model implementations are publicly available and validated to simulate psychoacoustic performance and/or physiological properties. We use the implementations available in the Auditory Modelling Toolbox (AMT) [8,9,20].
Based on our inclusion criteria, some models are excluded from the comparison, e.g., models that have only been evaluated at the level of cochlear filtering, such as models based on Hopf bifurcation [21] and the model of asymmetric resonators with fast-acting compression [22]. Other cochlear models, such as the Gammatone filter bank [23], the dual-resonance nonlinear model [24], the chirp filter bank [25] and the transmission-line model from [26], are included as modules in the selected models. We further excluded models whose structure did not contain one or more of the relevant processing stages that we evaluated in this study (Stages 3-6 in Fig. 2). Two models that entered this category are the power spectrum models, EPSM [27] (no stages [3][4][5] and GPSM [10,11] (no stage 5). Lastly, we did not include models focusing on specific psychoacoustic metrics [28][29][30], despite the fact that such models are often based on comparable auditory stages as those described in this study.
For the sake of simplicity, our analyses are focused on the comparison across models rather than on a comparison with experimental data. Nevertheless, we provide experimental references to the simulations that are illustrated throughout this paper. Additionally, to encourage reproducible research in auditory modelling, all our paper figures can be retrieved using AMT 1.1, including (raw) waveform representations of intermediate model outputs.
The paper is structured as follows: In Section 2 we provide a brief description of the processing stages in the selected auditory models. Their specific configurations are described in Section 3. The model comparison is presented in Section 4 and contains a description of the set of test stimuli as well as a detailed numerical description of the simulation results. Section 5 starts with a list of applications of the evaluated models, including some general considerations for the application of auditory models in further modelling work. Note that although the detailed analysis in Section 4 is relevant for model users who are interested in a transparent and accurate description of the illustrated model outputs, readers who are only interested in a bigger picture, as it is covered elsewhere (e.g., [1,2,13]), may wish to go directly to Sections 5 and 6.

Models
We define three model families, classified by their objectives [40], which translate into three different levels of detail in simulating the cochlear processing, as schematised in Figure 1. The selected models are listed in Table 1 and are labelled throughout this paper by the last name of the first author and the year of the corresponding publication. This naming system directly reflects the corresponding model functions implemented in AMT 1.1 [8].
We define the family of biophysical models (Fig. 1a) that use a transmission line consisting of many resonant stages coupled by the cochlear fluid. Biophysical models aim at exploring how the properties of the system emerge from biological-level mechanisms, needing a fine-grained description at this level. The biophysical models are represented by verhulst2015 [34] and its extended version, verhulst2018 [35] (model version 1.2 [41,42]). We further define phenomenological models which primarily predict physiological properties of the system, using a more abstract level of detail than the biophysical models. The phenomenological models considered here rely on dynamically adapted bandpass-filtering stages combined with nonlinear mappings (Fig. 1b) and are represented by zilany2014 [32] and its extended version bruce2018 [36], both combined with the same-frequency inhibitionexcitation (SFIE) stages for subcortical processing [43]. Further approximation is given by functional-effective models [44], which target the simulation of behavioural (perceptual) performance rather than the direct simulation of neural representations. These models usually approximate the cochlear processing by using static bandpass filtering with an optional nonlinear mapping (Fig. 1c). The linear effective models are represented by dau1997 [31] and osses2021 [39] and the nonlinear effective models are represented by king2019 [37] and relanoiborra2019 [38]. Given that for each model a similar level of approxima- Figure 1. Model families used in this study. The families are defined by the level of detail in simulating the cochlear processing and are sorted by their complexity from left to right. a) Biophysical models using a nonlinear transmission line that contains resonating stages coupled by the cochlear fluid; b) Phenomenological models using nonlinear filters dynamically controlled by an outer-hair-cell (OHC) model; c) Functional effective models using linear filters, optionally combined with static nonlinear filters.
tion has been generally used in the design of subsequent model stages, we use the defined categories to reflect the nature of the entire model. The defined categories are not meant to represent a hard boundary for model classification. Hence, we do not discard the existence of other moreor less-detailed models than the selected biophysical and effective models, respectively.
The selected monaural models share common stages of signal processing, as indicated in the schematic diagrams of Figure 2, with some stages even using the same (digital) implementation. Each model stage mimics, with greater or lesser detail, underlying hearing processes along the ascending auditory pathway. The thick vertical lines in Figure 2 indicate the intermediate model outputs which are the basis for our evaluation. Note that these stages are, conceptually speaking, independent of each other. However, because of nonlinear interactions between them, the processing performed by these stages is not commutative and thus requires a step-by-step approach.

Outer ear
The listener's head, torso, and pinna filter incoming sounds. The ear-canal resonance further emphasises frequencies around 3000 Hz [45]. These effects can be accounted for by filtering the sound with a head-related transfer function (HRTF) (e.g., [46]) or by applying a headphone-to-tympanic-membrane transfer function, as used in relanoiborra2019 and osses2021. The other six selected models that do not include an outer-ear filter, implicitly assume that either the outer ear does not introduce a significant effect in the subsequent sound processing chain, or that the sounds are presented near the tympanic membrane, as is the case for a sound presentation using in-ear earphones.

Middle ear
Six of the eight evaluated models include a stage of middle-ear filtering. The transfer functions of the middleear filters used in these models are shown in Figure 3. The transfer functions in verhulst2015 and verhulst2018 approximate the middle-ear forward pressure gain ("M1" in [47]). The humanised zilany2014 and bruce2018 models use a linear middle-ear filter [48,49] that approximates the forward-pressure measurements from [50,51]. The middle-ear filters in relanoiborra2019 and osses2021 are those designed to represent stapes velocity near the oval window of the cochlea [24,52]. These models use the same filter implementation, but the filter in osses2021 includes a gain factor to provide a 0-dB amplitude in the frequency range of the passband and a fixed group-delay compensation.
Middle-ear filtering not only introduces a bandpass characteristic to the incoming signal (Fig. 3), but also affects the operating range of cochlear compression in models relying on nonlinear cochlear processing, i.e., verhulst2015, verhulst2018, zilany2014, bruce2018, and relanoiborra2019. The passband gains of the middle-ear filters are indicated in Table 2 and range between À66.9 dB (relanoiborra2019) and +24 dB (verhulst2015). In nonlinear models, lower and higher passband gains vary the actual input level to the filter bank, shifting the onset of cochlear compression to higher and lower knee points, respectively.

Cochlear filtering
A cochlear filter bank performs a spectral analysis of incoming signals to simulate the mechanical oscillations of the basilar membrane (BM) and organ of Corti at different points along the cochlea. The BM and organ of Corti are thought to have multiple modes of vibration which could drive the transduction currents of the IHCs depending on their amplitudes. The cochlear filtering stages in the evaluated models all aim to describe the filtering properties of the main mode(s) of vibration that explain the frequencytuning curves of AN fibres. Some of the models include further components aimed at capturing the filtering behaviour of additional modes of BM and/or organ of Corti vibration to better explain the level-dependent nonlinear filtering of the cochlea. All the included cochlear filtering stages produce a set of N time-domain signals, for N simulated characteristic frequencies (CFs). Each cochlear section is assumed to either have relatively sharp frequency tuning (Eq. (1), [53]) or broader tuning (Eq. (2), [54]). For CFs expressed in Hz: The models verhulst2015 and verhulst2018 use a transmission-line model fitted to otoacoustic estimates of human cochlear filtering [26], whose outputs represent BM velocity [34,35], which provides the input to a model stage that maps BM velocity to IHC stereociliar deflection. In zilany2014 and bruce2018, the filtering is based on a chirp filter bank [25,55] tuned to a human cochlea [48,49,56], that contains two parallel processing paths of static and outer-hair-cell (OHC) controlled filters (C2 and C1 in [32]), representing BM and organ of Corti motion that drive two separate IHC transduction functions Osses and Kohlrausch (2021) [39] [ 55,57]. The cochlear filters in these models are assumed to be tuned according to equation (1). 1 In dau1997 and osses2021, the linear Gammatone filter bank from [23] is used. King2019 uses the Gammatone filter bank from [23] followed by a compressive stage acting above a given knee point. In relanoiborra2019, the cochlear processing is simulated by the dual-resonance nonlinear (DRNL) filter bank [24]. The cochlear filters of these models are assumed to be tuned according to equation (2).

Inner hair cell
The inner hair cells (IHCs) transform the mechanical BM and organ of Corti oscillations into receptor potentials, subsequently initiating neuronal discharges in the auditory nerve (AN) [58]. In the most simple approach, the IHC processing can be simulated as an envelope extractor that removes phase information for high stimulus frequencies, implemented as a half-wave rectification followed by a lowpass (LP) filter. This approach is used in dau1997, king2019, relanoiborra2019, and osses2021, in which the LP filters have À3-dB cut-off frequencies (f cut-off ) between 1000 and 2000 Hz. In zilany2014, bruce2018, and verhulst2015, a nonlinear transformation is applied to the output of the cochlear filter bank, followed by a cascade of LP filters with f cut-off of 3000 Hz (zilany2014 and bruce2018) and 1000 Hz (verhulst2015). The resulting f cut-off of each model ranges between 642 Hz (verhulst2015) and 1000 Hz (dau1997, relanoiborra2019, king2019), as indicated in Table 2. In verhulst2018, a more sophisticated IHC model is used [59], that is implemented as a threechannel non-spiking Hodgkin-Huxley type model, with each of the channels representing mechanoelectrical and (fast and slow) potassium-gated processing [35,59]. 1 The models verhulst2015 and verhulst2018 use an updated version of equation (1): Q ERB ¼ 11:46 Á CF =1000 ð Þ 0:25 [34,35]. This equation produces sharp tuning curves but with slightly lower Q-factors compared to those given by equation (1), with values between 6.8 (CF = 125 Hz) and 19.3 (CF = 8000 Hz).   Table 2. Model configurations and numeric results. Middle ear: Details of frequency response of the middle-ear filters. Cochlear filter bank: 40 dB and 100 dB refer to filter characteristics between 160 and 8000 Hz in response to white noises of 40 and 100 dB SPL, respectively. IHC: Parameters and frequency response of the LP filter structures. Subcortical processing: Theoretical and estimated BMF of the modulation filter with the closest BMF to 100 Hz and peak-to-peak amplitude of the simulated IC outputs in response to 70-dB-peSPL clicks of positive (A) and negative (ÀA) (see Sect. 4.4 for more details). Performance: Time required to process a 1-s long stimulus using each of the selected auditory models, between Stages 1 and 6 ( Fig. 2). All f cut-off in this table were measured at the À3 dB points of the amplitude spectrum.

Auditory nerve
The transduction from IHC receptor potentials into patterns of neural activity can be derived from the interaction between the IHC and AN. Several AN synapse models have been inspired by the three-store diffusion model [58], assuming that the release of synaptic material is managed in three storage compartments. For steady-sound inputs, this model predicts a rapid neural firing shortly after the sound onset with a decreasing rate towards a plateau discharge rate, a phenomenon called adaptation (e.g., [60]).
The AN synapse models in verhulst2015, verhulst2018, and zilany2014 are based on [58], but zilany2014 further incorporates a power-law adaptation following the diffusion model from [32]. The synapse model in bruce2018 uses a diffusion model based on [61] to: (1) have limited release sites, and (2) come after the power-law adaptation instead of before it [36]. The outputs of these models simulate the firing of neurons having a specific spontaneous rate of high-, medium-, and/or lowspontaneous rates.
The effective models, on the other hand, rely on a more coarse AN simulation, expressed in arbitrary units (a.u.). In king2019, adaptation is simulated by applying a highpass filter with a cut-off frequency of 3 Hz [37]. In dau1997, relanoiborra2019, and osses2021, adaptation is simulated by so-called adaptation loops [44] that introduce a nearly logarithmic compression to stationary input signals and a linear transformation for fast signal fluctuations (Appendix B in [39]). The arbitrary units of these transformed outputs are named model units (MUs).

Subcortical neural processing
AN firing patterns propagate to higher stages along the auditory pathway, first through the auditory brainstem, then towards more cortical regions [62]. On its way, AN spiking is mapped onto fluctuation patterns by neurons that are sensitive to the amplitude of low-frequency fluctuations [63]. This fluctuation sensitivity has been approximated using various approaches. Our analyses focus on model approximations of the modulation processing circuits of the ventral cochlear nucleus (CN) and inferior colliculus (IC) [43], as well as on different modulation-filter-bank variants [27,31]. As a result, we exclude the analysis of other subcortical structures such as those that play a particular role in the binaural interaction between ears (e.g., the dorsal cochlear nucleus and superior olive) [62,64].
The modulation processing in the ventral CN and IC can be simulated using the same-frequency inhibitionexcitation (SFIE) model, resulting in a widely tuned modulation filter (Q-factor % 1) with a best-modulation frequency (BMF) depending on the parameters of the model [33,43]. The SFIE model has already been used in combination with the biophysical and phenomenological models described here. For example, zilany2014 has been combined with the SFIE model using between one and three modulation filters (e.g., [65]). Or, verhulst2015 and verhulst2018 have used the SFIE model with one modulation filter centred at a BMF of 82.4 Hz (see Tab. 2) [35,41]. Further, bruce2018 can be combined with the SFIE model in the UR EAR 2020b toolbox [66]. Note that zilany2014, verhulst2015, and verhulst2018 have used the output of their mean firing rate generator-an output that can be conceptualised as peri-stimulus time histograms (PSTHs) [67]-as an input to the SFIE model. In bruce2018, because of the stochastic processes in its spike generator, repeated processing of the same stimulus is recommended to obtain a faithful PSTH that can appropriately account for power-law adaptation properties (see Sect. 3 in [36]).
The effective models, on the other hand, approximate the subcortical neural processing based on the modulation-filter-bank concept [27,31]. In dau1997, king2019, relanoiborra2019, and osses2021, linear modulation filter banks are used, covering a range of BMFs up to 1000 Hz. In dau1997, twelve modulation filters with a Q-factor of 2 and overlapped at their À3 dB points are used. The same modulation filters are used in relanoiborra2019 and osses2021, but an additional 150-Hz LP filter is applied [27,68] and the number of filters is limited so that the highest BMF is less than a quarter of the corresponding CF [69]. In king2019, the filter bank is used with a wider tuning (Q = 1, as suggested in [27,70]), using ten 50%-overlapped filters having a maximum BMF of 120 Hz [37].

Model configuration
We evaluated the intermediate model outputs that are indicated by thick vertical black lines in Figure 2. The evaluation points are located after the cochlear filter bank (Stage 3), the IHC processing stage (Stage 4), the AN synapse stage or equivalent (Stage 5), and after the IC processing stage or equivalent (Stage 6). Starting with the default parameters of each model, we introduced small adjustments to obtain the most comparable model outputs. All comparisons can be reproduced with the function exp_osses2022 from AMT 1.1 [8].

Level scaling
The same set of sound stimuli was used as input to all models. The waveform amplitudes were assumed to represent sound pressure expressed in Pascals (Pa). The models zilany2014, verhulst2015, verhulst2018, bruce2018, and relanoiborra2019 use this level convention and did not require further level scaling. The models dau1997, king2019, and osses2021 interpret sound pressures between À1 and 1 Pa as amplitudes in the range ±0.5, thus a factor of 0.5 (attenuation by 6 dB) was applied to the generated stimuli to meet the level convention of these models. For these latter models, which include mostly level-independent stages, such calibration is relevant because the adaptation loops (used in dau1997, osses2021, also extensible to relanoiborra2019) include level-dependent scaling (Eqs. (B1)-(B3) in [39]).
In king2019, a calibrated knee point (default of 30 dB) is used in its cochlear compression stage (Stage 3). All signal levels are reported as root-mean-square (rms) values referenced to 20 lPa, in dB sound pressure level (dB SPL).

Cochlear filtering
The phenomenological and effective models can be set to simulate responses at any CF. However, because of the nature of the transmission-line structure, the selected biophysical models have a discrete tonotopy that translates into a discrete set of available CFs.
The models verhulst2015 and verhulst2018 were set to 401 cochlear sections spaced at Dx = 0.068 mm with tonotopic distances x n ranging between x 1 = 3.74 mm and x 401 = 30.9 mm, that are related to CFs between CF 1 = 12010 Hz and CF 401 = 113 Hz, according to the base-to-apex mapping [71], where x n (in mm) can be obtained as x 1 + Dx Á (n À 1), and A = 165.4188 Hz, a = 61.765 1/m, k = 0.85, and A 0 = 20682 Hz. Note that when reporting results, we indicate the cochlear section number n and its corresponding CF n . The cochlear-filtering parameters of zilany2014 and bruce2018 were those adapted to a human cochlea [48,49]. Moreover, in order to analyse separately the effects of cochlear filtering and IHC processing in zilany2014, the C1-and C2-path outputs from the chirp filter bank were added and analysed before the IHC nonlinear mapping was applied. This analysis follows a similar rationale as analysing the main output of the DRNL filter bank in relanoiborra2019 (see Fig. 3a from [24]).
Finally, in king2019, we used a compression factor of 0.3 for all simulated CFs, which is different from the onechannel (on-CF) compression used in [37,72].

Inner hair cell and auditory nerve
Default parameters were used for the IHC and AN stages of the evaluated effective models. However, the biophysical and phenomenological models require the choice of parameters to simulate a population of AN fibres. For each CF we simulated 20 fibres, having either high-(HSR), medium-(MSR), or low-spontaneous rates (LSR), distributed in a 0.6-0.2-0.2 ratio [73,74], resulting in a 12-4-4 configuration (HSR-MSR-LSR). Note that for verhulst2015 and verhulst2018, this deviates from the standard 13-3-3 configuration [34,35]. For verhulst2015 and verhulst2018, the spontaneous rates of each fibre type were 68.5, 10, and 1 spikes/s for HSR, MSR, and LSR, respectively, as used in human-tuned simulations [35]. For zilany2014, the spontaneous rates of each fibre type were 100, 4, and 0.1 spikes/s and for bruce2018 were 70, 4, and 0.1 spikes/s for HSR, MSR, and LSR, respectively. We further disabled the random fractional noise generators in zilany2014 and bruce2018 [75], and the random spontaneous rates in bruce2018 ("std" from Tab. I in [36] was set to zero). With this configuration, the mean-rate synapse outputs of verhulst2015, verhulst2018, zilany2014, and bruce2018 are deterministic. For this reason, to obtain population responses, we simulated the AN processing of each type of neuron only once and then weighted them by factors of 0.6, 0.2, and 0.2 for HSR, MSR, and LSR fibres, respectively. In contrast, the PSTH outputs that are reported for zilany2014 and bruce2018 are not deterministic, requiring the simulation of each AN fibre for each CF. Therefore, PSTH population responses were obtained by counting the average number of spikes in time windows of 0.5 ms across 100 repetitions of the corresponding stimuli.

Subcortical neural processing
The default configuration of the model stages of subcortical processing (Stage 6, Fig. 2) differs in the number of modulation filters (from 1 to 12) and in their tuning across models. In our study, we use only one modulation filter targeting a BMF of approximately 80 Hz (see "Theoretical BMF" in Tab. 2) using a Q-factor of approximately 1 for zilany2014, verhulst2015, verhulst2018, bruce2018, and king2019, and a Q-factor of 2 for dau1997, relanoiborra2019, and osses2021.
For the biophysical and phenomenological models, we used the SFIE model [33,43] using two different configurations. The SFIE model [43] integrated in verhulst2015 and verhulst2018 has CN parameters with excitatory and inhibitory time constants of s exc = 0.5 ms and s inh = 2 ms, a delay D = 1 ms, and a strength of inhibition of S = 0.6. The IC stage uses s exc = 0.5 ms, s inh = 2 ms [35], D = 2 ms, and S = 1.5 [43], achieving a BMF of 82.4 Hz.
For zilany2014 and bruce2018, the SFIE model is a separate stage [33], implemented as carney2015 in AMT 1.1, where either the mean-rate (zilany2014) or the PSTH outputs (bruce2018) are used as inputs. In our analysis, we only used the output of the band-enhanced IC cell, which corresponds to the SFIE model from [43]. The CN parameters were identical to those for the biophysical models. The IC parameters were s exc = 1.11 ms, s inh = 1.67 ms, D = 1.1 ms, and S = 0.9, achieving a BMF of 83.9 Hz [33]. Note the different inhibition strength S between models. In the biophysical models, the IC output is dominated by inhibitory responses (S > 1) whereas in the phenomenological models the IC output is dominated by excitatory responses (S < 1).

Evaluation
In this section we analyse the outputs of the eight selected auditory models in a number of test conditions, whose results are presented in Figures 4-15. We aimed at a comparison across models and thus, for the sake of clarity, we refrained from a direct comparison to ground-truth references from physiological data. However, such a comparison is important and interesting. For this reason, we provide references where similar experimental and/or simulation analyses have been presented. These references are indicated as "Literature" in the caption of the corresponding figure. Alternatively, the outputs of the biophysical and phenomenological models may be considered as referential because they have been primarily developed to reflect physiological (human or animal) responses to sounds. This latter assumption always requires a careful consideration, especially when translating findings from animal to human physiology.

Cochlear filtering
Sound processing in the cochlea depends not only on the frequency but also on the level of the input stimulus [78]. The amplitude of the BM vibration displacement increases for higher levels, following an amplitude growth that comprises linear and compressive regimes [79]. We illustrate this level dependency for a set of pure tones and white noises. The pure tones had frequencies of 500 or 4000 Hz, with a duration of 100 ms. The white noise was generated as a fixed 3-s long waveform with a flat spectrum between 20 and 20000 Hz. All sounds were gated on and off with a 10-ms raised-cosine ramp and had levels of 40, 70, and 100 dB SPL. The obtained responses are shown in Figure 4, which were vertically shifted by the gains indicated in Table 2. These gains were derived for each model using the 1000-Hz pure tone of 100 dB SPL as a reference. The frequency responses were level-dependent for zilany2014, verhulst2015, verhulst2018, king2019, and relanoiborra2019. For verhulst2015, verhulst2018, and relanoiborra2019, we further observed a change in the location of their maximum amplitude (green triangles in Fig. 4, fourth row). Although a shift of the responses with increasing the stimulus level is supported by experimental data [76,78,80], this argument was later challenged [81] and rather attributed to shallower upper tails in the responses. A final observation is that king2019, zilany2014, verhulst2015 and verhulst2018, showed a certain amount of distortion in the frequency responses of 70-and 100-dB SPL sounds (Fig. 4, second and third rows). For the responses to white noises at CF = 502 Hz, the distortions occur in the upper tail of the cochlear responses as a side effect of the (amplitude) compression stage. This frequency-response distortion is most prominent in king2019 due to its . Filter bank responses to pure tones at 500 Hz and 4000 Hz (top two rows) and spectral magnitudes of single-channel responses to white noises (bottom two rows) for sounds of 40-, 70-, or 100-dB SPL (bottom-to-top coloured curves, respectively). In the first two rows, the responses represent excitation patterns at the simulated CFs. The coloured markers indicate the amplitudes at the (off-frequency) CFs one ERB N below (grey) and one ERB N above (yellow or orange) the on-frequency CF (502 or 4013 Hz). These markers are highlighted using the same colours in Figure 5. In the third and fourth rows, the average FFT response of two cochlearfilters (CFs of 502 Hz or 4013 Hz) in response to six 500-ms white noise sections are shown in grey, and the corresponding smoothed responses are shown in colour. This type of responses was used to assess the quality factors of Figure 6. The dashed black lines indicate the corresponding À3-dB filter bandwidths. All responses were shifted vertically by the reference gains given in Table 2 (see the text for details). Literature: Figure 1A from [76], Figure 2C-E from [77], and Figure 2 from [12].
broken-stick compression (rise of the amplitudes to the power of 0.3). 2

Compressive growth
The set of pure tones was further used to assess the curves relating the input stimulus levels with levels at the output of the cochlear filter banks, known as input-output (I/O) curves. For this analysis we included stimulus levels between 0 and 100 dB SPL in steps of 10 dB. The I/O curves were obtained for (1) the on-frequency CF tuned to the frequency of the input stimulus, and (2) the offfrequency responses of cochlear filters tuned to one equivalent rectangular bandwidth number (ERB N ) [54] below and above the stimulus frequency.
The obtained I/O curves are shown in Figure 5 for onfrequency (left panels) and off-frequency simulations (±1 ERB N , middle and right panels). The I/O curves were vertically shifted by the reference gains indicated in Table 2 (as in Fig. 4). As expected for the level-independent Gammatone filters used in dau1997 and osses2021, the curves were linear in all panels of Figure 5. For the remaining models, more compressive behaviour was observed for on-frequency curves (left panels) while more linear curves were obtained for off-frequency CFs (middle and right panels), except for relanoiborra2019 and king2019, that had on-and off-frequency compression.
For zilany2014/bruce2018, the I/O curves were fairly linear in response to 500-Hz tones (top panels) for both on-and off-frequency CFs. For 4000-Hz tones, a prominent compressive behaviour was observed in the on-frequency curves (Fig. 5d) where, additionally, the curve for verhulst2018 turned from a compressive to a linear regime for signal levels above 80 dB. The off-frequency I/O curves obtained for verhulst2018 were similar to those for verhulst2015 but had overall lower and higher amplitudes for the pure tones of 500 Hz (Fig. 5b-c) and 4000 Hz (Fig. 5e-f)), respectively, as a consequence of the differences in their middle-ear filters (see Fig. 3). The tendency to a more linear regime in off-frequency CFs has been shown previously [79]. This is in fact the basis for having compression only applied to the on-frequency channel in king2019 [37,72]. However, the default compression rate of 0.3 for the on-frequency channel with no compression for off-frequency channels leads to an unrealistic level balance between on-and off-frequency channels.

Frequency selectivity: Filter tuning
The frequency selectivity of each filter bank was computed in response to the described frozen noise waveform, presented at 40, 70, and 100 dB SPL. The estimates of frequency selectivity were obtained from FFT responses averaged across 500-ms non-overlapped analysis windows, meaning that the estimates were obtained from six statistically-independent noise sections.
The frequency response of thirty-two filters with CFs between 126 Hz (n = 396 in Eq. (3)) and 9587 Hz (n = 24 in Eq. (3)) at steps of n = 12 bins was obtained. For illustration purposes, we also included in this analysis the on-frequency CFs used in Figure 5 (CF = 502 Hz, n = 305; CF = 4013 Hz, n = 112), whose obtained responses are shown in Figure 4. For each filter response, a quality factor Q À3 dB = CF/BW was obtained, where BW is the bandwidth defined by the lower and upper 3-dB down points of each filter transfer function (black dashed lines in Fig. 4).
The frequency-selectivity simulations for each of the filter banks are shown in Figure 6 (1) and (2) are indicated as light and dark grey traces in Figure 6. Note that with this comparison, we assume that the Q factors within one ERB are similar to Q À3 dB values. The results for 40-dB noises show that the frequency selectivity follows either the analytical tuning of equation (1) (zilany2014, bruce2018, verhust2015, and verhulst2018) or the tuning of equation (2) (dau1997, relanoiborra2019, king2019, and osses2021). When looking at the results for higher levels (Fig. 6b-c), no change in tuning was observed for dau1997 and osses2021, as expected for linear models. For the nonlinear models, the results for 70-dB noises in Figure 6b showed overall lower Q factors, but with only a small change for king2019 and relanoiborra2019. The results for 100-dB noises in Figure 6c showed a further lowering of Q factors in the biophysical and phenomenological models, reaching values as low as Q % 2 in verhulst2015, lower Q factors for frequencies up to about 4000 Hz in relanoiborra2019, and virtually unaffected Q factors in king2019. A closer inspection to the outputs of king2019 revealed that there was a filter broadening as a consequence of its broken-stick nonlinearity stage, but this broadening predominantly affected the frequency responses outside the range defined by the 3-dB bandwidth used to derive the Q factors (see Fig. 4e). To illustrate the Q-factor transition when increasing the signal level in each model, the difference between Q factors obtained from 40-to 100-dB noises is shown in Figure 6d, where a decrease in Q factor with increasing signal level is represented by a positive Q-factor difference.
Additionally, we observed that relanoiborra2019 and king2019 introduce a change in selectivity at overall higher levels compared to the biophysical and phenomenological models. A closer look at this aspect revealed that this change occurs because relanoiborra2019 and king2019 only apply compression after the bandpass filtering and, therefore, lower level signals are used as input for their compression (broken-stick) module.

Frequency selectivity: Number of filters
The number of filters in a filter bank is relevant for several model applications because too few filters can lead to a loss of signal information (e.g., [84]) and too many filters may unnecessarily increase the computational costs. The number of filters is a free parameter in zilany2014/ bruce2018, but is fixed for verhulst2015 and verhulst2018 to yield an accurate precision of the transmission-line solver [85]. The remaining models use by default one ERB-wide bands (dau1997, king2019, and osses2021), or have an overlap every 0.5 ERB N (relanoiborra2019).
Here, we report the minimum number of filters that are required to obtain a filter bank with overlapping at À3-dB points of the individual filter responses. Using the empirical Q-factors of Figure 6, we assessed the number of filters that would be required to cover a frequency range between 126 Hz (n = 396 in Eq. (3)) and the first filter with its upper cut-off frequency equal or greater than 8000 Hz. The number of filters derived from the 40-dB and 100-dB frequency tuning curves (Fig. 6, panels c and d) are shown Literature: Figure 4 from [53] and Figure 4B from [12].
in Table 2, including the average filter bandwidth in ERB N for the corresponding model.
For the biophysical models, the filters were much wider at the higher level than for the other models, with average bandwidths being as wide as 3.05 ERB N for verhulst2015 and 2.30 ERB N for verhulst2018. This contrasts with the 1.57 ERB N for zilany2014 and bruce2018 and the 1.15 ERB N or less for the remaining models. These bandwidths are a consequence of the fastacting (sample-by-sample) compression that is applied just before the transmission-line in the biophysical models and the slower-acting bandwidth control in zilany2014 (denoted as the "control path" in the chirp filter bank). While cochlear filters are generally wider at high sound levels (e.g., [76,86]), the appropriate tuning must be evaluated depending on the species' characteristics, the tested CFs, and the type of evaluated excitation signals.

IHC processing: Phase locking to temporal fine structure
To illustrate the loss in phase locking to temporal fine structure with increasing stimulus frequency, we simulated IHC responses to pure tones with frequencies between 150 Hz (n = 387 in Eq. (3)) and 4013 Hz (n = 112 in Eq. (3)) spaced at n = 25 bins, resulting in 12 test frequencies. The tones were generated at 80 dB SPL, with a duration of 100 ms, and were gated on and off with 5-ms raisedcosine ramps. The simulated waveforms, that are assumed to approximate the IHC potential, are displayed and described in terms of AC (fast-varying) and DC (average bias) components, and the simulated resting potentials (V rest ). The AC potential was assessed from the peak-to-peak amplitudes as V AC = V peak,max À V peak,min .
The obtained IHC waveforms are shown in Figure 7. Within each panel, bottom to top waveforms represent on-frequency simulations for the test signals, from low to high frequency carriers, respectively. For some model outputs, the four highest carriers (1870 f c 4013 Hz) were amplified by a factor of 3 to improve waveform visibility. The simulated voltages before the tone onset, i.e., the resting potential V rest , were equal to 0 for all models except for verhulst2018, where V rest was À57.7 mV (not schematised in Fig. 7). It seems clear, however, that the decrease of peak-to-peak AC voltage towards high frequencies-a measure of the residual amount of temporal fine structure-is significantly different across models. When increasing the CFs from 1099 to 4013 Hz, three models showed  V AC reductions of less than 76.0% (king2019: 59.7%decrease from 3.12 Á 10 À3 to 1.26 Á 10 À3 a.u.; dau1997: 62.6%-decrease from 0.097 to 0.037 a.u.; and verhulst2018: 76.0%-decrease from 39.2 to 9.4 mV), while the other five models showed V AC reductions of at least 92.5%. From the low-frequency IHC waveforms (bottom-most waveforms in each panel), it can be seen that the simulated amplitudes of dau1997, king2019, relanoiborra2019, and osses2021 did not go below their V rest (horizontal grid lines in Fig. 7) as a result of the applied half-wave rectification process. Furthermore, zilany2014/bruce2018 and verhulst2015 have V peak,min amplitudes of À66 mV and À4.7 mV, respectively. Despite the different range in their minimum voltages, there is a strong qualitative resemblance between waveforms (green and red traces in the figure). In fact, both these models use the same type of IHC nonlinearity (compare Eqs. (17)-(18) from [55] with Eqs. (4)-(5) from [34]) and the same LP filter implementation, albeit with a different filter order and cut-off frequency (see Tab. 2).
The obtained AC/DC ratios are shown in Figure 8, where a reduction in phase locking is given by a lower ratio. For all models, the ratio decreased with increasing frequency. All AC/DC curves, except those for verhulst2018, overlap well at low frequencies with ratios between 2.1 and 5.9 (below 1000 Hz), decreasing to ratios between 0.06 (osses2021) and 0.83 (dau1997) at 4013 Hz. Although the AC/DC curve for verhulst2018 showed the highest overall ratios between 137.4 at 460 Hz down to 1.3 at 4013 Hz, due to its nearly zero DC voltages towards low frequencies (see the fairly symmetric waveforms around the horizontal grid in Fig. 7d), we still observed the systematic decrease in ratio with increasing frequency. If we further focus on the AC/DC curves in the frequency range between 600 and 1000 Hz, where the phase-locking is expected to start declining [82], all models showed monotonically decreasing curves starting from about 833 Hz (except for verhulst2018, that always showed a decreasing tendency). The lowest ratios were observed for osses2021, followed by the similarly steep curve of zilany2014. Finally, a similar AC/DC curve was obtained for relanoiborra2019 and verhulst2015.

AN firing patterns
Simulations included AN responses to pure tones and to amplitude-modulated (AM) tones from which rate-level functions expressed as onset and steady-state responses were obtained. With these benchmarks we attempt to characterise model responses at the output of the AN synapse stage or their equivalent, with a particular interest in the phenomenon of adaptation [60,75]. We comment on how adaptation is affected by the type of output of Stage 5, using either the approximations from the effective models, the average or instantaneous firing rate estimates of the phenomenological models (zilany2014, bruce2018), or the average rates of the biophysical models (verhulst2015, verhulst2018).

Adaptation
To illustrate the effect of auditory adaptation, we obtained AN model responses to a 4000-Hz pure tone of 70 dB SPL, duration of 300 ms, that was gated on and off with a cosine ramp of 2.5 ms. The obtained AN responses are shown in Figure 9. All responses had an amplitude overshoot just after the tone onset which then decreased to a plateau (e.g., between 300 and 340 ms, grey dashed lines). After the tone offset (t = 350 ms), the AN responses showed an undershoot with decreased amplitudes that subsequently returned to their resting level. This stereotypical behaviour is related to the AN adaptation process (e.g., [60]).
The waveforms from effective models using the adaptation loops (dau1997, relanoiborra2019, osses2021) are shown in Figure 9a, where their amplitudes had values between À230.5 MU and 1440.2 MU (dau1997), with a strong onset overshoot and a resting position at 0 MU. For king2019 (Fig. 9b), a mild overshoot was observed, whose maximum amplitude (1.52 Á 10 À3 a.u.) was higher in absolute value than that for the undershoot (À1.08 Á 10 À3 a.u.). With an observed steady-state peak-to-peak amplitude of 0.87 Á 10 À3 a.u. king2019 is, at this stage, the model that preserves the most temporal fine structure.
For the phenomenological models (zilany2014 and bruce2018), the simulated waveforms using their two types of AN synapse outputs are shown in Figure 9c-d, based on a PSTH (dark green or brown curves) and mean-rate synapse output (light green or brown curves). The obtained PSTH and mean rate responses in zilany2014 differ in their steady-state values (lower values for the PSTH estimate), while for bruce2018 the difference is in their onset responses, with almost no onset adaptation in the simulated mean-rate output. For the biophysical models (Fig. 9e), the AN synapse outputs represent mean firing rates where a stronger effect of adaptation was observed for verhulst2018 (sky blue), with a plateau after onset that was reached after about 150 ms (at t % 200 ms) while for verhulst2015 (red) the plateau is reached shortly after the tone onset.

Rate-level functions
Rate-level functions were simulated for a 4000-Hz pure tone presented at levels between 0 and 100 dB SPL with a duration of 300 ms, gated on and off with 2.5-ms cosine ramps. The obtained results are shown in Figures 10 and  11 for rate-level curves in the steady-state regime and for onset responses, respectively. For all models, average rates are shown (coloured traces) while for the phenomenological and biophysical models (panels c-h), the simulated response for the three types of neurons (HSR, MSR, and LSR) are shown (grey traces).
For the phenomenological and biophysical models, the discharge curves in Figure 10c-h tend to saturate towards higher levels, which is in line with experimental evidence (e.g., [87]). One difference between these curves is that they start to increase at slightly higher levels for the biophysical (from~20 dB SPL) than for the phenomenological models (from~0 dB SPL).
For the effective models ( Fig. 10a and b), with the exception of relanoiborra2019, the simulated rates did not show saturation as a function of level.  In relanoiborra2019, the simulated rates were between 70.2 and 83 MU for signal levels beyond 40 dB. This saturation effect results from the combined action of the nonlinear cochlear filter (Stage 3) with the later expansion stage (Stage 5, Fig. 2) that precedes the adaptation loops. Despite the overall lack of saturation in the evaluated effective models when looking at the steady-state outputs, a different situation is observed for the onset responses of Figure 11, where the responses of the models using adaptation loops had a prominent onset saturation (dau1997: 1443 MU for levels above 50 dB; relanoiborra2019: 1435 MU for levels above 30 dB; osses2021: 614 MU for levels above 50 dB). Other interesting aspects to highlight are that: (1) almost no onset effect is observed in the mean-rate output of bruce2018; (2) king2019 does not account for any type of saturation as the signal level increases (Figs. 10b and 11b). It should be noted that although hard saturation (as in Fig. 11a) has not been experimentally observed for onset AN responses, a decrease in the rate of growth of onset-rate curves with level is expected [87], a condition that is not met in king2019 or verhulst2015.

AM model responses
Model responses were obtained for a 500-ms 4000-Hz pure tone that was sinusoidally modulated in amplitude (modulation index of 100%) at a rate f mod = 100 Hz, presented at 60 dB SPL, including up/down ramps of 2.5 ms. The initial (0-50 ms) and later (350-400 ms) portions of the simulated responses are shown in the left and right panels of Figure 12, respectively. In all models, the modulation rate of 100 Hz is visible as amplitude fluctuations with the corresponding periodicity of 10 ms. In addition, adaptation was observed with stronger simulated responses immediately after the tone onset (left panels) than during the steady-state portion of the response (right panels).
For the effective models with adaptation loops (dau1997, relanoiborra2019, osses2021), the maximum amplitudes (Fig. 12a, left) were much lower in osses2021 than for dau1997 and relanoiborra2019, due to the stronger overshoot limitation. For these models, it was also observed that their phases are not perfectly aligned due to the outer-and middle-ear filters that introduced a delay into relanoiborra2019 (black traces run "ahead" the blue traces of dau1997), while the group-delay compensation in osses2021 (Sect. 2.2) seemed to overcompensate the alignment of the simulated waveforms (purple traces run "behind" the blue traces). In the right panel, the dynamic range of relanoiborra2019 (black traces) is lower than for osses2021 and dau1997, which have very similar steady-state amplitudes. The reduced dynamic range in relanoiborra2019 is mainly due to the nonlinear cochlear compression of the filter bank that interacts further with the expansion stage. In king2019 (Fig. 12b), a small effect of adaptation was observed with a maximum onset response of 0.88 Á 10 À3 a.u. (left panel) that decreases to a local maximum amplitude of 0.24 Á 10 À3 a.u. during the steady-state response (right panel).
The AN responses produced by verhulst2015 and verhulst2018 (Fig. 12e) showed an overshoot reaching firing rates of 598.5 and 565.2 spikes/s, respectively. After the onset, the overshoot effect quickly disappeared in verhulst2015, reaching a maximum local rate of 251 spikes/s during the second modulation cycle and 222 spikes/s between 370 and 400 ms. In contrast, verhulst2018 adapted more slowly after the onset with a maximum rate of 319 spikes/s in response to the second modulation cycle, while the response continued adapting reaching a maximum rate of 176 spikes/s between times 370 and 400 ms.
For zilany2014 (Fig. 12c) and bruce2018 (Fig. 12d), the mean-rate and PSTH outputs are shown as lighter and darker traces, respectively. It can be observed that in zilany2014, the AM modulations showed a similar mean-rate and PSTH excursions of about 100 spikes/s (Fig. 12b, right: mean rates between 194 and 295 spikes/s; PSTHs with rates between 94 and 196 spikes/s), but the PSTHs had overall lower rates. In bruce2018, a greater AM fluctuation is observed for the PSTH outputs (darker brown traces) with an excursion of 185 spikes/s (Fig. 12d, right: rates between 56 and 241 spikes/s) compared with the 40 spikes/s (rates between 121 and 161 spikes/s) of its mean-rate output. Additionally, bruce2018 showed a limited effect of adaptation in its mean-rate outputs, along with a shallower AM response in comparison to the obtained PSTH. We will not focus on the mean-rate output of this model, because (1) their authors validated the model primarily using PSTHs, recommending the use of the AN synapse output for further processing [36], (2) the model using PSTH outputs can be used as input for subcortical processing stages from the UR EAR toolbox [66], and (3) all the studies that we have so far identified using bruce2018 consistently used PSTH outputs [88,89].
It should be noted that zilany2014, from the same model family, has been extensively validated using both mean-rate and PSTHs outputs. In fact, for studies where psychoacoustic aspects have been investigated (e.g., [65]) there is a tendency to use the mean-rate model outputs.

Synchrony capture
Model responses were obtained for a complex tone of 50 dB SPL formed by three sinusoids of equal peak amplitude and frequencies of 414 Hz (9.6 ERB N ), 650 Hz (12.6 ERB N ), and 1000 Hz (15.6 ERB N ). This type of complex tone with more carriers and greater range of frequencies is commonly used in studies of profile analysis (e.g., [65]) and it is useful to explain an AN property named "synchrony capture" [57,63] that is believed to play a relevant role in the neural coding of supra-threshold speech sounds [33,63]. When synchrony capture occurs, the neural activity in on-frequency channels is driven primarily by one frequency component in the harmonic complex, such that there are minimal fluctuations due to the fundamentalfrequency envelope, while off-frequency channels exhibit fluctuating AN patterns at the fundamental frequency. To illustrate whether the evaluated models account for synchrony capture, the model outputs in response to the described complex tone were obtained for frequencies between 415 Hz (n = 320 in Eq. (3)) and 1007 Hz (n = 245 in Eq. (3)) for CFs spaced at approximately 1 ERB N (Dn = 12 or 13), resulting in three on-CF and four off-CF channels. The obtained simulations are shown in Figure 13 for a 30-ms window (between 220 and 250 ms). For each waveform, a schematic metric of envelope fluctuation was obtained and shown as thick grey lines. Those envelope fluctuations were constructed by connecting consecutive local maxima that had amplitudes above the mean responses (onset excluded) of each simulated channel. Subsequently, the standard deviation of the obtained envelope estimate was (arbitrarily) divided by one thirtieth of the amplitude scales shown in the insets of each panel (e.g., divided by 800/30 MU for dau1997, relanoiborra2019, and osses2021). The obtained estimates were drawn as maroon circles and connected with dashed lines along the right vertical axes in Figure 13 (dimensionless scale with labels between 0 and 6, as indicated in panels a and b), where higher values indicate greater envelope fluctuation variability. The resulting envelope scale allows for a direct comparison between models. In Figure 13 it can be observed that for all models, the on-frequency channels had nearly flat envelope fluctuations. The variability estimate averaged across onfrequency bins (at 415, 662, and 1007 Hz) ranged between 0.11 (king2019) and 1.71 (bruce2018). The variability estimate across off-frequency bins (at 489, 574, 769, and 881 Hz) ranged between 0.74 (king2019) and 4.09 (relanoiborra2019, with a maximum deviation of 6.95 at CF = 881 Hz in Fig. 13g). For all models the off-CF variability was greater than the on-CF variability, with king2019 being the least sensitive model to code envelope fluctuations.

Subcortical neural processing
We show two sets of figures to schematise the subcortical processing of the evaluated models.

Modulation transfer function
The first set of figures represents a modulation transfer function (MTF) in response to 100% AM tones modulated at f mod rates between 10 and 130 Hz (steps of 5 Hz). The tones were centred at 1000 Hz, had a duration of 300 ms, included 5-ms up/down ramps, and were presented at 30 and 70 dB SPL. For this analysis, 100 ms in the last portion of the simulated responses were used (between times 190 and 290 ms). The MTFs were derived from the maximum of the simulated responses. The responses were normalised to the corresponding maximum estimate over the set of tested f mod values, so that the MTF of each model had a maximum value of 1. The resulting MTFs are shown in Figure 14.
The results in Figure 14 show that the models produce bandpass-shaped MTFs with estimated BMFs between 35 Hz (zilany2014) and 70 Hz (dau1997, relanoiborra2019, and osses2021) that are below the theoretical BMFs (see Tab. 2). It is interesting to observe that the sharpest MTFs were obtained not only for dau1997 and osses2021 (both designed with Q = 2), but also for king2019 (which has a Q = 1), while a wider tuning was observed for the remaining models, including relanoiborra2019 (which has a Q = 2).
For the biophysical and phenomenological models, the MTFs obtained for the 70-dB AM tones (Fig. 14b) were different than those obtained for 30 dB (Fig. 14a). For these models, the MTFs were no longer bell-shaped and seemed to act as lowpass filters, which is inline with physiological evidence indicating that regions of "enhancement" in MTFs of low level-signals can become regions of "suppression" for higher presentation levels (see, e.g., Fig. 4 from [92]).
The effective models were more insensitive to the change in presentation levels. The only exception to this is relanoiborra2019, where a narrower MTF was obtained in Figure 14b (compared with panel a). The models dau1997, osses2021, and king2019 have MTFs that are qualitatively similar across presentation levels.

Response to clicks of alternating polarity
The second set of figures focuses on simulating the response to a typical click train as used in the assessment of auditory brainstem responses (ABRs) [95]. We used a click train with a repetition rate of 10 Hz and a duration of 1 s (i.e., containing 10 clicks). The clicks had an alternating polarity (amplitude A or ÀA) and were presented at 70 dB peak-equivalent SPL (dB peSPL) [96], i.e., using A = 0.1789 Pa. Each individual click had a duration of 100 ls. For this processing, the simulated outputs of Stage 6 of each model (see Fig. 2) were averaged across CFs to obtain a broadband representation, i.e., all simulated representations were added together and then divided by the  [93]. Figure 15. Simulated IC responses using one modulation filter (BMF % 80 Hz) to a click train of alternating polarity with a total duration of 1 s, repetition rate of 10 Hz and click duration of 100 ls. Only the responses to the two last clicks are shown, whose peak-to-peak amplitudes are indicated in Table 2. Literature: Figure 1 from [94] and Figures 8-9 from [95].
number of CFs [34,35]. This type of output can be used to derive a peak-to-peak or peak-to-trough amplitude correlate of the wave-V ABR component [95]. For this processing, we used the default number of CFs for the biophysical and effective models, while for zilany2014 and bruce2018, 50 CFs were obtained between CF n = 133.7 Hz (n = 393, Eq. (3)) and CF n = 12010 Hz (n = 1), spaced at n = 8 bins to roughly meet the number of filters from Table 2. The obtained click responses are shown in Figure 15 and are illustrated for the last two clicks (of amplitudes A and ÀA) of the test click train.
The biophysical models provided click responses that had positive and negative amplitudes (Figs. 15e-f), which was not the case for the phenomenological models that also use the SFIE model. This is because verhulst2015 and verhulst2018 assume that a population response can be obtained from the sum of single neuron activity (as, e.g., in [56]), with no half-wave rectification in the SFIE model (a non-explicit choice of the authors [34,35]) that, after scaling [35,41], results in a simplified neural representation that correlates with changes in electrical dipoles visible in scalp-recorded potentials [35].
The effective models, that use the modulation-filterbank concept, showed only positive amplitudes for all filters with BMFs ! 10 Hz [31] due to their envelope extraction, a phase-insensitive ("venelope") processing [37,70]. For modulation frequencies below 10 Hz, the perceptual models preserve the phase information, something that is not illustrated in Figure 15 (nor in Fig. 14).
Finally, the simulated peak-to-peak amplitudes in response to the last positive and negative clicks of the pulse train (ninth and tenth click, shown in Fig. 15) are shown in the entries "Click A" and "Click ÀA" of Table 2. From those amplitudes, it can be observed that there are models that have higher peak-to-peak amplitudes in response to positive clicks (zilany2014, verhulst2015, bruce2018, relanoiborra2019) and others where higher amplitudes are observed in response to the clicks of negative polarity (dau1997, verhulst2018, king2019, osses2021). Although we do not discuss the significance of this polarity sensitivity, this aspect has been a matter of discussion, in particular for electrical hearing, where it has been found that evoked potentials in response to positive and negative polarity clicks represent one of the differences between humans (e.g., [97]) and other mammals (e.g., [98]), whose responses are more sensitive to stimulation with clicks of negative and positive polarities, respectively.

Computational costs
The computational cost required to run each model was measured using the same click train as described in the previous section. Therefore, we assessed the time required to process an input signal of 1-s duration between Stages 1 and 6 of each model (Fig. 2). This metric aims at providing a relative notion of the processing times across models. Note that some model implementations can use parallel processing, which was disabled in this evaluation. The assessment was performed on a personal computer equipped with an Intel Core i5-10210UR, 1.6-GHz processor with 16 GB of RAM memory.
The results of the computational costs used by each model are given in the entry "Performance" of Table 2. The time required by the models to process one frequency channel ranged between~0.02 s (osses2021, dau1997) and 2.5 s (bruce2018). For individual frequency channels, the biophysical models (verhulst2015 and verhulst2018) showed moderate calculation times between 0.3 and 0.8 s. However, these models always require (internally) the simulation of the whole discretised cochlea with 1000 cochlear sections, independent of the number of user-requested cochlear channels (default number of 401 for the Verhulst models). This means for the current simulations, that the reported processing times of 122.9 and 319.5 s for verhulst2015 and verhulst2018, respectively, cannot be further reduced, even if the user requests the simulation of fewer CFs. In contrast, in any model based on a parallel filter bank, including zilany2014 and bruce2018, each cochlear section is independent of each other, and a user-defined number of frequency channels can be simulated, which vastly reduces the computation time for different model configurations.
Due to the long processing time of the evaluated biophysical models, their implementations include an option of parallel processing (also available in the original implementation of bruce2018 [36]), where multiple input signals can be processed simultaneously. The number of signals that can be processed in parallel will depend on the number of threads of the host computer. As a further solution to the long processing time, Stages 2-5 of verhulst2018 (transmission-line, IHC, and AN modules) and bruce2018 (generating mean PSTHs) have been approximated using deep neural networks in [99,100] and [101], respectively.

Models in perspective
The stimuli and comparison measures used in our evaluation (Sect. 4) were chosen to reflect relevant temporal and spectral properties of the models in a normal-hearing condition. Our evaluation provides an objective view, accompanied by a graphical representation of how the model responses reflect specific aspects of the hearing process in their model structure.
The content of this study might be considered as a guideline for model selection, but the motivation was not to select a "winner" among the different evaluated models. We compared models which were verified in different experimental conditions or in connection to different hearing applications, and we only presented raw model outputs (Figs. 7,9,12,13,15) or outputs transformed to characterise specific hearing properties (Figs. 4-6, 8, 10, 11, 14). These outputs reflect model responses to a very specific dataset that may not be suitable to appropriately verify all models. Any model, though, to be informative, needs to be verifiable and falsifiable, such that it is possible to understand both its essential characteristics and features that explain the predictive power in a given range of conditions, as well as its limitations that make transparent where the model fails. In the following sections we provide a brief overview of the context in which each of the selected models has been used and include general recommendations for further applications.

Applications of the evaluated auditory models
Dau1997 is a monaural model that has been used to simulate a number of psychoacoustic tasks including tonein-noise and AM detection experiments using a forcedchoice paradigm (e.g., [31,44]). To enable the model for the comparison between two or more sounds, the output of Stage 6 ( Fig. 2) is used as input to a decision back-end based on a signal-detection-theory (SDT) framework, the template-matching approach. This framework, extended to adopt two templates, has been recently validated to account for the perceptual similarity between two sounds using osses2021 [39].
The models zilany2014 and bruce2018 can account for elevated hearing thresholds due to OHC ("Cochlear gain loss" in Stage 3) or IHC impairment ("IHC loss" in Stage 4) [55]. The AN stage (Stage 5) includes two types of outputs: An actual spike generator and an analytical mean-rate synapse output. The spike generator has been primarily used to simulate physiological data, including the phenomenon of short-and long-term adaptation [75]. The mean-rate synapse output using zilany2014 has been used to simulate specific psychoacoustic tasks [65,102], including speech intelligibility predictions [103].
The models verhulst2015 and verhulst2018 were initially designed to simulate otoacoustic emissions [26] and can account for elevated hearing thresholds due to OHC impairment ("Cochlear gain loss" in Stage 3). Furthermore, they allow to study effects of the gradual disconnection of AN fibres, known as synaptopathy, on auditory brainstem responses [35,104]. When coupled with a decision back-end, they have been used to simulate psychoacoustic performance in simultaneous tonein-noise and high-rate AM tasks (f mod~1 00-120 Hz) [105,106].
The model relanoiborra2019 can predict speech intelligibility [38] when coupled with a decision back-end stage [38,107]. Relying on the prediction power of earlier model implementations [77,108], relanoiborra2019 should be able to (1) account for elevated thresholds based on OHC and IHC impairment [108], and (2) to predict a number of psychoacoustic tasks including simultaneous and forward masking and amplitude modulation [77]. Our results showed that relanoiborra2019 accounts well for hearing properties such as nonlinearities in the cochlear processing and auditory adaptation, including a saturation behaviour similar to that of the AN physiological models.
The model king2019 was designed to simulate perceptual tasks of amplitude-and frequency-modulation detection, primarily at low modulation rates (f mod 20 Hz). The model's decision back-end includes deterministic limitations (suboptimal template matching strategies) or stochastic limitations such as internal additive noise, multiplicative noise [109], and memory noise [72,110]. The model can be adapted to simulate hearing impairment by modifying its compression parameters (knee point and compression rate), and by increasing the bandwidth of the underlying cochlear filters. Despite the simplicity of this model-in fact one of its strengths-we have shown in this paper that the model can account for several of the comparison metrics, with the exception of the broadening of cochlear filters at higher presentation levels (Fig. 6), the adaptation saturation (Figs. [10][11], and the coding of fluctuation profiles (with minimal difference in amplitude fluctuations in Fig. 13).
In the context of this special issue on binaural hearing, it is worth mentioning a number of binaural applications that rely on the evaluated monaural auditory models: The lowpass modulation filter (similar to dau1997) [31,44] served as the basis for a model of binaural masking that uses a decision stage based on the equalisation-cancellation theory [116]. This model was later extended to predict perceptual attributes of room acoustics [117][118][119]. The model zilany2014 has been used to predict (1) the sensitivity to interaural time and level differences by estimating the disparity between left and right AN responses using a decision back-end based on shuffled cross-correlograms [120], and (2) the median-plane sound localisation for various profiles of sensorineural hearing loss (OHC impairment) [121]. Finally, bruce2018 has been used to simulate the lateralisation of high-frequency stimuli in a coincidence-counting model [88].

Simplified auditory representations
When an auditory model is used to broaden our understanding of auditory processes [1,2], it is required that the model be as complete as possible. More details in the model often come at the price of a more computationallyexpensive implementation. Such a level of detail is represented in the selected biophysical and phenomenological models, that attempt to shed light on the mechanisms behind the cellular and neural elements included in auditory processing. On the other hand, effective models have a more epistemic status providing an intelligible but simplified representation of the process. These models can guide the design of new experiments or facilitate the development of listener-targeted products. Such model simplification, however, potentially reduces the number of effects a model can account for, leading to an actual narrowing of its application field. An example of a successful model simplification is presented in [27], where MTFs were simulated using only a stage of envelope extraction followed by a modulation filter bank, omitting the stages of cochlear filtering and auditory adaptation. This model, however, is not thought to predict the performance in listening conditions where the omitted model stages do play a role as it is the case (in this example) for forward-masking tasks.
Peripheral auditory models are often combined with a decision back-end module converting simulated responses into (1) a behavioural response that reflects detectability or discriminability of a sound (e.g., [2]), or into (2) perceptual metrics to estimate, e.g., loudness [28], perceived reverberation [117,118], and sound-source localisation [122,123]. For successful simulations, the decision stage should appropriately weight the information contained in the model representations. An analysis of weighted timefrequency representations (time, audio frequency, and/or modulation frequency) can reveal what portions of the simulated responses are more relevant (e.g., [39,124]).
It is important to note that the simplification of auditory models based on statistical methods or machine learning processes requires a careful interpretation. While these approaches might be well suited to achieve goals such as real-time processing (e.g., [99]) in applications of speech perception (e.g., [101]) or in the prediction of evoked potentials [89], they limit the modular comprehension of each auditory stage, especially if multiple model stages are approximated (as in [89,101]).
In a recent study [89], firing rates of cortical A1 neurons in ferrets were approximated using several time-frequency representations ranging from simple short-time Fourier transforms to more detailed models of AN synapses (including bruce2018) to which a linear-nonlinear (LNL) encoder was used. Based on their separately-fitted encoders, the authors concluded that cortical processing in ferrets perform a "very simple signal transformation," without discussing how different the linear and nonlinear components in each of their encoders were. Despite the success of the authors in approximating neural responses in ferrets, we believe that it is difficult to know whether the "simple transformation" is indeed related to the underlying mechanisms of hearing (the cortical processing in ferrets) or rather is related to the complexity of operations in the fitted encoders.

Considerations for further modelling work
The following is a list of aspects that we recommend to keep in mind for further auditory modelling work, based on the general observations of this study: If the evaluated sounds are assumed to be reproduced via loudspeakers, or supra-aural or circumaural headphones, we recommend to use an outer-ear module as in relanoiborra2019, osses2021, or to apply an HRTF (as in [57]). Although we did not evaluate this effect, such an omission implicitly assumes that the outer ear (compare the grey and black lines for relanoiborra2019 and osses2021 in Fig. 3) does not influence the coding of incoming signals in the ascending auditory pathway.
The results in Figures 6 and 14 show that there are nonlinear interactions between model stages as a function of level and for different types of signals. This suggests that different sets of stimuli are required to characterise the behaviour of complex processes such as that of nonlinear filter banks. In other words, models may not always act as a linear time invariant (LTI) system. In Section 4.1.3, we suggested a minimum number of filters for each filter bank to roughly meet a À3-dB filter crossing (Tab. 2, "40 dB: Number of bands"). The required number of bands may vary from application to application and depend on the type of sounds that are to be simulated. This choice can be particularly critical in models where the number of bands are a free parameter (here zilany2014 and bruce2018). For models that are used as front-ends to machine-learning applications, Lyon [22] suggested a "not-too-sparse set of channels" with about a 50% overlap between filters, i.e., twice the number of channels that we recommend in Table 2. It is important to keep in mind, however, that our estimation was based on model responses to white noises, which are sustained signals in time and broadband in frequency. At higher presentation levels, where nonlinear filter banks act as compressors, similar estimations using sine tones (sustained narrowband signals, as in Fig. 5) or clicks (transient broadband sounds) may result in a different number of required bands. Different simulation results can be expected when evaluating mean-rate and PSTH outputs of models including AN synapse stages as shown in Figures 9-12. The particular choice of the type of output depends on the target application of the model. The spike generator is primarily used to simulate physiological data (e.g., [36,75], while the mean-rate synapse output is typically used to simulate specific psychoacoustic tasks (e.g., [65,102]). The choice of a set of stimuli to test and validate a specific model is crucial. As we stated in Section 1, the simulation of "unseen" (arbitrary) sounds may produce model outputs that have not been previously validated (or at least not reported) by the model developers. Actually, an unexpected model behaviour may not be strictly related to an unseen sound, but rather to an unseen sound property. For example, the models with adaptation loops have historically had an oversensitivity to transient sounds (e.g., [39,125,126]), leading to model versions with limited onset responses to counteract this effect [31,39] or have used stimuli with smoother onsets in their evaluation.
When using large datasets, where the stimuli are split into training and validation data (e.g., [16,101,112]), the stimuli should contain representative samples of the relevant sound properties that the model user wishes to test. A practice like this can help to support (or not) the applicability of a specific model to sounds that may have not been even validated before, the "unseen sounds", reducing (or generating awareness of) the potential limitations of the test model.

Conclusions
In this study we compared eight monaural models of human auditory processing that simulate responses-with different levels of accuracy-up to the level of the inferior colliculus in the midbrain. We described and quantified the similarities and differences among model implementations and derived a minimum number of filters required for those stages to ensure the preservation of auditory information based on our estimates of frequency selectivity.
We showed that despite the differences in model design that result in more physiologically-(biophysical and phenomenological models) or perceptually-plausible approximations (effective models), all the models can account for a number of basic hearing properties. Examples of these properties are the phase-locking reduction in inner-hair-cell processing and the phenomenon of auditory adaptation. Still, an in-depth understanding of each of the model stages is required when selecting a model for a specific application. We encourage future users to be explicitly aware of the specific datasets of sounds and experimental paradigms upon which their models have been evaluated, as well as of other underlying model limitations. To this end, a comparison across model implementations provides a guideline for their selection and an excellent way to challenge the capabilities of different models.

Data availability statement
The implementations of the evaluated models (see Tab. 1) and the model comparison (function exp_osses2022 to reproduce Figs. 4-15) are publicly available as part of the AMT toolbox (https://www. amtoolbox.org) [8] as of version 1.1.0 [9].