Effect of phoneme variations on blind reverberation time estimation

This study focuses on an unexplored aspect of the performance of algorithms for blind reverberation time (T) estimation – on the effect that speech signal’s phonetic content has on the value of the estimate of T that is obtained from the reverberant version of that signal. To this end, the performance of three algorithms is assessed on a set of logatome recordings artificially reverberated with room impulse responses from four rooms, with their T20 value in the [0.18, 0.55] s interval. Analyses of variance showed that the null hypotheses of equal means of estimation errors can be rejected at the significance level of 0.05 for the interaction terms between the factors “vowel”, “consonant”, and “room”, while the results of Tukey’s multiple comparison procedure revealed that there are both some similarities in the behaviour of the algorithms and some differences, where the latter are stemming from the differences in the details of algorithms’ implementation such as the number of frequency bands and whether T is estimated continuously or only on the selected, the so-called speech decay, segments of the signal.


Introduction
Reverberation time (T) is one of the most important objective measures indicating the severity of reverberation in an enclosure. It represents the value of time it takes for a steady-state sound energy level to gradually decay by 60 dB after an abrupt cessation of the sound source [1]. Since acoustic absorption coefficients of room boundaries, objects, and contained air are all frequency dependent, reverberation time is also frequency dependent and, hence, commonly measured in octave or 1/3 octave bands using existing standardised methods [1][2][3].
Due to detrimental effects reverberation has on speech signals captured when microphones (e.g., the ones in personal electronic devices such as mobile phones, videoconference systems, and hearing aids) are positioned in the far field of a sound source (e.g., a person speaking), a number of algorithms for reverberation time estimation from the received reverberant speech signalsthe so-called blind algorithmshave been developed in the last two decades. An estimate of reverberation time obtained with these algorithms can then be used in the process of speech dereverberation for either speech enhancement or speech/speaker recognition purposes [4][5][6].
One of possibly many classifications of these algorithms, a classification that is based on their complexity, was proposed in a paper by Eaton et al. [7], where the algorithms were grouped into three classes: Analytical with or without Bias Compensation (ABC) class, Single Feature with Mapping (SFM) class, and Machine Learning with Multiple Features (MLMF) class.
Algorithms of the ABC class are based on a stochastic time-domain model of room impulse responses (RIRs), with no information on speech signal characteristics included. One of the first algorithms of this class was developed by Ratnam et al. [8,9]. The variance of blind T estimates was reduced with the introduction of a sound decay detection step first proposed by Kendrick et al. [10], and later by Löllmann et al. [11] who developed a different variant of the step. Further work presented modifications either to the model or to the preprocessing of the reverberant signal [12][13][14][15][16].
For the second class, SFM, there exists an intermediate feature whose value is estimated and from which, via a polynomial mapping function, a reverberation time estimate is obtained. This feature is commonly a statistic related to the distribution of decay rates obtained in the time-frequency (TF) domain [17][18][19][20][21]. The last class, MLMF, contains classifiers trained on a large speech corpus that was artificially reverberated with a set of room impulse responses [12,22,23].
In the same paper, "The ACE Challenge" [7], both the most recent and the most comprehensive comparative study of the performance of blind reverberation time estimation algorithms, the performance of 25 algorithms was assessed on a set of reverberant speech signals. Since in the Challenge the main objective was to assess the influence of the additive noise term on estimation quality, reverberant speech signals had various types and levels of measured noise added to them. The results showed that the machine learning algorithms are being outperformed by the less complex algorithms of the ABC and the SFM class. Interestingly, the results of the ACE Challenge are in agreement with the results of an earlier investigation conducted by Shabtai et al. [24], which indicated that a simpler approach for blind room volume estimation, the one that uses abrupt stops present in the reverberant speech signal, outperforms more complex approaches based on dereverberation or the use of speech recognition features.
Another very comprehensive and equally important study was performed by Kinoshita et al. [25], whose results stress the importance of having an accurate blind estimate of T for the performance of a class of noise robust state-of-theart speech enhancement algorithms. In their paper [25], the performance of 25 algorithms for speech enhancement (SE) and of 49 algorithms for automatic speech recognition (ASR) was assessed on artificially reverberated speech utterances (T values of 0.3, 0.6, and 0.7 s were considered) as well as speech recordings from a room with T of 0.7 s. For the SE task, the quality of dereverberated speech signals was evaluated using both the objective measures and a listening test. When one-channel algorithms were considered, the second best performing SE algorithm was from the "Statistical RIR modelling" classa class that needs a blind estimate of reverberation time, which is then used in a statistical, time-domain model of room impulse responses. Finally, the dereverberated speech signal is obtained with spectral subtractiona technique of low computational complexity that is robust against additive noise.
Also in 2016, the results of a short study inspired by Ratnam's and Kendrick's initial, one speech recordingbased, observations that speech signal's offsets introduce errors in the blind T estimation process [8,12] were presented in a conference paper [26]. Data indicated that there exists a relationship between the values of estimates obtained with one blind T estimation algorithm and the speech signal, asking for a more in-depth and extensive follow-up study to be conducteda study that could answer the question of whether this was just an isolated case or a general rule. Therefore, the main objective of this study is to explore the outputs of several state-of-the-art blind reverberation time estimation algorithms, and discover how they are related to the characteristics of the speech signalthe type and order of the phonemes and, additionally, speaker's sex. To this end, three of the best performing algorithms from the ACE Challenge were selected for the statistical analysis of their T estimates.
The remainder of this paper is structured as follows: in Section 2, the speech corpus, room impulse responses, and algorithms for blind reverberation time estimation are introduced. This section closes with the details of the T acquisition steps. The results of the analyses of variance and multiple comparison tests of reverberation time estimation errors are presented in Section 3 and discussed in Section 4. Finally, Section 5 summarises the major findings and discusses the objective, speech corpus related, limitations of this study.

Methodology
In this section, the speech corpus and room impulse responses are introduced and their characteristics presented. After that, the principles of operation of the three selected state-of-the-art algorithms for blind reverberation time estimation are stated. The section closes with the details of the blind reverberation time estimate acquisition process.

Speech corpus
The OLdenburg LOgatome (OLLO) corpus version 2.0 [27], a large speech corpus freely available for research purposes that holds recordings of 150 logatomes uttered by 50 speakers of both sexes (25 women), was selected as the most appropriate corpus for this investigation's objectives. It consists of a set of 80 logatomes of the consonant-vowel-consonant (CVC) form and a set of 70 vowel-consonant-vowel (VCV) logatomes, where each of these 150 logatomes was uttered three times by 40 German and 10 French speakers in their normal speaking style. The logatomes that make those two sets are presented in Tables 1 and 2. In this study, the VCV set of logatomes will stand as a lexical representative for the languages with a predominance of words finishing with an open syllable (such as Italian), while the CVC set will perform an equivalent function for the languages with a predominance of words finishing with a closed syllable, one of them being the English language [28,29].

Room impulse responses
Thirty-two room impulse responses were taken from the Aachen Impulse Response (AIR) database v1.4 [30], of which six were measured in the room named "booth", another six in the "office" room, while 10 impulse responses were from the "meeting" room and another 10 from the "lecture" room. In addition to these room impulse responses from the AIR database, three room impulse responses measured in the listening room of the Department of Electroacoustics, at the Faculty of Electrical Engineering and Computing, University of Zagreb (Croatia), were also used. All of these 35 RIRs were obtained without the use of a dummy head and with a sampling rate of 48 kHz. For each RIR, the associated ground truth value of reverberation time T 20 was obtained from its full-band energy decay curve (EDC) in the [À5, À25] dB interval, and in accordance with the ISO3382 standard [1]. The average full-band T 20 values for those five rooms were found to be as follows: "booth" -0.18 s, "meeting" room -0.28 s, "listening" room -0.47 s, "office" -0.55 s, and "lecture" room -0.82 s.
According to a relatively recent reverberation time measurement study done by Diaz and Pedrero [31], in which the authors presented the values of reverberation time for more than 8000 bedrooms and 3000 living rooms, the average value of reverberation time for furbished spaces can be found in the [0.26, 0.77] s interval; the average T value across rooms was smallest in the 4 kHz octave band for the rooms in the 10-20 m 3 range (0.26 s), while the largest average value of T was present for the rooms in the 90-100 m 3 range at the 125 Hz octave band (0.77 s). Thus, it can be concluded that the selected set of measured RIRs with its ground truth values of reverberation time in the [0.18, 0.82] s interval covers well the interval of the most common T values found in the living spaces nowadays.

Blind reverberation time estimation algorithms
The first algorithm for blind reverberation time estimation used in this study was developed by Eaton et al. [18]. In this algorithm, the sound decay rates are estimated on the mel-STFT signal representation for each mel-frequency band separately by applying a least-squares linear fit to the log-energy envelopes of the signal. It was shown that the negative-side variance (NSV), defined as the variance of the negative gradients in the distribution of decay rates, is related to reverberation time and is, therefore, used as the input to a polynomial mapping function whose output is the estimate of reverberation time. For this algorithm, the estimated level of signal-to-noise ratio (SNR) defines which decay rates will be chosen to form the aforementioned distribution.
In this investigation, a copy of the Matlab implementation of the algorithm [32], provided on-line by its authors, was used. During the estimation process, the values of the parameters were not changed from the originally set values and were as follows: frame length -16 ms, overlap -25%, and the number of mel-frequency bands -31. Since in this study no noise was added to the reverberant signal, and in order to make this algorithm's results comparable to the results of the remaining two algorithms that do not adapt to the changes in the SNR, the Matlab code was adjusted so that it automatically gives an estimate of T as if the estimated SNR had the highest possible value. This is the mode of operation where all the negative gradients (from all the mel-bands and time frames) form the distribution from which the NSV is calculated.
In the second algorithm, developed by Prego et al. [20,21], the blind estimation of reverberation time is carried out in the STFT domain for each frequency band independently on detected speech free-decay regions. Estimation of T is performed with a procedure that closely resembles Schroeder's standardized method with the upper decibel limit of the energy decay curve calculated for a detected speech free-decay region set to -5 dB. The lower limit is set by finding the decibel range with the highest value of the regression coefficient when the ranges À60, À40, À20, and À10 dB are considered, and with that order of preference. The final reverberation time estimate is obtained as the output of a linear mapping function, which serves to correct for the differences in the dynamic ranges of the detected free decays, with the median value of all previously calculated sub-band T estimate medians serving as its input.
An implementation of the algorithm in the form of a Matlab script, with no SNR compensation step included, was kindly provided by its authors. During the algorithm evaluation process, the values of the parameters were not changed from the original ones and were as follows: frame length -50 ms, overlap -25%, and the number of frequency bands -1025.
The last algorithm whose sensitivity to speech signal characteristics is assessed, developed by Löllmann et al. [11], is an algorithm in which the estimates of reverberation time are acquired from the time-domain version of the input reverberant speech signal. For this algorithm as well, the detected speech decay sequences serve as an approximation of the true RIR. Since RIRs are modelled with Polack's stochastic model [4], the reverberation time estimate is obtained as the parameter value for which the likelihood of the observed decay sequence is maximal. In order for the speech decays to be detected, the reverberant signal is processed using 2 s long overlapping rectangular windows with 20 ms shift. Each frame is then partitioned into 30 subframes of equal length and their respective energies are calculated, upon which it is checked for a monotonous decrease in energy between at least three adjacent subframes starting from the first. If a monotonous decrease in energy is found, the corresponding subframes are used for the maximum-likelihood estimation of reverberation time.
In this study, an implementation of the algorithm was created in Matlab by following the steps outlined in Table 1. CVC set of logatomes. For a given row, ten logatomes are constructed using the pair of the 1st and 3rd phoneme combined with one of the vowels given in the 2nd phoneme column.
1st phoneme 2nd phoneme 3rd phoneme  [11]. In order to reduce the variance of the estimates, only the estimates obtained from the speech decay sequences of dynamic range higher or equal to 20 dB were kept. After quantization of those estimates in steps of 25 ms, a histogram was formed, and the final estimate calculated as the mode of the distribution.

Data acquisition
During the initial testing phase, it was observed that the two algorithms operating in the time-frequency domain need longer recordings than a speech file that contains the three concatenated repetitions of a logatome belonging to one speaker is long (about 5 s) in order to give a T estimate. Therefore, for women and men separately, the three repetitions of a logatome of 25 speakers were concatenated into a single speech file. The resulting 300 speech files (150 for women and 150 for men) were then artificially reverberated using each of the 35 measured RIRs. Before their convolution with the speech files, the room impulse responses were down-sampled to 16 kHz in order to conform to this lower sampling rate of the speech corpus. Finally, blind estimation of reverberation time was carried out using the algorithms from Section 2.3, after which estimation errors were calculated as the difference between the estimate and the ground truth T 20 value of the room impulse response.
The inspection of the estimation errors revealed that two of the algorithms performed quite unsatisfactorily for the RIRs from the room with the largest ground truth T 20 ("lecture"); the first one with a mean error of 0.2 s and a 2.5 ms wide 95% confidence interval, and the second one with a mean error of 0.45 s and a 2 ms wide 95% confidence interval. Given these problems with accuracy, it was decided not to use any of the estimates obtained from the "lecture" room in the following analyses of variance (ANOVAs).
Finally, before the analyses of variance were performed, the normality of data was verified for these six datasets using the Jarque-Bera test with the significance level set to 0.05.

Analysis of variance
Four-way ANOVA statistical tests with interaction terms, a value set to 0.05 and sum of squares type 3, were performed on the aforementioned six datasets with the factors being "speakers' sex", "consonant", "vowel", and "room".
Although some of the interactions of the factor "speakers' sex" with other factors were statistically significant (p < 0.05), Tukey's multiple comparison tests revealed that those interactions were ordinal while the values of the effect size measure g 2 of the interactions of this factor with the remaining three factors were in the [0.07, 2.22]% interval. Furthermore, across the datasets, the differences between the marginal means of the female (F) and male (M) level were in the [10,30] ms range and with the "speakers' sex" factor value of effect size measure g 2 in the [0.5, 4.58]% interval.
Since those differences were relatively small, both when compared with the order of magnitude of the estimation errors occurring when standardised methods for reverberation time measurement are utilized, and when compared with the differences observed between the marginal means of the levels of other factors, ANOVAs were performed again, but this time without the "speakers' sex" factor. The degrees of freedom (df), the F statistic and p value as well as the measure of effect size g 2 for each factor and their interaction terms are presented in Table 3 for the CVC set and in Table 4 for the VCV set.

CVC set of logatomes
Figures 1-3 display the marginal means for the three interactions ("consonant" Â "room", "vowel" Â "room", and "vowel" Â "consonant") for the CVC set of logatomes. For Eaton's algorithm (Fig. 1a), for the first three rooms, a pattern is presentthe CVCs of the /fVf/ and /zVs/ type give estimates of reverberation time that are significantly higher in value than the ones for the CVCs that contain plosives, which is reflected in the g 2 of 9.58 for the "consonant" factor. Furthermore, it can be observed that the estimates for the plosive CVCs are not significantly different between each other, indicating that it is primarily the manner of articulation of the consonants that defines the value of the estimate. The results for the last room ("office") are somewhat differentindicating the sensitivity of this algorithm to larger values of reverberation time, and are responsible for a p < 0.001 for the "consonant" Â "room" interaction, with an g 2 of 4.26.
As for the "vowel" Â "room" interaction ( Fig. 1b), small and generally statistically non-significant upward shifts in the value of the marginal means are present between the adjacent vowels for the first three rooms, while for the "office" room the shifts are larger and are causing a p < 0.001 for this interaction term but with a small g 2 of 2.82.
Finally, Figure 1c shows the "vowel" Â "consonant" interaction (with a small effect size of g 2 = 2.01), indicating that when averaging across rooms the distance between the marginal means for the fricative CVCs and the plosive CVCs depends on the central vowel.
The "consonant" Â "room" interaction for Prego's algorithm is presented in Figure 2a. The marginal means for the /fVf/ and /zVs/ type of CVCs are significantly higher than the ones for the logatomes that contain plosives, as reflected in a p < 0.001 for the "consonant" term with g 2 = 14.39. Furthermore, the distance between the means depends on the value of reverberation time, and the marginal means for the CVC pairs that differ from each other only by the voicing of the first plosive (e.g., /dVt/ and /tVt/) are not statistically different.
From Figure 2b, it can be observed that the "vowel" Â "room" interaction is similar to the one presented in Figure 1b the marginal means increase from /a/ to /f/, and from /a:/ to /u/. Figure 2c shows the "vowel" Â "consonant" interaction, revealing a very consistent pattern across consonants for the vowels /a/, /ɛ/, /ɪ/, /a:/, /e/, and /i/, and with g 2 = 7.94 for this interaction term.
The panel showing "consonant" Â "room" interactions for Löllmann's algorithm (Fig. 3a) indicates that the difference in the marginal means between the /fVf/ and /zVs/ CVCs and the CVC logatomes that contain plosives is significant and depends on the value of reverberation time (with g 2 = 9.23 for the interaction term and g 2 = 11.04 for the "consonant" term). Again, the place of articulation of a plosive is not a predictor of the value of the estimate, while the influence of vowels changes with the change in the true value of reverberation time (Fig. 3b), reflected in a p < 0.001 for the "vowel" Â "room" interaction with g 2 = 5.80.

VCV set of logatomes
Figures 4-6 display the marginal means for the three interactions ("consonant" Â "room", "vowel" Â "room", and "vowel" Â "consonant") for the VCV set of logatomes. Figure 4 gives results for Eaton's algorithm, and shows that for all four rooms the estimates obtained from the VCVs with an unvoiced plosive or the affricate /ts/ are significantly lower in value than the estimates obtained from the nasals /m/ and /n/, voiced fricative /v/ and liquid /l/, reflected in an g 2 of 13.66 for the "consonant" term and a very small g 2 value of 1.87 for the interaction with "room". The marginal means for the voiced plosives and unvoiced fricatives can be found in between.
As for the "vowel" Â "room" interaction, Figure 4b shows that the T estimates are lower for the vowels /a/ and /ɛ/ than those for the remaining three vowels, and with the value of the difference between the vowels /a/ and /f/ depending on the factor "room", where g 2 = 7.90 for the interaction term and g 2 = 15.70 for the "vowel" term.
Finally, the panel 4c indicates that the influence of consonants in the VCV type of logatomes is very consistent across vowels (and, therefore, the interaction term has a very small effect size g 2 = 1.35)the VCV logatomes that contain either an unvoiced plosive or the affricate /ts/ produce significantly lower estimates of reverberation time than the VCVs with a nasal /m/ or /n/, voiced fricative /v/ or the liquid /l/, while the estimates for the remaining central consonants from Table 2 can be found in between.
The results of the multiple comparison test for Prego's algorithm are presented in Figure 5. For this algorithm,   e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1 u2  o2  i2  e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1 e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1  a1  u2  o2  i2  e2  a2  u1  o1  i1  e1             the behaviour across consonants is different than the one observed in Figure 4a but, more importantly, it is very consistent across rooms, which is reflected in a negligible g 2 of 0.28 for the interaction term and a larger g 2 of 3.74 for the "consonant" term. Figure 5b reveals the strong influence that vowels have on this algorithm's output; the differences in the values of marginal means between the VCVs that contain an /a/ or an /f/ are from 0.1 s to 0.25 s, depending on the room in question, where the effect of interaction is small (g 2 = 3.14), while the effect of "vowel" term is very large (g 2 = 44.97). Figure 5c presents the "vowel" Â "consonant" interactions (g 2 = 7.18), where a pattern is visiblein the case the outer vowel is an /a/ or an /ɛ/, the value of the reverberation time estimate is essentially the same regardless of the central consonant while, when going from the vowel /ɪ/, to / c /, and finally /f/, the influence of the type of consonant embedded in the VCV on the value of the estimate increases drastically. The change in the value of the estimate across vowels for a fixed consonant is largest for the voiced fricative /v/ and smallest for the unvoiced fricative //. Finally, Figure 6 presents interactions between the factors for Löllmann's algorithm. The influence of consonants strongly depends on the "room" factor, as shown in the panel 6a and an g 2 of 9.72. For the first room ("booth"), the inter-consonant differences are on the order of hundreds of millisecondsthe estimates of reverberation time for the VCVs that contain an unvoiced plosive or an affricate are approximately 0.25 s lower than the ones for the VCV logatomes that contain a nasal or a liquid. As the true value of reverberation time for "booth" is 0.18 s, this observed difference is almost 140% of that value. For the other two rooms from the Aachen database the influence of the central consonant is also quite strong and statistically significant between some consonant pairs. The same cannot be said for the "listening" room where the values for the nasals and the liquid are indeed again higher but the difference is not statistically significant.
The panel 6b indicates that for the smaller values of reverberation time (the first three rooms), the influence of vowels is not significant while, for the fourth room, a pattern similar to the one previously observed for Eaton's algorithm emerges, causing a p < 0.001 with g 2 = 4.01 for this interaction term. Unlike the results for the previous two algorithms, where a relatively high level of consistency of both the vowel and consonant behaviour was observed across rooms, the panel 6c is not so informative, since it hides the fact that the influence of consonants depends quite strongly on the "room" factor, that is, the true value of reverberation time.

Discussion
In this Section, the most likely sources for the patterns of reverberation time estimation errors observed in the previous section are now proposed. They are based on the results of the analyses of spectral characteristics of the phonemes used, ratios of spectra of the adjacent speech sounds as well as the durations of these phonemes.

CVC set of logatomes
For all three blind reverberation time estimation algorithms, the following regularities could be observed in Figures 1-3: when averaging the estimates across vowels (i.e., the "consonant" Â "room" interaction), the CVCs that contained fricatives produced a higher estimate of T than the CVCs that contained plosives, when averaging the estimates across consonants (i.e., the "vowel" Â "room" interaction): for the TF algorithms: the CVCs that contained the vowel /a/ gave lower T estimates than the CVCs that contained the vowel /f/, and the same could be observed for the /a:/ and /u/ vowel pair. for the time domain algorithm: for the first two rooms the vowels from the V1 and V2 vowel groups formed two separate clusters of marginal means.
The first point can be explained by the fact that vowels are a class of speech sounds up to 40 dB higher in energy level than the consonants surrounding them, thus forming an "energy arc" [33]. Consequently, the blind estimates of T will be primarily obtained from the vowel decay segments of a reverberant speech signaloccurring when the vowel is either the final phoneme in a word or when it is followed by a consonant.
Given that, it is reasonable to assume that the differences in the T estimates between the fricative CVCs and the plosive CVCs can be attributed to the influence of the consonant that follows the central vowel where, on the one hand, an abrupt closure of the vocal tract at the beginning of the closed phase of a plosive [28] causes an almost instantaneous stop of the preceding vowel while, on the other hand, fricatives have relatively time-stable frequency profiles which are disrupting the energy decay of the vowel in the higher frequency bands, consequently inducing larger values of the reverberation time estimates.
To give an appropriate explanation for the second point, two figures (Figs. 7 and 8), must be first introduced. Figure 7 shows the average RMS-normalised spectra for the 10 vowels that were obtained from the CVC logatome recordings using the time stamps of the beginning and the end of the central vowel given in the speech segmentation text filesan integral part of the OLLO corpus. The second figure, Figure 8, presents another important vowel featureits duration, which was also obtained with the help of those speech segmentation text files. From Figure 7, it can be observed that the very likely cause for the increase in the values of the marginal means from /a/ to /f/ and from /a:/ to /u/ can be found in the decrease of the overall energy above 1 kHz that the vowels havestarting from the vowel /a/ to /f/ (Fig. 7a) and from the vowel /a:/ to /u/ (Fig. 7b), which, in the end, makes the decays of the vowels /a/ and /a:/ of "better quality" than the ones of the vowels /f/ and /u/.
Finally, the vowel group V1 and V2 clusters of the marginal means observed for the first two rooms for Löllmann's algorithm (Fig. 3b) as well as for Eaton's algorithm (Fig. 1b) could be explained by the difference in the average duration of the vowels belonging to those two groups, as shown in Figure 8.

VCV-CVC comparison
As explained in the previous sub-section, for the CVC logatomes the estimates of T were obtained primarily from the vowel-consonant (V-C) transition segment of the recordings. The VCV logatomes, on the other hand, contain two vowels and, consequently, the recordings will have two decays from which the final T estimates will be obtained: the V-C transition and the V-silence transition for the right VCV vowel. Figure 9 shows the average RMS-normalised spectra of the left vowel obtained from the recordings of the VCV logatomes. It can be noticed that the spectra look, as expected, almost exactly the same as the ones in Figure 7a for the V1 group of CVC vowels.
The differences in the energy profiles of vowels are again very visible; /a/ and /ɛ/ have substantially more energy than / c / and /f/ in the region above 1.5 kHz, and /ɪ/ is similar to /ɛ/ but with a lower inter-formant minimum at about 1.6 kHz and a lower formant peak in the 3 kHz area. Figure 10 presents the average ratios of the left vowel spectrum and the spectrum of the central consonant (VC ratios) calculated from the same logatome recording. These VC ratios are calculated and presented only for the central consonants that have relatively stable time-frequency characteristics and not for the plosives and affricate /ts/, which are the classes of transitory speech sounds. It can be observed that these average VC ratios decrease as one goes from /a/ to /f/, indicating that there will be more (on average) frequency bands with free-decays of higher decibel range for the vowels /a/ and /ɛ/ than for the vowels / c / and /f/ during the vowel-consonant transition. Figure 11 presents the corresponding average VC ratios obtained from the CVC recordings for the consonants /f/ and /s/. It is visible that the VC ratios for the same vowel-consonant pairs of the VCV and CVC recordings have the same frequency profiles. The only difference is that the VC ratios are a few decibels higher for the CVC recordings, meaning that the vowel in a CVC was stressed more by the speakers than the first vowel of a VCV. The similarity of both the average spectrum profiles of the vowels (Figs. 7 and 9) and the VC ratios (Figs. 10 and 11) enables for the comparison of the results obtained for the CVC logatomes and the ones for the VCV logatomes that contain the consonants /p/, /t/, /k/, /f/, and /s/ to be performed. By comparing the results presented in Figures 1a and 4a for Eaton's algorithm, it can be concluded that the V-silence decay provides new negative gradients for this algorithm, so that the final estimate of T for the VCVs that contain an /f/ or /s/ is closer in value to the estimates obtained with unvoiced plosives than it was the case for the CVC logatomes. This effect, caused by the presence of a new, V-silence, decay is even more pronounced for Prego's (Figs. 2a and 5a) and Löllmann's algorithm (Figs. 3a and 6a).

Influence of the central C
As described in Section 2.3, the first algorithm, Eaton's, continuously calculates the decay rates on the timemel-frequency decomposed reverberant speech signal, and from whose distribution it estimates the full-band value of reverberation time. Since it uses a window of fixed length to estimate those decay rates, it is sensitive to the duration of the gap between two high energy speech sounds, such as the two vowels in a VCV logatome. Furthermore, the two main differences between the unvoiced and voiced plosive pairs are: (i) the existence of a voiced bar for the latter, and (ii) longer voiced onset time (VOT) for the former, where the VOT is defined as the difference between the time of the plosive burst and the onset of voicing of the following vowel [28]. Figure 12 shows the durations of central consonants of the VCV recordings obtained using the time stamps of the beginning and the end of the consonant given in the speech segmentation text files. It can be observed that, in line with previous research [28], the durations of the unvoiced plosives, unvoiced fricatives and the affricate /ts/, a phoneme that is composed of those two speech classes, are longer than those for the voiced phonemes. Given these results and the results of the "consonant" Â "room" interactions ( Fig. 4a), it can be concluded that this algorithm's output is sensitive to both the inter-vowel gap duration and the frequency distribution and energy level of the consonant, where the latter is responsible for the difference between the results for the voiced plosives and the results for the voiced fricative /v/, the nasals and /l/.
The second, Prego's, algorithm differs from Eaton's primarily in two things: (i) in the much larger number of frequency bands used for the time-frequency reverberant speech signal decomposition, and (ii) that it searches for the free-decay regions and calculates their corresponding EDCs, from which an estimate of reverberation time is obtained. Since the minimum EDC dB range this algorithm finds acceptable for T calculation is 10 dB and given the average VC ratios (Fig. 10), it can be now understood why this algorithm shows strong dependence on the vowel used in a VCVin contrast to the stability of the reverberation time estimates for /a/ and /ɛ/ across consonants, for the vowels / c / and /f/ there simply is not much high frequency energy and, consequently, the interaction between the vowel and the consonant that follows becomes much more pronounced for a large subset of frequency bands.
Lastly, Löllmann's algorithm differs from the previous two in that it operates on the signal's time series, while its speech decay detection procedure is similar in its essence to the one used in Prego's algorithm. Due to the first point, the quality of a vowel decay will depend on the overall energy of the consonant that follows. This is in agreement with the results for "booth" (Fig. 6b), where the unvoiced plosives produce the smallest values of T estimate, voiced plosives, and fricatives larger, while the nasals and /l/ the largest ones. The difference between the unvoiced and voiced plosives most probably stems from the presence of the energy of the voiced bar for the latter. At the same time, the results for the affricate /ts/, somewhat perplexing due to the existence of the high energy /s/ part, must be also carefully consideredthey are the same as for the plosive /t/, meaning that for the T value of 0.18 s the plosive part of this unvoiced affricate is long enough to enable good reverberation time estimation on the V -/t/ decay segment. For this algorithm, as T increases, the interactions change as well, and for the larger T 's ("office"), vowels become the second source of estimate variability.

Conclusion
The results presented in this study demonstrate that, in addition to their well explored sensitivity to background noise, the values of T estimates of the state-of-the-art algorithms for blind reverberation time estimation are strongly influenced by the phonemes present in the speech material. This influence that phonemes have on the value of reverberation time estimates is statistically significant not just for the algorithm based on a stochastic timedomain model of room impulse responses which uses no information on speech signal characteristics, but also for the more complex algorithms, whose values of mapping parameters have been adjusted during the training phase in order to compensate for the speech signal characteristics.
Since reverberation time rarely exceeds the value of one second in dwellings [1,31], very large relative (i.e., percentage) errors in blind T estimation can occur in the electronic  devices that process captured speech signalsboth for the languages with a predominance of open syllables and the ones with the predominance of closed syllables, such as the omnipresent English language. In addition to that, many of the differences between the marginal means for different vowel-consonant combinations were much larger than 42 ms, which is the value of the subjective difference limena measure describing the T resolution of the human auditory system, and defined as the minimal change in the value of T that human subjects can register in listening tests [34].
Despite the corpus related limitations of this study, primarily due to a small number of consonants present in the CVC set, the results implicitly indicate ways in which the algorithms can be adapted to become more speech signalrobust; for the time-frequency algorithms, the frequency bands could be selected so that the ones where the vowels / c / and /f/ have relatively low energy level are not used for the decay rates/EDC estimation, while for the timedomain operating algorithm, since the values of its estimates are primarily consonant-driven for smaller values of reverberation time, the V-C transition of only one of the many classes of consonants should be utilized.