Open Access
Issue
Acta Acust.
Volume 8, 2024
Article Number 48
Number of page(s) 15
Section Musical Acoustics
DOI https://doi.org/10.1051/aacus/2024038
Published online 08 October 2024

© The Author(s), Published by EDP Sciences, 2024

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

The parameters of fundamental frequency (F0) and spectral envelope (SE) play a key role in the acoustical description of sustained musical instrument sounds. Whereas F0 corresponds to the repetition rate of sounds and usually to the perceived pitch [1], SE corresponds to the coarse spectral shape of sounds, and thus largely affects timbre [2]. Recent acoustical analyses have shown a strong covariance between F0 and SE properties for most musical instruments found in the Western symphony orchestra [3]. Specifically, it was shown that for most classes of musical instrument sounds, the spectral centroid (SC, spectral center of gravity) covaries with F0. Moreover, instrument classification using spectral shape was impaired when the F0 range of training and test sets did not overlap. The goal of the present work is to deeper explore this topic with psychoacoustic experiments and specifically test how the systematic (mis)alignment of F0 and SE affects the perceptual evaluation of sounds in form of pleasantness and brightness judgments. By using an acoustical analysis-synthesis model, we make sure that the acoustical manipulations remain in a realistic realm.

Most experimental studies of pitch and timbre perception, and their interrelation, have used well-controlled synthetic sounds, either controlled according to a source-filter approach or by using a fixed amplitude profile of partial tones [1, 4]. For instance, [5] had participants discriminate changes of F0 or SC with both dimensions varying simultaneously. Sounds were synthetic tone complexes synthesized according to a source-filter approach. Interaction effects were found, although it remained unclear whether such effects bear perceptual implications in realistic acoustical scenarios because sound synthesis was not constrained by real-world sound properties. Similarly, using tones that yield ambiguous frequency shifts, Siedenburg et al. [6] found strong interactions of F0 and SE properties on spectral shift perception (i.e., whether tones were judged to go up or down). In the literature on speech perception, it has been argued for a longer time that formant patterns or spectral shape are not sufficient to account for vowel perception [7]. Accordingly, McPherson and McDermott [8] found analogous interaction effects between F0 and SE for both realistic vowel and musical instrument sounds. McAdams and colleagues even found strong effects of F0 on the affective dimensions of musical instrument sounds [9] as well as F0 effects on instrument identification [10]. Even though theses studies have clearly demonstrated the relation between F0 and SE, it remains unclear how the magnitude of the acoustical differences in F0 and SE relate to the perceptual interaction effects, because only selected sound examples were tested without an underlying parametric model. In the present study, we thus provide a model of SE properties that captures most of the variance in musical instrument sounds (including singing voices) and use the corresponding parameter space as a basis for perceptual evaluation.

Concretely, we consider harmonic sounds with a clearly defined F0. Each of these sounds has an SE defined as the specific shape or contour of the corresponding auditory spectrum. Ideally, this line runs as a continuous curve through all harmonics. A harmonic sound with the same SE can be synthesized using additive synthesis. In this case, we call F0 and SE congruent, because they coincide in the natural acoustical world. However, SEs can also be synthesized with misaligned F0s, yielding an incongruent pairing. In the present study, SEs were computed for a large set of acoustic sounds and a latent space was derived via principal components analysis (PCA). The rationale of this approach was that we sought to characterize effects of the most fundamental latent variables underlying the acoustic structure – and possibly the perception – of harmonic sounds with regards to F0-SE congruency and did not want to capture idiosyncratic properties of individual sounds.

Experiment 1 had participants rate synthesized sounds according to their pleasantness and auditory brightness. We hypothesized that sounds synthesized to possess a congruent F0-SE relation would be rated as more pleasant than sounds synthesized with an incongruent F0-SE relation and that sounds outside the boundaries of the populated latent space (where no actual sound was located) would be rated as less pleasant compared to the inner space. We further hypothesized that auditory brightness ratings would correlate with one of the dimensions of the component space, given the central role of the brightness dimension in timbre spaces [11]. In Experiment 2, a spectrally more fine-grained analysis-synthesis approach concentrated on four particular instruments and measured sound pleasantness and plausibility. The following section describes the processes of acoustical modelling in greater depth.

2 Acoustical modelling

Figure 1 presents an overview of the analysis-synthesis approach. The analysis part will be described in detail in this section. The synthesis part for experiments 1 and 2 will be explained in more detail in their respective sections.

thumbnail Figure 1

Overview of the analysis and synthesis process. Darker colors correspond to diagrams for Experiment 1 and lighter colors for Experiment 2. [A1] Original waveform, here for a low [F♯3] and high [F♯5] note of a clarinet. Waveforms are only shifted along the time axis for visualization purposes. [A2] Long-term spectra from a fast Fourier transform [FFT] with a frequency resolution of 1 Hz; with (dark) and without (light) manipulation. [A3] ERB-magnitude spectra (RMS-normalized) for manipulated (dark) and original (light) spectra. [A4] ERB-frequency cepstral coefficients (EFCCs) 1 to 13 from a discrete cosine transform (DCT, type II). The lower graph corresponds to the original spectra and the upper to the manipulated spectra. CC1 [manipulated) was left out in the subsequent analysis as indicated by the dashed vertical line in the upper graph. [A5] Two-dimensional latent space from principal component analysis (PCA) on CCs 2 to 13. Smaller gray dots correspond to all analyzed notes across the data set. Experiment 2: [S32] Adjusted EFCCs from true envelope computation. [S22] Spectral envelopes from an inverse discrete cosine transform (IDCT); dotted lines correspond to the original EFCCs from A4 compared to the true envelope as solid lines. [S12] Synthesized waveforms from additive synthesis of harmonics underneath the spectral envelopes. Experiment 1: [S31] Reconstructed EFCCs after dimensionality reduction from the two-dimensional latent space. [S21] Spectral envelopes from an IDCT; dotted lines correspond to the EFCCs of the manipulated spectra in A4 compared to the approximated spectral envelopes from the latent space as solid lines. [S11] Synthesized waveforms from additive synthesis of harmonics underneath the spectral envelopes.

ERB-frequency cepstral coefficients (EFCCs) were computed from the sounds’ long-term frequency spectra as follows: Amplitudes from a fast Fourier transform (FFT) with a frequency resolution of 1 Hz were grouped using a 128-band ERB filterbank. The resulting ERB-magnitudes were RMS-normalized, log10-transformed, and processed by a discrete cosine transform (DCT, type II) for the sake of decorrelation [12]. By only considering the first 13 EFCCs, the analysis-synthesis approach focussed on the coarse spectral shape rather than the spectral fine structure. The computation of the EFCCs was identical as in a previous publication [3]. By computing a subsequent inverse discrete cosine transform (IDCT) a smooth approximation of the spectral envelope is achieved.

To allow for the resynthesis of sounds with mismatched F0s, a manipulation of SEs was introduced in Experiment 1. Because the modelled SEs would act as a spectral filter for a harmonic series in the resynthesis, care needed to be taken to allow for resynthesis of sounds with lower F0s than that of a given analyzed sound. Prior to ERB-filtering, the spectra were thus manipulated by setting the SE level for frequencies below F0 to the amplitude level of F0 (see Fig. 1A2). This step essentially removed the low-frequency cutoff of the SE. It thus allowed to synthesize sounds with energy in low-order harmonics in Experiment 1, even if the F0 of the synthesized sound was much below the F0 of the analyzed sound.

Dimensionality reduction was used to increase interpretability of the dimensions of the feature space (e.g., by allowing for 2D maps, see Fig. 2) and to focus the analysis-synthesis approach on the most fundamental (and least idiosyncratic) features of instrument sounds in Experiment 1. Principal component analysis (PCA) was applied to the matrix of cepstral coefficients. The first two principal components explained 62% of the variance of coefficients 2–13 (leaving out the DC component). In Experiment 1, only the first two dimensions were used for analysis/resynthesis.

thumbnail Figure 2

The component space. Individual panels depict 12 exemplary instruments (from the total of 50); grey dots correspond to all the analyzed 2000 sounds.

A set of around 1900 recorded harmonic tones from 50 different sustained orchestral instruments was analyzed, extracted from the Vienna Symphonic Library (www.VSL.co.at). Sounds were generated at a dynamic level of mf with a nominal duration of 250 ms plus a decay of variable duration, yielding sounds of around 0.5–1 s overall duration. For further details on the set of analyzed sounds, see Siedenburg et al. [3]. The component space mapped out spectral similarities in the sound set, where clusters could mainly be explained by instrument family membership (see Fig. 2). A correlation analysis between sounds’ spectral centroids (SCs) and the first two dimensions showed a moderate negative correlation for PC1 (Spearman’s ρ(1878 = −.65, p < .001) and only a weak negative correlation with PC2 (ρ(1878 = −.18, p < .001)), indicating that a substantial part of variance in the component space could be described by the SC. Some instruments (e.g., from the clarinet or flute classes) showed distinct F0-dependent trajectories, potentially corresponding to different F0-registers.

To resynthesize sounds, additive synthesis was used with a harmonic series of 32 partials in sine phase, corresponding to a given F0. The sound would then reflect the SE of a selected position in the component space, as shown in Figure 3.

thumbnail Figure 3

Reconstructed spectral envelopes from PC1 and PC2 across the component space (directions indicated by red arrows) with color-coded spectral centroid (SC) values.

3 Experiment 1

Experiment 1 measured elementary effects of F0-SE congruency on pleasantness and brightness ratings of resynthesized harmonic sounds. That is, our interest was to assess whether incongruent (i.e., mismatched) F0-SE pairings would yield lower pleasantness ratings compared to congruent (i.e., matched) F0-SE pairings. Because signal manipulations were sought to respect the parameter space of naturally occurring sounds, we used a parametric space of SEs as described above. Sounds were drawn from the inner region of the space (where natural instrument sounds actually lie) and as a comparison also from outer regions of the space (where no sound in the database was found).

3.1 Methods

3.1.1 Participants

The experiment had 41 participants (29 online participants and 12 participants in the laboratory). Both online and lab groups consisted of participants with self-reported normal hearing, and both groups were recruited via the online job board of the University of Oldenburg. Participants were compensated for their time. Online participants had a mean age of 24.6 years (SD = 2.6). On-site participants had a mean age of 24.8 years (SD = 3.1). Both participant groups reported mixed musical training experience. Online participants also completed a questionnaire on their musical perception and training skills. The Goldsmith Musical Sophistication Index (Gold-MSI), an established method for measuring participants’ musical abilities, was used for this purpose [13]. 21 of 29 online participants gave answers. The online subset of participants scored on average 46.2 (SD = 8.7) of 63 on musical perception and 20.8 (SD = 10.8) of 49 on musical training.

3.1.2 Stimuli

The experiment contained three F0 conditions: congruent, incongruent, and fixed. As a control, the fixed condition used a single F0 at the median F0 of the set of analyzed sounds (297 Hz, D4). The F0s for the congruent and incongruent conditions were extracted from the underlying F0 distribution in the component space. For a selected point in the space, the median F0 values among the ten nearest grid points were chosen for the congruent condition. This resulted in a set of 58 different F0s used in the congruent condition. The same overall set of F0s was used for the incongruent condition, but their rank order was inverted. For instance, an SE with the fifth-highest F0 in the congruent set would be synthesized with the fifth-lowest F0 in the incongruent condition. This approach resulted in a systematic variation of the incongruency level in semitones according to the initial rank.

Furthermore, sounds were synthesized in two distinct regions of the component space. The inner space corresponded to the region that fell within the range of the analyzed sounds, and the outer space corresponded to regions that were beyond this range. The inner space comprised 45 grid points and the outer space 36 grid points, i.e. a total of 81 grid points. See Figure 4A for the chosen grid points, also indicating that the outer space was sampled more sparsely. Synthesis of sounds from the outer regions would allow us to assess whether sounds with extrapolated (i.e., “unnatural”) SE parameter values would be evaluated differently compared to their counterparts from the inner region of the SE space.

thumbnail Figure 4

(A) Grid points for the congruent condition. (B) Grid points for the incongruent condition.

For the sound pleasantness task, the entire gridded component space was used. In the brightness task, participants were only presented with sounds synthesized along PC1 (±0, 5, 10, 15, 20) and PC2 (±0, 5, 10, 15). Three congruency conditions resulted in a total number of 243 trials for the pleasantness task and 48 trials for the brightness task.

All sounds had a duration of 500 ms including generic 75 ms on- and offsets from a Tukey window and were generated using additive synthesis with a harmonic series of 32 partials in sine phase. Furthermore, the sounds were equalized in loudness using RMS-normalization.

3.1.3 Procedure

The experiment was conducted via Testable (https://www.testable.org/) for both, online and lab sessions. Participants first completed a headphone test [14], and if they scored more than five or six out of six correct trials, they were allowed to continue with the main experiment. In the first part of Experiment 1, participants gave sound pleasantness ratings. They first completed the congruent and incongruent conditions (in the same block in random order). On every trial, they were presented a single sound and were asked to judge its sound pleasantness on a 6-point Likert-scale (very unpleasant to very pleasant). Subsequently, they completed the fixed F0 block. After a short break, they continued with the second part of Experiment 1, where they judged the brightness of a subset of sounds along the first and second PC on a 6-point Likert-scale from very dull to very bright. Before this, they were instructed about the notion of brightness by listening to intervals of filtered white noise with increasing energy towards high frequencies.

3.2 Results

The sound pleasantness ratings of every participant were range-normalized to lie within the interval from 1 (very unpleasant) to 6 (very pleasant). We used a criterion level of α = .001 for the statistical analysis. There were no significant differences between the overall pleasantness rating means of online (M = 3.30, SD = 1.29) and lab (M = 3.29, SD = 1.32) participants; two sample t-test: t(9961) = −0.35, p = .728. For the congruent condition, there was a marginal difference between online (M = 3.35, SD = 1.29) and lab (M = 3.51, SD = 1.30) participants; t(3319) = −3.08, p = .002. However, there were strong correlations between the mean rating profiles of in-lab and online participants in all conditions (Spearman’s ρ(79) > .82, p < .001).

Figure 5A presents the average ratings for the factors of F0 congruency and inner/outer space for every participant together with the group average. For the inner space, pleasantness rating means were in the range between mildly unpleasant and mildly pleasant [congruent: 3.58 (SD = 0.43), incongruent: 3.50 (SD = 0.54), fixed: 3.41 (SD = 0.56)]. For the outer space, mean ratings were found to be mildly unpleasant [congruent: 3.17 (SD = 0.32), incongruent: 2.97 (SD = 0.54), fixed: 3.00 (SD = 0.51)].

thumbnail Figure 5

(A) Sound pleasantness ratings across congruency condition and space position. Points correspond to average ratings of individual participants. Error bars correspond to bootstrapped 95% confidence intervals. (B) Linear regression between incongruency level (ICLVL) and sound pleasantness ratings from the inner space for the incongruent condition.

The pleasantness data were analysed using a linear mixed-effects model [15] implementation in R [16] with an effects coding scheme; see the supplementary materials for the full model results. As random effects, the model included by-participants intercepts; as fixed effects, the model included the predictors of space (outer, inner), congruency (congruent, incongruent, fixed), F0 as linear and quadratic terms, as well as interaction terms between the space and congruency predictors (Rcond2$ {R}_{\mathrm{cond}}^2$ = .16, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .04).

As expected, sounds from outer portions of the space were consistently rated as less pleasant across all three congruency conditions (β = −0.23 [−0.27, 0.19], p < .001). However, there was no main effect of congruency itself (p = .040). There was a weak interaction of the congruency and space factors based on lower ratings in the outer space for incongruent compared to congruent F0s (space:congr.: (β = −0.04 [−0.08, −0.01], p < .012). Importantly, there were strong effects of F0, both as linear (β = 0.59 [0.29, 0.88], p < .001) and quadratic terms (β = −0.04 [−0.06, −0.02], p < .002). That is, pleasantness ratings were affected considerably by sounds’ F0.

We further considered the results from the incongruent condition in more detail, in particular for sounds in the inner space. Figure 5B shows pleasantness ratings for the incongruent inner space over the level of incongruency in semitones. Within the first two octaves, sounds were mostly perceived as rather pleasant, but ratings seemed to drop off for levels of incongruency of more than two octaves. Notably, there was a negative correlation between pleasantness ratings and the level of incongruency (R2 = .33, p < .001). This suggests that pleasantness ratings declined with increasing levels of incongruency between F0 and SE.

The brightness ratings are shown in Figure 6A. Here, the fixed F0 condition is of particular interest because changes in F0 do not conflate ratings. Brightness ratings are highest for negative regions of PC1, following a monotonically downwards trend centered slightly off the zero position of PC1 (Spearman’s ρ(7) = −.87, p < .005). Separate LME models with by-participant intercepts constrained to the data that varied along PC1 or PC2 were used to analyse brightness ratings in the fixed F0 condition (PC1: Rcond2$ {R}_{\mathrm{cond}}^2$ = .41, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .23; PC2: Rcond2$ {R}_{\mathrm{cond}}^2$ = .35, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .002). An LME model confirmed that there was a robust negative effect of PC1 on brightness ratings (β = −0.05 [−0.06, −0.04], p < .001). However, there was no effect of PC2 on brightness ratings in the fixed F0 condition (ρ(5) = −0.04, p = .96; β = 0.01 [0.00, 0.01], p = .275). Ratings in the other two conditions showed different result patterns, with strong effects of F0 differences. Two additional LME models were used to analyse the data across PC1 and PC2. As random effects, both models included by-participants intercepts; as fixed effects, the models included the predictors of congruency (congruent, incongruent), the respective PC, F0 as a linear term, as well as interaction terms between the congruency and PC predictors (PC1: Rcond2$ {R}_{\mathrm{cond}}^2$ = .36, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .33; PC2: Rcond2$ {R}_{\mathrm{cond}}^2$ = .23, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .21). Effects of PC1 extended towards congruent and incongruent F0 conditions, even if F0 was accounted for in the LME model (PC1: β = −0.04 [−0.04, −0.03], p < .001; F0: β = 0.93 [0.71, 1.15], p < .001). This implies that brightness ratings were strongly associated with the PC dimension independently of F0.

thumbnail Figure 6

(A) Sound brightness ratings across the first PC for all conditions. Error bars correspond to bootstrapped 95% confidence intervals. (B) Sound brightness ratings across the second PC for all conditions. Error bars correspond to bootstrapped 95% confidence intervals. (C) Linear regression between F0 and sound brightness ratings for the congruent and incongruent conditions.

There were also marginally higher brightness ratings for incongruent sounds compared to congruent sounds (β = 0.10 [0.01, 0.20], p < .034). However, PC2 did not exert any robust influence on brightness ratings (PC2: β = 0.00 [−0.01, 0.02], p < .408), although F0 did play an important role for ratings along PC2 (F0: β = 0.45 [0.18, 0.72], p < .001). Contrary to PC1, no effect of congruency was seen for PC2 (β = −0.11 [−0.29, 0.07], p = .232). When F0 varied, there were robust correlations between F0 and brightness ratings in both (congruent and incongruent) conditions (raw R2 > .55, p < .001), see Figure 6B. Overall, this confirms that F0 (unless controlled as in the fixed condition) plays a major role in sound brightness ratings. At the same time, the impact of PC1 is not limited to the fixed F0 condition but also plays an independent role in congruent and incongruent F0 conditions.

3.3 Discussion

We used an analysis-synthesis approach to test for the role of congruency between F0 and SE in sound pleasantness and brightness ratings. Based on an analysis of around 1900 musical instruments and voice sounds, a two-dimensional synthesis space was derived that explained around 60% of the variance in the SE data. Sounds with congruent and incongruent F0-SE relations were synthesized. Participants rated the perceived pleasantness and brightness of these sounds.

Our results indicated that congruent pairings of F0 and SE were perceived as more pleasant comapared to incongruent pairings when the level of incongruency exceeded around two octaves. A previous study indicated that the F0 bandwidth for “timbre invariance” may, in fact, extend up to two octaves for musically trained participants [17]. In the present case, however, we suspect that reasons for the modest effect of congruency that only became visible in the post hoc correlation analysis may be sought in the present synthesis settings with a relatively coarse approximation of the spectral shape. Specifically, only 12 cepstral coefficients were used to avoid direct sampling of spectral fine-structure information from the resolved portions of the ERB representation. Also, the lower portions of the envelope (frequencies below F0) were extrapolated, thus effectively equalizing the effect of low-frequency information. This extrapolation and the coarse sampling with twelve EFCCs was adopted as a precaution in order not to explicitly sample a low-frequency cutoff and spectral fine-structure information, respectively. Despite these precaution measures, we found effects of congruency, which suggests that the hypothesized role of congruency may play a role in the perception of actual instrument sounds.

A second important finding was that sounds synthesized from outer regions of the space without direct correspondence to existing sounds were perceived as less pleasant. In the realm of musical consonance perception, there exists a long debate regarding the roles of biological and cultural roots [18, 19]. Similar questions could be raised concerning preference toward spectral shapes: why did our participants consistently rate SEs from outer regions of the parameter space as less pleasant compared to SEs from the inner region? An interpretation considering cultural constraints could relate this finding to a mere-exposure effect related to the prevalence of natural SEs: in their musical enculturation, participants would often encounter SEs akin to exemplars from the inner region of the SE space, but very rarely SEs akin to exemplars of the outer region of the space. Notably, this effect turned out to be robust, despite our deliberately coarse approximation of SE shape in the synthesis process.

Direct brightness ratings along the two PC axes were collected in the second part of Experiment 1. Paired with appropriate instruction (as in the present experiment), direct brightness ratings have been shown to be the most reliable way to assess brightness perception [20]. In the fixed F0 condition, ratings followed a reversed logistic shape with lower PC1 values yielding higher perceived brightness values. Brightness differences along PC2 seemed to be negligible. These findings support the notion that the dimension of auditory brightness corresponds to a major dimension of the acoustical structure of musical instrument sounds, even when the effects of F0 are controlled for. This aspect also aligns with the correlation of SC and PC1. Considering the two conditions with F0 variation, our results further align with the literature that F0 cues strongly affect brightness judgments [21, 22]. In summary, we observed both an effect of F0 as well as an independent effect of the first dimension of the SE space on brightness ratings.

4 Experiment 2

In Experiment 2, we sought to study the relation of F0 and SE on a different level of granularity. Whereas the sounds from Experiment 1 were based on a 2-dimensional projection using PCA, the second experiment included full array of EFCCs for four selected instruments as a type of case study. With this acoustically more fine-grained synthesis approach, we sought to explore the effect of F0-SE congruency with regards to instrument-dependent effects. As a secondary measure, sound plausibility ratings were collected. These were used to contextualize the results of the present synthesis method.

4.1 Methods

4.1.1 Participants

The experiment had 38 participants (16 male, 21 female, 1 other) who participated online via Testable (https://www.testable.org/). All of the participants had self-reported normal hearing, at least 2 years of musical instrument training, and were recruited via the online job board of the University of Oldenburg. Participants were compensated for their time. Mean age of the participants was 25.6 years (SD = 4.7 years). According to the Goldsmiths Musical Sophistication Index (Gold-MSI) [13], participants scored on average 45.8 (SD = 6.3) of 63 on musical perception and 28.0 (SD = 6.3) of 49 on musical training.

4.1.2 Stimuli

Sets of 500 ms synthesized instrument sounds were generated for four specific instruments: violin, alto voice (female, singing an /a/ as in father), clarinet, and tuba. Hence, this selection represents the strings, woodwinds, and brass instruments, as well as vocal sounds. Considering the F0-range of Western instruments, the violin includes sounds with the highest F0, alto voice and clarinet resemble the middle F0-range, and the tuba represents the lower end. The clarinet arguably features distinct timbres in its separate F0-registers.

Spectral envelopes were similarly extracted as described in Section 3.1.2. However, no manipulation of the spectra at frequencies below F0 was performed. Instead, the true envelope was computed [23]. This algorithm iteratively updates the resulting spectral envelope Ai(k) by computing the maximum of the original spectrum X(k) and the current spectral envelope Ci−1(k)

Ai(k)=max(20log10(X(k)),Ci-1(k)).$$ {A}_i(k)=\mathrm{max}(20\cdot \mathrm{lo}{\mathrm{g}}_{10}(X(k)),{C}_{i-1}(k)). $$(1)

In our case, Ci(k) was obtained by cepstral smoothing of Ai(k) with a DCT at iteration i. Since C0(k) stems from a 128-band ERB-filtered spectrum, X(k) represents the maximum of the original spectrum within each band k. The procedure is initialized with A0(k) = 20 · log10(X(k)) and repeated until Ci(k) + Δ ≥ Ai(k), with a threshold Δ = 2 dB.

All sounds had a duration of 500 ms including generic 75 ms on- and offsets from a Tukey window and were generated using additive synthesis with a harmonic series of 32 partials in sine phase. Furthermore, the sounds were equalized in loudness using RMS-normalization.

Sounds were then resynthesized by means of additive synthesis with a harmonic series (in sine phase) corresponding to a given F0 and a particular SE shape. To test for effects of congruency between F0 and SE, three partially incongruent conditions were comprised of fixed SE shapes from the instruments’ low, middle, and high registers. These sample points are further described as the low register anchor (LRA), middle register anchor (MRA), and high register anchor (HRA), the corresponding SEs as lrSE, mrSE, and hrSE. Figure 7 depicts the filtered harmonics for the lrSE in all four instruments. The mrSE and hrSE are given as a direct comparison. The pitches corresponding to the register anchors can be seen in Table 1. The entire pitch range included 19 pitches, comprised of sampling from F♯1 to F♯7 in steps of major thirds (4 semitones), dividing one octave into three pitches. The individual register anchors were then selected by choosing the closest sample pitch to the register’s center, with the lower pitch chosen in case of symmetry. See Table 1 for an overview. For each of the lrSE, mrSE, and hrSE envelopes, sounds were then synthesized across the full F0 range, resulting in varying degrees of incongruent F0-SE pairings as outlined in Figure 7. As a control condition, congruent F0-SE pairs were created across the instruments’ F0-range. This resulted in a total number of 269 sounds instrument sounds (violin: 69, alto voice: 65, clarinet: 68, tuba: 67).

thumbnail Figure 7

Harmonics (blue vertical lines) from the low register spectral envelope (lrSE) synthesized at low (top panel rows), middle (middle panel rows), and high (bottom panel rows) register anchors (LRA, MRA, LRA) for all instruments. Spectral envelopes from the middle and high registers (mrSE, hrSE) for comparison in red and yellow, respectively.

Table 1

Instruments’ F0-range, lower register boundaries of instruments’ middle and high, and register anchors from the low (lrSE), middle (hrSE), and high (hrSE) registers.

4.1.3 Procedure

Participants first completed a headphone test [14]. The experiment then consisted of two parts: rating of the sound pleasantness and the sound plausibility. For rating the pleasantness, four instrument-specific sets with sounds from all conditions were presented. Randomization was performed at the sets and conditions level. The participants were asked to rate the sounds on a 6-point scale (very unpleasant – unpleasantmildly unpleasantmildly pleasantpleasantvery pleasant). There was a total of 269 trials. A similar procedure was chosen for the plausibility ratings but with a reduced number of sounds comprised of only the register anchors. Sounds for all four instruments were again presented in individual sets but in the order violin, alto voice, clarinet, and tuba. Within the instrument sets, sounds from all conditions were randomized. Participants rated the sounds on a scale from 1 to 6 (very implausible to very plausible) according to the sounds’ plausibility as compared with their internal, subjective reference of the particular acoustical instrument. This resulted in a total of 48 trials.

4.2 Results

The sound pleasantness and plausibility ratings of every participant were range-normalized to lie within the interval from 1 (very unpleasant/very implausible) to 6 (very pleasant/very plausible). We used a criterion level of α = .001 for the statistical analysis.

4.2.1 Sound pleasantness

Figure 8 presents the average sound pleasantness ratings for all stimuli and conditions over the musical pitch (F0) for each instrument. Overall, ratings followed an inverted U-shape centered around the middle of the global F0-range for the congruent, lrSE, and mrSE conditions. The hrSE condition showed a flattened response towards lower pitches and thus an effective shift of its maximum towards higher pitches. This behavior was seen in all instruments except for the tuba. Stronger local differences between conditions were seen in the clarinet (lrSE vs. mrSE vs. congruent) and the alto voice (mrSE vs. congruent).

thumbnail Figure 8

Sound pleasantness ratings across pitch (F0) for the low (lrSE), middle (mrSE), high (hrSE) register spectral envelopes, and congruent condition. (A) Empirical data with bootstrapped 95% confidence intervals. (B) Fixed effects of a linear mixed-effects model. Vertical lines represent register anchors (color) and F0-range (gray).

The data were analyzed using a linear mixed-effects model in R with an effects coding scheme; see the supplementary materials for the full model statistics. As random effects, the model included by-participants intercepts; as fixed effects, the model included the predictors of register congruency (lrSE, mrSE, lrSE), pitch as linear and quadratic terms, as well as interaction terms between pitch and register congruency predictors.

For the effects coding scheme the congruent condition was coded as −1, centering the effects at the mean of the group means. To also include the effect of the congruent condition, which is the “hidden” factor, the model was rerun with a different contrast order. Pitch was coded as indices in octaves and centered at F♯4 which resulted in 19 indices from −3 to 3. For each of the four instruments an individual model was constructed (violin: Rcond2$ {R}_{\mathrm{cond}}^2$ = .39, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .16; alto voice: Rcond2$ {R}_{\mathrm{cond}}^2$ = .46, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .24; clarinet: Rcond2$ {R}_{\mathrm{cond}}^2$ = .45, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .28; tuba: Rcond2$ {R}_{\mathrm{cond}}^2$ = .51, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .34).

All instruments showed significant effects of pitch as quadratic terms (β < −0.13, p < .001). Clarinet and tuba sounds also showed significant effects of pitch as linear terms (β < −0.08, p < .001). Consistent effects of congruency were seen for all instruments, except for the tuba. The congruent condition received the most pleasant ratings (violin, alto voice, clarinet: β > −0.29, p < .001; tuba: p = .344). The lrSE was also significantly higher than the mean of condition means, but not as strong as the congruent condition (violin, alto voice, clarinet: β > 0.12, p ≤ .001; tuba: p = .712). The hrSE was strongly rated as least pleasant (violin, alto voice, clarinet: β < −0.51, p < .001; tuba: p = .098). Mixed interaction effects of pitch and congruency condition were seen for all instruments but the tuba.

Concerning instrument-specific effects, violin sounds received a weighted average rating of 2.87 (SD = 0.35), the lowest for all instruments. Besides main effects of pitch and congruency, interactions of pitch and congruency were seen for the congruent condition (β = −0.16 [−0.23, −0.09], p < .001) and the mrSE (β = 0.09 [0.04, 0.13], p < .001). Rating profiles for the congruent condition exhibited a rippled course within octaves three and five. The maximum average pleasantness rating for the mrSE was offset by one octave at D5 as compared to the expected location of the MRA at D4. Above D5, lrSE and mrSE received slightly higher ratings than the congruent condition. Overall, sound pleasantness ratings for violin sound showed that, with the exception of the hrSE, partially incongruent conditions did not differ significantly from the congruent condition and in some cases even received higher ratings. However, the conspicuous ripple effect in the congruent condition was not observed in other conditions and would need to be investigated in more detail.

Alto voice sounds received a weighted average rating of 3.16 (SD = 0.64). Besides main effects of pitch and congruency, significant interaction effects were seen for the lrSE and mrSE (β < −0.10, p ≤ .001). The interaction of pitch and the congruent condition was also significant (β = 0.23 [0.08, 0.37], p = .002). Regarding individual trajectories, the lrSE was rated almost identical to the congruent condition within the natural F0-range of the alto voice. Thus, the maximum average pleasantness ratings for the lrSE were seen one octave higher around A♯4. Maxima for the mrSE and hrSE were at their expected register anchors. Still, it should be noted, that the mrSE showed a constant offset between the congruent condition and the lrSE between the MRA and HRA. Also, the hrSE was rated more pleasant than the mrSE around the HRA. The results for vocal alto sounds show that voice-like and F0-independent formants are reflected in the sound pleasantness ratings for the congruent and the lrSE condition, while no being captured with the mrSE above its anchor point. At the end of the vocal alto range, the hrSE is comparable to the congruent and lrSE condition.

Clarinet sounds received a weighted average rating of 3.16 (SD = 0.64). Besides main effects of pitch and congruency, significant interaction effects were seen for all congruency conditions. Mainly the congruent condition and hrSE showed the strongest interaction effects (congr.: β = −0.20 [−0.28, −0.13], p < .001; hrSE: β = 0.22 [0.17, 0.26], p < .001). Regarding individual trajectories, maximum average pleasantness ratings for the mrSE and hrSE were seen at their respective register anchors MRA and HRA. Only the lrSE was rated most pleasant around A♯4, more than one octave above the expected LRA. Among all instruments, clarinet sounds had the most variability in sound pleasantness ratings. Especially the lrSE showed a distinct local minimum between the LRA and HRA. The mrSE exhibited a strong peak around its respective register anchor. Pleasantness ratings dropped noticeably above the HRA. In contrast to the results of the alto sounds, the partially congruent spectral envelopes in the low and middle register are only on par with the pleasantness ratings of the congruent condition within a narrow bandwidth, otherwise decreasing drastically. However, this behaviour diminishes in the high register.

Tuba sounds received a weighted average rating of 3.75 (SD = 0.23), the highest for all instruments. Besides only main effects of pitch, and in contrast to all other instruments, there were neither main effects of congruency nor interaction of pitch and congruency. Only an interaction of pitch and hrSE was marginally significant (β = 0.06 [0.02, 0.11], p = .006). As with the other instruments, maximum average pleasantness ratings were seen around F♯4.

In summary, the results indicate that the effects of register congruency differed across instruments. While the congruent condition had a significantly positive effect on pleasantness ratings for violin, alto voice, and clarinet sounds, there was no significant influence for tuba sounds. However, the relationship between pleasantness ratings and F0 was consistent across instruments suggesting a curvilinear relationship. Besides this global inverted U-shape most instruments showed more fine-grained local differences between congruency conditions and their interaction with F0 that appeared even below one octave of F0 separation.

4.2.2 Sound plausibility

Figure 9 presents the average sound plausibility ratings over the instrument specific register anchors. Effects of the lrSE were not as strong as for the pleasantness ratings. Congruent, lrSE, and mrSE ratings were very similar. Only the lrSE in the middle register for the clarinet was visibly lower. Average ratings for the clarinet and tuba sounds were higher than violin and alto voice sounds.

thumbnail Figure 9

Sound plausibility ratings over F0-register for the low (lrSE), middle (mrSE), high (hrSE) register spectral envelopes, and congruent condition. Error bars correspond to bootstrapped 95% confidence intervals.

The data were analyzed using a linear mixed-effects model in R with an effects coding scheme; see the supplementary materials for the full model results. As random effects, the model included by-participants intercepts; as fixed effects the model included the predictors of F0-register (low, mid, high), register congruency (congr., lrSE, mrSE, hrSE), as well as interaction terms between F0-register and register congruency predictors.

For the effects coding scheme the high register and the congruent condition were both coded as −1, centering the effects at the grand mean. To also include their main effects for statistical analysis, the model was rerun with a different constrast order. For each of the four instruments an individual model was constructed (violin: Rcond2$ {R}_{\mathrm{cond}}^2$ = .37, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .13; alto voice: Rcond2$ {R}_{\mathrm{cond}}^2$ = .43, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .04; clarinet: Rcond2$ {R}_{\mathrm{cond}}^2$ = .31, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .14; tuba: Rcond2$ {R}_{\mathrm{cond}}^2$ = .36, Rmarg2$ {R}_{\mathrm{marg}}^2$ = .08).

A combined model was constructed for the effect of instrument on plausibility ratings. The model results confirmed that the differences in average instrument ratings were significant (clarinet, tuba: β > 0.55, p < .001; violin, alto voice: β < −0.52, p < .001).

In the separate models, main effects of register were seen for violin and tuba sounds. Register congruency had main effects on plausibility ratings for violin and clarinet sounds.

Violin sounds received an average plausibility rating of 3.02 (SD = 1.52). The model results indicate that the low register was significantly rated as less plausible (β = −0.54 [−0.70, −0.38], p < .001), conversely the high register was significantly rated as more plausible (β = 0.51 [0.35, 0.67], p < .001). A main effect of congruency was seen for the hrSE, receiving lower ratings (β = −0.37 [−0.56, −0.17], p < .001). Interaction effects between register and congruency were seen in the hrSE within the middle and high register (mid:hrSE: (β = −0.48 [−0.75, −0.20], p = .001); high:hrSE: (β = 0.61 [0.34, 0.89], p < .001)). An interaction of high register and mrSE was only marginally significant (β = −0.34 [−0.62, −0.07], p = .015).

Vocal sounds received an average plausibility rating of 3.40 (SD = 1.52). There was no main effect of register (p > .282) but a marginally significant main effect of congruency for the hrSE (β = −0.28 [−0.47, −0.10], p = .003). The hrSE showed an interaction effect with the high register (β = 0.44 [0.18, 0.70], p = .001).

Clarinet sounds received an average plausibility rating of 4.47 (SD = 1.39). There was no main effect of register (p > .076). Main effects of congruency were seen for the mrSE and hrSE, indicating higher ratings for the mrSE (β = 0.32 [0.14, 0.51], p = .001) and lower ratings for the hrSE (β = −0.61 [−0.80, −0.43], p < .001). Interaction effects of register and congruency were complex. Most strikingly there was a negative interaction effect for the high register and the congruent condition (β = −0.54 [−0.81, −0.28], p < .001) and a positive interaction effect for the high register and the hrSE (β = 0.77 [0.51, 1.03], p < .001).

Tubas sound received an average plausibility rating of 4.81 (SD = 1.25). Main effects of register were seen in the low and high register, indicating higher ratings in the low register (β = 0.30 [0.17, 0.43], p < .001) and lower ratings in the high register (β = −0.49 [−0.62, −0.35], p < .001). There were no main effects of congruency (p > .128) and no interaction effects of register and congruency (p > .130).

Figure 10 shows the plausibility ratings over the pleasantness ratings, visualizing the correlation between the two types of judgements. Alto voice and clarinet showed a strong positive correlation between plausibility and pleasantness (R2 > .85, p < .001). Tuba sounds were negatively correlated (R2 > .53, p = .008). The regression for violin sounds was rather inconclusive (R2 > .29, p = .038) due to a few strong deviations from the regression line.

thumbnail Figure 10

Sound plausibility ratings over pleasantness ratings for violin, alto voice, clarinet, and tuba. Low, mid, high register and congruent data points depicted as upward triangles, circles, downward triangles, and squares, respectively.

In summary, clarinet and tuba sounds received significantly higher plausibility ratings than violin and alto voice sounds. Again, instrument dependent effects were seen with no congruency differences for alto voice and tuba. The clarinet showed most variations in plausibility ratings of all instruments, especially in the hrSE and lrSE.

4.3 Discussion

We investigated synthesized sounds from four instruments to validate findings from Experiment 1 with a finer resolution in SE modulations as compared to the averaged and extrapolated representation from the component space. Sounds with congruent and partially incongruent F0-SE relations with respect to instruments’ low, middle, and high register anchor points were synthesized. Participants rated the perceived sound pleasantness and plausibility of these sounds. In the following discussion we distinguish the notions of global F0 (across-instrument) from F0-register (within-instrument). We acknowledge that the term register may have different meanings in the literature depending on the context.

Our results indicated that a consistent effect of register congruency across all instruments was only seen for the high register spectral envelope (hrSE). However, there were also effects in other register conditions for individual instruments, such as in the mrSE of the alto voice and the lrSE of the clarinet. These findings suggest that the high-pass characteristics exhibited in sounds when resynthesized at lower pitches, i.e. shifting the fine structure underneath the fixed envelope towards lower frequencies, played a major role for the decrease in sound pleasantness ratings. This finding is consistent with the well-established importance of lower-order harmonics for pitch perception [1]. Effects for resynthesis of higher pitched sounds, i.e. shifting the fine structure underneath the fixed envelope towards higher frequencies, were not as straight forward and instrument dependent.

4.3.1 Global F0 effects

On a global level, pleasantness ratings followed an inverted U-shape distribution for all instruments. These shapes are comparable to the data from sound preference ratings across global F0 for different instrument families [9], where string, brass, and woodwind sounds were preferred in the middle F0-region (in their study, the authors refer to the partitioning of the global pitch range as pitch register). In our data, even for tuba sounds maximum pleasantness was seen in the middle F0-range. This suggests, that pleasantness ratings were globally affected by F0 or pitch, independent of the instruments’ F0-range. Figure 11 shows the mean pleasantness ratings across all instrument sounds for each participant as well as the overall mean. It is of interest to note that the global maximum is positioned in a range around F♯4. It has been shown that maximum toneness (a measure for pitch clarity) of general sounds, including musical instruments, is evoked around the pitch D4 [24], which is a major third apart from maximum pleasantness in our study. Interestingly, this maximum also corresponds with the average musical pitch found in cross-cultural musical works [25]. A similar pitch center was also found for the average pitch across all sustained orchestral instruments at D♯4 [3]. Combined, these findings clearly demonstrate that participants’ judgements on sound pleasantness were driven by F0.

thumbnail Figure 11

Distribution of mean sound pleasantness ratings for all instruments and stimuli. The solid black line depicts the mean with bootstrapped 95% confidence interval shaded in gray.

In a recent study, musical instrument identification was tested across F0-register [10]. Participants were trained to identify instruments based on sounds from a single pitch or small pitch region. They then had to generalize instrument identity across the instruments’ entire F0-range. Although highest identification performance was seen for most instruments in the vicinity of the training range, the tuba was an outlier with the highest proportion correct in its low register, despite training in the high register. Furthermore, most instruments were strongly affected by pitch, both, as linear and quadratic terms.

Comparing plausibility and pleasantness in the present study, strong positive correlations between the judgements were seen for synthesized alto voice and clarinet sounds. The analysis for violin sounds was inconclusive, and tuba sounds, in fact, showed a moderate negative correlation between pleasantness and plausibility. That is, curiously, with increasing pleasantness sounds become less plausible. In contrast to the pleasantness ratings being globally affected by F0, sound plausibility might be constrained to the individual instruments’ F0-range. It has to be noted, that for the pleasantness task participants were not primed with instrument labels, whereas in the plausibility task instrument labels were provided. To better quantify the effect of pitch on plausibility ratings and their correlation with pleasantness ratings, more data points and possibly more instruments are needed.

In summary, we found similar global effects of F0 for pleasantness ratings as seen in other studies with comparable rating scales. Only the plausibility ratings did not show a global F0 effect across all four instruments.

4.3.2 Instrument-specific effects

Apart from strong global F0 effects on pleasantness ratings, more fine-grained effects were seen for individual instruments. The clarinet sounds in particular showed an interesting behavior. Compared to the congruent condition, the lrSE showed a dip in pleasantness ratings towards the middle register anchor (MRA). Past this point, ratings increased again and were on par with the congruent condition for even higher F0s. Considering the spectral modulation characteristics of the low register of the clarinet, even harmonics are not present or highly attenuated compared to odd harmonics, resulting in stronger modulations captured in lower cepstral coefficients (CCs). For the lrSE of the clarinet, when keeping the SE constant and shifting the fine structure by increasing F0, the first harmonic or fundamental moves into the dip created by the “missing” even, second harmonic, as seen in Figure 7. This mismatch of modulation characteristics could explain the decrease in pleasantness ratings for the lrSE around the mrSE. The effect was also seen in the plausibility ratings.

Overall, the plausibility results suggest, that synthesizing the SE from different registers across the instruments range does not affect ratings except for the hrSE condition. However, the synthesized tuba sounds appear to be unaffected by the high-pass filter characteristic of the hrSE for both, pleasantness and plausibility ratings. A reason for this finding could be the hypothesized inherently weak intensity of the fundamental tone for mezzoforte notes [26]. In this case, high-pass filtering does not change the structure of the lower harmonics much (see Figure 7) and sounds are still perceived as pleasant and plausible as coming from the congruent condition. Interestingly, in contrast to the low pleasantness ratings for tuba sounds in the low register, plausibility ratings were high in this register as well as the middle register. A similar observation can be made for the violin sounds, where tones were rated least pleasant in the higher register compared to higher plausibility ratings in the same register. There seems to be an inverse relationship between pleasantness and plausibility for these two instruments, at least for congruent sounds. Figure 10 shows the correlation between the two judgements for all instruments. Where alto voice and clarinet show a strong positive correlation between pleasantness and plausibility, correlation is moderate and negative for tuba sounds and inconclusive for violin sounds.

In summary, instrument-dependent effects of F0 on pleasantness ratings converge with the notion of a co-dependence of F0 and SE in acoustical instrument sounds. In comparison, plausibility ratings were more robust and only decreased for extreme mismatches of F0 and SE in particular instruments.

5 General discussion

In Experiments 1 and 2 we investigated synthesized acoustic instrument sounds based on their SE and F0 statistics. Experiment 1 included a more general case across all 50 instruments from our acoustical analysis. SEs from a PCA space were synthesized with congruent and incongruent F0-pairings. Experiment 2 focused on four exemplary instruments in detail. Although both experiments featured similar synthesis approaches there are some distinctions to be made that also have implications for the overall interpretation of our results.

Our goal in the analysis/synthesis approach of Experiment 1 was to establish a complete synthesis across all F0s. For SEs with an F0 shifted towards higher frequencies this does not principally pose a problem. But for lower F0s it inevitably results in a high-pass characteristic, as also seen in Experiment 2. The method of using a manipulation of the original spectra was therefore not only done for synthesis reasons but also for analysis reasons. The aforementioned high-pass characteristic results in a pseudo-correlation with F0 (mostly reflected in the second cepstral coefficient). If no manipulation to the spectra or the CCs is performed, the resulting clusters in the PCA space will be heavily influenced by F0 and thus not fully represent the variability of the shape of the spectral envelope, independent of the source (i.e., F0). We are not aware of any publication that shows different or similar methods to eliminate such a pseudo-correlation.

5.1 Rating attributes

Whereas listeners were asked to rate sound pleasantness and brightness in Experiment 1, we collected ratings of sound pleasantness and plausibility in Experiment 2. It may be argued that other attributes such as familiarity, naturalness, or plausibility may be equally apt for revealing differences between congruent and incongruent F0-SE relations. In fact, pleasantness ratings may not be the most powerful attribute for discriminating between different classes of sounds, because listeners are not asked to explicitly make a reference to stored memory representations or acoustic templates. Yet, what we aimed at in this study was to obtain a rather open response that captured how well these highly synthetic sounds “fit the ear”, similar to a whole body of work on consonance judgements [18, 19]. In the latter work on consonance, two tones with a specific frequency interval are presented simultaneously and listeners can easily judge whether they find the interval pleasant or unpleasant. In the present timbre study, an F0 is presented with a specific SE. Our results show that even in this case, there are clear differences in pleasantness judgments that partially reflect the congruency of these two variables.

5.2 Interpretation of the PCA space

The results from Experiment 2 showed differences between instruments in the susceptibility of sound pleasantness and sound plausibility ratings to deviations in F0 and SE. Besides strong effects in the hrSE condition for most instruments, more detailed discrepancies between the congruent and the lrSE and mrSE became clear for certain instruments, but especially for clarinet sounds. This behavior might be linked to the properties of the clarinet cluster in Figure 2. The cluster is rather spread out across large parts of the overall space of instrument sounds, thus exhibiting a high degree of dispersion. Compared to the clarinet cluster, tuba sounds appear more concentrated in a single spot with a low degree of cluster dispersion. Since the PCA space visualizes differences in SE shapes, it can be used to interpret and maybe even predict differences in sound pleasantness and plausibility ratings between the congruent condition and the fixed SE conditions. This is a research avenue that should be explored in future work.

5.3 Limitations of the synthesis method

Our modeling approach for the cepstral SE representation in both experiments was motivated by the aim to focus on fundamental characteristics rather than specific idiosyncrasies of sounds. Naturally, the synthesized sounds lacked realism due to the approximation of the spectra. In Experiment 1 this was especially prevalent due to the intermediate step of creating a PCA space with even more reduced spectral variation. But this was indeed a design choice in order to focus on the most basic interactions of SE and F0 as already mentioned in the Discussion of Experiment 1. In Experiment 2, even though a more fine-grained approach was taken by utilizing the true envelope computation there are still limitations. An instrument’s timbral identity is also captured in spectro-temporal cues, especially prevalent in the onset portion of sounds [27, 28], which were deliberately not considered in this study. All instruments had a generic attack and decay formed by a Tukey window. Moreover, all SEs were based on sounds at a playing level of mezzoforte. An instrument’s timbre is dependent on the dynamic level [3] and effects of congruency might be less or more pronounced at different sound intensities. The behavior across the whole dynamic range should be investigated in future research. Also, more instruments, possibly of electronic origin, should be included in such research to draw a bigger picture of the effects of F0 and instrument on sound pleasantness and plausibility ratings. In particular, a possible distinction between pleasantness and plausibility ratings based on global or instrument-specific F0 effects needs to be validated.

6 Conclusions

Experiment 1 showed that congruency between F0 and SE significantly increased the pleasantness of sounds when discrepancies in F0 exceeded two octaves. Together with the finding that spectral envelopes within the vicinity of analyzed sounds were significantly rated as more pleasant than sounds from “outside”, this suggests that there is a preference for sounds where pitch and spectral characteristics match naturally, reflecting typical listening experiences in the real world. Furthermore, the first component of the latent space of spectral envelopes closely corresponded to brightness ratings, indicating a direct correspondence between the acoustical structure of natural sounds and an important facet of timbre perception. A more fine-grained synthesis in Experiment 2 showed that even small deviations in F0 and SE can noticeably affect the pleasantness of certain instrument sounds, although their effect size was relatively weak compared to a global F0 effect. However, this effect was not maintained for two instruments for the more robust sound plausibility ratings. We interpret the present results as evidence for a co-dependence of F0 and SE properties in the perceptual evaluation of sounds. Overall, these findings highlight the importance of considering both pitch and spectral properties in tandem, as their interaction appears crucial for shaping our perceptual evaluations of sound quality and timbre.

Funding

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project ID 352015383 – SFB 1330 A6. This work was also supported by a Freigeist Fellowship of the Volkswagen Foundation.

Conflicts of interest

The authors declare no conflict of interest.

Data availability statement

The research data associated with this article are available in Zenodo, under the reference S. Jacobsen, K. Siedenburg: Exploring the relation between fundamental frequency and spectral envelope in the perception of musical instrument sounds – sound files and participant responses, Zenodo, 2024. https://doi.org/10.5281/zenodo.11110480.

Supplementary material

Full model statistics of the linear mixed-effects models. Access here

References

  1. C.J. Plack, A.J. Oxenham: Overview: the present and future of pitch, in: C.J. Plack, R.R. Fay, A.J. Oxenham, A.N. Popper (Eds.), Pitch, Springer, New York, NY, 2005. ISBN 978-0-387-28958-8, https://doi.org/10.1007/0-387-28958-5_1. [CrossRef] [Google Scholar]
  2. K. Siedenburg, C. Saitis, S. McAdams, A.N. Popper, R.R. Fay: Timbre: acoustics, perception, and cognition. Springer Handbook of Auditory Research, Springer Nature, Heidelberg, Germany, 2019. https://doi.org/10.1007/978-3-030-14832-4. [CrossRef] [Google Scholar]
  3. K. Siedenburg, S. Jacobsen, C. Reuter: Spectral envelope position and shape in orchestral instrument sounds, Journal of the Acoustical Society of America 149, 6 (2021) 3715–3727. https://doi.org/10.1121/10.0005088. [CrossRef] [PubMed] [Google Scholar]
  4. S. McAdams: The perceptual representation of timbre, in: K. Siedenburg, C. Saitis, S. McAdams, A. Popper, R. Fay (Eds.), Timbre: acoustics, perception, and cognition, Springer, Cham, 2019, pp. 23–57. https://doi.org/10.1007/978-3-030-14832-4_2. [CrossRef] [Google Scholar]
  5. E.J. Allen, A.J. Oxenham: Symmetric interactions and interference between pitch and timbre, Journal of the Acoustical Society of America 135, 3 (2014) 1371–1379. https://doi.org/10.1121/1.4863269. [CrossRef] [PubMed] [Google Scholar]
  6. K. Siedenburg, J. Graves, D. Pressnitzer: A unitary model of auditory frequency change perception, PLoS Computational Biology 19, 1 (2023) 1–30. https://doi.org/10.1371/journal.pcbi.1010307. [Google Scholar]
  7. D. Maurer: Acoustics of the VowelPreliminaries, Peter Lang International Academic Publishers, 2016. https://doi.org/10.3726/978-3-0343-2391-8. [CrossRef] [Google Scholar]
  8. M.J. McPherson, J.H. McDermott: Relative pitch representations and invariance to timbre, Cognition 232 (2023) 105327. https://doi.org/10.1016/j.cognition.2022.105327. [CrossRef] [PubMed] [Google Scholar]
  9. S. McAdams, C. Douglas, N.N. Vempala: Perception and modeling of affective qualities of musical instrument sounds across pitch registers, Frontiers in Psychology 8 (2017) 153. https://doi.org/10.3389/fpsyg.2017.00153. [CrossRef] [PubMed] [Google Scholar]
  10. S. McAdams, E. Thoret, G. Wang, M. Montrey: Timbral cues for learning to generalize musical instrument identity across pitch register, Journal of the Acoustical Society of America 153, 2 (2023) 797–811. https://doi.org/10.1121/10.0017100. [CrossRef] [PubMed] [Google Scholar]
  11. S. McAdams, K. Siedenburg: Perception and cognition of musical timbre, in: D.J. Levitin, P.J. Rentfrow (Eds.), Foundations in music psychology: theory and research, MIT Press, Cambridge, MA, 2019, pp. 71–120. [Google Scholar]
  12. M. Caetano, C. Saitis, K. Siedenburg: Audio content descriptors of timbre, in: K. Siedenburg, C. Saitis, S. McAdams, A. Popper, R. Fay (Eds.), Timbre: acoustics, perception, and cognition, Springer, Cham, 2019, pp. 297–333. https://doi.org/10.1007/978-3-030-14832-4_11. [CrossRef] [Google Scholar]
  13. D. Müllensiefen, B. Gingras, J. Musil, L. Stewart: The musicality of non-musicians: an index for assessing musical sophistication in the general population, PLoS One 9, 2 (2014) e89642. https://doi.org/10.1371/journal.pone.0089642. [CrossRef] [PubMed] [Google Scholar]
  14. A.E. Milne, R. Bianco, K.C. Poole, S. Zhao, A.J. Oxenham, A.J. Billig, M. Chait: An online headphone screening test based on dichotic pitch, Behavior Research Methods 53, 4 (2021) 1551–1562. https://doi.org/10.3758/s13428-020-01514-0. [CrossRef] [PubMed] [Google Scholar]
  15. B.T. West, K.B. Welch, A.T. Galecki: Linear mixed models: a practical guide using statistical software, Chapman and Hall/CRC, Boca Raton, FL, 2022. https://doi.org/10.1201/9781003181064. [CrossRef] [Google Scholar]
  16. D. Bates, M. Mächler, B. Bolker, S. Walker: Fitting linear mixed-effects models using lme4, Journal of Statistical Software 67, 1 (2015) 1–48. https://doi.org/10.18637/jss.v067.i01. [CrossRef] [Google Scholar]
  17. K.M. Steele, A.K. Williams: Is the bandwidth for timbre invariance only one octave? Music Perception 23, 3 (2006) 215–220. https://doi.org/10.1525/mp.2006.23.3.215. [CrossRef] [Google Scholar]
  18. J.H. McDermott, A.F. Schultz, E.A. Undurraga, R.A. Godoy: Indifference to dissonance in native amazonians reveals cultural variation in music perception, Nature 535, 7613 (2016) 547–550. https://doi.org/10.1038/nature18635. [CrossRef] [PubMed] [Google Scholar]
  19. P. Harrison, M.T. Pearce: Simultaneous consonance in music perception and composition, Psychological Review 127, 2 (2020) 216–244. https://doi.org/10.1037/rev0000169. [CrossRef] [PubMed] [Google Scholar]
  20. C. Saitis, K. Siedenburg: Brightness perception for musical instrument sounds: relation to timbre dissimilarity and source-cause categories, Journal of the Acoustical Society of America 148, 4 (2020) 2256–2266. https://doi.org/10.1121/10.0002275. [CrossRef] [PubMed] [Google Scholar]
  21. E. Schubert, J. Wolfe: Does timbral brightness scale with frequency and spectral centroid?, Acta Acustica united with Acustica 92, 5 (2006) 820–825. [Google Scholar]
  22. J. Marozeau, A. de Cheveigné: The effect of fundamental frequency on the brightness dimension of timbre, Journal of the Acoustical Society of America 121, 1 (2007) 383–387. https://doi.org/10.1121/1.2384910. [CrossRef] [PubMed] [Google Scholar]
  23. A. Röbel, X. Rodet: Efficient spectral envelope estimation and its application to pitch shifting and envelope preservation, in: International Conference on Digital Audio Effects, Madrid, Spain, 2005, pp. 30–35. https://hal.science/hal-01161334. [Google Scholar]
  24. D. Huron: Tone and voice: A derivation of the rules of voice-leading from perceptual principles, Music Perception 19, 1 (2001) 1–64. https://doi.org/10.1525/mp.2001.19.1.1. [CrossRef] [Google Scholar]
  25. D. Huron, R. Parncutt: An improved model of tonality perception incorporating pitch salience and echoic memory, Psychomusicology: A Journal of Research in Music Cognition 12, 2 (1993) 154. https://doi.org/10.1037/h0094110. [CrossRef] [Google Scholar]
  26. J. Meyer: Akustik und musikalische Aufführungspraxis: Leitfaden für Akustiker, Tonmeister, Musiker, Instrumentenbauer und Architekten, Bochinsky, Bergkirchen, Germany, 1995. [Google Scholar]
  27. K. Siedenburg: Specifying the perceptual relevance of onset transients for musical instrument identification, Journal of the Acoustical Society of America 145, 2 (2019) 1078–1087. https://doi.org/10.1121/1.5091778. [CrossRef] [PubMed] [Google Scholar]
  28. K. Siedenburg, M.R. Schädler, D. Hülsmeier: Modeling the onset advantage in musical instrument recognition, Journal of the Acoustical Society of America 146, 6 (2019) EL523–EL529. https://doi.org/10.1121/1.5141369. [CrossRef] [PubMed] [Google Scholar]

Cite this article as: Jacobsen S. & Siedenburg K. 2024. Exploring the relation between fundamental frequency and spectral envelope in the perception of musical instrument sounds. Acta Acustica, 8, 48.

All Tables

Table 1

Instruments’ F0-range, lower register boundaries of instruments’ middle and high, and register anchors from the low (lrSE), middle (hrSE), and high (hrSE) registers.

All Figures

thumbnail Figure 1

Overview of the analysis and synthesis process. Darker colors correspond to diagrams for Experiment 1 and lighter colors for Experiment 2. [A1] Original waveform, here for a low [F♯3] and high [F♯5] note of a clarinet. Waveforms are only shifted along the time axis for visualization purposes. [A2] Long-term spectra from a fast Fourier transform [FFT] with a frequency resolution of 1 Hz; with (dark) and without (light) manipulation. [A3] ERB-magnitude spectra (RMS-normalized) for manipulated (dark) and original (light) spectra. [A4] ERB-frequency cepstral coefficients (EFCCs) 1 to 13 from a discrete cosine transform (DCT, type II). The lower graph corresponds to the original spectra and the upper to the manipulated spectra. CC1 [manipulated) was left out in the subsequent analysis as indicated by the dashed vertical line in the upper graph. [A5] Two-dimensional latent space from principal component analysis (PCA) on CCs 2 to 13. Smaller gray dots correspond to all analyzed notes across the data set. Experiment 2: [S32] Adjusted EFCCs from true envelope computation. [S22] Spectral envelopes from an inverse discrete cosine transform (IDCT); dotted lines correspond to the original EFCCs from A4 compared to the true envelope as solid lines. [S12] Synthesized waveforms from additive synthesis of harmonics underneath the spectral envelopes. Experiment 1: [S31] Reconstructed EFCCs after dimensionality reduction from the two-dimensional latent space. [S21] Spectral envelopes from an IDCT; dotted lines correspond to the EFCCs of the manipulated spectra in A4 compared to the approximated spectral envelopes from the latent space as solid lines. [S11] Synthesized waveforms from additive synthesis of harmonics underneath the spectral envelopes.

In the text
thumbnail Figure 2

The component space. Individual panels depict 12 exemplary instruments (from the total of 50); grey dots correspond to all the analyzed 2000 sounds.

In the text
thumbnail Figure 3

Reconstructed spectral envelopes from PC1 and PC2 across the component space (directions indicated by red arrows) with color-coded spectral centroid (SC) values.

In the text
thumbnail Figure 4

(A) Grid points for the congruent condition. (B) Grid points for the incongruent condition.

In the text
thumbnail Figure 5

(A) Sound pleasantness ratings across congruency condition and space position. Points correspond to average ratings of individual participants. Error bars correspond to bootstrapped 95% confidence intervals. (B) Linear regression between incongruency level (ICLVL) and sound pleasantness ratings from the inner space for the incongruent condition.

In the text
thumbnail Figure 6

(A) Sound brightness ratings across the first PC for all conditions. Error bars correspond to bootstrapped 95% confidence intervals. (B) Sound brightness ratings across the second PC for all conditions. Error bars correspond to bootstrapped 95% confidence intervals. (C) Linear regression between F0 and sound brightness ratings for the congruent and incongruent conditions.

In the text
thumbnail Figure 7

Harmonics (blue vertical lines) from the low register spectral envelope (lrSE) synthesized at low (top panel rows), middle (middle panel rows), and high (bottom panel rows) register anchors (LRA, MRA, LRA) for all instruments. Spectral envelopes from the middle and high registers (mrSE, hrSE) for comparison in red and yellow, respectively.

In the text
thumbnail Figure 8

Sound pleasantness ratings across pitch (F0) for the low (lrSE), middle (mrSE), high (hrSE) register spectral envelopes, and congruent condition. (A) Empirical data with bootstrapped 95% confidence intervals. (B) Fixed effects of a linear mixed-effects model. Vertical lines represent register anchors (color) and F0-range (gray).

In the text
thumbnail Figure 9

Sound plausibility ratings over F0-register for the low (lrSE), middle (mrSE), high (hrSE) register spectral envelopes, and congruent condition. Error bars correspond to bootstrapped 95% confidence intervals.

In the text
thumbnail Figure 10

Sound plausibility ratings over pleasantness ratings for violin, alto voice, clarinet, and tuba. Low, mid, high register and congruent data points depicted as upward triangles, circles, downward triangles, and squares, respectively.

In the text
thumbnail Figure 11

Distribution of mean sound pleasantness ratings for all instruments and stimuli. The solid black line depicts the mean with bootstrapped 95% confidence interval shaded in gray.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.