Issue 
Acta Acust.
Volume 8, 2024



Article Number  43  
Number of page(s)  13  
Section  Speech  
DOI  https://doi.org/10.1051/aacus/2024054  
Published online  30 September 2024 
Scientific Article
Design of a computational method to optimise acoustic output of the human vocal tract
^{1}
Department of Mechanics, Biomechanics and Mechatronics, Faculty of Mechanical Engineering, Czech Technical University in Prague, Technická 4, 160 00 Prague 6, Czech Republic
^{2}
Department of Mechanics, Biomechanics and Mechatronics, Faculty of Mechanical Engineering, Czech Technical University in Prague, Technická 4, 160 00 Prague 6, Czech Republic
^{*} Corresponding author: jaroslav.storkan@fs.cvut.cz
Received:
12
February
2024
Accepted:
20
August
2024
The influence of the geometric configuration of the human vocal tract (HVT) on the character of acoustic energy distribution during phonation of the vowel [a:] has been analysed. The computationally efficient mathematic models of the HVT have been assembled based on super elements and an isoparametric element with higher degree of polynomial shape function. The assembled models enable the easy and quick geometrical reconfiguration of the HVT and they can be used for the timeconsuming optimization process with aim to find the suitable geometric configuration of the HVT to generated the socalled singer’s formant.
Key words: Human vocal tract / Finite element method / Singer's formant
© The Author(s), Published by EDP Sciences, 2024
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
To assess how a human’s vocal tract (HVT) geometry affects their phonation characteristics, it is necessary to develop computationally efficient mathematical models. In [1], a method was presented how to tune the shape of the simplified model of HVT to given sets of formant frequencies. In [2] was derived the vocal tract length sensitivity function, which enables to evaluate changes in formant frequency due to modification of the vocal tract length. The sensitivity analyses of the simplified model of HVT to position of the formants in the frequency range are solved e.g. [3–7]. Creating the fully parametric models of HVT in 3D domain for simulating human phonation is a very challenging computational problem [8–13]. For the 3D acoustic analysis of HVT, the finite element method (FEM) is almost exclusively used. The computational complexity of this method increases with the number of used finite elements describing the solved system. Direct geometric reconfiguration of the finite element mesh represents a very complex problem requiring a detailed analysis of the solved system [14]. If the aim is to optimize for physiologically permissible configurations of the HVT, then FEM is not a suitable method, due to its computational complexity. One way to assemble an efficient HVT computational model is to use socalled superelements [15, 16], where the reduced mass and stiffness matrices of the individual superelement are derived by a suitable reduction technique. Another way to formulate an efficient computational model of HVT is to derive a specialization isoparametric finite element with higher polynomial shape function. In this article both these approaches were used to find the physiologically permissible geometric configuration of the HVT producing maximal value of acoustic energy in the frequency span about 3 [kHz] for phonation of vowel [a:]. This increase of the acoustic energy is known as so called singer’s formant when 3rd and 4th resonant peaks of the frequency spectrum of phonation characteristics are shifted closer together and causes significantly increase of the acoustic energy in the frequency span about 3 [kHz].
2 Aim of this work
The influence of the geometric configuration of the HVT on the distribution of resonant peaks in the frequency spectrum of the human speech is analysed e.g. [17–19]. Evaluating the sensitivity of computational models to change their configuration generally requires very timeconsuming calculations. The main aim of this work is to suggest and derive an efficient computational model of the HVT that can be used for time consuming optimization analyses. The first method uses so called superelements. The initial geometry of the HVT is divided into separate subregions. These subregions are then meshed as conventional finite elements. The assembled mass and stiffness matrices of the super element are further reduced by a suitable reduction technique. The reduction method must decrease the number of degrees of freedom and guarantees the coincidence of the lowest eigenfrequencies of the reduced and nonreduced super element. The second method uses a special isoparametric element with higher degree of polynomial shape function. This element exploits the typical geometrical configuration of the HVT which is remarkably like a branching canal.
3 Finite element model of the vocal tract
The initial geometrical configuration of the HVT for the vowel [a:] was taken from [20, 21]. The CT LightSpeed VCT GE64 device was used for the imaging. The measurement resulted in 181 sections of 0.625 mm thickness. The image resolution was 512 × 512 pixels. The measurement was carried out in the supine position, while the patient was pronouncing the vowel [a:]. The geometric model was created by analysing the sagittal CT crosssections that were transferred to the FE model (Fig. 1). The geometric model includes the parallel cavities of the piriform sinuses and the epiglottic vallecula.
Figure 1 FEM mesh for vowel [a:]. 
The mathematical description of the finite element is given by the row vector N_{e} [–] which contain shape functions that distribute the nodal values of the unknown pressure quantities throughout the entire volume of the element. The acoustic pressure distribution in the element’s volume is acquiring by the multiplying the shape functions matrix by the column vector of the nodal values pressure P_{e} [Pa].
The acoustic problem is described by wave equation (2)
where ρ_{0} [kgm^{−3}] is the density of the environment, c_{0} [ms^{−1}] is the speed of sound, r [kgm^{−3} s^{−1}] is the volume damping coefficient, V_{e} [m^{3}] is the element volume and ∆ [m^{−2}] is the Laplace operator. The finite element method uses a socalled weak formulation and the Galerkin method [22]. The differential equation of the strong formulation is scalarly multiplied by the test function δp^{'} and the resulting expression is integrated over the solved volume
The equation is converted to the frequency domain substituting a time derivative for iω where ω [s^{−1}] is the angular frequency. Green’s first identity is applied to the equation. The pressure gradient in the resulting surface integral can be expressed using the acoustic particle velocity
where ∇ [m^{−1}] is the Nabla operator, n [−] is the unit normal to the boundary ∂V_{e} [m^{2}], and v' [ms^{−1}] is the acoustic particle velocity. These expressions can be simplified as follows:
where M_{e} [s^{2}m] is the mass matrix of the element
B_{e} [sm] is the damping matrix of the element (the specific acoustic impedance has been used here Z_{s} = p'/(n^{T}v'))
K_{e} [m] is the element’s stiffness matrix
and V_{e} [kgs^{2}] is the input velocity vector
Equation (5) mathematically defines the acoustic properties of the element. After discretising the space of finite elements, global matrices describing the entire solved system are assembled.
Figure 1 illustrates the FEM mesh model of the HVT for a vowel [a:]. This model has a total of 36,553 nodes. The mesh model is unstructured and the element types used are quadratic tetrahedrons with ten nodes. There are 186,570 elements in the model. The model is created using commercial Ansys software.
The airflow is not stationary due to vocal folds oscillations, it is periodic and can be approximated using an analytical model. The LiljencrantsFant model (LF model) [23] is one of the basic glottal flow models that can be used for this purpose. This model has four parameters. The setting of the excitation of the LF model was chosen by base frequency pulses of 100 [Hz], time constants t_{e} = 0.006 [s], t_{a} = 0.00023 [s], t_{p} = 0.0045 [s], and an amplitude E_{e} = 0.4 [m^{3}s^{−2}]. Figure 2 shows the pulse shape and control parameters. A velocitytype boundary condition is modelled at the level of the vocal folds.
Figure 2 Left: LF pulse excitation for the time derivative of the glottal volume velocity used for modelling the phonation. Right: Corresponding integral (glottal airflow volume velocity pulse). 
Acoustic energy is emitted from the mouth into the surrounding space. The external environment model is replaced by the emitted acoustic impedance. This is derived under the assumption of a circular oral opening, constant acoustic particle velocity throughout the oral opening, and an infinite external environment (piston radiation model) [24]. In such a case, the specific acoustic impedance at the output of the HVT can be defined by the following relationship
where J_{1} [–] and H_{1} [–] are the first order Bessel and Struve functions, respectively, and R [m] is the radius of the oral opening under consideration. The disadvantage of this method is the nonlinear frequency dependence of the acoustic impedance. The boundary surface of the vocal tract was modelled as a rigid wall.
A disadvantage of the described FEM model is the large number of degrees of freedom (DoFs). The dimensions of the mass, damping, and stiffness matrices of the model are equal to the number of DoFs of the model. Although the matrices are symmetric and sparse, their dimensions are generally so large (in this case: 36,553) that the model can be described as very computationally demanding. The computational requirements depend on the type of analysis, the computer power, and the performance optimisation of the solver.
4 Model reduction method
The most commonly used method of reducing the computational requirements of large models is a reduction at the level of the M and K matrices. The principle is to reduce the number of DoFs. The resulting reduced model can be understood as a superelement. The basic reduction method is the Guyan reduction method [25], which is generalised by IRS method.
The IRS reduction is based on a model containing stiffness and mass matrices. The DoFs are divided into those that are kept, P_{m}, and those that are reduced, P_{s}, [26]
The mathematical model is transformed into the following form
For the reduction, it is necessary to find the map P in the set P_{m}. This is ensured by the transformation matrix T_{IRS} by the following relation
The transformation matrix T_{IRS} is used to transform the full mathematical description into the reduced DoFs
is defined by the following relation [27]
where T_{s} is the transformation matrix of the static Guyan reduction
and the matrix S is defined by the following relationship
The formula for T_{IRS} is implicit because it depends on the reduced system. Therefore, it must be calculated iteratively. The first iteration step is solved using the Guyan reduction T_{IRS,1} = T_{s}. After obtaining the first reduced mass and stiffness matrices, the iteration can be continued using the IRS method
The convergence of this iteration scheme is accelerated if T_{s} is replaced by T_{IRS,i} in the second part. The final reduction matrix is determined after convergence. The form of the model is then
5 Properties of IRS reduction
The IRS reduction [27] uses simplified assumptions that introduce errors into the computed model. These errors depend on the number and location of the reduced and retained nodes and on the number of iterations of the IRS method. Therefore, the effect of the IRS reduction is tested on the model properties. The motivation for using reduction methods to modify models was to reduce the computational requirements, allowing optimal computations to be performed in a reasonable time. The expression for the transformation matrix of the IRS method is implicit and is solved iteratively, which makes the calculation more complex. Therefore, a convergence test of the IRS iteration scheme was carried out. This test is important for selecting the optimal number of iterations that guarantees a compromise between the computational requirements and the error generated.
The formants are used to study the importance of the HVT in vowel formation. From an acoustic point of view, the formant is a local frequency maximum in the voice spectrum, which is created by the modulation of acoustic pressure by a transfer function of the HVT between the vocal folds and the mouth. These local extremes are always in the area around the eigenfrequency of the acoustic oscillations of the solved geometry. The task is therefore to find these eigenfrequencies.
For the purposes of this work, the geometric model of the measured HVT had to be divided into subsequent volumes. The crosssections are always planar and the resulting geometry is shown in Figure 3. The main HVT cavity consists of 16 volumes. In addition, the model contains two parts that form the piriform sinuses and four parts that close the model. The vallecula are a part of the main cavity. All parts of the divided HVT were reduced using the IRS method. The aim of the reduction was to retain only the nodes at the intersection of the dividing planes with the HVT surface. This reduced the total number of DoFs from 36,553 to 1,154. Almost 97% of the model size is reduced. Figure 4 shows the relative error of the eigenfrequency of models as a function of reduction rate in DoFs. These graphs are used to evaluate the 10 smallest eigenfrequencies for the each HVT part and each degree of reduction. To reduce the amount of data in the image, the eigenfrequency with the worst result is always displayed. It can be seen that there is practically no error up to a reduction of 70–80% of the DoFs. If only the DoFs on the perimeter of the dividing surface are retained, the errors are in the tenths of a percent range. Different degrees of reduction were achieved at different parts of the HVT. This is because the ratio between the number of retained DoFs and the total number of DoFs is different. In some parts, only 86% of the DoFs were reduced and the maximum reduction reached 97%.
Figure 3 Vocal tract for vowel [a:] divided into subsystems. 
Figure 4 Relationship between relative error and reduction rate. 
The computational time increases with the number of iterations. The optimal number of iterations was determined by a numerical experiment. The relative error of the eigenfrequencies was monitored as a function of the number of iterations (Fig. 5). The eigenfrequencies were reevaluated and the worst result was always used for illustration purposes. The error is up to 68% after the first iteration, and the average error is approximately 20%. The error decreases with the number of iterations. The decrease slows down after the fifth iteration. The error does not exceed 1.7% after the 10th iteration and the maximum error after the 20th iteration is 1.08%.
Figure 5 Relationship between relative error and number of IRS iterations. 
The test carried out using the IRS method only evaluated the eigenfrequencies and showed a low distortion of the results (less than 1%). After the IRS reduction, the model has a lower number of DoFs and therefore an even lower number of eigenfrequencies and oscillation shapes. The IRS maintains the lowest frequency. The model retains the lowest eigenfrequency even if it does not contain the nodes necessary to describe the respective oscillation shape.
Superelements were created for all parts of the HVT. As shown in Figure 3, this resulted in a total of 22 superelements, based on which the entire HVT computational model was built.
The CPU time to obtain the reduced model from the full model is 229.8 s. Reduced model has 1,154 nodes (Fig. 6), which is about 3% of the original model. The computational requirements increase with its size. The processing time saving is significant. In this case, no emitted acoustic impedance was used at the level of the mouth. This boundary condition was modelled as zero acoustic pressure. This can be interpreted as an emission without any losses.
Figure 6 Nodes of a complete model assembled from superelements. 
The use of superelements and reduction methods confirmed the applicability for reducing computational requirements. Comparing the size of the original model with the reduced model is not sufficient to evaluate the effectiveness of these methods. It is also necessary to evaluate the computational requirements of the reduction itself and the number of runs of the resulting model. Most of the computational time is spent in the inversion of the K_{s,s} and M_{red} matrices. The IRS method of creating superelements is only practical when the resulting model is to be used repeatedly.
6 Designed isoparametric element
The HVT has a characteristic geometry that can be described as a curved canal, with a dominant longitudinal dimension and less dominant transverse dimensions. The crosssection is different from the circular one. The designed element has a triangular base in the reference state, which extends to the third dimension (Fig. 7). The triangular base is based on the shape functions of a higher degree polynomial but most of the nodes are eliminated. The base has nodes only at the vertices and on one wall.
Figure 7 Derived isoparametric element. 
The topology is chosen so that the side with the most nodes forms the surface of the HVT and the nodes outside the surface are in the centre of the crosssection. The HVT is divided longitudinally. As the outer sides of the surface contain many nodes, it is possible to take into account the complex shape of the HVT crosssection. The base functions of the 6th degree and 6 elements were used in every crosssection. Overall, every crosssection is described using 36 nodes on the surface and one central node. The elements in the longitudinal direction are linear. Therefore, it is necessary to cut a sufficient number of crosssections in the HVT geometry.
The derivation of the base functions is based on a 2D triangular 6th degree polynomial element (Fig. 8 on the left). The derivation of this 2D element is described in [28]. Area coordinates are used here (Fig. 8 on the right), which are defined as the ratio of the defined area to the area of the element [29]. The first step in the algorithm is to assign the coordinates L_{1}, L_{2}, L_{3} to all nodes. The parameters α, β, γ for each node are calculated from the node coordinates
where n is the degree of the polynomial. It is now possible to express an explicit relationship for the node shape function
where
Figure 8 Left: Fullplanar 6th order element. Right: Definition of surface coordinates. 
The functions and are defined in a similar way to function [28]. The parameters α, β, γ determine for which node the given shape function is valid. This procedure creates a complete element. The desired element does not contain any nodes. So redundant nodes are eliminated. The shape functions of the redundant nodes are added to the shape functions of the remaining nodes in the chosen ratio.
Once the redundant nodes have been eliminated, the redundant surface coordinates are converted into independent coordinates η, ξ [29]
The resulting system of the triangular base shape functions of the derived element is as follows
The 2D triangular element is extracted into the spatial dimension by a linear shape function. The planar element exists in the η, ξ plane and its extraction is on the ζ ∈ 〈0, 1〉 axis. Two new shape functions are created from the shape function of each node. This results in a final system of 16 shape functions of the element from Figure 7
The derived shape functions are valid in the volume of the element. The volume is defined by the set, V in the space of local coordinates as
The resulting element is a 6th degree polynomial in a planar base and a 1st degree polynomial in the third dimension. This allows complex shapes to be modelled by using only six elements in each crosssection. The integration scheme combines 2D integration in the triangular base of the 7th order of accuracy with 13 integration points [29] (Fig. 9) and 1D integration in the perpendicular direction with 4 integration points. The element contains a total of 52 integration points.
Figure 9 Integration points in a triangular base. 
Conditionality of the derived element must be checked in all geometric configurations in which it is used. This was verified by numerical tests, which verified its conditionality on the tested crosssectional shapes. The conditionality of the element is calculated from the transfer matrix between the reference and the actual shape of the element. Conditionality was tested only in 2D. Figure 10 illustrates the conditionality of the elements with a circular crosssection. The conditionality ranges at intervals of approximately (1.7, 2.6). Ideally, the conditionality is 1 and infinite conditionality means singularity.
Figure 10 Conditionality of elements in circular geometry. 
The HVT is geometrically more complex than a tube (circular crosssection). Therefore, a second conditionality test was performed. Figure 11 illustrates the test for significantly nonconvex geometry. Here, it depends on how the elements are distributed in the geometry. Therefore, two variants are shown. The worst conditionality is 10.3.
Figure 11 Conditionality of elements in nonconvex geometry. 
The element mass matrices and stiffness matrices for c_{0} = 1 ms^{−1} are full. The structure of the mass and stiffness matrices is illustrated in Figure 12. This shows that the higher values of the stiffness matrix are located at the node positions that are on the wall of the element, forming the HVT shell. In contrast, higher values of the mass matrix are found at the node positions that are not on this wall.
Figure 12 Visualisation of the values of the derived element mass (left) and stiffness (right) matrices. The diagonal elements of the central nodes are highlighted. 
7 Properties of the derived isoparametric element
Combining the 6 derived elements results in a prismatic volume, with 1 node in the centre of the bases and 36 nodes along the perimeter of the bases, 74 nodes in total. This is similar to the superelement created by the IRS method. Each of these groups of six elements can fill the HVT volume surrounded by two crosssections.
There is a significant difference between the two types of elements. The IRS approach explicitly does not contain internal nodes, but their effect is included in the resulting superelement, whereas the derived isoparametric element does not contain these nodes. If only one group of six elements is used, it is not possible to find a higher longitudinal oscillation shape. The same problem occurs with the transverse oscillation shapes. By using a sufficient number of layers composed of a group of six derived elements, the qualitative results are the same as for the fullscale FEM models reduced by the IRS method. An advantage of the superelements from the IRS method is the ability to find the eigenfrequency of the oscillation shape that cannot be described by the nodes used. In contrast, the derived element model is numerically much more effective. The assembly of one group of six derived elements is direct. The shape functions are derived in general and transforming them into the necessary geometry is very easy. It is therefore easy to generate their stiffness and mass matrices. This is done only six times for a layer, while creating a superelement represents the assembly of a complete FEM model from ordinary elements containing thousands of DoFs. When reducing nodes, it is necessary to invert the matrix by a dimension equal to the number of reduced DoFs and then invert the resulting mass matrix in each iteration. If tens of nodes are to exist after the reduction, it is necessary to invert the matrix by thousands of rows. This is computationally demanding. Building a model directly from derived elements would be expected to be more efficient than simply reducing a full model, the time required to assemble the full model is not taken into account. Moreover, if the time to assemble a new model was considered, this ratio would be even greater.
The HVT model (Fig. 13) for the vowel [a:], divided into the parts shown in Figure 3, can be discretised by the derived isoparametric elements, using sixteen layers of elements. The number of layers of elements was chosen with regard to the shape of the HVT, the wavelength of the natural vibration shapes and the chosen linear shape function of the elements in the longitudinal direction. The HVT is modelled to include parallel cavities. The piriform sinuses and the vallecula are modelled by six additional elements that are attached to the main cavity from the side. In total, the model contains 949 nodes. There are 17 crosssections with nodes in the main HVT cavity, which is sufficient for modelling lower longitudinal oscillation shapes without introducing discretisation errors [30]. The shape of a pressure wave that moves between the vocal folds and the mouth is represented by the longitudinal oscillation. The lower longitudinal oscillation shapes are the most important for voice production. The speed of sound is chosen to be 350 m/s and the walls of the HVT are assumed to be stiff. The HVT is excited by velocity pulses of the LF model. It is not necessary to know the shape of these pulses to find the eigenfrequencies as the boundary condition used is of a homogeneous velocity. The boundary condition at the HVT output in the mouth is defined as the emitted acoustic impedance. The CPU time to build this HVT FEM model is 0.6 s.
Figure 13 The finite element mesh of the vocal tract for vowel [a:] assembled from derived isoparametric elements. 
The eigenfrequencies are shown in Table 1, together with a description of the respective oscillation shape. The first four oscillation shapes are longitudinal. The first transverse oscillation shape (pressure wave perpendicular to the longitudinal direction) is located at the widest part of the HVT with a frequency of 4261.9 [Hz]. For the human voice, the most important frequency range is 300 [Hz]–5 [kHz]. There are five longitudinal oscillations and one transverse oscillation in this range. The first four longitudinal shapes are shown in Figure 14 and the first transverse shape is shown in Figure 15.
Figure 14 Acoustic pressure of the first 4 longitudinal oscillation shapes F_{1} = 530.4 [Hz], F_{2} = 1146.6 [Hz], F_{3} = 3061.2 [Hz], F_{4} = 3876.1 [Hz]. 
Figure 15 Acoustic pressure of the transverse oscillation shape F = 4261.9 [Hz]. 
Eigenfrequency and description of the shape of the model’s oscillations.
The character of a vowel is given by its formants. These are the most represented frequencies in the voice spectrum. When the HVT is excited with velocity pulses, the spectrum of the pulses is modulated by the HVT transmission characteristic. The frequency peaks represent formants when the excitation is broadband. The transmission characteristic is shown in Figure 16, which illustrates the significance of the character of the oscillation shape. The first six resonance peaks correspond to the longitudinal oscillation frequencies. There are also two transverse oscillations among these two shapes that are not visible in the transmission characteristics. This is given by the position of their nodes at the excitation and emission points. Therefore, longitudinal oscillation shapes are important for voice production.
Figure 16 Transmission characteristics between input excitation and emitted acoustic pressure. Top plot shows the amplitude characteristics and bottom plot shows phase characteristics. 
8 Energy maximisation in the frequency band
From the point of view of the vocal abilities of professional singers, it is advantageous to be able to emphasize certain frequencies of the spectrum when singing. The human ear is most sensitive around 4 [kHz]. If the singing dominantly represents frequencies around 3 [kHz], then the singing will be much clearer, and when sung with an orchestra, the orchestra will not drown out the singing. This is called the singer’s formant and is particularly important for opera singers.
The singer’s formant has been the subject of research [31, 32]. It has been found that it can be achieved by an appropriate ratio between the crosssectional area of the HVT in the epiglottis and the piriform sinuses [33]. According to the results of the study [33], the ratio between the crosssectional area before the piriform sinuses and the crosssectional area after their connection should be less than 1/6. Therefore, the assembled model was used to verify the hypothesis. The FEM model was parameterised to allow for changes in the crosssectional area of the main HVT cavity in all crosssections and the crosssectional area of the piriform sinuses (Fig. 17). The crosssections area of the main HVT cavity in piriform sinuses connection (A_{3} [m^{2}]) was varied and the crosssections area after the piriform sinuses connection remained constant. The effect on the frequency of the formants was monitored in Figure 18 as function of ratio between area A_{3} and original area A_{03}.
Figure 17 Geometric modifications of the epilarynx used for improving transmission characteristic of the HVT. 
Figure 18 Position of formants depending on the vocal tract crosssection and the piriform sinuses ratio. 
The analysis showed that the third and fourth formants can be brought closer together with a suitable crosssection ratio, to a distance of 224 [Hz] from the original 815 [Hz]. Moving the two resonance peaks closer to each other leads to an increase in the transfer characteristics, as seen in Figure 19. This can be interpreted as the singer’s formant. This double peak is located in the area around 3 [kHz], which is consistent with observations made on opera singers.
Figure 19 Comparison of the transmission characteristic reference model and the transmission characteristics optimised model. Top plot shows the amplitude characteristics and bottom plot shows phase characteristics. 
Figure 19 compares the frequency characteristics of the modified model for generating the singer’s formant with the original geometry model. It can be seen that the geometric modification affected neither the region below 2 [kHz] nor the region above 7 [kHz]. Since it did not affect the frequency of the first two formants, the vowel identity did not change. The maximum emphasis in the area of the singer’s formant (Fig. 20) increased from the original 60 [dB] to 79 [dB], which corresponds to a tenfold increase. The average emphasis in the 2750–3250 [Hz] band increased from 43 [dB] by 17 [dB] to 60 [dB], which is also about a tenfold increase in acoustic pressure.
Figure 20 Region of the singer’s formant in the transmission characteristic. 
To evaluate the amount of acoustic energy in the monitored frequency band, it is necessary to excite the transmission function, for which the LF model is used. Acoustic energy density is defined by the following relation [24].
Acoustic energy density is made up of a potential (pressure) and a kinetic part that changes with time. Only the pressure part of the energy density is used to evaluate the acoustic energy density. When converted to the frequency domain, the acoustic energy density is defined as follows:
The glottal flow pulses that excite the HVT are periodic and contain a fundamental frequency and higher harmonic components. Their spectrum is not continuous. The acoustic energy is then simply the sum of the frequency components. The amount of acoustic energy density in the monitored frequency band of the original model is 2.86e7 [Jm^{−3}]. The modified model, which simulates the singer’s formant emits 3.17e5 [Jm^{−3}] of energy density in the same frequency band. A comparison with the singer’s formants shows a 111 times higher amount of energy in the monitored frequency band.
9 Conclusions
A new isoparametric element was derived for the direct assembly of FEM HVT models. In parallel, a spatial FEMHVT model was assembled from ordinary elements. From this, superelements were generated using the IRS method. The resulting HVT models were compared. The aim of these methods is to reduce the computational requirements of the model. The HVT geometry for the vowel [a:] was obtained from CT measurements, and it is very detailed. All models even include the piriform sinuses and the epiglottic vallecula. All the models used give satisfactory compliance in terms of the results of the formant configuration of [a:]. However, the computational requirements vary considerably.
The FEM model created from ordinary elements has 36,553 DoFs and 186,570 elements. Assembling the model requires the calculation of stiffness and mass matrices for each element. This is done by numerically integrating the shape functions over a selected number of integration points. The solution then lies in the analysis of the resulting matrices.
The analysis of the resulting matrices may be more effective if the matrices are reduced using reduction methods. In this case, it is necessary to first construct a complete FEM model and reduce it. If the degree of reduction is high, then the largest number of demands used on the inversion of the stiffness submatrix takes place during the IRS procedure. In terms of size, it is close to a full matrix. Although the IRS method reduces the DoFs from 36,553 to 1154, thus reducing the computational demands of the formants to negligible, this saving is negatively offset by the computation of the inversion stiffness submatrix. As a result, this procedure is not suitable for a onetime analysis as the savings correspond to the requirements of the reduction itself. Performing the IRS reduction of the model took 229.8 s of CPU time. This time includes only the reduction itself, without the time necessary to build the full FEM model.
The final choice used is to apply a new element to the FEM. The obvious advantage of this procedure is that it does not require assembling the model from ordinary elements, creating a calculationefficient model directly, rather than by simplifying an extensive model. This model has 132 elements, and a 7th accuracy order integration scheme is used (4 layers with 13 integration points in a triangular plane). The total number of integration points is 6864, which is more than 100 times less than the assembled ordinary model. The total number of DoFs is 949, which represents a significant saving even in the assembly of the global matrix, as there is less demand on the computer’s memory. The CPU time to build this FEM model of the HVT is 0.6 s. Comparing the requirements at the level of computational time, the model assembled from the new isoparametric elements is approximately 383 times faster than the simple IRS reduction. Both CPU times are evaluated on the same office computer in the MATLAB environment. Using a more powerful computer and an optimised solver would reduce the CPU times, but the ratio would remain similar. If the time needed to assemble the full FEM model were added, the ratio would be even higher. The model built from new isoparametric elements, and the model derived by the IRS reduction have a similar number of DoFs and, therefore, their computational complexity is similar.
This work compares two approaches to creating effective FEM HVT models. It was found that the wellknown IRS method is less advantageous than using the new element described here. There is a significant difference in efficiency. The model using the new element is an order of magnitude faster. One disadvantage of the new element is that it is designed for the HVT geometry. In an acoustic model, which is not geometrically similar to the HVT, this element does not offer any advantages. In contrast, the IRS method is universally applicable. The designed isoparametric element was tested by a practical calculation of the singer’s formant. The calculation confirmed the general assumptions for its design.
The model using the derived isoparametric element was used for a sensitivity study aimed at finding the optimal geometric configuration for generating a singer’s formant. It has been confirmed that a suitable ratio between the crosssection of the piriform sinuses and the laryngeal part of the HVT brings the 3rd and 4th formant closer together in the region of 3 [kHz]. This singer’s formant provides a 17 [dB] higher acoustic pressure than the original model with the geometry obtained from the CT measurement. The amount of acoustic energy is 111 times higher than the original model when the model is excited by the LF model.
Funding
The research has been supported by the grant GAČR No. 2411121S Redistribution of acoustic energy output in human voice and its effect on vocal folds loading using computer and physical modeling. Additional support was from the grant CTU in Prague No. 9301225201X000 Optimizing the acoustic output with minimal possible load on the vocal cords due to changes in the human vocal tract using computerphysical modeling.
Conflicts of interest
Author declared no conflict of interests.
Data availability statement
The data presented in this study are available on request from the corresponding author.
References
 W. Kreuzer, C.H. Kasess: Tuning of vocal tract model parameters for nasals using sensitivity functions, Journal of the Acoustical Society of America 137, 2 (2015) 1021–1031. https://doi.org/10.1121/1.4906158. [CrossRef] [PubMed] [Google Scholar]
 S. Adachi, H. Takemoto, T. Kitamura, P. Mokhtari, K. Honda: Vocal tract length perturbation and its application to malefemale vocal tract shape conversion, Journal of the Acoustical Society of America 121, 6 (2007) 3874–3885. https://doi.org/10.1121/1.2730743. [CrossRef] [PubMed] [Google Scholar]
 B.H. Story: Technique for tuning vocal tract area functions based on acoustic sensitivity functions, Journal of the Acoustical Society of America 119, 2 (2006) 715–718. https://doi.org/10.1121/1.2151802. [CrossRef] [PubMed] [Google Scholar]
 B.H. Story: Vowel acoustics for speaking and singing, Acta Acustica united with Acustica 90, 4 (2004) 629–640. [Google Scholar]
 M. Mrayati, R. Carré, B. Guérin: Distinctive regions and modes: a new theory of speech production, Speech Communication 7, 3 (1988) 257–286. https://doi.org/10.1016/01676393(88)900738. [CrossRef] [Google Scholar]
 R. Carré: From an acoustic tube to speech production, Speech Communication 42 (2004) 227–240. https://doi.org/10.1016/j.specom.2003.12.001. [CrossRef] [Google Scholar]
 G. Fant, S. Pauli: Spatial characteristics of vocal tract resonance modes, in: Proceedings of Speech Communication Seminar, vol. 74, Stockholm, Sweden, 1–3 August, 1975, pp. 121–132. [Google Scholar]
 M. Arnela, S. Dabbaghchian, R. Blandin, O. Guasch, O. Engwall, A. Van Hirtum, X. Pelorson: Influence of vocal tract geometry simplifications on the numerical simulation of vowel sounds, Journal of the Acoustical Society of America 140 (2016) 1707–1718. https://doi.org/10.1121/1.4962488. [CrossRef] [PubMed] [Google Scholar]
 K. Motoki: Threedimensional acoustic field in vocaltract, Acoustical Science and Technology 20 (2002) 207–212. https://doi.org/10.1250/ast.23.207. [CrossRef] [Google Scholar]
 V. Ribeiro, K. Isaieva, J. Leclere, P. Vuissoz, Y. Laprie: Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated, Speech Communication 141 (2022) 1–13. https://doi.org/10.1016/j.specom.2022.04.004. [CrossRef] [Google Scholar]
 S. Dabbaghchian, M. Arnela, O. Engwall, O. Guasch: Reconstruction of vocal tract geometries from biomechanical simulations, International Journal for Numerical Methods in Biomedical Engineering 35 (2018) e3159. https://doi.org/10.1002/cnm.3159. [Google Scholar]
 Y. Kagawa, Y. Ohtani, R. Shimoyama: Vocal tract shape identification from formant frequency spectra – a simulation using threedimensional boundary element models, Journal of Sound and Vibration 203, 4 (1997) 581–596. [CrossRef] [Google Scholar]
 D. Mohapatra, M. Fleischer, V. Zappi, P. Birkholz, S. Fels: Threedimensional finitedifference timedomain acoustic analysis of simplified vocal tract shapes, in: in Proceeding of the Interspeech 2022, Incheon, Korea, 18–22 September, 2022, pp. 764–768. https://doi.org/10.21437/Interspeech.202210649. [Google Scholar]
 M. Arnela, D. Ureña: Tuned twodimensional vocal tracts with piriform fossae for the finite element simulation of vowels, Journal of Sound and Vibration 537 (2022) 117168. https://doi.org/10.1016/117168. [CrossRef] [Google Scholar]
 T. Vampola, J. Horáček, J.G. Švec: FE Modeling of human vocal tract acoustics. Part I: production of Czech vowels, Acta Acustica united with Acustica 94 (2008) 433–447. [CrossRef] [Google Scholar]
 T. Vampola, J. Horáček, J.G. Švec: Modeling the influence of piriform sinuses and valleculae on the vocal tract resonances and antiresonances, Acta Acustica united with Acustica 101, 3 (2015) 594–602. [CrossRef] [Google Scholar]
 A.E.D.S. Antonetti, J.D.S. Vitor, M. Guzmán, C. Calvache, A.G. Brasolotto, K.C.A. Silverio: Efficacy of a semioccluded vocal tract exercisestherapeutic program in behavioral dysphonia: a randomized and blinded clinical trial, Journal of Voice 37, 2 (2023) 215–225. https://doi.org/10.1016/j.jvoice.2020.12.008. [CrossRef] [PubMed] [Google Scholar]
 M. Guzman, G. Miranda, C. Olavarria, S. Madrid, D. Muñoz, M. Leiva, L. Lopez, C. Bortnem: Computerized tomography measures during and after artificial lengthening of the vocal tract in subjects with voice disorders, Journal of Voice 31, 1 (2017) 124.e1–124.e10. https://doi.org/10.1016/j.jvoice.2016.01.003. [CrossRef] [PubMed] [Google Scholar]
 M. Saldias, M. Guzman, G. Miranda, A.M. Laukkanen: A computerized tomography study of vocal tract setting in hyperfunctional dysphonia and in belting, Journal of Voice 33, 4 (2019) 412–419. https://doi.org/10.1016/j.jvoice.2018.02.001. [CrossRef] [PubMed] [Google Scholar]
 T. Vampola, A.M. Laukkanen, J. Horáček, J.G. Švec: Vocal tract changes caused by phonation into a tube: a case study using computer tomography and finiteelement modeling, Journal of the Acoustical Society of America 129 (2011) 310–315. [CrossRef] [PubMed] [Google Scholar]
 T. Vampola, A.M. Laukkanen, J. Horáček, J.G. Švec: Finite element modelling of vocal tract changes after voice therapy, Applied and Computational Mechanics 5 (2011) 77–88. [Google Scholar]
 C. Johnson: Numerical solution of partial differential equations by the finite element method, Cambridge University Press, Cambridge, 1987. [Google Scholar]
 G. Fant, J. Liljencrants, Q. Lin: A fourparameter model of glottal flow, STLQPSR 26, 4 (1985) 1–13. [Google Scholar]
 A. D. Pierce: Acoustics: An introduction to its physical principles and applications, Springer International Publishing, Cham, 2019, ISBN 9783030112141, https://doi.org/10.1007/9783030112141. [Google Scholar]
 R.J. Guyan: Reduction of stiffness and mass matrices, American Institute of Aeronautics and Astronautics Journal 3, 2 (1965) 380. [CrossRef] [Google Scholar]
 M.A. Blair, T.S. Camino, J.M. Dickens: An iterative approach to a reduced mass matrix, in: Proceedings of the 9th International Modal Analysis Conference, Florence, Italy, 15–18 April, 1991, pp. 621–626. [Google Scholar]
 M.I. Friswell, S.D. Garvey, J.E.T. Penny: Model reduction using dynamics and iterated IRS techniques, Journal of Sound and Vibration 186, 2 (1995) 311–323. [CrossRef] [Google Scholar]
 P. Silvester: Highorder polynomial triangular finite elements for potential problems, International Journal of Engineering Science 7, 8 (1969) 849–861. [CrossRef] [Google Scholar]
 K.J. Bathe: Finite elements procedures, Prentice Hall Inc., Upper Saddle River, NJ, 1996. ISBN 0133014584. [Google Scholar]
 G.C. Diwan, M.S. Mohamed: Pollution studies for high order isogeometric analysis and finite element for acoustic problems, Computer Methods in Applied Mechanics and Engineering 350 (2019) 701–718. https://doi.org/10.1016/j.cma.2019.03.031. [CrossRef] [Google Scholar]
 T. Ikävalko, A.M. Laukkanen, A. McAllister, R. Eklund, E. Lammentausta, M. Leppävuori, M.T. Nieminen: Three professional singers’ vocal tract dimensions in operatic singing, kulning, and edge – a multiple case study examining loud singing, Journal of Voice 38, 5 (2024) 1253.e11–1253.e27. https://doi.org/10.1016/j.jvoice.2022.01.024. [CrossRef] [PubMed] [Google Scholar]
 B. Story: The vocal tract in singing, in: G. Welch, D. Howard, J. Nix (Eds.), The Oxford handbook of singing, Oxford University Press, Oxford, 2016, pp. 1–21. https://doi.org/10.1093/oxfordhb/9780199660773.013.012. [Google Scholar]
 E. Yanagisawa, J. Estill, S. Kmucha, S. Leder: The contribution of aryepiglottic constriction to “ringing” voice quality – a videolaryngoscopic study with acoustic analysis, Journal of Voice 3, 4 (1989) 342–350. [CrossRef] [Google Scholar]
Cite this article as: Štorkán J. & Vampola T. 2024. Design of a computational method to optimise acoustic output of the human vocal tract. Acta Acustica, 8, 43.
All Tables
All Figures
Figure 1 FEM mesh for vowel [a:]. 

In the text 
Figure 2 Left: LF pulse excitation for the time derivative of the glottal volume velocity used for modelling the phonation. Right: Corresponding integral (glottal airflow volume velocity pulse). 

In the text 
Figure 3 Vocal tract for vowel [a:] divided into subsystems. 

In the text 
Figure 4 Relationship between relative error and reduction rate. 

In the text 
Figure 5 Relationship between relative error and number of IRS iterations. 

In the text 
Figure 6 Nodes of a complete model assembled from superelements. 

In the text 
Figure 7 Derived isoparametric element. 

In the text 
Figure 8 Left: Fullplanar 6th order element. Right: Definition of surface coordinates. 

In the text 
Figure 9 Integration points in a triangular base. 

In the text 
Figure 10 Conditionality of elements in circular geometry. 

In the text 
Figure 11 Conditionality of elements in nonconvex geometry. 

In the text 
Figure 12 Visualisation of the values of the derived element mass (left) and stiffness (right) matrices. The diagonal elements of the central nodes are highlighted. 

In the text 
Figure 13 The finite element mesh of the vocal tract for vowel [a:] assembled from derived isoparametric elements. 

In the text 
Figure 14 Acoustic pressure of the first 4 longitudinal oscillation shapes F_{1} = 530.4 [Hz], F_{2} = 1146.6 [Hz], F_{3} = 3061.2 [Hz], F_{4} = 3876.1 [Hz]. 

In the text 
Figure 15 Acoustic pressure of the transverse oscillation shape F = 4261.9 [Hz]. 

In the text 
Figure 16 Transmission characteristics between input excitation and emitted acoustic pressure. Top plot shows the amplitude characteristics and bottom plot shows phase characteristics. 

In the text 
Figure 17 Geometric modifications of the epilarynx used for improving transmission characteristic of the HVT. 

In the text 
Figure 18 Position of formants depending on the vocal tract crosssection and the piriform sinuses ratio. 

In the text 
Figure 19 Comparison of the transmission characteristic reference model and the transmission characteristics optimised model. Top plot shows the amplitude characteristics and bottom plot shows phase characteristics. 

In the text 
Figure 20 Region of the singer’s formant in the transmission characteristic. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.