Overview on state-of-the-art numerical modeling of the phonation process

– Numerical modeling of the human phonatory process has become more and more in focus during the last two decades. The increase in computational power and the use of high-performance computation (HPC) yielded more complex models being closer to the actual ﬂ uid-structure-acoustic interaction (FSAI) within the human phonatory process. However, several different simulation approaches with varying mathematical complexity and focus on certain parts of the phonatory process exist. Currently, models are suggested based on ordinary differential equations (reduced order models) but also on partial differential equations based on continuum mechanics as e


Introduction
Speech and the underlying phonation and voice production are fundamental prerequisites for a functioning social life and societies.However, many underlying biomechanical aspects are not fully understood, especially the fluid-structure-acoustic-interaction (FSAI) of the phonation aspects.In phonation, the FSAI is composed of the viscous airflow (fluid), the myoelastic motion of the vocal fold tissue (structure), and the production of the acoustic voice signal (acoustics), see Figure 1.The most natural examination approach would be human in-vivo studies, however due to the restricted access to the human deep respiratory tract, the interrelations between airflow, vocal fold dynamics and resulting acoustics in the larynx cannot be investigated in humans.Hence, numerical models have been suggested going back to the 1960s.Since then, increased biomechanical knowledge on the phonatory process and computational power enabled to develop numerical models, simulating the phonatory process, which become more and more realistic.
For simulating the phonation process, several approaches are suggested providing complementary infor-mation.In general, these models assemble of simple reduced order models and highly resolved continuum based models: Reduced order models: Models with high computational efficiency due to the simplification of the glottal flow and/or vocal fold mechanics.These models allow large-scale, parametric investigations of the phonatory physics which are otherwise impractical in fullyresolved phonation models.Computational Fluid Dynamics (CFD): Concentration on the occurring flow dynamics based on continuum mechanics.These models mainly focus on the airflow characteristics for different glottal and vocal tract configurations and use static or prescribed vocal fold oscillations.Commonly, these models are discretized by numerical methods as the Finite-Volume-Method (FVM), in rare cases also by Finite-Difference-Method (FDM) or the Finite-Element-Method (FEM).Fluid-Structure Interaction (FSI): High-end computational models that consider the interaction between vocal folds biomechanics commonly simulated by FEM and airflow.FSAI: High-end computational models that additionally allow to compute the acoustic source terms and the sound field.Hereby, FEM or FVM are the common methods to simulate the sound generation and propagation.Models combined with machine learning methods: To increase computational efficiency of highly spatially and temporally resolved models to estimate underlying biomechanical model parameters.
However, specific characteristics of current continuum based models as the direct vocal fold contact or an individualized geometry are still topics of basic research.As a consequence, those model are often highly computationally expensive (i.e., 3D-FEM or FVM methods).In 2011, a comprehensive review of numerical models, based on methodology, of human phonation was published by Alipour et al. [1].Hence, this work will focus and provide an overview on computational models published between 2011 and 2022.However, since then, other reviews have been published that focused more on the voice physiology components but not on the mathematical approaches of the simulations [2,3], experimental and numerical modeling [4], and mechanical characterization of vocal fold tissue, being highly important for input in numerical simulations [5].
2 Pure CFD models without FSI Historically, the starting point for the numerical modeling of the phonation process was either the pure simulation of the laryngeal fluid dynamics or the pure simulation of the vocal folds vibration based on spring-mass elements with the simplest aerodynamic excitation functions, i.e. step functions of subglottal pressure.As the mass-spring model also constitutes the simplest strategy to model fluid-structure interaction, those models are discussed in the according section further below.In this section, the focus lies on models that purely simulate the fluid dynamics in the larynx.Thereby, those models do not only include static geometric boundary conditions but also models with driven or imposed vocal fold motion.Especially models with imposed vocal fold motion are of high potential with regard to the cost-benefit efficiency.A summary of studies discussed here is displayed in Table 1 for static vocal fold models and in Table 2 for models with vibrating vocal folds with imposed or driven motion pattern.

Fluid flow modeling
In all but the very simplified, reduced-order models of phonation, the fluid flow is described by Navier-Stokes equations (NSE) for viscous Newtonian fluid.Some studies [6][7][8][9][10] use the full NSE for the compressible fluid, whose continuous form captures both fluid dynamics and acoustic perturbations.Since the Mach number in glottal flow does not exceed M = 0.3, incompressible NSE used e.g. in most studies represent a reasonable choice, too, and result in lower model complexity and computational costs.The incompressible model cannot capture acoustic waves.However, decoupled aeroacoustic simulations can be still run on top of incompressible simulation results.
The question whether compressible flow effects have a large impact on the simulations results has not been clarified yet.De Luzan et al. [11] found only negligible differences between the compressible and incompressible flow field in their 3D static vocal folds model.In contrast, Hájek et al. [12] compared both fluid properties by taking into account the complete fluid-structure interaction and found reasonable differences in the vocal folds motion pattern and therewith in the flow field.However, they used a 2D model which implicitly means to have constant pressure conditions in longitudinal direction of the vocal folds at every time step.Therefore, it is questionable, whether the model overestimates the impact of acoustics on flow and vocal fold dynamics.
The airflow in the trachea is usually laminar, but is dominated by complex turbulent structures in the supraglottal region.Numerical simulation of the highly unsteady turbulent airflow is a challenging issue.Laminar models (i.e.no turbulence modeling) introduce inaccuracy, since they neglect turbulent momentum transfer.Reynolds-Averaged (RANS) models face difficulties for massively separated flows, and are inappropriate for aeroacoustic simulations because they provide only mean flow solution with turbulent fluctuations averaged out.Since Direct Numerical Simulations (DNS) are unfeasible due to prohibitive computational cost, the most promising approach is Large Eddy Simulations (LES), where large turbulent scales are resolved and small scales modeled.

Vocal fold models
The geometry of the computational models is manifold depending on the focus of the respective study.The variations incorporate different shapes of the vocal folds, including or omitting the ventricular folds (also called false vocal folds) and the vocal tract.Regarding the vocal fold shape, most studies used the so-called "M5 model geometry" shown in Figure 2 which was first introduced as experimental static model by Scherer et al. [13].Although not officially defined as benchmark model, the M5 model became widely accepted for experimental and computational studies including static [14][15][16][17][18][19][20] as well as vibrating vocal folds of all types, i.e. driven dynamics [21][22][23][24][25][26][27][28][29][30].Its big advantage is possibility of the parametric variation of the model contour for representing different shapes of the glottal duct.Thus, several computational studies investigated the flow through static vocal folds at different instances during the oscillation cycle represented by different convergent or divergent transglottal shapes of the coronal glottal duct [14-17, 19, 20].Most of these models are quasi-2D with a uniform rectangular glottis and without ventricular folds and vocal tract except the model by de Luzan et al. [20] who performed simulations with 3D M5 models with uniform or divergent glottal duct.Beside the M5 model, other customized and/or physiologically based model types [11,31] as well as those with a much simpler geometry of the vocal folds as e.g.half-cylinders [32,33] have been applied.
For models with predefined vocal fold motion, the M5 model geometry also exhibited a large spreading.Except for Zheng et al. [21], all M5 based models are 3D models with rectangular [22,24] or semi-elliptical glottis shape [25][26][27][28].Other studies used also simplified vocal fold shapes [34] or more realistic but still simplified geometry as in Zörner et al. [35] being realized in a 2D model.

Imposed vibration
There are two groups with imposed vibration of the vocal folds, those with purely medial-lateral oscillations [21,22,25] and those with additional rotation motion of the medial surface to reproduce the convergent-to-divergent change of the glottal duct [24][25][26][27][28][29][30][34][35][36] during an oscillation cycle which is characteristic for the mucosal wave-like motion of human vocal folds.The numerical realization of the vocal folds motion and the handling of the numerical mesh is only rarely described, e.g. for Zheng et al. [21] being the Immersed Boundary Method (IBM) and for Sadeghi and Falk with colleagues [25][26][27][28] being the Overset Mesh Method (OVM) implemented in the commercial CFD solver STAR-CCM + (Siemens AG) [37].For the other studies, the applied numerical methods to realize the vocal fold motion within the numerical mesh are not explicitly described.

Supraglottal geometry: Ventricular folds and vocal tract
Besides the geometry of the immediate glottal region with the two vocal folds, the upstream and especially downstream regions of the computational models are essential parts to study the laryngeal aerodynamics.Especially the ventricular folds immediately downstream of the vocal folds have been of special interest in computational studies [20,27,33].In contrast to the vocal fold geometry, the shape of the ventricular folds is less standardized sometimes with half-circular shape [32,33], some circular with divergent expansion [23,24] and some transferred from the M5 model geometry [25,27,28].The important geometric parameters in this context and thus focus of the related studies [11,20,32,33] are the length of the ventricles being the space between vocal and ventricular folds and the gap between the ventricular folds.Also the pure aerodynamic influence of the ventricular folds were in the focus of computational studies [24,27].
Including the remaining vocal tract up the mouth is of particular interest with regard to the sound generation.Thus, only few larynx models involved a complete vocal tract model in simplified straight form [9,10,[28][29][30].

Study
The impermeable walls in all presented larynx models were characterized as no-slip boundary conditions at which the flow velocity is either zero or equal to the velocity of the respective wall being the case at the surfaces of the vocal folds.Thereby, pure medial-lateral [21,22] or a superimposed medial-lateral and rotatory vibration pattern [24,25,34] was used to move the vocal folds.Another approach was to apply the motion pattern of the vocal folds transfered from simulation with FSI to evaluate the validity of the flow field [35].
Computationally, the adaption of the numerical mesh around the vocal folds is realized if acknowledged in the studies by IBM [21] or the OVM [25,28] which is a combination of a fixed background mesh and a small deformable and highly resolved mesh around the vocal folds [38][39][40].
The flow through the glottis was mostly forced by a constant pressure gradient between the inlet and the outlet boundary with absolute values between 300 Pa and 2.45 kPa as shown in Tables 1 and 2. Other models used constant flow velocity or flow rate inlet in combination with constant pressure outlet boundary conditions [31][32][33].In one case, de Luzan et al. [20] applied an unsteady pressure gradient with static vocal folds to generate a flow field representing different instances during an oscillation cycle of the vocal folds.

Model verification
In general, there are two steps that have to be performed to show the validity of computational models based on FDM, FVM and FEM.
1. Grid independence study: Different numerical meshes of the simulation case with different spatial mesh resolutions are applied and selected flow parameters of the simulated physical field (flow, structure or acoustics) are computed.The goal of the grid study is to find the grid with the lowest resolution (number of grid points or volumes) that produces similar results compared to grids with higher resolution.This ensures physically correct flow field solutions with the smallest simulation wall-time.

Experimental validation: Physical parameters
extracted from simulations and equivalent experiments at the same location within the physical field are compared in temporal and/or spectral representation to ensure that the simulation case was modeled correctly.
The verification information in the discussed studies are summarized in Tables 3 and 4. Whereas the grid study is mandatory for every simulation and explicitly described in the summarized studies, the experimental validation is difficult in case of the human phonation.The reason is the inaccessibility of the respiratory tract for experimental sensors in living persons without anesthesia.Thus, the computational models summarized here are based on simplified geometrical models, mostly M5 or similar simplified shapes.Furthermore, some studies explicitly include accompanying experimental data to validate the computational results [14,26].

Topics of pure CFD models
The static models are based on the quasi-steady approximation [41,42] that describes the pulsatile flow field as a series of steady and fully developed flow fields.Thus the main topic of the studies using static vocal fold models is the characteristics of the laryngeal flow field for different glottal duct shapes representing different instances during an oscillation cycle of the vocal folds.The second topic is the influence of the ventricular folds on the flow field with regard to interaction the shear layer vertices from the glottal jet with the ventricular folds and the glottal flow resistance [9,10,20,32,33].
All studies with imposed vocal folds motion focus on the unsteady flow field development downstream of the vocal folds.More specific topics are the differences of results pro- duced by 2D and 3D models [34], variations of the motion characteristics (pure lateral vs. lateral-rotatory motion of the medial surface) [25], the differences in glottal jet evolution for different glottis shapes [24] and again the influence of the ventricular folds on the phonation process [27].Additionally, differences of results of FSI and imposed oscillation models with identical motion patterns have been analyzed [35] as well as the computational accuracy and effort of those models for a potential clinical application [25].

Acoustics
From the 26 studies included in Tables 3 and 4 with pure CFD or CFD with predefined vocal fold oscillation, 13 studies additionally simulated and analyzed the acoustic outcome (listed in Tab. 5).These flow-induced voice generation models are discussed with special attention to their reproducibility and validity.Furthermore, we cluster studies that explicitly build upon each other (see Tab. 5).In doing so, the studies of [43][44][45][46] are closely related to each other and are e.g.discussed by the study of [28].Schoder et al. [43] validated the aeroacoustic model and [46] discussed the filtering of flow-induced sound sources.In [44], the perturbed convective wave equation (PCWE) source term is evaluated in detail.The studies [22,23,35] identify a second cluster and are discussed by considering the study [24].Both study clusters used the FEM solver openCFS to obtain the acoustic solution [47].

Geometries
The analyzed studies (see Tab. 5) either used vocal tract geometry tuned to the formants of a vowel or an MRI-based (magnetic resonance imaging) geometry [10].All studies omitted the vocal fold motion during the acoustic propagation simulation.To conclude, the studies considered aimed for a realistic representation of the geometrical details in 3D or the formants of the respective vowels investigated.

Aeroacoustic models
The studies used different aeroacoustic models (see Tab. 5), with a strong tendency to use a viscous acoustic splitting technique.In [24] and the related studies, Lighthill's equation [48] and the acoustic perturbation equations (APE-2) [49] were applied and compared using the finite element method.The source terms of both theories were visualized to evaluate the differences and to get more insight.In [9], the acoustic radiation for different Finnish vowels ([a], [e], [i], [o], and [u]) was computed directly by considering a compressible fluid flow in the CFD simulation using the compressible form of the Navier-Stokes equations solved by the finite volume method.When resolving compressible flow directly with a CFD simulation, all interactions of the acoustic mode and the fluid dynamics directly captured.In contrast to a hybrid aeroacoustic workflow, the compressible CFD simulation requires more computational processing resources.Regarding Powell's manipulation of Lighthill's source term, [10] visualized the Lamb vector inside the upper airways.Finally, [28][29][30] computed the acoustic field based on the PCWE theory, which allows distinguishing between flow and acoustics inside the flow field (according to the finding in [50]).The PCWE model is an exact reformulation of the APE-2 [49] and was discretized by the finite element method.Since the original PCWE needs a definition of the mean flow field from a given simulation, the PCWE equation cannot be coupled in parallel to a flow simulation.Both the wave operator and the source term require the flow field during the whole simulation time of the acoustic simulation in order to pre- compute the source term and the mean flow.Recently, the PCWE theory was extended in [51] to account for moving geometries conveniently and depend solely on the instantaneous incompressible flow field.Similar to the reformulation of the PCWE, the LPCE was reformulated into a convective wave equation depending on instantaneous incompressible flow field in [52].Another benefit of this new model is that it can easily be coupled to surface motion.In contrast, to the recently applied aeroacoustic model using the finite element method, integral methods and the boundary element method have been used to model the far-field propagation as has been summarized in [1].

Material and boundary conditions
The material and boundary condition data is summarized in Table 6.All studies used acoustic sound-hard walls as boundary condition for the tissue, a reasonable assumption regarding the huge impedance change from the air to the tissue.For both, the boundary to the lungs and to the free-field in front of the mouth non-reflecting boundary conditions are used (e.g., perfectly matched layer or absorbing boundary condition).An acoustic analogy was used in [10], which incorporates the free-field condition using Green's function.Furthermore, the acoustic boundary condition towards the lower airways is not given explicitly in [9].The temperature affecting the speed of sound was typically chosen in the ambient range, significantly lower than the body temperature.This can be improved with wave equations accounting for a variable temperature.

Model verification
The authors analyzed the examined human phonation models in detail.They compared the results to literature findings and a grid study was conducted for the numerical methods.In addition to that, the flow and acoustic models of [28] were validated by a synthetic experimental setup with satisfactory agreement (see Fig. 3).The few validated cases show a need for accessible experimental data in human voice production (Tab.7).

Variation of model parameters
The studies related to [24] provided much numerical groundwork for hybrid aeroacoustic workflow, being later used in [28].Zörner et al. [24] studied different vocal fold oscillation types, which were recently investigated systematically in [28].The work in [29,30] evaluated the influence of different LES subgrid models on the acoustic radiation.Both studies [9,10] showed the applicability towards realistic vocal tract geometries and the changes when considering different vowels ( [a], [e], [i], [o], and [u]).In [44], the aeroacoustic source term of the PCWE was analyzed.A compensation effect of the individual source terms was found and later explained in [53].

FSI in phonation
The interaction between the airflow and the elastic structure of the vocal folds lies at the core of the phonation process.Historically, the first computational models of phonation tried to consider the coupling between these two physical domains, disregarding both the interaction between the acoustics and structure, and the delicate interrelation between the fluid dynamics and acoustics.The computational studies of this type published up to 2011 are well summarized in the review of Alipour et al. [1].
In the recent ten years, it has become more and more evident that the assumption of decoupled acoustics seriously impairs the accuracy and applicability of computational models of phonation.Both experimental and numerical evidence suggests that during phonation, the interaction with the acoustic field significantly affects vocal fold oscillation and should not be neglected.A notable example is the subglottal and supraglottal acoustic resonance, which can considerably influence vocal fold self-oscillation.Consequently, Table 6.An overview of the acoustic boundary conditions and the flow temperature.Abbreviation: ABC = absorbing boundary condition, PML = perfectly matched layer.

ID Lungs
Free-field T/°C [24] PML PML [28] ABC/PML PML 20 [9] ABC 27 [10] 2 7 [29,30] PML PML 20 in the last decade the studies focusing purely on fluid-structure interaction (FSI) are becoming rather rare.Yet, in human phonation the coupling between the fluid and structural fields is by far the strongest interaction and its modeling deserves significant attention.
When the vocal folds are adducted to phonation position and exposed to expiratory airflow, the elastic structure is subject to unsteady aerodynamic forces, which can lead to flow-induced vocal fold oscillations.As the vocal folds move, they change the geometry of the channel and influence back the airflow.In the case of modal phonation, this effect is very strong, since the vocal folds collide and completely close the channel for a considerable portion of the oscillation period.A further important characteristic is the energy transfer: for stable oscillations, there must be positive net energy transfer from the airflow to the structure over one oscillation cycle.

Structural modeling
Structural models of the vocal folds usually include up to four different tissue layers.The simplest models, e.g.[7,54], use a single isotropic layer.Models trying to capture at least the basic physiology [55,56] include two isotropic or transverse isotropic layers (body and cover), more complex models [6,57] even four layers: body, ligament, superficial lamina propria and epithelium.
The constitutive relation between the stress r and strain e is either linear elastic r = E, hyperelastic (calculated from the strain energy density function), or viscoelastic r ¼ E þ g_ e, where E is the Young modulus and g viscosity.The linear elastic and hyperelastic models can include optional Rayleigh damping B ¼ c 1 M þ c 2 K introduced on the level of mass and stiffness matrices M and K, with constants c 1 and c 2 tuned from experiments.

FSI modeling
From the modeling perspective, FSI poses serious challenges.First, the information between the fluid and solid domains (aerodynamic forces and new geometry of the domain boundary) has to be exchanged in each time-step of the simulation.Theoretically, in a monolithic approach the FSI problem could be discretized into a single matrix.However, this method is very rarely used, since it results in huge ill-conditioned linear systems.Moreover, the fluid and structural domains are often discretized using different numerical methodsthe structural dynamics are almost exclusively solved by the Finite Element Method (FEM), while the primary choice for the fluid dynamics is the Finite Volume Method (FVM).Thus, the FSI in phonation is almost always solved by a coupling algorithm, where each domain is solved separately and the boundary conditions on the common interface are exchanged iteratively.When the time-step of the simulation is sufficiently low, the coupled approach offers acceptable accuracy, lower model complexity and computational costs.
The second challenge is the fact that both the structural and fluid computational domains are time-dependent, since they deform due to vocal fold motion.The first possible approach is the usage of moving computational meshes, conforming to the interface (vocal fold surface).In this case, the equations are usually rewritten in the Arbitrary Lagrangian-Eulerian (ALE) approach, which uses a mapping between the reference (fixed) and current (moving) domains [38,58].ALE equations are commonly implemented both in commercial and open-source codes, and preserve the order of accuracy and convergence properties of the numerical methods on static meshes.However, in the case of phonation modeling, the ALE approach faces problems in the glottal region.As the vocal folds approach and glottis gets narrow (or even closes completely), the ALE mapping often produces highly distorted or even inverted mesh elements and causes serious convergence issues.
The second approach, which was pioneered by Mittal, Luo, Seo et al. [59][60][61], is the Immersed Boundary Method (IBM).Instead of using complex moving body-fitted meshes, the simulations are run on fixed and simple Cartesian grids, which do not conform to the vocal fold surface (see Fig. 4).This greatly simplifies the task of grid generation and avoids problems with mesh deformation.On the other hand, the difficult task in IBM is tracking of the solid interface and treatment of the boundary conditions at the vocal fold surface.Imposing these boundary conditions is not straightforward, can decrease the order of accuracy and cause parallelization and performance issues.
The third possible method on how to deal with moving objects or channel walls is the OVM approach, as described in Section 2.3.None of the FSI studies listed in Table 8 used this method.Only studies with predefined vocal fold dynamics applied OVM recently, e.g.[28,44] as described in Section 2.

Overview of the FSI models
The studies published between 2011 and 2022 focusing on fluid-structure interaction in human phonation are listed in Table 8.Note that only models with full two-way FSI, which do not solve for acoustics, are included.The oneway FSI models (with forced vocal fold motion) are commented in Section 2, and the FSI studies including the acoustic interactions are described in Section 4. Surprisingly, with the exception of [56], all the other models of this class published between 2011 and 2022 are only two-dimensional and the geometry of the larynx is highly simplifiedthere is no attempt to model the ventricular folds and the vocal tract is a straight channel.Also, none of the authors tries to model the turbulence in the supraglottal region, all simulations are laminar.The fluid meshes are relatively coarse, even for 2D simulations.Clearly, these studies do not aim to model the phonation in the most accurate way.Instead, they focus on the modeling of the fluid-structure interaction itself, and try to investigate the effect of the model parameters on the flow-induced vibration.
The most important boundary condition for the FSI problem is the condition for the fluid flow at inlet of the flow domain.Most of the studies use constant pressure, ranging between 600 Pa [6, 57] and 1.8 kPa [56].Feistauer et al. [7] selected a less physiologically relevant condition of constant velocity at inlet.In [54], both prescribed velocity and constant pressure are tested, and the influence of the boundary condition on flow-induced oscillation is investigated.
About half of the studies, i.e. [6,8,56], report on grid dependence tests, the others do not.Verification and validation in phonation modeling is a tough issue, since benchmark problems are nonexistent and comparison with in-vivo experiments is highly complicated by inter-subject variability and poor repeatability of measurements.There are some experimental data of synthetic larynx models, though, but these models are 3D models which makes a quantitative comparison of the similated data difficult.Thus, the only paper which tried to validate the results of numerical simulations with experimental data is the study of Jiang et al. [56], which is also the only 3D study in the FSI modeling group.

Fluid-Structure-Acoustic Interaction (FSAI) modeling
A logical next step in modeling the human phonation process includes an acoustic model in the incompressible fluid-structure-interaction simulation or investigates a compressible fluid-structure-interaction simulation leading to a so-called Fluid-Structure-Acoustic Interaction (FSAI) model.After some contributions from 2005 until the early years of the last decade, a few contributions dealt with FSAI in human phonation.A sample of six publications closely related to each other is discussed in this section.Furthermore, the literature study on FSAI showed that several publications with strongly misleading FSAI-claiming titles occurred in the last ten years.Table 9 shows a summary of recent FSAI models in the field of human voice production and compares the development to previous studies [62,63].The study [64] is an extension of the study [63].The other studies build up on the knowledge of [62][63][64] and further develop the capabilities of the immersed-boundary FEM for human phonation applications.

Geometries
The 3D studies jointly aimed for a realistic description of the upper human airways (see Fig. 5).The 2D studies aim to investigate the underlying computational models and limit themselves to a more simplistic geometry.

Study [ID]
Year Table 10 reports the geometrical properties of the reported models.Recent studies are directed toward a realistic description of human physiology, including first attempts modeling the contact using a small artificial gap between the closing vocal folds.The 3D studies used a curved vocal tract based on an in vivo-based "neutral" vowel model [65].It was superimposed onto a realistic airway center line from the in vivo MRI measurement [66].

Models and material properties
The investigated studies modeled the solid mechanics, fluid dynamics, and acoustics with slightly different numerical techniques and material parameters; both are discussed in this section.For the numerical discretization, FEM is used and the studies aiming for realistic 3D geometry combined it with IBM.The vocal folds are modeled by a multilayered structure with the material parameters given for the layers in Table 11.In [64], the tissue is modeled by a hyperelastic Ogden model.The density, Poisson's ratio, and bulk modulus values of both layers were q = 1070 kg/m 3 , m p = 0.49, and 100 kPa, respectively.Damping was simulated using a Rayleigh damping scheme with coefficients a = 24.1915and = 0.000127.The vocal fold tissue was modeled as viscoelastic, transversely isotropic material in the other four studies [67][68][69].A constant density q = 1043 kg/m 3 , in-plane transversal and longitudinal Poisson ratio (m p = 0.9, m pz = 0) inside the layers is used.Experimental estimates for these simulation parameters can be found in [1].A four layer model was considered [12].In addition to the body, cover and ligament and epithelium layer with a layer thickness of 0.05 mm was used.The stiffness parameters of the epithelium are the following E p = 25 kPa and m p = 0.49.Study [70] is not covered in is not covered in Table 11 since the vocal fold material was modeled by a three-mass model.
The fluid parameters are defined by the air temperature given in Table 12.In the studies of [67], the viscosity is artificially reduced by 1/4 to reduce the computational costs of the turbulent structures arising.The flow is typically modeled as viscous incompressible flow of a Newtonian fluid, with the material parameters adapted to air.The coupling to the acoustic is performed with a Linearized Perturbed Compressible Equations model (LPCE) in four studies and with a weakly compressible simulation in one of the studies.As recently shown [51], a single scalar wave equation could replace the four equations LPCE model without further restrictions.Furthermore, no study used a direct numerical simulation or LES to describe the turbulent scales realistically.Finally, all studies aim to investigate the relevance of the acoustic feedback concerning the generated voice signal.

Applied loads
As depicted in Table 12, the vocal fold motion was driven by the inlet pressure throughout the selected studies.The range varied between 250 Pa and 1000 Pa.

Boundary conditions
The boundary conditions of the solid mechanic equations are a fixation (zero displacements) of the vocal folds at the cartilage and a full coupling condition of the velocity and the stresses at the interface to the fluid.The contact is modeled artificially by a contact plane and a residual gap.A pressure inlet and outlet condition drive the flow, and a nopenetration and a no-slip condition are applied to the walls.For the acoustics equation in four studies [67][68][69][70], soundhard wall boundary conditions are applied at the tissue walls.A total reflecting boundary condition models the free field radiation.Towards the lower airways, a Dirichlet boundary condition was described in [67,68] or an anechoic termination was used in [69,70].

Model verification
An overview about the model verification of FSAI models is shown in Table 13.The study [12,64] compared the results to results from literature as the other investigation analyzed here.Furthermore, the model verification was performed by refining the grid and time step sizes.The computational models used in [67][68][69] build upon established FSI models from previous years and the experience gained there.In [70], a grid independence study was carried out.Additionally, the ressults of the applied methods were benchmarked with a proprietary solver.

Variation of model parameters
In [64], the impact of a varying length of the subglottal tract and the variations induced by modeling the fluid as slightly incompressible are investigated.The results of this study are promising, and the applicability to 3D and realistic geometries should be investigated in the future.In [12], an incompressible and compressible flow simulation are applied to model the human phonation process.The study [68] investigated a varying longitudinal cover and ligament layers thickness.It was found that the varied thickness resulted in up to 24% stiffness reduction in the middle and up to 47% stiffness increase near the anterior and posterior ends of the vocal fold.However, the average stiffness is not affected.A longitudinal layer thickness variation result in a more energy-efficient vibration than the vibra- tions with a constant layer thickness.In [69], subglottic stenosis's effect on glottal flow dynamics, vocal fold vibration, and acoustics during voice production was investigated.The results show that subglottic stenosis affects voice production only when its severity is beyond a threshold.For the glottal flow rate and acoustics, this threshold is 75%, and at 90% for the vocal fold vibrations.

Reduced-order models of phonation
Fully resolving the complex physics involved in human voice production in three-dimensions is computationally challenging and expensive, as shown in the sections above.
The high computational costs make it impractical to use these computational models for extensive parametric investigation of the physics of voice production or practical applications.Thus, an important goal of voice research is to develop reduced-order models of phonation that are computationally efficient enough to allow large-scale parametric voice simulations for e.g.parameter estimation [71,72] or predicting clinical intervention outcomes.Since communication is the main interest of voice production, reducedorder models in principle can be developed such that perceptually relevant physiologic features are sufficiently represented and features of minimum perceptual relevance are simplified [2], thus improving computational efficiency.Such simplification can be applied to both the fluid and structure sides.
On the fluid side, while recent experimental studies showed that the glottal flow is highly three-dimensional and exhibits many complex phenomena such as flow separation, intraglottal vortex generation, shear layer instabilities, and transition to turbulence [60,[73][74][75], the acoustic and perceptual relevance of these three-dimensional flow features, particularly to the harmonic component of voiced speech, remains unclear.In an experimental study, Zhang and Neubauer [76] quantified the acoustic relevance of supraglottal vortical structures by experimentally disturbing the supraglottal flow field and observing its impact [ 91] Curved [65] 307 Cylindrical cs [68] [ 91] Curved [65] 307 Cylindrical cs [70] [ 118] Pipe 200 20 mm width [69] [ 91] Curved [65] 340 Cylindrical cs [12] [ 13] /i/ 250 - on the produced voice.They showed that alterations in the supraglottal flow field produced no significant changes in the produced sound, suggesting that the three-dimensional supraglottal flow may be simplified in phonation models.
The relevance of intraglottal vortices to voice production was also questioned and found to be small [77][78][79], especially for subglottal pressures typical of normal-intensity phonation.While more studies will be needed to establish the relevance of three-dimensional flow features to voice production, these studies suggest that many three-dimensional flow features may be simplified to some degree to improve computational efficiency of phonation models.
In early models of phonation such as the two-mass and three-mass models, these three-dimensional flow features are often neglected and the glottal flow is simplified to be one-dimensional (1D).Many studies have aimed to quantify the accuracy of one-dimensional flow models by comparing to either simulations solving the Navier-Stokes equations or experiment.Decker and Thomson [63] compared 1D flow models to a 2D Navier-Stokes flow model, and their results showed that the 1D flow models were able to predict the flow rate, intraglottal pressure, and glottal width with reasonable accuracy, particularly at conditions of small glottal width.At large glottal width, 1D models overestimated the flow rate.In a series of studies, Luo and colleagues [80][81][82] proposed improvements to the 1D flow model, and showed that when coupled to a 3D vocal fold model, these 1D flow models were able to predict phonation frequency, vibration amplitude, and vertical phase difference in vocal fold vibration with reasonable agreement with those from 3D flow models.Yoshinaga et al. [83] evaluated the accuracy of 1D flow models in left-right asymmetric vocal fold conditions, and showed that the 1D flow model was able to predict vocal fold vibratory patterns and selected voice outcome measures with reasonable agreement with the predictions from the three-dimensional Navier-Stokes-based flow model (Fig. 6).An interesting finding of the study by Yoshinaga et al. [83] is that such agreement was reached despite differences in the predicted flow pressure on vocal fold surface, indicating that vocal fold properties play a larger role than the glottal flow in determining the overall pattern of vocal fold vibration and the produced voice.
Studies also showed that predictions from phonation models using a low-dimensional flow model compared well with experiments, despite the many complex flow phenomena being neglected in these models [84][85][86].
On the structural side, the vocal folds are known to exhibit nonlinear, anisotropic, viscoelastic mechanical behavior.Although there have been recent efforts toward developing structurally-based constitutive models [5,[87][88][89][90], implementation of these constitutive models is computationally expensive, especially in three-dimensional continuum models.As a result, they have yet to be used in phonation models.In phonation models, the vocal folds have been modeled as hyperelastic, often isotropic, materials (e.g., [6,86]).The computational cost is generally high, particularly when a large-deformation, large-strain formulation is used.
One simplification often made to improve computational speed is to simplify the vocal folds as linear elastic materials.It is generally understood that the elastic moduli in these models should be interpreted as the tangent elastic moduli around a specific deformation state of the vocal fold, which would vary with vocal fold posturing.Thus, the effect of material nonlinearity, often due to changes in vocal fold posturing, can be investigated by parametric variations in the elastic moduli and vocal fold geometry in these  models.The linear elasticity simplification has been widely used in phonation models (e.g., [62,[91][92][93]).
Another simplification often made in vocal fold models is to assume small-strain deformation of the vocal folds, which neglects geometric nonlinearity in the vocal folds.However, Chang et al. [94] showed that while this simplification was able to predict qualitative trends of phonation, neglecting geometric nonlinearity led to significant differences in the glottal gap width, flow rate, and impact stress, as errors due to the small-strain simplification was amplified by vocal fold contact and fluid-structure interaction.
Further improvement in computational efficiency on the structural side can be achieved by reducing the order of the system governing equations of the vocal folds.Zhang [95,96] proposed an eigenmode-based reduced-order model, in which the system governing equations of the vocal fold are projected onto the space spanned by the in vacuo eigenmodes of the vocal folds.He showed that the model was able to predict phonation frequency, sound pressure level, closed quotient of vocal fold vibration, and the amplitude difference between the first and second harmonics (H1-H2) in the output voice spectrum with reasonable accuracy with the use of the first 100 vocal fold in vacuo eigenmodes [96].This number is significantly smaller than the degrees of freedom in typical finite element models of phonation which is often in the order of tens of thousands, thus significantly improving computational efficiency.With the improved computational efficiency, this reduced-order model has been used in a series of large-scale, three-dimensional, parametric studies, using either simplified [97][98][99] or MRI-based realistic vocal fold geometry [93,100].The large number of conditions investigated (about 200,000 vocal fold conditions) allowed for the first time a systematic investigation of the global cause-effect relationship between vocal fold physiology (vocal fold geometry, stiffness, and subglottal pressure) and voice outcomes (vocal fold vibration, glottal flow, and voice acoustics) in a large range of vocal conditions.
Due to challenges in experimentally measuring material properties and geometry of the vocal folds in human or animal models, computational models of phonation are often qualitatively validated by their ability to produce voice measures (e.g., fundamental frequency, vocal intensity, and various flow-based measures) that are within typical human range; e.g., [56,81,82,86,101,102].However, the comparison was often qualitative or limited to a small number of voice conditions.Due to the experimental challenges, fully-resolved fluid-structure-acoustics methods as discussed in earlier sections will likely play an important role in the development and validation of reduced-order models of phonation, by systematically understanding the acoustic and perceptual relevance of different physical components and gradually integrating them into reduced-order models.

Numerical modeling and machine learning
As described above, developing and applying numerical simulation models enables the analysis of fundamental biomechanical behaviour of the phonatory process and the occurring fundamental FSAI processes.However, for reproducing a specific phonatory process or specific vocal fold vibrations as e.g.recorded with high-speed imaging [103], corresponding model parameters have to be adapted.Automated estimation of biomechanical model parameters from numerical simulations of the phonatory process has been performed for many years [104].Applied probabilistic approaches [105] and optimization approaches [55] belong also in the broader area of artificial intelligence.However, in contrast to machine learning methods that need training data to develop the corresponding mathematical model behind, these approaches directly compute automatically the desired model parameters.
Combining numerical phonatory simulation models with machine learning methods is a relatively new approach, but is becoming more and more popular.Since 2019 several studies have been performed.The first study reporting parameter estimation of a lumped mass model (two-mass model -2MM) using a recurrent neural network was Gomez et al. 2019 [106].They trained their model with trajectories generated by the 2MM and used 288 ex-vivo vocal fold dynamics to test their model, see Figure 7.The suggested work showed that the subglottal pressure estimation is in the same accuracy range as classical optimization techniques [102] by reducing the online computational costs to a minimum.
Computing the glottal midline in HSV data during phonation is highly important for judging the dynamic vocal fold left-right symmetry.So far, the needed midline detection was a semi-automatic task needing user interaction and was hence, to a certain degree, user dependent [107].To overcome this shortcoming, Kist et al. 2020 [108] applied deep neural networks (DNN) on a six-massmodel (6MM) to automatically estimate the glottal midline detection.They showed that DNNs outperformed semiautomatic approaches.However, all methods, semi-or fully automatic, achieved sufficient accurate results.Finally, they suggested the so called "GlottisNet" that enables the simultaneous estimation of the glottis midline and the segmentation of the glottal area.
Li et al. [109] suggest a new numerical 1D flow model that contains an analytical computation of the entrance effect and the pressure loss in the glottis.This was derived by training the 1D flow model model with flow data based on a complex 3D fluid-structure interaction model.For training, they used a regression approach.
Zhang et al. [110] introduced a so called generalized reduced-order model (ROM) where they estimate glottal flow and glottal pressure distribution along the vocal fold based on subglottal pressure and the glottal shape provided by universal kinematics equations (UKE).Within the ROM, a given glottal shape is optimized (i.e.matched) in each time-step by the UKE using a genetic optimization (GA) algorithm.The DNN, trained with a Navier-Stokes fluid solver model, estimates the glottal flow and glottal pressure distribution.Based on this, a FEM solid solver computes the glottal shape in the next time-step.In a subsequent study, Zhang et al. [111] improved the accuracy of the same parameter estimation by using a long short-term memory (LSTM) network and also reduced the computational time by discarding the GA optimization.
Zhang [112][113][114] developed a simulation-based machine learning model to solve the inverse problem of voice production.The goal was to estimate vocal fold physiology (geometry, position, and mechanical properties) and subglottal pressure from the produced voice outcome.With the improved computational efficiency of the reduced-order model developed in his earlier work [95,97,98], he was able to perform voice production simulations for a large number of voicing conditions.In Zhang [114], a total of 221,400 vocal fold conditions were simulated.Results from these simulations were then used to train a neural network to estimate vocal fold physiology and subglottal pressure from the produced voice.Unlike other machine learning studies that used time-series data as input, Zhang [112] used perceptually-important voice features extracted from the simulated voice outcomes as input to the neural network, which significantly reduced the amount of input data and training efficiency.The voice features included measures of voice acoustics (fundamental frequency, vocal intensity, etc.), aerodynamics (mean and peak-to-peak glottal flow, maximum flow declination rate, etc.), and vocal fold vibration [113].The output of the neural network included vocal fold length, medial-lateral depth, vertical thickness, glottal angle (a measure of vocal fold approximation), vocal fold stiffness along the transverse and longitudinal direction, and subglottal pressure.While the neural network was trained using simulation data, comparison to excised human larynx experiment showed that the neural network was about to predict the subglottal pressure and vocal fold geometry with reasonable accuracy [112].Zhang [114] further applied this neural network to data from human subjects.The results showed reasonable accuracy in estimating the subglottal pressure, and the neural network was able to qualitatively differentiate soft and loud voice production regarding differences in the subglottal pressure and degree of vocal fold adduction.

Conclusion
This overview on numerical approaches for simulating the human phonation process illustrates the diversity of current research.The various approaches concentrate on different components of the phonatory process but also demonstrate the differences of research groups on how the phonatory process should be simulated and analysed.There is still an intensive discussion on what components of the phonatory process are more important and should be considered in simulations and what can be neglected.A promising way to determine what components of the phonatory process are essential may be answered by FSAI simulations in the future.Most importantly, FSAI models that account for the contact of the vocal folds are expected to shed new light on the phonatory process.However, to deliver this expectation, the model validation of the accuracy based on experimental data is essential although being very challenging for FSAI models.This validation step could potentially path the way for driven vocal folds models based on clinical imaging procedures and their usage in patient-tailored applications.As a final conclusion, there is no doubt that in future more and more models will combine simulations with machine learning techniques [115].

Figure 1 .
Figure 1.(A) Head with location of larynx.(B) Illustration of fluid-structure-acoustic interaction during phonation.

Figure 3 .
Figure 3. Sound pressure level of simulations using the time derivative of the incompressible pressure, the convective part, and the full PCWE source term as acoustic source in comparison with experimental measurement data at a microphone positioned 1 m from the end of the vocal tract.Figure adapted from [44].

Figure 4 .
Figure 4. Difference between the body-fitted moving meshes in the ALE approach (left) and static nonconforming IBM grids (right).

Figure 5 .
Figure 5. (A) The computational domain and geometry of the vocal folds, larynx, and vocal tract.(B) The inner-layer structure of the vocal fold as well as the boundary conditions applied on vocal fold walls.Figure adapted from [67].

Figure 6 .
Figure 6.Voice outcome measures predicted from 1D flowbased reduced-order models agree reasonably well with those from 3D Navier-Stokes-based flow models in left-right symmetric and asymmetric vocal fold conditions (as quantified by the leftright stiffness asymmetry ratio Q in the abscissa).The measures include from the top to bottom the fundamental frequency f0, left-right vocal fold vibration amplitude ratio, left-right phase difference in vocal fold vibration, and maximum flow declination rate (MFDR).Figure adapted from [83].
Figure 6.Voice outcome measures predicted from 1D flowbased reduced-order models agree reasonably well with those from 3D Navier-Stokes-based flow models in left-right symmetric and asymmetric vocal fold conditions (as quantified by the leftright stiffness asymmetry ratio Q in the abscissa).The measures include from the top to bottom the fundamental frequency f0, left-right vocal fold vibration amplitude ratio, left-right phase difference in vocal fold vibration, and maximum flow declination rate (MFDR).Figure adapted from [83].

Figure 7 .
Figure 7. Schematic of biomechanical parameter estimation by adapting a two-mass-model towards vocal fold vibrations recorded by high-speed imaging by using a recurrent neural network.Figure adopted from [106].

Table 3 .
Overview of the recent CFD models with static geometry with regard to grid study, experimental validation and study topic.Abbreviation: FvF = false vocal folds.*Only the first author is mentioned.

Table 4 .
Overview of the recent CFD models with externally imposed geometry motion with regard to grid study, experimental validation and study topic.

Table 7 .
An overview of the model verification.

Table 9 .
An overview of the recent FSAI models.Abbreviations: VF = vocal folds, FvF = false vocal folds, VT = vocal tract, D = spatial dimensions of the model, c = compressible flow model, ic+a = hybrid approach using a coupling between an incompressible flow model and an acoustic model, L = layer, M = mass model.*Only the first author is mentioned.

Table 10 .
An overview of the recent FSAI models, denoted by the ID.Abbreviations: VF = vocal folds, VT = vocal tract, L = length of the model, cs = cross-section.

Table 11 .
An overview of the mechanical parameters of the VF in FSAI models.Abbreviations: E p and E pz = transversal and longitudinal Young's modulus, respectively, G pz = longitudinal shear modulus, g = damping ratio.

Table 12 .
An overview of the fluid parameters of the FSAI model.

Table 13 .
An overview of the model verification procedure of the FSAI model.