Learning complementary representations via attention-based ensemble learning for cough-based COVID-19 recognition

– Coughs sounds have shown promising as a potential marker for distinguishing COVID individuals from non-COVID ones. In this paper, we propose an attention-based ensemble learning approach to learn complementary representations from cough samples. Unlike most traditional schemes such as mere maxing or averaging, the proposed approach fairly considers the contribution of the representation generated by each single model. The attention mechanism is further investigated at the feature level and the decision level. Evaluated on the Track-1 test set of the DiCOVA challenge 2021, the experimental results demonstrate that the proposed feature-level attention-based ensemble learning achieves the best performance (AUC: 77 : 96 % ), resulting in an 8 : 05 % improvement over the challenge baseline.


Introduction
Cough sounds from patients with different respiratory illnesses have been proven to have distinct latent features, which can be extracted and fed into machine learning models for recognition purpose [1].In the COronaVIrus Disease 2019 (COVID-19) pandemic, coughing is one of the main modes of COVID-19 dissemination [2].It is worthwhile to investigate the feasibility of automatic cough-based COVID-19 recognition, as it is potentially cheaper and faster than the pre-existing diagnosis methods, e.g., polymerase chain reaction testing.
Deep learning tends to outperform traditional machine learning by learning highly non-linear transformations for disease detection [3].However, most COVID-related cough sound databases [4,5] are small-scale, making it challenging to train deep learning models.Transfer learning is promising to transfer the knowledge learnt from large-scale datasets to a new small dataset.Image-based models trained on ImageNet [6] were successfully employed for audio classification [3], and audio-based models trained on Audio-Set [7] performed better than image-based models in [8].In our study, both image-based and audio-based models, as well as feed-forward deep neural networks (DNNs), are applied and compared.
Ensemble learning takes full advantage of multiple models for better performance.For example, features extracted by histogram-oriented gradient and a Convolutional Neural Network (CNN) model were simply concatenated for X-ray-based COVID-19 recognition [9].Either majority voting or calculating the max/average score of the predicted probabilities was employed to fuse the predictions at the decision level [10].However, for the conventional fusion methods, there is a difficulty in integrating each model's contribution into the final results when fusing multiple models.In this regard, weighted fusion was used to sum up all features or predictions with weight values [11].A neural network was trained to fuse multiple representations [12], and an attention-based fusion was attempted to learn the weights of each model [13].Nevertheless, few studies have investigated and compared feature-level and decision-level attention-based fusions.
To this end, for the first time we propose assembling multiple cough-based COVID-19 recognition models (i.e., a feed-forward DNN model, an image-based model, and an audio-based model) with feature-/decision-level attention, by assuming attention can estimate each feature item's or each prediction's contribution to complementing multiple representations.Key results demonstrate the attention-based ensemble learning effectively outperforms single models and other conventional fusion methods.

Methodology
In our study, three single-model representations are learnt from hand-crafted features, colourful log Mel spectrogram images, and original log Mel spectrograms, respectively.The attention-based ensemble learning is further proposed to extract helpful information from each representation.

Single-model representations
From the perspective of the input data format of neural networks, three single-model representations are extracted.
Hand-crafted-feature-based representations.To explore the performance of hand-crafted features, three feature sets are extracted, including a log Mel feature set, a Mel Frequency Cepstral Coefficients (MFCC) feature set, and a Computational Paralinguistics ChallengE (ComParE) feature set [14].The log Mel and MFCC feature sets respectively calculate 26 Mel bins and 14 MFCCs for Low-Level Descriptors (LLDs).For those LLDs, we apply 100 functionals from the ComParE feature set, resulting in 2600 log Mel features and 1400 features for each audio wave.The Com-ParE feature set generates 6373 features for each audio signal.A feed-forward DNN model is used to process the hand-crafted features.The activations from the intermediate Fully Connected (FC) layers are further extracted as the hand-crafted-feature-based representations for the later ensemble learning.
Deep Image-from-audio-based representations.Two typical CNN models pre-trained on ImageNet [6], VGG11 [15] and ResNet34 [16], are used to extract features from colourful log Mel spectrogram images [17] which have three channels as those of the original image inputs of the pretrained models.Both models are fine-tuned with added FC layers for the final results.Similar to the hand-craftedfeature-based representations, the deep image-from-audiobased representations are extracted from the added FC layers.
Deep audio-based representations.The audio-based features herein are extracted by pre-trained models learnt from AudioSet [7].Both CNN14_16k and ResNet38 models [18] are fine-tuned with added FC layers on the log Mel spectrograms.Similarly, the deep audio-based representations are extracted from these extra FC layers.
Although deep audio-based representations outperformed deep image-from-audio-based representations by mitigating the gap between natural images and timefrequence representations of audio waves in [8], we assume there is hidden difference between the two representations.Therefore, both of them are used in this study.

Attention-based ensemble learning
The representations to be fused are defined as a tensor R with a shape of (L, N m ), where L is each representation's length, and N m denotes the number of the representations.

Feature-level attention
From the N m representations, attention-based featurelevel fusion intends to learn a new vector based on the contribution of each feature item (Fig. 1a).A one-Dimensional (1D) convolutional layer with a kernel size of 1 and an output channel number of L, followed by a sigmoid function, is applied to the tensor R, outputting a new tensor A f with the same shape of R. Afterwards, A f is normalised and multiplied with R using where M f is the element-wise multiplication result.The normalised A f is considered as the contribution (i.e., weight) for each feature item M f is then summed up along the axis across multiple representations to generate a vector, which is fed into an FC layer to compute the final probability.To enhance the flexibility of the feature-level attention, the output channel of the convolutional layer could be a different number L 0 .In this way, an additional convolutional layer with an output channel number L 0 should be added to process R so that the outputs from the two convolutional layers have the same size (L 0 , N m ).

Decision-level Attention
The decision-level attention processes R with two 1D convolutional layers with a kernel size of 1 and an output channel number of 1, generating two new vectors (Fig. 1b).One of the convolutional layers, followed by a sigmoid function, takes the original tensor R and outputs a vector A d1 with a length of N m .Next, A d1 is normalised and multiplied with the other newly generated vector A d 2 by where M d is the element-wise multiplication result.Afterwards, M d is summed up along the axis across multiple representations for the ultimate probability value.Compared to the feature-level attention, the decision-level attention learns the weight values for each representation rather than each feature item, as the output channel number of the convolutional layers is related to the class number.
With fewer weight values to learn, the decision-level attention is coarser-grained than the feature-level one.
3 Experimental results

Database
The Track-1 dataset of the DiCOVA challenge 2021 [5] consists of 1040 cough sounds recorded from 1040 subjects (non-COVID: 965, COVID: 75).Five train-validation folds are split from the dataset, leading to 772 non-COVID and 50 COVID samples in each training set, and 193/25 non-COVID/COVID samples in each validation set.Moreover, an additional blind test set consists of 233 audio samples.All cough sound recordings are sampled into 44.1 kHz and stored in .FLAC format.As for performance evaluation, the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) [5] are assessed.Additionally, the specificity is calculated at 80% sensitivity.

Experimental setup
We re-sample all audio recordings into 16 kHz.To generate 224-frame log Mel spectrograms as required by the image-based models, the audio samples are segmented by a non-overlapping sliding window with a time length of 57; 600 frames.The window length and the overlap during short-term Fourier transformation are set as 512 and 256.We adjust the number of Mel bins as 64 and 128 to align with the input dimension requirements of the pretrained audio-based and the image-based models.The log Mel spectrograms are converted into images with the "jet" colourmap for the image-based models.The mixup approach [18] is utilised to augment the extracted handcrafted features and (colourful) log Mel spectrograms 1.5 times, respectively.
For the hand-crafted-feature-based representations, the feed-forward DNN models consist of three FC layers with neurons' number 1024, 256, and 1.For the other two types of representations, three trainable FC layers with the same number of neurons are added after the pre-trained single models.All representations are extracted from the respective second FC layer.The layers before the inherent FC layers of the pre-trained models (the second FC layer of VGG11, the first FC layers of ResNet34, ResNet38, and CNN14_16k) are frozen, and the others are removed.Specifically, the final convolutional block of ResNet34 is set to be trainable, since the transferability of its unique FC layer is limited.During training, the model parameters are updated within 30 epochs and a batch size of 16.The "Adam" optimiser with an initial learning rate of 0.001 is experientially chosen, and the learning rate decays by 0.1 after every 10 epochs for a stable training process.We apply the binary cross-entropy with logits loss as the loss function.
For experimental comparison, we fuse the representations by max and average fusions at the feature/decision level.The feature-level max and average fusions respectively calculates the maximum and average values across multiple 256-dimensional representations from the second last FC layers, while the decision-level ones computes the maximum and average predicted probabilities.The inputs of both feature-level and decision-level attentions are the outputs from the second last FC layers.

Results and discussions
In Table 1, the reason for the better results on the test set than those on the validation sets in most cases is probably all 1040 cough sounds are used for training a model which is verified on the test set instead of smaller training sets during cross validation.Both log Mel and ComParE features perform well on the test set, indicating it is promising to extract log Mel spectrograms as the CNNs' inputs.When comparing the performance of log Mel features and MFCC features, log Mel features perform better than MFCC features on the test set, perhaps because MFCC features are highly compressible and 14 coefficients are not enough to represent the characteristics of COVID cough sounds.The deep audio-based representations mostly outperform the deep image-from-audio-based representations, perhaps because audio-based models are more suitable for audio-related tasks than image-based models.The best three-type representations are from the feed-forward DNN model on the ComParE features, VGG11 on the colourful log Mel spectrogram images, and ResNet38 on the log Mel spectrograms, respectively.
The above representations are further processed by ensemble learning.Compared to the official baseline (i.e., average validation AUC: 68:81%, test AUC: 69:91%), the results of the three single-model representations are similar or slightly inferior on the validation and test sets.The ensemble learning approaches with complementary singlemodel representations ameliorate most single models and mostly outperform the baseline system.At both the feature and decision levels, the average fusion performs better than the max fusion, perhaps because max fusion neglects beneficial representations when selecting the maximum values only.Remarkably, the feature-/decision-level attentionbased fusion outperforms both max and average fusions, indicating attention-based fusion can better complement multiple representations.Notably, the feature-level attention obtains the best AUC of 77.96% on the test set, which is a significant amelioration of the baseline system's performance (p < 0.05 by a one-tailed z-test).
Apart from AUC values, the specificity results are also compare in Table 1.Higher specificity is corresponding to lower false positive rate.Both attention-based fusion models perform with a high specificity of 59.38% on the test set.The feature-level average fusion has the highest specificity (61.46%) while it perform slightly worse than the attention-based fusions on AUC (73.72%), perhaps because its ROC curve is not stable on all thresholds.To analyse the two attention-based fusion methods, the average Receiver Operating Characteristic (ROC) curves over the five-fold validation sets are depicted in Figure 2. We can see that both attention-based fusion methods yield finer True Positive Rates (TPRs) at given False Positive Rates (FPRs) than the chance.The ROC curves illustrate that the attention-based fusion approaches are adequate to recognise COVID-19 from cough sounds.

Conclusions and future work
Three single-model representations were extracted in this work, i.e., hand-crafted-feature-based representations, deep image-from-audio-based representations, and deep audiobased representations.The proposed attention-based ensemble learning was further applied to learn these complementary On the DiCOVA challenge 2021 Track-1 database, both feature-and decision-level attention-based fusions outperformed the single-model classifiers and the max/average fusion for COVID-19 recognition.In future efforts, we will augment the training data by using more COVID-19-related databases and more data augmentation methods, e.g., SpecAugment [18].The two attention-based fusion mechanisms will be further compared and analysed on other acoustic tasks, such as speech emotion recognition [19].Bold values: The feature-level attention (specificity: 59.38%, AUC: 77.96%) outperforms the other two feature-level fusions (i.e.feature-max and feature-average (avg)), and the decisionlevel attention (specificity: 59.38%, AUC: 77.36%) performs the best among the three decision-level fusions.

Figure 2 .
Figure 2. The average ROC curves of the attention-based ensemble learning on the validation sets.