Introducing The Multi-Task Speech Corpus for Covid-19

Speech corpus is a collection of, or you can also say the database of speech files and their textual description. Speech corpus can be used either as the acoustic model in speech technologies or to do research on various aspects of speech recognition and speech identification in linguistics. Corpora is plural of a corpus, and it contains a set of databases together. Here we will talk about the Corpus For COVID-19 and how it will help in the recent pandemic with its multi-task nature.

Creating a dataset on any topic is not an easy task to do so, before thinking about it do some persistent effort using the internet and research about the topic whether it is already available or not, whether we can utilize the available resources or not. Especially when we are dealing with speech data it becomes trickier, and you need to put considerable work to balance time and resources.

Speech Corpus for Covid-19

In recent times, help is needed to mitigate the increasing rate of infections and the burden of workload at hospitals. With the use of speech corpus on COVID-19 different health, screening application can have benefited by analyzing the symptoms of the suspect and detecting any abnormalities in the input given by him\her. Early track of this novel infection is a challenge but helps in fast and easy recovery of patients with these applications this challenging situation becomes a little easy to tackle as the voice of the suspect can be recorded and subjected to analysis. Covid-19 detection can be done by figuring out the difference between a normal signal of speech and the coughing of a suspect as the signals vary in each case.


The Multi-task Nature of Corpus:

Once a corpus is created or collected, it opens a variety of its uses in innovating any process. Specifically, a speech corpus-based application for COVID-19 can provide a remote solution to many causalities, the detection is low in a cast, it provides convenience to its users, tracking of fetal symptoms, and many more. The applications investigate the acoustic parameters in the speech segments of the sample with the database containing negatively tested samples with mild symptoms and positively tested samples with moderate and mild symptoms with the accuracy of more than 80% using machine learning or AI algorithms.

Collecting/ creating corpus:

In the present time, the speech recordings of positively tested COVID-19 patients are not commonly available due to their less participation in speaking activities in illness. Also, there is no dataset that is available for public use. But we can use the smart applications to collect the speech sample of positively tested patients. Like an application assigns some verbal tasks to the patients that include holding vowel sounds in their speech for 5 sec or until they can hold it. Furthermore, repeating some words until they hear the beep sound from the application. In addition to it, repeating nasal phrases until the timer beeps. Once they complete this task they are directed to the other by that smart application. 

Exploring Features in Corpus:

After collecting speech inputs for the corpus, acoustic features are investigated and extracted and applied to all the corpus entries. Then the most optimal feature is selected to retrieve information from it and avoid over and under-fitting. These features are then subjected to the classifiers of ML and the accuracy is calculated to finalize the most reliable features. An experiment turned out in the end that the features extracted from the nasal task produced the highest baseline accuracy of 68% while classification, this was due to the presence of phonemic sounds in nasal speech which distinguish between positively and negatively tested cases.


In the recent pandemic, help is needed to mitigate the increasing rate of infections and the burden of workload at hospitals. And Speech Corpus for COVID-19 can be used to identify positively suspected cases and retrieve more information about this contagious outbreak. We hope this article provides you with enough information about the Speech Corpus for COVID-19 that you expected!