I came up with this idea while working on Voca. Since I don't intend to include this as part of my PhD thesis, write a research paper (since I don't have time to perform experimental evaluation) or make money out of it by patenting it (thanks to extremely slow Rice OTT process), I thought it might at least make a good blog entry.
In this blogpost, I will discuss a novel approach to improve speech recognition that in-turn will help improve the accuracy of Voca desktop app (if and when I decide to work on it). This approach can be applied to any speech-to-text software that is to be distributed to masses (eg: Siri).
1.1 Introduction 
Automatic speech recognition (ASR) involves predicting the most likely sequence of words W for an input audio data A. To do so, the speech recognizers use the following optimization function[2]:
\begin{align*} \hat{W}&= {arg\,max}_{W} P(W | A) & \\ &= {arg\,max}_{W} \dfrac{P(W) P(A | W)}{P(A)} & \dots \text{ Bayes theorem} \\ &\propto {arg\,max}_{W} P(W) P(A | W) & \dots \text{ The denominator normalizes to a constant} \end{align*}
The term P (W ) means the probability of occurrence of given sequence of words and is defined by the language model. At a high level, the language model captures the inherent restrictions imposed by the english language. For example: a language model might imply that the sentence “it is raining heavily in houston” is more likely sentence that “raining it is in heavily houston”.
The term P(A|W) refers to the probability of an audio data given the sequence of word and is captured by the acoustic model, which at a high level is probabilistic mapping of the acoustic sound to the words. The acoustic model is usually found by training hidden markov model on a small subset of users. The users that train the acoustic model are referred as trainers in subsequent sections.
To summarize, the speech recognizer searches for the sequence of words, that maximizes the above function using the acoustic and the language model. The output of this search procedure is a tuple < words W, confidence >, where confidence refers to the value of the above function for W. For example: <“it is raining heavily in houston”, 0.9 >
Thus, accuracy of speech recognizer thus highly depends on these two models. Since only acoustic model varies with respect to the user, in our further discussion we will only refer to acoustic model and assume the use of a standardized language model (eg: HUB4).
1.2 Roles
In this blogpost, I will discuss a novel approach to improve speech recognition that in-turn will help improve the accuracy of Voca desktop app (if and when I decide to work on it). This approach can be applied to any speech-to-text software that is to be distributed to masses (eg: Siri).
1. Background
Automatic speech recognition (ASR) involves predicting the most likely sequence of words W for an input audio data A. To do so, the speech recognizers use the following optimization function[2]:
\begin{align*} \hat{W}&= {arg\,max}_{W} P(W | A) & \\ &= {arg\,max}_{W} \dfrac{P(W) P(A | W)}{P(A)} & \dots \text{ Bayes theorem} \\ &\propto {arg\,max}_{W} P(W) P(A | W) & \dots \text{ The denominator normalizes to a constant} \end{align*}
The term P (W ) means the probability of occurrence of given sequence of words and is defined by the language model. At a high level, the language model captures the inherent restrictions imposed by the english language. For example: a language model might imply that the sentence “it is raining heavily in houston” is more likely sentence that “raining it is in heavily houston”.
The term P(A|W) refers to the probability of an audio data given the sequence of word and is captured by the acoustic model, which at a high level is probabilistic mapping of the acoustic sound to the words. The acoustic model is usually found by training hidden markov model on a small subset of users. The users that train the acoustic model are referred as trainers in subsequent sections.
To summarize, the speech recognizer searches for the sequence of words, that maximizes the above function using the acoustic and the language model. The output of this search procedure is a tuple < words W, confidence >, where confidence refers to the value of the above function for W. For example: <“it is raining heavily in houston”, 0.9 >
Thus, accuracy of speech recognizer thus highly depends on these two models. Since only acoustic model varies with respect to the user, in our further discussion we will only refer to acoustic model and assume the use of a standardized language model (eg: HUB4).
1.2 Roles
There are three types of people involved in speech recognition process:
- 
     Developers write the code for the speech recognition software.
 
- 
     Trainers record their voice and give the audio files to the developers.
Developers then use these audio files to train the acoustic model.
 
- 
     Users are the general audience who uses the speech recognition software. The number of trainers is usually much less than number of users.
1.3 Steps in speech recognition process
The steps in typical speech recognition API/software is as follows: 
1. Before releasing the API/software:
(a) The search algorithm in the API/software tries to find what the user is saying by comparing the user’s acoustic signal with its universal model using various statistical methods.
1. Before releasing the API/software:
- 
     (a)  Developers encourages trainers to record their voice.
 
- 
     (b)  Developers use the audio files to train a universal model for all
the users. The developers may also use information provided by
experts to improve language model.
 
- 
     (c)  Developers bundle together the model and the search algorithm
and release the API/software.
 
(a) The search algorithm in the API/software tries to find what the user is saying by comparing the user’s acoustic signal with its universal model using various statistical methods.
1.4 Challenges
However, speech recognition is inherently a hard problem due to issues like:
4. References
   
- 
     Differences in speaking style due to accent, regional dialect, sex, age,
etc (Eg: the acoustic signature of a young chinese girl saying some
words will probably differ from an older german person speaking those
exact same words).
 
- 
     Lack of natural pauses to detect word boundaries.
 
- 
     Computational issues like search space, homophones (words that sounds
same), echo effect due to surroundings, background noise.
 
2. Summary
Instead of using one universal model that works for everyone, we propose
the following approach:
Train k different models in step 1b of section 1.3 (where is k is reasonably small number, for example: 100) such that each of these models try to capture the speaking style of a particular group of people. Then, recommend the model that best suits the user depending on his/her acoustic similarity with each of the above groups.
2.1 Why this idea will work in practice ?
- It is a well known fact that the accuracy of the speech recognizer is higher if the trainer and the user are the same person.
- As pointed out earlier, most users do not want to spend hours training the model.
- Some people are more acoustically similar than others. For eg: model trained on young native american student will suit well for a similar student, but might not suit well for an old Indian lady. Hence, if the trainer and the user speak very similarly, the model will have higher accuracy than an universal model.
- We like to point out that users can be clustered into groups by the way they speak and the number of these groups are finite. This is basically a soft assignment, for eg: I would fall in the category American accent with probability 0.2, but in category Indian accent with probability 0.7. Thus this knowledge can be incorporated in the recognition process.
2.2 Related Works
An area that is close to our invention is speaker-adaptive modeling[1], where
the universal model is adapted to the user based on training on speaker-
dependent features. This means, the parameters of acoustic models are split
into two parts:
  
 
 
  
- Speaker independent parameters
- Speaker dependent parameters
- Training is significantly compute-intensive task and might not be a viable option for smart phones, netbooks or even laptops. Offloading the training on to server will involve more network usage (which in turn will increase the cost of speech recognition for user or/and developer).
- Training might require expert-intervention.
- It still is not an out-of-the-box solution as an off-line solution will require some training to make it speaker-adaptive.
For n users (where n ≫ k), Speaker-adaptive systems in effect will have
n models and our clustering-based model assignment approach will have k
models. Since k models are trained on the server, model training will also
benefit from expert-intervention without any additional cost to the user. It
is important to note that model assignment does not require any expert-
intervention and is not same as speaker-adaptive training. 
3. Detail discussion
3.1 Key conditions
3. Corollaries based on above points:
(a) A model is simply a cluster of users.
(b) A user can belong to multiple cluster.
 
3.2 Modified steps for speech recognition
- Every user can be represented as a mixture model over the acoustic model. Example: Let’s say that user U1 is sampled with mixture proportion {0.9, 0.1} over model m1 and m2 respectively. This means that if we run recognition on the audio file generated by the user U1 on both the models, it is likely that model m1 will produce output that is going to be 9 times more accurate that model m2.
- When we create a model, the job of the expert is to make sure that:
- 
     (a)  Each model is representative of a distinct accent.
 (b) Every user (on average) is expected to have high affinity to only 1 model.
3. Corollaries based on above points:
(a) A model is simply a cluster of users.
(b) A user can belong to multiple cluster.
3.2 Modified steps for speech recognition
We propose a modified methodology that will improve the accuracy of the
speech recognition:
1. Before releasing the API/software:
1. Before releasing the API/software:
- 
     (a)  Developers encourages trainers to record their voice.
 
- 
     (b)  Using the model described in section 3.3, we get k models (For
example: k = 100).
 
- 
     (c)  Chose one of these k model as a default model. Also, set a value
for the parameter n (For example: n = 3).
 
- 
     (d)  for i = 1 to n do:
 - Select a new phrase pi such that all k models have at least one audio file that have phrase pi or one very similar to the phrase pi. Hint: During training process, ask every trainer to speak the phrase pi. Use edit-distance and some tolerance delta (= 5) to find similar phrases.
 
- 
     (e) Developers bundle together the default model and the search algorithm and release the API/software.
 
- 
(a) During installation (or during periodic re-tuning), for i = 1 to n
do:
 - 
     Ask user to speak the phrase pi and send the audio file to the
server.
 
- 
     For each model j, do:- 
       Do speech-to-text transcription for the given audio file.
 
- 
       Let p′i,j be the output phrase of the transcription.
 
 
- 
       Do speech-to-text transcription for the given audio file.
 Let εi,j be a measure that calculates the error in transcription of phrase pj on model j.For example: εi,j = Edit-Distance(p′i,j,pi,j). Note that, depending on the training data, an expert might chose to use a different metric (before releasing the API) such as acoustic similarity between a random audio file in the model and the audio file outputted by the use.
- 
     Ask user to speak the phrase pi and send the audio file to the
server.
- (b) Return the model to the user that has minimum error for all the phrases
- (c) The user uses the model that is send by the server rather than a universal model.
3.3. Model for clustering the trainers
Step 1: Create two sets:
- Set 1 (Set of models): Models that have been found to contain similar users. Initial value = {}.
- Set 2 (Set of trainers not yet clustered): Put all the trainers in Set 2
Step 2: Train an acoustic model M using audio files of trainers in set 2.
Step 3: Run recognition for the audio files of the trainers in set 2 ⇒ You
get εi (i.e. Edit-distance between output phrase of recognizer and the
true phrase). This means some of the audio files are used as held-out. It might also be interesting to have no held-out and see the error when the training and test data are the same.
Step 4: Find maximal subset of the trainers whose Zi matches the trainers
for all the phrases ⇒ Train the model using these trainers and add it
to Set 2 (after a sanity test i.e. this model should be able to predict a random set of audio files that it was trained on)
Step 5: Put the remaining in Set 2 and goto Step 2. 
4. References
- 
     [1]  X. Huang and K.F. Lee. On speaker-independent, speaker-dependent,
and speaker-adaptive speech recognition. Speech and Audio Processing,
IEEE Transactions on, 1(2):150 –157, apr 1993.
 
- 
     [2]  Geoffrey Zweig and Michael Picheny. Advances in large vocabulary continuous speech recognition. volume 60 of Advances in Computers, pages
249 – 291. Elsevier, 2004. 
 
 
 
No comments:
Post a Comment