On-line Compensation Frame-Synchronous Compensation of Non-Stationary Noise April 15, 2004 I am currently writing a Ph.D in Computer Science in the PAROLE work group of LORIA research institute in France. My dissertation is dealing with automatic speech recognition robustness issues and is due in October 2004. My researches are supervised by Irina Illina, Dominique Fohr and Jean-Paul Haton.
Robust automatic speech recognition
An automatic speech recognition (ASR) system gives a significant degradation in performances when used in a test condition that does not match its training environment. This mismatch is due mostly to additional noise sources and discrepancies in channels and speakers. Those mismatch sources may be nonstationary and little a priori information about them is available. Several techniques have been proposed to enhance speech in a robust manner. Those techniques generally fall into three broad categories. In the first class, robust signal processing is used to reduce the sensitivity of the speech features with regards to possible distortions. In the second set, models of the noise and channel are directly incorporated into the recognition process. In the third set of approaches, compensation methods modify the feature vectors of the testing signal to bring them closer to the trained models. The algorithms studied in my researches fall into this last category. More specifically, our approach can be classified in the Stochastic Matching (SM) framework. The fundamentals of SM were proposed in . In that paper, the parameters of a compensation function are estimated so as to maximize the likelihood of the transformed speech sequence given the set of acoustic models. The parameters are obtained consequently by several Estimation-Maximization steps and naturally rely on the optimal sequence of states. The most interesting aspect of SM framework is that it does not need any a priori information on the nature or level of the corrupting noise. Theoretically, only the test sentence to be decoded is needed to perform compensation. Frame synchronous algorithms are naturally appealing to cope with nonstationary slowly varying noise sources even if they often face convergence problems linked to the scarcity of data. Offline compensation algorithms exist and
cope with this sort of naturally varying acoustic environment, but the duration of the computation process involved is not compatible with everyday life applications. Our techniques are totally frame-synchronous: the parameters of the compensation functions are updated at each time frame, in parallel with the recognition process.
Frame synchronous compensation algorithm
In the frame synchronous compensation mode, complete statistics (forwardbackward probabilities) needed in the classical SM framework are difficult to obtain because the end of sentence is not available. One solution, is to approximate these statistics by forward probabilities. The basic idea of our method is as follows. First, the hypothesis is made that during the Viterbi alignment, the states linked to the highest forward probabilities give a good modelisation of the speech observations . Then, the parameters of the mismatch function are estimated in order to enhance the likelihood of the observation given those states. Consequently, this on-line algorithm performs compensation in parallel with recognition and does not need any a priori information on the nature of the noise. Hence, the parameters of the compensation transform are estimated frame per frame. Compared to classical frame-synchronous compensation methods such as Cepstral Mean Normalization or Spectral Subtraction, our algorithms gave significant results. For example, the first version of our algorithm gave up to 15.5 % improvement in word error rate over Spectral Subtraction on VODIS database. The French database VODIS (Voice-Operated Driver Information System) have been recorded in a moving car in various driving situation by 200 speakers. Similarly, 27.8 % improvement over frame-synchronous Cepstral Mean Normalization were obtained.
To improve the results of the previously presented method, we proposed a structural state-based transformation . This approach is motivated by several observations. First, it is often assumed that observations which are similar will be affected in a similar manner by variations in the environment. Hence, a set of subspace-specific transformations should give better results. Second, subspacespecific transformations face a data scarcity problem that can be overcome by the use of hierarchical transformation: a tree of transformations. For each node of this tree, a transformation function is estimated according to the observations of the current sentence. If the transformation associated with a node is poorly estimated, its parent will be used.
On-line compensation for non-stationary noise
As a second step, we explored the possibility for our algorithm to cope with abrupt changes in acoustic environment . In real life environments, ASR systems might face unexpected and sudden occurrence of noise (for example, opening a window while driving). No information is available on the occurring time, the level and the nature of the sudden noise. A compensation algorithm should take into account such changes in a short time period. In this scope, two problems can be explored: detection of environment changes and adaptation of compensation strategy to this new environment. Consequently, we studied a new version of the previously presented algorithm. This new version takes into account the abrupt changes in the environment. At each time frame the distance between the incoming speech frame and the most probable emitting state is computed. When a sudden change occurs in the test acoustical environment, this distance changes quickly. We detect this disruption using several widely known detection algorithms such as the Shewart control charts detection algorithm, Bayesian information criterion (BIC) and an adaptation of the Spectral Variation Function (SVF). Then, the bias is set to a value corresponding to the closest environment previously observed. This approach gave impressive improvements over classical compensation methods when used on artificially corrupted data (noise added from the middle of a clean test sentence to its end). For instance, we obtained up 32.4 % phoneme error rate improvement over baseline on this type of data.
References  A. Sankar and C.H. Lee, “A Maximum Likelihood Approach to Stochastic Matching for Robus t Speech Recognition,” IEEE Transaction on Speech and Audio Processing, pp. 190–202, 1996.  V. Barreaud, I. Illina, and D. Fohr, “On-Line Frame-Synchronous Compensation of Non-Stationary noise,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Apr. 2003.  V. Barreaud, I. Illina, and D. Fohr, “On-Line Frame-Synchronous Noise Compensation,” in Proceedings of the The 15th International Congress of Phonetic Sciences, Aug. 2003.  V. Barreaud, I. Illina, D. Fohr, and F Korkmazsky, “Structural State-Based Frame Synchronous Compensation,” in Proceedings of the European Conference on Speech Communication and Technology, Sept. 2003.  V. Barreaud, I. Illina, and D. Fohr, “On-line compensation for non-stationary noise,” in Proceedings of the Automatic Speech Recognition and Understanding Workshop, Nov. 2003.  V. Barreaud, I. Illina, D. Fohr, and V. Colotte, “Compensation en milieu abruptement,” in Proceedings of the Journes d’Etudes sur la Parole, Apr. 2004.