Emulating Speech-To-Text Reliability with ITU Difficulty Scores


Posted on December 16, 2011  /  0 Comments

The P.800 Difficult Percentage (or Difficulty Score) is an International Telecommunications Union Standardization sector recommended method for testing transmission quality in one’s own laboratory. We adopted this method in our feasibility study to enable Freedom Fone for emergency data exchange. The project studied the design challenges for exchanging the Freedom Fone interactive voice data with the Sahana Disaster Management System. This entailed taking situational reports supplied by Sarvodaya Community Emergency Response Team (CERT) members in audible (or speech) forms and transforming them to text. Getting good quality noise-free voice recordings was difficult for applying any kind of speech-to-text software for automatically transforming the audio to text was impossible, not in the lab but in a realistic setting.

There are two paradigms to implementing speech-to-text software: 1) speaker-dependent and 2) speaker-independent. In 1) the users have to first train the system (also know as a voice recognition system), typically by training an artificial neural network software that learns to react to known patterns of words and pronunciations for a particular human voice. Then in 2) the software does not require any training and anyone can speak. It may be restricted to a domain with limited words to remove any uncertainties that may arise relative a fully open system.

Figure 1

To emulate the speaker-dependent and speaker-independent scenarios the research devised two methods incorporating the difficult percentage as the underlying measure.

Figure 1 shows results from a formal survey. The survey required the Freedom Fone users (n=51) to record the answer to a question from a list of possible answers (multiple choice type questionnaire). In this case, the voice quality Evaluators (m=3) had a sense of the answers and could predict even if there was some distortion. Thus, it mimics a trained system. For example, a question that asked “what was the hazard event type”. The possible answers would be: cyclone, tsunami, floods, landslide.

The results in Figure 2 were obtained with Freedom Fone users (n=41) submitting field observation reports pertaining to any incident of their choice. In this exercise, there were no predetermined answers to select from and were free to supply any pertinent information. The Evaluators (m=7) did not have a prior knowledge of the possible answers; thus, it mimics an untrained that may arise. For example, if the voice recording was heavily and only the first letter of the word could be heard, then “[ts]unami” could be mistaken for a “[c]yclone”.

Figure 2

The results show that with a speaker dependent system 95% of the information could be clearly deciphered opposed a speaker independent system that was only 70% clear (blue areas in Figure 1 and Figure 2). It is not surprising, the outcomes are intuitive. In our study reliability had two components, one was efficiency and the other was voice quality. The voice quality also took in to consideration the Mean Opinion Score and the Comparison Categorical Rating. The researchers wish to acknowledge that their may be disagreements in the  sample sizes and number of Evaluators. These results are not ideal for drawing a ‘for-all” kind of conclusion. However, at this realize stage of the research it provides a quick and easy method to draw initial conclusions.

Comments are closed.