March 2013
ARTICLES
SPEECH RECOGNITION TECHNOLOGY
Begum Sacak

Speech Recognition Technology

Research in speech recognition technology has been carried out for almost six decades. Though the attempts to devise a speech recognition program started in 1950s, one of the first concrete efforts was made in RCA Laboratories in 1956, where Olson and Belar tried to recognize 10 single syllables of a single speaker. In the same way, Fry and Denes (1959) from the University College of England built a phoneme recognizer, and they used statistical information about allowable sequences of English, and in this way they tried to improve overall phoneme accuracy in words constituted of two or more phonemes. In 1959, a vowel recognizer was developed in MIT Lincoln Laboratories and the achievement was such that 10 vowels embedded in consonant environments were recognized in a speaker-independent (without speaker) way.

Back in the 1960s, there were more improvements in the area, mostly by Japanese researchers. Suzuki and Nakata (1961) designed a hardware vowel recognizer. In this device, they connected the channels to vowel decision circuits so that it could use logic in speech recognition. Sakai and Doshita (1962) and Nagata (1963) also made similar attempts.

However, one of the most influential findings in this area of research was made in RCA Laboratories. They developed time normalization methods which are used to determine where the speech starts and ends, so this finding reduced the variability of recognition scores (Martin, 1964). In the 1960s, Reddy (1966) pioneered the research of dynamic tracking of phonemes, and later on it was adopted by Carnegie Mellon University.

In the 1970s, applications and extensions of speech recognition programs were widely investigated by Russian, Japanese, and U.S. scholars. However, Rabiner, Levinson, Rosenberg, & Wilpon (1979) tried to make speaker-independent recognition systems by applying some sort of clustering algorithms.

In the 1980s, the area of research shifted toward statistical modeling methods, especially the hidden Markov Model (Ferguson, 1980). In the mid-1980s, this modeling technique was widely applied. In this era, the emphasis was on large vocabulary, continuous speech recognition systems. It has been carried out by the Defense Advanced Research Projects Agency, and it has investigated how to achieve high word accuracy, continuous speech recognition, and database management task since then (Lee, Rabiner, Pierracini, & Wilpon, 1990).

Use of Speech Recognition Programs in CALL Environments

In learning a new language as well as comprehension of the content and rules of that language, the need for being able to produce that language has gained much importance. The need for production in the target language has led to the need of instructional materials aiming to enhance productive abilities in the target language. In this respect, CALL methods and applications offer alternative ways to support or replace the traditional student-teacher interaction. However, the technologies designed to support language learning are not widely used by language instructors.

One of the reasons is that there is not a single and approved theoretical framework for CALL systems. Another reason is that there is no scientific evidence of the benefits of computers in language learning. Though the technology itself is not fully complex and cannot achieve fully social and communicative features of a human, still there are various ways of applying the computer technology into language learning and there are certain potential advantages of them in language learning environments.

Considering the benefits, CALL technologies offer an abundant amount of learner time as well as a stress-free environment. By the use of automatic speech recognition (ASR), oral tasks can be practiced more easily because it is the computer itself which understands and gives feedback to the learner so that the learner feels much more involved, and in contrast to traditional methods, a learner can enjoy engaging in activities more.

One of the things that ASR programs achieve is that they are able to translate speech signals and categorize them into syntactic and phonetic models. In this respect, ASR is useful because learners can realize interactive dialogues with the computer. Furthermore, they can answer multiple-choice questions in a speech-enabled environment (Neri, Cucchiarini, & Strik, 2003). The dialogues can take place in the form of closed-response or open-response designs. This type of design is based on the given information (i.e., charts or graphical information) so that a learner should infer and give a spoken answer by sticking to the given information. Most of the time, such activities take place in goal-driven tasks or games. These kinds of activities also provide some sort of positive feedback because learners can only make progress by uttering the desired spoken responses. In the open-response design, the response needed should be up to the learner, and the system does not provide the learner with help. It might include stimulus-response or simulated real-life conversation tasks. In stimulus-response tasks, it is possible to practice primitive items such as word meanings or equivalents (Waters, 1995). In life-simulative conversations, the aim is to maintain a continuous flow of communication. The design itself has some problems because it has to provide the learner with the feedback of his or her preceding speech; however, all kinds of possible errors should be included in the system, and the main problem is that some utterances can be valid and some can be grammatically incorrect. An alternative design with a low rejection rate of the learner’s errors may lead to wrong answers (due to the misrecognition of the stimulus) from the system so that the flow of the communication may be interrupted (Ehsani & Knodt, 1998).

Automatic speech recognition programs are also useful in giving immediate feedback regarding the output to the learner. Especially in pronunciation training in a target language, corrective feedback is of much importance, and it is really important that it should not rely on the learner’s own judgments because that might be misleading. The program itself can analyze the speech, and it can further analyze the acoustic properties of speech as well as the temporal properties. ASR programs can give either segmental or suprasegmental feedback. In the segmental feedback, the speech processing part of the program works in such a way that it both recognizes the speech of a learner and processes it in order to give the correct feedback. The speech recognition program uses a model, which is used as a reference; this reference is native-like speech of the target language. The errors may be further analyzed by human speech experts; however, it is not always the case. By determining to what extent the learner’s performance is close to this reference, the success rate of a learner will be higher (Neri et al., 2003). Ehsani and Knodt (1998) also mention another speech processing technique in which pronunciation errors of the language learners’ errors are also included in the processing system so that when processing takes place, the system can easily detect the wrong spelling and offer the correct pronunciation for that error. In suprasegmental feedback, the intonation and stress patterns in the target language, and how effectively they are used, is important. With the help of the contours, it is possible to present them (Ehsani & Knodt, 1998). It has been found out that it is more effective to include contours with the audio feedback in language learning (de Bot, 1983; James, 1976). As well as contours, speech waveforms, spectrum information, a graphical display of the speaker’s face, or vocal tract information can also be given (Ehsani & Knodt, 1998).

The program can also be useful in detecting the pronunciation errors because the systems can detect and tell the learners what points they can improve in the future. This is really crucial for second language learning because it helps learners to be aware of where specifically they commit errors and what to do next. Moreover, the system can provide them with ideas on how to improve these points so that learners can focus more on the potential errors and will be able to prevent their occurrence in the future with the help of explicit information. However, the given feedback should address clearly the error of the learner and should be easy to understand.

Conclusion

Speech recognition programs can be really useful in CALL environments, especially in speaking-based or pronunciation practices. They can be used in a variety of engaging activities, including dialogues, pronunciation training, and creating open-designed and close-designed speeches for multiple purposes. Though there are some shortcomings of the technology itself, it can be used in classroom environments as a tool which enhances the communication skills and replaces a part of the teacher-student interaction, which lessens the stress levels of the learner. In addition, they might be helpful in speech-related errors of learners because most of the speech processing systems give immediate corrective feedback to the learner. In the future, it might be expected that the speech processing systems will be much more complex and they will be able to give more correct feedback and maintain a healthy flow of desired communication because they will have a much more human-like linguistic competence by which learners can practice their speaking in the target language on potential real-life simulations.

References

de Bot, K. (1983). Visual feedback of intonation: Effectiveness and induced practice behavior. Language and Speech, 26, 331–350.

Denes, P. (1959). The design and operation of the mechanical speech recognizer at university college London. Journal of British Institution of Radio Engineers, 19, 219–229.

Ehsani, F., & Knodt, E. (1998). Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm.Language Learning & Technology, 2(1), 45–60.

Ferguson, J. (Ed.). (1980). Hidden Markov models for speech. Princeton, NJ: IDA.

Fry, D. B. (1959). Theoretical aspects of mechanical speech recognition.Journal of British Institution of Radio Engineers, 19, 211–218James, E. (1976). The acquisition of prosodic features of speech using a speech visualizer. International Review of Applied Linguistics, 14, 227–243.

Lee, C. H., Rabiner, L. R., Pierracini, R., & Wilpon, J. G. (1990). Acoustic modeling for large vocabulary speech recognition.Computer Speech and Language, 4, 127–165.

Martin, T. B., Nelson, A. L., & Zadell H. J. (1964). Speech recognition by feature abstraction techniques (Tech. Report AL-TDR-64-176). Air Force Avionics Lab.

Nagata, K., Kato Y., & Chiba S. (1963). Spoken digit recognizer for Japanese language. NEC Research Development, 6.

Neri, A., Cucchiarini, C., & Strik, H. (2003, August). Automatic speech recognition for second language learning: How and why it actually works. International Congress of Phonetic Sciences (pp. 1157–1160).

Olson, H. F., & Belar, H. (1956). Phonetic typewriter. Journal of Acoustics Society of America, 28, 1072–1081.

Sakai, T., & Doshita, S. (1962). The phonetic typewriter, information processing. Proc. International Federatin for Information Processing Congress, Munich.

Suzuki, J., & Nakata, K. (1961). Recognition of Japanese vowels—Preliminary to the recognition of speech. Journal of Radio Research Laboratory 37(8), 193–212.

Rabiner, L., & Juang, B. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: PTR Prentice-Hall.

Rabiner, L. R., Levinson, S. E., Rosenberg, A. E., & Wilpon, J. G. (1979). Speaker independent recognition of isolated words using clustering techniques. IEEE Trans. Acoustics, Speech, Signal 27, 336–349.

Reddy, D. R. (1966). An approach to computer speech recognition by direct analysis of the speech wave (Tech. Report No. C549). Stanford, CA: Stanford University, Computer Science Department.

Waters, R. (1995). The audio interactive tutor. Computer Assisted Language Learning, 8, 325–354.


Begum Sacak is a Turkish MA graduate student at Ohio University. Her main interest areas are psycholinguistics and language acquisition.