Speech Recognition Technology
Research in speech recognition technology has been carried out
for almost six decades. Though the attempts to devise a speech
recognition program started in 1950s, one of the first concrete efforts
was made in RCA Laboratories in 1956, where Olson and Belar tried to
recognize 10 single syllables of a single speaker. In the same way, Fry
and Denes (1959) from the University College of England built a phoneme
recognizer, and they used statistical information about allowable
sequences of English, and in this way they tried to improve overall
phoneme accuracy in words constituted of two or more phonemes. In 1959, a
vowel recognizer was developed in MIT Lincoln Laboratories and the
achievement was such that 10 vowels embedded in consonant environments
were recognized in a speaker-independent (without speaker) way.
Back in the 1960s, there were more improvements in the area,
mostly by Japanese researchers. Suzuki and Nakata (1961) designed a
hardware vowel recognizer. In this device, they connected the channels
to vowel decision circuits so that it could use logic in speech
recognition. Sakai and Doshita (1962) and Nagata (1963) also made
similar attempts.
However, one of the most influential findings in this area of
research was made in RCA Laboratories. They developed time normalization
methods which are used to determine where the speech starts and ends,
so this finding reduced the variability of recognition scores (Martin,
1964). In the 1960s, Reddy (1966) pioneered the research of dynamic
tracking of phonemes, and later on it was adopted by Carnegie Mellon
University.
In the 1970s, applications and extensions of speech recognition
programs were widely investigated by Russian, Japanese, and U.S.
scholars. However, Rabiner, Levinson, Rosenberg, & Wilpon (1979)
tried to make speaker-independent recognition systems by applying some
sort of clustering algorithms.
In the 1980s, the area of research shifted toward statistical
modeling methods, especially the hidden Markov Model (Ferguson, 1980).
In the mid-1980s, this modeling technique was widely applied. In this
era, the emphasis was on large vocabulary, continuous speech recognition
systems. It has been carried out by the Defense Advanced Research
Projects Agency, and it has investigated how to achieve high word
accuracy, continuous speech recognition, and database management task
since then (Lee, Rabiner, Pierracini, & Wilpon, 1990).
Use of Speech Recognition Programs in CALL Environments
In learning a new language as well as comprehension of the
content and rules of that language, the need for being able to produce
that language has gained much importance. The need for production in the
target language has led to the need of instructional materials aiming
to enhance productive abilities in the target language. In this respect,
CALL methods and applications offer alternative ways to support or
replace the traditional student-teacher interaction. However, the
technologies designed to support language learning are not widely used
by language instructors.
One of the reasons is that there is not a single and approved
theoretical framework for CALL systems. Another reason is that there is
no scientific evidence of the benefits of computers in language
learning. Though the technology itself is not fully complex and cannot
achieve fully social and communicative features of a human, still there
are various ways of applying the computer technology into language
learning and there are certain potential advantages of them in language
learning environments.
Considering the benefits, CALL technologies offer an abundant
amount of learner time as well as a stress-free environment. By the use
of automatic speech recognition (ASR), oral tasks can be practiced more
easily because it is the computer itself which understands and gives
feedback to the learner so that the learner feels much more involved,
and in contrast to traditional methods, a learner can enjoy engaging in
activities more.
One of the things that ASR programs achieve is that they are
able to translate speech signals and categorize them into syntactic and
phonetic models. In this respect, ASR is useful because learners can
realize interactive dialogues with the computer. Furthermore, they can
answer multiple-choice questions in a speech-enabled environment (Neri,
Cucchiarini, & Strik, 2003). The dialogues can take place in the
form of closed-response or open-response designs. This type of design
is based on the given information (i.e., charts or graphical
information) so that a learner should infer and give a spoken answer by
sticking to the given information. Most of the time, such activities
take place in goal-driven tasks or games. These kinds of activities also
provide some sort of positive feedback because learners can only make
progress by uttering the desired spoken responses. In the open-response
design, the response needed should be up to the learner, and the system
does not provide the learner with help. It might include
stimulus-response or simulated real-life conversation tasks. In
stimulus-response tasks, it is possible to practice primitive items such
as word meanings or equivalents (Waters, 1995). In life-simulative
conversations, the aim is to maintain a continuous flow of
communication. The design itself has some problems because it has to
provide the learner with the feedback of his or her preceding speech;
however, all kinds of possible errors should be included in the system,
and the main problem is that some utterances can be valid and some can
be grammatically incorrect. An alternative design with a low rejection
rate of the learner’s errors may lead to wrong answers (due to the
misrecognition of the stimulus) from the system so that the flow of the
communication may be interrupted (Ehsani & Knodt,
1998).
Automatic speech recognition programs are also useful in giving
immediate feedback regarding the output to the learner. Especially in
pronunciation training in a target language, corrective feedback is of
much importance, and it is really important that it should not rely on
the learner’s own judgments because that might be misleading. The
program itself can analyze the speech, and it can further analyze the
acoustic properties of speech as well as the temporal properties. ASR
programs can give either segmental or suprasegmental feedback. In the
segmental feedback, the speech processing part of the program works in
such a way that it both recognizes the speech of a learner and processes
it in order to give the correct feedback. The speech recognition
program uses a model, which is used as a reference; this reference is
native-like speech of the target language. The errors may be further
analyzed by human speech experts; however, it is not always the case. By
determining to what extent the learner’s performance is close to this
reference, the success rate of a learner will be higher (Neri et al.,
2003). Ehsani and Knodt (1998) also mention another speech processing
technique in which pronunciation errors of the language learners’ errors
are also included in the processing system so that when processing
takes place, the system can easily detect the wrong spelling and offer
the correct pronunciation for that error. In suprasegmental feedback,
the intonation and stress patterns in the target language, and how
effectively they are used, is important. With the help of the contours,
it is possible to present them (Ehsani & Knodt, 1998). It has
been found out that it is more effective to include contours with the
audio feedback in language learning (de Bot, 1983; James, 1976). As well
as contours, speech waveforms, spectrum information, a graphical
display of the speaker’s face, or vocal tract information can also be
given (Ehsani & Knodt, 1998).
The program can also be useful in detecting the pronunciation
errors because the systems can detect and tell the learners what points
they can improve in the future. This is really crucial for second
language learning because it helps learners to be aware of where
specifically they commit errors and what to do next. Moreover, the
system can provide them with ideas on how to improve these points so
that learners can focus more on the potential errors and will be able to
prevent their occurrence in the future with the help of explicit
information. However, the given feedback should address clearly the
error of the learner and should be easy to understand.
Conclusion
Speech recognition programs can be really useful in CALL
environments, especially in speaking-based or pronunciation practices.
They can be used in a variety of engaging activities, including
dialogues, pronunciation training, and creating open-designed and
close-designed speeches for multiple purposes. Though there are some
shortcomings of the technology itself, it can be used in classroom
environments as a tool which enhances the communication skills and
replaces a part of the teacher-student interaction, which lessens the
stress levels of the learner. In addition, they might be helpful in
speech-related errors of learners because most of the speech processing
systems give immediate corrective feedback to the learner. In the
future, it might be expected that the speech processing systems will be
much more complex and they will be able to give more correct feedback
and maintain a healthy flow of desired communication because they will
have a much more human-like linguistic competence by which learners can
practice their speaking in the target language on potential real-life
simulations.
References
de Bot, K. (1983). Visual feedback of intonation: Effectiveness and
induced practice behavior. Language and Speech, 26, 331–350.
Denes, P. (1959). The design and operation of the mechanical speech
recognizer at university college London. Journal of British Institution
of Radio Engineers, 19, 219–229.
Ehsani, F., & Knodt, E. (1998). Speech technology in
computer-aided language learning: Strengths and limitations of a new
CALL paradigm.Language Learning & Technology, 2(1), 45–60.
Ferguson, J. (Ed.). (1980). Hidden Markov models for speech. Princeton, NJ: IDA.
Fry, D. B. (1959). Theoretical aspects of mechanical speech
recognition.Journal of British Institution of Radio Engineers, 19,
211–218James, E. (1976). The acquisition of prosodic features of speech using a
speech visualizer. International Review of Applied Linguistics, 14,
227–243.
Lee, C. H., Rabiner, L. R., Pierracini, R., & Wilpon, J. G.
(1990). Acoustic modeling for large vocabulary speech
recognition.Computer Speech and Language, 4, 127–165.
Martin, T. B., Nelson, A. L., & Zadell H. J. (1964). Speech
recognition by feature abstraction techniques (Tech. Report
AL-TDR-64-176). Air Force Avionics Lab.
Nagata, K., Kato Y., & Chiba S. (1963). Spoken digit recognizer
for Japanese language. NEC Research Development, 6.
Neri, A., Cucchiarini, C., & Strik, H. (2003, August). Automatic
speech recognition for second language learning: How and why it
actually works. International Congress of Phonetic Sciences (pp.
1157–1160).
Olson, H. F., & Belar, H. (1956). Phonetic typewriter. Journal
of Acoustics Society of America, 28, 1072–1081.
Sakai, T., & Doshita, S. (1962). The phonetic typewriter,
information processing. Proc. International Federatin for Information
Processing Congress, Munich.
Suzuki, J., & Nakata, K. (1961). Recognition of Japanese
vowels—Preliminary to the recognition of speech. Journal of Radio
Research Laboratory 37(8), 193–212.
Rabiner, L., & Juang, B. (1993). Fundamentals of speech
recognition. Englewood Cliffs, NJ: PTR Prentice-Hall.
Rabiner, L. R., Levinson, S. E., Rosenberg, A. E., & Wilpon, J.
G. (1979). Speaker independent recognition of isolated words using
clustering techniques. IEEE Trans. Acoustics, Speech, Signal 27,
336–349.
Reddy, D. R. (1966). An approach to computer speech recognition by
direct analysis of the speech wave (Tech. Report No. C549). Stanford,
CA: Stanford University, Computer Science Department.
Waters, R. (1995). The audio interactive tutor. Computer Assisted Language Learning, 8, 325–354.
Begum Sacak is a Turkish MA graduate student at Ohio
University. Her main interest areas are psycholinguistics and language
acquisition. |