版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、<p> Speech Recognition</p><p> Victor Zue, Ron Cole, & Wayne Ward</p><p> MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA Oregon Graduate Institute of Science &am
2、p; Technology, Portland, Oregon, USA</p><p> Carnegie Mellon University, Pittsburgh, Pennsylvania, USA</p><p> 1 Defining the Problem</p><p> Speech recognition is the process o
3、f converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document pre
4、paration. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section. </p><p> Speech recognition systems can be characterized b
5、y many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system
6、 does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide sam
7、ples of his</p><p> The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating n
8、atural language are specified in terms of a context-sensitive grammar.</p><p> One popular measure of the difficulty of the task, combining the vocabulary size and the 1 language model, is perplexity, loose
9、ly defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finall
10、y, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and</p><p> Speech recognition is a difficult problem, l
11、argely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in whi
12、ch they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme,At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in Am
13、erican English, </p><p> Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can r
14、esult from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-
15、speaker variabilities.</p><p> Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a f
16、ixed rate, 2 typically once every 10--20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, m
17、aking use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, </p><p> Speech recognition systems attempt to model the sources of variability described above in se
18、veral ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent character
19、istics. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been</p><p> Word leve
20、l variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handle
21、d by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the se
22、arch through the most probable sequence of words. </p><p> The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which t
23、he generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections,and 11.2. Neural networks have als
24、o been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has com</p><p> An interesting feature of frame-based HMM systems is that speech
25、 segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This appro
26、ach has produced competitive recognition performance in several tasks. </p><p> 2 State of the Art</p><p> Comments about the state-of-the-art need to be made in the context of specific appli
27、cations which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single uni
28、t. Such an approach is not practical for large vocabularies, where word models must be built up from subword units. </p><p> The past decade has witnessed significant progress in speech recognition technolo
29、gy. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and larg
30、e vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, </p><p>
31、 Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task s
32、pecific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and
33、to determine parameters of the recognizers in a statistically </p><p> Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers tra
34、ined and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system
35、39;s performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public </p><p> Finally, advances in computer technology have al
36、so indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that
37、the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without addi
38、tional hardw</p><p> One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strin
39、gs spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known. </p><p> One of the best known moderate-perplexity tasks is the 1,000-word so-
40、called Resource 5 Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair lan
41、guage model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Tr</p>
42、;<p> High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community ha
43、s since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP≈200), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sente
44、nces drawn from North America business news. </p><p> With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries.
45、 Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is
46、 low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 tele</p><p> At present, several very large vocabulary dictation systems are avail
47、able for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical repor
48、ts.</p><p> Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50
49、%. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized. </p><p> 3 Future Directions</p><p> In 1992, the U.S. National Science
50、Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in. Research
51、 in the following areas for speech recognition were identified: </p><p> Robustness:</p><p> In a robust system, performance degrades gracefully (rather than catastrophically) as conditions be
52、come more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention.</p><p> Portability:</p><p> Por
53、tability refers to the goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak per
54、formance, they must be trained on examples specific to the new task, which is time consuming and expensive. </p><p> Adaptation: </p><p> How can systems continuously adapt to changing conditi
55、ons (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc. </p><p> Language Modeling:
56、</p><p> Current systems use statistical language models to help reduce the search space and resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create more habitable s
57、ystems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporating syntactic and semantic constraints that cannot be captured by purely statistical models.</p&g
58、t;<p> Confidence Measures: </p><p> Most speech recognition systems assign scores to hypotheses for the purpose of rank ordering them. These scores do not provide a good indication of whether a hyp
59、othesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions, we need better methods to evaluate the absolute correctness of hypotheses.</p><p> Out-
60、of-Vocabulary Words: </p><p> Systems are designed for use with a particular set of words, but system users may not know exactly which words are in the system vocabulary. This leads to a certain percentage
61、of out-of-vocabulary words in natural conditions. Systems must have some method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.&l
62、t;/p><p> Spontaneous Speech: </p><p> Systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammat
63、ical constructions and other common behaviors not found in read speech. Development on the ATIS task has resulted in progress in this area, but much work remains to be done. </p><p><b> Prosody:</b
64、></p><p> Prosody refers to acoustic structure that extends over several segments or words. Stress, intonation, and rhythm convey important information for word recognition and the user's intentions
65、(e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that has not yet been answered.</p><p&g
66、t; Modeling Dynamics:</p><p> Systems assume a sequence of input frames which are treated as if they were independent. But it is known that perceptual cues for words and phonemes require the integration of
67、 features that reflect the movements of the articulators, which are dynamic in nature. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.</p><p><b>
68、 語音識別</b></p><p> 舒維都,羅恩科爾,韋恩沃德</p><p> 麻省理工學(xué)院計算機科學(xué)實驗室,劍橋,馬薩諸塞州,美國 </p><p> 俄勒岡科學(xué)與技術(shù)學(xué)院,波特蘭,俄勒岡州,美國 </p><p> 卡耐基梅隆大學(xué),匹茲堡,賓夕法尼亞州,美國 </p><p><b&
69、gt; 一 定義問題</b></p><p> 語音識別是指音頻信號的轉(zhuǎn)換過程,被電話或麥克風的所捕獲的一系列的消息。 所 識別的消息作為最后的結(jié)果,用于控制應(yīng)用,如命令與數(shù)據(jù)錄入,以及文件準備。它們也 可以作為處理輸入的語言,以便進一步實現(xiàn)語音理解,在第一個主題涵蓋。</p><p> 語音識別系統(tǒng)可以用多個參數(shù)來描述, 一些更重要參數(shù)在圖形中顯示出來.一個孤立字 語音
70、識別系統(tǒng)要求詞與詞之間短暫停頓,而連續(xù)語音識別系統(tǒng)對那些不自發(fā)的,或臨時生 成的,言語不流利的語音,比用講稿讀出更難以識別。有些系統(tǒng)要求發(fā)言者登記——即用 戶在使用系統(tǒng)前必須為系統(tǒng)提供演講樣本或發(fā)言底稿,而其他系統(tǒng)據(jù)說是獨立揚聲器,因 為沒有必要登記。一些參數(shù)特征依賴于特定的任務(wù)。 當詞匯量比較大或有較多象聲詞的 時候,識別起來一般比較困難。當語音由有序的詞語生成時,語言模型或特定語法便會限 制詞語的組合。</p><
71、;p> 最簡單的語言模型可以被指定為一個有限狀態(tài)網(wǎng)絡(luò),每個語音所包含的所有允許的詞 語都能顧及到。更普遍的近似自然語言的語言模型在語法方面被指定為上下文相關(guān)聯(lián)。</p><p> 一種普及的任務(wù)的難度測量,詞匯量和語言模型相結(jié)合的語音比較復(fù)雜,大量語音的 幾何意義可以按照語音模型的應(yīng)用定義寬泛些(參見文章對語言模型普遍性與復(fù)雜性的詳 細討論)。最后,還有一些其他參數(shù),可以影響語音識別系統(tǒng)的性能,包括環(huán)境
72、噪聲和麥 克風的類型和安置。</p><p> 語音識別是一個困難的問題,主要是因為與信號相關(guān)的變異有很多來源。 首先,音 素,作為組成詞語的最小的語音單位,它的聲學(xué)呈現(xiàn)是高度依賴于他們所出現(xiàn)的語境的。 這些語音的變異性正好由音素的聲學(xué)差異做出了驗證。在詞語的范圍里,語境的變化會相 當富有戲劇性---使得美國英語里的 gas shortage 聽起來很像 gash shortage, 而意大利語中的 devo
73、andare 聽起來會很像 devandare。</p><p> 其次,聲變異可能由環(huán)境變化,以及傳輸介質(zhì)的位置和特征引起。 第三, 說話人的 不同,演講者身體和情緒上的差異可能導(dǎo)致演講速度,質(zhì)量和話音質(zhì)量的差異。最后,社 會語言學(xué)背景,方言的差異和聲道的大小和形狀更進一步促進了演講者的差異性 。 </p><p> 數(shù)字圖形展示了語音識別系統(tǒng)的主要組成部分。數(shù)字化語音信號先轉(zhuǎn)換成一
74、系列有用 的測量值或有特定速率的特征,通常每次間隔10 - 20毫秒(見第11.3章節(jié),分別描述了模 擬信號和數(shù)字信號的處理)。然后這些測量被用來尋找最有可能的備選詞匯,使用被聲學(xué) 模型、詞匯模型、和語言模型強加的限制因素。 整個過程中,訓(xùn)練數(shù)據(jù)是用來確定模型 參數(shù)值的。 </p><p> 語音識別系統(tǒng)嘗試在上述變異的來源的某些方面做模型。在信號描述的層面上,研究 人員已經(jīng)開發(fā)出了感性地強調(diào)重要發(fā)言者獨立語音
75、信號的特征,以及忽略發(fā)言者依賴環(huán)境 的語音信號特征。在聲學(xué)語音層面上,說話人差異變化通常是參照使用大量的數(shù)據(jù)來做模 型。語音改編法則還開發(fā)出適應(yīng)說話人獨立聲學(xué)模型 以適應(yīng)那些目前在系統(tǒng)中使用的說 話人語音樣本(參見文章)。在語言方面語境影響的聲學(xué)語音處理,通常情況下被不同的 訓(xùn)練模式分隔為單獨的音素,這就是所謂的上下文相關(guān)聲學(xué)模型。 </p><p> 字級差異可以由發(fā)音網(wǎng)絡(luò)中可描述的字詞的候選發(fā)音來處理。對于
76、象聲詞的替代,考 慮到方言以及口音的影響,通過搜索算法在網(wǎng)絡(luò)上尋找音素的替代方法。統(tǒng)計語言的模型 基于對字序列的發(fā)生頻率的估計,常常通過可能的詞序來引導(dǎo)搜索。 </p><p> 眾所周知在過去的 15 年中占主導(dǎo)地位的識別范例是隱馬爾可夫模型(HMM)?;?HMM 是一種雙隨機模型,基本音素字符串和框架的生成,表面聲波的變現(xiàn)都作為馬氏過 程來表述,在本章節(jié)中所討論的和 11.2 節(jié)中的神經(jīng)網(wǎng)絡(luò)也被用來估算框
77、架的基本性能,然 后將這些性能集成到基于 HMM 的系統(tǒng)架構(gòu)中,即現(xiàn)在被稱為的混合系統(tǒng)所述的,參見第 11.5 節(jié)。</p><p> 基于 HMM 系統(tǒng)框架的一種有趣的特點,就是相比明確的定義而言,語音片段是在搜 索過程中被定義的。另一種方法,是先找出語音片段,然后將這些片段分類并使用片段性 能來識別文字。這種做法已經(jīng)產(chǎn)生在一些生產(chǎn)任務(wù)的競爭識別性能上了。 </p><p><b
78、> 二 目前發(fā)展現(xiàn)狀</b></p><p> 討論目前的發(fā)展狀況,需要聯(lián)系到具體應(yīng)用的環(huán)境,他影響到了任務(wù)的制約性。此外, 有時不同的技術(shù)適合于不同的任務(wù)。 例如,當詞匯量小,整個單詞可以建模為一個單元。 但這種做法對大詞匯量來說是不實際的,如字詞模式必須由單一字詞單元建立。</p><p> 過去十年目睹識別技術(shù)在語音方面取得重大進展。字錯誤率持續(xù)每兩年下降 5
79、0%?;?礎(chǔ)技術(shù)已取得了重大的進展,從而降低了說話人獨立語音,連續(xù)語音及大詞匯量語音識別的障礙。有幾個因素促成了這種迅速的進展。 首先,HMM 時代即將到來。 HMM 模型 規(guī)模強大,以及具有有效地訓(xùn)練數(shù)據(jù),可以自動訓(xùn)練出模型的最佳的性能。</p><p> 第二,很大的努力已經(jīng)投入到語音系統(tǒng)大量詞匯識別的發(fā)展、訓(xùn)練和測試上。 語料 庫其中一些是專為語音聲學(xué)研究的,也有非常具體的任務(wù)。 如今,這并非罕見有成千上
80、 萬可行的句子提供給系統(tǒng)來訓(xùn)練及測試。 這些語料庫允許研究人員量化語音聲學(xué)的重要 內(nèi)容,以確定識別參數(shù)在統(tǒng)計上是有意義的方式。盡管許多語料(如論文利用 TIMIT,馬 幣,車號自動識別等,參見 12.3 節(jié))原本是在美國國防部高級研究計劃局的贊助下收集的 人類的語言來刺激其承辦商的技術(shù)發(fā)展,然而他們獲得了世界的廣泛認可(例如,英國, 加拿大,法國,德國,日本,)作為評價標準來建立語音識別。</p><p> 第
81、三,取得的進展所帶來的性能評價標準的建立。 十年前,研究人員僅測試他們的 系統(tǒng)培訓(xùn)和利用當?shù)厥占臄?shù)據(jù),并沒有很仔細劃分培訓(xùn)和測試。 因此,這樣便很難比 較系統(tǒng)的全面性能,以及它所給出的數(shù)據(jù)在之前未出現(xiàn)時,系統(tǒng)的性能便逐漸退化。公共 領(lǐng)域最近提供的數(shù)據(jù)按照評價標準的規(guī)范,致使試驗結(jié)果相同,從而有助于提高監(jiān)測的可 靠性(語料庫發(fā)展活動的主體和評價方法,分別在 12 和 13 章作了總結(jié))。</p><p> 最后
82、,計算機技術(shù)的進步,也間接影響了人類的進展。 提供大容量存儲能力的快速 且低廉的電腦,使研究人員能夠短時間運行許多大型規(guī)模的實驗。 這意味著經(jīng)過實踐和 評價后的想法,它所花費的時間大大減少。 事實上,合理性能的語音識別系統(tǒng)現(xiàn)在可以 在無附加設(shè)備的高端工作站隨時運行----這在幾年之前仍是個不可思議的想象。 </p><p> 其中最普遍的,最有用的和困惑最低最有潛在的任務(wù)是數(shù)字識別。對于美國英語,獨 立演講者的
83、連續(xù)數(shù)字串識別和電話寬帶限制的語音可以達到 0.3%的誤碼率,前提是字符 串的長度已知。</p><p> 其中最著名的中等難度的任務(wù)是 1000 字的所謂資源管理(RM)的任務(wù),其用來查詢 各種有關(guān)太平洋海軍艦艇的研究。最好的獨立執(zhí)行任務(wù)的語音設(shè)備執(zhí)行 RM 任務(wù)不超過 4%,用文字語言模型約束給定的單詞。 最近,研究人員已經(jīng)開始處理自發(fā)語音識別的問 題了。例如,在航空旅游信息服務(wù)(ATIS)域,超過 3%的
84、誤碼誤率少報了近 2000 字的詞 匯和二元語言模型大約 15 的混亂度。 </p><p> 數(shù)千字詞匯任務(wù)的高混亂度主要產(chǎn)生于聽寫任務(wù)中。語音系統(tǒng)成立多年,使用鼓勵詞 后,研究機構(gòu)從 1992 年開始向超大詞匯(20000 字以上),高混亂度(P≈200),獨立連 續(xù)語音識別發(fā)展。 1994 年的最好的語音系統(tǒng)實現(xiàn)了從北美商業(yè)新聞中讀取句子并描述僅率 7.2%的誤碼率的成績。</p><
85、p> 隨著語音識別性能的不斷改善,系統(tǒng)現(xiàn)正部署在電話和許多國家的蜂窩網(wǎng)絡(luò)。統(tǒng)現(xiàn)正 部署在電話和許多國家的蜂窩網(wǎng)絡(luò)。在未來幾年中,語音識別的電話網(wǎng)絡(luò)將在世界各地普 遍存在。有巨大的力量推動這項技術(shù)的發(fā)展,在許多國家,觸摸音普及率低,聲音是自動 控制服務(wù)的唯一選擇。在語音撥號,例如,用戶可以撥打 10 - 20 語音電話號碼(例如,打 電話回家后)登記,說他們的聲音與電話號碼相關(guān)的話。 AT&T 公司,另一方面,安裝 了呼叫路由系統(tǒng)
86、使用揚聲器獨立字研配技術(shù),可檢測數(shù)(例如,個人對個人的關(guān)鍵短語, 要求在諸如句子卡):我想給它充電我電話卡。</p><p> 目前,一些非常大的詞匯聽寫系統(tǒng)可用于文檔生成。這些系統(tǒng)通常需要對詞與詞之間 暫停發(fā)言。他們的表現(xiàn)可以得到進一步加強,如果可以報考,如支配的具體領(lǐng)域限制的醫(yī) 療報告。</p><p> 盡管正在取得很大進展,機器是從認識到對話的講話很長的路。在語料庫的總機電話
87、交談字識別率是 50%左右。這將是許多年以前無限的詞匯,非特定人連續(xù)聽寫能力得以實 現(xiàn)。 </p><p><b> 三 未來發(fā)展方向 </b></p><p> 1992 年,美國國家科學(xué)基金會主辦的研討會,以確定人類語言技術(shù)領(lǐng)域重點研究的 挑戰(zhàn),以及工作需要的基礎(chǔ)設(shè)施支持。研究的主要挑戰(zhàn)歸納為語音識別技術(shù)的以下幾個方 面: </p><p&
88、gt;<b> 濾波性: </b></p><p> 在一個強大的系統(tǒng),性能緩慢下降(而不是災(zāi)難性的)作為條件使得所與訓(xùn)練的數(shù)據(jù)更為不符。在信道特征的差異和聲學(xué)環(huán)境上應(yīng)受到特別重視。</p><p><b> 可攜性:</b></p><p> 便攜性是指目標的快速設(shè)計,開發(fā)和部署新的應(yīng)用系統(tǒng)。目前,當系統(tǒng)時常遭受
89、重大 退化時,它便移動到一個新的任務(wù)上。 為了返回到峰值性能,他們必須接受培訓(xùn)的具體 例子來完成新的任務(wù),這樣即費時又昂貴。</p><p><b> 適應(yīng): </b></p><p> 如何能適應(yīng)系統(tǒng)不斷變化的條件(新?lián)P聲器,麥克風,任務(wù)等)和使用,通過使用改 進?這種適應(yīng)可能發(fā)生在多層次的系統(tǒng),模型子字,詞的發(fā)音,語言模型等。 </p><
90、p><b> 語言模型: </b></p><p> 當前系統(tǒng)使用統(tǒng)計語言模型,是為了幫助減少搜索空間和解決聲音的含糊問題。隨著 詞匯量的增長和其他方面的限制放寬,創(chuàng)造更適合人類居住的系統(tǒng),這將使越來越重要的 語言模型可以得到盡可能多的約束,也許結(jié)合句法,并不能由純粹的統(tǒng)計模型捕獲語義約 束。 </p><p><b> 確保措施: </b
91、></p><p> 大多數(shù)語音識別系統(tǒng)分配分數(shù)來假設(shè)為基層來行使目的。這些分數(shù)不提供或不充分表 明他們是否有一個假設(shè)是正確的,只是因為這些假設(shè)優(yōu)于其他。當我們按任務(wù)要求開始行 動時,我們需要更好的方法來評估假設(shè)的絕對正確性。</p><p><b> 超綱詞匯: </b></p><p> 系統(tǒng)設(shè)計使用一套特定的單詞,但系統(tǒng)的用戶
92、可能不知道哪些詞是屬于詞匯系統(tǒng)中 的。這導(dǎo)致了某些自然條件下,超綱詞匯占據(jù)了一定的百分比。系統(tǒng)必須有一些方法來檢 測超綱的詞匯,否則最終將會從詞匯單詞映射到未知的單詞,導(dǎo)致發(fā)生錯誤。</p><p><b> 自發(fā)演講:</b></p><p> 系統(tǒng)部署的行為是一個真正處理各種常見的自發(fā)講話的現(xiàn)象,如填充停頓,錯誤的開 始,猶豫,在講話中的不合語法的結(jié)構(gòu)和其他沒
93、有發(fā)現(xiàn)的行為。 在飛機任務(wù)上的發(fā)展, 意味著在這一領(lǐng)域中的進展,但仍有許多工作要做。</p><p><b> 韻律:</b></p><p> 韻律是指在一些片段或字組上加以擴大的聲學(xué)結(jié)構(gòu)。通過音量、語調(diào)和節(jié)奏來表達文 字識別和用戶意圖的重要信息(例如,諷刺、憤怒)。目前的系統(tǒng)并不能識別韻律的結(jié)構(gòu)。 如何把韻律信息整合到識別系統(tǒng)中來是一個尚未解決的關(guān)鍵性問題。&
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
評論
0/150
提交評論