自動化外文翻譯---改進型智能機器人的語音識別方法

上傳人：奔*** IP屬地：河北更新時間：2024-03-01 格式：doc 頁數(shù)：21 大?。?32.50KB 人氣指數(shù)：12 舉報 版權申訴

已閱讀1頁，還剩20頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權，請進行舉報或認領

文檔簡介

1、　　畢業(yè)設計(論文)外文資料翻譯　　學院：自動化工程學院 　　專業(yè)：□自動化　　□測控技術與儀器　　姓名：

2、;　　學號： 　　附件： 1.外文資料翻譯譯文；2.外文原文。 　　附件1：外文資料翻譯譯文　　改進型智能機器人的語音識別方法　　2、語音識別概述

3、;　　最近，由于其重大的理論意義和實用價值，語音識別已經(jīng)受到越來越多的關注。到現(xiàn)在為止，多數(shù)的語音識別是基于傳統(tǒng)的線性系統(tǒng)理論，例如隱馬爾可夫模型和動態(tài)時間規(guī)整技術。隨著語音識別的深度研究，研究者發(fā)現(xiàn)，語音信號是一個復雜的非線性過程，如果語音識別研究想要獲得突破，那么就必須引進非線性系統(tǒng)理論方法。最近，隨著非線性系統(tǒng)理論的發(fā)展，如人工神經(jīng)網(wǎng)絡，混沌與分形，可能應用這些理論到語音識別中。因此，本文的研究是在神經(jīng)網(wǎng)絡和

4、混沌與分形理論的基礎上介紹了語音識別的過程。　　語音識別可以劃分為獨立發(fā)聲式和非獨立發(fā)聲式兩種。非獨立發(fā)聲式是指發(fā)音模式是由單個人來進行訓練，其對訓練人命令的識別速度很快，但它對與其他人的指令識別速度很慢，或者不能識別。獨立發(fā)聲式是指其發(fā)音模式是由不同年齡，不同性別，不同地域的人來進行訓練，它能識別一個群體的指令。一般地，由于用戶不需要操作訓練，獨立發(fā)聲式系統(tǒng)得到了更廣泛的應用。所以，在獨立發(fā)

5、聲式系統(tǒng)中，從語音信號中提取語音特征是語音識別系統(tǒng)的一個基本問題。　　語音識別包括訓練和識別，我們可以把它看做一種模式化的識別任務。通常地，語音信號可以看作為一段通過隱馬爾可夫模型來表征的時間序列。通過這些特征提取，語音信號被轉(zhuǎn)化為特征向量并把它作為一種意見，在訓練程序中，這些意見將反饋到HMM的模型參數(shù)估計中。這些參數(shù)包括意見和他們響應狀態(tài)所對應的概率密度函數(shù)，狀態(tài)間的轉(zhuǎn)移概率，等等。經(jīng)過參數(shù)

6、估計以后，這個已訓練模式就可以應用到識別任務當中。輸入信號將會被確認為造成詞，其精確度是可以評估的。整個過程如圖一所示。　　圖1 語音識別系統(tǒng)的模塊圖　　3、理論與方法　　從語音信號中進行獨立揚聲器的特征提取是語音識別系統(tǒng)中的一個基本問題。解決這個問題的最流行方法是應用線性預

7、測倒譜系數(shù)和Mel頻率倒譜系數(shù)。這兩種方法都是基于一種假設的線形程序，該假設認為說話者所擁有的語音特性是由于聲道共振造成的。這些信號特征構成了語音信號最基本的光譜結(jié)構。然而，在語音信號中，這些非線形信息不容易被當前的特征提取邏輯方法所提取，所以我們使用分型維數(shù)來測量非線形語音擾動。　　本文利用傳統(tǒng)的LPCC和非線性多尺度分形維數(shù)特征提取研究并實現(xiàn)語音識別系統(tǒng)。<p&

8、gt;　　3.1線性預測倒譜系數(shù)　　線性預測系數(shù)是一個我們在做語音的線形預分析時得到的參數(shù)，它是關于毗鄰語音樣本間特征聯(lián)系的參數(shù)。線形預分析正式基于以下幾個概念建立起來的，即一個語音樣本可以通過一些以前的樣本的線形組合來快速地估計，根據(jù)真實語音樣本在確切的分析框架（短時間內(nèi)的）和預測樣本之間的差別的最小平方原則，最后會確認出唯一的一組預測系數(shù)。　　LPC

9、可以用來估計語音信號的倒譜。在語音信號的短時倒譜分析中，這是一種特殊的處理方法。信道模型的系統(tǒng)函數(shù)可以通過如下的線形預分析來得到：　　其中p代表線形預測命令，，（k=1，2，… …，p）代表預測參數(shù)，脈沖響應用h(n)來表示，假設h（n）的倒譜是。那么（1）式可以擴展為（2）式：　　將（1）帶入（2），兩邊同時，（2）變成（3）。

10、　　就獲得了方程（4）：　　那么可以通過來獲得。　?。?）中計算的倒譜系數(shù)叫做LPCC，n代表LPCC命令。　　在我們采集LPCC參數(shù)以前，我們應該對語音信號進行預加重，幀處理，加工和終端窗口檢測等，所以，中文命令字“前進”的端點檢測如圖2所示，接下來，斷點檢測后的中文命令字“前進”語音波

11、形和LPCC的參數(shù)波形如圖3所示。　　圖2 中文命令字“前進”的端點檢測　　圖3 斷點檢測后的中文命令字“前進”語音波形和LPCC的參數(shù)波形　　3.2 語音分形維數(shù)計算　　分形維數(shù)是一個與分形的規(guī)模與數(shù)量相關的定值，也是對自我的結(jié)構相似性的測量。分形分維測量是[6-7]。

12、從測量的角度來看，分形維數(shù)從整數(shù)擴展到了分數(shù)，打破了一般集拓撲學方面被整數(shù)分形維數(shù)的限制，分數(shù)大多是在歐幾里得幾何尺寸的延伸。　　有許多關于分形維數(shù)的定義，例如相似維度，豪斯多夫維度，信息維度，相關維度，容積維度，計盒維度等等，其中，豪斯多夫維度是最古老同時也是最重要的，它的定義如【3】所示：　　其中，表示需要多少個單位來覆蓋子集F. &

13、lt;/p>　　端點檢測后，中文命令詞“向前”的語音波形和分形維數(shù)波形如圖4所示。　　圖4 端點檢測后，中文命令詞“向前”的語音波形和分形維數(shù)波形　　3.3 改進的特征提取方法　　考慮到LPCC語音信號和分形維數(shù)在表達上各自的優(yōu)點，我們把它們二者混合到信號的特取中，即分形維數(shù)表表征語音

14、時間波形圖的自相似性，周期性，隨機性，同時，LPCC特性在高語音質(zhì)量和高識別速度上做得很好。　　由于人工神經(jīng)網(wǎng)絡的非線性，自適應性，強大的自學能力這些明顯的優(yōu)點，它的優(yōu)良分類和輸入輸出響應能力都使它非常適合解決語音識別問題。　　由于人工神經(jīng)網(wǎng)絡的輸入碼的數(shù)量是固定的，因此，現(xiàn)在是進行正規(guī)化的特征參數(shù)輸入到前神經(jīng)網(wǎng)絡[9]，在我們的實驗中，LPCC和每個樣

15、本的分形維數(shù)需要分別地通過時間規(guī)整化的網(wǎng)絡，LPCC是一個4幀數(shù)據(jù)（LPCC1,LPCC2,LPCC3,LPCC4，每個參數(shù)都是14維的），分形維數(shù)被模范化為12維數(shù)據(jù)，（FD1,FD2,…FD12，每一個參數(shù)都是一維），以便于每個樣本的特征向量有4*14+12*1=68-D維，該命令就是前56個維數(shù)是LPCC，剩下的12個維數(shù)是分形維數(shù)。因而，這樣的一個特征向量可以表征語音信號的線形和非線性特征。<p&g

16、t;　　自動語音識別的結(jié)構和特征　　自動語音識別是一項尖端技術，它允許一臺計算機，甚至是一臺手持掌上電腦（邁爾斯，2000）來識別那些需要朗讀或者任何錄音設備發(fā)音的詞匯。自動語音識別技術的最終目的是讓那些不論詞匯量，背景噪音，說話者變音的人直白地說出的單詞能夠達到100%的準確率（CSLU，2002）。然而，大多數(shù)的自動語音識別工程師都承認這樣一個現(xiàn)狀，即對于一個大的語音詞匯單位，當前的準確度水

17、平仍然低于90%。舉一個例子，Dragon's Naturally Speaking或者IBM公司，闡述了取決于口音，背景噪音，說話方式的基線識別的準確性僅僅為60%至80%(Ehsani & Knodt, 1998)。更多的能超越以上兩個的昂貴的系統(tǒng)有Subarashii (Bernstein, et al., 1999), EduSpeak (Franco, etal., 2001), Phonepass (Hink

18、s, 2001), ISLE Project (Menzel, et al., 2001) and RAD (CSLU, 2003)。語音識別的準確性將有望改善。　　在自動語音識別產(chǎn)品中的幾種語音識別方式中，隱馬爾可夫模型（HMM）被認為是最主要的算法，并且被證明在處理大詞匯語音時是最高效的(Ehsani & Knodt, 1998)。詳細說明隱馬爾可夫模型如何工作超出了本文的范圍，但可

19、以在任何關于語言處理的文章中找到。其中最好的是Jurafsky & Martin (2000) and Hosom, Cole, and Fanty (2003)。簡而言之，隱馬爾可夫模型計算輸入接收信號和包含于一個擁有數(shù)以百計的本土音素錄音的數(shù)據(jù)庫的匹配可能性(Hinks, 2003, p. 5)。也就是說，一臺基于隱馬爾可夫模型的語音識別器可以計算輸入一個發(fā)音的音素可以和一個基于概率論相應的模型達到的達到的接近度。高性能就意

20、味著優(yōu)良的發(fā)音，低性能就意味著劣質(zhì)的發(fā)音(Larocca, et al., 1991)。　　雖然語音識別已被普遍用于商業(yè)聽寫和獲取特殊需要等目的，近年來，語言學習的市場占有率急劇增加(Aist, 1999; Eskenazi, 1999; Hinks, 2003)。早期的基于自動語音識別的軟件程序采用基于模板的識別系統(tǒng)，其使用動態(tài)規(guī)劃執(zhí)行模式匹配或其他時間規(guī)范化技術(Dalby & Ke

21、wley-Port,1999). 這些程序包括Talk to Me (Auralog, 1995), the Tell Me More Series (Auralog, 2000), Triple-Play Plus (Mackey & Choi, 1998), New Dynamic English (DynEd, 1997), English Discoveries (Edusoft, 1998), and See it,

22、Hear It, SAY IT! (CPI, 1997)。這些程序的大多數(shù)都不會提供任何反饋給超出簡單說明的發(fā)音準確率，這個基于最接近模式匹配說明是由用戶提出書面對話選擇的。學習者不會被告之他們發(fā)音的準確率。特別是內(nèi)里，（2002年）評論例如Talk to Me和Tel　　一個視覺信號可以讓學習者把他們的語調(diào)同模型揚聲器發(fā)出的語調(diào)進行對比。　　學習者發(fā)音

23、的準確度通常以數(shù)字7來度量（越高越好）　　那些發(fā)音失真的詞語會被識別出來并被明顯地標注。　　附件2：外文原文（復印件）　　Improved speech recognition method　　for intelligent robot<p&

24、gt;　　2、Overview of speech recognition　　Speech recognition has received more and more attention recently due to the important theoretical meaning and practical value [5 ]. Up to now, most speech recog

25、nition is based on conventional linear system theory, such as Hidden Markov Model (HMM) and Dynamic Time Warping(DTW) . With the deep study of speech recognition, it is found that speech signal is a complex nonlinear pro

26、cess. If the study of speech recognition wants to break through, nonlinear　　-system theory method must be introduced to it. Recently, with the developmentof nonlinea-system theories such as artificia

27、l neural networks(ANN) , chaos and fractal, it is possible to apply these theories to speech recognition. Therefore, the study of this paper is based on ANN and chaos and fractal theories are introduced to process speech

28、 recognition.　　Speech recognition is divided into two ways that are speaker dependent and speaker independent. Speaker dependent refers to the pronunciation model trained by a single person, the iden

29、tification rate of the training person?sorders is high, while others’orders is in low identification rate or can’t be recognized. Speaker independent refers to the pronunciation model trained by persons of different age,

30、 sex and region, it can identify a group of persons’orders. Generally, speaker independent syste　　Speech recognition can be viewed as a pattern recognition task, which includes training and recogniti

31、on.Generally, speech signal can be viewed as a time sequence and characterized by the powerful hidden Markov model (HMM). Through the feature extraction, the speech signal is transferred into feature vectors and act asob

32、servations. In the training procedure, these observationswill feed to estimate the model parameters of HMM. These parameters include probability density function for the observati　　Fig. 1　Block diagr

33、am of speech recognition system　　3 Theory andmethod　　Extraction of speaker independent features from the speech signal is the fundamental problem of speaker recognition system.

34、The standard methodology for solving this problem uses Linear Predictive Cepstral Coefficients (LPCC) and Mel-Frequency Cepstral Co-efficient (MFCC). Both these methods are linear procedures based on the assumption that

35、speaker features have properties caused by the vocal tract resonances. These features form the basic spectral structure of the speech signal. However, the n　　This paper investigates and implements sp

36、eaker identification system using both traditional LPCC and non-linear multiscaled fractal dimension feature extraction.　　3. 1　L inear Predictive Cepstral Coefficients　　Linear pr

37、ediction coefficient (LPC) is a parameter setwhich is obtained when we do linear prediction analysis of speech. It is about some correlation characteristics between adjacent speech samples. Linear prediction analysis is

38、based on the following basic concepts. That is, a speech sample can be estimated approximately by the linear combination of some past speech samples. According to the minimal square sum principle of difference between re

39、al speech sample in certain analysis frame short-ti　　LPC coefficient can be used to estimate speech signal cepstrum. This is a special processing method in analysis of speech signal short-time cepstr

40、um. System function of channelmodel is obtained by linear prediction analysis as follow.　　Where p represents linear prediction order, ak,(k=1,2,…,p) represent sprediction coefficient, Impulse respons

41、e is represented by h(n). Suppose cepstrum of h(n) is represented by ,then (1) can be expanded as (2).　　The cepstrum coefficient calculated in the way of (5) is called LPCC, n represents LPCC order.&

42、lt;/p>　　When we extract LPCC parameter before, we should carry on speech signal pre-emphasis, framing processing, windowingprocessing and endpoints detection etc. , so the endpoint detection of Chinese com

43、mand word“Forward”is shown in Fig.2, next, the speech waveform ofChinese command word“Forward”and LPCC parameter waveform after Endpoint detection is shown in Fig. 3.　　3. 2 Speech Fractal Dimension C

44、omputation　　Fractal dimension is a quantitative value from the scale relation on the meaning of fractal, and also a measuring on self-similarity of its structure. The fractal measuring is fractal dim

45、ension[6-7]. From the viewpoint of measuring, fractal dimension is extended from integer to fraction, breaking the limitof the general to pology set dimension being integer Fractal dimension,fraction mostly, is dimension

46、 extension in Euclidean geometry.　　There are many definitions on fractal dimension, eg.,similar dimension, Hausdoff dimension, inforation dimension, correlation dimension, capability<p

47、>　　imension, box-counting dimension etc. , where,Hausdoff dimension is oldest and also most important, for any sets, it is defined as[3].　　Where, M￡(F) denotes how many unit ￡ needed to cover subs

48、et F.　　In thispaper, the Box-Counting dimension (DB) of ,F, is obtained by partitioning the plane with squares grids of side ￡, and the numberof squares that intersect the plane (N(￡)) and is define

49、d as[8].　　The speech waveform of Chinese command word“Forward”and fractal dimension waveform after Endpoint detection is shown in Fig. 4.　　3. 3　Improved feature extractions metho

50、d　　Considering the respective advantages on expressing speech signal of LPCC and fractal dimension,we mix both to be the feature signal, that is, fractal dimension denotes the self2similarity, period

51、icity and randomness of speech time wave shape, meanwhile LPCC feature is good for speech quality and high on identification rate.　　Due to ANN′s nonlinearity, self-adaptability, robust and self-learn

52、ing such obvious advantages, its good classification and input2output reflection ability are suitable to resolve speech recognition problem.　　Due to the number of ANN input nodes being fixed, therefo

53、re time regularization is carried out to the feature parameter before inputted to the neural network[9]. In our experiments, LPCC and fractal dimension of each sample are need to get through the network of time regulariz

54、ation separately, LPCC is 4-frame data(LPCC1,LPCC2,LPCC3,LPCC4, each frame parameter is 14-D), fractal dimension is regularized to be12-frame data(FD1,FD2,…,FD12, each frame parameter is 1-D), so that the feature vector

55、of 　　Architectures and Features of ASR　　ASR is a cutting edge technology that allows a computer or even a hand-held PDA (Myers, 2000) to identify words that are read aloud or spo

56、ken into any sound-recording device. The ultimate purpose of ASR technology is to allow 100% accuracy with all words that are intelligibly spoken by any person regardless of vocabulary size, background noise, or speaker

57、variables (CSLU, 2002). However, most ASR engineers admit that the current accuracy level for a large vocabulary unit of speech (e.g., the sen　　Among several types of speech recognizers used in ASR p

58、roducts, both implemented and proposed, the Hidden Markov Model (HMM) is one of the most dominant algorithms and has proven to be an effective method of dealing with large units of speech (Ehsani & Knodt, 1998). Deta

59、iled descriptions of how the HHM model works go beyond the scope of this paper and can be found in any text concerned with language processing; among the best are Jurafsky & Martin (2000) and Hosom, Cole, and Fanty (

60、2003). Put si　　While ASR has been commonly used for such purposes as business dictation and special needs accessibility, its market presence for language learning has increased dramatically in recent

61、 years (Aist, 1999; Eskenazi, 1999; Hinks, 2003). Early ASR-based software programs adopted template-based recognition systems which perform pattern matching using dynamic programming or other time normalization techniqu

62、es (Dalby & Kewley-Port, 1999). These programs include Talk to Me (Auralog, 1995), the Tell Me M　　A visual signal allows learners to compare their intonation to that of the model speaker.</p&g

眾賞文庫> 全部分類> 畢業(yè)設計

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 眾賞文庫僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責。
6. 下載文件中如有侵權或不適當內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

自動化外文翻譯---改進型智能機器人的語音識別方法

文檔簡介

溫馨提示

最新文檔

評論

免費下載