外文翻譯--基于網(wǎng)絡(luò)的自動(dòng)語(yǔ)音識(shí)別能度語(yǔ)言模型_第1頁(yè)
已閱讀1頁(yè),還剩6頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、<p><b>  中文3230字</b></p><p><b>  外文翻譯:</b></p><p>  原文來(lái)源: Computer Speech and Language (2014) </p><p><b>  譯文正文:</b></p><p>  基

2、于網(wǎng)絡(luò)的自動(dòng)語(yǔ)音識(shí)別能度語(yǔ)言模型</p><p>  本文描述了一種基于可能性理論的新的語(yǔ)言模型。這些新模型的目的是為了更好地利用Web上可用的數(shù)據(jù)進(jìn)行語(yǔ)言建模。這些模型的目的在于整合與不可能單詞序列有關(guān)的信息。我們找到了使用這種模型的兩個(gè)主要問(wèn)題:如何估算單詞序列的長(zhǎng)度以及如何把這種模型整合到自動(dòng)語(yǔ)音識(shí)別系統(tǒng)(ASR)中去。</p><p>  我們提出了一個(gè)單詞序列可能性的措施和一個(gè)基

3、于單詞序列統(tǒng)計(jì)數(shù)據(jù)的實(shí)用估算方法,這種方法尤其適用于來(lái)自于Web數(shù)據(jù)的估算。對(duì)于在一個(gè)經(jīng)典的依靠一個(gè)語(yǔ)音識(shí)別過(guò)程中的概率模型的自動(dòng)語(yǔ)音識(shí)別引擎中使用這些模型,我們提出了一些策略和構(gòu)想。這項(xiàng)工作在兩種典型的使用場(chǎng)景中進(jìn)行評(píng)估:具有非常大訓(xùn)練集的廣播新聞轉(zhuǎn)錄和在一個(gè)專業(yè)領(lǐng)域,對(duì)只有非常有限訓(xùn)練數(shù)據(jù)的醫(yī)療視頻的轉(zhuǎn)錄。</p><p>  該結(jié)果表明,在專業(yè)領(lǐng)域的任務(wù)中,能度模型提供了顯著低的單詞錯(cuò)誤率,但是經(jīng)典的n元模

4、型由于訓(xùn)練材料的缺少?zèng)]有能夠做到這一點(diǎn)。在廣播新聞方面,概率模型仍然優(yōu)于能度模型。但是,這兩種模型的對(duì)數(shù)線性組合優(yōu)于所有單獨(dú)使用的模型,這表明能度模型帶來(lái)了概率模型所沒(méi)有的信息。</p><p><b>  1.簡(jiǎn)介</b></p><p>  最先進(jìn)的大詞匯量連續(xù)語(yǔ)音識(shí)別系統(tǒng)(LVCSR)是基于n元語(yǔ)法模型的,這種模型是在由數(shù)十億單詞組成的文本集合中被估算的。這些模

5、型在很大范圍的應(yīng)用中證明了自己的效率,但是它們的準(zhǔn)確度是依賴于龐大的相關(guān)訓(xùn)練語(yǔ)料庫(kù)的可用性上,但是如果對(duì)于資源很少的語(yǔ)言或者特定的某一領(lǐng)域,大量的數(shù)據(jù)集就不能保證了。</p><p>  處理這種訓(xùn)練數(shù)據(jù)缺乏的最受歡迎的方法之一在于在互聯(lián)網(wǎng)上搜集文本材料和在這些自動(dòng)搜集的數(shù)據(jù)集上估算n元統(tǒng)計(jì)模型。這種方法得益于互聯(lián)網(wǎng)兩個(gè)有趣的特點(diǎn):覆蓋范圍廣和持續(xù)更新。</p><p>  覆蓋依賴于這樣

6、一個(gè)事實(shí),Web可能被看作是一個(gè)趨于無(wú)限的語(yǔ)料庫(kù),大多數(shù)的語(yǔ)言實(shí)現(xiàn)都可以在這里找到?;ヂ?lián)網(wǎng)提供了一個(gè)比通常在LM訓(xùn)練中用到的文本集合大得多的語(yǔ)言覆蓋。</p><p>  用戶通過(guò)不斷地增加包含新單詞和新的慣用語(yǔ)言形式的文檔來(lái)提供更新。最后一點(diǎn)被廣泛地用于統(tǒng)計(jì)語(yǔ)言模型的各個(gè)方面,典型的應(yīng)用如新單詞的發(fā)現(xiàn),n元語(yǔ)法模型的適應(yīng),不可見(jiàn)的n元語(yǔ)法模型的評(píng)價(jià)。</p><p>  然而,與互聯(lián)網(wǎng)內(nèi)

7、容的規(guī)模和不穩(wěn)定性相關(guān)的技術(shù)問(wèn)題限制了對(duì)大范圍和統(tǒng)計(jì)語(yǔ)言模型更新的利用。標(biāo)準(zhǔn)的方法會(huì)是定期地搜集互聯(lián)網(wǎng)上可以利用的所有數(shù)據(jù),在結(jié)果語(yǔ)料庫(kù)上來(lái)估算n元模型。這樣的一種技術(shù)很明顯是難以實(shí)施的;一些作者提出了使對(duì)龐大的LM數(shù)據(jù)集的估算可行的解決方法:Guthrie和Hepple(2010)解決了稀疏n元模型占用內(nèi)存減少的問(wèn)題;快速平滑的技術(shù)在Brants(2007)等被提出;基于分布式的數(shù)據(jù)存儲(chǔ)與處理的技術(shù)方法在Ghemawat(2003),

8、Chang(2006)等文章上被發(fā)表。最后,即使軟件和硬件技術(shù)不斷發(fā)展,在整個(gè)Web內(nèi)容上最新的LM的訓(xùn)練仍然是一個(gè)具有挑戰(zhàn)性的問(wèn)題。</p><p>  另一個(gè)問(wèn)題是和單詞序列在Web上的分布相關(guān)。由于文檔來(lái)源的多樣性,生產(chǎn)的可變性和使用的環(huán)境等因素,它們的可靠性很低。分布不僅僅不可靠,也不會(huì)匹配一個(gè)定向的應(yīng)用程序上下文,這個(gè)應(yīng)用程序上下文決定著可能的主題、說(shuō)話的風(fēng)格和語(yǔ)言的等級(jí)等。</p>&l

9、t;p>  考慮到使用整個(gè)Web在實(shí)用上和理論上的諸多限制,以前的很多研究都是提取相關(guān)的和易于處理的Web子集,這些Web子集被作為傳統(tǒng)的估算n元統(tǒng)計(jì)模型的語(yǔ)料庫(kù)來(lái)使用。語(yǔ)料庫(kù)是通過(guò)自動(dòng)查詢搜索引擎取得的。就覆蓋,語(yǔ)言風(fēng)格等而言,查詢組成技術(shù)決定了語(yǔ)料庫(kù)的精確度。不幸的是,查詢是基于以前的知識(shí)或者是一個(gè)與領(lǐng)域相關(guān)的描述的自動(dòng)提取,這種描述可能是不完整或者不準(zhǔn)確的。此外,獨(dú)立于查詢組成技術(shù),搜索到的數(shù)據(jù)依賴于在商業(yè)引擎里面使用的搜索

10、策略,這些搜索策略可能完全或者部分是機(jī)密的。</p><p>  即使這些方法成功地在各種應(yīng)用程序上下文中得到使用,一些作者仍試圖通過(guò)使用動(dòng)態(tài)n元語(yǔ)法估算方法從Web的特殊性中得到更多的好處。在Beger和Miller(1998)的文章中,一個(gè)剛好及時(shí)的適應(yīng)過(guò)程被提出了,它是基于一個(gè)在線的文檔主題分析和快速LM更新。在Zhu和Rosenfeld(2010)的文章中,作者們提出了一個(gè)倒轉(zhuǎn)的技術(shù),它通過(guò)計(jì)算包含它的W

11、eb文檔的數(shù)量來(lái)估算一個(gè)單詞序列的概率。這個(gè)數(shù)量是通過(guò)使用一個(gè)帶有定向單詞序列查詢的搜索引擎返回的成功地案列數(shù)目。這篇文章專注于LM適應(yīng)于一個(gè)專門的領(lǐng)域,但是它介紹了使用一個(gè)搜索引擎進(jìn)行語(yǔ)言成績(jī)專門估算的思想。我們?cè)贠ger(2009a)等里面拓展了這個(gè)思想,在這里我們提出了一個(gè)高效的方法,在一個(gè)自動(dòng)語(yǔ)音識(shí)別別系統(tǒng)中使用Web搜索引擎的命中率作為概率。一個(gè)特別的n元統(tǒng)計(jì)模型估計(jì)提供更新了的統(tǒng)計(jì)數(shù)據(jù),但是沒(méi)有解決Web統(tǒng)計(jì)數(shù)據(jù)可靠性的問(wèn)題

12、。為了解決這個(gè)問(wèn)題,我們?cè)贠ger(2009b)等提出了考慮單詞序列存在與否而不是它們出現(xiàn)的頻率的語(yǔ)言模型。這些模型是基于可能性理論的,可能性理論提供了一個(gè)在理論上解決不確定性問(wèn)題的框架。我們通過(guò)Web查詢提出</p><p>  在多數(shù)情況下,基于概率的語(yǔ)言模型都表現(xiàn)不俗,尤其是在高頻率和中等頻率的事件中。低頻率事件發(fā)生概率的估計(jì)基本上依賴于一個(gè)倒轉(zhuǎn)或者平滑的策略,這種策略會(huì)導(dǎo)致不太可靠的概率。已提出的能度語(yǔ)言

13、模型僅僅在這些低頻率事件上起作用,通過(guò)測(cè)量這些事件的可信度,這種可信度實(shí)際上不是由通常估計(jì)這些事件的概率的平滑和倒轉(zhuǎn)的技術(shù)檢測(cè)的。因此,提出的并沒(méi)有取代基于能度的語(yǔ)言模型基于概率的語(yǔ)言模型,而是在基于概率的語(yǔ)言模型不可靠的情況下對(duì)其進(jìn)行補(bǔ)充,這種情況,也就是低頻率事件?;谀芏鹊恼Z(yǔ)言模型的目標(biāo)估計(jì)這些低頻事件的可信度,目的是為了在當(dāng)主要的語(yǔ)言模型錯(cuò)誤地分配給它們一個(gè)本應(yīng)有的更高的概率的時(shí)候過(guò)濾掉它們。</p><p&

14、gt;  這篇文章講述了一個(gè)可能性語(yǔ)言模型的深入研究。我們將會(huì)聲明我們的動(dòng)機(jī)和這些模型理論上的基礎(chǔ)以及陳述一個(gè)經(jīng)驗(yàn)上的估算可能性的方法和新的把它們整合到一個(gè)自動(dòng)語(yǔ)音識(shí)別系統(tǒng)中的方法。能度模型與傳統(tǒng)的n元語(yǔ)言模型在Web和傳統(tǒng)的文本語(yǔ)料庫(kù)的估算方面比較和結(jié)合。</p><p>  我們圍繞兩個(gè)任務(wù)做了實(shí)驗(yàn):廣播新聞轉(zhuǎn)錄,有大量的訓(xùn)練材料可以提供,以及醫(yī)療視頻轉(zhuǎn)錄,這通常是致力于訓(xùn)練外科醫(yī)生的。后者應(yīng)用的環(huán)境對(duì)應(yīng)于一

15、個(gè)只有很少的可提供資源的非常專業(yè)的領(lǐng)域。</p><p>  本文剩余部分的內(nèi)容組織安排如下。第二部分從傳統(tǒng)的語(yǔ)料庫(kù)概率模型開(kāi)始,提供了一個(gè)逐步的Web能度語(yǔ)言模型的描述。第三部分陳述了把能度語(yǔ)言模型整合到一個(gè)統(tǒng)計(jì)學(xué)的自動(dòng)語(yǔ)音識(shí)別系統(tǒng)中各種各樣的策略。第四部分描述了實(shí)驗(yàn)的設(shè)置,同時(shí)也做了對(duì)比的實(shí)驗(yàn)。最后第五部分進(jìn)行了總結(jié)并且提出了一些觀點(diǎn)與看法。</p><p>  2.從語(yǔ)料庫(kù)的可能性到

16、Web的可能性</p><p>  在這一部分,我們通過(guò)使用新的數(shù)據(jù)源(也就是Web)提出了新的方法來(lái)改進(jìn)語(yǔ)言建模和一個(gè)新的理論框架,也就是可能性理論。我們首先描述了傳統(tǒng)的基于語(yǔ)料庫(kù)的概率語(yǔ)言模型,這種語(yǔ)言模型在大多數(shù)最先進(jìn)的語(yǔ)音識(shí)別系統(tǒng)中得到使用。其次,我們介紹了一個(gè)新的方法來(lái)從Web中估算這些概率。最后,我們提出使用從可能性理論中的觀念來(lái)建造新的可以被在Web上以及傳統(tǒng)的封閉語(yǔ)料庫(kù)中估算的策略:能度策略。&l

17、t;/p><p>  2.1基于語(yǔ)料庫(kù)的概率</p><p>  在自動(dòng)語(yǔ)音識(shí)別系統(tǒng)領(lǐng)域,語(yǔ)言模型設(shè)計(jì)的目的主要是估算一個(gè)單詞序列W的先驗(yàn)概率P(W):</p><p>  W=(w1, w2, …,wn), wi€v</p><p>  這個(gè)概率可以被分解為條件概率的產(chǎn)物:</p><p>  P(W)=P(wi|w1,

18、w2,…,wi-1)</p><p>  這個(gè)公式假設(shè)一個(gè)單詞Wi只能通過(guò)前面的單詞序列來(lái)預(yù)測(cè)。整體上來(lái)講,n元語(yǔ)言模型組成一個(gè)將會(huì)在自動(dòng)語(yǔ)音識(shí)別系統(tǒng)中被使用的條件概率的集合,為了一個(gè)單詞的預(yù)測(cè)給一個(gè)部分轉(zhuǎn)錄假設(shè)。</p><p>  就像在Eq中表現(xiàn)的那樣,單詞的概率取決于整個(gè)語(yǔ)言的歷史。實(shí)際上,這樣長(zhǎng)期的相關(guān)性由于復(fù)雜性和語(yǔ)料庫(kù)的限制不能被估算:估算如此長(zhǎng)的單詞序列所需要的訓(xùn)練數(shù)據(jù)的量

19、是巨大的,并且對(duì)一個(gè)n(n>6)元語(yǔ)言模型的高階統(tǒng)計(jì)的直接估算通常是不可能完成的。因此,大多數(shù)先進(jìn)的自動(dòng)語(yǔ)音識(shí)別系統(tǒng)只使用4或5元模型。</p><p>  一些可以替代的語(yǔ)言評(píng)估的方法被提出使得長(zhǎng)序列的單詞序列概率估算可行,主要使用提供有效的(但是間接的)推理和平滑的機(jī)制的神經(jīng)網(wǎng)絡(luò)。然而,理想的情況在一個(gè)詳盡的語(yǔ)料庫(kù)中進(jìn)行直接的精確的概率估算,在這種情況下,所有可能的句子都會(huì)被發(fā)現(xiàn)。這將表明自動(dòng)語(yǔ)音識(shí)別的

20、問(wèn)題,可以被看做是在一個(gè)封閉的文本文檔集合中對(duì)正確轉(zhuǎn)錄的搜索。</p><p><b>  原文正文:</b></p><p>  Web-based possibilistic language models for automatic speech recognition</p><p>  This paper describes a n

21、ew kind of language models based on the possibility theory. The purpose of these new models is to better use the data available on the Web for language modeling. These models aim to integrate information relative to impo

22、ssible word sequences. We address the two main problems of using this kind of model: how to estimate the measures for word sequences and how to integrate this kind of model into the ASR system.</p><p>  We p

23、ropose a word-sequence possibilistic measure and a practical estimation method based on word-sequence statistics, which is particularly suited for estimating from Web data. We develop several strategies and formulations

24、for using these models in a classical automatic speech recognition engine, which relies on a probabilistic modeling of the speech recognition process. This work is evaluated on two typical usage scenarios: broadcast news

25、 transcription with very large training sets and transcr</p><p>  The results show that the possibilistic models provide significantly lower word error rate on the specialized domain task, where classical n-

26、gram models fail due to the lack of training materials. For the broadcast news, the probabilistic models remain better than the possibilistic ones. However, a log-linear combination of the two kinds of models outperforms

27、 all the models individually, which indicates that possibilistic models bring information that is not modeled by probabilistic ones.</p><p>  Keywords: speech processing; language modeling; theory of possibi

28、listic</p><p>  Introduction</p><p>  State-of-the-art large vocabulary continuous speech recognition (LVCSR) systems rely on n-gram language models that are estimated on text collections compos

29、ed of billions of words. These models have demonstrated their efficiency in a wide scope of applications but their accuracy depends on the availability of huge and relevant training corpora that may not be available, for

30、 instance for low resource languages or for specific domains.</p><p>  One of the most popular approaches for dealing with this lack of training data consists in collecting text material on the Internet and

31、in estimating classical n-gram statistics on these automatically collected datasets (Kemp and Waibel 1998; Bulyko et al., 2003). This approach benefits from two interesting characteristics of the Internet: large coverage

32、 and continuous updating.</p><p>  Coverage relies on the fact that the Web may be viewed as a close-to-infinite corpus, where most of the linguistic realizations may be found. The Internet provides a lingui

33、stic coverage significantly larger than the text corpora usually involved in LM training (Keller and Lapata, 2003).</p><p>  Updating is provided by users who continuously add documents containing new words

34、and new idiomatic forms. This last point was largely exploited for various aspects of statistical language modeling, typically for new words discovery (Asadi et al., 1990; Bertodi and Federico, 2001; Allauzen and Gauvain

35、, 2005), n-gram model adaptation, unseen n-gram scoring (Keller and Lapata, 2003), etc.</p><p>  Nevertheless, exploiting the large coverage and the updating for statistical language modeling is limited by t

36、echnical issues which are related to the size and instability of the Internet contents. The standard approach would be to regularly collect all data available on the Internet and to estimate n-gram models on the resultin

37、g corpus. Such a technique is clearly unfeasible; some authors proposed solutions that are supposed to enable the estimation of huge LM: Guthrie and Hepple (2010) tackled</p><p>  Another issue is related to

38、 word sequence distribution on the Web. They are poorly reliable due to the diversity of the document sources, the variability of production and usage contexts, etc. Distributions are not only unreliable, but they would

39、not match a targeted application context which determines the potential topics, speech styles, language level, etc.</p><p>  Considering these practical and theoretical limits of using the whole Web, most of

40、 the previous studies consisted in extracting relevant and tractable Web subsets, which are used as classical corpora for estimating n-gram statistics. Corpora are obtained by automatic querying search engines (Monroe et

41、 al, 2002; Wan and Hain, 2006; Lecorve et al, 2008). The query composing technique determines the corpus accuracy in terms of coverage, language styles, etc. Unfortunately, querying is based on prio</p><p> 

42、 Even if these methods were successfully applied in various application contexts, some authors tried to get further benefit from the Web specificities by using dynamic approaches of n-gram estimate. In Berger and Miller

43、(1998), a just-in-time adaption process, based on an on-line analysis of the document topic and fast LM updating, is proposed. In Zhu and Rosenfeld (2001), the authors proposed a back-off technique which estimates a word

44、 sequence probability by counting the number of Web documents</p><p>  Probability-based language models perform well in most situations, especially on high and medium-frequency events. The low-frequency eve

45、nt probability estimation rely generally on a back-off or smoothing strategy, which led to less reliable probabilities. The proposed possibility language models only operate on these low-frequency events, by measuring th

46、eir plausibility, which is not actually measured by the smoothing and back-off techniques used to estimate the probabilities on these events. Th</p><p>  This paper presents an in-depth study of possibilisti

47、c language models. We will state motivations and theoretical foundations of these models as well as present a method for empirically estimating possibilities and new ways to integrate them in an ASR system. Possibilistic

48、 models are compared and combined with classical n-gram probabilities estimated on both Web and classical text corpus.</p><p>  Experiments are conducted on two tasks: broadcast news transcription, for which

49、 large training materials are available, and transcription of medical videos that are dedicated to training surgeons. The latter application context corresponds to a very specialized domain with only low resources availa

50、ble.</p><p>  The rest of the paper is organized as follows. The next section proposes a step-by-step description of possibilistic Web models, starting from classical corpus probabilistic models. Section 3 p

51、resents various strategies for the integration of possibilistic language models into a statistical ASR system. Section 4 describes the experimental setup and the comparative experiments that were conducted. Finally, sect

52、ion 5 concludes and proposes some perspectives.</p><p>  From corpus probabilities to Web possibilities</p><p>  In this section, we present new approaches for improving language modeling by usi

53、ng a new data source, the Web, and a new theoretical framework, the possibility theory. We first describe the classical corpus-based probabilistic language models, as used in most of the state-of-the-art speech recogniti

54、on system. Then, we introduce a new approach for estimating these probabilities from the Web. Finally, we propose to use concepts from the possibility theory for building a new measure that can be es</p><p>

55、  2.1 Corpus-based probabilities</p><p>  In the ASR domain, language models are mainly designed with the purpose of estimating the prior probability P (W) of a word sequence W:</p><p>  W=(w1,

56、w2, …,wn), wi€v</p><p>  This probability may be decomposed as the product of conditional probabilities:</p><p>  P(W)=P(wi|w1,w2,…,wi-1)</p><p>  This formula assumes that a word w

57、i could be predicted only from the preceding word sequence. Globally, n-gram models consist in a collection of a conditional probabilities that will be used, in the ASR engine, for the prediction of a word, given a parti

58、ally transcribed hypothesis.</p><p>  As expressed in Eq. (1), a word probability depends on the whole linguistic history. In practice, such long-term dependencies cannot be estimated due to complexity and t

59、o the limits of the corpus: the amount of training data required for estimating such long sequences would be huge, and it is usually impossible to perform a direct estimate of –n-gram statistics of high order(n>6). Th

60、erefore, most state-of-the-art ASR system use only 4 or 5 gram models.</p><p>  Some alternative approaches for linguistic scoring were proposed to enable the estimation of long-sequence probability, mainly

61、with neural networks that offer efficient (but implicit) interface and smoothing mechanisms (Bengio et al., 2006; Mnih and Hinton, 2007). Nevertheless, the ideal situation would be to estimate directly accurate probabili

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔