版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領
文檔簡介
1、<p> 外文翻譯文獻(中文)</p><p> 一個實驗文語轉(zhuǎn)換系統(tǒng)在分析韻律短語的貢獻</p><p><b> 介紹</b></p><p> 我們描述了一個實驗性的文語轉(zhuǎn)換系統(tǒng),它使用一個確定性的解析器和韻律規(guī)則為英文輸入生成詞組水平音高和時間持續(xù)久的信息。這一信息是用來注釋輸入句子,然后被處理的文本到語音程序目前在貝
2、爾實驗室開發(fā)。在建構(gòu)這系統(tǒng)中,我們的目標一直是檢驗假設(i)該語法樹中的信息可用。尤其地,如主謂和頭補這樣的語法功能,是BV公司本身在確定svnthetic韻律時有用的短語和語法功能(ii)它可以使用一個指定語法句法分析函數(shù)來確定合成語音的韻律短語。</p><p> 雖然語法和韻律之間的某些關聯(lián)是眾所周知的(例如像進度話詞性應力的影響,或設立括號表達式關閉)實用的知識是非常小的語法問題上可能被連接到可用的韻律
3、短語。在許多研究中,研究人員之間尋求成分結(jié)構(gòu)和韻律連接(如Cooper和Paccia-Cooper1980年。Umeda1982年。Gee和Grosjean1983)但是,隨著Selkirk(1984年)的例外。他們往往忽略了在svntax樹語法功能的代表性。此外,以前的工作還沒有具體明確,提供了一個完整的系統(tǒng)實施的基礎。在我們的韻律短語記錄人類語言的研究的基礎上,我們決定強調(diào)三個方面的結(jié)構(gòu),它涉及到短語:句法選區(qū),語法功能及成分的長度
4、。這些研究結(jié)果。我們將詳細討論,已實施了韻律規(guī)則的集合在一個實驗文語轉(zhuǎn)換系統(tǒng)。</p><p> 我們系統(tǒng)具有兩個重要的特征。第一,對我們的韻律系統(tǒng)的輸入是由一個一個分析樹的deterministtc分析器Fidditch(欣德爾1983)版本生成的。這個解析器左角落搜索策略,特別是,它的決定,給Fidditch的速度,使在線文本到語音的生產(chǎn)是可行的。在建設一個解析樹里,F(xiàn)ldditch確定核心主謂對象關系,但
5、沒有試圖代表附屬或修飾關系。因此相對的條文,狀語和其他非參數(shù)的成分在樹中沒有指定位置,而且沒有指定的語義角色。第二,在韻律系統(tǒng)的規(guī)則通過參考據(jù)法結(jié)構(gòu)和早期的語法結(jié)構(gòu)來建立韻律樹。其結(jié)果是一個支持該觀點的分層表示,也是在Selkirk(1984)提出該語法功能信息與韻律短語有關,但間接得,通過不同層次的處理。該系統(tǒng)的非正式測試顯示,它在所產(chǎn)生的合成語音質(zhì)量韻律中能夠產(chǎn)生顯著改善。我們在我們描述的調(diào)查系統(tǒng)的問題中,并沒有發(fā)現(xiàn)任何嚴重違反我們
6、的基本方針。在許多情況下,看來當前版本的問題能就通過進一步采取我們的做法來解決,包括所要求的另一個因素確定的韻律短語解析器的詞匯信息</p><p><b> 文語轉(zhuǎn)換</b></p><p> 大多數(shù)文語系統(tǒng)包括兩部分:發(fā)音規(guī)則和語音合成器。發(fā)音規(guī)則轉(zhuǎn)換成拼音輸入文字,wav可以補充到一個提供關于一部分語音、強調(diào)模式和特定詞語的拼音組成信息的字典。語音合成器然后
7、轉(zhuǎn)換拼音成語音參數(shù)系列,并在后來的處理中產(chǎn)生數(shù)字化語音。雖然這些系統(tǒng)往往表現(xiàn)在字的發(fā)音非常好,但當涉及到提供完整的句子很好的韻律時他們功虧一簣。目前的文本到語音系統(tǒng)無法獲得語法和影響詞組層次韻律的句子的語義特征。因此判刑韻律規(guī)則,當他們提供所有通常取決于文本(例如標點符號)表面的問題,以及在復雜程度不同的啟發(fā)。雖然這種技術(shù)通常添加一個更自然的質(zhì)量,由此產(chǎn)生的合成語音,他們可能會在一些重要方面失敗,例如,忽略了冗長的主語和謂語韻律活動之間
8、的韻律事件,以至于在字中正確的標記顯著特征中的正確性和標記之間沒有明確的韻律邊界。</p><p> 一些作者(如Allen 1976; Elovitz等al.1976。Luce等1983)曾建議,語音合成與天然之間的韻律差異是主要的,在未解決的因素,導致合成語音的流利的理解困難。但是詞組之間的層次韻律及其來源的關系,是如此知之甚少,以至于我們對在任何程度上不同層次的適用的解釋--句法,語義或務實沒有很好的理解
9、。我們目前有一個合理的文本自動句法分析工具,但對于語義或語用文本分析并沒有等價發(fā)達的東西。因此,一個明顯的目的是探討在何種程度上詞組層次韻律可以解釋語法樹和發(fā)展這一關系的詳細描述。另外一個目標是將這個關系而產(chǎn)生的見解轉(zhuǎn)換成一個能夠與語音合成器工作的系統(tǒng)。這使我們能夠更充分地測試我們的描述,或許也將進一步產(chǎn)生一些文語技術(shù)。</p><p><b> 句法結(jié)構(gòu)與韻律短語</b></p&g
10、t;<p> 除了字一級水平,出現(xiàn)了句法結(jié)構(gòu)和韻律短語之間的系統(tǒng)連接聯(lián)系。Cooper和Paccia -Cooper(1980),梅田(1982)和Gee和Grosjean (1983)和Selkirk韻律理論(1984)在心理學聲學調(diào)查是其中較顯著的研究,代表了兩種主要方法語法/韻律關系。在Cooper和Paccia -Cooper(1980)和Umeda(1982),從語法連接韻律短語是任何過濾過程作中間人,即他們提
11、出了具體韻律短語可以直接從語法句法結(jié)構(gòu)通過擁有音值的特別句法節(jié)點關聯(lián)(或者成分界限),要么暫停,節(jié)段性延長,或交叉的語音規(guī)則,單詞的調(diào)節(jié)阻塞。相比之下,Gee和Grosjean(1983)和Selkirk(1984)認為,語法關系是間接的韻律:韻律短語是根據(jù)規(guī)則推導,是指由左到右的順序,長度(或分支模式),并在在Selkirk的情況下的語法功能,以及組成成員,以便推斷層次韻律結(jié)構(gòu)。但是,盡管各自的立場非常清楚,這些研究都不是決定性的。所
12、有的語法框架缺乏足夠詳細和正式允許廣泛的測試,大多數(shù)只考慮了少數(shù)的句子和句子類型。</p><p> 為了發(fā)展我們的分析,我們首先在從包含四個指令手冊的不同文本里閱讀我們的一次演講來審查韻律短語。后來這些文本增加了一個專業(yè)閱讀散文故事。韻律短語之間的界限被確定歸類,然后根據(jù)他們的句法和語義方面的功能被歸類。</p><p><b> 文語轉(zhuǎn)換合成</b></
13、p><p> 該方案構(gòu)成的講話組件中描述了Liberman和Buchsbaum(個人通信)。這些方案作為輸入文字文本和產(chǎn)生數(shù)字化語音輸出。通過注解文字輸入這個系統(tǒng),其運作的許多方面都可以重寫或修改,例如:主要和次要的短語邊界的位置,給單詞的壓力,轉(zhuǎn)錄的單詞和它們之間的界限,時間段,以及等高線間距的細節(jié)。正如我們將顯示,我們的韻律體制使我們能夠生產(chǎn)其中的四個邊境水平確定和感知區(qū)分,使用目前的文本到語音轉(zhuǎn)換系統(tǒng)的注釋字
14、符串。</p><p><b> 韻律短語</b></p><p> 韻律規(guī)則使用的有關成分結(jié)構(gòu),語法的作用,和長度來映射一個表面結(jié)構(gòu)樹標識韻律短語邊界的位置(由節(jié)點標志著)和每個邊界(由節(jié)點號,標志著中)的相對強度信息。正是這一點是用來注釋用轉(zhuǎn)義序列提供有關韻律短語說明文字到語音轉(zhuǎn)換系統(tǒng)的輸入文字信息。</p><p> 在擬定我們的規(guī)
15、則來建設韻律結(jié)構(gòu),我們以單單實施Gee和Grosjean(1983)模型的思想開始。這種模式最初提出來預測主觀的描述句子結(jié)構(gòu),被稱為性能結(jié)構(gòu),從句法樹決定韻律邊界,但聲明不是明確提出了一個句法成分。</p><p> 我們起初被Gee和Grosjean的模式吸引,因為其對相對邊界的比重,即在一個關于在句子中的其他界面邊界強度的測定。我們發(fā)現(xiàn),在我們所收集的數(shù)據(jù),這個比重發(fā)揮了重要作用。事實上,我們直接納入到我們
16、的系統(tǒng)這樣做的一個權(quán)重的方法,即Gee和Grosjean的規(guī)則來確定圍繞一個使用相對長度(如終端節(jié)點數(shù)量衡量)動詞短語的韻律邊界的優(yōu)勢。</p><p> 當我們擴展Gee和Grosjean的模型來創(chuàng)建一個通用系統(tǒng)使用適當?shù)乃惴?,我們的算法偏離了它的出發(fā)點,反映了我們試圖糾正在Gee和Grosjean模型中遇到的弱點和缺陷。我們遇到的這些問題并不奇怪,因為我們的目標和Gee和Grosjean之間的不同。<
17、/p><p> Gee和Grosjean模式和我們目前的算法中最重要的區(qū)別是涉及邊界的決定因素權(quán)重。Gee和Grosjean假設這個比重僅取決于句法節(jié)點的數(shù)量,其數(shù)量左到右順序,在動詞短語組成的長度的例子。相比之下,我們的數(shù)據(jù)與Selkirk(1984)的理論分析一致,表明邊界的力量是依賴于語法功能,在一個給定的句子成分的發(fā)揮。特別是,我們觀察這些功能之間的邊界方面的強度,就像如下討論。我們的附加規(guī)則從大部分的Se
18、lkirk的算法中推導出了。我們也取得了Gee和Grosjean(1983)從Selkirk的工作采取的大部分思想,某些句法頭劃出語音短語邊界,并提供更高層次的分析。我們的韻律運行規(guī)則使用四個獨立的階段.每個階段是建立在之前的階段,這樣的規(guī)則可以參考語法和韻律結(jié)構(gòu),因為先后建立更高層次的韻律結(jié)構(gòu)。</p><p><b> 結(jié)論</b></p><p> 我們描述
19、了一個在線實驗系統(tǒng),該系統(tǒng)采用韻律規(guī)則由成分結(jié)構(gòu)、語法功能、韻律和長度得到韻律應用。該系統(tǒng)包含三個模塊:一個確定性的解析器,短語的韻律規(guī)則,和一個轉(zhuǎn)換短語的韻律規(guī)則的輸出到貝爾實驗室文本語音轉(zhuǎn)換系統(tǒng)的算法。</p><p> 基于基元選擇的語音合成方法中普通話文語轉(zhuǎn)換</p><p><b> 介紹</b></p><p> 文語轉(zhuǎn)換系統(tǒng)
20、是一個可以自由轉(zhuǎn)換文本文件到音頻文件的系統(tǒng)。這是一個把文本文件讀出來給人聽的過程。對于文語轉(zhuǎn)換系統(tǒng),有著廣范圍的應用。</p><p> 一個典型的文語轉(zhuǎn)換系統(tǒng)包含三個主要的部分:文本分析,韻律生成和語音合成。文本分析部分理解了每個文本并確定每個句子的聲音;韻律合成部分產(chǎn)生控制語音變異的一些參數(shù);語音合成部分根據(jù)發(fā)音和韻律的要求產(chǎn)生話語的表達。</p><p> 在過去的二十年,許多方
21、法已被用于合成語音,主要途徑可分為兩個主要的類別,即以規(guī)則以基礎的共振峰合成和串聯(lián)合成。共振峰合成生成語音使用一套規(guī)則。這些規(guī)則經(jīng)常是來自一個漫長的實驗過程,這種方法需要小型計算機內(nèi)存。但是語音質(zhì)量受到了該方法本身的限制。然而,串聯(lián)合成須使用一些預先錄制的語音單位為模板。合成過程中,各單位通過使用信號處理技術(shù)被修改,然后聯(lián)合在一起形成一段話語。這個方法通常需要更大的內(nèi)存。但是語音質(zhì)量也相對應地更好了。然而,隨著科技的發(fā)展,人并不滿足于這
22、樣的通過使用信號方法產(chǎn)生的語音話語機。</p><p> 正常連接合成的工作原理是保持一個小單位的庫存在系統(tǒng)。合成過程中一個單位被選中,然后根據(jù)韻律特征修改使用信號處理技術(shù)。用該方法合成可生成具有較高的語音質(zhì)量,但是,由于信號處理過程,合成語音或多或少扭曲。一個簡單地產(chǎn)生好質(zhì)量語音的方法是儲存大量的人類發(fā)音的語音段在一個數(shù)據(jù)庫里,當執(zhí)行時,串聯(lián)所有需要的語音段在一起不作任何修改。當然,選擇的連接段時間越長,生成
23、的講話越自然。由于每個語音單位在不同情況下可能有很多變種或韻律情況下,這種方法需要一個大的內(nèi)存來存儲大量的語音段。因為幾年前的計算能力和內(nèi)存限制,該方法不實用。隨著硬件的發(fā)展,大語料庫語音合成用于直接連接使用單位是可能的。單位選擇為基礎的語音合成(或語料庫為基礎合成)已應用在英語及其他語言好幾年。一些嘗試(劉,王,1998年;楚等人,2001年;王等人,2000年,Liet人,2001年)已使用中文TTS的單位選擇方式。吳等人 (200
24、1)也提出了一個計劃,選擇發(fā)音,語言最佳單位,然后應用韻律修改。但是,所有提出的方法已在適當?shù)捻嵚蓱镁窒扌浴H绻麤]有適當?shù)捻嵚蓪徸h后,生成的語音質(zhì)量,有時可能會很差。本文關注有關如何適用于一個單位選擇基</p><p><b> 2基元選擇模型</b></p><p> 一個基元選擇模型具有良好的組織基元的數(shù)據(jù)庫。該數(shù)據(jù)庫包含了語音基元從一大主體,這是經(jīng)過精心設
25、計,有韻律的所有語音和覆蓋面大變種各單位。在數(shù)據(jù)庫中,每個基元有一個講話可能變種的數(shù)量,這是適合出現(xiàn)在不同的語音和韻律環(huán)境。大語料進行了分析和離線所有的計算都儲存在一個單位的數(shù)據(jù)庫。在數(shù)據(jù)庫中,每一個基元的實例所描述的特征向量。每個功能可能是離散或連續(xù)值。的特點包括單位本身和該單位的環(huán)境特點。本機的功能本身用于選擇正確的單位,符合段的要求,而環(huán)境的特點是用于最好的選擇內(nèi)容相關的單位,這可能減少選擇的單位之間的不連續(xù)性。主體為基礎的合成實
26、際上是一種串聯(lián)模式匹配的過程。在合成,工作需要做的是選擇最佳單位,發(fā)音和韻律的最佳匹配的目標單位。同時,選擇的單位之間的不連續(xù)性,應盡可能小。為了滿足這些要求,兩種成本的界定應合成。一個是單位成本,介紹如何關閉選擇的單位到所需的單位。另一種是連接的成本,它描述了連續(xù)性的程度單位之間的選擇??偝杀臼莾煞N成本的加權(quán)和。</p><p><b> 3 基元選擇</b></p>&l
27、t;p> 在語音合成過程中接受來自韻律生成零件信息,檢索講話單位數(shù)據(jù)庫來為每一個適當?shù)膯挝徊檎夷繕苏Z音單位。該裝置可以選擇過程如圖1所示,在圖中,目標一句是“今天很熱”,由4個音節(jié)組成。每個音節(jié)有一組候選單位。粗線厚邊框顯示選定的基元序列。在單位選擇過程,為了獲得最佳的講話,我們要考慮(1)通過與目標單位的比較,候選單位是否適當,(2)被選擇的單位之間鏈接的平滑。因此,選擇過程是要找到一個在所有的最佳路徑在連接晶格可能路徑。搜索
28、過程是按照一個成本函數(shù),它描述對一個單位,兩個單位之間的平滑度的適當程度。</p><p><b> 4 語料庫</b></p><p> 正如我們前面提到的,一個大語料是用于基于合成的單位選擇。該語料包含了大量收集的話語。合成的單位將被從語料中提取。盡可能多地覆蓋上下文相關單位和韻律的變種是理想的。但是,建立一個非常大的語料,有一個完整的覆蓋單位的變種,這通常
29、是不可能的。由于建設有高品質(zhì)的大型語料庫的成本非常昂貴的,平衡是通常由覆蓋面和規(guī)模之間衡量。</p><p> 在此研究中,我們建立了一個約38000音節(jié)語料。這語料的腳本是從一個大的文本語料庫(約3億個漢字)選擇的。主體是設計來盡可能覆蓋經(jīng)常使用的獨立音節(jié)和上下文相關的音節(jié)。我們使用北大人民日報的文本語料庫,作為真正的word文本參考來評估腳的本主體。我們算出創(chuàng)建語料庫覆蓋的99.8%的音節(jié)出現(xiàn)在北大語料庫。
30、當單位上下文是由最初和最后一類分組(我們定義了11個聲母類和10個韻母類)中,語料覆蓋的76.8%的單位的類出現(xiàn)在北大文本語料庫。有了這樣的覆蓋面,我們認為,對于基于合成的單位選擇,語料庫是合適的。</p><p> 外文翻譯文獻(英文) </p><p> THE CONTRIBUTION OF PARSING TO PROSODIC PHRASING IN AN EXPERI
31、MENTAL TEXT-TO-SPEECH SYSTEM</p><p> INTRODUCTION </p><p> We describe an experimental text-to-speech system that uses a deterministic parser and prosody rules to generate phrase-level pitch a
32、nd duration information for English input. This information is used to annotate the input sentence, which is then processed by the text-to-speech programs currently under development at Bell Labs. In constructing the sys
33、tem, our goal has been to test the hypotheses (i) that information available in the syntax tree. In particular. grammatical functions such as subje</p><p> Although certain connections between syntax and pr
34、osody are well-known (e.g. the influence of part of speech on stress in words like progress, or the setting off of parenthetical expressions) very little practical knowledge is available on which aspects of syntax might
35、be connected to prosodic phrasing. In many studies, investigators have sought connections between constituent structure and prosody (e.g. Cooper and Paccia-Cooper 1980. Umeda 1982. Gee and Grosjean 1983) but, with the ex
36、ception of</p><p> Two important features characterize our system. First. the input to our prosody system is a parse tree generated by a version of the deterministtc parser Fidditch (Hindle 1983). The left-
37、corner search strategy of this parser and, in particular, its determinism, give Fidditch the speed that makes online text-to-speech production feasible. In building a parse tree, Fldditch identifies the core subject-ver
38、b- object relations but makes no attempt to represent adjunct or modifier relations. Thus rel</p><p> Informal tests of the system show that it is capable of producing a significant improvement in the proso
39、dic quality of the resulting synthesized speech, Our investigations of the system's problems, which we describe, have not revealed any serious counterexample to our basic approach. In many cases,it appears that probl
40、ems with the current version can be resolved by taking our approach a step further, and including lexical information required by the parser as another factor in the determination </p><p> TEXT-TO-SPEECH &l
41、t;/p><p> Most text-to-speech systems comprise two components: pronunciation rules and a speech </p><p> synthesizer. Pronunciation rules convert the input text into a phonetic transcription; thi
42、s information mav also be supplemented by a dictionary that provides information about the part of speech, stress pattern and phonetic makeup of particular words. The speech synthesizer then converts this phonetic transc
43、ription into a series of speech parameters which are subsequently processed to produce digitized speech.</p><p> While these systems tend to perform quite well on word pronunciation, they fall short when it
44、 comes to providing good prosody for complete sentences. Current text-to-speech systems have no access to the syntactic and semantic properties of a sentence that influence phrase-level prosody. Hence rules for sentence
45、prosody, when they are provided at all typically depend on superficial aspects of text (e.g. punctuation) and on heuristics that vary widely in sophistication. Although such techniques of</p><p> Several au
46、thors (e.g. Allen 1976; Elovitz et al. 1976; Luce et al. 1983) have suggested that prosodic differences between synthetic and natural speech are the primary, unaddressed factor leading to difficulties in the comprehensio
47、n of fluent synthetic speech. The relation between phrase-level prosody and its sources, however, is so poorly understood that we have no good sense of the degree to which different levels of explanation--syntactic, sema
48、ntic, or pragmatic--are applicable. We currently h</p><p> SYNTACTIC STRUCTURE AND PROSODIC PHRASING</p><p> Beyond the word level, however, there has been little investigation of systematic c
49、onnections between syntactic structure and prosodic phrasing. The psycholinguistic and acoustic investigations of Cooper and Paccia-Cooper (1980), Umeda (1982) and Gee and Grosjean (1983)and the prosodic theory of Selkir
50、k (1984) are among the more notable studies and represent the two main approaches to syntax/prosody relations. In Cooper and Paccia-Cooper (1980) and Umeda (1982), the connection from syntax to pr</p><p>
51、To develop our analysis, we first examined prosodic phrasing in the speech of one of us reading prose from various texts, including four instruction manuals. These texts were later augmented by a professional reading of
52、a prose story. The boundaries between prosodic phrases were identified and then classed according to their syntactic context and semantic function. </p><p> Text-to-speech Synthesis</p><p> Th
53、e programs that make up the speech component are described in Liberman and Buchsbaum (personal communication). These programs take character text as input and produce digitized speech output. By annotating the input text
54、 to this system, many aspects of its operation can be overridden or modified: e.g. the location of major and minor phrase boundaries, the stress given to words, the transcription of words and the boundaries between them,
55、 the timing of segments, </p><p> and details of the pitch contour. As we will show, with our prosody system we are able to produce </p><p> strings in which four boundary levels are identifie
56、d and perceptually distinguished, using the current text- to-speech system annotations. </p><p> Prosodic Phrasing </p><p> The prosody rules use information about constituent structure, gramm
57、atical role, and length to map a surface structure. The prosody tree identifies the location of phrase boundaries (signified by the nodes) and the relative strength of each boundary (signified by a number in the node).
58、It is this information that is used to annotate the input text with escape sequences that provide the text-to- speech system with instructions about prosodic phrasing. </p><p> In formulating our rules for
59、building the prosodic structure, we began with the idea of simply implementing the model of Gee and Grosjean (1983). This model, initially proposed to predict a form of psychological data describing subjective sentence s
60、tructure known as performance structure, determines prosodic boundaries from a syntactic tree, but assumes rather than explicitly presents a syntactic component.</p><p> We were initially attracted to the G
61、ee and Grosjean model because of its emphasis on relative boundary weighting, i.e., on the determination of the strength of a given boundary with respect to the other boundaries in the sentence. We found that in the data
62、 we had collected, this weighting played an important role. In fact, we incorporated directly into our system one method of doing this weighting, namely Gee and Grosjean's rule to determine the strengths of the proso
63、dic phrase boundaries around</p><p> The most important difference between the Gee create an algorithm adequate for use in a general purpose system, our algorithm diverged from its starting point, reflectin
64、g our attempts to correct weaknesses and lacunae that we encountered in the Gee and Grosjean model. That we encountered these problems is not surprising given the difference between our goals and those of Gee and Grosjea
65、n. and Grosjean model and our current algorithm involves the factors determining boundary weight. Gee and Grosj</p><p> Our adjunction rules are derived for the most part from Selkirk's account. We have
66、 also made use of the idea, which Gee and Grosjean (1983) take largely from the work of Selkirk, that certain syntactic heads mark off phonological phrase boundaries, and provide the basic prosodic constituents for highe
67、r level analysis. </p><p> Our prosody rules run in four independent stages. Each stage builds on the previous stage, so that the rules can refer to both syntactic and prosodic structure as they build succe
68、ssively higher levels of prosodic structure.</p><p> CONCLUSIONS </p><p> We have described an on-line experimental system that uses prosody rules to infer prosodic phrasing from constituent s
69、tructure, grammatical functions, and length considerations. The system contains three modules: a deterministic parser, a set of prosodic phrasing rules, and an algorithm to convert the output of the prosodic phrasing rul
70、es into signals for the Bell Labs text-to-speech system.</p><p> A Unit Selection-based Speech Synthesis Approach for Chinese Mandarin Text-to-Speech</p><p> 1 Introduction </p><p&g
71、t; Text-to-Speech system is a system that converts free text into speech. This is a process that reads out the text for people. There is a wide range of applications for text-to-speech system. </p><p> A t
72、ypical text-to-speech system consists of three main parts, which are text analysis, prosody generation and speech synthesis. The text analysis part understands the text and determines the sound of each sentence. The pros
73、ody generation part generates some parameters that control the variability of the speech. The speech synthesis part generates the speech utterance based on the pronunciation and prosody requirement.</p><p>
74、 In the past decades, many approaches have been used to synthesize speech. The main approaches can be classified into two main categories, i.e. rule-based formant synthesis and concatenation synthesis. Formant synthesis
75、generates speech using a set of rules. The rules are usually derived from a long process of experiments. This approach needs small computer memory. But the speech quality is limited by the approach itself. Concatenation
76、synthesis, however, uses some pre-recorded speech units as te</p><p> Normal concatenation synthesis works by keeping a small unit inventory in system. During synthesis, a unit is selected and then modified
77、 using signal processing techniques according to prosody features. Synthesis by this way can generate speech with relatively high quality. However, the synthetic speech is more or less distorted due to the signal process
78、ing process.</p><p> A simple idea of generating good speech is to store large quantities of speech segments of human speech in a database and, when generating, concatenate all the needed speech segments to
79、gether without any modification. Of course the longer the stored segments selected for the concatenation, the more natural the generated speech. As each speech unit may have many variants in different contexts or prosodi
80、c </p><p> situations, this approach needs a large memory to store a large number of speech segments. The approach was not practical some years ago because of the limitation of computer power and memory. Wi
81、th the development of hardware, the use of large speech corpus as synthetic units for direct concatenation is possible. </p><p> Unit selection-based speech synthesis (or corpus-based synthesis) has been ap
82、plied in English and other languages for some years. Some attempts (Liu, and Wang, 1998; Chu et al. 2001; Wang et al., 2000, Li et al, 2001) have been made for Chinese TTS using unit selection approach. Wu et al. (2001)
83、also proposed a scheme to select phonetically, linguistically best units and then apply prosodic modifications. </p><p> However, all the proposed approaches have limitations in the application of proper pr
84、osody. Without proper prosody consideration, the quality of the generated speech may be poor sometimes. This paper concerns about how to apply prosody in a unit selection based synthesis.</p><p> 2 Unit Sel
85、ection Model </p><p> A unit selection model has a well-organized unit database. The database contains the speech units from a large corpus, which is carefully designed to have a large coverage of all phone
86、tic and prosodic variants of each unit. In the database, each speech unit has a number of possible variants, which are suitable to appear in different phonetic and prosodic environments. The large speech corpus is analyz
87、ed offline and all the calculated features are stored in a unit database. In the database, each </p><p> 3 Unit Selection Process </p><p> The speech synthesis process accepts information from
88、 prosody generation part, retrieves the speech unit database to find a proper unit for every target speech unit. The unit selection process can be illustrated as Figure 1. In the figure, the target sentence is “今天很熱 (it
89、is very hot today)”, which consists of 4 syllables. Each syllable has a set of candidate units. The thick line and thick edge box indicate the selected unit sequence. In unit selection process, to get the best speech, we
90、 have t</p><p><b> 4 Corpus </b></p><p> As we have mentioned earlier, a large speech corpus is used in unit selection based synthesis. The speech corpus consists of a large collec
91、tion of utterances. The unit for the synthesis will be extracted from the corpus. It is ideal to cover context dependent units and prosody variants as much as possible. However, it is usually impossible to build a very l
92、arge speech corpus that has a complete coverage of unit variants. As the cost of constructing a large corpus with high quality is very expens</p><p> In this research, we built a corpus of around 38000 syll
93、ables. The script of this speech corpus is selected from a large text corpus (around 300M Chinese characters). The corpus is designed to cover the frequently used context independent syllable and context dependent syllab
94、le as much as possible. We use PKU People’s Daily text corpus as a reference for real word text to evaluate the script of the corpus. We calculated that the built corpus covers 99.8% of syllable occurrences in the PKU co
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 一個色彩空間轉(zhuǎn)換系統(tǒng)的設計與實現(xiàn).pdf
- 普通話文語轉(zhuǎn)換系統(tǒng)的研究.pdf
- 文語轉(zhuǎn)換系統(tǒng)若干問題研究.pdf
- 中文話費文語轉(zhuǎn)換系統(tǒng)的研究與實現(xiàn).pdf
- 漢語文語轉(zhuǎn)換系統(tǒng)中韻律調(diào)節(jié)算法的研究與實現(xiàn).pdf
- 一個通用文件格式轉(zhuǎn)換系統(tǒng)的設計與實現(xiàn).pdf
- 培訓系統(tǒng)----一個觀點【外文翻譯】
- 漢語可視文語轉(zhuǎn)換系統(tǒng)研究與實現(xiàn).pdf
- 基于貝葉斯網(wǎng)絡的文語轉(zhuǎn)換系統(tǒng)文本分析研究.pdf
- 面向機務CBT的一種實用文語轉(zhuǎn)換系統(tǒng)研究.pdf
- 一個會計信息系統(tǒng)的設計【外文翻譯】
- 一個企業(yè)轉(zhuǎn)型的理論【外文翻譯】
- 創(chuàng)建一個flash網(wǎng)頁【外文翻譯】
- 一個完美的市場【外文翻譯】
- 外文翻譯--一個良好的公路的基礎
- 發(fā)展一個營銷計劃【外文翻譯】
- 外文翻譯---股利政策一個綜述
- 外文翻譯---創(chuàng)建一個高效的仿真模型
- 一個最佳執(zhí)行的過程模型【外文翻譯】
- 做一個熱門產(chǎn)品一個信號解釋饑餓營銷策略【外文翻譯】
評論
0/150
提交評論