杨允言IunnUn-gian14.PPT

上传人:本田雅阁 文档编号:2665823 上传时间:2019-05-02 格式:PPT 页数:38 大小:1.36MB
返回 下载 相关 举报
杨允言IunnUn-gian14.PPT_第1页
第1页 / 共38页
杨允言IunnUn-gian14.PPT_第2页
第2页 / 共38页
杨允言IunnUn-gian14.PPT_第3页
第3页 / 共38页
杨允言IunnUn-gian14.PPT_第4页
第4页 / 共38页
杨允言IunnUn-gian14.PPT_第5页
第5页 / 共38页
点击查看更多>>
资源描述

《杨允言IunnUn-gian14.PPT》由会员分享,可在线阅读,更多相关《杨允言IunnUn-gian14.PPT(38页珍藏版)》请在三一文库上搜索。

1、,楊允言 Iunn Un-gian 2008.7.14,台語文特性分析 及其處理技術,Written Taiwanese : Its Characteristic Analysis and Processing Techniques,2,Vita,1984-1988 NTU CSIE under 1990/8-1994/1 Sinica IIS assistant 1991-1993 NTHU IS graduate 1994/2-1996/11 NTU CC programmar 1996 migrate to Hualian,3,Vita-2,1999 Dahan I.T. CSIE le

2、cturer 2003/8 - assistant prof. 2004 - NTU CSIE phD program Journal : IJCLCLP 12(4) Project : NSC 3, NMTL 1, Academia Historica 1,4,Outline,Introduction Resources and Survey of Written Taiwanese Processing Coding and I/O of POJ Tone Sandhi Problem and Algorithm,5,Outline-2,Word Segmentation and Tagg

3、ing Methods Corpora Collection and Annotation Some Applications of Written Taiwanese Corpora Conclusion and Future Work,6,1. Introduction,1.1 Background Population : 46M (2005) Distribution : Taiwan, Singapore, Malaysia, Brunei, China, Thailand, Philippines, Indonesia Rank : 21 Confused Name : South

4、ern-Min ? Amoy ? Taiwanese ?,7,1. Introduction-2,1.2 Different Scripts Han Characters Script Romanization Script (POJ) Han-Romanization Mixed Script Others : Kana, Phonetic Symbols, Proverb, ,8,1. Introduction-3,1.3 Phoneme of the Taiwanese Initials (18) Vowels (86) Tones (7) Compared with Mandarin

5、: legal syllable 2726 vs 1200,9,1. Introduction-4,1.4 Some Keypoints Not yet standardized The POJ characters are seperated to different zones in Unicode set Need to Annotate phonetic marker in corpora Interact with Taiwanese group,10,1. Introduction-5,1.5 Motivation My mother tongue 1.6 Definition a

6、nd Glossary 1.7 Goal of This Dissertation 1.8 Organization,11,2. Resources and Survey,2.1 Resources Input method Dictionary Corpus Word segmentation Scripts conversion Text-to-speech 2.2 Survey,12,3. Coding and I/O of POJ,3.1 POJ Character Code Unicode encoding 3.2 Two Kinds of POJ Representation PO

7、J and numbered POJ,13,3. Coding and I/O of POJ-2,3.3 Retrieval of POJ Issue : both case-sensitive and case-insensitive 2-stage retrieval : excute SQL command and then filtering Fuzzy retrieval : toneless, glottal stop, checked syllable, vowel Examples,14,3. Coding and I/O of POJ-3,3.4 Display of POJ

8、 Strategy : Unicode (with specific fonts) or graph POJ to numbered POJ lng la5ng lang5 Numbered POJ to POJ lang5 la5ng lng Priority : o a e u i n m ou5o5u ou5 .,15,3. Coding and I/O of POJ-4,3.5 Word Processing Utilities for POJ Phoneme segmentation : backward direction Spelling checker Syllable / w

9、ord / sentence count,16,4. Tone Sandhi,4.1 Tone Sandhi Problem Types of tone sandhi Normal sandhi Following sandhi Neutral sandhi Double sandhi Pre- sandhi Triplicate sandhi Rising sandhi,17,4. Tone Sandhi-2,4.1 Tone Sandhi Problem Most complicate among the Sino language family Need to find the boun

10、dary of tone sandhi group,18,4. Tone Sandhi-3,19,4. Tone Sandhi-4,4.2 Implementation Training and test data : POJ Tag set : A(adj) C(conj) D(adv) G(postposition) I(interjection) M(special marker) N(noun) P(prep) R(pron) S(time) T(aux) V(verb) Taiwanese-Mandarin dict & Chinese electronic dict,20,4. T

11、one Sandhi-5,4.3 Rule-based Algorithm 20 rules Syllable / word / POS / sentence level 4.4 Result Training data : 97.39% Test data : 88.98%,21,5. Word Seg and Tagging,5.1 Word Segmentation For Han-Romanization mixed Forward maximal matching (FMM) vs Backward maximal matching (BMM) 看台語 : 看台 語(FMM) or

12、看 台語(BMM)? Ambiguous : statistic P(看)P(台語) P(看台)P(語),22,5. Word Seg and Tagging-2,5.2 POS Tagging Data : POJ and HR mixed parallel corpus Tag set : CKIP Chinese tagset Taiwanese-Mandarin dict Chinese bigram training data,23,5. Word Seg and Tagging-3,24,5. Word Seg and Tagging-4,5.2 POS Tagging Examp

13、le : 因為in-i由於;因為(Cbb) 等待tn-thi留待;等待(VK) 朋友png-i友人;朋友(Na) ,,(COMMACATEGORY) 心適sim-sek好玩;好玩兒;有趣;風趣;愉快;稀奇;鬧著玩(VH) 心適sim-sek好玩;好玩兒;有趣;風趣;愉快;稀奇;鬧著玩(VH),25,5. Word Seg and Tagging-5,5.2 POS Tagging Result : 91.49% Error analysis : Wrong Chinese translation word No best Chinese translation to select Unknow

14、n word Proper noun Propogation error,26,6. Collect/Annotate Corpora,6.1 Corpora Collection POJ (3M+ syllables) Han-Romanization Mixed (5M+ syllables) Sources : Project results Articles in magazines Academic paper,27,6. Collect/Annotate Corpora-2,6.2 Raw Corpus Pre-process Space between “-” and char

15、“-” between Han char and POJ Alignment,28,6. Collect/Annotate Corpora-3,6.3 Corpus Annotation POS Semantic annotation Phonetic annotation Special pattern marker,29,7. Corpora Applications,7.1 Basic Count Syllable / word count Zipf law Proportion of POJ in Han-Romanization mixed script Suggestion of

16、othpgraphy for unconsistent word usage,30,7. Corpora Applications-2,7.2 Concordancer system For language learning For systax study 7.3 Collocation MI & Correlation (2) VN, NV, AN, NN,31,7. Corpora Applications-3,7.4 Lexical Change and Variation Two periods : before / after 1945 Register : Japanese l

17、oanwords Mandarin loanwords church register,32,7. Corpora Applications-4,7.4 Lexical Change and Variation Two Taiwanese bible versions (new testament) : 1916 and 1972 Dialect difference Common words : 31% 43% words disappered after 5 decades,33,7. Corpora Applications-5,7.5 Language Learning and Tes

18、t 7.6 Coarticulation,34,7. Corpora Applications-6,7.7 POJ / HR mixed script conversion POJ to HR mixed Kin-a2-jit8 thinn-khi3 chin ho2 今仔日天氣真好 Lookup dictionary Bigram , unigram ( 5M syllables training data ) (input method),35,7. Corpora Applications-7,7.7 POJ / HR mixed script conversion HR mixed t

19、o POJ 今仔日天氣真好 Kin-a2-jit8 thinn-khi3 chin ho2 Word segmentation Loopup dictionary Bigram,unigram (3M syllables/ words training data),36,8. Future Work,8.1 Summary 8.2 Future Work Parser Machine translation OCR Put corpora to LDC,37,8. Future Work,I wish this dissertation will turn into be a written Taiwanese processing textbook ( written in Taiwanese or Mandarin ),敬請指教 Kng-chhin ch-ku Please advise.,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 其他


经营许可证编号:宁ICP备18001539号-1