《杨允言IunnUn-gian14.PPT》由会员分享,可在线阅读,更多相关《杨允言IunnUn-gian14.PPT(38页珍藏版)》请在三一文库上搜索。
1、,楊允言 Iunn Un-gian 2008.7.14,台語文特性分析 及其處理技術,Written Taiwanese : Its Characteristic Analysis and Processing Techniques,2,Vita,1984-1988 NTU CSIE under 1990/8-1994/1 Sinica IIS assistant 1991-1993 NTHU IS graduate 1994/2-1996/11 NTU CC programmar 1996 migrate to Hualian,3,Vita-2,1999 Dahan I.T. CSIE le
2、cturer 2003/8 - assistant prof. 2004 - NTU CSIE phD program Journal : IJCLCLP 12(4) Project : NSC 3, NMTL 1, Academia Historica 1,4,Outline,Introduction Resources and Survey of Written Taiwanese Processing Coding and I/O of POJ Tone Sandhi Problem and Algorithm,5,Outline-2,Word Segmentation and Tagg
3、ing Methods Corpora Collection and Annotation Some Applications of Written Taiwanese Corpora Conclusion and Future Work,6,1. Introduction,1.1 Background Population : 46M (2005) Distribution : Taiwan, Singapore, Malaysia, Brunei, China, Thailand, Philippines, Indonesia Rank : 21 Confused Name : South
4、ern-Min ? Amoy ? Taiwanese ?,7,1. Introduction-2,1.2 Different Scripts Han Characters Script Romanization Script (POJ) Han-Romanization Mixed Script Others : Kana, Phonetic Symbols, Proverb, ,8,1. Introduction-3,1.3 Phoneme of the Taiwanese Initials (18) Vowels (86) Tones (7) Compared with Mandarin
5、: legal syllable 2726 vs 1200,9,1. Introduction-4,1.4 Some Keypoints Not yet standardized The POJ characters are seperated to different zones in Unicode set Need to Annotate phonetic marker in corpora Interact with Taiwanese group,10,1. Introduction-5,1.5 Motivation My mother tongue 1.6 Definition a
6、nd Glossary 1.7 Goal of This Dissertation 1.8 Organization,11,2. Resources and Survey,2.1 Resources Input method Dictionary Corpus Word segmentation Scripts conversion Text-to-speech 2.2 Survey,12,3. Coding and I/O of POJ,3.1 POJ Character Code Unicode encoding 3.2 Two Kinds of POJ Representation PO
7、J and numbered POJ,13,3. Coding and I/O of POJ-2,3.3 Retrieval of POJ Issue : both case-sensitive and case-insensitive 2-stage retrieval : excute SQL command and then filtering Fuzzy retrieval : toneless, glottal stop, checked syllable, vowel Examples,14,3. Coding and I/O of POJ-3,3.4 Display of POJ
8、 Strategy : Unicode (with specific fonts) or graph POJ to numbered POJ lng la5ng lang5 Numbered POJ to POJ lang5 la5ng lng Priority : o a e u i n m ou5o5u ou5 .,15,3. Coding and I/O of POJ-4,3.5 Word Processing Utilities for POJ Phoneme segmentation : backward direction Spelling checker Syllable / w
9、ord / sentence count,16,4. Tone Sandhi,4.1 Tone Sandhi Problem Types of tone sandhi Normal sandhi Following sandhi Neutral sandhi Double sandhi Pre- sandhi Triplicate sandhi Rising sandhi,17,4. Tone Sandhi-2,4.1 Tone Sandhi Problem Most complicate among the Sino language family Need to find the boun
10、dary of tone sandhi group,18,4. Tone Sandhi-3,19,4. Tone Sandhi-4,4.2 Implementation Training and test data : POJ Tag set : A(adj) C(conj) D(adv) G(postposition) I(interjection) M(special marker) N(noun) P(prep) R(pron) S(time) T(aux) V(verb) Taiwanese-Mandarin dict & Chinese electronic dict,20,4. T
11、one Sandhi-5,4.3 Rule-based Algorithm 20 rules Syllable / word / POS / sentence level 4.4 Result Training data : 97.39% Test data : 88.98%,21,5. Word Seg and Tagging,5.1 Word Segmentation For Han-Romanization mixed Forward maximal matching (FMM) vs Backward maximal matching (BMM) 看台語 : 看台 語(FMM) or
12、看 台語(BMM)? Ambiguous : statistic P(看)P(台語) P(看台)P(語),22,5. Word Seg and Tagging-2,5.2 POS Tagging Data : POJ and HR mixed parallel corpus Tag set : CKIP Chinese tagset Taiwanese-Mandarin dict Chinese bigram training data,23,5. Word Seg and Tagging-3,24,5. Word Seg and Tagging-4,5.2 POS Tagging Examp
13、le : 因為in-i由於;因為(Cbb) 等待tn-thi留待;等待(VK) 朋友png-i友人;朋友(Na) ,,(COMMACATEGORY) 心適sim-sek好玩;好玩兒;有趣;風趣;愉快;稀奇;鬧著玩(VH) 心適sim-sek好玩;好玩兒;有趣;風趣;愉快;稀奇;鬧著玩(VH),25,5. Word Seg and Tagging-5,5.2 POS Tagging Result : 91.49% Error analysis : Wrong Chinese translation word No best Chinese translation to select Unknow
14、n word Proper noun Propogation error,26,6. Collect/Annotate Corpora,6.1 Corpora Collection POJ (3M+ syllables) Han-Romanization Mixed (5M+ syllables) Sources : Project results Articles in magazines Academic paper,27,6. Collect/Annotate Corpora-2,6.2 Raw Corpus Pre-process Space between “-” and char
15、“-” between Han char and POJ Alignment,28,6. Collect/Annotate Corpora-3,6.3 Corpus Annotation POS Semantic annotation Phonetic annotation Special pattern marker,29,7. Corpora Applications,7.1 Basic Count Syllable / word count Zipf law Proportion of POJ in Han-Romanization mixed script Suggestion of
16、othpgraphy for unconsistent word usage,30,7. Corpora Applications-2,7.2 Concordancer system For language learning For systax study 7.3 Collocation MI & Correlation (2) VN, NV, AN, NN,31,7. Corpora Applications-3,7.4 Lexical Change and Variation Two periods : before / after 1945 Register : Japanese l
17、oanwords Mandarin loanwords church register,32,7. Corpora Applications-4,7.4 Lexical Change and Variation Two Taiwanese bible versions (new testament) : 1916 and 1972 Dialect difference Common words : 31% 43% words disappered after 5 decades,33,7. Corpora Applications-5,7.5 Language Learning and Tes
18、t 7.6 Coarticulation,34,7. Corpora Applications-6,7.7 POJ / HR mixed script conversion POJ to HR mixed Kin-a2-jit8 thinn-khi3 chin ho2 今仔日天氣真好 Lookup dictionary Bigram , unigram ( 5M syllables training data ) (input method),35,7. Corpora Applications-7,7.7 POJ / HR mixed script conversion HR mixed t
19、o POJ 今仔日天氣真好 Kin-a2-jit8 thinn-khi3 chin ho2 Word segmentation Loopup dictionary Bigram,unigram (3M syllables/ words training data),36,8. Future Work,8.1 Summary 8.2 Future Work Parser Machine translation OCR Put corpora to LDC,37,8. Future Work,I wish this dissertation will turn into be a written Taiwanese processing textbook ( written in Taiwanese or Mandarin ),敬請指教 Kng-chhin ch-ku Please advise.,