计算语言学在台湾从COLING2002在台北谈起.ppt

资源描述

《计算语言学在台湾从COLING2002在台北谈起.ppt》由会员分享，可在线阅读，更多相关《计算语言学在台湾从COLING2002在台北谈起.ppt（88页珍藏版）》请在三一文库上搜索。

1、計算語言學在台灣從COLING2002在台北談起,黃居仁Chu-Ren Huang Academia Sinica 2002.11.5 北京大學計算語言學研究所,Congress Venue: Howard International House,National Palace Museum,National Concert Hall and Shin Kuang Tower,Scenes from Yangmingshan National Park,Museum IHP, Academia Sinica,Date August 24 - 25, 2002 (tutorial) Augus

2、t 26 - 30, 2002 (conference) August 31 - September 1, 2002 (workshop) Venue Howard International House Main Conference Academia Sinica Tutorials & Workshops,Organized by : Association for Computational Linguistics and Chinese Language Processing Institute of Information Science, Academia Sinica Inst

3、itute of Linguistics, Academia Sinica National Tsing Hua University http:/www.coling2002.sinica.edu.tw,COLING2002 in Taipei 緣起與意義,第19屆會議(雙年會) 第一次在歐美日以外地區舉行 1998年擊敗其他國家（特別是韓國），取得主辦權 1988 年時曾有1992於北京主辦之議，但近期內也不會再到亞洲其他國家已排定2004日內瓦，2006墨爾本，2008曼徹斯特,COLING2002的組織,指導單位 Under the Auspices of the Internat

4、ional Committee on Computational Linguistics 大會主席：李遠哲共同主席：王小川，李德財，何大安程序委員會 Program Committee 主席: Winfried Lenders, University of Bonn 籌備委員會 Organizing Committee 主席: 黃居仁執行秘書：簡立峰，曾淑娟,COLING2002程序委員會,主席: Winfried Lenders, ICCL Member 共有17名程序委員，各負責一個領域的論文審查。有四名研究中文計算語言學的專家（另有3名日本學者，共七名來自亞洲。） Jason S.

5、 Chang Statistical Approaches to NLP I Chu-Ren Huang Oriental Language Processing Benjamin K. Tsou- Discourse Anaylysis and Summarization Shiwen Yu Chinese Language Processing,COLING2002 in Taipei 基本數據 I：論文場次,研習會tutorials：4個主題，6個場次大會main conference：每個時段有5個平行場次，共計64場次（含4場綜合場次）。會後研討會workshops：10場研討會

6、，6個平行場次（分兩天，有兩個研討會舉行兩天）。午餐音樂會場次： 3。,COLING2002 in Taipei 基本數據 II：發表論文數,專題演講：三篇大會：196篇（由450篇投稿中評選出）。會後研討會：約130篇總計：約330篇。,COLING2002 in Taipei 基本數據 III：參與人數,參與總人數：540人參與國家數：32國以上大會的參加人數：445人研習課程的參加人數：200人約270人次會後研討會的參加人數：300人約450人次,COLING2002 in Taipei 參考資料 I：參與地區排名,1 台灣 130 2 日本 98 3 美國 8

7、9 4 南韓 35 5 中國大陸 29 tie 5 德國 29 7 法國 21 8 英國 18 9 香港 11 10 澳大利亞 9 tie 10西班牙 9,COLING2002 in Taipei 參考資料 II：亞洲與華語區,是亞洲學者參加最多的一次。至少有10個亞洲國家地區參加。參加人數最多的6個國家地區中有四國來自亞洲。（以往18屆，包括在日本舉行的兩屆，均以歐美國家學者為主）。是用中文的學者參加最多的一次，共計將近180人。是大陸學者參加最多的一次（以往不超過10人，本屆共有29位）是台灣學者參加最踴躍的一次（以往不超過15人，本屆逾130人）,COLING2002 in Ta

8、ipei 參考資料 III：會議內容,這是新世紀的第一次COLING 是會後研討會（Post-conference workshop）最多的一次，共有十個研討會(包括8個單日研討會，2個雙日研討會)。第一次舉行國際中文處理研討會(The First SigHAN Workshop on Chinese Language Processing, 隸屬剛在ACL之下通過成立的國際組織）,COLING2002 in Taipei 參考資料 IV：多項獎學金,COLING2002 Scholarship 1. Zeman, Daniel (Czech Republic) 2. Liakata, Ma

9、ria (Greece) 3. Lu, Ya-juan (Peoples Republic of China) Alternates ： 1. Pekar, Viktor (Russia), 2. Bohmova, Alena (Czech) MRSA fellowship 1. Yao, Jian-min 2. Li, Xin Alternates ：1. Luo, Xiao. 2. Sun, Jian SIGHAN Student Fellowship 1. Lu Yajuan, 2. Peng Fuchun SigHAN workshop Scholarship 1.Kevin ZHAN

10、G, 2. Liang HUANG, 3. Jianhua TAO 4. Yuqi ZHANG, 5. Chunfa YUAN, 6. Yinan PENG,COLING2002 in Taipei 會議內容概述,特色會議型態的主軸主題演講，大會論文研習課程，研討會，專題討論議題的主軸語意網，中文處理，科際整合,COLING2002 的幾個特色,跨領域的演化與遺傳生物資訊（計算）心理語言學中文處理數位典藏與語言典藏走出象牙塔的環保的文化的,COLING2002會議內容：主題演講 Keynotes,William S. Y. Wang 香港城市大學 Computer

11、Modeling of Language Evolution Hans Uszkoreit- Saarland University New Chances for Deep Linguistic Processing Charles Fillmore - U.C. Berkeley FrameNet and the Linking Between Semantic and Syntactic Relations,COLING2002會議內容：大會論文Papers,60 個場次，每個場次三篇論文（4 個海報場次，每場次5至8篇）場次較多的幾個熱門議題（各三場） Information

12、 Extraction Question-Answering Word Sense Disambiguation Learning Systems -傳統議題（各四場） Machine Translation, Parsing Methods,COLING2002會議內容：研習課程 Tutorials I,Computational Linguistics and Chinese Language Processing Intelligent Character Encoding Chin-chun Hsieh, Academia Sinica Treebanking and Parsing

13、 Keh-jiann Chen, Academia Sinica Corpus-based Methods in Chinese Morphology Richard Sproat- AT&T Labs,COLING2002會議內容：研習課程 Tutorials II,Bio-Informatics Bioinformatics and NLP Issues T. Takagi, T. Takai, Univ. of Tokyo K. Fukuda, National Institute of Advanced Industrial Science and Technology Some A

14、pplications of NLP Techniques for Modeling Biological Sequences Aravind Joshi, Univ. of Pennsylvania The Application of Information Extraction in Bio-Informatics Jun-Ichi Tsujii, University of Tokyo,COLING2002會議內容：研習課程 Tutorials III,Probabilistic Computational Psycholinguistics Dan Jurafsky, Univer

15、sity of Colorado Open-Domain Textual Question Answering Sanda M. Harabagiu, Univ. of Texas Dan Moldovan, Univ. of Texas,COLING2002會議內容：會後研討會 Workshops I,SEMANET: Building and Using Semantic Networks COMPUTERM 2002: Second International Workshop on Computational Terminology The 6th Conference on Nat

16、ural Language Learning 2002 (CoNLL-2002) (in conjunction with WVLC),COLING2002會議內容：會後研討會 Workshops II,The 3rd Workshop on Asian Language Resources and International Standardization Machine Translation in Asia The First SIGHAN Workshop on Chinese Language Processing,COLING2002會議內容：會後研討會 Workshops I

17、II,Multilingual Summarization and Question Answering Grammar Engineering and Evaluation The 2nd Workshop on NLP and XML (NLPXML-2002) A Roadmap for Computational Linguistics,由COLING2002會議內容引發的研究發展趨勢,科際整合的趨勢與新方向生物/基因體研究生物資訊認知/腦神經/心理學研究研究計算心理語言學文獻/典藏研究數位典藏的語言座標中文處理日漸蓬勃，日形重要；與語意網的發展自然語言處理如何在未

18、來網路中佔關鍵的地位？,Chinese Language Processing in 10 Years 十年後的電腦中文處理 Chu-Ren Huang 黃居仁,Panel at the First SIGHAN Workshop on Chinese Language Processing A COLING2002 Post-Conference Workshop 1 September 2002, Academia Sinica, Taipei Moderator: Benjamin K. Tsou Panelists: 朱邦復，黃居仁，周明,CLP in the past 10 year

19、s,A review of what happened in the past ten years in Chinese Language Processing (1992-200 from a somewhat personal perspective 1992 Corpora Completion of the first Chinese corpus for linguistic research (Huang and Chen, COLING 92.1214-1217) -untagged, non-segmented -but searchable,CLP in the past 1

20、0 years,1992 Segmentation Standard Announcement of the first national standard for word segmentation by PRC government. GB 13715-信息處理用現代漢語分詞規範.,CLP in the past 10 years,1993 -Lexicon Completion and Release of the first version of CKIP lexicon (with the category set and ICG thematic roles) First vers

21、ion of K. Chens parser for Chinese,CLP in the past 10 years,1994 Corpus 10th year anniversary for the Automation of Chinese historical textual databases. Completion of the pre-Qin Classic Chinese corpus at Academia Sinica.,CLP in the past 10 years,1995 Corpus Completion of Sinica Corpus (v. 1.0 1 mi

22、llion words), the first balanced and tagged Chinese corpus.,CLP in the past 10 years,1996 Research Institutes 10th Anniversary of the Institute of Computational Linguistics at Peking University 10th Anniversary of the Chinese Knowledge Information Processing Group at Academia Sinica Anthology of Pap

23、ers Readings in Chinese Natural Language Processing (Journal of Chinese Linguistics Monograph) Editors: Huang, Chen, and Tsou,CLP in the past 10 years,1996 November Sinica Corpus on Web The first fully searchable language corpus on the WWW (To the best of my knowledge) http:/www.sinica.edu.tw/ftms-b

24、in/kiwi.sh,CLP in the past 10 years,1997 Publication of the first Chinese dictionary compiled directly from a corpus (黃居仁，陳克健，賴慶雄。國語日報量詞典） The Tenth Annual ROCLING conference HKs Return to Chinese Sovereignty,CLP in the past 10 years,1998 KnowledgeNet Release of HowNet, the first full-fledged Chines

25、e and English-Chinese LKB http:/ -Segmentation Standard Official announcement of CNS14366 for Taiwan,CLP in the past 10 years,2000 Treebanks Simultaneous completion and announcement of two Chinese Treebanks: *Penn Chinese Treebank *Sinica Treebank ACL workshop on Chinese Language Processing,CLP in t

26、he past 10 years,2001 Society Formal approval of the formation of ACL SigHAN, the first international organization on Chinese Language Processing,CLP in the past 10 years,2002 First SigHAN workshop on Chinese Language Processing Formal launch of Hsiehs Intelligent Character Encoding System (a sustai

27、nable solution to the missing character problem) COLING2002 in Taipei,CLP in the next 10 years,A fictional projection of what may (or may not) happen in Chinese Language Processing in the next 10 years since fictional, therefore 以下情節純屬臆測，若有巧合雷同，恕難負責,CLP in the next 10 years,Several forces will consp

28、ire to make Chinese content a significant part of the Web -One-fourth of the web-users will speak Chinese -15%-20% of the web content will be originally in Chinese (see also Ed Hovys prediction) 朱邦復先生的宏願：九億農民上網微軟研究院自然語言組經理周明的預測：五億中國人上網,CLP in the next 10 years,Within the first 5 years -At least two

29、 versions of Chinese wordnet will be completed -CWN will utilize various levels of semantic information encoded in the Chinese language, including, but not limited to: components (部件), radicals, roots (in compounds), and classifier constructions,CLP in the next 10 years,Hsieh and Chuangs intelligent

30、 character encoding scheme will -make the missing character problem obsolete -create an open (i.e. efficient resource sharing) environment for character fonts,CLP in the next 10 years,Segmentation and Tagging A new and successful trend will be a hybrid model where segmentation and POS tagging are pe

31、rformed together in one step and will not be modularized,CLP in the next 10 years,Documentation of minority languages in China and in Taiwan will have immediately urgency, since half of the remaining languages are predicted to die within the next five years On the other hand, completion of a compreh

32、ensive diachronic Chinese corpus covering nearly 3000 years makes it both the definite corpus for historical linguistics as well as essential source for social and evolutionary studied of human languages,CLP in the next 10 years,With the Semantic Web slowly becoming a reality Most Chinese web servic

33、es and pages will use semantic features of the language as a skeletal part of their ontology The semantic information is realized from components, radicals, root meaning, classifier collocation etc. Many other language sites will also adapt the skeletal ontology of Chinese,CLP in the next 10 years,S

34、ome other unsystematic guesses There population of Chinese computational linguists will exceed computational linguists working on any other languages Code-switching (between Chinese and English) will become even more popular. English words (either transcribed or in alphabet form) will take up more t

35、han 10% of a random corpus.,CLP in the next 10 years,Conclusion The bottom line is that we really CANNOT forsee what will happen in the next 10 years of CLP, but We do see promising trends and exciting future for our field. If the past ten years is an indication, we can expect breakthroughs that wil

36、l really bring Chinese NLP to our life, and bring the Chinese language to the world stage that is the Web.,計算語言學與中文語言處理的未來方向 -語意網，詞網，與知識本體,由COLING2002 中有關語意網(Semantic Web)的專題討論出發,何謂語意網？,Semantic Web 一種新的網路內容形式，能讓電腦理解其中的語意，勢必帶來新一波網路革命！科學人2002八月號46-56頁 Scientific American, May 2001,語意網將成為下一代的網際網路,

37、Berners-Lee, Tim, James Hendler and Ora Lassila. The Semantic Web. Scientific American. August 2001. 本文的主要作者伯納李正是網際網路的發明人。他對網路未來發展的宣告不可忽視。語意網會讓計算語言學更形重要，還是會取代計算語言學？,語意網在COLING2002,Semantic Web: A New Challenge for Language Technology （大會專題) -Chair: Hans Uszkoreit -Panelists: Ed Hovy, Paul Buitela

38、ar , Chu-Ren Huang The Roles of Natural Language and XML in the Semantic Web (NLPXML workshop 專題)-Panel: Nancy Ide, Laurent Romary, Graham Wilcock, Paul Buitelaar(歐盟科技計畫中整合語意網與語言科技計畫主持人),從全球資訊網到語意網,全球資訊網仍只是人們交換文件的載體，其中的資訊是機器不能自動運用的。如果我們針對電腦，增加專門提供給電腦閱讀的網頁，我們就可把現有的網路轉換成語意網。,電腦如何閱讀語意？,利用RDF（資源描述架構）與U

39、RI（通用資源標誌碼）連結到相關網頁/資源藉超連結找到關鍵詞後藉知識本體(Ontology)定義關鍵詞，並做邏輯推理,知識本體(Ontology),對任一網頁/資源知識內容及資訊架構的描述與定義以RDF（或類似語言）寫成的文件，清楚定義概念間的關係和推理的邏輯規則請注意資訊學中把ONTOLOGY當成知識/訊息的基底架構；與哲學中本體論的原定義大不相同,知識的演化,若設計得當，語意網將有助於人類的總體知識演化網頁的知識本體提供了不同知識體系的完整描述 URI在每個知識體系中明確描述了每個概念語意網將有助於概念的溝通與知識體系的整合,Http:/www.w3.org/2001/sw H

40、ttp:/www.SemanticWeb.org Http:/,我們關心的問題之一,語意網將會使用什麼語言？,答案甲,English 當然是英文，因為英文本來就是WWW上最廣泛使用的語言,可是：十年後的網路與中文處理,朱邦復先生的宏願：九億農民上網微軟研究院自然語言組經理周明的預測：五億中國人上網個人的預測：全世界上網人口中，每四人即有一人用中文 Source: panel on Chinese Language Processing: 10 Years from Now. The First SigHan Workshop on Chinese Language Processing

41、. COLING2002. Sept. 1. Taipei.,答案乙,Any Language(s) 任何語言都可。因為語意網是靠知識溝通，不是靠語言溝通 Other languages: OWL, XML, etc,我們關心的問題之二,既然語意網主要依靠知識本體；特定語言與文本的知識管理還有需要嗎？,答案甲,不需要因為知識本體的存在是獨立於特定語言與文本之外的,答案乙,當然還需要因為每個特定的語言或文本都是一個獨特的知識體系，唯有正確整理分析其知識內容，方能建立完整的知識本體,知識本體的變遷,知識的豐富性從何而來？ -從文化，領域，環境，族群，社會階層，媒體，學科，時代

42、等知識的豐富性如何體現？ -以共同的語言語語彙（即所謂的次語言或行話與領域詞彙或專門辭典）,由（知識的）本體到語言的本體,每個語言都有其架構完善的知識本體 -語言（包括次語言）可以表達所有（領域內的)知識 -說話者與聽話者之間知識的交換通常有效而正確任何人工知識本體的使用者，不論其數目或準確性，都不會超過語言本體的使用者語言本體本來就是文本典藏處理的必要資源,如何呈現語言的知識本體？,詞彙網路WordNet是最直接的語言本體表徵詞彙網路的構成元素：某個語言內所有的詞彙（任一個詞形lemma與詞義sense的獨特配對定義為一個詞彙）該語言表達的所有概念（即所有詞義sense）一組基

43、本的詞彙語意關係,詞彙網路的架構,以詞義為基準，把有相同詞義的所有詞彙放在一個同義詞集(SynSet) 同義詞集即是表達相同概念的所有詞的集合以定義過的詞彙語意關係，連結所有的同義詞集即是建立所有概念間的語意關係除同義，反義，近義外，更重要的有上位，下位，及功能等關係,詞彙網路提供的知識基本架構,Synset:詞彙驅動的概念（知識）單位 Semantic Relation: 概念連結與知識衍生的基本關係專家（such as reference librarian）用什麼方法找keyword找不到的資料？,詞彙網路與（圖書館學中的）索引典,索引典建立在 equivalence clas

44、s 的概念上，原則上只標記同義關係(synonymy) 可以很快找到預先設定的固定類別,Equivalence Classes vs. Relational Classes,-下載，拷貝，錄製 -數位檔案，電子檔，機讀檔，程式 -音樂. -相片 -小說 -貝多芬，李玟，Stephen King, Steve Martin, Harry Potter (How about 閱讀，欣賞，買),詞彙網路缺乏的知識,使用領域與知識的分類跨領域與跨語言的知識與概念連結歐語詞網與知網以做了部分連結,問題：分類定義因時/地置宜,蕃薯，本地芋頭，外來 ,跨語言的詞彙對應以詞義關係定義之, 單

45、語與多語詞網同樣建立在詞義關係上 Translating Lexical Semantic Relations：The First Step Towards Multilingual Wordnets Chu-Ren Huang, I-Ju E. Tseng, Dylan Tsai SEMANET: Building and Using Semantic Networks,DomOnto: Domain Ontology,Each Domain will be represented by its own linguistic ontology (i.e. a wordnet) Each do

46、main wordnet will be linked to the general ontology via lemmas through semantic relations A lexicon marked with domain tag will serve as a convenient index,Work Towards Domain Ontology,Bilingual Domain Lexicon for FishBase (4,000 lemma, half of them in WordNet) Domain Ontology: Zoological Links to G

47、eneral Ontology (WordNet 1.7): FoodFish, GameFish, AcquariumFish, FreshWaterFish,Domain Ontology and incremental Knowledge,個人學有涯知無涯領域知識日新月異，而且常非領域外的人所能盡知利用結合網路上眾人的領域知識現在：以網路介面提供用戶更新領域知識未來；直接擷取各網頁的知識本體,Academia Sinicas Chinese WordNet,Since 2000 English-Chinese Bilingual (first stage) 語言座標計畫 http

48、:/corpus.ling.sinica.edu.tw/project/LanguageArchive Ckip.iis.sinica.edu.tw/Ontology,語言座標參考資源建置與服務,何謂語言座標?,語言座標描述典藏內容的HOW與WHAT；輔以時空的WHEN與WHERE，語言座標協助典藏內容知識的投射與解讀在語意網路(Semantic Web)的未來發展中，數位典藏必須要經本體論（Ontology）的媒介才能為其所用。語言座標提供了由典藏中抽取本體知識，以及在不同本體論中建立概念轉換的必要基礎架構。,數位典藏的What and How,What: 典藏的內容知識與訊息文本的知識

49、內容是語言或文字 How:知識與訊息表達的方式非文本知識的表達與傳遞使用語言與文字最不受媒介所限，也最方便人的理解,語言座標的規劃與建置,不同典藏間的連結與知識排比，必須靠語言（特別是詞彙）來穿針引線語言座標跨越時間（歷代語言變遷），空間（方言差異），語言（多語對比），領域（專門詞彙），社會階層等,目標中文詞彙網路典藏知識的座標,建立英中對應詞彙網路，作為雙語後設資料，權威檔，及本體論（Ontology）等之基本材料建立中文詞彙網路，做為中文典藏本體論（Ontology），權威檔等之基準建立多語詞彙庫（包括古今，繁簡，英中等之對應）參與Open Lexical Infrastructure (for the Web), Open Semantic Lexicon Foundations, 及European Lexical Infrastructure and Technology等國際合作計畫，建立詞彙知識與本體論交換互享之基本架構。,

展开阅读全文