课件跨语言资讯检索导论.ppt

上传人:本田雅阁 文档编号:2601310 上传时间:2019-04-16 格式:PPT 页数:66 大小:363.51KB
返回 下载 相关 举报
课件跨语言资讯检索导论.ppt_第1页
第1页 / 共66页
课件跨语言资讯检索导论.ppt_第2页
第2页 / 共66页
课件跨语言资讯检索导论.ppt_第3页
第3页 / 共66页
课件跨语言资讯检索导论.ppt_第4页
第4页 / 共66页
课件跨语言资讯检索导论.ppt_第5页
第5页 / 共66页
点击查看更多>>
资源描述

《课件跨语言资讯检索导论.ppt》由会员分享,可在线阅读,更多相关《课件跨语言资讯检索导论.ppt(66页珍藏版)》请在三一文库上搜索。

1、Hsin-Hsi Chen,1,跨語言資訊檢索導論,Hsin-Hsi Chen (陳信希) Department of Computer Science and Information Engineering National Taiwan University,Hsin-Hsi Chen,2,Outline,Multilingual Environments What is Cross-Language Information Retrieval? Major Problems in CLIR Major Approaches in CLIR Case Study: CLIR in NPDM

2、 Summary,Hsin-Hsi Chen,3,Multilingual Collections,There are 6,703 languages listed in the Ethnologue Digital libraries OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership atta

3、ched in more than 370 languages World Wide Web Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English,Hsin-Hsi Chen,4,真實世界語言使用人口,( http:/ Chen,5,(Statistics from Euro-Marketing Associates, 1998),西班牙語,德語,日語,法語,中文,荷蘭語,葡萄牙語,義大利語,瑞典語,韓文,Hsin-Hsi Chen,6,http:/ f

4、rom Euro-Marketing Associates, 1999),中文人口 比例(6.1%) 法文人口 比例(8.8%) (1998年),Hsin-Hsi Chen,7,網路世界語言使用人口,Hsin-Hsi Chen,8,網際網路內容,(Network Wizards Jan 99 Internet Domain Survey),英語,日語,德語,法語,荷蘭語,芬蘭語,西班牙語,中文,瑞典語,33,878,1,687,1,684,654,546,473,458,432,546,40%的Internet使用者 不懂英文,但是80% 的Internet內容是英文,Hsin-Hsi Che

5、n,9,(Source: http:/),Hsin-Hsi Chen,10,What is Cross-Language Information Retrieval?,Definition: Select information in one language based on queries in another. Terminologies Cross-Language Information Retrieval (ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval) Translingual Informatio

6、n Retrieval (Defense Advanced Research Project Agency - DARPA),Hsin-Hsi Chen,11,Generalization: Multi- & Cross- Lingual Information Access,Hsin-Hsi Chen,12,MLIR Applications,Multilingual information access in multilingual country, organization, enterprise, etc. Cross- language information retrieval

7、for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary). Monolingual users may retrieve images by taking advantage of multilingual captions. Monolingual users may retrieve documents and have them translated (automatically o

8、r manually) in their language.,Hsin-Hsi Chen,13,Why is Cross- Language Information Retrieval Important?,More information workers with less time require fast access to global resources global B2B interactions (virtual enterprises) global B2C interactions (online trading, travelling) time critical inf

9、ormation (translation comes too late),Hsin-Hsi Chen,14,History,1970 Salton runs retrieval experiments with a small English/ German dictionary 1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation 1978 ISO Standard 5964 for developin

10、g multilingual thesauri (revised in 1985) 1990 Latent Semantic Indexing (LSI) applied to CLIR,Hsin-Hsi Chen,15,History (Continued),1994 1st PhD thesis on CLIR by Khaled Radwan 1996 Similarity thesaurus applied to CLIR (ETH Zurich) 1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenob

11、le) 1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU),Hsin-Hsi Chen,16,History (Continued),1997 CLIR (Cross- Language Information Retrieval) track starts within TREC 1998 NTCIR starts in Japan 1999 TIDES (Translingual Information Detection, Extraction, and Summarization) starts in U.

12、S. 2000 CLEF starts in Europe,Hsin-Hsi Chen,17,An Architecture of Multilingual Information Access,Hsin-Hsi Chen,18,Major Problems of CLIR,Queries and documents are in different languages. translation Words in a query may be ambiguous. disambiguation Queries are usually short. expansion,Hsin-Hsi Chen

13、,19,Major Problems of CLIR (Continued),Queries may have to be segmented. segmentation A document may be in terms of various languages. language identification,Hsin-Hsi Chen,20,Enhancing Traditional Information Retrieval Systems,Which part(s) should be modified for CLIR?,Documents,Queries,Document Re

14、presentation,Query Representation,Comparison,(3),(1),(2),(4),Hsin-Hsi Chen,21,Enhancing Traditional Information Retrieval Systems (Continued),(1): text translation (2): vector translation (3): query translation (4): term vector translation (1) and (2), (3) and (4): interlingual form,Hsin-Hsi Chen,22

15、,What are the Problems?,Ambiguous terms (e.g., performance) Multiword phrases may correspond to single-word phrases (e. g. South Africa = 南非, Sdafrika) Coverage of the vocabulary There is not a one-to-one mapping between two languages Translating queries automatically (lack of syntax) Translating do

16、cuments automatically (performance, ) Computing mixed result lists,Hsin-Hsi Chen,23,Cross-Language Information Retrieval,Hsin-Hsi Chen,24,Query Translation Based CLIR,English Query,Translation Device,Chinese Query,Monolingual Chinese Retrieval System,Retrieved Chinese Documents,Hsin-Hsi Chen,25,Tran

17、slating the 400 Million non-English Pages of the WWW,. would take 100000 days (300 years) on one fast PC. Or, 1 month on 3600 PCs.,Hsin-Hsi Chen,26,Knowledge-Based,Examples Subject Thesaurus Hierarchical and associative relations. Unique term assigned to each node. Concept List Term space partitione

18、d into concept spaces. Term List List of cross-language synonyms. Lexicon Machine readable syntax and/or semantics.,Hsin-Hsi Chen,27,Ontology-Based Approaches,Exploit complex knowledge representations e.g., EuroWordNet A Proposal for Conceptual Indexing using EuroWordNet,Hsin-Hsi Chen,28,Dictionary-

19、Based Approaches,Exploit machine-readable dictionaries. Problems translation ambiguity + target polysemy coverage (unknown words, abbreviations, .),Hsin-Hsi Chen,29,Dictionary-Based Approaches (Continued),Issue 1: selection strategy Select all. Select N randomly. Select best N. Issue 2: which level

20、word phrase,Hsin-Hsi Chen,30,Selection Strategy: Select All,Hull and Grefenstette 1996 Take concatenation of all term translation. E: politically motivated civil disturbances F: troubles civils a caractere politique trouble - turmoil, discord, trouble, unrest, disturbance, disorder civil - civil, ci

21、vilian, courteous caractere - character, nature politique - political, diplomatic, politician, policy Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%. errors: multi-word expressions and ambiguity,Hsin-Hsi Chen,31,Selection Strategy: Select All (Continued),Davis 1

22、997 (TREC5) Replace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary. Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%,Hsin-Hsi Chen,32,Evaluation Method,Average Precision (5-, 9-, 11-points) Model,Spanish Query,Mono IR Engin

23、e,English Query,Bilingual Dictionary,Mono IR Engine,TREC Spanish Corpus,Spanish Equivalents,English Query,Mono IR Engine,TREC Spanish Corpus,Spanish Equivalents by POS,POS Bilingual Dictionary,TREC Spanish Corpus,Hsin-Hsi Chen,33,Selection Strategy: Select N,Simple word-by-word translation Each quer

24、y term is replaced by the word or group of words given for the first sense of the terms definition. 50-60% drop in performance (average precision),Hsin-Hsi Chen,34,Selection Strategy: Select N (Continued),word/phrase translation Take at most three translations of each word, one from each of the firs

25、t three senses. Take phrase translation if appearing in dictionary. 30-50% worse than good translation Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements. WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3%,Hsin-Hsi C

26、hen,35,Selection Strategy: Select Best N,Hayashi, Kikui and Susaki 1997 search for a dictionary entry corresponding to the longest sequence of words from left to right choose the most frequently used word (or phrases) in a text corpus collected from WWW no report for this query translation approach

27、Davis 1997 (TREC5) POS disambiguation Monolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%,Hsin-Hsi Chen,36,Corpus-Based Approaches,Categorization Term-Aligned Sentence-Aligned Document-Aligned (Parallel, Comparable) Unaligned Usage Setup Thesauru

28、s Vector Mapping,Hsin-Hsi Chen,37,Term-Aligned Corpora,Fine-grained alignment in parallel corpora Oard 1996 Term alignment is a challenging problem.,Parallel Binlingual Corpus,Cooccurrance Statistics,Translation Tables,Machine Translation System,English Query,Spanish Query,Hsin-Hsi Chen,38,Sentence-

29、Aligned Corpora,Davis & Dunning 1996 (TREC4) High-frequency Terms,Hsin-Hsi Chen,39,Brief Summary,dictionary-based methods Specialized vocabulary not in the dictionaries will not be translated. Ambiguities will add extraneous terms to the query. parallel/comparable corpora-based methods Parallel corp

30、ora are not always available. Available corpora tend to be relative small or to cover only a small number of subjects. Performance is dependent on how well the corpora are aligned.,Hsin-Hsi Chen,40,Brief Summary (Continued),Dictionaries are very useful. Achieve 50% on their own Parallel corpora have

31、 limitations. Domain shifts Term alignment accuracy Dictionaries and corpora are complementary. Dictionaries provide broad and shallow coverage. Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.,Hsin-Hsi Chen,41,Hybrid Methods,What knowledge can be employ

32、ed? lexical knowledge corpus knowledge .,Hsin-Hsi Chen,42,Hybrid Methods (Continued),Query Expansion Issue 1: context pseudo relevance feedback (local feedback): A query is modified by the addition of terms found in the top retrieved documents. local context analysis: Queries are expanded by the add

33、ition of the top ranked concepts from the top passages.,Hsin-Hsi Chen,43,Hybrid Methods (Continued),Issue 2: when before query translation after query translation,Hsin-Hsi Chen,44,Hybrid Methods (Continued),Ballesteros & Croft 1997,Original Spanish TREC Queries,human translation,English (BASE) Queri

34、es,Spanish Queries,automatic dictionary translation,English Queries,query expansion,Spanish Queries,query expansion,Spanish Queries,automatic dictionary translation,INQUERY,Hsin-Hsi Chen,45,Hybrid Methods (Continued),Performance Evaluation pre-translation MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.11

35、39) +33.5% +38.5% post-translation MRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% combined pre- and post-translation MRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0% 32% below a monolingual baseline,Hsin-Hsi Chen,46,Cross-Language Evaluation Forum,A collaboration between

36、the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST) Extension of CLIR track at TREC (1997-1999),Hsin-Hsi Chen,47,Main Goals,Promote research in cross-language system development for European languages by providing an appropriate inf

37、rastructure for: CLIR system evaluation, testing and tuning Comparison and discussion of results,Hsin-Hsi Chen,48,CLEF 2000 Task Description,Four evaluation tracks in CLEF 2000 multilingual information retrieval bilingual information retrieval monolingual (non-English) information retrieval domain-s

38、pecific IR,Hsin-Hsi Chen,49,Case Study: CLIR for NPDM,Hsin-Hsi Chen,50,3M in Digital Libraries/Museums,Multi-media Selecting suitable media to represent contents Multi-linguality Decreasing the language barriers Multi-culture Integrating multiple cultures,Hsin-Hsi Chen,51,NPDM Project,Palace Museum,

39、 Taipei, one of the famous museums in the world NSC supports a pioneer study of a digital museum project NPDM starting from 2000 Enamels from the Ming and Ching Dynasties Famous Album Leaves of the Sung Dynasty Illustrations in Buddhist Scriptures with Relative Drawings,Hsin-Hsi Chen,52,Design Issue

40、s,Standardization A standard metadata protocol is indispensable for the interchange of resources with other museums. Multimedia A suitable presentation scheme is required. Internationalization to share the valuable resources of NPDM with users of different languages to utilize knowledge presented in

41、 a foreign language,Hsin-Hsi Chen,53,Translingual Issue,CLIR to allow users to issue queries in one language to access documents in another language the query language is English and the document language is Chinese Two common approaches Query translation Document translation,Hsin-Hsi Chen,54,Resour

42、ces in NPDM pilot,an enamel, a calligraphy, a painting, or an illustration MICI-DC Metadata Interchange for Chinese Information Accessible fields to users Short descriptions vs. full texts Bilingual versions vs. Chinese only Fields for maintenance only,Hsin-Hsi Chen,55,Search Modes,Free search users

43、 describe their information need using natural languages (Chinese or English) Specific topic search users fill in specific fields denoting authors, titles, dates, and so on,Hsin-Hsi Chen,56,Example,Information need Retrieval “Travelers Among Mountains and Streams, Fan Kuan” (“范寬谿山行旅圖”) Possible quer

44、ies Author: Fan Kuan; Kuan, Fan Time: Sung Dynasty Title: Mountains and Streams; Travel among mountains; Travel among streams; Mountain and stream painting Free search: landscape painting; travelers, huge mountain, Nature; scenery; Shensi province,Hsin-Hsi Chen,57,ECIR in NPDM,Hsin-Hsi Chen,58,Speci

45、fic Topic Search,proper names are important query terms Creators such as “林逋” (Lin Pu), “李建中” (Li Chien-chung), “歐陽脩” (Ou-yang Hsiu), etc. Emperors such as “康熙” (Kang-hsi), “乾隆” (Chien-lung), “徽宗” (Hui-tsung), etc. Dynasty such as ”宋” (Sung), “明” (Ming), “清” (Ching), etc.,Hsin-Hsi Chen,59,Name Trans

46、literation,The alphabets of Chinese and English are totally different Wade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries backward transliteration Transliterate target language terms back to source language ones Chen, Huang, and Tsai (COLING, 1998) Lin and Chen (ROCLIN

47、G, 2000),Hsin-Hsi Chen,60,Name Mapping Table,Divide a name into a sequence of Chinese characters, and transform each character into phonemes Look up phoneme-to-WG (Pinyin) mapping table, and derive a canonical form for the name Example “林逋” “ ” “Lin Pu” (WG),Hsin-Hsi Chen,61,Name Similarity,Extract

48、named entity from the query Select the most similar named entity from name mapping table Naming sequence/scheme LastName FirstName1, e.g., Chu Hsi (朱熹) FirstName1 LastName, e.g., Hsi Chu (朱熹) LastName FirstName1-FirstName2, e.g., Hsu Tao-ning (許道寧) FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu (許道寧) Any order, e.g., Tao Ning Hsu (許道寧) Any transliteration, e.g., Ju Shi (朱熹),Hsin-Hsi Chen,62,Title,谿山行旅圖” “Travelers among Mountains and Streams” “travelers“, “mountains“, and “streams“ are basic components Users can express their information need through the d

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 其他


经营许可证编号:宁ICP备18001539号-1