智能信息检索.ppt_三一文库31doc.com

资源描述

《智能信息检索.ppt》由会员分享，可在线阅读，更多相关《智能信息检索.ppt（97页珍藏版）》请在三一文库上搜索。

1、智能信息检索,杜小勇教授,中国人民大学文继荣教授,微软亚洲研究院,课程介绍,课程历史,2007年9月27日，我们和微软亚洲研究院联合举办了一次“互联网数据管理主题学术报告”既聘任文继荣博士为兼职研究员的活动; 2008年春季学期,第一次与微软合作,开设智能信息检索课程; 来自MSRA的9位研究员共进行了11次讲座; 参考文献:从智能信息检索看微软前瞻性课程，计算机教育,授课风格,IR基础知识+专题讲座专题讲座由微软研究员担任,信息量非常大考核方式: (1)选择某一个专题写一个综述性质的报告 ,包括:研究的问题是什么? 该领域的理论基础是什么?技术难点在那里? 目前大致有什么解决问题的手段

2、和方法? 研究这些问题的实验方法和评价体系是什么?提出自己的观点. (2) 将打印好的文章在最后一节课交给助教,将分发给相关的老师进行评阅。 (3) 平时考核,主要是参与讨论的情况.,授课内容,基础知识基本模型基本实现技术评价方法核心技术与系统 Ranking Information Extraction Log Mining System Implementation 应用 Image search Multi-language search Knowledge Management ,Reading Materials,R. Baeza-Yates, B. Ribeiro-Neto

3、, Modern Information Retrieval, ACM Press,1999 E.M.Voorhees,D.K.Harman, TREC: Experiment and Evaluation in Information Retrieval, The MIT Press 2005 K. S. Jones, P. Willett, Readings in Information Retrieval, Morgan Kaufmann,1997 Proceedings SIGIR, SIGMOD，WWW,课程安排,课程安排,http:/ 联系方式：杜小勇（信息楼 0459）文继

4、荣JRWEN,Introduction to IR,Prof. Xiaoyong Du,What is Information Retrieval,Definition from comparison,What is IR?,Definition by examples: Library system, like CALIS Search engine, like Google, Baidu,What is IR?,Definition by content IR = , where D: document collection Q: Users query R: the relevance/

5、similarity degree between query qi and document dj,IR System Architecture,Unstructured Data,Index,Indexer,Ranker,Classical IR,Crawler,User Interface,WEB,WEB IR,Query log Feedback,Extractor,Data Miner,Content,IR Model System architecture Evaluation and benchmark Key techniques Media-related operators

6、 Indexing IE Classification and clustering Link analysis Relevance evaluation ,Related Area,Natural Language Processing Large-scale distributed computing Database Data mining Information Science Artificial Intelligence ,Model,IR Model,Representation How to represent document/query Bag-of-word Sequen

7、ce-of-word Link of documents Semantic Network Similarity/relevance Evaluation sim(dj,q)=?,两大类的模型,基于文本内容的检索模型布尔模型向量空间模型概率模型统计语言模型与内容无关的其他检索模型基于协同的模型基于链接分析的模型基于关联的模型,Classical IR Models - Basic Concepts,Bag-of-Word Model Each document represented by a set of representative keywords or index te

8、rms The importance of the index terms is represented by weights associated to them Let ki : an index term dj : a document t : the total number of docs K = k1, k2, , kt : the set of all index terms,Classic IR Models - Basic Concepts,wij = 0 : a weight associated with (ki,dj) The weight wij quantifies

9、 the importance of the index term for describing the document contents wij = 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, , wtj) : a weighted vector associated with the document dj gi(vec(dj) = wij : a function which returns the weight of term ki in document dj,Classical IR Mode

10、ls - Basic Concepts,A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query A ranking is based on fundamental premises regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of

11、 relevance Each set of premises leads to a distinct IR model,The Boolean Model,Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = ka (kb kc) Terms are either present or absent. Thus, wij 0,1 Consider q = ka (kb kc) vec(qdnf) = (1,1,1) (1,1,

12、0) (1,0,0) vec(qcc) = (1,1,0) is a conjunctive component,Outline,Boolean Model（BM） Vector Space Model（VSM） Probabilistic Model（PM） Language Model(LM),The Boolean Model,q = ka (kb kc) sim(q,dj) = 1 if vec(qcc) | (vec(qcc) /in vec(qdnf) (ki, gi(vec(dj) = gi(vec(qcc) 0 otherwise,Drawbacks of the Boolea

13、n Model,Exact matching No ranking: Awkward: Information need has to be translated into a Boolean expression Too simple: The Boolean queries formulated by the users are most often too simplistic Unsatisfiable Results: The Boolean model frequently returns either too few or too many documents in respon

14、se to a user query,Outline,Boolean Model（BM） Vector Space Model（VSM） Probabilistic Model（PM） Language Model (LM),The Vector Model,Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of

15、 documents provides for better matching,The Vector Model,Define: wij 0 whenever ki dj wiq = 0 associated with the pair (ki,q) vec(dj) = (w1j, w2j, ., wtj) vec(q) = (w1q, w2q, ., wtq) index terms are assumed to occur independently within the documents ,That means the vector space is orthonormal. The

16、t terms form an orthonormal basis for a t-dimensional space In this space, queries and documents are represented as weighted vectors,The Vector Model,Sim(q,dj) = cos() = vec(dj) vec(q) / (|dj| * |q|) = wij * wiq / (|dj| * |q|) Since wij 0 and wiq 0, 0 = sim(q,dj) =1 A document is retrieved even if i

17、t matches the query terms only partially,i,j,dj,q,The Vector Model,Sim(q,dj) = wij * wiq / ( |dj| * |q|) The KEY is to compute the weights wij and wiq ? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a doc

18、ument quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency TF*IDF formular: wij = tf(i,j) * idf(i),The Vector Model,Let, N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within d

19、j A normalized tf factor is given by tf(i,j) = freq(i,j) / max(freq(l,j) where kl dj The idf factor is computed as idf(i) = log (N/ni) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.,The Vector Model,tf

20、-idf weighting scheme wij = tf(i,j) * log(N/ni) The best term-weighting schemes For the query term weights, wiq = (0.5 + 0.5 * freq(i,q) / max(freq(l,q) * log(N/ni) Or specified by the user The vector model with tf-idf weights is a good ranking strategy with general collections The vector model is u

21、sually as good as the known ranking alternatives. It is also simple and fast to compute.,The Vector Model,Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to

22、degree of similarity to the query Disadvantages: assumes independence of index terms,Outline,Boolean Model（BM） Vector Space Model（VSM） Probabilistic Model（PM） Language Model (LM),Probabilistic Model,Objective: to capture the IR problem using a probabilistic framework Given a user query, there is an

23、ideal answer set Querying as specification of the properties of this ideal answer set (clustering) But, what are these properties? Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) Improve by iteration,Probabilistic Model,Baisc ideas: An initial set of d

24、ocuments is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeting this process, it is expected that the description of the ideal answer set will im

25、prove Description of ideal answer set is modeled in probabilistic terms,Probabilistic Ranking Principle,The probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the

26、query and the document representations only. Let R be the Ideal answer set. But, how to compute probabilities? what is the sample space?,The Ranking,Probabilistic ranking computed as: sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) Definition: wij 0,1 P(R | vec(dj) : probability that given

30、i | R) and P(ki | R) ? Estimates based on assumptions: P(ki | R) = 0.5 P(ki | R) = ni N where ni is the number of docs that contain ki Use this initial guess to retrieve an initial ranking Improve upon this initial ranking,Improving the Initial Ranking,sim(dj,q) wiq * wij * (log P(ki | R) + log P(ki

32、1-P(ki | R) To avoid problems with V=1 and Vi=0: P(ki | R) = Vi + 0.5 V + 1 P(ki | R) = ni - Vi + 0.5 N - V + 1 Also, P(ki | R) = Vi + ni/N V + 1 P(ki | R) = ni - Vi + ni/N N - V + 1,Discussion,Advantages: Docs ranked in decreasing order of probability of relevance Disadvantages: need to guess initi

33、al estimates for P(ki | R) method does not take into account tf and idf factors,Outline,Boolean Model（BM） Vector Space Model（VSM） Probabilistic Model（PM） Language Model (LM),Document Representation,Bag-of-words Bag-of-facts Bag-of-sentences Bag-of-nets,Document = Bag of Words Document = Bag of Sente

34、nces, Sentence = word sequence p(南京市长)p(江大桥|南京市长) p(中国大学人民),What is a LM?,“语言”就是其字母表上的某种概率分布,该分布反映了任何一个字母序列成为该语言的一个句子(或其他任何的语言单元) 的可能性,称这个概率分布为语言模型。给定的一个语言，对于一个语言“句子”（符号串），可以估计其出现的概率。例如：假定对于英语， p1 (a quick brown dog) p2 ( dog brown a quick) p3 (brown dog 棕熊) p4 (棕熊) 若p1=p2,称为一阶语言模型,否则称为高阶语言模型,Basi

35、c Notation,M: language we are try to model, it can be thought as a source s: observation (string of tokens) P(s|M): probability of observation “s” in M, that is the probability of getting “s” during random sampling from M,Basic Notation,Let S=s1s2.sn be any sentence P(S) = P(s1)P(s2|s1).P(sn|s1,s2sn

36、) Under n-gram model P(si|s1,s2si-1)=P(si|si-n+1,si-1) n =1, ungram P(si|s1,s2,si-1)=P(si),How can we use LMs in IR,Use LM to model the process of query generation: Every document in a collection defines a “language” P(s|MD) defines the probability that author would write down string ”s” Now suppose

37、 “q” is the users query P(q|MD) is the probability of “q” during random sampling from the D, and can be treated as rank of document D in the collection,Major issues in applying LMs,What kind of language model should we use? Unigram or high-order models How can we estimate model parameters? Basic mod

38、el or advanced model Data smoothing approaches,What kind of models is better?,Unigram model Bigram model High-order model,Unigram LMs,Words are “sampled” independently of each other Joint probability decomposes into a production of marginals P(xyz)=p(x)p(y)p(z) P(“brown dog”)=P(“dog”)P(“brown”) Esti

39、mation of probability :simple counting,Higher-order Models,n-gram: condition on preceding words Cache: condition on a window Grammar: condition on a parse tree Are they useful? ? Parameter estimation is very expensive!,Comparison,Song 和Croft指出，把一元语言模型和二元语言模型混合后的效果比只使用一元语言模型好8%左右。不过，Victor Lavrenko指出

40、，Song 和Croft 使用的多元模型得到的效果并不是一直比只用一元语言模型好。 David R.H.Miller 指出一元语言模型和二元语言模型混合后得到的效果也要好于一元语言模型。也有研究认为词序对于检索结果影响不大.,Major issues in applying LMs,What kind of language model should we use? Unigram or high-order models How can we estimate model parameters? Basic model or advanced model Data smoothing ap

41、proaches,Estimation of parameter,Given a string of text S（=Q or D）, estimate its LM: Ms Basic LMs Maximum-likelihood estimation Zero-frequency problem Discounting technology Interpolation method,Maximum-likelihood estimation,Let V be vocabulary of M,Q=q1q2qm be a query, qi in V, S=d1d2dn be a doc. L

42、et Ms be the language model of S P(Q|Ms) =? ,called query likelihood P (Ms|Q) = P(Q| Ms)P(Ms)/P(Q) can be treated as the ranking of doc S. P(Q| Ms)P(Ms) Estimating P(Q|Ms),and P(Ms),Maximum-likelihood estimation,估计P(Q|Ms)的方法： Multivarint Bernouli model Multinomial model Bernouli model 只考虑词是否在查询中出现，而

44、采用极大似然估计：Count relative frequencies of words in S P(w|Ms)=#(w,S)/|S| 0-frenquency problem （由数据的稀疏性造成） Assume some event not occur in S, then the probability is 0! It is not correct, and we need to avoid it,Discounting Method,Laplace correction（add-one approach）: Add 1 to every count,（normalize） P(w|

45、Ms)=（#(w,S)+1）/（|S|+|V|） Problematic for large vocabularies（|V|太大的时候） Ref. Chen SF and Goodman JT: an empirical study of smoothing technology for language modeling, proc. 34th annual meeting of the ACL,1996,Smoothing methods,Additive smoothing methods Jelinek-Mercer 方法 Dirichlet 方法,Additive smoothin

47、t and query.,平滑对检索性能的影响,Zhai CX, Lafferty J, A study of smoothing methods for language models applied to ad hoc information retrieval. ACM SIGIR 2001 Zhai CX Lafferty J, A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2)179-214 平滑有两个作用：一是估计，解决0概率问题，二是查询建模，消除或者降低噪音的影响,Translation Models,Basic LMs do not address word synonymy. P(q|M) = w P(w|M) P(q|w) P(q|w) 就是q和w之间的关系。如果q和w是近似词，这个值就比较大。 P (q|w) 可以依据词的共现关系/相同词根/词典等进行计算，这是该方法的关键 P (w|M) 就是语言模型下w的概率。,LM Tools,LEMUR www.cs.cmu.edu/lemur CMU/UMass joint project C+, good documentation, forum-b

展开阅读全文