龙星计划课程信息检索CourseOverviewBackground.ppt

上传人:本田雅阁 文档编号:2606177 上传时间:2019-04-16 格式:PPT 页数:65 大小:438.51KB
返回 下载 相关 举报
龙星计划课程信息检索CourseOverviewBackground.ppt_第1页
第1页 / 共65页
龙星计划课程信息检索CourseOverviewBackground.ppt_第2页
第2页 / 共65页
龙星计划课程信息检索CourseOverviewBackground.ppt_第3页
第3页 / 共65页
亲,该文档总共65页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

《龙星计划课程信息检索CourseOverviewBackground.ppt》由会员分享,可在线阅读,更多相关《龙星计划课程信息检索CourseOverviewBackground.ppt(65页珍藏版)》请在三一文库上搜索。

1、2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,1,龙星计划课程:信息检索 Course Overview & Background,ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-C

2、hampaign http:/www-faculty.cs.uiuc.edu/czhai, czhaics.uiuc.edu,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,2,Outline,Course overview Essential background Probability & statistics Basic concepts in information theory Natural language processing,2008 ChengXiang Zha

3、i Dragon Star Lecture at Beijing University, June 21-30, 2008,3,Course Overview,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,4,Course Objectives,Introduce the field of information retrieval (IR) Foundation: Basic concepts, principles, methods, etc Trends: Frontier

4、 topics Prepare students to do research in IR and/or related fields Research methodology (general and IR-specific) Research proposal writing Research project (to be finished after the lecture period),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,5,Prerequisites,Pro

5、ficiency in programming (C+ is needed for assignments) Knowledge of basic probability & statistics (would be necessary for understanding algorithms deeply) Big plus: knowledge of related areas Machine learning Natural language processing Data mining ,2008 ChengXiang Zhai Dragon Star Lecture at Beiji

6、ng University, June 21-30, 2008,6,Course Management,Teaching staff Instructor: ChengXiang Zhai (UIUC) Teaching assistants: Hongfei Yan (Peking Univ) Bo Peng (Peking Univ) Course website: http:/ Course group discussion: http:/ Questions: First post the questions on the group discussion forum; if ques

7、tions are unanswered, bring them to the office hours (first office hour: June 23, 2:30-4:30pm),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,7,Format & Requirements,Lecture-based: Morning lectures: Foundation & Trends Afternoon lectures: IR research methodology Rea

8、dings are usually available online 2 Assignments (based on morning lectures) Coding (C+), experimenting with data, analyzing results, open explorations (5 hours each) Final exam (based on morning lectures): 1:30-4:30pm, June 30. Practice questions will be available,2008 ChengXiang Zhai Dragon Star L

9、ecture at Beijing University, June 21-30, 2008,8,Format & Requirements (cont.),Course project (Mini-TREC) Work in teams Phase I: create test collections ( 3 hours, done within lecture period) Phase II: develop algorithms and submit results (done in the summer) Research project proposal (based on aft

10、ernoon lectures) Work in teams 2-page outline done within lecture period full proposal (5 pages) due later,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,9,Coverage of Topics: IR vs. TIM,Text Information Management (TIM),Information Retrieval (IR),Multimedia, etc,IR

11、 and TIM will be used interchangeably,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,10,What is Text Info. Management?,TIM is concerned with technologies for managing and exploiting text information effectively and efficiently Importance of managing text information

12、 The most natural way of encoding knowledge Think about scientific literature The most common type of information How much textual information do you produce and consume every day? The most basic form of information It can be used to describe other media of information The most useful form of inform

13、ation!,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,11,Text Management Applications,Access,Mining,Organization,Select information,Create Knowledge,Add Structure/Annotations,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,12,Example

14、s of Text Management Applications,Search Web search engines (Google, Yahoo, ) Library systems Recommendation News filter Literature/movie recommender Categorization Automatically sorting emails Mining/Extraction Discovering major complaints from email in customer service Business intelligence Bioinf

15、ormatics Many others,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,13,Elements of Text Info Management Technologies,Search,Text,Filtering,Categorization,Summarization,Clustering,Natural Language Content Analysis,Extraction,Mining,Visualization,Retrieval Application

16、s,Mining Applications,Information Access,Knowledge Acquisition,Information Organization,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,14,Text Management and Other Areas,TM Algorithms,User,Text,Storage Compression,Probabilistic inference Machine learning,Natural lan

17、guage processing,Human-computer interaction,TM Applications,Software engineering Web,Computer science,Information Science,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,15,Related Areas,Information Retrieval,Databases,Library & Info Science,Machine Learning Pattern

18、Recognition Data Mining,Natural Language Processing,Applications Web, Bioinformatics,Statistics Optimization,Software engineering Computer systems,Models,Algorithms,Applications,Systems,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,16,Publications/Societies (Incomp

19、lete),ACM SIGIR,VLDB, PODS, ICDE,ASIS,Learning/Mining,NLP,Applications,Statistics,Software/systems,COLING, EMNLP, ANLP,HLT,ICML, NIPS, UAI,RECOMB, PSB,JCDL,Info. Science,Info Retrieval,ACM CIKM,Databases,ACM SIGMOD,ACL,ICML,AAAI,ACM SIGKDD,ISMB,WWW,SOSP,OSDI,TREC,2008 ChengXiang Zhai Dragon Star Lec

20、ture at Beijing University, June 21-30, 2008,17,Schedule: available at http:/ ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,18,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,19,Essential Backgroud 1: Probability & Statistics,2008 ChengX

21、iang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,20,Prob/Statistics & Text Management,Probability & statistics provide a principled way to quantify the uncertainties associated with natural language Allow us to answer questions like: Given that we observe “baseball” three times

22、and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University

23、, June 21-30, 2008,21,Basic Concepts in Probability,Random experiment: an experiment with uncertain outcome (e.g., tossing a coin, picking a word from text) Sample space: all possible outcomes, e.g., Tossing 2 fair coins, S =HH, HT, TH, TT Event: ES, E happens iff outcome is in E, e.g., E=HH (all he

24、ads) E=HH,TT (same face) Impossible event (), certain event (S) Probability of Event : 1P(E) 0, s.t. P(S)=1 (outcome always in S) P(A B)=P(A)+P(B) if (AB)= (e.g., A=same face, B=different face),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,22,Basic Concepts of Prob

25、. (cont.),Conditional Probability :P(B|A)=P(AB)/P(A) P(AB) = P(A)P(B|A) =P(B)P(A|B) So, P(A|B)=P(B|A)P(A)/P(B) (Bayes Rule) For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A) Total probability: If A1, , An form a partition of S, then P(B)= P(BS)=P(BA1)+P(B An) (why?) So, P(Ai|B)=P(B|Ai)P(Ai)/P

26、(B) = P(B|Ai)P(Ai)/P(B|A1)P(A1)+P(B|An)P(An) This allows us to compute P(Ai|B) based on P(B|Ai),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,23,Interpretation of Bayes Rule,Hypothesis space: H=H1 , , Hn Evidence: E,If we want to pick the most likely hypothesis H*,

27、 we can drop P(E),Posterior probability of Hi,Prior probability of Hi,Likelihood of data/evidence if Hi is true,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,24,Random Variable,X: S (“measure” of outcome) E.g., number of heads, all same face?, Events can be defined

28、 according to X E(X=a) = si|X(si)=a E(Xa) = si|X(si) a So, probabilities can be defined on X P(X=a) = P(E(X=a) P(aX) = P(E(aX) Discrete vs. continuous random variable (think of “partitioning the sample space”),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,25,An Exa

29、mple: Doc Classification,X1: sport 1 0 1 1,Topic the computer game baseball,X2: sport 1 1 1 1,X3: computer 1 1 0 0,X4: computer 1 1 1 0,X5: other 0 0 1 1 ,For 3 topics, four words, n=?,Events Esport =xi | topic(xi )=“sport” Ebaseball =xi | baseball(xi )=1 Ebaseball,computer = xi | baseball(xi )=1 &

30、computer(xi )=0,Sample Space S=x1, xn,Conditional Probabilities: P(Esport | Ebaseball ), P(Ebaseball|Esport), P(Esport | Ebaseball, computer ), .,An inference problem: Suppose we observe that “baseball” is mentioned, how likely the topic is about “sport”?,But, P(B=1|T=“sport”)=?, P(T=“sport” )=?,P(T

31、=“sport”|B=1) P(B=1|T=“sport”)P(T=“sport”),Thinking in terms of random variables Topic: T “sport”, “computer”, “other”, “Baseball”: B 0,1, P(T=“sport”|B=1), P(B=1|T=“sport”), .,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,26,Getting to Statistics .,P(B=1|T=“sport”

32、)=? (parameter estimation) If we see the results of a huge number of random experiments, then But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)

33、,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,27,Parameter Estimation,General setting: Given a (hypothesized & probabilistic) model that governs the random experiment The model gives a probability of any data p(D|) that depends on the parameter Now, given actual s

34、ample data X=x1,xn, what can we say about the value of ? Intuitively, take your best guess of - “best” means “best explaining/fitting the data” Generally an optimization problem,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,28,Maximum Likelihood vs. Bayesian,Maximu

35、m likelihood estimation “Best” means “data likelihood reaches maximum” Problem: small sample Bayesian estimation “Best” means being consistent with our “prior” knowledge and explaining data well Problem: how to define prior?,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30,

36、 2008,29,Illustration of Bayesian Estimation,Posterior: p(|X) p(X|)p(),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,30,Maximum Likelihood Estimate,Data: a document d with counts c(w1), , c(wN), and length |d| Model: multinomial distribution M with parameters p(wi)

37、 Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M),Well tune p(wi) to maximize l(d|M),Use Lagrange multiplier approach,Set partial derivatives to zero,ML estimate,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,31,What You Should Know,Probability con

38、cepts: sample space, event, random variable, conditional prob. multinomial distribution, etc Bayes formula and its interpretation Statistics: Know how to compute maximum likelihood estimate,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,32,Essential Background 2: Ba

39、sic Concepts in Information Theory,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,33,Information Theory,Developed by Shannon in the 40s Maximizing the amount of information that can be transmitted over an imperfect communication channel Data compression (entropy) Tr

40、ansmission rate (channel capacity),2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,34,Basic Concepts in Information Theory,Entropy: Measuring uncertainty of a random variable Kullback-Leibler divergence: comparing two distributions Mutual Information: measuring the c

41、orrelation of two random variables,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,35,Entropy: Motivation,Feature selection: If we use only a few words to classify docs, what kind of words should we use? P(Topic| “computer”=1) vs p(Topic | “the”=1): which is more ran

42、dom? Text compression: Some documents (less random) can be compressed more than others (more random) Can we quantify the “compressibility”? In general, given a random variable X following distribution p(X), How do we measure the “randomness” of X? How do we design optimal coding for X?,2008 ChengXia

43、ng Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,36,Entropy: Definition,Entropy H(X) measures the uncertainty/randomness of random variable X,Example:,P(Head),H(X),1.0,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,37,Entropy: Properties,Minimum v

44、alue of H(X): 0 What kind of X has the minimum entropy? Maximum value of H(X): log M, where M is the number of possible values for X What kind of X has the maximum entropy? Related to coding,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,38,Interpretations of H(X),M

45、easures the “amount of information” in X Think of each value of X as a “message” Think of X as a random experiment (20 questions) Minimum average number of bits to compress values of X The more random X is, the harder to compress,A fair coin has the maximum information, and is hardest to compress A

46、biased coin has some information, and can be compressed to 1 bit on average A completely biased coin has no information, and needs only 0 bit,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,39,Conditional Entropy,The conditional entropy of a random variable Y given a

47、nother X, expresses how much extra information one still needs to supply on average to communicate Y given that the other party knows X H(Topic| “computer”) vs. H(Topic | “the”)?,2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008,40,Cross Entropy H(p,q),What if we encode X with a code optimized for a wrong distribution q? Expected # of bits=?,Intuitively, H(p,q) H(p), and mathematically,2008 ChengXiang Zhai Dragon Star Lecture at

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 其他


经营许可证编号:宁ICP备18001539号-1