高级搜索方法毕业论文外文翻译.doc

资源描述

《高级搜索方法毕业论文外文翻译.doc》由会员分享，可在线阅读，更多相关《高级搜索方法毕业论文外文翻译.doc（4页珍藏版）》请在三一文库上搜索。

1、4 苏州科技学院本科生毕业论文A 译文高级搜索方法搜索关键字方式：如输入“线性”和“代数”可以很容易出现成百上千篇的文献，其中有一些甚至可能与线性代数方面毫无关系的文章。如果我们增加搜索词的数量，而且要求所有的搜索词都匹配，然后我们就可以减少一些至关重要的文献文档被排除掉的风险。我们进行数据库的搜索时应该优先考虑那些含有频率相对较高的最为匹配的关键词的文档，而不是扩大搜索列表中的所有匹配的词。为了搜索到与向量X相关的文献，为达到这一目标，我们需要找到数据库的列矩阵A。将两个向量紧紧地联系起来的最好的方法就是定义向量之间的夹角。这个知识点我们将在第五章的第一节里学习到。在我们已经对奇异值分解的相

2、关知识有所了解之后，我们也将重新审视信息检索应用程序（第六章，第五节）。这种分解可以更加简便的找到近似的数据库矩阵，这样将大大加快搜索。通常它有过滤掉噪音的好处，也就是说，使用近似版本的数据库矩阵可能会产生自动消除掉一些使用关键词在不必要的无关重要的文献文档上的作用。例如，一个牙科学生和一个数学的学生可能都会使用微积分作为他们的一个搜索词。因为数学的列表搜索使用近似数据库矩阵可能会消除掉所有关于牙科的文档。同样道理，数学文件将被过滤掉在牙科学生的搜索文件里。网络搜索和页面的排名现代网络搜索可以轻易涉及到含有成千上万的关键词的数十亿文档。事实上，截止2004年03月，就有超过四十亿个网页出现在互

3、联网上，而且仅仅在单一的一天中对于通过搜索引擎获取或更新多大100个亿的网页这样的事是不常见的。虽然数据库矩阵对页面在网页上的作用是非常之大，但由于矩阵和搜索矢量备件搜索可以被大大的简化；也就是说，任何列中大部分的条目是0。互联网搜索引擎，更好的搜索引擎会做简单的匹配搜索来找到所有关键词的页面，但是他们不会在对关键词的相对频率的基础上有所要求。由于互联网的商业本质，人们要卖的产品可能会故意重复使用关键词来确保他们的网站排名较高的任何相对频率搜索。事实上，很容易地列出关键词的几百倍。如果单词的字体颜色配上页面的背景色，然后观众会不知道这个词是重复。用于网络搜索的更复杂的算法是必要的页面排名包含所

4、有的矩阵模型概率分配在特定的随机过程。这种类型的模型称为马尔可夫过程或一个马尔可夫链。在第三节,我们将会看到第6章如何使用马尔可夫链模型的网页浏览和获取的网页排名。相对频率搜索搜索的商业数据库通常找出所有包含搜索词的关键文件然后以基于相对频率的文件。在这种情况下，数据库条目矩阵应该代表的第六个字数是代数的所有数据库关键词和应用第八字的相对频率，在那里所有的单词按字母顺序排列。如果说，在数据库中，9号文件包含从词典共发生200次，如果关键词字代数发生10次在文档和Word中的应用发生了6次，然后对这些词的相对频率是10/200和6/200，数据库和相应的矩阵条目。附录B 外文原文Advanc

5、ed search methodsA search for the key words such as linear and algebra could easily turn up hundreds of documents, some of which may not even be about linear algebra. If we were to increase the number of search words and require that all search words be matched, then we could run a risk of excluding

6、 some crucial linear algebra documents. Rather than match all words of the expanded search list, our database search should give priority to those documents that match most of the key words with high relative frequencies. To accomplish this, we need to find the columns of the database matrix A that

7、are “closest” to the search vector x. One way to measure how close two vectors are is to define the angle between the vectors. We will do this in Section 1 of Chapter 5. We will also revisit the information retrieval application again after we have learned about the singular value decomposition (Cha

8、pter 6, Section 5). This decomposition can be used to find a simpler approximation to the database matrix, which will speed up the searches dramatically. Often it has the added advantage of filtering out noise; that is, using the approximate version of the database matrix may automatically have the

9、effect of eliminating documents that use key words in unwanted contexts. For example, a dental student and a mathematics student could both use calculus as one of their search words. Since the list of mathematics search using an approximate database matrix is likely to eliminate all documents relati

10、ng to dentistry. Similarly, the mathematics documents would be filtered out in the dental students search.Web search and page ranking Modern Web searches could easily involve billions of documents with hundreds of thousands of key words. Indeed, as of March 2004, there were more than 4 billion Web p

11、ages on the Internet, and it is not uncommon for search engines to acquire or update as many as 10 billion Web pages in a single day. Although the database matrix for pages on the Internet is extremely large, searches can be simplified dramatically since the matrices and search vectors are spares; t

12、hat is, most of the entries in any column are 0s. For Internet searches, the better search engines will do simple matching searches to find all pages matching the key words, but they will not order them on the basis of the relative frequency of the key words. Because of the commercial nature of the

13、Internet, people that want to sell products may deliberately make repeated use of key words to ensure that their Web site is highly ranked in any relative frequency search. In fact, it is easy to surreptitiously list a key word hundreds of times. If the font color of the word matches the background

14、color of the page, then the viewer will not be aware that the word is listed repeatedly.For Web searches a more sophisticated algorithm is necessary for ranking the pages that contain all of matrix model for assigning probabilities in certain random processes. This type of model is referred to as a

15、Markov process or a Markov chain. In Section 3 of Chapter 6 we will see how to use Markov chain to model Web surfing and obtain rankings of Web pages.Relative frequency searches Searches of noncommercial databases generally find all documents containing the key search words and then order the docume

16、nts based on the relative frequency. In this case the entries of the database matrix should represent the relative frequencies of all key words of the database the 6th word is algebra and the 8th word is applied, where all words are listed alphabetically. If, say, document 9 in the database contains a total of 200 occurrences of key words from the dictionary and if the word algebra occurred 10 times in the document and the word applied occurred 6 times, then the relative frequencies for these words would be 10/200 and 6/200, and the corresponding entries in the database matrix would .

展开阅读全文

高级搜索方法 毕业论文外文翻译.doc

高级搜索方法毕业论文外文翻译.doc