計(jì)算機(jī)外文翻譯---萬(wàn)維網(wǎng)爬行的有效url緩存（節(jié)選）

上傳人：奔*** IP屬地：河北更新時(shí)間：2024-03-01 格式：doc 頁(yè)數(shù)：21 大?。?8.50KB 人氣指數(shù)：12 舉報(bào) 版權(quán)申訴

計(jì)算機(jī)外文翻譯---萬(wàn)維網(wǎng)爬行的有效url緩存（節(jié)選）_第1頁(yè)

已閱讀1頁(yè)，還剩20頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、　　4300單詞，2.1萬(wàn)英文字符，5500漢字　　出處：Broder A Z, Najork M, Wiener J L. Efficient URL caching for world wide web crawling[C]// 2003:679-689.　　Efficient URL caching for world wide

2、 web crawling　　Andrei Z. Broder, Marc Najork, Janet L. Wiener　　ABSTRACT Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract

3、all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial progr

4、amming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web　　Our main conclusion is that cach

5、ing is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective

6、 while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.　　Keywords: Ca

7、ching, Crawling, Distributed crawlers, URL caching, Web graph models, Web crawlers 1. INTRODUCTION A recent Pew Foundation study [31] states that “Search engines have become an indispensable

8、utility for Internet users” and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest.

9、In this paper, we concentrate on one aspect of t　　Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs

10、that have been discovered previously. The new URLs are forwarded to the URL Frontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that toda

11、y’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be config　　中文譯文　　萬(wàn)維網(wǎng)爬行的有效URL緩存</p&

12、gt;　　摘要要在網(wǎng)絡(luò)上爬行非常簡(jiǎn)單：基本的算法是：（a）取得一個(gè)網(wǎng)頁(yè)（b）解析它提取所有的鏈接URLs（c）對(duì)于所有沒(méi)有見(jiàn)過(guò)的URLs重復(fù)執(zhí)行（a）-（c）。但是，網(wǎng)絡(luò)的大小（估計(jì)有超過(guò)40億的網(wǎng)頁(yè)）和他們變化的頻率（估計(jì)每周有7%的變化）使這個(gè)計(jì)劃由一個(gè)微不足道的設(shè)計(jì)習(xí)題變成一個(gè)非常嚴(yán)峻的算法和系統(tǒng)設(shè)計(jì)挑戰(zhàn)。實(shí)際上，光是這兩個(gè)要素就意味著如果要進(jìn)行及時(shí)地，完全地爬行網(wǎng)絡(luò)，步驟（a）必須每秒鐘執(zhí)行大

13、約1000次，因此，成員檢測(cè) （c）必須每秒鐘執(zhí)行超過(guò)10000次，并有非常大的數(shù)據(jù)儲(chǔ)存到主內(nèi)存中。這個(gè)要求有一個(gè)分布式構(gòu)造，使得成員檢測(cè)更加復(fù)雜。一個(gè)非常重要的方法加速這個(gè)檢測(cè)就是用cache（高速緩存），這個(gè)是把見(jiàn)過(guò)的URLs存入主內(nèi)存中的一個(gè)（動(dòng)態(tài)）子集中。這個(gè)論文最主要的成果就是仔細(xì) 的研究了幾種關(guān)于網(wǎng)絡(luò)爬蟲的URL緩存技術(shù)。我們考慮所有實(shí)際的算法：隨機(jī)置換，靜態(tài)cache，LRU，和CLOCK，和理論極限：透視cac

14、he和極大的cache。我們執(zhí)行了大約1800次模擬，用不同的cache大小執(zhí)行這些算法，用真實(shí)的log日志數(shù)據(jù)，獲取自一個(gè)非常大的33天的網(wǎng)絡(luò)爬行，　　關(guān)鍵詞：高速緩存、爬行、分布式爬蟲URL緩存、網(wǎng)絡(luò)圖模型,網(wǎng)絡(luò)爬蟲1.介紹皮尤基金會(huì)最新的研究指出：“搜索引擎已經(jīng)成為互聯(lián)網(wǎng)用戶不可或缺的工具”，估計(jì)在2002年中期，初略有超過(guò)1半的美國(guó)人用網(wǎng)絡(luò)搜索獲取信息。因此，一個(gè)強(qiáng)大的搜

15、索引擎技術(shù)有巨大的實(shí)際利益，在這個(gè)論文中，我們集中于一方面的搜索技術(shù)，也就是搜集網(wǎng)頁(yè)的過(guò)程，最終組成一個(gè)搜索引擎的文集。搜索引擎搜集網(wǎng)頁(yè)通過(guò)很多途徑，他們中，直接提交URL，回饋內(nèi)含物，然后從非web源文件中提取URL，但是大量的文集包含一個(gè)進(jìn)程叫 crawling 或者 SPIDERing，他們遞歸的探索互聯(lián)網(wǎng)?；镜乃惴ㄊ牵篎etch a pageParse it to extract all linked URLsF

眾賞文庫(kù)> 全部分類> 畢業(yè)設(shè)計(jì)

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

計(jì)算機(jī)外文翻譯---萬(wàn)維網(wǎng)爬行的有效url緩存（節(jié)選）

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

計(jì)算機(jī)外文翻譯---萬(wàn)維網(wǎng)爬行的有效url緩存（節(jié)選）

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

免費(fèi)下載