計(jì)算機(jī)外文翻譯---基于網(wǎng)絡(luò)爬蟲(chóng)的有效url緩存_第1頁(yè)
已閱讀1頁(yè),還剩24頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、<p><b>  外文資料原文</b></p><p>  Efficient URL Caching for World Wide Web Crawling</p><p>  Andrei Z. Broder</p><p>  IBM TJ Watson Research Center</p><p>

2、  19 Skyline Dr</p><p>  Hawthorne, NY 10532</p><p>  abroder@us.ibm.com</p><p>  Marc Najork</p><p>  Microsoft Research</p><p>  1065 La Avenida</p>

3、;<p>  Mountain View, CA 94043</p><p>  najork@microsoft.com</p><p>  Janet L. Wiener</p><p>  Hewlett Packard Labs</p><p>  1501 Page Mill Road</p><p

4、>  Palo Alto, CA 94304</p><p>  janet.wiener@hp.com</p><p><b>  ABSTRACT</b></p><p>  Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (

5、b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this pl

6、an from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) m</p><p>

7、;  A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web cr

8、awling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms

9、 with various cache sizes, using actual log data extracted from</p><p>  1. INTRODUCTION</p><p>  A recent Pew Foundation study [31] states that “Search engines have become an indispensable util

10、ity for Internet users” and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In t

11、his paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.</p><p>  Search engines collect pages in many

12、 ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basi

13、c algorithm is</p><p>  (a) Fetch a page</p><p>  (b) Parse it to extract all linked URLs</p><p>  (c) For all the URLs not seen before, repeat (a)–(c)</p><p>  Crawlin

14、g typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected pa

15、ge, or a directory such as yahoo.com, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenom

16、enon.</p><p>  If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategi

17、es for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they

18、 are easy to implement and taught in many introductory algorithms classes. (See for inst</p><p>  However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design c

19、hallenge because of the following two factors. </p><p>  1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, histori

20、cally, the web has doubled every 9-12 months.</p><p>  2. Web pages are changing rapidly. If “change” means “any change”, then about 40% of all web pages change weekly [12]. Even if we consider only pages th

21、at change by a third or more, about 7% of all web pages change weekly [17]. </p><p>  These two factors imply that to obtain a reasonably fresh and 679 complete snapshot of the web, a search engine must craw

22、l at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs tha

23、t is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicat</p><p>  A crucial way to speed up the membersh

24、ip test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: ran

25、dom replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests.

26、We found that simple c</p><p>  The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an int

27、roduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations

28、are depicted and discussed in Section 5, and our recommendations for practical algorithms and data struct</p><p>  2. CRAWLING</p><p>  Web crawlers are almost as old as the web itself, and nume

29、rous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.&

30、lt;/p><p>  The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to dis

31、k; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.</p><p>  The original Google crawler, described in [7], implements the different crawler components as diffe

32、rent processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URL

33、s, which are then fed to the URL Server. The various processes communicate via the file system.</p><p>  For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses

34、a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host component.

35、A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead.</p><p>  Cho and Garcia-Moli

36、na’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL

37、space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-pro

38、c based on some property of the URL, such</p><p>  The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for which

39、 it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.</p><p>  UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of

40、multiple independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.</p&

41、gt;<p>  Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and

42、 schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-m

43、ounted file system. The crawling application reads those saved pages, extrac</p><p>  Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to

44、 download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier

45、as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs),</p><p>  Many of the distributed web crawlers d

46、escribed above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and send

47、s these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL

48、 to a peer crawler goe</p><p>  3. CACHING</p><p>  In most computer systems, memory is hierarchical, that is, there exist two or more levels of memory, representing different tradeoffs between

49、size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hiera

50、rchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a f</p><p>  in the context of a web proxy caching web pages [26, Chapte

51、r 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.</p><p&g

52、t;  Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If t

53、he requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the

54、 cache. Otherwise, the algorithm must choose either to evict an </p><p>  Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized b

55、y the miss ratio for a given size cache. In general, caching is successful for two reasons:</p><p>  _ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instanc

56、e, a link to yahoo.com is a much more common occurrence than a link to the authors’ home pages.</p><p>  _ Temporal correlation or locality of reference. Current requests are more likely to duplicate request

57、s made in the recent past than requests made long ago. The latter terminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, tem

58、poral correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf. Section 4.2, and second, because pages </p><p>  Because of these tw

59、o factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.</p><p>  We now de

60、scribe some standard caching algorithms, whose performance we evaluate in Section 5.</p><p>  3.1 Infinite cache (INFINITE)</p><p>  This is a theoretical algorithm that assumes that the size of

61、 the cache is larger than the number of distinct requests.</p><p>  3.2 Clairvoyant caching (MIN)</p><p>  More than 35 years ago, L´aszl´o Belady [2] showed that if the entire sequenc

62、e of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because

63、 it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance.</p><p>  3.3 Least recently used (LRU)</p><p>  The LRU algorithm evicts the item in

64、the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefor

65、e the number of misses will be minimized in the spirit of Belady’s algorithm.</p><p>  Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of

66、 the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. This queue has a processing time cost and a memory cost. The latter is usually ignored in c

67、aching situations where the items are large.</p><p><b>  3.4 CLOCK</b></p><p>  CLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M

68、1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arri

69、ves, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves sequentially through the array, turning the mark bits off, until an unmarked locat</p><p>  3.5 Random replacem

70、ent (RANDOM)</p><p>  Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.</p><p>  In m

71、ost practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Secti

72、on 6.</p><p>  3.6 Static caching (STATIC)</p><p>  If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any

73、 point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.</p><p>  There are two issues with this appro

74、ach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in

75、most practical situations.</p><p>  In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the cu

76、rrent crawl. (There are also efficient techniques for discovering the most popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a

77、 first pass over our input to determine the k most popular URLs, and then preload the cache with these URLs, </p><p>  The second issue above is the very reason we decided to test STATIC: if STATIC performs

78、well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests

79、are highly correlated.</p><p>  4. EXPERIMENTAL SETUP</p><p>  We now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted

80、a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl.</p><p>  4.1 Mercator crawler architecture

81、</p><p>  A Mercator crawling system consists of a number of crawling processes, usually </p><p>  running on separate machines. Each crawling process is responsible for a subset of all web serv

82、ers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.</p><p>  Each worker thread repeatedly performs the following operations:

83、it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for s

84、hort); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic pr</

85、p><p>  The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes using a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment)

86、will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawling processes. Both the stream of local URLs and the stream of URLs received from peer crawlers flow in

87、to the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been di</p><p>  Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it p

88、ossible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the

89、disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead</p><p>  Mercator performs an approximation o

90、f a breadth-first search traversal of the web graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the

91、scheduling of downloads is moderated by Mercator’s politeness policy, which limits the load placed by the crawler on any particular web server. Mercator’s politeness policy guarantees that no server ever receives multipl

92、e requests from Mercator</p><p>  4.2 Our web crawl</p><p>  Our crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM,

93、144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.</p><p>  The crawl ran from July 12 until Septem

94、ber 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million

95、of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however,

96、 the median number of</p><p>  The links extracted from these HTML pages, plus about 38 million HTTP redirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the e

97、ffectiveness of various caching algorithms, we instrumented Mercator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs.&

98、lt;/p><p>  After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two

99、URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique finge

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論