版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、Web Crawling,Based on the slides by Filippo Menczer @Indiana University School of Informatics in Web Data Mining by Bing Liu,,1,Outline,Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUnivers
2、al crawlersCrawler ethics and conflicts,,2,Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled,,3,,Crawler:basic idea,starting pages(seeds)
3、,,,,4,Many names,CrawlerSpiderRobot (or bot)Web agentWanderer, worm, …And famous instances: googlebot, scooter, slurp, msnbot, …,,5,Googlebot & you,,6,Motivation for crawlers,Support universal search engines (Go
4、ogle, Yahoo, MSN/Windows Live, Ask, etc.)Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc.Business intelligence: keep track of potential competitors, partnersMonitor Web sites
5、of interestEvil: harvest emails for spamming, phishing…… Can you think of some others?…,,7,A crawler within a search engine,,8,Web,Text index,PageRank,,,Page repository,,,,googlebot,,Text & link analysis,,Query,hit
6、s,,Ranker,,,,,One taxonomy of crawlers,Many other criteria could be used:Incremental, Interactive, Concurrent, Etc.,,9,Outline,Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUniversal crawl
7、ersCrawler ethics and conflicts,,10,Basic crawlers,This is a sequential crawlerSeeds can be any list of starting URLsOrder of page visits is determined by frontier data structureStop criterion can be anything,Graph t
8、raversal (BFS or DFS?),Breadth First SearchImplemented with QUEUE (FIFO) Finds pages along shortest pathsIf we start with “good” pages, this keeps us close; maybe other good stuff…Depth First SearchImplemented with
9、 STACK (LIFO)Wander away (“l(fā)ost in cyberspace”),,12,A basic crawler in Perl,Queue: a FIFO list (shift and push)my @frontier = read_seeds($file);while (@frontier && $tot < $max) {my $next_link = shift @fron
10、tier;my $page = fetch($next_link);add_to_index($page);my @links = extract_links($page, $next_link);push @frontier, process(@links);}A workable example,,13,Implementation issues,Don’t want to fetch same page twice!
11、Keep lookup table (hash) of visited pagesWhat if not visited but in frontier already?The frontier grows very fast!May need to prioritize for large crawlsFetcher must be robust!Don’t crash if download failsTimeout m
12、echanismDetermine file type to skip unwanted filesCan try using extensions, but not reliableCan issue ‘HEAD’ HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests,,14,More implement
13、ation issues,FetchingGet only the first 10-100 KB per pageTake care to detect and break redirection loopsSoft fail for timeout, server not responding, file not found, and other errors,,15,More implementation issues: P
14、arsing,HTML has the structure of a DOM (Document Object Model) treeUnfortunately actual HTML is often incorrect in a strict syntactic senseCrawlers, like browsers, must be robust/forgivingFortunately there are tools t
15、hat can helpE.g. tidy.sourceforge.netMust pay attention to HTML entities and unicode in textWhat to do with a growing number of other formats?Flash, SVG, RSS, AJAX…,More implementation issues,Stop wordsNoise words t
16、hat do not carry meaning should be eliminated (“stopped”) before they are indexed E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…Typically syntactic markersTypically the most common termsTypically kept in a negat
17、ive dictionary10–1,000 elementsE.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_wordsParser can detect these right away and disregard them,,17,More implementation issues,Conflation and thesauriIdea: impro
18、ve recall by merging words with same meaningWe want to ignore superficial morphological features, thus merge semantically similar tokens{student, study, studying, studious} => studiWe can also conflate synonyms int
19、o a single form using a thesaurus30-50% smaller indexDoing this in both pages and queries allows to retrieve pages about ‘automobile’ when user asks for ‘car’Thesaurus can be implemented as a hash table,,18,More imple
20、mentation issues,StemmingMorphological conflation based on rewrite rulesLanguage dependent!Porter stemmer very popular for Englishhttp://www.tartarus.org/~martin/PorterStemmer/Context-sensitive grammar rules, eg:“I
21、ES” except (“EIES” or “AIES”) --> “Y”Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.Porter has also developed Snowball, a language to create stemming algorithms in any languagehttp://snowball.tartarus.org/E
22、x. Perl modules: Lingua::Stem and Lingua::Stem::Snowball,,19,More implementation issues,Static vs. dynamic pagesIs it worth trying to eliminate dynamic pages and only index static pages?Examples:http://www.census.gov/
23、cgi-bin/gazetteerhttp://informatics.indiana.edu/research/colloquia.asphttp://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452http://www.imdb.com/Name?Menczer,+Ericohttp://www.imdb.com/name/nm057880
24、1/Why or why not? How can we tell if a page is dynamic? What about ‘spider traps’?What do Google and other search engines do?,,20,More implementation issues,Relative vs. Absolute URLsCrawler must translate relative UR
25、Ls into absolute URLsNeed to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by defaultExamplesBase: http://www.cnn.com/linkto/Relative URL: intl.htmlAbsolute URL: http://www.cnn.com/li
26、nkto/intl.htmlRelative URL: /US/Absolute URL: http://www.cnn.com/US/,,21,More implementation issues,URL canonicalizationAll of these:http://www.cnn.com/TECHhttp://WWW.CNN.COM/TECH/http://www.cnn.com:80/TECH/http:/
27、/www.cnn.com/bogus/../TECH/Are really equivalent to this canonical form:http://www.cnn.com/TECH/In order to avoid duplication, the crawler must transform all URLs into canonical formDefinition of “canonical” is arbit
28、rary, e.g.:Could always include portOr only include port when not default :80,,22,More on Canonical URLs,Some transformation are trivial, for example:http://informatics.indiana.eduhttp://informatics.indiana.edu/ht
29、tp://informatics.indiana.edu/index.html#fragmenthttp://informatics.indiana.edu/index.htmlhttp://informatics.indiana.edu/dir1/./../dir2/http://informatics.indiana.edu/dir2/http://informatics.indiana.edu/%7Efil/http
30、://informatics.indiana.edu/~fil/http://INFORMATICS.INDIANA.EDU/fil/http://informatics.indiana.edu/fil/,,23,More on Canonical URLs,Other transformations require heuristic assumption about the intentions of the author o
31、r configuration of the Web server:Removing default file namehttp://informatics.indiana.edu/fil/index.htmlhttp://informatics.indiana.edu/fil/This is reasonable in general but would be wrong in this case because the de
32、fault happens to be ‘default.asp’ instead of ‘index.html’Trailing directoryhttp://informatics.indiana.edu/filhttp://informatics.indiana.edu/fil/This is correct in this case but how can we be sure in general that ther
33、e isn’t a file named ‘fil’ in the root dir?,,24,More implementation issues,Spider trapsMisleading sites: indefinite number of pages dynamically generated by CGI scripts Paths of arbitrary depth created using soft direc
34、tory links and path rewriting features in HTTP serverOnly heuristic defensive measures:Check URL length; assume spider trap above some threshold, for example 128 charactersWatch for sites with very large number of URL
35、sEliminate URLs with non-textual data typesMay disable crawling of dynamic pages, if can detect,,25,More implementation issues,Page repositoryNaïve: store each page as a separate fileCan map URL to unique filena
36、me using a hashing function, e.g. MD5This generates a huge number of files, which is inefficient from the storage perspectiveBetter: combine many pages into a single large file, using some XML markup to separate and id
37、entify themMust map URL to {filename, page_id}Database optionsAny RDBMS -- large overheadLight-weight, embedded databases such as Berkeley DB,,26,Concurrency,A crawler incurs several delays:Resolving the host name i
38、n the URL to an IP address using DNSConnecting a socket to the server and sending the requestReceiving the requested page in responseSolution: Overlap the above delays by fetching many pages concurrently,,27,Architect
39、ure of a concurrent crawler,,28,Concurrent crawlers,Can use multi-processing or multi-threadingEach process or thread works like a sequential crawler, except they share data structures: frontier and repositoryShared da
40、ta structures must be synchronized (locked for concurrent writes)Speedup of factor of 5-10 are easy this way,,29,Outline,Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUniversal crawlersCr
41、awler ethics and conflicts,,30,Universal crawlers,Support universal search enginesLarge-scaleHuge cost (network bandwidth) of crawl is amortized over many queries from usersIncremental updates to existing index and ot
42、her data repositories,,31,Large-scale universal crawlers,Two major issues:PerformanceNeed to scale up to billions of pagesPolicyNeed to trade-off coverage, freshness, and bias (e.g. toward “important” pages),,32,Larg
43、e-scale crawlers: scalability,Need to minimize overhead of DNS lookupsNeed to optimize utilization of network bandwidth and disk throughput (I/O is bottleneck)Use asynchronous socketsMulti-processing or multi-threadin
44、g do not scale up to billions of pagesNon-blocking: hundreds of network connections open simultaneouslyPolling socket to monitor completion of network transfers,,33,High-level architecture of a scalable universal crawl
45、er,,34,Several parallel queues to spread load across servers (keep connections alive),DNS server using UDP (less overhead than TCP), large persistent in-memory cache, and prefetching,Optimize use of network bandwidth,Opt
46、imize disk I/O throughput,Huge farm of crawl machines,Universal crawlers: Policy,CoverageNew pages get added all the timeCan the crawler find every page?FreshnessPages change over time, get removed, etc.How frequent
47、ly can a crawler revisit ?Trade-off!Focus on most “important” pages (crawler bias)?“Importance” is subjective,,35,Web coverage by search engine crawlers,This assumes we know the size of the entire the Web. Do we? Can
48、you define “the size of the Web”?,Maintaining a “fresh” collection,Universal crawlers are never “done”High variance in rate and amount of page changesHTTP headers are notoriously unreliableLast-modifiedExpiresSoluti
49、onEstimate the probability that a previously visited page has changed in the meanwhilePrioritize by this probability estimate,,37,Estimating page change rates,Algorithms for maintaining a crawl in which most pages are
50、fresher than a specified epochBrewington & Cybenko; Cho, Garcia-Molina & PageAssumption: recent past predicts the future (Ntoulas, Cho & Olston 2004)Frequency of change not a good predictorDegree of chang
51、e is a better predictor,,38,Do we need to crawl the entire Web?,If we cover too much, it will get staleThere is an abundance of pages in the WebFor PageRank, pages with very low prestige are largely uselessWhat is th
52、e goal?General search engines: pages with high prestige News portals: pages that change oftenVertical portals: pages on some topicWhat are appropriate priority measures in these cases? Approximations?,,39,Breadth-fir
53、st crawlers,BF crawler tends to crawl high-PageRank pages very earlyTherefore, BF crawler is a good baseline to gauge other crawlersBut why is this so?,Najork and Weiner 2001,Bias of breadth-first crawlers,The structur
54、e of the Web graph is very different from a random networkPower-law distribution of in-degreeTherefore there are hub pages with very high PR and many incoming linksThese are attractors: you cannot avoid them!,Outline,
55、Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUniversal crawlersCrawler ethics and conflicts,,42,Crawler ethics and conflicts,Crawlers can cause trouble, even unwillingly, if not properly
56、designed to be “polite” and “ethical”For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!Server administrator and users will be upsetCrawler dev
57、eloper/admin IP address may be blacklisted,,43,Crawler etiquette (important!),Identify yourselfUse ‘User-Agent’ HTTP header to identify crawler, website with description of crawler and contact information for crawler de
58、veloperUse ‘From’ HTTP header to specify crawler developer emailDo not disguise crawler as a browser by using their ‘User-Agent’ stringAlways check that HTTP requests are successful, and in case of error, use HTTP err
59、or code to determine and immediately address problemPay attention to anything that may lead to too many requests to any one server, even unwillingly, e.g.:redirection loopsspider traps,,44,Crawler etiquette (important
60、!),Spread the load, do not overwhelm a serverMake sure that no more than some max. number of requests to any single server per unit time, say < 1/secondHonor the Robot Exclusion ProtocolA server can specify which p
61、arts of its document tree any crawler is or is not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g. http://www.indiana.edu/robots.txtCrawler should always check, parse, and obey this
62、 file before sending any requests to a serverMore info at:http://www.google.com/robots.txthttp://www.robotstxt.org/wc/exclusion.html,,45,More on robot exclusion,Make sure URLs are canonical before checking against rob
63、ots.txt Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawlerLet’s look at some examples to understand the protocol…,,46,www.apple.com/robots.txt,,47,# robots.txt for
64、http://www.apple.com/ User-agent: *Disallow:,All crawlers…,…can go anywhere!,www.microsoft.com/robots.txt,,48,# Robots.txt file for http://www.microsoft.com User-agent: *Disallow: /canada/Library/mnp/2/aspx/Disall
65、ow: /communities/bin.aspxDisallow: /communities/eventdetails.mspxDisallow: /communities/blogs/PortalResults.mspxDisallow: /communities/rss.aspxDisallow: /downloads/Browse.aspxDisallow: /downloads/info.aspxDisallow:
66、 /france/formation/centres/planning.aspDisallow: /france/mnp_utility.mspxDisallow: /germany/library/images/mnp/Disallow: /germany/mnp_utility.mspxDisallow: /ie/ie40/Disallow: /info/customerror.htmDisallow: /info/sm
67、art404.aspDisallow: /intlkb/Disallow: /isapi/#etc…,All crawlers…,…are not allowed in these paths…,,www.springer.com/robots.txt,,49,# Robots.txt for http://www.springer.com (fragment)User-agent: GooglebotDisallow: /
68、chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*User-agent: slurpDisallow:Crawl-delay: 2User-agent: MSNBotDisallow:Crawl-delay: 2User-agent: scooterDisallow:# all othersUser-agent: *Disallow: /
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- lecture graph
- lecture one
- lecture9
- ec111lecture15
- ec111lecture15
- aprimerineconometrictheory2016lecture_1
- 【托福聽力備考】tpo12聽力文本——lecture 3
- lecture7(英文)
- lecture8(英文)
- lecture4(ch4)
- lecture6_1設計
- ec111_lecture2
- The History of Civilization in Europe(from Lecture Six to Lecture Seven)漢譯實踐及譯后總結_15926.pdf
- c型題破題思路、步驟lecture
- The History of Civilization in Europe(from Lecture Six to Lecture Seven)漢譯實踐及譯后總結.pdf
- 通信原理lecture4 sampling-
- lecture 6 - macro environmental factors affecting crm
- lecture 4 移動通信網絡
- 通信原理lecture1 class introduction-
- 專業(yè)八級mini-lecture填詞總結
評論
0/150
提交評論