lecture3_第1頁
已閱讀1頁,還剩51頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、Web Crawling,Based on the slides by Filippo Menczer @Indiana University School of Informatics in Web Data Mining by Bing Liu,,1,Outline,Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUnivers

2、al crawlersCrawler ethics and conflicts,,2,Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled,,3,,Crawler:basic idea,starting pages(seeds)

3、,,,,4,Many names,CrawlerSpiderRobot (or bot)Web agentWanderer, worm, …And famous instances: googlebot, scooter, slurp, msnbot, …,,5,Googlebot & you,,6,Motivation for crawlers,Support universal search engines (Go

4、ogle, Yahoo, MSN/Windows Live, Ask, etc.)Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc.Business intelligence: keep track of potential competitors, partnersMonitor Web sites

5、of interestEvil: harvest emails for spamming, phishing…… Can you think of some others?…,,7,A crawler within a search engine,,8,Web,Text index,PageRank,,,Page repository,,,,googlebot,,Text & link analysis,,Query,hit

6、s,,Ranker,,,,,One taxonomy of crawlers,Many other criteria could be used:Incremental, Interactive, Concurrent, Etc.,,9,Outline,Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUniversal crawl

7、ersCrawler ethics and conflicts,,10,Basic crawlers,This is a sequential crawlerSeeds can be any list of starting URLsOrder of page visits is determined by frontier data structureStop criterion can be anything,Graph t

8、raversal (BFS or DFS?),Breadth First SearchImplemented with QUEUE (FIFO) Finds pages along shortest pathsIf we start with “good” pages, this keeps us close; maybe other good stuff…Depth First SearchImplemented with

9、 STACK (LIFO)Wander away (“l(fā)ost in cyberspace”),,12,A basic crawler in Perl,Queue: a FIFO list (shift and push)my @frontier = read_seeds($file);while (@frontier && $tot < $max) {my $next_link = shift @fron

10、tier;my $page = fetch($next_link);add_to_index($page);my @links = extract_links($page, $next_link);push @frontier, process(@links);}A workable example,,13,Implementation issues,Don’t want to fetch same page twice!

11、Keep lookup table (hash) of visited pagesWhat if not visited but in frontier already?The frontier grows very fast!May need to prioritize for large crawlsFetcher must be robust!Don’t crash if download failsTimeout m

12、echanismDetermine file type to skip unwanted filesCan try using extensions, but not reliableCan issue ‘HEAD’ HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests,,14,More implement

13、ation issues,FetchingGet only the first 10-100 KB per pageTake care to detect and break redirection loopsSoft fail for timeout, server not responding, file not found, and other errors,,15,More implementation issues: P

14、arsing,HTML has the structure of a DOM (Document Object Model) treeUnfortunately actual HTML is often incorrect in a strict syntactic senseCrawlers, like browsers, must be robust/forgivingFortunately there are tools t

15、hat can helpE.g. tidy.sourceforge.netMust pay attention to HTML entities and unicode in textWhat to do with a growing number of other formats?Flash, SVG, RSS, AJAX…,More implementation issues,Stop wordsNoise words t

16、hat do not carry meaning should be eliminated (“stopped”) before they are indexed E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…Typically syntactic markersTypically the most common termsTypically kept in a negat

17、ive dictionary10–1,000 elementsE.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_wordsParser can detect these right away and disregard them,,17,More implementation issues,Conflation and thesauriIdea: impro

18、ve recall by merging words with same meaningWe want to ignore superficial morphological features, thus merge semantically similar tokens{student, study, studying, studious} => studiWe can also conflate synonyms int

19、o a single form using a thesaurus30-50% smaller indexDoing this in both pages and queries allows to retrieve pages about ‘automobile’ when user asks for ‘car’Thesaurus can be implemented as a hash table,,18,More imple

20、mentation issues,StemmingMorphological conflation based on rewrite rulesLanguage dependent!Porter stemmer very popular for Englishhttp://www.tartarus.org/~martin/PorterStemmer/Context-sensitive grammar rules, eg:“I

21、ES” except (“EIES” or “AIES”) --> “Y”Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.Porter has also developed Snowball, a language to create stemming algorithms in any languagehttp://snowball.tartarus.org/E

22、x. Perl modules: Lingua::Stem and Lingua::Stem::Snowball,,19,More implementation issues,Static vs. dynamic pagesIs it worth trying to eliminate dynamic pages and only index static pages?Examples:http://www.census.gov/

23、cgi-bin/gazetteerhttp://informatics.indiana.edu/research/colloquia.asphttp://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452http://www.imdb.com/Name?Menczer,+Ericohttp://www.imdb.com/name/nm057880

24、1/Why or why not? How can we tell if a page is dynamic? What about ‘spider traps’?What do Google and other search engines do?,,20,More implementation issues,Relative vs. Absolute URLsCrawler must translate relative UR

25、Ls into absolute URLsNeed to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by defaultExamplesBase: http://www.cnn.com/linkto/Relative URL: intl.htmlAbsolute URL: http://www.cnn.com/li

26、nkto/intl.htmlRelative URL: /US/Absolute URL: http://www.cnn.com/US/,,21,More implementation issues,URL canonicalizationAll of these:http://www.cnn.com/TECHhttp://WWW.CNN.COM/TECH/http://www.cnn.com:80/TECH/http:/

27、/www.cnn.com/bogus/../TECH/Are really equivalent to this canonical form:http://www.cnn.com/TECH/In order to avoid duplication, the crawler must transform all URLs into canonical formDefinition of “canonical” is arbit

28、rary, e.g.:Could always include portOr only include port when not default :80,,22,More on Canonical URLs,Some transformation are trivial, for example:http://informatics.indiana.eduhttp://informatics.indiana.edu/ht

29、tp://informatics.indiana.edu/index.html#fragmenthttp://informatics.indiana.edu/index.htmlhttp://informatics.indiana.edu/dir1/./../dir2/http://informatics.indiana.edu/dir2/http://informatics.indiana.edu/%7Efil/http

30、://informatics.indiana.edu/~fil/http://INFORMATICS.INDIANA.EDU/fil/http://informatics.indiana.edu/fil/,,23,More on Canonical URLs,Other transformations require heuristic assumption about the intentions of the author o

31、r configuration of the Web server:Removing default file namehttp://informatics.indiana.edu/fil/index.htmlhttp://informatics.indiana.edu/fil/This is reasonable in general but would be wrong in this case because the de

32、fault happens to be ‘default.asp’ instead of ‘index.html’Trailing directoryhttp://informatics.indiana.edu/filhttp://informatics.indiana.edu/fil/This is correct in this case but how can we be sure in general that ther

33、e isn’t a file named ‘fil’ in the root dir?,,24,More implementation issues,Spider trapsMisleading sites: indefinite number of pages dynamically generated by CGI scripts Paths of arbitrary depth created using soft direc

34、tory links and path rewriting features in HTTP serverOnly heuristic defensive measures:Check URL length; assume spider trap above some threshold, for example 128 charactersWatch for sites with very large number of URL

35、sEliminate URLs with non-textual data typesMay disable crawling of dynamic pages, if can detect,,25,More implementation issues,Page repositoryNaïve: store each page as a separate fileCan map URL to unique filena

36、me using a hashing function, e.g. MD5This generates a huge number of files, which is inefficient from the storage perspectiveBetter: combine many pages into a single large file, using some XML markup to separate and id

37、entify themMust map URL to {filename, page_id}Database optionsAny RDBMS -- large overheadLight-weight, embedded databases such as Berkeley DB,,26,Concurrency,A crawler incurs several delays:Resolving the host name i

38、n the URL to an IP address using DNSConnecting a socket to the server and sending the requestReceiving the requested page in responseSolution: Overlap the above delays by fetching many pages concurrently,,27,Architect

39、ure of a concurrent crawler,,28,Concurrent crawlers,Can use multi-processing or multi-threadingEach process or thread works like a sequential crawler, except they share data structures: frontier and repositoryShared da

40、ta structures must be synchronized (locked for concurrent writes)Speedup of factor of 5-10 are easy this way,,29,Outline,Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUniversal crawlersCr

41、awler ethics and conflicts,,30,Universal crawlers,Support universal search enginesLarge-scaleHuge cost (network bandwidth) of crawl is amortized over many queries from usersIncremental updates to existing index and ot

42、her data repositories,,31,Large-scale universal crawlers,Two major issues:PerformanceNeed to scale up to billions of pagesPolicyNeed to trade-off coverage, freshness, and bias (e.g. toward “important” pages),,32,Larg

43、e-scale crawlers: scalability,Need to minimize overhead of DNS lookupsNeed to optimize utilization of network bandwidth and disk throughput (I/O is bottleneck)Use asynchronous socketsMulti-processing or multi-threadin

44、g do not scale up to billions of pagesNon-blocking: hundreds of network connections open simultaneouslyPolling socket to monitor completion of network transfers,,33,High-level architecture of a scalable universal crawl

45、er,,34,Several parallel queues to spread load across servers (keep connections alive),DNS server using UDP (less overhead than TCP), large persistent in-memory cache, and prefetching,Optimize use of network bandwidth,Opt

46、imize disk I/O throughput,Huge farm of crawl machines,Universal crawlers: Policy,CoverageNew pages get added all the timeCan the crawler find every page?FreshnessPages change over time, get removed, etc.How frequent

47、ly can a crawler revisit ?Trade-off!Focus on most “important” pages (crawler bias)?“Importance” is subjective,,35,Web coverage by search engine crawlers,This assumes we know the size of the entire the Web. Do we? Can

48、you define “the size of the Web”?,Maintaining a “fresh” collection,Universal crawlers are never “done”High variance in rate and amount of page changesHTTP headers are notoriously unreliableLast-modifiedExpiresSoluti

49、onEstimate the probability that a previously visited page has changed in the meanwhilePrioritize by this probability estimate,,37,Estimating page change rates,Algorithms for maintaining a crawl in which most pages are

50、fresher than a specified epochBrewington & Cybenko; Cho, Garcia-Molina & PageAssumption: recent past predicts the future (Ntoulas, Cho & Olston 2004)Frequency of change not a good predictorDegree of chang

51、e is a better predictor,,38,Do we need to crawl the entire Web?,If we cover too much, it will get staleThere is an abundance of pages in the WebFor PageRank, pages with very low prestige are largely uselessWhat is th

52、e goal?General search engines: pages with high prestige News portals: pages that change oftenVertical portals: pages on some topicWhat are appropriate priority measures in these cases? Approximations?,,39,Breadth-fir

53、st crawlers,BF crawler tends to crawl high-PageRank pages very earlyTherefore, BF crawler is a good baseline to gauge other crawlersBut why is this so?,Najork and Weiner 2001,Bias of breadth-first crawlers,The structur

54、e of the Web graph is very different from a random networkPower-law distribution of in-degreeTherefore there are hub pages with very high PR and many incoming linksThese are attractors: you cannot avoid them!,Outline,

55、Motivation and taxonomy of crawlersBasic crawlers and implementation issuesUniversal crawlersCrawler ethics and conflicts,,42,Crawler ethics and conflicts,Crawlers can cause trouble, even unwillingly, if not properly

56、designed to be “polite” and “ethical”For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!Server administrator and users will be upsetCrawler dev

57、eloper/admin IP address may be blacklisted,,43,Crawler etiquette (important!),Identify yourselfUse ‘User-Agent’ HTTP header to identify crawler, website with description of crawler and contact information for crawler de

58、veloperUse ‘From’ HTTP header to specify crawler developer emailDo not disguise crawler as a browser by using their ‘User-Agent’ stringAlways check that HTTP requests are successful, and in case of error, use HTTP err

59、or code to determine and immediately address problemPay attention to anything that may lead to too many requests to any one server, even unwillingly, e.g.:redirection loopsspider traps,,44,Crawler etiquette (important

60、!),Spread the load, do not overwhelm a serverMake sure that no more than some max. number of requests to any single server per unit time, say < 1/secondHonor the Robot Exclusion ProtocolA server can specify which p

61、arts of its document tree any crawler is or is not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g. http://www.indiana.edu/robots.txtCrawler should always check, parse, and obey this

62、 file before sending any requests to a serverMore info at:http://www.google.com/robots.txthttp://www.robotstxt.org/wc/exclusion.html,,45,More on robot exclusion,Make sure URLs are canonical before checking against rob

63、ots.txt Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawlerLet’s look at some examples to understand the protocol…,,46,www.apple.com/robots.txt,,47,# robots.txt for

64、http://www.apple.com/ User-agent: *Disallow:,All crawlers…,…can go anywhere!,www.microsoft.com/robots.txt,,48,# Robots.txt file for http://www.microsoft.com User-agent: *Disallow: /canada/Library/mnp/2/aspx/Disall

65、ow: /communities/bin.aspxDisallow: /communities/eventdetails.mspxDisallow: /communities/blogs/PortalResults.mspxDisallow: /communities/rss.aspxDisallow: /downloads/Browse.aspxDisallow: /downloads/info.aspxDisallow:

66、 /france/formation/centres/planning.aspDisallow: /france/mnp_utility.mspxDisallow: /germany/library/images/mnp/Disallow: /germany/mnp_utility.mspxDisallow: /ie/ie40/Disallow: /info/customerror.htmDisallow: /info/sm

67、art404.aspDisallow: /intlkb/Disallow: /isapi/#etc…,All crawlers…,…are not allowed in these paths…,,www.springer.com/robots.txt,,49,# Robots.txt for http://www.springer.com (fragment)User-agent: GooglebotDisallow: /

68、chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*User-agent: slurpDisallow:Crawl-delay: 2User-agent: MSNBotDisallow:Crawl-delay: 2User-agent: scooterDisallow:# all othersUser-agent: *Disallow: /

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論