版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1、<p> 中文3700字,2000英文單詞,10500英文字符</p><p> 文獻(xiàn)出處:Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.</p><p> MapR
2、educe: Simplified Data Processing on Large Clusters</p><p> Jeffrey Dean and Sanjay Ghemawat</p><p> jeff@google.com, sanjay@google.com</p><p> Google, Inc.</p><p>&
3、lt;b> Abstract</b></p><p> MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value
4、pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown i
5、n the paper.</p><p> Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitio
6、ning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine ommunication. This allows programmers without any experience with p
7、arallel and distributed systems to easily utilize the resources of a large distributed system.</p><p> Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a
8、 typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers and the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce
9、 jobs are executed on Google's clusters every day.</p><p> 1 Introduction</p><p> Over the past five years, the authors and many others at Google have implemented hundreds of that process
10、large amounts of raw data,such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries
11、of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input dat</p><p> As a reaction to this c
12、omplexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing i
13、n a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical ?
14、record? in our input in order to comp</p><p> The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, co
15、mbined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.</p><p> Section 2 describes the basic programming model and gives several examples.</p&g
16、t;<p> Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. </p><p> Section 4 describes several renements of the programming mo
17、del that we have found useful. </p><p> Section 5 has performance measurements of our implementation for a variety of tasks. </p><p> Section 6 explores the use of MapReduce within Google incl
18、uding our experiences in using it as the basis for a rewrite of our production indexing system. </p><p> Section 7 discusses related and future work.</p><p> 2 Programming Model</p><
19、;p> The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.</p><p&g
20、t; Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes t
21、hem to the Reduce function. </p><p> The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smalle
22、r set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values
23、 that are too large to fit in memory.</p><p> 2.1 Example</p><p> Consider the problem of counting the number of occurrences of each word in a large collection of documents.The user would writ
24、e code similar to the following pseudo-code:</p><p> The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The reduce function sums together all cou
25、nts emitted for a particular word.</p><p> In addition, the user writes code to fill in a mapreduce specification object with the names of the input and output files, and optional tuning parameters. The use
26、r then invokes the MapReduce function, passing it the specification object. The user's code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this examp
27、le.</p><p><b> 2.2 Types</b></p><p> Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by th
28、e user have associated types:</p><p> I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore,the intermediate keys and values are from the same domain
29、 as the output keys and values.</p><p> Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types.</p>
30、;<p> 2.3 More Examples</p><p> Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations.</p><p> Distributed Grep: The map functio
31、n emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.</p><p> Count of URL Access Frequency: The map fun
32、ction processes logs of web page requests and outputs The reduce function adds together all values for the same URL and emits <URL, total count> pair.</p><p> ReverseWeb-Link Graph: The map function o
33、utputs <target,source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair<target,
34、list(source)></p><p> Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of hword; frequencyi pairs. The map function emits a
35、 hhostname; term vector pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vector
36、s together, throwing away infrequent terms, and then emits a final <hostname,term vector> pair.</p><p> Inverted Index: The map function parses each document,and emits a sequence of <word, document
37、 ID>pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word; list(document ID)> pair. The set of all output pairs forms a simple inverted index. It i
38、s easy to augment this computation to keep track of word positions.</p><p> Distributed Sort: The map function extracts the key from each record, and emits a <key, record> pair. The reduce function em
39、its all pairs unchanged. This computation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Section 4.2.</p><p> 3 Implementation</p><p>
40、Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be</p><p> suitable for a small shared-memory machine
41、, another for a large NUMA multi-processor, and yet another for an even larger collection of networked machines.</p><p> This section describes an implementation targeted to the computing environment in wid
42、e use at Google: large clusters of commodity PCs connected together with switched Ethernet [4]. In our environment:</p><p> (1) Machines are typically dual-processor x86 processors running Linux, with 2-4 G
43、B of memory per machine.</p><p> (2) Commodity networking hardware is used . typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisecti
44、on bandwidth.</p><p> (3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are common.</p><p> (4) Storage is provided by inexpensive IDE disks attached
45、directly to individual machines. A distributed _le system [8] developed in-house is used to manage the data</p><p> stored on these disks. The _le system uses replication to provide availability and reliabi
46、lity on top of unreliable hardware.</p><p> (5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.
47、</p><p> 3.1 Execution Overview</p><p> The Map invocations are distributed across multiplemachines by automatically partitioning the input data into a set of M splits. The input splits can be
48、 processed</p><p> in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The n
49、umber of partitions (R) and the partitioning function are speci_ed by the user. </p><p> Figure 1 shows the overall _ow of a MapReduce operation in our implementation. When the user program calls the MapRed
50、uce function, the following sequence of actions occurs (the numbered labels in Figure 1 correspond to the numbers in the list below):</p><p> 1. The MapReduce library in the user program _rst splits the inp
51、ut _les into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.</p><p
52、> 2. One of the copies of the program is special . the master. The rest are workers that are assigned work by the master. There areM map tasks and R reduce tasks to assign. The master picks idle workers and assigns e
53、ach one a map task or a reduce task.</p><p> 3. A worker who is assigned a map task reads thecontents of the corresponding input split. It parseskey/value pairs out of the input data and passes each pair to
54、 the user-de_ned Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.</p><p> 4. Periodically, the buffered pairs are written to local disk, partitioned into R
55、 regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.</p><p>
56、 5. When a reduce worker is noti_ed by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate da
57、ta, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediat
58、e data is too large to fit in memory, an external sort is use</p><p> 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and
59、the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a _nal output _le for this reduce partition.</p><p> 7. When all map tasks an
60、d reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. </p><p> After successful completion, the outpu
61、t of the mapreduce execution is available in the R output files (one per reduce task, with _le names as speci_ed by the user).</p><p> Typically, users do not need to combine these R output files into one f
62、iles -they often pass these _les as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.</p><p> MapRedu
63、ce的:簡化數(shù)據(jù)處理大型集群</p><p><b> 摘要</b></p><p> MapReduce是一個編程模型,也是一個處理和生成超大數(shù)據(jù)集的算法模型的相關(guān)實現(xiàn)。用戶首先創(chuàng)建一個Map函數(shù)處理一個基于key/value pair的數(shù)據(jù)集合,輸出中間的基于key/value pair的數(shù)據(jù)集合;然后再創(chuàng)建一個Reduce函數(shù)用來合并所有的具有相同中間key
64、值的中間value值?,F(xiàn)實世界中有很多滿足上述處理模型的例子,本論文將詳細(xì)描述這個模型。</p><p> MapReduce架構(gòu)的程序能夠在大量的普通配置的計算機(jī)上實現(xiàn)并行化處理。這個系統(tǒng)在運行時只關(guān)心:如何分割輸入數(shù)據(jù),在大量計算機(jī)組成的集群上的調(diào)度,集群中計算機(jī)的錯誤處理,管理集群中計算機(jī)之間必要的通信。采用MapReduce架構(gòu)可以使那些沒有并行計算和分布式處理系統(tǒng)開發(fā)經(jīng)驗的程序員有效利用分布式系統(tǒng)的豐
65、富資源。</p><p> 我們的MapReduce實現(xiàn)運行在規(guī)??梢造`活調(diào)整的由普通機(jī)器組成的集群上:一個典型的MapReduce計算往往由幾千臺機(jī)器組成、處理以TB計算的數(shù)據(jù)。程序員發(fā)現(xiàn)這個系統(tǒng)非常好用:已經(jīng)實現(xiàn)了數(shù)以百計的MapReduce程序,在Google的集群上,每天都有1000多個MapReduce程序在執(zhí)行。</p><p><b> 1 介紹 </b&
66、gt;</p><p> 在過去的5年里,包括本文作者在內(nèi)的Google的很多程序員,為了處理海量的原始數(shù)據(jù),已經(jīng)實現(xiàn)了數(shù)以百計的、專用的計算方法。這些計算方法用來處理大量的原始數(shù)據(jù),比如,文檔抓?。愃凭W(wǎng)絡(luò)爬蟲的程序)、Web請求日志等等;也為了計算處理各種類型的衍生數(shù)據(jù),比如倒排索引、Web文檔的圖結(jié)構(gòu)的各種表示形勢、每臺主機(jī)上網(wǎng)絡(luò)爬蟲抓取的頁面數(shù)量的匯總、每天被請求的最多的查詢的集合等等。大多數(shù)這樣的數(shù)據(jù)
67、處理運算在概念上很容易理解。然而由于輸入的數(shù)據(jù)量巨大,因此要想在可接受的時間內(nèi)完成運算,只有將這些計算分布在成百上千的主機(jī)上。如何處理并行計算、如何分發(fā)數(shù)據(jù)、如何處理錯誤?所有這些問題綜合在一起,需要大量的代碼處理,因此也使得原本簡單的運算變得難以處理。</p><p> 為了解決上述復(fù)雜的問題,我們設(shè)計一個新的抽象模型,使用這個抽象模型,我們只要表述我們想要執(zhí)行的簡單運算即可,而不必關(guān)心并行計算、容錯、數(shù)據(jù)分
68、布、負(fù)載均衡等復(fù)雜的細(xì)節(jié),這些問題都被封裝在了一個庫里面。設(shè)計這個抽象模型的靈感來自Lisp和許多其他函數(shù)式語言的Map和Reduce的原語。我們意識到我們大多數(shù)的運算都包含這樣的操作:在輸入數(shù)據(jù)的“邏輯”記錄上應(yīng)用Map操作得出一個中間key/value pair集合,然后在所有具有相同key值的value值上應(yīng)用Reduce操作,從而達(dá)到合并中間的數(shù)據(jù),得到一個想要的結(jié)果的目的。使用MapReduce模型,再結(jié)合用戶實現(xiàn)的Map和R
69、educe函數(shù),我們就可以非常容易的實現(xiàn)大規(guī)模并行化計算;通過MapReduce模型自帶的“再次執(zhí)行”(re-execution)功能,也提供了初級的容災(zāi)實現(xiàn)方案。</p><p> 這個工作(實現(xiàn)一個MapReduce框架模型)的主要貢獻(xiàn)是通過簡單的接口來實現(xiàn)自動的并行化和大規(guī)模的分布式計算,通過使用MapReduce模型接口實現(xiàn)在大量普通的PC機(jī)上高性能計算。</p><p> 第
70、二部分描述基本的編程模型和一些使用案例。</p><p> 第三部分描述了一個經(jīng)過裁剪的、適合我們的基于集群的計算環(huán)境的MapReduce實現(xiàn)。</p><p> 第四部分描述我們認(rèn)為在MapReduce編程模型中一些實用的技巧。</p><p> 第五部分對于各種不同的任務(wù),測量我們MapReduce實現(xiàn)的性能。</p><p>
71、第六部分揭示了在Google內(nèi)部如何使用MapReduce作為基礎(chǔ)重寫我們的索引系統(tǒng)產(chǎn)品,包括其它一些使用MapReduce的經(jīng)驗。</p><p> 第七部分討論相關(guān)的和未來的工作。</p><p><b> 2 編程模型 </b></p><p> MapReduce編程模型的原理是:利用一個輸入key/value pair集合來產(chǎn)生
72、一個輸出的key/value pair集合。MapReduce庫的用戶用兩個函數(shù)表達(dá)這個計算:Map和Reduce。</p><p> 用戶自定義的Map函數(shù)接受一個輸入的key/valuepair值,然后產(chǎn)生一個中間key/value pair值的集合。MapReduce庫把所有具有相同中間key值I的中間value值集合在一起后傳遞給reduce函數(shù)。</p><p> 用戶自定義
73、的Reduce函數(shù)接受一個中間key的值I和相關(guān)的一個value值的集合。Reduce函數(shù)合并這些value值,形成一個較小的value值的集合。一般的,每次Reduce函數(shù)調(diào)用只產(chǎn)生0或1個輸出value值。通常我們通過一個迭代器把中間value值提供給Reduce函數(shù),這樣我們就可以處理無法全部放入內(nèi)存中的大量的value值的集合。</p><p><b> 2.1 例子</b><
74、;/p><p> 例如,計算一個大的文檔集合中每個單詞出現(xiàn)的次數(shù),下面是偽代碼段:</p><p> Map函數(shù)輸出文檔中的每個詞、以及這個詞的出現(xiàn)次數(shù)(在這個簡單的例子里就是1)。Reduce函數(shù)把Map函數(shù)產(chǎn)生的每一個特定的詞的計數(shù)累加起來。</p><p> 另外,用戶編寫代碼,使用輸入和輸出文件的名字、可選的調(diào)節(jié)參數(shù)來完成一個符合MapReduce模型規(guī)范
75、的對象,然后調(diào)用MapReduce函數(shù),并把這個規(guī)范對象傳遞給它。用戶的代碼和MapReduce庫鏈接在一起(用C++實現(xiàn))。附錄A包含了這個實例的全部程序代碼。</p><p><b> 2.2 類型</b></p><p> 盡管在前面例子的偽代碼中使用了以字符串表示的輸入輸出值,但是在概念上,用戶定義的Map和Reduce函數(shù)都有相關(guān)聯(lián)的類型:</p&
76、gt;<p> 比如,輸入的key和value值與輸出的key和value值在類型上推導(dǎo)的域不同。此外,中間key和value值與輸出key和value值在類型上推導(dǎo)的域相同。</p><p> 我們的C++中使用字符串類型作為用戶自定義函數(shù)的輸入輸出,用戶在自己的代碼中對字符串進(jìn)行適當(dāng)?shù)念愋娃D(zhuǎn)換。</p><p><b> 2.3 更多的例子</b&g
77、t;</p><p> 這里還有一些有趣的簡單例子,可以很容易的使用MapReduce模型來表示:</p><p> 分布式的Grep:Map函數(shù)輸出匹配某個模式的一行,Reduce函數(shù)是一個恒等函數(shù),即把中間數(shù)據(jù)復(fù)制到輸出。</p><p> 計算URL訪問頻率:Map函數(shù)處理日志中web頁面請求的記錄,然后輸出(URL,1)。Reduce函數(shù)把相同URL的
78、value值都累加起來,產(chǎn)生(URL,記錄總數(shù))結(jié)果。</p><p> 倒轉(zhuǎn)網(wǎng)絡(luò)鏈接圖:Map函數(shù)在源頁面(source)中搜索所有的鏈接目標(biāo)(target)并輸出為(target,source)。Reduce函數(shù)把給定鏈接目標(biāo)(target)的鏈接組合成一個列表,輸出(target,list(source))。</p><p> 每個主機(jī)的檢索詞向量:檢索詞向量用一個(詞,頻率)列
79、表來概述出現(xiàn)在文檔或文檔集中的最重要的一些詞。Map函數(shù)為每一個輸入文檔輸出(主機(jī)名,檢索詞向量),其中主機(jī)名來自文檔的URL。Reduce函數(shù)接收給定主機(jī)的所有文檔的檢索詞向量,并把這些檢索詞向量加在一起,丟棄掉低頻的檢索詞,輸出一個最終的(主機(jī)名,檢索詞向量)。</p><p> 倒排索引:Map函數(shù)分析每個文檔輸出一個(詞,文檔號)的列表,Reduce函數(shù)的輸入是一個給定詞的所有(詞,文檔號),排序所有的
80、文檔號,輸出(詞,list(文檔號))。所有的輸出集合形成一個簡單的倒排索引,它以一種簡單的算法跟蹤詞在文檔中的位置。</p><p> 分布式排序:Map函數(shù)從每個記錄提取key,輸出(key,record)。Reduce函數(shù)不改變?nèi)魏蔚闹?。這個運算依賴分區(qū)機(jī)制(在4.1描述)和排序?qū)傩?在4.2描述)。</p><p><b> 3 實現(xiàn) </b></p
81、><p> MapReduce模型可以有多種不同的實現(xiàn)方式。如何正確選擇取決于具體的環(huán)境。例如,一種實現(xiàn)方式適用于小型的共享內(nèi)存方式的機(jī)器,另外一種實現(xiàn)方式則適用于大型NUMA架構(gòu)的多處理器的主機(jī),而有的實現(xiàn)方式更適合大型的網(wǎng)絡(luò)連接集群。</p><p> 本章節(jié)描述一個適用于Google內(nèi)部廣泛使用的運算環(huán)境的實現(xiàn):用以太網(wǎng)交換機(jī)連接、由普通PC機(jī)組成的大型集群。在我們的環(huán)境里包括:&l
82、t;/p><p> 1. x86架構(gòu)、運行Linux操作系統(tǒng)、雙處理器、2-4GB內(nèi)存的機(jī)器。</p><p> 2. 普通的網(wǎng)絡(luò)硬件設(shè)備,每個機(jī)器的帶寬為百兆或者千兆,但是遠(yuǎn)小于網(wǎng)絡(luò)的平均帶寬的一半。</p><p> 3. 集群中包含成百上千的機(jī)器,因此,機(jī)器故障是常態(tài)。</p><p> 4. 存儲為廉價的內(nèi)置IDE硬盤。一個內(nèi)部分
83、布式文件系統(tǒng)用來管理存儲在這些磁盤上的數(shù)據(jù)。文件系統(tǒng)通過數(shù)據(jù)復(fù)制來在不可靠的硬件上保證數(shù)據(jù)的可靠性和有效性。</p><p> 5. 用戶提交工作(job)給調(diào)度系統(tǒng)。每個工作(job)都包含一系列的任務(wù)(task),調(diào)度系統(tǒng)將這些任務(wù)調(diào)度到集群中多臺可用的機(jī)器上。</p><p><b> 3.1 執(zhí)行概括</b></p><p> 通
84、過將Map調(diào)用的輸入數(shù)據(jù)自動分割為M個數(shù)據(jù)片段的集合,Map調(diào)用被分布到多臺機(jī)器上執(zhí)行。輸入的數(shù)據(jù)片段能夠在不同的機(jī)器上并行處理。使用分區(qū)函數(shù)將Map調(diào)用產(chǎn)生的中間key值分成R個不同分區(qū)(例如,hash(key) mod R),Reduce調(diào)用也被分布到多臺機(jī)器上執(zhí)行。分區(qū)數(shù)量(R)和分區(qū)函數(shù)由用戶來指定。</p><p> 圖1展示了我們的MapReduce實現(xiàn)中操作的全部流程。當(dāng)用戶調(diào)用MapReduce
85、函數(shù)時,將發(fā)生下面的一系列動作(下面的序號和圖1中的序號一一對應(yīng)):</p><p> 1. 用戶程序首先調(diào)用的MapReduce庫將輸入文件分成M個數(shù)據(jù)片度,每個數(shù)據(jù)片段的大小一般從16MB到64MB(可以通過可選的參數(shù)來控制每個數(shù)據(jù)片段的大小)。然后用戶程序在機(jī)群中創(chuàng)建大量的程序副本。</p><p> 2. 這些程序副本中的有一個特殊的程序–master。副本中其它的程序都是wo
86、rker程序,由master分配任務(wù)。有M個Map任務(wù)和R個Reduce任務(wù)將被分配,master將一個Map任務(wù)或Reduce任務(wù)分配給一個空閑的worker。</p><p> 3. 被分配了map任務(wù)的worker程序讀取相關(guān)的輸入數(shù)據(jù)片段,從輸入的數(shù)據(jù)片段中解析出key/value pair,然后把key/value pair傳遞給用戶自定義的Map函數(shù),由Map函數(shù)生成并輸出的中間key/value
87、pair,并緩存在內(nèi)存中。</p><p> 4. 緩存中的key/value pair通過分區(qū)函數(shù)分成R個區(qū)域,之后周期性的寫入到本地磁盤上。緩存的key/value pair在本地磁盤上的存儲位置將被回傳給master,由master負(fù)責(zé)把這些存儲位置再傳送給Reduce worker。</p><p> 5. 當(dāng)Reduce worker程序接收到master程序發(fā)來的數(shù)據(jù)存儲位
88、置信息后,使用RPC從Map worker所在主機(jī)的磁盤上讀取這些緩存數(shù)據(jù)。當(dāng)Reduce worker讀取了所有的中間數(shù)據(jù)后,通過對key進(jìn)行排序后使得具有相同key值的數(shù)據(jù)聚合在一起。由于許多不同的key值會映射到相同的Reduce任務(wù)上,因此必須進(jìn)行排序。如果中間數(shù)據(jù)太大無法在內(nèi)存中完成排序,那么就要在外部進(jìn)行排序。</p><p> 6. Reduce worker程序遍歷排序后的中間數(shù)據(jù),對于每一個唯
89、一的中間key值,Reduce worker程序?qū)⑦@個key值和它相關(guān)的中間value值的集合傳遞給用戶自定義的Reduce函數(shù)。Reduce函數(shù)的輸出被追加到所屬分區(qū)的輸出文件。</p><p> 7. 當(dāng)所有的Map和Reduce任務(wù)都完成之后,master喚醒用戶程序。在這個時候,在用戶程序里的對MapReduce調(diào)用才返回。</p><p> 在成功完成任務(wù)之后,MapRedu
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 工程數(shù)據(jù)處理及數(shù)據(jù)庫技術(shù)
- 數(shù)據(jù)庫外文翻譯---關(guān)系數(shù)據(jù)庫的結(jié)構(gòu)
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)庫系統(tǒng)
- sql數(shù)據(jù)庫外文翻譯
- 數(shù)據(jù)庫設(shè)計外文翻譯
- 外文翻譯---數(shù)據(jù)庫管理
- 數(shù)據(jù)庫編程外文翻譯
- 數(shù)據(jù)庫管理外文翻譯
- 數(shù)據(jù)庫設(shè)計外文翻譯
- 數(shù)據(jù)庫管理-外文翻譯
- 圖形數(shù)據(jù)庫的分布式數(shù)據(jù)處理.pdf
- 流數(shù)據(jù)處理在金融數(shù)據(jù)庫中的應(yīng)用.pdf
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)倉庫
- 基于MapReduce的復(fù)雜結(jié)構(gòu)數(shù)據(jù)處理.pdf
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)倉庫
- 數(shù)據(jù)庫安全【外文翻譯】(01)
- 數(shù)據(jù)庫更改管理[外文翻譯]
- 數(shù)據(jù)庫管理系統(tǒng)外文翻譯
- 數(shù)據(jù)庫設(shè)計外文翻譯3
- 數(shù)據(jù)庫畢業(yè)設(shè)計外文翻譯--數(shù)據(jù)庫的空間組織
評論
0/150
提交評論