hadoop分布式文件系統(tǒng)架構(gòu)和設(shè)計(jì)-外文翻譯

上傳人：奔*** IP屬地：河北更新時(shí)間：2024-04-04 格式：doc 頁數(shù)：20 大小：132.81KB 人氣指數(shù)：12 舉報(bào) 版權(quán)申訴

已閱讀1頁，還剩19頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、　　外文文獻(xiàn)翻譯　　Hadoop分布式文件系統(tǒng)：架構(gòu)和設(shè)計(jì)　　一、引言　　Hadoop分布式文件系統(tǒng)(HDFS)被設(shè)計(jì)成適合運(yùn)行在通用硬件(commodity hardware)上的分布式文件系統(tǒng)。它和現(xiàn)有的分布式文件系

2、統(tǒng)有很多共同點(diǎn)。但同時(shí)，它和其他的分布式文件系統(tǒng)的區(qū)別也是很明顯的。HDFS是一個(gè)高度容錯(cuò)性的系統(tǒng)，適合部署在廉價(jià)的機(jī)器上。HDFS能提供高吞吐量的數(shù)據(jù)訪問，非常適合大規(guī)模數(shù)據(jù)集上的應(yīng)用。HDFS放寬了一部分POSIX約束，來實(shí)現(xiàn)流式讀取文件系統(tǒng)數(shù)據(jù)的目的。HDFS在最開始是作為Apache Nutch搜索引擎項(xiàng)目的基礎(chǔ)架構(gòu)而開發(fā)的。HDFS是Apache Hadoop Core項(xiàng)目的一部分。這個(gè)項(xiàng)目的地址是http://hadoop.

3、apache.org/core/。 　　二、前提和設(shè)計(jì)目標(biāo)　　2.1 硬件錯(cuò)誤　　硬件錯(cuò)誤是常態(tài)而不是異常。HDFS可能由成百上千的服務(wù)器所構(gòu)成，每個(gè)服務(wù)器上存儲著文件系統(tǒng)的部分?jǐn)?shù)據(jù)。我們面對的現(xiàn)實(shí)是構(gòu)成系統(tǒng)的組件數(shù)目是巨大的，而且任一組

4、件都有可能失效，這意味著總是有一部分HDFS的組件是不工作的。因此錯(cuò)誤檢測和快速、自動的恢復(fù)是HDFS最核心的架構(gòu)目標(biāo)。　　2.2 流式數(shù)據(jù)訪問　　運(yùn)行在HDFS上的應(yīng)用和普通的應(yīng)用不同，需要流式訪問它們的數(shù)據(jù)集。HDFS的設(shè)計(jì)中更多的考慮到了數(shù)據(jù)批處理，而不是用戶交互處理。比之?dāng)?shù)據(jù)訪問的低延遲問題，更關(guān)鍵的在于數(shù)據(jù)訪問的高吞吐量。POSIX標(biāo)準(zhǔn)設(shè)置的很

5、多硬性約束對HDFS應(yīng)用系統(tǒng)不是必需的。為了提高數(shù)據(jù)的吞吐量，在一些關(guān)鍵方面對POSIX的語義做了一些修改。　　2.3 大規(guī)模數(shù)據(jù)集　　運(yùn)行在HDFS上的應(yīng)用具有很大的數(shù)據(jù)集。HDFS上的一個(gè)典型文件大小一般都在G字節(jié)至T字節(jié)。因此，HDFS被調(diào)節(jié)以支持大文件存儲。它應(yīng)該能提供整體上高的數(shù)據(jù)傳輸帶寬，能在一個(gè)集群里擴(kuò)展到數(shù)百個(gè)節(jié)點(diǎn)。一個(gè)單一的HDFS實(shí)例

6、應(yīng)該能支撐數(shù)以千萬計(jì)的文件。　　2.4 簡單的一致性模型　　HDFS應(yīng)用需要一個(gè)“一次寫入多次讀取”的文件訪問模型。一個(gè)文件經(jīng)過創(chuàng)建、寫入和關(guān)閉之后就不需要改變。這一假設(shè)簡化了數(shù)據(jù)一致性問題，并且使高吞吐量的數(shù)據(jù)訪問成為可能。Map/Reduce應(yīng)用或者網(wǎng)絡(luò)爬蟲應(yīng)用都非常適合這個(gè)模型。目前還有計(jì)劃在將來擴(kuò)充這個(gè)模型，使之支持文件的附加寫操作。</p

7、>　　2.5 “移動計(jì)算比移動數(shù)據(jù)更劃算”　　一個(gè)應(yīng)用請求的計(jì)算，離它操作的數(shù)據(jù)越近就越高效，在數(shù)據(jù)達(dá)到海量級別的時(shí)候更是如此。因?yàn)檫@樣就能降低網(wǎng)絡(luò)阻塞的影響，提高系統(tǒng)數(shù)據(jù)的吞吐量。將計(jì)算移動到數(shù)據(jù)附近，比之將數(shù)據(jù)移動到應(yīng)用所在顯然更好。HDFS為應(yīng)用提供了將它們自己移動到數(shù)據(jù)附近的接口。 　　2.6異構(gòu)軟硬件平臺間的可移植性

8、　　HDFS在設(shè)計(jì)的時(shí)候就考慮到平臺的可移植性。這種特性方便了HDFS作為大規(guī)模數(shù)據(jù)應(yīng)用平臺的推廣。 　　三、Namenode 和 Datanode　　HDFS采用master/slave架構(gòu)。一個(gè)HDFS集群是由一個(gè)Namenode和一定數(shù)目的Datanodes組成。Namenode是一個(gè)中心服務(wù)器，負(fù)責(zé)管理文

9、件系統(tǒng)的名字空間(namespace)以及客戶端對文件的訪問。集群中的Datanode一般是一個(gè)節(jié)點(diǎn)一個(gè)，負(fù)責(zé)管理它所在節(jié)點(diǎn)上的存儲。HDFS暴露了文件系統(tǒng)的名字空間，用戶能夠以文件的形式在上面存儲數(shù)據(jù)。從內(nèi)部看，一個(gè)文件其實(shí)被分成一個(gè)或多個(gè)數(shù)據(jù)塊，這些塊存儲在一組Datanode上。Namenode執(zhí)行文件系統(tǒng)的名字空間操作，比如打開、關(guān)閉、重命名文件或目錄。它也負(fù)責(zé)確定數(shù)據(jù)塊到具體Datanode節(jié)點(diǎn)的映射。Datanode負(fù)責(zé)處理

10、文件系統(tǒng)客戶端的讀寫請求。在Namenode的統(tǒng)一調(diào)度下進(jìn)行數(shù)據(jù)塊的創(chuàng)建、刪除和復(fù)制。　　Namenode和Datanode被設(shè)計(jì)成可以在普通的商用機(jī)器上運(yùn)行。這些機(jī)器一般運(yùn)行著GNU/Linux操作系統(tǒng)(OS)。HDFS采用Java語言開發(fā)，因此任何支持Java的機(jī)器都可以部署Namenode或Datanode。由于采用了可移植性極強(qiáng)的Java語言，使得HDFS可以部署到多種類型的機(jī)器上。一個(gè)

11、典型的部署場景是一臺機(jī)器上只運(yùn)行一個(gè)Namenode實(shí)例，而集群中的其它機(jī)器分別運(yùn)行一個(gè)Datanode實(shí)例。這種架構(gòu)并不排斥在一臺機(jī)器上運(yùn)行多個(gè)Datanode，只不過這樣的情況比較少見。　　集群中單一Namenode的結(jié)構(gòu)大大簡化了系統(tǒng)的架構(gòu)。Namenode是所有HDFS元數(shù)據(jù)的仲裁者和管理者，這樣，用戶數(shù)據(jù)永遠(yuǎn)不會流過Namenode。　　四、文

12、件系統(tǒng)的名字空間 (namespace)　　HDFS支持傳統(tǒng)的層次型文件組織結(jié)構(gòu)。用戶或者應(yīng)用程序可以創(chuàng)建目錄，然后將文件保存在這些目錄里。文件系統(tǒng)名字空間的層次結(jié)構(gòu)和大多數(shù)現(xiàn)有的文件系統(tǒng)類似：用戶可以創(chuàng)建、刪除、移動或重命名文件。當(dāng)前，HDFS不支持用戶磁盤配額和訪問權(quán)限控制，也不支持硬鏈接和軟鏈接。但是HDFS架構(gòu)并不排斥實(shí)現(xiàn)這些特性。 　　Name

13、node負(fù)責(zé)維護(hù)文件系統(tǒng)的名字空間，任何對文件系統(tǒng)名字空間或?qū)傩缘男薷亩紝⒈籒amenode記錄下來。應(yīng)用程序可以設(shè)置HDFS保存的文件的副本數(shù)目。文件副本的數(shù)目稱為文件的副本系數(shù)，這個(gè)信息也是由Namenode保存的。　　五、數(shù)據(jù)復(fù)制　　HDFS被設(shè)計(jì)成能夠在一個(gè)大集群中跨機(jī)器可靠地存儲超大文件。它將每個(gè)文件存儲成

14、一系列的數(shù)據(jù)塊，除了最后一個(gè)，所有的數(shù)據(jù)塊都是同樣大小的。為了容錯(cuò)，文件的所有數(shù)據(jù)塊都會有副本。每個(gè)文件的數(shù)據(jù)塊大小和副本系數(shù)都是可配置的。應(yīng)用程序可以指定某個(gè)文件的副本數(shù)目。副本系數(shù)可以在文件創(chuàng)建的時(shí)候指定，也可以在之后改變。HDFS中的文件都是一次性寫入的，并且嚴(yán)格要求在任何時(shí)候只能有一個(gè)寫入者。 　　Namenode全權(quán)管理數(shù)據(jù)塊的復(fù)制，它周期性地從集群中的每個(gè)Datanode接收心跳信號

15、和塊狀態(tài)報(bào)告(Blockreport)。接收到心跳信號意味著該Datanode節(jié)點(diǎn)工作正常。塊狀態(tài)報(bào)告包含了一個(gè)該Datanode上所有數(shù)據(jù)塊的列表。 　　5.1 副本存放: 最最開始的一步　　副本的存放是HDFS可靠性和性能的關(guān)鍵。優(yōu)化的副本存放策略是HDFS區(qū)分于其他大部分分布式文件系統(tǒng)的重要特性。這種特性需要做大量的調(diào)優(yōu)，并需要經(jīng)驗(yàn)的積累。HDFS

16、采用一種稱為機(jī)架感知(rack-aware)的策略來改進(jìn)數(shù)據(jù)的可靠性、可用性和網(wǎng)絡(luò)帶寬的利用率。目前實(shí)現(xiàn)的副本存放策略只是在這個(gè)方向上的第一步。實(shí)現(xiàn)這個(gè)策略的短期目標(biāo)是驗(yàn)證它在生產(chǎn)環(huán)境下的有效性，觀察它的行為，為實(shí)現(xiàn)更先進(jìn)的策略打下測試和研究的基礎(chǔ)。 　　大型HDFS實(shí)例一般運(yùn)行在跨越多個(gè)機(jī)架的計(jì)算機(jī)組成的集群上，不同機(jī)架上的兩臺機(jī)器之間的通訊需要經(jīng)過交換機(jī)。在大多數(shù)情況下，同一個(gè)機(jī)架內(nèi)的兩臺機(jī)

17、器間的帶寬會比不同機(jī)架的兩臺機(jī)器間的帶寬大。 　　通過一個(gè)機(jī)架感知的過程，Namenode可以確定每個(gè)Datanode所屬的機(jī)架id。一個(gè)簡單但沒有優(yōu)化的策略就是將副本存放在不同的機(jī)架上。這樣可以有效防止當(dāng)整個(gè)機(jī)架失效時(shí)數(shù)據(jù)的丟失，并且允許讀數(shù)據(jù)的時(shí)候充分利用多個(gè)機(jī)架的帶寬。這種策略設(shè)置可以將副本均勻分布在集群中，有利于當(dāng)組件失效情況下的負(fù)載均衡。但是，因?yàn)檫@種策略的一個(gè)寫操作需要傳輸數(shù)據(jù)塊到多

18、個(gè)機(jī)架，這增加了寫的代價(jià)。 　　在大多數(shù)情況下，副本系數(shù)是3，HDFS的存放策略是將一個(gè)副本存放在本地機(jī)架的節(jié)點(diǎn)上，一個(gè)副本放在同一機(jī)架的另一個(gè)節(jié)點(diǎn)上，最后一個(gè)副本放在不同機(jī)架的節(jié)點(diǎn)上。這種策略減少了機(jī)架間的數(shù)據(jù)傳輸，這就提高了寫操作的效率。機(jī)架的錯(cuò)誤遠(yuǎn)遠(yuǎn)比節(jié)點(diǎn)的錯(cuò)誤少，所以這個(gè)策略不會影響到數(shù)據(jù)的可靠性和可用性。于此同時(shí)，因?yàn)閿?shù)據(jù)塊只放在兩個(gè)（不是三個(gè)）不同的機(jī)架上，所以此策略減少了讀取數(shù)據(jù)時(shí)

19、需要的網(wǎng)絡(luò)傳輸總帶寬。在這種策略下，副本并不是均勻分布在不同的機(jī)架上。三分之一的副本在一個(gè)節(jié)點(diǎn)上，三分之二的副本在一個(gè)機(jī)架上，其他副本均勻分布在剩下的機(jī)架中，這一策略在不損害數(shù)據(jù)可靠性和讀取性能的情況下改進(jìn)了寫的性能。 　　當(dāng)前，這里介紹的默認(rèn)副本存放策略正在開發(fā)的過程中。 　　5.2 副本選擇

20、　　為了降低整體的帶寬消耗和讀取延時(shí)，HDFS會盡量讓讀取程序讀取離它最近的副本。如果在讀取程序的同一個(gè)機(jī)架上有一個(gè)副本，那么就讀取該副本。如果一個(gè)HDFS集群跨越多個(gè)數(shù)據(jù)中心，那么客戶端也將首先讀本地?cái)?shù)據(jù)中心的副本。 　　5.3 安全模式　　Namenode啟動后會進(jìn)入一個(gè)稱為安全模式的特殊狀態(tài)

21、。處于安全模式的Namenode是不會進(jìn)行數(shù)據(jù)塊的復(fù)制的。Namenode從所有的Datanode接收心跳信號和塊狀態(tài)報(bào)告。塊狀態(tài)報(bào)告包括了某個(gè)Datanode所有的數(shù)據(jù)塊列表。每個(gè)數(shù)據(jù)塊都有一個(gè)指定的最小副本數(shù)。當(dāng)Namenode檢測確認(rèn)某個(gè)數(shù)據(jù)塊的副本數(shù)目達(dá)到這個(gè)最小值，那么該數(shù)據(jù)塊就會被認(rèn)為是副本安全(safely replicated)的；在一定百分比（這個(gè)參數(shù)可配置）的數(shù)據(jù)塊被Namenode檢測確認(rèn)是安全之后（加上一個(gè)額外的

22、30秒等待時(shí)間），Namenode將退出安全模式狀態(tài)。接下來它會確定還有哪些數(shù)據(jù)塊的副本沒有達(dá)到指定數(shù)目，并將這些數(shù)據(jù)塊復(fù)制到其他Datanode上。 　　六、文件系統(tǒng)元數(shù)據(jù)的持久化　　Namenode上保存著HDFS的名字空間。對于任何對文件系統(tǒng)元數(shù)據(jù)產(chǎn)生修改的操作，Namenode都會使用一種稱為EditLog的事務(wù)日志記錄下來。例如，在HDFS中創(chuàng)

23、建一個(gè)文件，Namenode就會在Editlog中插入一條記錄來表示；同樣地，修改文件的副本系數(shù)也將往Editlog插入一條記錄。Namenode在本地操作系統(tǒng)的文件系統(tǒng)中存儲這個(gè)Editlog。整個(gè)文件系統(tǒng)的名字空間，包括數(shù)據(jù)塊到文件的映射、文件的屬性等，都存儲在一個(gè)稱為FsImage的文件中，這個(gè)文件也是放在Namenode所在的本地文件系統(tǒng)上。 　　Namenode在內(nèi)存中保存著整個(gè)文件系

24、統(tǒng)的名字空間和文件數(shù)據(jù)塊映射(Blockmap)的映像。這個(gè)關(guān)鍵的元數(shù)據(jù)結(jié)構(gòu)設(shè)計(jì)得很緊湊，因而一個(gè)有4G內(nèi)存的Namenode足夠支撐大量的文件和目錄。當(dāng)Namenode啟動時(shí)，它從硬盤中讀取Editlog和FsImage，將所有Editlog中的事務(wù)作用在內(nèi)存中的FsImage上，并將這個(gè)新版本的FsImage從內(nèi)存中保存到本地磁盤上，然后刪除舊的Editlog，因?yàn)檫@個(gè)舊的Editlog的事務(wù)都已經(jīng)作用在FsImage上了。這個(gè)過程

25、稱為一個(gè)檢查點(diǎn)(checkpoint)。在當(dāng)前實(shí)現(xiàn)中，檢查點(diǎn)只發(fā)生在Namenode啟動時(shí)，在不久的將來將實(shí)現(xiàn)支持周期性的檢查點(diǎn)。 　　Datanode將HDFS數(shù)據(jù)以文件的形式存儲在本地的文件系統(tǒng)中，它并不知道有關(guān)HDFS文件的信息。它把每個(gè)HDFS數(shù)據(jù)塊存儲在本地文件系統(tǒng)的一個(gè)單獨(dú)的文件中。Datanode并不在同一個(gè)目錄創(chuàng)建所有的文件，實(shí)際上，它用試探的方法來確定每個(gè)目錄的最佳文件數(shù)目，并

26、且在適當(dāng)?shù)臅r(shí)候創(chuàng)建子目錄。在同一個(gè)目錄中創(chuàng)建所有的本地文件并不是最優(yōu)的選擇，這是因?yàn)楸镜匚募到y(tǒng)可能無法高效地在單個(gè)目錄中支持大量的文件。當(dāng)一個(gè)Datanode啟動時(shí)，它會掃描本地文件系統(tǒng)，產(chǎn)生一個(gè)這些本地文件對應(yīng)的所有HDFS數(shù)據(jù)塊的列表，然后作為報(bào)告發(fā)送到Namenode，這個(gè)報(bào)告就是塊狀態(tài)報(bào)告。 　　七、通訊協(xié)議

27、;　　所有的HDFS通訊協(xié)議都是建立在TCP/IP協(xié)議之上?？蛻舳送ㄟ^一個(gè)可配置的TCP端口連接到Namenode，通過ClientProtocol協(xié)議與Namenode交互。而Datanode使用DatanodeProtocol協(xié)議與Namenode交互。一個(gè)遠(yuǎn)程過程調(diào)用(RPC)模型被抽象出來封裝ClientProtocol和Datanodeprotocol協(xié)議。在設(shè)計(jì)上，Namenode不會主動發(fā)起RPC，而是響應(yīng)來自客戶端或 D

28、atanode 的RPC請求。 　　八、健壯性　　HDFS的主要目標(biāo)就是即使在出錯(cuò)的情況下也要保證數(shù)據(jù)存儲的可靠性。常見的三種出錯(cuò)情況是：Namenode出錯(cuò), Datanode出錯(cuò)和網(wǎng)絡(luò)割裂(network partitions)。 　　8.1 磁盤數(shù)據(jù)錯(cuò)誤、心跳檢測和重新

29、復(fù)制　　每個(gè)Datanode節(jié)點(diǎn)周期性地向Namenode發(fā)送心跳信號。網(wǎng)絡(luò)割裂可能導(dǎo)致一部分Datanode跟Namenode失去聯(lián)系。Namenode通過心跳信號的缺失來檢測這一情況，并將這些近期不再發(fā)送心跳信號Datanode標(biāo)記為dead，不會再將新的IO請求發(fā)給它們。任何存儲在dead Datanode上的數(shù)據(jù)將不再有效。Datanode的dead可能會引起一些數(shù)據(jù)塊的副本系數(shù)低于指定

30、值，Namenode不斷地檢測這些需要復(fù)制的數(shù)據(jù)塊，一旦發(fā)現(xiàn)就啟動復(fù)制操作。在下列情況下，可能需要重新復(fù)制：某個(gè)Datanode節(jié)點(diǎn)失效，某個(gè)副本遭到損壞，Datanode上的硬盤錯(cuò)誤，或者文件的副本系數(shù)增大。 　　8.2 集群均衡　　HDFS的架構(gòu)支持?jǐn)?shù)據(jù)均衡策略。如果某個(gè)Datanode節(jié)點(diǎn)上的空閑空間低于特定的

31、臨界點(diǎn)，按照均衡策略系統(tǒng)就會自動地將數(shù)據(jù)從這個(gè)Datanode移動到其他空閑的Datanode。當(dāng)對某個(gè)文件的請求突然增加，那么也可能啟動一個(gè)計(jì)劃創(chuàng)建該文件新的副本，并且同時(shí)重新平衡集群中的其他數(shù)據(jù)。這些均衡策略目前還沒有實(shí)現(xiàn)。 　　8.3 數(shù)據(jù)完整性　　從某個(gè)Datanode獲取的數(shù)據(jù)塊有可能是損壞的，損壞可能是由

32、Datanode的存儲設(shè)備錯(cuò)誤、網(wǎng)絡(luò)錯(cuò)誤或者軟件bug造成的。HDFS客戶端軟件實(shí)現(xiàn)了對HDFS文件內(nèi)容的校驗(yàn)和(checksum)檢查。當(dāng)客戶端創(chuàng)建一個(gè)新的HDFS文件，會計(jì)算這個(gè)文件每個(gè)數(shù)據(jù)塊的校驗(yàn)和，并將校驗(yàn)和作為一個(gè)單獨(dú)的隱藏文件保存在同一個(gè)HDFS名字空間下。當(dāng)客戶端獲取文件內(nèi)容后，它會檢驗(yàn)從Datanode獲取的數(shù)據(jù)跟相應(yīng)的校驗(yàn)和文件中的校驗(yàn)和是否匹配，如果不匹配，客戶端可以選擇從其他Datanode獲取該數(shù)據(jù)塊的副本。 &

33、lt;/p>　　8.4 元數(shù)據(jù)磁盤錯(cuò)誤　　FsImage和Editlog是HDFS的核心數(shù)據(jù)結(jié)構(gòu)。如果這些文件損壞了，整個(gè)HDFS實(shí)例都將失效。因而，Namenode可以配置成支持維護(hù)多個(gè)FsImage和Editlog的副本。任何對FsImage或者Editlog的修改，都將同步到它們的副本上。這種多副本的同步操作可能會降低Namenode每秒處理的名字空間事務(wù)數(shù)量。

34、然而這個(gè)代價(jià)是可以接受的，因?yàn)榧词笻DFS的應(yīng)用是數(shù)據(jù)密集的，它們也非元數(shù)據(jù)密集的。當(dāng)Namenode重啟的時(shí)候，它會選取最近的完整的FsImage和Editlog來使用。 　　Namenode是HDFS集群中的單點(diǎn)故障(single point of failure)所在。如果Namenode機(jī)器故障，是需要手工干預(yù)的。目前，自動重啟或在另一臺機(jī)器上做Namenode故障轉(zhuǎn)移的功能還沒實(shí)現(xiàn)。

35、　　8.5 快照　　快照支持某一特定時(shí)刻的數(shù)據(jù)的復(fù)制備份。利用快照，可以讓HDFS在數(shù)據(jù)損壞時(shí)恢復(fù)到過去一個(gè)已知正確的時(shí)間點(diǎn)。HDFS目前還不支持快照功能，但計(jì)劃在將來的版本進(jìn)行支持。 　　九、數(shù)據(jù)組織<p&

36、gt;　　9.1 數(shù)據(jù)塊　　HDFS被設(shè)計(jì)成支持大文件，適用HDFS的是那些需要處理大規(guī)模的數(shù)據(jù)集的應(yīng)用。這些應(yīng)用都是只寫入數(shù)據(jù)一次，但卻讀取一次或多次，并且讀取速度應(yīng)能滿足流式讀取的需要。HDFS支持文件的“一次寫入多次讀取”語義。一個(gè)典型的數(shù)據(jù)塊大小是64MB。因而，HDFS中的文件總是按照64M被切分成不同的塊，每個(gè)塊盡可能地存儲于不同的Datanod

37、e中。 　　9.2 Staging　　客戶端創(chuàng)建文件的請求其實(shí)并沒有立即發(fā)送給Namenode，事實(shí)上，在剛開始階段HDFS客戶端會先將文件數(shù)據(jù)緩存到本地的一個(gè)臨時(shí)文件。應(yīng)用程序的寫操作被透明地重定向到這個(gè)臨時(shí)文件。當(dāng)這個(gè)臨時(shí)文件累積的數(shù)據(jù)量超過一個(gè)數(shù)據(jù)塊的大小，客戶端才會聯(lián)系Namenode。Namenode將文件名插入文件系統(tǒng)的層次結(jié)構(gòu)中，并且分配一

38、個(gè)數(shù)據(jù)塊給它。然后返回Datanode的標(biāo)識符和目標(biāo)數(shù)據(jù)塊給客戶端。接著客戶端將這塊數(shù)據(jù)從本地臨時(shí)文件上傳到指定的Datanode上。當(dāng)文件關(guān)閉時(shí)，在臨時(shí)文件中剩余的沒有上傳的數(shù)據(jù)也會傳輸?shù)街付ǖ腄atanode上。然后客戶端告訴Namenode文件已經(jīng)關(guān)閉。此時(shí)Namenode才將文件創(chuàng)建操作提交到日志里進(jìn)行存儲。如果Namenode在文件關(guān)閉前dead了，則該文件將丟失。 　　上述方法是對在

39、HDFS上運(yùn)行的目標(biāo)應(yīng)用進(jìn)行認(rèn)真考慮后得到的結(jié)果。這些應(yīng)用需要進(jìn)行文件的流式寫入。如果不采用客戶端緩存，由于網(wǎng)絡(luò)速度和網(wǎng)絡(luò)堵塞會對吞估量造成比較大的影響。這種方法并不是沒有先例的，早期的文件系統(tǒng)，比如AFS，就用客戶端緩存來提高性能。為了達(dá)到更高的數(shù)據(jù)上傳效率，已經(jīng)放松了POSIX標(biāo)準(zhǔn)的要求。 　　9.3 流水線復(fù)制

40、　　當(dāng)客戶端向HDFS文件寫入數(shù)據(jù)的時(shí)候，一開始是寫到本地臨時(shí)文件中。假設(shè)該文件的副本系數(shù)設(shè)置為3，當(dāng)本地臨時(shí)文件累積到一個(gè)數(shù)據(jù)塊的大小時(shí)，客戶端會從Namenode獲取一個(gè)Datanode列表用于存放副本。然后客戶端開始向第一個(gè)Datanode傳輸數(shù)據(jù)，第一個(gè)Datanode一小部分一小部分(4 KB)地接收數(shù)據(jù)，將每一部分寫入本地倉庫，并同時(shí)傳輸該部分到列表中第二個(gè)Datanode節(jié)點(diǎn)。第二個(gè)Datanode也是這樣，一小部分一小部

41、分地接收數(shù)據(jù)，寫入本地倉庫，并同時(shí)傳給第三個(gè)Datanode。最后，第三個(gè)Datanode接收數(shù)據(jù)并存儲在本地。因此，Datanode能流水線式地從前一個(gè)節(jié)點(diǎn)接收數(shù)據(jù)，并在同時(shí)轉(zhuǎn)發(fā)給下一個(gè)節(jié)點(diǎn)，數(shù)據(jù)以流水線的方式從前一個(gè)Datanode復(fù)制到下一個(gè)。 　　十、可訪問性　　HDFS給應(yīng)用提供了多種訪問方式。用戶可以由D

42、FSShell接口通過命令行與HDFS數(shù)據(jù)進(jìn)行交互，通過Java API接口訪問，也可以通過C語言的封裝API訪問，還可以通過瀏覽器的方式訪問HDFS中的文件。通過WebDAV協(xié)議訪問的方式正在開發(fā)中。 　　10.1 DFSShell　　HDFS以文件和目錄的形式組織用戶數(shù)據(jù)。它提供了一個(gè)命令行的接口(DFSShell)讓用戶與HDFS中的數(shù)據(jù)進(jìn)行交互。

43、命令的語法和用戶熟悉的其他shell(例如 bash, csh)工具類似。下面是一些動作/命令的示例： 　　動作命令　　創(chuàng)建一個(gè)名為 /foodir 的目錄 bin/hadoop dfs -mkdir /foodir 　　刪除名為/foodir/myfile.txt的文件

44、Bin/hadoop dfs -rm /foodir myfile.txt　　查看名為 /foodir/myfile.txt 的文件內(nèi)容bin/hadoop dfs -cat /foodir/myfile.txt 　　DFSShell可以用在那些通過腳本語言和文件系統(tǒng)進(jìn)行交互的應(yīng)用程序上。 　　10.2 DFS

45、Admin　　DFSAdmin 命令用來管理HDFS集群。這些命令只有HDSF的管理員才能使用。下面是一些動作/命令的示例： 　　動作命令　　將集群置于安全模式 bin/hadoop dfsadmin -safemode enter

46、;　　顯示Datanode列表 bin/hadoop dfsadmin -report 　　使Datanode節(jié)點(diǎn) datanodename退役bin/hadoop dfsadmin -decommission datanodename 　　10.3 瀏覽器接口 　　一個(gè)典型的HDFS安裝會在一個(gè)可配置的TC

47、P端口開啟一個(gè)Web服務(wù)器用于暴露HDFS的名字空間。用戶可以用瀏覽器來瀏覽HDFS的名字空間和查看文件的內(nèi)容。 　　十一、存儲空間回收　　11.1 文件的刪除和恢復(fù)　　當(dāng)用戶或應(yīng)用程序刪除某個(gè)文件時(shí)，這個(gè)文件并沒有立刻從HDFS中刪除。實(shí)際上，HDFS會將這個(gè)文件重命名轉(zhuǎn)移

48、到/trash目錄。只要文件還在/trash目錄中，該文件就可以被迅速地恢復(fù)。文件在/trash中保存的時(shí)間是可配置的，當(dāng)超過這個(gè)時(shí)間時(shí)，Namenode就會將該文件從名字空間中刪除。刪除文件會使得該文件相關(guān)的數(shù)據(jù)塊被釋放。注意，從用戶刪除文件到HDFS空閑空間的增加之間會有一定時(shí)間的延遲。　　只要被刪除的文件還在/trash目錄中，用戶就可以恢復(fù)這個(gè)文件。如果用戶想恢復(fù)被刪除的文件，他/她可以

49、瀏覽/trash目錄找回該文件。/trash目錄僅僅保存被刪除文件的最后副本。/trash目錄與其他的目錄沒有什么區(qū)別，除了一點(diǎn)：在該目錄上HDFS會應(yīng)用一個(gè)特殊策略來自動刪除文件。目前的默認(rèn)策略是刪除/trash中保留時(shí)間超過6小時(shí)的文件。將來，這個(gè)策略可以通過一個(gè)被良好定義的接口配置。 　　11.2 減少副本系數(shù)　　當(dāng)一個(gè)文件的副本系數(shù)被減小后，Na

50、menode會選擇過剩的副本刪除。下次心跳檢測時(shí)會將該信息傳遞給Datanode。Datanode就會移除相應(yīng)的數(shù)據(jù)塊并釋放空間。同樣，在調(diào)用setReplication API結(jié)束和集群中空閑空間增加間會有一定的延遲。　　文獻(xiàn)翻譯英文原文　　The Hadoop Distributed File System: A

51、rchitecture and Design　　1 Introduction　　The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities wit

52、h existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high thr

53、oughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable　　2 Assumptions and Goals　　2.1 Hardw

54、are Failure　　Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact th

55、at there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic re

56、covery from them is a core architectural goal of HDFS. 　　2.2 Streaming Data Access　　Applications that run on HDFS need streaming access to their data sets. They are not general p

57、urpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low l

58、atency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increas　　2.3 Large D

59、ata Sets　　Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate da

60、ta bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. 　　2.4 Simple Coherency Model　　HDFS applications

61、 need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce ap

62、plication or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. 　　2.5 “Moving Computation is Cheaper than Moving Data” </

63、p>　　A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network con

64、gestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is

65、 running. HDFS provides interfaces for applications to move themselves closer 　　2.6 Portability Across Heterogeneous Hardware and Software Platforms 　　HDFS has been designed to b

66、e easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. 　　3 NameNode and DataNodes

67、HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Dat

68、aNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into

69、one or more blocks and these blocks are stored in a set of D　　The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operatin

70、g system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide ran

71、ge of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster ru　　The existence of a single NameNode in a cluster greatl

72、y simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

73、;　　HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existi

74、ng file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links

75、. However, the HDFS architecture does not preclude implementing　　The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNod

76、e. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

77、　　HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. Th

78、e blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified

79、 at file creation time and can be changed later. Files in HDFS are write-once and h　　The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Block

80、report from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. 　　3.1 Replica Pl

81、acement: The First Baby Steps　　The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. Thi

82、s is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation fo

83、r the replica placement policy is a first effort in this direction. The short-term goals of implementing this po　　Large HDFS instances run on a cluster of computers that commonly spread across many r

84、acks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different rack

85、s. 　　The NameNode determines the rack id each DataNode belongs to via the process outlined in Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losin

86、g data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However

87、, this policy increases the cost of writes because a write needs to transfer blocks to　　For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one

88、 node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The

89、chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggre　　The current, default replica

90、 placement policy described here is a work in progress. 　　3.2 Replica Selection　　To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request f

91、rom a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, t

92、hen a replica that is resident in the local data center is preferred over any remote replica. 　　3.3 Safemode　　On startup, the NameNode enters a special state called Safemode. Rep

93、lication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNo

94、de is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a confi</p&g

95、t;　　6 The Persistence of File System Metadata　　The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that

96、 occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record

97、to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system nam　　The NameNode keeps an image of the entire file system name

98、space and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it read

99、s the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old

100、 E　　The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The

101、 DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create all loc

眾賞文庫> 全部分類> 畢業(yè)設(shè)計(jì)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 眾賞文庫僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

hadoop分布式文件系統(tǒng)架構(gòu)和設(shè)計(jì)-外文翻譯

文檔簡介

溫馨提示

最新文檔

評論

hadoop分布式文件系統(tǒng)架構(gòu)和設(shè)計(jì)-外文翻譯

文檔簡介

溫馨提示

最新文檔

評論

免費(fèi)下載