基于Hadoop的分布式文件系統(tǒng)技術分析及應用
發(fā)布時間:2018-07-17 07:47
【摘要】:隨著互聯(lián)網(wǎng)(主要為移動互聯(lián)網(wǎng))和新興物聯(lián)網(wǎng)的高速發(fā)展,我們生活在一個數(shù)據(jù)大爆炸時代。根據(jù)IDC估計,2011年,全球產(chǎn)生和創(chuàng)建的數(shù)據(jù)總量為1.8ZB,且全球的信息總量每過兩年就會增長一倍。產(chǎn)生這么多的數(shù)據(jù),自然而然就會給我們在數(shù)據(jù)存儲和管理上帶來巨大的挑戰(zhàn)。IDC的研究報告還指出,全球數(shù)據(jù)存儲容量的增長速度已遠遠跟不上的數(shù)據(jù)的增長速度了。 這么多的數(shù)據(jù)存儲在一個設備上在當今的存儲技術下是很難辦到的,并且存儲在一個設備上,會對以后數(shù)據(jù)的分析帶來很大的困難。把數(shù)據(jù)存儲在多個設備上,是我們現(xiàn)今存儲海量數(shù)據(jù)的首選。既然存儲在多個存儲設備上,那么就需要我們有相應的分布式文件系統(tǒng)來管理這些存儲設備,使它們能夠協(xié)同工作,并可以向用戶提供更好的數(shù)據(jù)訪問性能。 Hadoop分布式文件系統(tǒng)(HDFS),一個類似Google的分布式文件系統(tǒng)(GFS)的出現(xiàn)是可以解決海量數(shù)據(jù)存儲需求的一個很好應用。首先它是一個開源免費的應用并且在很多節(jié)點上已經(jīng)部署,具有不凡的表現(xiàn)。其次,HDFS擁有高容錯性、高可靠性、高擴展性和高吞吐率等特征,這些特征都為海量數(shù)據(jù)提供了安全存儲的環(huán)境和對超大數(shù)據(jù)集(Large Data Set)的應用處理帶來了很大便利。它還可以與MapReduce編程模型很好的結合,并且能夠為應用程序提供高吞吐量的數(shù)據(jù)訪問。 在本論文中,首先以時間為軸,介紹了每個時代典型的分布式文件系統(tǒng)及其特點,然后對HDFS的體系架構和運行原理進行了詳細分析。通過對HDFS高可用性的研究,結合了BackupNode和AvatarNode這兩種方案的優(yōu)點設計出了一個高可用的分布式文件系統(tǒng),我們稱之為HADFS。該文件系統(tǒng)不僅實現(xiàn)了NameNode的熱備節(jié)點,還可以在當NameNode節(jié)點發(fā)生故障時,能夠自動切換到備用節(jié)點,而用戶卻察覺不到節(jié)點的切換。最后,我們以HDFS為基礎存儲層設計出了一個可以實現(xiàn)文件上傳、下載、新建文件夾和刪除文件等功能的云盤系統(tǒng)。該系統(tǒng)采用了SSH框架設計,并在與HDFS傳輸數(shù)據(jù)的時候采用了webdav協(xié)議,使云盤的前端與底層存儲實現(xiàn)了很好的分離。
[Abstract]:With the rapid development of the Internet (mainly mobile Internet) and the emerging Internet of things, we live in a data Big Bang era. According to IDC estimates, the total amount of data generated and created globally was 1.8 ZB in 2011, and the global amount of information doubled every two years. Generating so much data naturally poses a huge challenge in data storage and management. IDC's report also points out that the growth of global data storage capacity is far from keeping up with the growth of data. It is very difficult to store so much data on one device under the current storage technology, and it will bring great difficulty to the analysis of data in the future. Storing data on multiple devices is our preferred choice for storing massive amounts of data today. Since it is stored on multiple storage devices, we need to have the appropriate distributed file systems to manage these storage devices so that they can work together, Hadoop distributed file system (HDFS), a distributed file system similar to Google (GFS), is a good application to solve the requirement of massive data storage. First, it is an open source free application and has been deployed on many nodes, with extraordinary performance. Secondly, HDFS has the characteristics of high fault tolerance, high reliability, high scalability and high throughput. These features provide a secure storage environment for massive data and great convenience for the application and processing of large data sets. It also combines well with MapReduce programming model and provides high throughput data access for applications. In this paper, the typical distributed file system and its characteristics in each era are introduced on the axis of time, and then the architecture and running principle of HDFS are analyzed in detail. By studying the high availability of HDFS, combining the advantages of backup Node and Avatar Node, a highly available distributed file system is designed, which we call HADFS. The file system not only implements the hot node of NameNode, but also can automatically switch to the standby node when the node of NameNode fails, but the user can not detect the switch of the node. Finally, we design a cloud disk system which can upload, download, create new folder and delete files based on HDFS. The system is designed by SSH framework, and webdav protocol is used to transmit data with HDFS, which makes the front end of the cloud disk separate from the underlying storage.
【學位授予單位】:武漢理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP333;TP316.4
本文編號:2129658
[Abstract]:With the rapid development of the Internet (mainly mobile Internet) and the emerging Internet of things, we live in a data Big Bang era. According to IDC estimates, the total amount of data generated and created globally was 1.8 ZB in 2011, and the global amount of information doubled every two years. Generating so much data naturally poses a huge challenge in data storage and management. IDC's report also points out that the growth of global data storage capacity is far from keeping up with the growth of data. It is very difficult to store so much data on one device under the current storage technology, and it will bring great difficulty to the analysis of data in the future. Storing data on multiple devices is our preferred choice for storing massive amounts of data today. Since it is stored on multiple storage devices, we need to have the appropriate distributed file systems to manage these storage devices so that they can work together, Hadoop distributed file system (HDFS), a distributed file system similar to Google (GFS), is a good application to solve the requirement of massive data storage. First, it is an open source free application and has been deployed on many nodes, with extraordinary performance. Secondly, HDFS has the characteristics of high fault tolerance, high reliability, high scalability and high throughput. These features provide a secure storage environment for massive data and great convenience for the application and processing of large data sets. It also combines well with MapReduce programming model and provides high throughput data access for applications. In this paper, the typical distributed file system and its characteristics in each era are introduced on the axis of time, and then the architecture and running principle of HDFS are analyzed in detail. By studying the high availability of HDFS, combining the advantages of backup Node and Avatar Node, a highly available distributed file system is designed, which we call HADFS. The file system not only implements the hot node of NameNode, but also can automatically switch to the standby node when the node of NameNode fails, but the user can not detect the switch of the node. Finally, we design a cloud disk system which can upload, download, create new folder and delete files based on HDFS. The system is designed by SSH framework, and webdav protocol is used to transmit data with HDFS, which makes the front end of the cloud disk separate from the underlying storage.
【學位授予單位】:武漢理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP333;TP316.4
【參考文獻】
相關期刊論文 前4條
1 孫燕,田俊峰,王鳳先;分布式冗余管理系統(tǒng)可靠性的設計與實現(xiàn)[J];計算機工程與應用;2003年15期
2 朱強;多服務器模型下的服務器選擇算法及仿真[J];計算機工程與應用;2005年29期
3 譚支鵬;馮丹;;對象存儲系統(tǒng)形式化研究[J];計算機科學;2006年12期
4 陸榮幸,郁洲,阮永良,王志強;J2EE平臺上MVC設計模式的研究與實現(xiàn)[J];計算機應用研究;2003年03期
相關碩士學位論文 前3條
1 林松濤;基于Lustre文件系統(tǒng)的并行I/O技術研究[D];國防科學技術大學;2004年
2 趙春燕;云環(huán)境下作業(yè)調度算法研究與實現(xiàn)[D];北京交通大學;2009年
3 楊平安;基于Paxos算法的HDFS高可用性的研究與設計[D];華南理工大學;2012年
,本文編號:2129658
本文鏈接:http://www.wukwdryxk.cn/kejilunwen/jisuanjikexuelunwen/2129658.html
最近更新
教材專著