Hadoop Overview

Hadoop Overview

2008.01.15유현정

Hadoop

• Brief History– 2005 년 Doug Cutting(Lucene & Nutch 개발자 )

에 의해서 시작• Nutch 오픈소스 검색엔진의 분산확장 이슈에서 출발

– 2006 년 Yahoo 의 전폭적인 지원 (Doug Cutting 과 전담팀 고용 )

– 2008 년 Apache Top-level Project 로 승격– 현재 (2009.1) 0.19.0 release

Hadoop

• Java 기반 언어• Apache 라이선스• 많은 컴포넌트들 – HDFS, Hbase, MapReduce, Hadoop On De-

mand(HOD), Streaming, HQL, Hama, etc

Hadoop Architecture

Hadoop

• 주요 기능– Distributed File System

– Distributed computing

• Distributed File System(DFS)– 네트워크로 연결된 서버들의 저장공간을 하나로 묶은 대용량 가상공간에 파일들을

저장하는 시스템– 전세계의 웹 페이지에 있는 내용을 분석하여 구성한 index 파일과 같은 대용량과

동시에 엄청난 양의 transaction 을 처리해야 하는 요구사항에 부합되도록 설계– 종류

• NHN + KAIST 가 공동으로 개발한 OwFS(Owner based File System)

• Sun Microsystems 의 NFS

• Microsoft 의 분산 파일 시스템• IBM 의 Transarc's DFS

Google File System(GFS)

• Hadoop 의 DFS 는 Google File System(GFS) 의 기본 개념을 그대로 가져와 구현함

• GFS 의 특징– PC 와 같은 일반적으로 값싼 장비를 이용한다– NAS 등과 같은 고비용의 장비를 사용하지 않고 소프트웨어로 해결한다– 많은 수의 대용량 파일 ( 수백 MB~ 수 GB) 을 처리할 수 있어야 한다– 추가로 데이터에 대한 백업을 하지 않는다– 장비의 추가 및 제거가 자유로워야 한다– 특정 노드 장애 시에도 별도의 복구 절차 없이 지속적인 서비스 제공이

가능하다


• 두 개의 데몬 서버 ( 사용자 application 수준 )– GFS master : 파일 이름 , 크기 등과 같은 파일에 대한 메타데이터 관리– GFS chunkserver : 실제 파일을 저장하는 역할 수행

• 수백 MB ~ 수 GB 이상의 크기의 파일 하나를 여러 조각으로 나눈 후 , 여러 chunkserver 에 저장

• 나누어진 파일의 조각 = chunk (default, 64MB)

• GFS Client : 데몬 서버들과 통신을 통해 파일 처리– 파일의 생성 , 읽기 , 쓰기 등의 작업을 수행하는 역할– API 형태로 제공되고 내부적으로 socket 등의 통신을 이용하여 서버와 통신


• 동작 방식– 1. Application 은 File System 에서 제공하는 API 를 이용하여 GFS Client 코드를

생성하여 파일 작업 요청– 2. GFS Client 는 GFS master 에게 해당 파일에 대한 정보 요청– 3. GFS master 는 자신이 관리하는 파일 메타 데이터에서 client 가 요청한 파일의

정보를 전달• 전달되는 데이터는 파일 크기와 같은 정보와 조각으로 나뉘어진 chunk 수 , chunk size, chunk

가 저장된 chunkserver 의 주소 값 등

– 4. GFS client 는 해당 chunk 가 저장되어있는 chunkserver 로 접속한 후 파일 처리 요청

– 5. GFS chunkserver 는 실제 파일에 대한 처리 수행

Þ Client 와 master 사이에는 파일에 대한 정보만 주고 받을 뿐 , 실제 파일 데이터의 이동 및 처리는 client 와 chunkserver 사이에서 발생

Þ Master 의 부하를 최소화 하도록 하기 위한 것


• Replication – 가장 큰 특징 중 하나– Chunk 를 하나의 chunkserver 에만 저장하는 것이

아니라 여러 개의 chunkserver 에 복사본을 저장• Default : 3 개의 복사본

– Chunkserver 의 down 등으로 인해 정해진 복사본 수만큼 가지고 있지 않는 경우 , master 는 새로운 복사본을 만들도록 관리


• Replication 의 장점– chunk 를 저장하고 있는 chunkserver 가 down

되어도 장애 없이 서비스를 제공할 수 있다– 항상 복사본이 존재하고 있기 때문에 파일 시스템

수준에서 RAID1 수준의 미러링 백업을 제공• 실제로 Google 은 NAS 와 같은 고비용의 스토리지 장비를

사용하지 않기 때문에 Yahoo 와 같은 다른 경쟁업체에 비해 월등하게 싼 가격으로 시스템을 운영하고 있다

– 특정 파일 또는 chunk 를 읽기 위한 접근이 집중되는 경우 , 하나의 chunkserver 로 집중되는 부하를 분산시킬 수 있으며 서버에 대한 분산뿐만 아니라 Disk head 와 같은 물리적인 장치에 대한 분산 효과


• Replication 의 단점– Google File System 에서는 , 복사본을 동기적인 방식으로 생성한

다 . 이것은 파일을 생성하는 시점에 복사본까지 완전히 저장된 후에 파일 생성에 대한 완료처리를 한다는 것이다 . 따라서 , 일반 파일 시스템에 비해 기본적으로 3 배 이상의 write 속도 저하가 발생한다 . • Google File System 의 경우 , 온라인 실시간 시스템에서는 사용하기 부적합

할 수 있으며 파일을 한번 생성한 다음부터는 계속해서 읽기 작업만 발생하는 시스템에 적합하다고 할 수 있다 .

• 대표적인 서비스들 : 검색 , 동영상 , 이미지 , 메일 (Mime 파일 ) 등이다

– 3 개의 복사본을 저장하기 때문에 3 배의 디스크 공간 필요• Google File System 의 경우 , 기본적으로 PC 에 붙어 있는 값싼 디스크도

사용할 수 있기 때문에 NAS 와 같이 고비용의 스토리지를 사용하는 엔터프라이즈 환경에서는 단점이라고 할 수 없다 .

Hadoop Distributed File System(HDFS)

• 순수한 자바 파일 시스템• 기존의 분산 파일 시스템과 많은 유사성을 갖고

있으면서도 커다란 차이점이 있음 .– highly fault-tolerant – Low-cost hardware 를 통해 배포할 수 있도록 설계– Application data 접근에 높은 처리량 제공 – Suitable for application having large data sets

Assumptions & Goals of HDFS

• 1. hardware failure

– An HDFS instance may consist of hundreds or thou-sands of server machines, each storing part of the file system’s data.

– The fact that there are a huge number of compo-nent and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional.

– 따라서 , 결함의 탐지와 빠르고 자동적인 복구는 HDFS 의 핵심적인 구조적 목표 .


• 2. Streaming Data Access– HDFS 는 batch processing 에 적합– HDFS is optimized to provide streaming read per-

formance; this comes at the expense of random seek times to arbitrary positions in files.

– HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

– 강점 : high throughput of data access

*POSIX(Potable Operating System Interface)

유닉스 운영체제에 기반을 두고 있는 일련의 표준 운영체제 인터페이스


• 3. Large Data Sets – HDFS 의 applications 는 large data sets 를 가짐– HDFS 에서의 일반적인 파일 크기 : gigabytes to ter-

abytes – 따라서 , HDFS 가 대용량 파일들을 제공할 수 있도록

조정됨• 높은 aggregate data bandwidth 와 단일 클러스터에서

수백개의 nodes 로의 확장을 제공해야 함• 단일 instance 에서 수 천만 파일을 제공해야 함


• Simple Coherency Model– HDFS’s applications need a write-once-read-

more access model for files. – 이러한 가정은 data coherency issues 를 단순화하고

높은 처리량의 데이터 접근을 가능하게 함– A Map/Reduce application or a web crawler ap-

plication fits perfectly with this model.– 추가 쓰기 (appending-writes) 를 앞으로는 제공할

계획• Scheduled to be included in Hadoop 0.19 but is not

available yet


• “Moving Computation is Cheaper than Moving Data”– Application 에 의해 요청된 computation 은 데이터 근처에서

실행될 때 더 효과적– 특히 , data sets 의 사이즈가 대단히 클 때 – This minimizes network congestion and increase the

overall throughput of the system

– The assumption is that is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is run-ning.

– 따라서 , HDFS 는 데이터가 위치한 곳 가까이 application 들을 옮길 수 있는 interface 를 제공

Assumptions & Goals of HDFSkjuo

• 6. 이종 하드웨어와 소프트웨어 플랫폼으로의 이식성– HDFS 는 한 플랫폼에서 다른 플랫폼으로 쉽게 이식할 수

있도록 디자인됨– This facilitates widespread adoption of HDFS as

a platform of choice for large set of applica-tions.

Features of HDFS

• A master/slave architecture

• A single NameNode, a master server for a HDFS clus-ter– manages the file system namespace

• Executes file system namespace operations like opening, closing, and renaming files and directories

– regulates access to files by clients

– Determines the mapping of blocks to DataNodes

Þ Simplifies the architecture of the systemÞ NameNode = arbitrator and repository for all HDFS meta-

data

Features of HDFS

• A number of DataNodes, usually one per node in the cluster– Manages storage attached to the nodes that

they run on • Perform block creation, deletion, and replication upon

instruction from the NameNode

Features of HDFS

• The NameNode and DataNodes are pieces of software designed to run on commodity machines.

• These machines typically run a GNU/Linux OS.

• Using the Java language; any machine that supports Java can run the NameNode or DataNode software.

Features of HDFS

• A typically deployment has a dedicated machine that runs only the NameNode software.

• Each of the other machines in the cluster runs one instance of the DataNode soft-ware

• -> The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

Features of HDFS

• Single namespace for entire cluster– Managed by a single NameNode– Files are write-once– Optimized for streaming reads of large files

• Files are broken in to large blocks– HDFS is a block-structured file system– Default block size

• in HDFS : 64 MB vs. in other systems : 4 or 8 KB

– These blocks are stored in a set of DataNodes

• Client talks to both NameNode and DataN-odes– Data is not sent through the NameNode

Features of HDFS

• DataNodes holding blocks of multiple files with a replication factor of 2

• The NameNode maps the filenames onto the block ids

Features of HDFS

• DataNodes 의 클러스터를 통해서 구축• 각각의 서버들은 데이터 블록을 네트워크를 통해

제공• 웹 브라우저나 다른 client 를 통해서 모든 컨텐츠에

대해 접근할 수 있도록 HTTP 프로토콜을 통해서 데이터를 제공하기도 함

Features of HDFS

• NameNode 라는 하나의 특별한 서버를 필요로 함 -> HDFS 설치에 있어서 실패의 한 요소 NameNode 가 다운되면 File system 역시 다운 • 2 차 NameNode 를 운영하기도 함• 대부분은 하나의 NameNode 를 이용• Replay process 는 큰 클러스터의 경우 30 분

이상 소요

Features of HDFS

The File System Namespace

• File System Namespace 계층은 다른 file systems 와 유사 ;– 파일 생성 및 삭제– 하나의 dir 에서 다른 dir 로 파일 복사– 파일 이름 변경

• 사용자 한도 량과 접근 허가 구현 X• 하드 링크와 소프트 링크 지원 X• 하지만 , 이런 특징들의 구현을 제한 X

The File System Namespace

• NameNode : the file system namespace 유지

• File system namespace 나 속성의 변화는 NameNode 에 의해서 기록됨

• Application 은 HDFS 에 의해 유지되어야 하는 파일의 replicas 의 개수를 명시할 수 있다 .– 파일의 복제 수는 the replication factor of that file

이라고 불리워짐 .– 이러한 정보 역시 NameNode 에 의해서 저장

Documents

Hadoop Overview