はじめてのGlusterFS

はじめてのGlusterFS

@doryokujin第1回 GlusterFS 座談会 2011/09/14

http://twitter.com/%23!/doryokujin


・井上敬浩(26歳)

・twitter: doryokujin

・データマイニングエンジニア

・MongoDB JP 代表

・NoSQLやHadoopに興味

・マラソン2時間33分

自己紹介



1. イントロダクション

2. セットアップチュートリアル

3. No Metadata Server Approach

4. GlusterFS アプリケーションデモ

アジェンダ

資料中の内容・図・用語は以下の資料を参考にしています。

・Gluster 3.2 Filesystem Administration Guide

・Cloud Storage: Adoption, Practice and Deployment

・An Introduction to Gluster Architecture White Paper

・Performance in a Gluster System White Paper

参考資料

http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guide

http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guide

http://www.gluster.com/products/cloud-storage-report/

http://www.gluster.com/products/cloud-storage-report/

http://www.gluster.com/products/gluster-file-system-architecture-white-paper/

http://www.gluster.com/products/gluster-file-system-architecture-white-paper/

http://www.gluster.com/products/performance-in-a-gluster-system-white-paper/

http://www.gluster.com/products/performance-in-a-gluster-system-white-paper/

1. 特徴

アプローチ

Copyright 2011, Gluster, Inc. Page 6

www.gluster.com !

If a customer has found that the performance levels are acceptable, but wants to increase capacity by 25%, they could add another 4, 1 TB drives to each server, and will not generally experience performance degradation. (i.e., each server would have 16, 1 TB drives). (See Config. B, above). Note that they do not need to upgrade to larger, or more powerful hardware, they simply add 8 more inexpensive SATA drives.

On the other hand, if the customer is happy with 24 TB of capacity, but wants to double performance, they could distribute the drives among 4 servers, rather than 2 servers (i.e. each server would have 6, 1 TB drives, rather than 12). Note that in this case, they are adding 2 more low-price servers, and can simply redeploy existing drives. (See Config. C, above)

If they want to both quadruple performance and quadruple capacity, they could distribute among 8 servers (i.e. each server would have 12,1 TB drives). (See Config. D, below)

Note that by the time a solution has approximately 10 drives, the performance bottleneck has generally already moved to the network. (See Config. D, above)


www.gluster.com !





水平スケーリングによるパフォーマンス向上


www.gluster.com !





水平スケーリングによるキャパシティの増加

レプリケーションによるデータ保全性の増加

アプローチ


www.gluster.com !

So, in order to maximize performance, we can upgrade from a 1 Gigabit Ethernet network to a 10 Gigabit Ethernet network. Note that performance in this example is more than 25x that which we saw in the baseline. This is evidenced by an increase in performance from 200 MB/s in the baseline configuration to 5,000 MB/s. (See Config. E, below)

As you will note, the power of the scale-out model is that both capacity and performance can scale linearly to meet requirements. It is not necessary to know what performance levels will be needed 2, or 3 years out. Instead, configurations can be easily adjusted as the need demands.

InfiniBandによる転送速度増加によるパフォーマンス向上

アプローチ

[1] SOFTWARE ONLY・OS や特定ハードウェアに依存しない

・Amazon Web Service や VMWare にもデプロイ可能

[2] OPEN SOURCE・AGPL ライセンス

[3] COMPLETE STORAGE OS STACK・distributed memory management

・I/O scheduling・software RAID・sellf-healing

特徴

[4] USER SPACE (<=> KERNEL SPACE)・インストール、アップグレードが容易

・カーネルの知識がなくてもCの知識があれば良い

[5] MODULAR, STACKABLE ARCHITECTURE・large number of large flies

・huge numbers of very small files

・environments with cloud storage

・various transport protocols

特徴

[6] DATA STORED IN NATIVE FORMATS・EXT3, EXT4, XFS に対応

・POSIX 準拠による ShellScripting

[7] NO METADATA WITH THE ELASTIC HASH ALGORITHM

・Elastic Hashing Algorithm によるファイルロケーションの決定

・i.e. メタデータを保存するための特定のサーバーが不要

・i.e. No Single Point Of Failure !!

特徴

[8] 豊富な Volume Management 機能

・Tuning Volume Options

・Expanding Volumes

・Shrinking Volumes

・Migrating Volumes

・Rebalancing Volumes

・Stopping Volumes

・Deleting Volumes

・Triggering Self-Heal on Replicate

特徴

(a) 導入が容易：deb, rpm パッケージも用意

(b) シンプルな構成：サーバ側コンポーネントでローカルファイルシステムを Volume としてまとめ、クライアント側コンポーネントが Volume をマウントして使用

(c) POSIX準拠：Shell Scripting によって数十GBファイル群をカジュアルに集計可

(d) 管理が容易：メタサーバー不要。Replicationも簡単

(e) Hadoop対応：大規模データに対しても解析可能

Gluster のここが好き

2. セットアップ

チュートリアル

用語Replicate-Volume Geo-Replication

Brickglusterfs におけるストレージの単位。

Trusted Storage Pool 内のサーバーのディレクトリパスで表現。

glusterd “Gluster management daemon”、Trusted Storage Pool 内の全てのサーバーで起動。

Server / ClientServer の各 Brick にデータが分散ストレージされる。Client はVolume を mount する

ことでデータにアクセス可能。

Trusted Storage Pool A storage pool is a trusted network of storage servers.

Volume Brick の logical collection、Client は Volume 単位で mount を行う。

Volfile glusterfs プロセスが使用する設定ファイル“/etc/glusterd/vols/VOLNAME”

[1] ダウンロード + make

➜ BETA_LINK="http://download.gluster.com/pub/gluster/glusterfs/

qa-releases/3.3-beta-2/glusterfs-3.3beta2.tar.gz"

➜ wget $BETA_LINK

➜ tar zxvf glusterfs-3.3beta2.tar.gz

➜ cd glusterfs-3.3beta2

➜ ./configure && make

➜ sudo make install

http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/glusterfs-3.3beta2.tar.gz




[2] glusterd デーモン起動と trusted storage pool へのサーバ登録

# Start Gluster management daemon for each server

➜ sudo /etc/init.d/glusterd start

# Adding Servers to Trusted Storage Pool

➜ for HOST in host1 host2 host3; do

gluster peer probe $HOST; done

#=> Probe successful

Probe successful

Probe successful

[3] trusted pool 内のサーバ確認➜ sudo gluster peer status

#=> Number of Peers: 3

Hostname: host1

Uuid: 81982001-ba0d-455a-bae8-cb93679dbddd

State: Peer in Cluster (Connected)

Hostname: host2

Uuid: 03945cd4-7487-4b2c-9384-f006a76dfee5

State: Peer in Cluster (Connected)...

[4-1] distribute volume の作成# Create a distribute Volume named ‘log’

➜ SERVER_LOG_PATH=”/mnt/glusterfs/server/log”

➜ sudo gluster volume create log transport tcp

host1: $SERVER_LOG_PATH



host1, host2, host3 でdistribute

$ sudo gluster volume start log # start Volume

$ sudo gluster volume info log

#=> Volume Name: log

Type: Distribute

Status: Started

Number of Bricks: 12

Transport-type: tcp

Bricks:

Brick1: delta1:/mnt/glusterfs/server/log



[5] volume の起動

# Create a distribute replicate Volume named ‘repository’

➜ SERVER_REPO_PATH=”/mnt/glusterfs/server/repository”

➜ sudo gluster volume create repository

replica 2 transport tcp

host1: $SERVER_REPO_PATH host2: $SERVER_REPO_PATH

host3: $SERVER_REPO_PATH host4: $SERVER_REPO_PATH

host1 と host2 で active-active Replication

host3 と host4 で active-active Replication

[4-2] distribute replicate volume の作成

$ sudo gluster volume info repository

#=> Volume Name: repository

Type: Distributed-Replicate

Status: Started

Number of Bricks: 2 x 2 = 4

Transport-type: tcp

Bricks:

Brick1: delta1:/mnt/glusterfs/server/repository




sync

sync

distribute

[4-2] distribute replicate volume の確認

[1] Gluster Native Client (FUSE)

・high concurrency, performance,transparent failover

・POSIX準拠, getfattrコマンド

・大サイズファイル向け

[2] NFS

・NFS v3

・getfattr コマンドは利用不可(?)

・小サイズで大量のファイル群向け

Client 側のアクセス方法

[6] Volume のマウント

# Mount ‘log’ Volume

➜ CLIENT_LOG_PATH=”/mnt/glusterfs/client/log”

➜ sudo mount -t glusterfs

-o log-level=WARNING,log-file=/var/log/gluster.log

localhost:log $CLIENT_LOG_PATH # native-client

➜ sudo mount -t nfs -o mountproto=tcp

localhost:log $CLIENT_LOG_PATH # nfs

[7] マウントの確認➜ df -h

#=>

Filesystem Size Used Avail Use% Mounted on

/dev/sdb 1.9T 491G 1.4T 27% /mnt/disk2

...

localhost:repository 11T 4.0G 11T 1%

/mnt/glusterfs/client/repository

localhost:log 8.3T 3.6T 4.3T 46%

/mnt/glusterfs/client/log

[8] 物理ロケーションの取得# Get Physical Location

➜ sudo getfattr -m . -n trusted.glusterfs.pathinfo

/mnt/glusterfs/client/repository/some_file

#=>

# file: /mnt/glusterfs/client/repository /some_file

trusted.glusterfs.pathinfo="(

<REPLICATE:repository-replicate-0>

<POSIX:host1: /mnt/glusterfs/server/repository/some_file >

<POSIX:host2: /mnt/glusterfs/server/repository/some_file >

)

Hadoop 対応 (from 3.3 beta2)・“getfattr -m . -n trusted.glusterfs.pathinfo” を実行して物理ロケーションとVolumeタイプを特定（beta2からこの出力内容がHadoopに対応させるため大きく変更になった）

・データローカリティとレプリケーションからのインプットにも対応

・パフォーマンスは未知だが、HDFSのメタデータ管理などの心労から解放されるメリットは大きい

・かつ POSIX 準拠で大規模でも「データを見る」という行為が ShellScript などで気軽に深く行える

・(Distributed) Striped Volume

・Geo-Replication

Geo-Replication Volume

Replicate-Volume Geo-Replication

クラスタ間のミラーリング地理的に離れたクラスタ間のミラーリング

high-availability backing up of data for disaster recovery

Synchronous replication Asynchronous replication

ds

Gluster File system Administration Guide_3.2_02_B Pg No. 47

Checking Geo-replication Minimum Requirements

9.2.1. Exploring Geo-replication Deployment Scenarios

GlusterFS Geo-replication provides an incremental replication service over Local Area Networks (LANs), Wide Area Network (WANs), and across the Internet. This section illustrates the most common deployment scenarios for GlusterFS Geo-replication, including the following:

Geo-replication over LAN

Geo-replication over WAN

Geo-replication over the Internet

Multi-site cascading Geo-replication


You can configure GlusterFS Geo-replication to mirror data over a Local Area Network.


You can configure GlusterFS Geo-replication to replicate data over a Wide Area Network.

Geo-replication over Internet

You can configure GlusterFS Geo-replication to mirror data over the Internet.

ds


Checking Geo-replication Minimum Requirements

9.2.1. Exploring Geo-replication Deployment Scenarios

GlusterFS Geo-replication provides an incremental replication service over Local Area Networks (LANs), Wide Area Network (WANs), and across the Internet. This section illustrates the most common deployment scenarios for GlusterFS Geo-replication, including the following:



Geo-replication over the Internet



You can configure GlusterFS Geo-replication to mirror data over a Local Area Network.


You can configure GlusterFS Geo-replication to replicate data over a Wide Area Network.

Geo-replication over Internet

You can configure GlusterFS Geo-replication to mirror data over the Internet. ds



You can configure GlusterFS Geo-replication to mirror data in a cascading fashion across multiple sites.

9.2.2. GlusterFS Geo-replication Deployment Overview Deploying GlusterFS Geo-replication involves the following steps:

1. Verify that GlusterFS v3.2 is installed and running on systems that will serve as masters, using the following command:

#glusterfs --version

3. No Metadata Server Approach

CENTRALIZED METADATA SYSTEMS


www.gluster.com !

!

Figure!4!Centralized!Metadata!Approach

3.2 DISTRIBUTED METADATA SYSTEMS

An alternative approach is to forego a centralized metadata server in favor of a distributed metadata approach. In this implementation, the index of location metadata is spread among a large number of storage systems.

While this approach would appear on the surface to address the shortcomings of the centralized approach, it introduces an entirely new set of performance and availability issues.

1. Performance Overhead: Considerable performance overhead is introduced as the various distributed systems try to stay in sync with data via the use of various locking and synching mechanisms. Thus, most of the performance scaling issues that plague centralized metadata systems plague distributed metadata systems as well. Performance degrades as there is an increase in files, file operations, storage systems, disks, or the randomness of I/O operations. Performance similarly degrades as the average file size decreases.

While some systems attempt to counterbalance these effects by creating dedicated solid state drives with high performance internal networks for metadata, this approach can become prohibitively expensive.

2. Corruption Issues: Distributed metadata systems also face the potential for serious corruption issues. While + ��#&**�&)��&)),'+!&%�&��&%��!*+)!�,+��%&��.&%6+�+�"��&.%�+ ��%+!)��*0*+�$��!+��%��&)),'+�+ ��entire system. When metadata is stored in multiple locations, the requirement to maintain it synchronously also implies significant risk related to situations when the metadata is not properly kept in synch, or in the event it is actually damaged. The worst possible scenario involves apparently-successful updates to file data and metadata to separate locations, without correct synchronous maintenance of metadata, such that there is no longer perfect agreement among the multiple instances. Furthermore, the chances of a corrupted storage system increase exponentially with the number of systems. Thus, concurrency of metadata becomes a significant challenge.


www.gluster.com !

Figure 5, below, illustrates a typical distributed metadata server implementation. It can be seen that this approach also results in considerable overhead processing for file access, and by design has built-in exposure for corruption scenarios. Here again we see a legacy approach to scale-out storage not congruent with the requirement of the modern data center or with the burgeoning migration to virtualization and cloud computing.

!

Figure!5!Decentralized!Metadata!Approach

3.3 AN ALGORITHMIC APPROACH (NO METADATA MODEL)

As we have seen so far, any system which separates data from location metadata introduces both performance and reliability concerns. Therefore, Gluster designed a system which does not separate metadata from data, and which does not rely on any separate metadata server, whether centralized or distributed.

Instead, Gluster locates data algorithmically. Knowing nothing but the path name and file name, any storage system node and any client requiring read or write access to a file in a Gluster storage cluster performs a mathematical operation that calculates the file location. In other words, there is no need to separate location metadata from data, because the location can be determined independently.

We call this the Elastic Hashing Algorithm, and it is key to many of the unique advantages of Gluster. While a complete explanation of the Elastic Hashing Algorithm is beyond the scope of this document, the following is a simplified explanation that should illuminate some of the guiding principles of the Elastic Hashing Algorithm.

DISTRIBUTED METADATA SYSTEMS

ALGORITHMIC APPROACH (NO METADATA SERVER MODEL)

[Elastic Hashing Algorithm]

・それぞれのオペレーションが個々にメタデータをアルゴリズムによって特定。よって高速

・メタデータの特定はデータの増大やサーバーの増加に対するパフォーマンスの影響が小さく、ほぼ linear にスケール

・メタデータの不一致などによる矛盾を引き起こしにくい。’Safe’ な分散ファイルシステムを実現


www.gluster.com !

The benefits of the Elastic Hashing Algorithm are fourfold:

1. The algorithmic approach makes Gluster faster for each individual operation, because it calculates metadata using an algorithm, and that approach is faster than retrieving metadata from any storage media.

2. The algorithmic approach also means that Gluster is faster for large and growing individual systems because there is never any contention for any single instance of metadata stored at only one location.

3. The algorithmic approach means Gluster is faster and achieves true linear scaling for distributed deployments, because each node is independent in its algorithmic handling of its own metadata, eliminating the need to synchronize metadata.

4. Most importantly, the algorithmic approach means that Gluster is safer in distributed deployments, because it eliminates all scenarios of risk which are derived from out-of-synch metadata (and that is arguably the most common source of significant risk to large bodies of distributed data).

To explain how the Elastic Hashing Algorithm works, we will examine each of the three words (algorithm, hashing, and elastic.)

��+6*�*+�)+�.!+ �#�&)!+ $� We are all familiar with an algorithmic approach to locating data. If a person goes into any office that stores physical documents in folders in filing cabinets, that person should be able to find the 4�$�5��&#��)�.!+ &,+��&!%��+&��%+)�#�!%��/��%0&%��!%�+ ��&��!��.&,#��"%&.�+ �+�+ ��4�$�5��&#��)�!*�#&��+��!%�+ ��45��!#��!%�+��!%�+ ��)�.�)�$�)"��4�&++-�)!�,#+,)��5��+.��%�+ ��#��%��&)%��&#��)*�

Similarly, one could implement an algorithmic approach to data storage that used a similar 4�#' ��+5��#�&)!+ $�+&�locate files. For example, in a ten system cluster, one �&,#��**!�%��##��!#�*�. !� ��!%�.!+ �+ ��#�++�)�45�+&��!*"��##��!#�*�. !� ��!%�.!+ �+ ��#�++�)�4�5�+&��isk 10, etc. Figure 6, below illustrates this concept.

!

Figure!6:!Understanding!EHA:!Algorithm [1] 辞書式順序によってロケーションを特定“Acme” なら Disk1 へ、”Do” なら Disk2 へ



www.gluster.com !

Because it is easy to calculate where a file is located, any client or storage system could locate a file based solely on its name. Because there is no need for a separate metadata store, the performance, scaling, and single point-of-failure issues are solved.

Of course, an alphabetic algorithm would never work in practice. File names are not themselves unique, certain letters are far more common than others, we could easily get hotspots where a group of files with similar names are stored, etc.

3.4 THE USE OF HASHING

To address some of the abovementioned shortcomings, you could use a hash-based algorithm. A hash is a mathematical function that converts a string of an arbitrary length into a fixed length values. People familiar with hash algorithms (e.g. the SHA-1 hashing function used in cryptography or various URL shorteners like bit.ly), will know that hash functions are generally chosen for properties such as determinism (the same starting string will always result in the same ending hash), and uniformity (the ending results tend to be uniformly distributed mathematically)��#,*+�)6*��#�*+!��* !%��#�&)!+ $�!*��*��&%�+ ��-!�*-Meyer hashing algorithm,

In the Gluster algorithmic approach, we take a given pathname/filename (which is unique in any directory tree) and run it through the hashing algorithm. Each pathname/filename results in a unique numerical result.

For the sake of simplicity, one could imagine assigning all files whose hash ends in the number 1 to the first disk, all which end in the number 2 to the second disk, etc. Figure 7, below, illustrates this concept.

!

Figure!7!Understanding!EHA:!Hashing [2] pathname/filename に基づく Hash Based アルゴリズム。この例は Mod10


Virtual Volumes

[Virtual Volumes]

・サーバーの増減やファイルのリネームにどう対応するか？

・基本アルゴリズムは変化させない

・”Virtual Volumes” 内の物理的配置を変更

・ Virtual Volumes は実際の Volume とは分離


www.gluster.com !

Of course, questions still arise. What if we add or delete physical disks? What if certain disks develop hotspots? To a%*.�)�+ �+��.�6##�+,)%�+&��*�+�&��(,�*+!&%*��&,+� &.�.��$�"��+ �� * !%��#�&)!+ $�4�#�*+!�5�

3.5 MAKING IT ALL ELASTIC: PART I

In the real world, stuff happens. Disks fail, capacity is used up, files need to be redistributed, etc.

Gluster addresses these challenges by:

1. Setting up a very large number of virtual volumes 2. Using the hashing algorithm to assign files to virtual volumes 3. Using a separate process to assign virtual volumes to multiple physical devices

Thus, when disks or nodes are added or deleted, the algorithm itself does not need to be changed. However, virtual volumes can be migrated or assigned to new physical locations as the need arises. Figure 8, below, illustrates the Glus+�)�4�#�*+!��* !%��#�&)!+ $5�

!

Figure!8!Understanding!EHA:!Elasticity

For most people, the preceding discussion should be sufficient for understanding the Elastic Hashing Algorithm. It oversimplifies in some respects for pedagogical purposes. (For example, each folder is actually assigned its own hash space.). Advanced discussion on Elastic Volume Management, Moving, or Renaming, and High Availability follows in the next section, Advanced Topics.

ALGORITHMIC APPROACH (NO METADATA MODEL)

[3] アルゴリズム不変。Virtual Volume 内での物理配置を変更

[1] バックグラウンドでマイグレーション開始

[2] その間、まず新しい配置へポインタを作成

[3] Client はアルゴリズムによって新しい配置を見に行くことに。その際このポインタによって旧配置へリダイレクトされる

[4] マイグレーション完了後、ポインタは削除

※ ポインタってコレですか？↓

Rename や Move による物理配置の変更が起きた場合

➜ ll /mnt/glusterfs/server/result*

---------T 1 doryokujin 0 Sep 13 16:40 result.host1

-rw-r--r-- 1 doryokujin 4044654 Sep 13 16:40 result.host6

・”Logical” と “Physical” な Volume 間のやりとりは？

・Rebalancing コマンドの Fix Layout フェイズはアルゴリズム or Virtual Volumeをどう変更するのか？

・Server 側のパスに直接ファイルを書き込んだ場合、Client 側のパスからそのファイルが見える場合と見えない場合がある。この周辺の挙動はどうなっている？ (以下は体験談)

(a) distributed-volume : 見える

(b) replicate-volume : replica の片方だけ見える

(c) nfs-mount：見えない

いくつかの疑問

・2-Replicate-Volume の場合、ペアの片方のサーバーが死んでしまった場合、生存するサーバーに対して新しくReplicaを作成してくれるのか（i.e. Replica のペアと数は常に保たれるのか）

・その場合のペアは再度固定なのか、それとも複数のサーバーに分散してReplication されるのか？

いくつかの疑問

4. GlusterFS アプリケーションデモ (おまけ)

・GlusterFS をログのストレージとして活用

・MapReduce フレームワークによってログを集計

- Python 実装

- Redis をパラメータ共有空間として機械学習 / イテレーション対応

- Redis を利用して In-memory / Pipelined MR

- Fabric の SSH 機能でサーバに Map/Reduce 処理させる

- R や MongoDB との親和性強化

イントロダクション

http://redis.io/

http://redis.io/

http://docs.fabfile.org/en/1.2.1/index.html

http://docs.fabfile.org/en/1.2.1/index.html

http://www.r-project.org/

http://www.r-project.org/

http://www.mongodb.org/display/DOCS/Home

http://www.mongodb.org/display/DOCS/Home

デモタイム

ありがとうございました

Technology

はじめてのGlusterFS