gpfs%fpo’...

13
GPFSFPO A Cluster File System for ScaleOut Applica8ons Dinesh Subhrave. IBM Research

Upload: hoangcong

Post on 14-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

GPFS-­‐FPO  A  Cluster  File  System  for  Scale-­‐Out  Applica8ons    

Dinesh  Subhrave.  IBM  Research  

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

2

Mo8va8on:  SANs  are  built  for  latency  not  bandwidth    Advantages  of  using  GPFS:  r  High  scale  (thousands  of  nodes,  

petabytes  of  storage),  high  performance,  high  availability,  data  integrity,    

r  POSIX  seman.cs,  workload  isola.on,  enterprise  features  (security,  snapshots,  backup/restore,  archive,  asynchronous  caching  and  replica.on)  

 Challenges:  r  Adapt  GPFS  to  shared  nothing  clusters    r  Maximize  applica.on  performance  

rela.ve  to  cost  r  Failure  is  common,  network  is  a  

boMleneck  

SAN  

I/O  boMleneck  

Tradi.onal  architecture  

Shared  nothing  environment  

Scale  out  performance  

1  server  -­‐>  1  GBps  

100  servers-­‐>  100  GBps  

0

200

400

600

800

1000

Seq  w

rite

Seq  r

ead

Rand

om  Re

ad

Rand

om  W

rite

Backw

ard  Re

ad

Record  

rewrite

Stride

 Read

gpfs-­‐san gpfs-­‐snc

GPFS-­‐FPO:  Mo8va8on  

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

3

Mul.ple  Block  Sizes  

InfoSphere Big Insights

POSIX   Map  Reduce,  Cloud  

Disaster  Recovery  

Node  N  

MetadataN  

SAN  

Node  1  

Data  1  

Metadata1  

Solid  State  Disks  

Cache  Node  2  

Metadata2  

Local  or  Rack    AMached  Disks  

Cache   Cache  

Replicated  metadata:  No  single  point  of  failure  

Mul.-­‐.ered  storage  with  automated  policy-­‐based  migra.on  

Data  2   Data  N  

Smart  Caching  

Data  Layout  Control  

High  Scale:  1000  nodes,  PBs  of  data  

GPFS-­‐FPO:  Architecture  

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

4

r  Hardware Configuration: r  4 nodes, per node: 16 cores 96 GB 12 JBOD disks, 10 Gbps network, data disk cache enabled

r  YCSB Workload A Configuration r  YCSB: Region Heap Size 6 GB, Presplit 100, Record count 1M, threads=8, 4 region servers r  HDFS: 1 metadata 3 data replicas, 48 data disks r  GPFS: 3 metadata 3 data replicas, page pool 32 GB, system: 256KB block size, data pool 2M block size, data

block group factor 128, 48 data disks r  Hbase: major compaction disabled; minor compaction threshold 3, server threads 50/node

r  Similar results for YCSB Workload B-F

4

Excellent  Performance  

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 5

Locality  Awareness:  Make  applica8ons  aware  where  data  is  located  

r  Scheduler  assigns  tasks  to  nodes  where  data  chunks  reside  r  needs  to  know  the  

loca.on  of  each  chunk  r  GPFS  API  maps  each  chunk  to  

its  node  loca.on  r  Map  is  compacted  to  op.mize  

its  size  

N1  

D1  

N1:D1  N1Task  Engine:    Loca1on  Map  

Scheduler  

Read  Loca1on  Map  

Schedule    F(D1)  on  N1  

F(D1)  

GPFS  

Connector  pass  locality  info  

0

50

100

150

w/ locality w/o locality

Grep (min)

Application

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 6

Chunks:  allows  applica8ons  to  define  own  logical  block  size  

r  Workloads  make  two  types  of  accesses  r  Small  blocks  (eg:  512KB)  for  Index  File  

Lookups  r  Large  blocks  (eg:  64MB)  for  File  scans  

r  Solu.on:  Mul.ple  block  sizes  in  the  same  file  system  

r  New  alloca.on  scheme  r  Block  group  factor  for  block  size  r  Effec.ve  block  size  =  block  size  *  block  group  

factor  r  File  blocks  are  laid  out  based  on  effec.ve  block  

size  

Application “chunk” FS block

New allocation policy

0

5

10

15

20

(512,256KB)

(128,1MB)

(32,4MB)

(8,16MB)

(2,64MB)

Grep (min)

(Metablock  factor,  Block  Size)à  

Optimum Block Size:

Caching and pre-fetching

effects at larger block

sizes

Effective block size: 128 MB

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 7

Write  Affinity:  Allow  applica8ons  to  dictate  layout  

r  Wide  striping  with  mul.ple  replicas:  r  No  locality  of  data  in  a  

par.cular  node  r  Bad  for  commodity  

architectures  due  to  excessive  network  traffic  

r  Write  Affinity  for  a  file  or  a  set  of  files  on  a  given  node  r  All  relevant  files  will  be  

allocated  on  the  node,  near  a  node  or  in  the  same  rack  

r  Configurability  of  replicas  r  Each  replica  can  be  configured  

for  either  wide  striping  or  write  affinity  

N1  

File  

N2  

N3  

Read  File  1  

Wide  striped  layout:  Read  over  local  disk  +  network  

N1  

N2  

N3  

N4  

Read  File  1  

Write  affinity  layout:  Read  over  only  local  disk  

N1  

N2  

N3  

N4  

Read  File  1  

Write  affinity  and  wide  striping  with  replica8on:  Read  over  only  local  disk,  recover  from  wide  striped  replica  

Parallel  Recovery  

0

5

10

15

20

25

30

35

WriteAffinity WideStriping

TeraGen (min)

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

r  Observa.on  r  Data  is  suited  for  GPFS-­‐FPO  r  Metadata  is  suited  for  regular  GPFS  

r  Make  alloca.on  type  a  storage  pool  property  r  Metadata  pool:  no  write  affinity  r  Data  pools:  write  affinity  

 r  GPFS-­‐FPO:  distributed  metadata  r  No  single  point  of  failure  but  too  many  

metadata  nodes  r  Increases  probability  of  any  3  metadata  

nodes  going  down  à  loss  of  data  r  Assumes  replica.on=3  

r  Typical  metadata  profile  in  GPFS-­‐FPO  r  One  or  two  metadata  nodes  per  rack  

(failure  domain)  r  Random  access  paMerns  

Hybrid  alloca8on:  treat  metadata  and  data  differently  

Metadata: No write affinity

Rack 1 Rack 2 Rack 3

Data: Write Affinity

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 9

Pipelined  Replica8on:  Efficient  replica8on  of  data    

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

Node 1 Node 2 Node 3 Node 4 Node 5

N1  

 Client  

 N2  

 N3  

B0  

B1   B2  

Write  Write  Pipeline  

0

10

20

30

40

S ing le -­‐ s o ur c e P ip e l ined

TeraGen(min)

r  Higher  Degree  of  replica.on  r  Pipelining  for  lower  latency  

r  Single  source  does  not  u.lize  the  bandwidth  effec.vely    

r  Able  to  saturate  1  Gbps  Ethernet  link  

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 10

Fast  Recovery:  Recover  quickly  from  disk  failures  

r  Handling  failures  is  policy  driven  r  Incorporated  node  failure  and  

disk  failure  triggers  r  Policy-­‐based  recovery  

r  Restripe  on  a  node  failure  or  disk  failure  

r  Alterna.vely,  rebuild  the  disk  when  the  disk  is  replaced  

r  Fast  recovery  r  Incremental  recovery  

r  Keep  track  of  changes  during  failure  and  recover  what  is  needed  

r   Distributed  restripe  r  Restripe  load  is  spread  

out  over  all  the  nodes  r  Quickly  figure  out  what  

blocks  are  needed  to  be  recovered  when  a  node  fails  

N1  N1  N1  

File  

N2  

N3  

Well  replicated  file   N2  rebooted:  Writes  go  to  N1  and  N3  

N2  comes  back  and  catches  up  by  copying  missed  writes  from  N1    and    N3  in  parallel  

N2  

N3  

N2  

N3  

New  write  

 

 

Copy  Deltas    

0

5

10

15

20

0 1 2

Grep(Min)

Failed Nodes

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

11

File System GPFS-FPO HDFS MapR

Robust No single point of failure

NameNode vulnerability

No single point of failure

Data Integrity High Evidence of data loss New file system

Scale Thousands of nodes Thousands of nodes Hundreds of nodes

POSIX Compliance Full – supports a wide range of applications

Limited Limited

Data Management Security, Backup, Replication

Limited Limited

MapReduce Performance

Good Good Good

Workload Isolation Supports disk isolation

No support Limited support

Traditional Application Performance

Good Poor performance with random reads and writes

Limited

Comparison  with  HDFS  and  MapR  

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

r  4Q12 : GA r  1H13 :

r WAN Caching r Advanced Write Affinity r KVM Support

r  1Q14: r  Integrate with Platform Computing Scheduler:

DirectShuffle r  Encryption r Disaster Recovery

12

RoadMap  

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

http://almaden.ibm.com