putting wings on the elephant

42
Putting Wings on the Elephant Pritam Damania Facebook, Inc.

Upload: hadoop-summit

Post on 10-May-2015

1.105 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Putting Wings on the Elephant

Putting Wings on the ElephantPritam DamaniaFacebook, Inc.

Page 2: Putting Wings on the Elephant
Page 3: Putting Wings on the Elephant

Putting wings on the Elephant!

Pritam DamaniaSoftware EngineerApril 2, 2014

Page 4: Putting Wings on the Elephant

1 Background

2 Major Issues in I/O path

3 Read Improvements

4 Write Improvements

5 Lessons learnt

Agenda

Page 5: Putting Wings on the Elephant

High level Messages Architecture

HBASE

Application Server

Message

Message

AckWrite

Page 6: Putting Wings on the Elephant

Hbase Cluster Physical Layout▪ Multiple clusters/cells for messaging

▪ 20 servers/rack; 5 or more racks per cluster

Rack #1

ZooKeeper PeerHDFS Namenode

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #2

ZooKeeper PeerStandby Namenode

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #3

ZooKeeper PeerJob Tracker

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #4

ZooKeeper PeerHBase Master

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #5

ZooKeeper PeerBackup HBase Master

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Page 7: Putting Wings on the Elephant

Write Path Overview

HDFS

Write Ahead Log

RegionServer

Memstore

HFiles

Page 8: Putting Wings on the Elephant

HDFS Write Pipeline

Datanode

OS page cache

Disk

Regionserver

64k packet

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

Ack

Page 9: Putting Wings on the Elephant

Read Path Overview

HDFS

RegionServer

Memstore

HFiles

Get

Page 10: Putting Wings on the Elephant

Problems in R/W Path

• Skewed Disk Usage

• High Disk iops

• High p99 for r/w

Page 11: Putting Wings on the Elephant

Improvements in Read Path

Page 12: Putting Wings on the Elephant

Disk Skew

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

• HDFS block size : 256MB• HDFS block resides on single disk• Fsync of 256MB hitting single disk

Page 13: Putting Wings on the Elephant

Disk Skew - Sync File Range

………………………………………………………………………………………………..64k

64k

64k

64k

sync_file_range every 1MB

▪ sync_file_range(SYNC_FILE_RANGE_WRITE)

▪ Initiates Async write

Block File Written on Linux FileSystem

64k

64k

fsync

Page 14: Putting Wings on the Elephant

High IOPS• Messages workload is random read

• Small preads (~4KB) on datanodes

• Two iops for each pread

Datanode

Block File Checksum file

pread

Read checksumRead data

Page 15: Putting Wings on the Elephant

High IOPS - Inline Checksums

…………………… …………………………………4096 byte Data

Chunk4 byte Checksum

• Checksums inline with data

• Single iop for pread

HDFS Block

Page 16: Putting Wings on the Elephant

High IOPS - Results

No. of Put and get above one second

Put avg time

Get avg time

Page 17: Putting Wings on the Elephant

Hbase Locality - HDFS Favored Nodes▪ Each region’s data on 3 specific datanodes

▪ On failure locality preserved

▪ Favored nodes persisted at hbase layerRegionServer

Local Datanode

Page 18: Putting Wings on the Elephant

Hbase Locality - Solution

• Persisting info in NameNode complicated

• Region Directory :▪ /*HBASE/<tablename>/<regionname>/cf1/…

▪ /*HBASE/<tablename>/<regionname>/cf2/…

• Build Histogram of locations in directory

• Pick lowest frequency to delete

Datanodes

040008000

D1 D2D3 D4D5 D6

Page 19: Putting Wings on the Elephant

More Improvements

• Keep fds open

• Throttle re-replication

Page 20: Putting Wings on the Elephant

Improvements in Write Path

Page 21: Putting Wings on the Elephant

Hbase WAL

Datanode

OS page cache

Disk

Regionserver

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

• Packets never hit disk• > 1s outliers !

Page 22: Putting Wings on the Elephant

Instrumentation

1. Write to OS cache

2. Write to TCP buffers

3. sync_file_range(SYNC_FILE_RANGE_WRITE)

1. & 3. outliers >1s !

Page 23: Putting Wings on the Elephant

Use of strace

Page 24: Putting Wings on the Elephant

Interesting Observations

• write(2) outliers correlated with busy disk

• Reproducible by artificially stressing disk

dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000

Page 25: Putting Wings on the Elephant

Test Program

File Written on Linux FileSystem

……………………………………………………………………………………..64k

64k

64k

64k

sync_file_range every 1MB

64k

64k

………………………………………………………………………………………………..63k

1k 63k

1k

sync_file_range every 1MB

63k

1k

No Outliers !

Outliers Reproduced !

Page 26: Putting Wings on the Elephant

Some suspects

• Too many dirty pages

• Linux stable pages

• Kernel trace points revealed stable pages the culprit

Page 27: Putting Wings on the Elephant

Stable Pages

Persistent Store (Device with Integrity Checking)

OS page

Kernel Checksum

Device Checksum

WriteBack • Checksum Error

• Solution – Lock pages under writeback

Page 28: Putting Wings on the Elephant

Explanation of Write Outliers

Persistent Store

OS Page 4k

WAL write

WriteBack (sync_file_rang

e)

WAL write blocked

Page 29: Putting Wings on the Elephant

Solution ?

Patch : http://thread.gmane.org/gmane.comp.file-systems.ext4/35561

Page 30: Putting Wings on the Elephant

sync_file_range ?

• sync_file_range not async for > 128 write requests

• Solution – Use threadpool

Page 31: Putting Wings on the Elephant

Results

P99 Write latency to OS cache (in ms)

Page 32: Putting Wings on the Elephant

Per request profiling

• Entire profile of client requests

• Full profile of pipeline write• Full profile of pread• Lot of visibility !

Page 33: Putting Wings on the Elephant

Interesting Profiles

• In memory operations >1s• No Java GC• Co-related with busy root disk• Reproducible by stressing root

disk

Page 34: Putting Wings on the Elephant

Investigation

• Use lsof

• /tmp/hsperfdata_hadoop/<pid> suspicious

• Disable using -XX:-UsePerfData

• Stalls disappeared !

• -XX:-UsePerfData breaks jps, jstack

• Mount /tmp/hsperfdata_hadoop/ on tmpfs

Page 35: Putting Wings on the Elephant

Result

p99 WAL write latency(in ms)

Page 36: Putting Wings on the Elephant
Page 37: Putting Wings on the Elephant

Lessons learnt

• Instrumentation is key

• Per request profiling is very useful

• Understanding of Linux kernel and fs is important

Page 38: Putting Wings on the Elephant

Acknowledgements▪ Hairong Kuang

▪ Siying Dong

▪ Kumar Sundararajan

▪ Binu John

▪ Dikang Gu

▪ Paul Tuckfield

▪ Arjen Roodselaar

▪ Matthew Byng-Maddick

▪ Liyin Tang

Page 39: Putting Wings on the Elephant

FB Hadoop code

• https://github.com/facebook/hadoop-20

Page 40: Putting Wings on the Elephant

Questions ?

Page 41: Putting Wings on the Elephant

(c) 2009 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Page 42: Putting Wings on the Elephant