gfarm fs tatebe tip2004

28
Joint Techs Workshop, TIP 2004 Jan 28, 2004 Honolulu, Hawaii National Institute of Advanced Industrial Science and Technology Trans-Pacific Grid Datafarm Osamu Tatebe Osamu Tatebe Grid Technology Research Center, AIST Grid Technology Research Center, AIST On behalf of the Grid On behalf of the Grid Datafarm Datafarm Project Project

Upload: xlight

Post on 28-Jan-2015

105 views

Category:

Spiritual


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Gfarm Fs Tatebe Tip2004

Joint Techs Workshop, TIP 2004Jan 28, 2004Honolulu, Hawaii

National Institute of Advanced Industrial Science and Technology

Trans-Pacific Grid Datafarm

Osamu TatebeOsamu TatebeGrid Technology Research Center, AISTGrid Technology Research Center, AIST

On behalf of the Grid On behalf of the Grid DatafarmDatafarm ProjectProject

Page 2: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Key points of this talk

TransTrans--pacific Grid file system and pacific Grid file system and testbedtestbed

70 TBytes disk capacity, 13 GB/sec disk I/O performance

TransTrans--pacific file replication [SC2003 Bandwidth Challenge]pacific file replication [SC2003 Bandwidth Challenge]

1.5TB data transferred in an hour

Multiple high-speed Trans-Pacific networks; APAN/TransPAC (2.4 Gbps OC48 POS, 500 Mbps OC-12 ATM), SuperSINET (2.4 Gbps x 2, 1 Gbps available)

6,000 miles

stable 3.79 Gbps out of theoretical peak 3.9 Gbps (97%) using 11 node pairs (MTU 6000B)

We won the "Distributed Infrastructure" award!

Page 3: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

[Background] Petascale Data Intensive Computing

Detector forALICE experiment

Detector forLHCb experiment

High Energy PhysicsCERN LHC, KEK Belle

~MB/collision,100 collisions/sec~PB/year2000 physicists, 35 countries

Astronomical Data Analysisdata analysis of whole the dataTB~PB/year/telescopeSUBARU telescope

10 GB/night, 3 TB/year

Page 4: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

[Background 2] Large-scale File Sharing

P2P P2P –– exclusive and specialexclusive and special--purpose approachpurpose approach

Napster, Gnutella, Freenet, . . .

Grid technology Grid technology –– file transfer, metadata managementfile transfer, metadata management

GridFTP, Replica Location Service

Storage Resource Broker (SRB)

LargeLarge--scale file system scale file system –– general approachgeneral approach

Legion, Avaki [Grid, no replica management]

Grid Datafarm [Grid]

Farsite, OceanStore [P2P]

AFS, DFS, . . .

Page 5: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Goal and feature of Grid Datafarm

GoalGoalDependable data sharing among multiple organizationsHigh-speed data access, High-speed data processing

Grid Grid DatafarmDatafarmGrid File System – Global dependable virtual file system

Integrates CPU + storage

Parallel & distributed data processing

FeaturesFeaturesSecured based on Grid Security InfrastructureScalable depending on data size and usage scenariosData location transparent data accessAutomatic and transparent replica access for fault toleranceHigh-performance data access and processing by accessing multiple dispersed storages in parallel (file affinity scheduling)

Page 6: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Grid Datafarm (1): Gfarm file system -World-wide virtual file system [CCGrid 2002]

Transparent access to dispersed file data in a GridTransparent access to dispersed file data in a GridPOSIX I/O APIs, and native Gfarm APIs for extended file view semantics and replicationsMap from virtual directory tree to physical fileAutomatic and transparent replica access for fault tolerance and access-concentration avoidance

Gfarm File System

/grid

ggf jp

aist gtrc

file1 file3file2 file4

file1 file2

File replica creation

Virtual DirectoryTree

mapping

File system metadata

Page 7: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Grid Datafarm (2): High-performance data access and processing support [CCGrid 2002]

WorldWorld--wide parallel and distributed processingwide parallel and distributed processingAggregate of files = superfileData processing of superfiles = parallel and distributed data processing of member files

Local file view (SPMD parallel file access)File-affinity scheduling (“Owner-computes”)

Grid File System

Virtual CPU

Astronomic archival datain a year (superfile)365 parallel analysis

World-wideParallel &distributedprocessing

Page 8: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Transfer technology in long fat networks

Bandwidth and latency between US and JapanBandwidth and latency between US and Japan

1~10 Gbps, 150~300 msec in RTT

TCP accelerationTCP acceleration

Adjustment of congestion window

Multiple TCP connections

HighSpeed TCP、Scalable TCP、FAST TCP

XCP (not TCP)

UDP based accelerationUDP based acceleration

Tsunami、UDT、RBUDP、atou、. . .

Bandwidth prediction without packet loss

Page 9: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Multiple TCP streams sometimes considered harmful . . .

Multiple TCP streams achieve good bandwidth, but Multiple TCP streams achieve good bandwidth, but excessively congest the network. In fact would excessively congest the network. In fact would ““shoot oneself in the footshoot oneself in the foot””..

High oscillationNot stable!

Too muchcongestion

APAN/TransPAC LA-Tokyo (2.4Gbps)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

375.5 376 376.5 377 377.5 378

Time (seconds)

Ban

dwidth (Mbp

TxTotal

TxBW0

TxBW1

TxBW2

[10 msec average]Too much network flowNeed to limit bandwidth appropriately

Compensateeach other

Page 10: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

A programmable network testbed device GNET-1

Programmable hardwarenetwork testbedWAN emulation- latency, bandwidth,packet loss, jitter, . . .

Precise measurement- bandwidth in 100 usec- latency, jitter between 2 GNET-1General purpose, very flexible!

Large high-speedmemory blocks

Page 11: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

IFG-based pace control by GNET-1

Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))

0

100

200

300

400

500

600

700

800

900

1000

245.5 246 246.5 247

Time (Second)

Ban

dwidth (Mbp

TxBW0

TxBW1

TxBW2

Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))

0

100

200

300

400

500

600

700

800

900

1000

245.5 246 246.5 247

Time (Second)

Ban

dwidth (Mbp

TxBW0

TxBW1

TxBW2

Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))

0

100

200

300

400

500

600

700

800

900

1000

245.5 246 246.5 247

Time (Second)

Ban

dwidth (Mb

RxBW0

RxBW1

BottleneckGNET-11 Gbps

(enable flow control) 700 Mbps 700 MbpsNO PACKET LOSS!

GNET-1 provides

Precise traffic pacing at any data rate by changing IFG (Inter-Frame Gap)

Packet loss free network using large input buffer (16MB)

Page 12: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Summary of technologies for performance improvement

[[Disk I/O performanceDisk I/O performance] Grid ] Grid DatafarmDatafarm –– A Grid file system with highA Grid file system with high--performance dataperformance data--intensive computing supportintensive computing support

A world-wide virtual file system that federates local file systems of multiple clusters

It provides scalable disk I/O performance for file replication via high-speed network links and large-scale data-intensive applications

Trans-Pacific Grid Datafarm testbed5 clusters in Japan, 3 clusters in US, and 1 cluster in Thailand, provides 70 TBytes disk capacity, 13 GB/sec disk I/O performance

It supports file replication for fault tolerance and access-concentration avoidance

[[WorldWorld--wide highwide high--speed network efficient utilizationspeed network efficient utilization] GNET] GNET--1 1 –– a gigabit a gigabit network network testbedtestbed devicedevice

Provides IFG-based precise rate-controlled flow at any rate

Enables stable and efficient Trans-Pacific network use of HighSpeedTCP

Page 13: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Trans-Pacific Grid Datafarm testbed:Network and cluster configuration

2.4G

2.4G

10G

10G

1G

2.4G(1G)

1G1G

SuperSINET

APAN/TransPAC

Los Angeles

622M

AIST

Titech

Maffin

10G

10G

APANTokyo XP

SuperSINET

TsukubaWAN

10G

2.4GNewYork

OC-12 ATM

SC2003Phoenix

32 nodes23.3 TBytes2 GB/sec

5G

16 nodes11.7 TBytes1 GB/sec

16 nodes11.7 TBytes1 GB/sec

7 nodes3.7 TBytes200 MB/sec

10 nodes1 TBytes300 MB/sec

147 nodes16 TBytes4 GB/sec

IndianaUniv

KasetsartUniv,Thailand

SDSC

Trans-Pacific thoretical peak 3.9 GbpsGfarm disk capacity 70 TBytes

disk read/write 13 GB/sec

Chicago

Abilene

Abilene

KEK

UnivTsukuba NII

1G

[2.34 Gbps]

[950 Mbps]

[500 Mbps]

Page 14: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Scientific Data for Bandwidth Challenge

TransTrans--Pacific File Replication of scientific dataPacific File Replication of scientific data

For transparent, high-performance, and fault-tolerant access

Astronomical Object Survey on Grid Astronomical Object Survey on Grid DatafarmDatafarm [HPC Challenge participant][HPC Challenge participant]

World-wide data analysis on whole the archive

652 GBytes data observed by SUBARU telescope

N. Yamamoto (AIST)

Large configuration data from Lattice QCDLarge configuration data from Lattice QCD

Three sets of hundreds of gluon field configurations on a 24^3*48 4-D space-time lattice (3 sets x 364.5 MB x 800 = 854.3 GB)

Generated by the CP-PACS parallel computer at Center for Computational Physics, Univ. of Tsukuba (300Gflops x years of CPU time) [Univ Tsukuba Booth]

Page 15: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Network bandwidth in APAN/TransPACLA route

[Gbp

s]

1

2

No pacing Pacing in 2.3 Gbps(900 + 900 + 500)

PC

PC

PC

PC

PC

switch

switch

switch

FC10E600 router Juniper

M20

PC

PC

PC

PC

PC

switch

switch

switchLA Tokyo

10G 2.4G 3G3G

GNET-1

RTT: 141 ms

Stable transfer rate of 2.3 Gbps

Page 16: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

APAN/TransPAC LA route (1)

Page 17: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

APAN/TransPAC LA route (2)

Page 18: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

APAN/TransPAC LA route (3)

Page 19: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

File replication between Japan and US(network configuration)

PC

PC

PC

PC

PC

switch

switch

router

PC

PC

PC

PC

PC

switch

switchLA Tokyo

10G

2.4G

3G

PC

PC

PC

switch router

PC

PC

PC

switch

(1G)2.4G 1G

PCswitch

PCswitch

PC

PCswitch router

router

PC

PCswitch500M 1G

Chicago

FC10E600

NYCA

bileneA

bilene

JuniperM20

3G

RTT: 141 ms

RTT: 285 ms

RTT: 250 ms

GNET-1

1G

1G

Phoenix Tokyo,Tsukuba

Page 20: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

File replication performance between Japan and US (total)

Page 21: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

APAN/TransPAC Chicago

Pacing at 500 Mbps, quite stable

Page 22: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

APAN/TransPAC LA (1)

After re-pacing from 800 to 780 Mbps, quite stable

Page 23: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

APAN/TransPAC LA (2)

After re-pacing of LA (1), quite stable

Page 24: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

APAN/TransPAC LA (3)

After re-pacing of LA (1), quite stable

Page 25: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

SuperSINET NYCRe-pacing from 930 to 950 Mbps

Page 26: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Summary

Efficient use around the peak rate in long fat networksEfficient use around the peak rate in long fat networksIFG-based precise pacing within packet loss free bandwidth with GNET-1

-> packet loss free network

Stable network flow even with HighSpeed TCPDisk I/O performance improvementDisk I/O performance improvement

Parallel disk access using GfarmTrans-pacific file replication performance: 3.79Gbps out of theoretical peak 3.9 Gbps (97%) using 11 node pairs (MTU 6000B)1.5TB data was transferred in an hour

Linux 2.4 kernel problem during file replication (transfer)Linux 2.4 kernel problem during file replication (transfer)Network transfer stopped in a few minutes when flushing buffer cache to diskLinux kernel bug?Defensive solution: set very short interval for buffer cache flush

This limits file transfer rate to 400 Mbps for one node pair

Successful TransSuccessful Trans--pacific scale data analysispacific scale data analysis.. . . Scalability problem of LDAP server for a metadata server. . Scalability problem of LDAP server for a metadata server

Further improvement needed

Page 27: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Future work

Standardization effort with GGF Grid File System WGStandardization effort with GGF Grid File System WG

Foster (world-wide) storage sharing and integration

dependable data sharing, high-performance data accessamong several organizations

Application areaApplication area

High energy physics experiment

Astronomic data analysis

Bioinformatics, . . .

Dependable data processing in eGovernment and eCommerce

Other applications that needs dependable file sharingamong several organizations

Page 28: Gfarm Fs Tatebe Tip2004

National Institute of Advanced Industrial Science and Technology

Special thanks to

HirotakaHirotaka Ogawa, Ogawa, YuetsuYuetsu Kodama, Tomohiro Kodama, Tomohiro KudohKudoh, Satoshi , Satoshi SekiguchiSekiguchi (AIST), Satoshi Matsuoka, (AIST), Satoshi Matsuoka, KentoKento Aida (Aida (TitechTitech), ), TaisukeTaisuke BokuBoku, , MitsuhisaMitsuhisa Sato (Sato (UnivUniv Tsukuba),Tsukuba),YouheiYouhei Morita (KEK), Yoshinori Morita (KEK), Yoshinori KitatsujiKitatsuji (APAN Tokyo XP), (APAN Tokyo XP), Jim Williams, John Hicks (Jim Williams, John Hicks (TransPACTransPAC/Indiana /Indiana UnivUniv))

EguchiEguchi HisashiHisashi ((MaffinMaffin), Kazunori ), Kazunori KonishiKonishi, Jin Tanaka, , Jin Tanaka, Yoshitaka Hattori (APAN), Jun Yoshitaka Hattori (APAN), Jun MatsukataMatsukata (NII), Chris Robb (NII), Chris Robb (Abilene)(Abilene)

Tsukuba WAN NOC team, APAN NOC team, NII Tsukuba WAN NOC team, APAN NOC team, NII SuperSINETSuperSINETNOC teamNOC team

Force10 NetworksForce10 Networks

PRAGMA, PRAGMA, ApGridApGrid, SDSC, Indiana University, , SDSC, Indiana University, KasetsartKasetsartUniversityUniversity