gtc japan 2014
Post on 25-Jan-2015
637 Views
Preview:
DESCRIPTION
TRANSCRIPT
(')��!� YLOa�$)b¬JS��ßÿzGuã«�
éÖDĝD�³Ì��¯D�¯ð�ÉËÀs�uG�
��(���,9,7� ����� ����������
������������������> �����Ì�����$��ĉȧ���$��Ďȧ�����Ç�� ��������
�������������78/0;�����:,.5;���$)���7<06�+087�+��������.8:0;��C�����$)��"*�����(0;6,�� �+�C���!�!����������&������!�A��''���������C� ��
����������C�%�&��=66�-4;0.<487�#9<4.,6��7B74-,7/�
��������''��� ��(��������$��� =;<:0���$�'��(,90���$���'8:,20(05�' ����C� ��
TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD)
�
Y�
“Global'Work'Space”'#1�
SFA10k'#5�
“Global'Work'Space”'#2� “Global'Work'Space”'#3�
SFA10k'#4�SFA10k'#3�SFA10k'#2�SFA10k'#1�
/data0'� /work0� /work1'''''/gscr�
“cNFS/Clusterd'Samba'w/'GPFS”''
HOME�
System'applicaJon�
“NFS/CIFS/iSCSI'by'BlueARC”''
HOME�
iSCSI�
Infiniband'QDR'Networks�
SFA10k'#6�
GPFS#1� GPFS#2� GPFS#3� GPFS#4�
Parallel'File'System'Volumes�Home'Volumes�
QDR'IB(×4)'×'20� 10GbE'×'2�QDR'IB'(×4)'×'8�
1.2PB�3.6(PB�
/data1'�
''
''
Thin'nodes� 1408nodes'''(32nodes'x44'Racks)'
HP'Proliant'SL390s'G7'1408nodesyyyyyyyyyyyy�CPU:'Intel'Westmere`EP''2.93GHz'''''''''''6cores'×'2'='12cores/node'GPU:'NVIDIA'Tesla'K20X,'3GPUs/node'Mem:'54GB'(96GB)'SSD:''60GB'x'2'='120GB'(120GB'x'2'='240GB)y'
Medium'nodes�
HP'Proliant'DL580'G7'24nodes''CPU:'Intel'Nehalem`EX'2.0GHz''''''''''8cores'×'2'='32cores/node'GPU:'NVIDIA''Tesla'S1070,''''''''''''NextIO'vCORE'Express'2070'Mem:128GB'SSD:'120GB'x'4'='480GB'
''
Fat'nodes�
HP'Proliant'DL580'G7'10nodes'CPU:'Intel'Nehalem`EX'2.0GHz''''''''''8cores'×'2'='32cores/node''GPU:'NVIDIA'Tesla'S1070'Mem:'256GB'(512GB)'SSD:'120GB'x'4'='480GB'
¯¯¯¯¯¯�
yyCompu.ng(Nodes³Q17.1PFlops(SFP),(5.76PFlops(DFP),(224.69TFlops(CPU),(~100TB(MEM,(~200TB(SSD(
Interconnets:yFullKbisec.on(Op.cal(QDR(Infiniband(Network�
''
Voltaire'Grid'Director'4700''×12'IB'QDR:'324'ports'
Core'Switch'
''
Edge'Switch'
''
Edge'Switch'(/w'10GbE'ports)'
Voltaire'Grid'Director'4036'×179'IB'QDR':'36'ports'
Voltaire''Grid'Director'4036E'×6'IB'QDR:34ports'''10GbE:''2port'
12switches'
6switches'179switches'
2.4(PB(HDD(+((�4PB(Tape�
例��:,93�����3<<9�???2:,93���8:2� ý�ßÿj��ZĘěÓ¦Vq�l�bāÎRa�wjzGu¼O ��v�Gi�
! ºÜZÂă�TEPS(Traversed Edges Per Second) ! ÅûXá¬�¥�(Cybersecurity, Medical Informatics,
Social Networks, Data Enrichment, Symbolic Networks) ! µTZgG~�
! concurrent search(Breadth First Search : BFS) ! optimization (Single Source Shortest Path) ! edge-oriented (Maximal Independent Set)
! ý�ßÿj��[ZĈ¬ ! ÁƦ·bíK�Green Graph500 ��v�Gi
! http://green.graph500.org/
• ��9�6#��QKronecker'Graph'�������'<'(BFS)'!?'– ��.)�'16'(=m/n)'�I���4��ª¦±��.)'32'�+��ª¦���%}²'– ¤ª¨° QSCALE'�8|��ª¦B-�Q3)'2SCALE'',)'2SCALE'+'4'�0���'– �²SCALE30'���z10'3'172',�4��ª¦±+�³344',²'
• ¨�®¨©«����D±G�²¯J��N����|A/'– PG������z:¨©«��� �="�� A�{�'
Input parameters • SCALE • edgefactor (=16) �
Graph'GeneraJon�
Graph'ConstrucJon� BFS� ValidaJon� results �
64 iterations�
(')��!� �Z�ßÿzGuã«ÊĐWQUZÕÒ�• dis��GuZġĕ�
– "*�����(0;6,�� �+�� �Ě�
• �G|IS`Z���ġĕõN�äğ¢�ÅJ�– �����G|�C�������G|�
• �cw{x�Gè¸Z�=,6�&,46�%�&��7B74-,7/�– ��(-9;Z���esio����|Ā�
• ÞÃ×�G|Y''�bġĕ�– oqy�¨V ��(�ü§�
�• �ßÿq{�Gp�
– =;<:0W�$�'Zć¬�– �$�M_Xa���öô����$�M_XayG�öô�
�
(')��!� �Z�ßÿzGuã«ÊĐWQUZÕÒ�• dis��GuZġĕ�
– "*�����(0;6,�� �+�� �Ě�
• �G|IS`Z���ġĕõN�äğ¢�ÅJ�– �����G|�C�������G|�
• �cw{x�Gè¸Z�=,6�&,46�%�&��7B74-,7/�– ��(-9;Z���esio����|Ā�
• ÞÃ×�G|Y''�bġĕ�– oqy�¨V ��(�ü§�
�• �ßÿq{�Gp�
– =;<:0W�$�'Zć¬�– �$�M_Xa���öô����$�M_XayG�öô�
�
Ć¹Zq�l�ZÄđWò¾�• ĄĒ»Zď�¢È�F±ċìªF�Ĕ§ª�
– ��swnZ��vldªF�}Gldª�– dis��GuZÑî�
'
• ÍĠĊØZÝñFÅąčª�– ±Ĝ�·���FÏ¿®���ZĂ��
• � �'���$�!��'((�!&�!��&0&�!����!���0<.�
– zGuï°Zlq{��·ÓFåÚƦ�NÈ�'
F*H�
�$!�18
67!
>(L
)����B-�Q ��°ª¥«¡�
��D!
���¬�«�§
:M�
&@9CO�
�£°�� K�!�2�
�B-¢° I/O
Ć¹Zq�l�ZÄđWò¾�• ĄĒ»Zď�¢È�F±ċìªF�Ĕ§ª�
– ��swnZ��vldªF�}Gldª�– dis��GuZÑî�
'
• ÍĠĊØZÝñFÅąčª�– ±Ĝ�·���FÏ¿®���ZĂ��
• � �'���$�!��'((�!&�!��&0&�!����!���0<.�
– zGuï°Zlq{��·ÓFåÚƦ�NÈ�'
F*H�
�$!�18
67!
>(L
)����B-�Q ��°ª¥«¡�
��D!
���¬�«�§
:M�
&@9CO�
�£°�� K�!�2�
�B-¢° I/O
���}��¦£���� A~µ'�$!�18¯�K��¨©«;5¯'E������°ª¥«¡��
�� ���4236@��..060:,<0/�!,9&0/=.0�• '�ßÿzGuĄĒã«��G��Gi'
– dis��GuW±Ĝ�·���bġĕQS�q�l�b ê�
– þªRa���Ząč·bÐĖ�'
• Ûę�– ²¤Zz�eqZ'��b¬´YĈ¬�
• ����Y^a©Ċ�• �)����#907"*!��0<.�
– ����ÑHZdis��GuVZ�ėqkG��j��
• (')��!� �– �$)VZ#=<�81�.8:0XzGuë«F�d�m�r�Z½¬�
• �$)��$)¡q{�G��jzGuàç�Z²Ĉª�
• �$)�GqZ¶£tG{��– dis��Gu¼OZ²ĈXzGu�èóZ½¬�
• ��'�fG�w{�
Hamar'Overview�
Map�
Distributed'Array�
Rank'0� Rank'1� Rank'n�
Local'Array� Local'Array� Local'Array� Local'Array�
Reduce�Map�
Reduce�
Map�
Reduce�Shuffle�
Shuffle�
Data'Transfer'between'ranks�
Shuffle�Shuffle�
Local'Array� Local'Array� Local'Array� Local'Array�
Device(GPU)'Data�
Host(CPU)'Data� Memcpy''
(H2D,'D2H)�
Virtualized'Data'Object�
Map/Reduce'code'sample�class'MapImpl':'public'hamar::funcJon::cuda::Map<MapContext>'{'''public:'''''''__host__'__device__'Operate(MapContext'*context)'{''''''''''KeyType'key'='context`>input_key();''''''''''ValueType'value'='context`>input_value();'''''''''context`>Emit(key,'value);''''''}'}''class'ReduceImpl':'public'hamar::funcJon::cuda::Reduce<ReduceContext>'{'''public:''''''___host__'__device__''Operate(ReduceContext'*context)'{''''''''''KeyType'key'='context`>input_key();''''''''''ValueType'values'='context`>input_values();''''''''''int'n'='context`>num_input_values();''''''''''ValueType'sum'='values[0]'+'…'+'values[n];''''''''''context`>Emit(key,'sum);''''''}'}'
Map/Reduce'code'sample'(cont’d)�int'main()'{'''''MapImpl'map;''''ReduceImple'reduce;'''''Environment'env;''''env.Init();''//'MPI/CUDA'IniJalizaJon'''''Directory'object(&env);''''object.Init(path);'''''object.Map(map);''''object.Reduce(reduce);'''''object.Destroy();'''''env.Destroy();''//'MPI/'CUDA'FinalizaJon''}�
Highly'Accelerated'MapReduce'with''Out`of`core'support'on'GPUs�
Map�
Reduce�
Map�
Reduce�
Map�
Reduce�
• Hierarchical'memory'management'for'large`scale''data'parallel'processing'using'mulJ`GPUs'– Support'out`of`core'processing'on'GPU'devices'– Overlapping'computaJon'and'communicaJon'
Map�
Reduce�
GPU�
CPU�
Memcpy''(H2D,'D2H)�Processing''for'each'chunk�
WY�
Shuffle� Shuffle�
Map/Reduce'ImplementaJon�• IniJalizaJon'before'each'operaJon'
– Remove'unnecessary'keys'– Reordering'data'structures'
• OpJmizaJons'for'GPU'accelerators'– Assign'a'warp'(32'threads)'per'key'for'avoiding'warp'divergence'in'
Map/Reduce'– Overlapping'computaJon'on'GPU'and'data'transfer'between'CPU'and'
GPU'
Map/'Reduce�
Map/'Reduce�
Sort�Sort�
Scan�
Sort'key`value'for'Scan�
Compact'keys'to'unique�
Overlap'computaJon'and'data'transfer�
GPU`based'External'Sort'ImplementaJon�
CPU�GPU�
1.'Divide'input'data'into'chunks,'then'sort'on'GPU'for'each'chunk�
2.'Swap'intermediate''''''data'on'CPU�
GPU�
3.'Sort'intermediate'data'on'GPU�
*1:'Y.'Ye'et'al.,'“GPUMemSort:'A'High'Performance'Graphics'Co`processors'SorJng'Algorithm'for'Large'''''''''Scale'In`Memory'Data”,'GSTF'InternaJonal'Journal'on'CompuJng,'2011'
• Out`of`core'GPU'sorJng'algorithm'*1'– Adopted'Sample`based'Parallel'SorJng'Algorithm'– Overlapping'computaJon'on'GPU'and'data'transfer'between'CPU'and'GPU'
WZ�
ApplicaJon'Example':'GIM`V'Generalized'IteraJve'Matrix`Vector'mulJplicaJon*1�
• Easy'descripJon'of'various'graph'algorithms'by'implemenJng'combine2,'combineAll,'assign'funcJons'
• PageRank,'Random'Walk'Restart,'Connected'Component'– v’#=#M#×G#v''where'
v’i'='assign(vj','combineAllj'({xj#|'j#='1..n,'xj#='combine2(mi,j,'vj)}))''(i'='1..n)'
– IteraJve'2'phases'MapReduce'operaJons'
´� ×G�v’i� mi,j�
vj�
v’� M�
combineAll(and(assign((stage2)�
combine2((stage1)�
assign� v�
*1':'Kang,'U.'et'al,'“PEGASUS:'A'Peta`Scale'Graph'Mining'System`'ImplementaJon''and'ObservaJons”,'IEEE'INTERNATIONAL'CONFERENCE'ON'DATA'MINING'2009�
Straigh|orward'implementaJon'using'Hamar�
Weak'Scaling'Performance''[Sato,'Shirahata'et'al.'Cluster2014]'�
• PageRank'applicaJon'on'TSUBAME'2.5'• Data'size'is'larger'than'GPU'memory'capacity�
0'
500'
1000'
1500'
2000'
2500'
3000'
0' 200' 400' 600' 800' 1000' 1200'
Perform
ance([MEdges/sec]�
Number(of(Compute(Nodes�
SCALE(23(K(24(per(Node�
1CPU'(S23'per'node)'1GPU'(S23'per'node)'2CPUs'(S24'per'node)'2GPUs'(S24'per'node)'3GPUs'(S24'per'node)'
2.81'GE/s'on'3072'GPUs'(SCALE'34)�
2.10x'Speedup'(3'GPU'v'2CPU)�
Breakdown�• Performance'on'3'GPUs'compared'with'2'CPUs'
– SCALE'33,'1024'nodes'– Map:'2.82x,'Reduce:'1.11x,'Sort:'5.04x'speedup'
• Overlapping'communicaJon'effecJvely�
0'
10000'
20000'
30000'
40000'
50000'
60000'
70000'
1CPU' 1GPU' 2CPUs' 2GPUs' 3GPUs'
Elapsed(.me([ms]�
Map'
Shuffle'
Reduce'
Sort'
Others'
Towards(Mul.level(data(management((on(Hamar(using(GPUs(and(NVMs([GTC2014]�
Mother'board�
'''''''''''''''''''''''''''''''''''''RAID'card�
mSATA� mSATA� mSATA� mSATA�
0'
1000'
2000'
3000'
4000'
5000'
6000'
7000'
8000'
9000'
0' 5' 10' 15' 20'
Bandwidth([MB/s]�
#(mSATAs�
Raw'mSATA'4KB'RAID0'1MB'RAID0'64KB'
0'
0.5'
1'
1.5'
2'
2.5'
3'
3.5'
0.274'0.547'1.09' 2.19' 4.38' 8.75' 17.5' 35' 70' 140'
Throughuput([GB/s]�
Matrix(Size([GB]�
Raw'8'mSATA'8'mSATA'RAID0'(1MB)'8'mSATA'RAID0'(64KB)'
I/O'performance'of'mulJple'mSATA'SSD� I/O'performance'from'GPU'to'mulJple'mSATA'SSDs�
�(7.39(GB/s(from((
16(mSATA(SSDs((Enabled(RAID0)(
�(3.06(GB/s(from((
8(mSATA(SSDs(to(GPU(
How(to(design(local(storage(for(nextKgen(supercomputers(?(K(Designed(a(local(I/O(prototype(using(16(mSATA(SSDs(
¯Capacity:((4TB(¯Read(bandwidth:(8(GB/s(
SorJng'for'Rapidly'Increasing'Datasets'[Shamoto,'Sato'et'al]'�
• The'need'to'process'huge'datasets'is'increasing'due'to'growth'of'data'collecJon'in'various'fields'– Sensor'data'– SNS'network'
• Fast'sorJng'methods'– Distributed'SorJng:'SorJng'for'distributed'system'
• Spli~er`based'parallel'sort'• Radix'sort'• Merge'sort'
– SorJng'on'heterogeneous'architectures'• Many'sorJng'algorithms'are'accelerated'by'many'cores'and'high'memory'bandwidth.'
• SorJng'for'large`scale'heterogeneous'systems'remains'unclear'
ExisJng'SorJng'Algorithms�SpligerKbased(parallel(sor.ng(
– The'flow'of'the'algorithm'1. local'sort:'Each'process'sorts'its'own'array'2. Select'spli0ers:'Choose'criteria'for'data'segmentaJon'3. Data'transfer:'Transfer'data'segments'4. Local'merge:'Merge'sorted'arrays'
– Low'communicaJon'costs'x 'ComputaJon'costs'starts'dominaJng'the'overall'performance(
(
Sor.ng(on(GPU(– There'are'many'a~empts'to'accelerate'sorJng'
• Thrust'sort[D.merrill'et'al.,'2011]'– Fast'sorJng'for'one'compute'node'
• A'GPU'external'sort[Y.'Ye'et'al.,'2010]'– Handle'GPU'memory'overflows'
• A'mulFGnode'GPU'sort[K.'L.'Spafford'et'al.,'2011]'– Does'not'sort'huge'data'sets'
U.lize(GPU(accelerators(for(spligerKbased(parallel(sor.ng�
GPU'implementaJon'for'Spli~er`based'Parallel'SorJng�
• Offloading'the'most'Jme`consuming'phase'to'GPU'accelerators�
0
20
40
4 8 16 32 64 128
256
512
1024
2048
# of proccesses (2 proccesses per node)
Elap
sed
time[
s]
synchronization costsdata transfer and Mergelocal sort (original)merge (remaining arrays)select splitters
select'spli~ers�
data'transfer�
merge�
'�
'�
GPU�
local'sort�'� unsorted�
sorted�
'�
'�
• 2'~'1024'nodes'(4'~'2048'GPUs)'on'TSUBAME2.5'• 2'processes'per'node'and'each'node'has'2GB'64bit'integer�
Weak'Scaling'Performance�
0
10000
20000
30000
0 500 1000 1500 2000# of proccesses (2 proccesses per node)
Keys
/sec
ond(
mill
ions
)
HykSort 1threadHykSort 6threadsHykSort GPU + 6threads
GPU(implementa.on(based(on(mul.Kthreaded(
implementa.on�
Mul.Kthreaded(implementa.on�
SingleKthreaded(implementa.on�
x1.4�
x3.6�
When'the'#'of'processes'is'2048�
K20x x4 faster than K20x
0
20000
40000
60000
0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)
Key
s/se
cond
(mill
ions
)HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation
Performance'PredicJon�• PCIe_#:'#GB/s'bandwidth'of'interconnect'between'CPU'and'GPU'
8.8%'reducJon'of'overall'runJme'when'the'accelerators'work'4'Jmes'faster'than'K20x�
x2.2'speedup'when'the'#'of'PCI'bandwidth'increase'to'50GB/s�
\W]�• (')��!� YLOa�$)b¬JS��ßÿzGuã«�÷ZPĞø�– ��!�&���ßÿzGuĄĒã«��G��Gi��– '964<<0:�-,;0/Z�ētG{���
• �$)Zæù¢X½¬NúÔ¢�– tG{Eqh��E0<.�
• �$)Z���YÙ\_XJßÿZzGuner�– úâ¢Xq{�G��jzGuë«F�#=<�81�.8:0d�m�r�Z½¬�
top related