parallel&programming&in&the&real&world!& · games – many systems ......

Parallel Programming in the Real World!

•  Andrew Brownsword, Intel •  Niall Dalton •  Goetz Graefe, HP Labs •  Russell Williams, Adobe

•  Moderator: Luis Ceze, UW-‐CSE

Andrew Brownsword

MIC SW Architect

Intel Corporation

“ ” software is…

Large, unwieldy, and long lived

Much longer & larger than intended

especially by the authors!

Written by many people

Of widely ranged skills, styles, agendas, experiences

Many have moved on to [next_task..next_life]

Hard to change

Even with language-aware tools

“ ” software is…

Expressed in and defined by programming

models

Many levels of abstraction, concreteness,

explicitness, constraints

○ Best balance depends on goals & changes over time

Compilers only understand the programming

model…

…not the abstractions built in it

“ ” parallel software is…

Supposed to be fast

But performance is extremely fragile

Supposed to be robust

Data races, deadlocks, livelocks, etc

Composability

Supposed to be maintainable

Critical aspects often hidden in the details

1. Brief survey of the “ ” landscape

2. Implications for programming models

Games – code & platform

Medium-to-large codebases 50K-5M lines of code, largely C++

May have large tool chains & online infrastructure

Many diverse sub-systems running at once “Soft” real-time, broad range of data set sizes

Frame-oriented scheduling (mostly)

Many sequencing dependencies between tasks

Target hardware “is what it is” Phones to servers, performance is critical

Multi-core, heterogeneous, SIMD, GPU, networks

Games – many systems

Graphics

Environment

Audio

Animation

Game logic

AI & scripting

Physics

User Interface

Inputs

Network

I/O (streaming)

Data conversion / processing

Note that this slide is a gross over-simplification!

Games – dev process

Short development cycles Severe code churn, high pressure to deliver quickly

Middleware & game “engine” use common

Rapidly changing feature requirements Fast iteration during development is critical

Code & architecture maintenance nightmare

Substantial volume & variety of media data Content team size greatly exceeds engineering’s

Porting between diverse platforms is common

HPC – code & platform

Widely varying codebases & domains 10K-10M+ lines of code

Diverse programming models ○ Fortran, C/C++, MPI, OpenMP dominate

Few kernels*, BIG data* Correctness & robustness are critical*

Epic, titanic, gargantuan data sets*

Varied target hardware Workstations to large clusters

Often purchased for the application

* Usually

HPC – dev process

On-going development cycles Code is generally never re-written, lasts for decades

Huge, poorly understood legacy code

Heavy library use (math, solvers, communications, etc.)

Correctness, verifiability, robustness Dependability of results is crucial

Very, very long running times are common

Portability across generations & platforms Tuning ‘knobs’ exposed rather than changing the code

Outlast HW, tools, vendors, prog. models, authors

Programming Models

Programming Models

Design of the model has formative impact on software written in it Abstractions to avoid over-specifying details

Concrete to allow control over solution

Explicit to keep critical detail visible

Constraints to allow effective optimization

For parallelism: Top desirable attributes

Top factors to address

Desirable Attributes…

Integration

With other models: Existing model, enables gradual adoption

Peer models, there is no single silver bullet

Layered models, enables DSLs & interop

With runtimes: Interaction & interop within processes

Resource management (processors, memory, I/O)

With tools: Build systems, analysis tools, design tools

Debuggers, profilers, etc.

Portability

Hardware, OS & vendor independence

Standard, portable models live longer

Investment in software is very expensive Re-writing is often simply not an option

Even seemingly small changes can be extraordinarily expensive

○ Testing & validation costs

○ Architectural implications

Composability

Real software is large and complex

Built by many people

Built out of components

Subject to intricate system level behaviours

Programming models must facilitate and

support these aspects

Factors To Address…

Concurrent Execution

Multiple levels to achieve performance

Vectorization (SIMD & throughput optimization)

Parallelization (multi/many-core)

Distributed (cluster-level)

Internet (loosely coupled, client/server, cloud services)

Programming model needs to express each level

Each level brings >10x potential

Cannot afford a different decomposition at each level

Data Organization & Access

FLOPS are cheap, bandwidth is not Severe and worsening imbalance

No sign of this changing

Optimizing data access is usually key to achieving performance & power

Existing models do very little to address this Access patterns usually implicit, layouts explicit

Changing data layout requires changing code

Different hardware, algorithms & models demand different layouts

Specialization

Hardware is diversifying

Heterogeneous processors (CPU, GPU, etc)

Fixed function hardware

System-on-chip

Driven by power & performance

Tight integration needed for fine-grained

interactions & data

NYSE TAQ record counts

An elegant weapon.. for a more civilized age

Trade lifecycle

•  Strategy Development & Testing

•  Strategy Deployment & Management

• Data Storage & Analysis

•  Post Trade Analysis & Compliance

- Trade History - Exchange latency

- Historical DB - Data Capture

- Data Publishing

- Analytics

- Back Testing

- Optimization

- Research Systems

- Matching - Execution

- Risk Management

Follow the data

!"#$%&'()*&+,(-."/&001+2!"#$!"%##$&'(#$)'*'$+,-".'/

!"#$%$&'("'#)*&*+('%#,-#.,%)/0("'#1*2$+#,"#*"#34$"0#5&(4$"#!&.6(0$.0/&$#735!8#06*0#

,--$&2#06$#*1(9(0:#0,#*"*9:;$#$<0&$%$9:#9*&'$#*%,/"02#,-#$4$"0#7*"+#,06$&8#+*0*#-&,%#

+(2)*&*0$#2,/&.$2#=(06#4$&:#9,=#9*0$".:#*"+#6('6#06&,/'6)/0

34 0 1234'(/$12(5,6#(*,'7$0 )#8#39#%$:;$<=:=

>*%#'3,(-$)'*'$?('7/*,8@A12347#B$CD#(*$E%28#@@,(-F

G(HI#32%/$?('7/*,8@AG(HI#32%/$)'*'9'@#F

I'@@$+,@*2%,8$6'*'$'('7/*,8@A127J3('%$)'*'$>*2%#F

>*%#'3

6'*'

K4@$6'*'

>J9HI,77,@#82(6

I,(J*#L+2J%

>#82(6

Different cores for different chores

FPGAs

GPUs

Many Cores

Low-‐Power SOCs

Ease of P

rogram

ming

ApplicaJon-‐Specific Broad Applicability

Don’t fight the last war

The Era of the CORE WARS The Era of the

CLOCK WARS

The Era of the EFFICIENCY WARS

1990 2000 2010 2020

SoOer HW or harder SW?

!

!

!

PCI Express Card comprising: • 4 SFP/SFP+ cages providing flexible 1/10Gbps

connectivity • Tilera TILE-Gx8036™ Processor (64-bit, 36

cores, 1.2GHz) - two dedicated 1600MHz DDR3 SODIMMs

• Altera® Stratix® IV FPGA - 531,200 Eq. LEs – One dedicated DDR3 SODIMM

• High-precision (OCXO) Oven-compensated Quartz Oscillator providing ultra-accurate timestamping capability

• Generation 2 (5Gbps) x8 PCI Express bus providing 40Gbps between card and host

Key Features: • Optimised ultra-low-latency path between the

“wire” and the CPU • FPGA is inline between card network interfaces

and CPU allowing it to process/manipulate raw frames in-line with the CPU’s network interfaces

• FPGA and CPU are also connected by a dedicated 20Gbps PCI Express bus allowing efficient DMA transfer between them

• Low power footprint of <75W allowing multiple cards to be used in a single server if required

• Tilera’s Multicore Development Environment™ (MDE) included with Accensus user-space libraries

Accensus also offers a subscription service to a range of Exchange Line Handlers designed and implemented to run on the TELAS 2™ card. These are implemented on the FPGA thus leaving all CPU cores available for application development. Each Line Handler provides full packet to message decode, line arbitration and transformation to optimised C structs. Also planned for early 2012 are a user-space TCP/IP stack with FPGA offload optimised for order management systems and advanced time stamping functionality. Please contact [email protected] for further information

!

OCXO!

CPU!DDR3!

SODIMM

!

DDR3!

SODIMM

!SFP/SFP+!SFP/SFP+! FPGA! DD

R3!

SODIMM

!

SFP/SFP+!

SFP/SFP+!

PCI!Express!

TELAS 2™ Card

Product Brief

An ultra-low-latency trading system on a card TELAS 2™ (Trading Engine Latency-reduction and Acceleration System) is a revolutionary new product from Accensus. It combines flexible 1/10Gbps network connectivity with a powerful in-line Altera® Stratix® IV FPGA and the new Tilera TILE-Gx8036™ 36-core processor running Linux along with up to 24GB (8GB FPGA, 16GB Processor) of DDR3 RAM on a single PCI Express card.

The TELAS 2™ architecture is designed from the outset to be ultra-low-latency allowing data to be received from the wire at up to 40Gbps, optionally processed in-line by the FPGA chip then delivered to the processor where it is processed by applications written in C/C++ running on an optimized Linux OS. The data can remain on the card for processing or be transferred to the host if required. The transmit path is simply the reverse. This architecture allows application developers the complete freedom to implement ultra-low-latency applications either in C/C++ or in VHDL/Verilog or split across both if so desired. User applications may reside completely on the card or be split across the card and the host server.

TELAS 2™ is built to be installed in commodity servers and provide both a significantly lower latency network path to and from the card CPU than the that of the server and supplement server capacity by providing significant additional compute and memory resource thus extending existing rack space and power utilisation.

© 2011 Accensus LLC, 200 South Wacker Drive, Chicago, IL 60606, USA

!

!

!














!

OCXO!

CPU!DDR3!

SODIMM

!

DDR3!

SODIMM

!

SFP/SFP+!SFP/SFP+! FPGA! DD

R3!

SODIMM

!

SFP/SFP+!

SFP/SFP+!

PCI!Express!

TELAS 2™ Card

Product Brief





!

!

!














!

OCXO!

CPU!DDR3!

SODIMM

!

DDR3!

SODIMM

!

SFP/SFP+!SFP/SFP+! FPGA! DD

R3!

SODIMM

!

SFP/SFP+!

SFP/SFP+!

PCI!Express!

TELAS 2™ Card

Product Brief





Data reducJon

target NIC. For verification, we have used the NEC Japan stock information (see Section II) as events. The date of each event is replaced by a sequence number in order to continuously input events at the cycle level. As shown in Fig. 10, the logic successfully detects five change points, and it achieves 20Gbps event processing performance; the clock frequency is 156MHz and the data width is 128bits (i.e.,156MHz x 128bit =19,968Mbps). It should be noted that it is difficult for SQL-based systems to detect such change points without having support for procedural languages.

6 10 1315 31

Smoothing &Change-point analysis

Pric

e (Y

en)

220225230235240245250255260265

smoothed data

215time

NIC

Figure 10. Implementation of our motivating example.

B. Logic Usage We have evaluated how much the above implementation

increases slice logic utilization of our NIC FPGA. Since the number of occupied slices increases only 2.7% (see Table II), the entire logic, including our event processing adapter, can be efficiently implemented.

TABLE II. LOGIC USAGE IN OUR MOTIVATING EXAMPLEItem Available Increase

Number of slice registers 207,360 +2,492 (1.2%) Number of slice LUTs 207,360 +4,290 (2.1%)

Number of occupied slices 51,840 +1,378 (2.7%)

VI. CONCLUSION

The requirements for fast complex event processing will necessitate hardware acceleration which uses reconfigurable devices. Key to the success of our work is logic automation generated with our C-based event language. With this language, we have achieved both higher event processing performance and higher flexibility for application designs than those with SQL-based CEP systems. We have demonstrated the world’s fastest (20Gbps) event processing performance in a financial trading application on an FPGA-based NIC. In future work, we intend to employ partial reconfiguration for run-time function replacement.

ACKNOWLEDGMENTS

We thank T. Iihoshi, K. Ichino, O. Itoku, S. Kamiya, S. Karino, S. Morioka, A. Motoki, M. Nishihara, M. Petersen, H. Tagato, M. Tani, A. Tsuji, and N. Yamagaki for their great contributions.

REFERENCES[1] http://avid.cs.umass.edu/sase/index.php?page=navigation_menus [2] D. J. Abadi, et al., “Aurora: a new model and architecture for data stream

management,” Intl. J. on Very Large Data Bases, vol. 12, issue 2, 120-139, Aug. 2003.

[3] D. J. Abadi, et al., “The design of the Borelias stream processing engine,” Biennial Conf. on Innovative Data Systems Research, 277-289, Jan. 2005.

[4] D. Anicic, et al., “A rule-based language for complex event processing and reasoning,” Intl. Conf. on Web Reasoning and Rule Systems, 42-57, Sept. 2010.

[5] S. Chandrasekaran, et al., “TelegraphCQ: continuous dataflow processing for an uncertain world,” Biennial Conf. on Innovative Data Systems Research, 269-280, Jan. 2003.

[6] A. Demers, et al., “Cayuga: a general purpose event monitoring system,” Biennial Conf. on Innovative Data Systems Research, 412-422, Jan. 2007.

[7] B. Gedik, et al., “SPADE: the System S declarative stream processing engine,” ACM Intl. Conf. on Management of Data, 1123-1134, June 2008.

[8] D. Gyllstrom, J. Agrawal, Y. Diao, and N. Immerman, “On supporting kleene closure over event streams,” Intl. Conf. on Data Engineering, 1391-1393, April 2008.

[9] J. Kraemer, and B. Seeger, “PIPES-a public infrastructure for processing and exploring streams,” ACM Intl. Conf. on Management of Data, 925-926, June 2004.

[10] J. Naughton, et al., “The Niagara Internet query system,” IEEE Data Engineering Bulletin, vol. 24, num. 1, 27-33, March 2001.

[11] The STREAM Group, “STREAM: The Stanford stream data manager,” IEEE Data Engineering Bulletin, vol. 26, num. 1, 19-26, March 2003.

[12] M. R. Mendes, P. Bizarro, and P. Marques, “A performance study of event processing systems,” Performance Evaluation and Benchmarking, vol. 5895, 221-236, 2009.

[13] L. Woods, J. Teubner, and G. Alonso, “Complex event detection at wire speed with FPGAs,” Intl. Conf. on Very Large Data Bases, vol. 3, issue 1-2, 660-669, Sept. 2010.

[14] OPRA, “Updated traffic projections 2011 & 2012,” http://www. opradata.com/ specs/upd_traffic_proj_11_12.pdf.

[15] R. Mueller, J. Teubner, and G. Alonso, “Streams over wires – a query compiler for FPGAs,” Intl. Conf. on Very Large Data Bases, vol. 2, issue 1, 229-240, Aug. 2009.

[16] NEC CyberWorkBench. http://www.nec.com/global/prod/cwb/ [17] F. Zemke, A. Witkowski, and M. Cherniak, “Pattern matching in sequences of

rows,” ANSI Standard Proposal, March 2007. [18] R. B. Mandelbrot, and R. L. Hudson, “The (mis)behavior of markets – a fractal

view of risk, ruin, and reward,” Basic Books, March 2006. [19] R. Sidhu, and V. Prasanna, “Fast regular expression matching using FPGAs,”

IEEE Symp. on Field-Programmable Custom Computing Machines, 227-238, April 2001.

[20] Z. K. Baker, and V. K. Prasanna, “A methodology for the synthesis of efficient intrusion detectoin systems on FPGAs,” IEEE Symp. on Field Programmable Custom Computing Machine, 135-144, April 2004.

[21] F. Bruschi, M. Paolieri, and V. Rana, “A reconfigurable system based on a parallel and pipelined solution for regular expression matching,” IEEE Intl. Conf. on Field-Programmable Logic and Applications, 44-49, Sept. 2010.

[22] Y. H. Cho, S. Navab, and W. H. M.-Smith, “Specialized hardware for deep network packet filtering,” Intl. Conf. on Field Programmable Logic and Applications, 452-461, Sept. 2002.

[23] C. R. Clark, and D. E. Schimmel, “Efficient reconfigurable logic circuits for matching complex network intrustion detection patterns,” Intl. Conf. on Field Programmable Logic and Applications, 956-959, Sept. 2003.

[24] B. L. Hutchings, R. Franklin, and D. Carver, “Assisting network intrusion detection with reconfigurable hardware,” IEEE Symp. on Field-Programmable Custom Computing Machine, 111-120, April 2002.

[25] C.-H. Lin, C.-T. Huang, C.-P. Jiang, and S.-C. Chang, “Optimization of pattern matching circuits for regular expression on FPGA,” IEEE Trans. on Very Large Scale Integration Systems, vol 15, no. 12, 1303-1310, Dec. 2007.

[26] N. Yamagaki, R. Sidhu, and S. Kamiya, “High-speed regular expressin matching engine using multi-character NFA,” IEEE Intl. Conf. on Field Programmable Logic and Applications, 131-136, Sept. 2008.

[27] Altera, “Atlantic interface specification,” ver 3.0, June 2002.

+%)

Air Vents Console Port

Management Port USB Port 16 Base SFP/SFP+ Ports 8 FX SFP/SFP+ Ports

Clock Input

somedata = (1,2,3); otherdata = [1,2,3]; dict = [à=1, `b=2]; // Note these are different types, list vs. vector. type area = `FX | ÈquiJes Int | `FixedIncome Double Double results@node0 with f = #(id :: symbol; profit :: double) f = {(id, area, pnl) var profit = area? `FX : pnl*.98 | ÈquiJes x : pnl-‐x | `FixedIncome x y : pnl*(x-‐y); insert (id, profit) into results } jobs = select id, area, parameters from strategies where date==today() simulate = {(job) var pnl = sum(random * 1..10); if (pnl > 100) {send (job.id, job. area, pnl) to results; (òk, pnl)} else (`fail, pnl) } job_status = @[select disJnct processors from places] { <-‐[(x){begin simulate(x)} each jobs] } failed_jobs = select (status, pnl) from job_status where status==`fail

// run some code in place A; block unJl it’s done @A {code} // start an acJvity f in place B and return immediately @B begin {f} // run some code in place C, taking ownership of data @c with data {…} // bind data to place // distribute data over place1 and place 2 @[place1, place2] data; // redundant copies of data in place1 and place 2; also works for redundant computaJon @[place1], [place2] data; // Run f in the fastest place we can var c = select core-‐id from processors where max frequency @c {f(`somedata)} // Queue work in parent @parent {code} // Reply @reply {code}

Open Problems

•  Machines are already beyond our ability to program producJvely with high performance

•  It’s gevng harder to observe, understand, debug & tune our broken programs/machines

•  Where is the inconsistency coming from? •  What implicit effects are we suffering from? •  How do we cope with increasing diversity? •  What do we need to give up to get some help?

Hot topics in parallelism in data management

Goetz Graefe Hewlett-Packard Laboratories

Palo Alto, Cal. – Madison, Wis.

June 8, 2012 2

History •  Concurrency among independent transactions

Each transaction single-threaded 1960s, 1970s, …

•  Parallel query processing (within a transaction) Teradata 1983-84 specialized hardware Gamma 1984-88 off-the-shelf hardware Pipelines for algebraic execution Partitioning intermediate results

June 8, 2012 3

Transactions ACID = atomicity, consistency, isolation, durability •  User transactions

Database contents queries & updates Locks held to transaction commit Rollback using recovery log

•  System transactions Database representation changes, e.g., B-tree node split In-memory data structures, “latches”

June 8, 2012 4

Two types of transactions User transactions System transactions

Invocation source User request System-internal

Database effects Logical database contents Physical database representation

Data location Database or buffer pool In-memory page images

Invocation overhead New thread Same thread

Locks Acquire & retain Test for conflicts

Commit overhead Force log to stable storage No forcing

Logging Full “redo” & “undo” “Redo” only usually

Failure recovery Rollback Completion

Hardware opportunity Non-volatile memory Transactional memory

June 8, 2012 5

Two types of concurrency control Locks Latches

Separate… User transactions Threads Protect… Database contents In-memory data structures During… Entire transactions Critical sections Modes: Shared, exclusive, update,

intention, escrow, schema Shared, exclusive

Deadlock… Detection & resolution Avoidance … by… Waits-for graph analysis,

timeout, transaction abort, partial rollback, lock de-escalation

Coding discipline, lock leveling

Kept in… Lock manager’s hash table

Protected data structure

June 8, 2012 6

Current trends and challenges •  Scalability

Query processing versus map-reduce (Hadoop etc.) Data mining, business intelligence, analytics Utilities (load, reorganization, …)

•  Implementation techniques Low-level synchronization Transactional memory Non-volatile memory Other novel hardware

© 2011 Adobe Systems Incorporated. All Rights Reserved.

Me: Russell Williams. My product: Photoshop

1

•  Huge cross-platform code base on single threaded framework

•  Parallel computation since mid-90s using basic parallel_for

•  Scaling falls off beyond 4 cores for many operations.

•  Must trade off throughput for latency

•  Proliferation of thread pools


Challenges — structure of the problem

2

•  Asynchrony vs. parallel compute

•  Available parallelism •  Amdahl’s law vs. events, views, PCI bus •  On server, parallelize per user. On desktop: one user •  Bandwidth limited — FLOPS / memory reference

•  80-core chips not coming; so"ware can’t use ‘em.


Challenges — structure of the solutions

3

•  Heterogeneous environment •  C machine vs. data parallel, high latency, high throughput •  Different cache / memory hierarchies

•  Rapidly changing hardware landscape •  Discrete->Integrated GPU •  SSE -> AVX -> AVX2 -> AVX3

•  __m128i vDst = _mm_cv#ps_epi32(_mm_mul_ps(_mm_cvtepi32_ps(vSum0), vInvArea));

•  Variety, rapid evolution, and fragmentation of tools •  CUDA / DX / OpenCL / C++ AMP, vectorizing compilers


Intrinsics / Auto-

vec

Desktop GFlops (8-core 3.5GHz Sandy Bridge + AMD 6950)

4

Straight C++

OpenGL / OpenCL / C++ AMP

TBB GCD

0 500 1000 1500 2000 2500 3000

GPU CPU Vector Unit Multi-thread Scalar (GFlops)

parallel&programming&in&the&real&world!& · games – many systems ......

Documents