parallel&programming&in&the&real&world!& · games – many systems ......
TRANSCRIPT
Parallel Programming in the Real World!
• Andrew Brownsword, Intel • Niall Dalton • Goetz Graefe, HP Labs • Russell Williams, Adobe
• Moderator: Luis Ceze, UW-‐CSE
Andrew Brownsword
MIC SW Architect
Intel Corporation
“ ” software is…
Large, unwieldy, and long lived
Much longer & larger than intended
especially by the authors!
Written by many people
Of widely ranged skills, styles, agendas, experiences
Many have moved on to [next_task..next_life]
Hard to change
Even with language-aware tools
“ ” software is…
Expressed in and defined by programming
models
Many levels of abstraction, concreteness,
explicitness, constraints
○ Best balance depends on goals & changes over time
Compilers only understand the programming
model…
…not the abstractions built in it
“ ” parallel software is…
Supposed to be fast
But performance is extremely fragile
Supposed to be robust
Data races, deadlocks, livelocks, etc
Composability
Supposed to be maintainable
Critical aspects often hidden in the details
1. Brief survey of the “ ” landscape
2. Implications for programming models
Games
Games – code & platform
Medium-to-large codebases 50K-5M lines of code, largely C++
May have large tool chains & online infrastructure
Many diverse sub-systems running at once “Soft” real-time, broad range of data set sizes
Frame-oriented scheduling (mostly)
Many sequencing dependencies between tasks
Target hardware “is what it is” Phones to servers, performance is critical
Multi-core, heterogeneous, SIMD, GPU, networks
Games – many systems
Graphics
Environment
Audio
Animation
Game logic
AI & scripting
Physics
User Interface
Inputs
Network
I/O (streaming)
Data conversion / processing
Note that this slide is a gross over-simplification!
Games – dev process
Short development cycles Severe code churn, high pressure to deliver quickly
Middleware & game “engine” use common
Rapidly changing feature requirements Fast iteration during development is critical
Code & architecture maintenance nightmare
Substantial volume & variety of media data Content team size greatly exceeds engineering’s
Porting between diverse platforms is common
HPC
HPC – code & platform
Widely varying codebases & domains 10K-10M+ lines of code
Diverse programming models ○ Fortran, C/C++, MPI, OpenMP dominate
Few kernels*, BIG data* Correctness & robustness are critical*
Epic, titanic, gargantuan data sets*
Varied target hardware Workstations to large clusters
Often purchased for the application
* Usually
HPC – dev process
On-going development cycles Code is generally never re-written, lasts for decades
Huge, poorly understood legacy code
Heavy library use (math, solvers, communications, etc.)
Correctness, verifiability, robustness Dependability of results is crucial
Very, very long running times are common
Portability across generations & platforms Tuning ‘knobs’ exposed rather than changing the code
Outlast HW, tools, vendors, prog. models, authors
Programming Models
Programming Models
Design of the model has formative impact on software written in it Abstractions to avoid over-specifying details
Concrete to allow control over solution
Explicit to keep critical detail visible
Constraints to allow effective optimization
For parallelism: Top desirable attributes
Top factors to address
Desirable Attributes…
Integration
With other models: Existing model, enables gradual adoption
Peer models, there is no single silver bullet
Layered models, enables DSLs & interop
With runtimes: Interaction & interop within processes
Resource management (processors, memory, I/O)
With tools: Build systems, analysis tools, design tools
Debuggers, profilers, etc.
Portability
Hardware, OS & vendor independence
Standard, portable models live longer
Investment in software is very expensive Re-writing is often simply not an option
Even seemingly small changes can be extraordinarily expensive
○ Testing & validation costs
○ Architectural implications
Composability
Real software is large and complex
Built by many people
Built out of components
Subject to intricate system level behaviours
Programming models must facilitate and
support these aspects
Factors To Address…
Concurrent Execution
Multiple levels to achieve performance
Vectorization (SIMD & throughput optimization)
Parallelization (multi/many-core)
Distributed (cluster-level)
Internet (loosely coupled, client/server, cloud services)
Programming model needs to express each level
Each level brings >10x potential
Cannot afford a different decomposition at each level
Data Organization & Access
FLOPS are cheap, bandwidth is not Severe and worsening imbalance
No sign of this changing
Optimizing data access is usually key to achieving performance & power
Existing models do very little to address this Access patterns usually implicit, layouts explicit
Changing data layout requires changing code
Different hardware, algorithms & models demand different layouts
Specialization
Hardware is diversifying
Heterogeneous processors (CPU, GPU, etc)
Fixed function hardware
System-on-chip
Driven by power & performance
Tight integration needed for fine-grained
interactions & data
NYSE TAQ record counts
An elegant weapon.. for a more civilized age
Trade lifecycle
• Strategy Development & Testing
• Strategy Deployment & Management
• Data Storage & Analysis
• Post Trade Analysis & Compliance
- Trade History - Exchange latency
- Historical DB - Data Capture
- Data Publishing
- Analytics
- Back Testing
- Optimization
- Research Systems
- Matching - Execution
- Risk Management
Follow the data
!"#$%&'()*&+,(-."/&001+2!"#$!"%##$&'(#$)'*'$+,-".'/
!"#$%$&'("'#)*&*+('%#,-#.,%)/0("'#1*2$+#,"#*"#34$"0#5&(4$"#!&.6(0$.0/&$#735!8#06*0#
,--$&2#06$#*1(9(0:#0,#*"*9:;$#$<0&$%$9:#9*&'$#*%,/"02#,-#$4$"0#7*"+#,06$&8#+*0*#-&,%#
+(2)*&*0$#2,/&.$2#=(06#4$&:#9,=#9*0$".:#*"+#6('6#06&,/'6)/0
34 0 1234'(/$12(5,6#(*,'7$0 )#8#39#%$:;$<=:=
>*%#'3,(-$)'*'$?('7/*,8@A12347#B$CD#(*$E%28#@@,(-F
G(HI#32%/$?('7/*,8@AG(HI#32%/$)'*'9'@#F
I'@@$+,@*2%,8$6'*'$'('7/*,8@A127J3('%$)'*'$>*2%#F
>*%#'3
6'*'
K4@$6'*'
>J9HI,77,@#82(6
I,(J*#L+2J%
>#82(6
Different cores for different chores
FPGAs
GPUs
Many Cores
Low-‐Power SOCs
Ease of P
rogram
ming
ApplicaJon-‐Specific Broad Applicability
Don’t fight the last war
The Era of the CORE WARS The Era of the
CLOCK WARS
The Era of the EFFICIENCY WARS
1990 2000 2010 2020
SoOer HW or harder SW?
!
!
!
PCI Express Card comprising: • 4 SFP/SFP+ cages providing flexible 1/10Gbps
connectivity • Tilera TILE-Gx8036™ Processor (64-bit, 36
cores, 1.2GHz) - two dedicated 1600MHz DDR3 SODIMMs
• Altera® Stratix® IV FPGA - 531,200 Eq. LEs – One dedicated DDR3 SODIMM
• High-precision (OCXO) Oven-compensated Quartz Oscillator providing ultra-accurate timestamping capability
• Generation 2 (5Gbps) x8 PCI Express bus providing 40Gbps between card and host
Key Features: • Optimised ultra-low-latency path between the
“wire” and the CPU • FPGA is inline between card network interfaces
and CPU allowing it to process/manipulate raw frames in-line with the CPU’s network interfaces
• FPGA and CPU are also connected by a dedicated 20Gbps PCI Express bus allowing efficient DMA transfer between them
• Low power footprint of <75W allowing multiple cards to be used in a single server if required
• Tilera’s Multicore Development Environment™ (MDE) included with Accensus user-space libraries
Accensus also offers a subscription service to a range of Exchange Line Handlers designed and implemented to run on the TELAS 2™ card. These are implemented on the FPGA thus leaving all CPU cores available for application development. Each Line Handler provides full packet to message decode, line arbitration and transformation to optimised C structs. Also planned for early 2012 are a user-space TCP/IP stack with FPGA offload optimised for order management systems and advanced time stamping functionality. Please contact [email protected] for further information
!
OCXO!
CPU!DDR3!
SODIMM
!
DDR3!
SODIMM
!SFP/SFP+!SFP/SFP+! FPGA! DD
R3!
SODIMM
!
SFP/SFP+!
SFP/SFP+!
PCI!Express!
TELAS 2™ Card
Product Brief
An ultra-low-latency trading system on a card TELAS 2™ (Trading Engine Latency-reduction and Acceleration System) is a revolutionary new product from Accensus. It combines flexible 1/10Gbps network connectivity with a powerful in-line Altera® Stratix® IV FPGA and the new Tilera TILE-Gx8036™ 36-core processor running Linux along with up to 24GB (8GB FPGA, 16GB Processor) of DDR3 RAM on a single PCI Express card.
The TELAS 2™ architecture is designed from the outset to be ultra-low-latency allowing data to be received from the wire at up to 40Gbps, optionally processed in-line by the FPGA chip then delivered to the processor where it is processed by applications written in C/C++ running on an optimized Linux OS. The data can remain on the card for processing or be transferred to the host if required. The transmit path is simply the reverse. This architecture allows application developers the complete freedom to implement ultra-low-latency applications either in C/C++ or in VHDL/Verilog or split across both if so desired. User applications may reside completely on the card or be split across the card and the host server.
TELAS 2™ is built to be installed in commodity servers and provide both a significantly lower latency network path to and from the card CPU than the that of the server and supplement server capacity by providing significant additional compute and memory resource thus extending existing rack space and power utilisation.
© 2011 Accensus LLC, 200 South Wacker Drive, Chicago, IL 60606, USA
!
!
!
PCI Express Card comprising: • 4 SFP/SFP+ cages providing flexible 1/10Gbps
connectivity • Tilera TILE-Gx8036™ Processor (64-bit, 36
cores, 1.2GHz) - two dedicated 1600MHz DDR3 SODIMMs
• Altera® Stratix® IV FPGA - 531,200 Eq. LEs – One dedicated DDR3 SODIMM
• High-precision (OCXO) Oven-compensated Quartz Oscillator providing ultra-accurate timestamping capability
• Generation 2 (5Gbps) x8 PCI Express bus providing 40Gbps between card and host
Key Features: • Optimised ultra-low-latency path between the
“wire” and the CPU • FPGA is inline between card network interfaces
and CPU allowing it to process/manipulate raw frames in-line with the CPU’s network interfaces
• FPGA and CPU are also connected by a dedicated 20Gbps PCI Express bus allowing efficient DMA transfer between them
• Low power footprint of <75W allowing multiple cards to be used in a single server if required
• Tilera’s Multicore Development Environment™ (MDE) included with Accensus user-space libraries
Accensus also offers a subscription service to a range of Exchange Line Handlers designed and implemented to run on the TELAS 2™ card. These are implemented on the FPGA thus leaving all CPU cores available for application development. Each Line Handler provides full packet to message decode, line arbitration and transformation to optimised C structs. Also planned for early 2012 are a user-space TCP/IP stack with FPGA offload optimised for order management systems and advanced time stamping functionality. Please contact [email protected] for further information
!
OCXO!
CPU!DDR3!
SODIMM
!
DDR3!
SODIMM
!
SFP/SFP+!SFP/SFP+! FPGA! DD
R3!
SODIMM
!
SFP/SFP+!
SFP/SFP+!
PCI!Express!
TELAS 2™ Card
Product Brief
An ultra-low-latency trading system on a card TELAS 2™ (Trading Engine Latency-reduction and Acceleration System) is a revolutionary new product from Accensus. It combines flexible 1/10Gbps network connectivity with a powerful in-line Altera® Stratix® IV FPGA and the new Tilera TILE-Gx8036™ 36-core processor running Linux along with up to 24GB (8GB FPGA, 16GB Processor) of DDR3 RAM on a single PCI Express card.
The TELAS 2™ architecture is designed from the outset to be ultra-low-latency allowing data to be received from the wire at up to 40Gbps, optionally processed in-line by the FPGA chip then delivered to the processor where it is processed by applications written in C/C++ running on an optimized Linux OS. The data can remain on the card for processing or be transferred to the host if required. The transmit path is simply the reverse. This architecture allows application developers the complete freedom to implement ultra-low-latency applications either in C/C++ or in VHDL/Verilog or split across both if so desired. User applications may reside completely on the card or be split across the card and the host server.
TELAS 2™ is built to be installed in commodity servers and provide both a significantly lower latency network path to and from the card CPU than the that of the server and supplement server capacity by providing significant additional compute and memory resource thus extending existing rack space and power utilisation.
© 2011 Accensus LLC, 200 South Wacker Drive, Chicago, IL 60606, USA
!
!
!
PCI Express Card comprising: • 4 SFP/SFP+ cages providing flexible 1/10Gbps
connectivity • Tilera TILE-Gx8036™ Processor (64-bit, 36
cores, 1.2GHz) - two dedicated 1600MHz DDR3 SODIMMs
• Altera® Stratix® IV FPGA - 531,200 Eq. LEs – One dedicated DDR3 SODIMM
• High-precision (OCXO) Oven-compensated Quartz Oscillator providing ultra-accurate timestamping capability
• Generation 2 (5Gbps) x8 PCI Express bus providing 40Gbps between card and host
Key Features: • Optimised ultra-low-latency path between the
“wire” and the CPU • FPGA is inline between card network interfaces
and CPU allowing it to process/manipulate raw frames in-line with the CPU’s network interfaces
• FPGA and CPU are also connected by a dedicated 20Gbps PCI Express bus allowing efficient DMA transfer between them
• Low power footprint of <75W allowing multiple cards to be used in a single server if required
• Tilera’s Multicore Development Environment™ (MDE) included with Accensus user-space libraries
Accensus also offers a subscription service to a range of Exchange Line Handlers designed and implemented to run on the TELAS 2™ card. These are implemented on the FPGA thus leaving all CPU cores available for application development. Each Line Handler provides full packet to message decode, line arbitration and transformation to optimised C structs. Also planned for early 2012 are a user-space TCP/IP stack with FPGA offload optimised for order management systems and advanced time stamping functionality. Please contact [email protected] for further information
!
OCXO!
CPU!DDR3!
SODIMM
!
DDR3!
SODIMM
!
SFP/SFP+!SFP/SFP+! FPGA! DD
R3!
SODIMM
!
SFP/SFP+!
SFP/SFP+!
PCI!Express!
TELAS 2™ Card
Product Brief
An ultra-low-latency trading system on a card TELAS 2™ (Trading Engine Latency-reduction and Acceleration System) is a revolutionary new product from Accensus. It combines flexible 1/10Gbps network connectivity with a powerful in-line Altera® Stratix® IV FPGA and the new Tilera TILE-Gx8036™ 36-core processor running Linux along with up to 24GB (8GB FPGA, 16GB Processor) of DDR3 RAM on a single PCI Express card.
The TELAS 2™ architecture is designed from the outset to be ultra-low-latency allowing data to be received from the wire at up to 40Gbps, optionally processed in-line by the FPGA chip then delivered to the processor where it is processed by applications written in C/C++ running on an optimized Linux OS. The data can remain on the card for processing or be transferred to the host if required. The transmit path is simply the reverse. This architecture allows application developers the complete freedom to implement ultra-low-latency applications either in C/C++ or in VHDL/Verilog or split across both if so desired. User applications may reside completely on the card or be split across the card and the host server.
TELAS 2™ is built to be installed in commodity servers and provide both a significantly lower latency network path to and from the card CPU than the that of the server and supplement server capacity by providing significant additional compute and memory resource thus extending existing rack space and power utilisation.
© 2011 Accensus LLC, 200 South Wacker Drive, Chicago, IL 60606, USA
Data reducJon
target NIC. For verification, we have used the NEC Japan stock information (see Section II) as events. The date of each event is replaced by a sequence number in order to continuously input events at the cycle level. As shown in Fig. 10, the logic successfully detects five change points, and it achieves 20Gbps event processing performance; the clock frequency is 156MHz and the data width is 128bits (i.e.,156MHz x 128bit =19,968Mbps). It should be noted that it is difficult for SQL-based systems to detect such change points without having support for procedural languages.
6 10 1315 31
Smoothing &Change-point analysis
Pric
e (Y
en)
220225230235240245250255260265
smoothed data
215time
NIC
Figure 10. Implementation of our motivating example.
B. Logic Usage We have evaluated how much the above implementation
increases slice logic utilization of our NIC FPGA. Since the number of occupied slices increases only 2.7% (see Table II), the entire logic, including our event processing adapter, can be efficiently implemented.
TABLE II. LOGIC USAGE IN OUR MOTIVATING EXAMPLEItem Available Increase
Number of slice registers 207,360 +2,492 (1.2%) Number of slice LUTs 207,360 +4,290 (2.1%)
Number of occupied slices 51,840 +1,378 (2.7%)
VI. CONCLUSION
The requirements for fast complex event processing will necessitate hardware acceleration which uses reconfigurable devices. Key to the success of our work is logic automation generated with our C-based event language. With this language, we have achieved both higher event processing performance and higher flexibility for application designs than those with SQL-based CEP systems. We have demonstrated the world’s fastest (20Gbps) event processing performance in a financial trading application on an FPGA-based NIC. In future work, we intend to employ partial reconfiguration for run-time function replacement.
ACKNOWLEDGMENTS
We thank T. Iihoshi, K. Ichino, O. Itoku, S. Kamiya, S. Karino, S. Morioka, A. Motoki, M. Nishihara, M. Petersen, H. Tagato, M. Tani, A. Tsuji, and N. Yamagaki for their great contributions.
REFERENCES[1] http://avid.cs.umass.edu/sase/index.php?page=navigation_menus [2] D. J. Abadi, et al., “Aurora: a new model and architecture for data stream
management,” Intl. J. on Very Large Data Bases, vol. 12, issue 2, 120-139, Aug. 2003.
[3] D. J. Abadi, et al., “The design of the Borelias stream processing engine,” Biennial Conf. on Innovative Data Systems Research, 277-289, Jan. 2005.
[4] D. Anicic, et al., “A rule-based language for complex event processing and reasoning,” Intl. Conf. on Web Reasoning and Rule Systems, 42-57, Sept. 2010.
[5] S. Chandrasekaran, et al., “TelegraphCQ: continuous dataflow processing for an uncertain world,” Biennial Conf. on Innovative Data Systems Research, 269-280, Jan. 2003.
[6] A. Demers, et al., “Cayuga: a general purpose event monitoring system,” Biennial Conf. on Innovative Data Systems Research, 412-422, Jan. 2007.
[7] B. Gedik, et al., “SPADE: the System S declarative stream processing engine,” ACM Intl. Conf. on Management of Data, 1123-1134, June 2008.
[8] D. Gyllstrom, J. Agrawal, Y. Diao, and N. Immerman, “On supporting kleene closure over event streams,” Intl. Conf. on Data Engineering, 1391-1393, April 2008.
[9] J. Kraemer, and B. Seeger, “PIPES-a public infrastructure for processing and exploring streams,” ACM Intl. Conf. on Management of Data, 925-926, June 2004.
[10] J. Naughton, et al., “The Niagara Internet query system,” IEEE Data Engineering Bulletin, vol. 24, num. 1, 27-33, March 2001.
[11] The STREAM Group, “STREAM: The Stanford stream data manager,” IEEE Data Engineering Bulletin, vol. 26, num. 1, 19-26, March 2003.
[12] M. R. Mendes, P. Bizarro, and P. Marques, “A performance study of event processing systems,” Performance Evaluation and Benchmarking, vol. 5895, 221-236, 2009.
[13] L. Woods, J. Teubner, and G. Alonso, “Complex event detection at wire speed with FPGAs,” Intl. Conf. on Very Large Data Bases, vol. 3, issue 1-2, 660-669, Sept. 2010.
[14] OPRA, “Updated traffic projections 2011 & 2012,” http://www. opradata.com/ specs/upd_traffic_proj_11_12.pdf.
[15] R. Mueller, J. Teubner, and G. Alonso, “Streams over wires – a query compiler for FPGAs,” Intl. Conf. on Very Large Data Bases, vol. 2, issue 1, 229-240, Aug. 2009.
[16] NEC CyberWorkBench. http://www.nec.com/global/prod/cwb/ [17] F. Zemke, A. Witkowski, and M. Cherniak, “Pattern matching in sequences of
rows,” ANSI Standard Proposal, March 2007. [18] R. B. Mandelbrot, and R. L. Hudson, “The (mis)behavior of markets – a fractal
view of risk, ruin, and reward,” Basic Books, March 2006. [19] R. Sidhu, and V. Prasanna, “Fast regular expression matching using FPGAs,”
IEEE Symp. on Field-Programmable Custom Computing Machines, 227-238, April 2001.
[20] Z. K. Baker, and V. K. Prasanna, “A methodology for the synthesis of efficient intrusion detectoin systems on FPGAs,” IEEE Symp. on Field Programmable Custom Computing Machine, 135-144, April 2004.
[21] F. Bruschi, M. Paolieri, and V. Rana, “A reconfigurable system based on a parallel and pipelined solution for regular expression matching,” IEEE Intl. Conf. on Field-Programmable Logic and Applications, 44-49, Sept. 2010.
[22] Y. H. Cho, S. Navab, and W. H. M.-Smith, “Specialized hardware for deep network packet filtering,” Intl. Conf. on Field Programmable Logic and Applications, 452-461, Sept. 2002.
[23] C. R. Clark, and D. E. Schimmel, “Efficient reconfigurable logic circuits for matching complex network intrustion detection patterns,” Intl. Conf. on Field Programmable Logic and Applications, 956-959, Sept. 2003.
[24] B. L. Hutchings, R. Franklin, and D. Carver, “Assisting network intrusion detection with reconfigurable hardware,” IEEE Symp. on Field-Programmable Custom Computing Machine, 111-120, April 2002.
[25] C.-H. Lin, C.-T. Huang, C.-P. Jiang, and S.-C. Chang, “Optimization of pattern matching circuits for regular expression on FPGA,” IEEE Trans. on Very Large Scale Integration Systems, vol 15, no. 12, 1303-1310, Dec. 2007.
[26] N. Yamagaki, R. Sidhu, and S. Kamiya, “High-speed regular expressin matching engine using multi-character NFA,” IEEE Intl. Conf. on Field Programmable Logic and Applications, 131-136, Sept. 2008.
[27] Altera, “Atlantic interface specification,” ver 3.0, June 2002.
+%)
Air Vents Console Port
Management Port USB Port 16 Base SFP/SFP+ Ports 8 FX SFP/SFP+ Ports
Clock Input
somedata = (1,2,3); otherdata = [1,2,3]; dict = [`a=1, `b=2]; // Note these are different types, list vs. vector. type area = `FX | `EquiJes Int | `FixedIncome Double Double results@node0 with f = #(id :: symbol; profit :: double) f = {(id, area, pnl) var profit = area? `FX : pnl*.98 | `EquiJes x : pnl-‐x | `FixedIncome x y : pnl*(x-‐y); insert (id, profit) into results } jobs = select id, area, parameters from strategies where date==today() simulate = {(job) var pnl = sum(random * 1..10); if (pnl > 100) {send (job.id, job. area, pnl) to results; (`ok, pnl)} else (`fail, pnl) } job_status = @[select disJnct processors from places] { <-‐[(x){begin simulate(x)} each jobs] } failed_jobs = select (status, pnl) from job_status where status==`fail
// run some code in place A; block unJl it’s done @A {code} // start an acJvity f in place B and return immediately @B begin {f} // run some code in place C, taking ownership of data @c with data {…} // bind data to place // distribute data over place1 and place 2 @[place1, place2] data; // redundant copies of data in place1 and place 2; also works for redundant computaJon @[place1], [place2] data; // Run f in the fastest place we can var c = select core-‐id from processors where max frequency @c {f(`somedata)} // Queue work in parent @parent {code} // Reply @reply {code}
Open Problems
• Machines are already beyond our ability to program producJvely with high performance
• It’s gevng harder to observe, understand, debug & tune our broken programs/machines
• Where is the inconsistency coming from? • What implicit effects are we suffering from? • How do we cope with increasing diversity? • What do we need to give up to get some help?
Hot topics in parallelism in data management
Goetz Graefe Hewlett-Packard Laboratories
Palo Alto, Cal. – Madison, Wis.
June 8, 2012 2
History • Concurrency among independent transactions
Each transaction single-threaded 1960s, 1970s, …
• Parallel query processing (within a transaction) Teradata 1983-84 specialized hardware Gamma 1984-88 off-the-shelf hardware Pipelines for algebraic execution Partitioning intermediate results
June 8, 2012 3
Transactions ACID = atomicity, consistency, isolation, durability • User transactions
Database contents queries & updates Locks held to transaction commit Rollback using recovery log
• System transactions Database representation changes, e.g., B-tree node split In-memory data structures, “latches”
June 8, 2012 4
Two types of transactions User transactions System transactions
Invocation source User request System-internal
Database effects Logical database contents Physical database representation
Data location Database or buffer pool In-memory page images
Invocation overhead New thread Same thread
Locks Acquire & retain Test for conflicts
Commit overhead Force log to stable storage No forcing
Logging Full “redo” & “undo” “Redo” only usually
Failure recovery Rollback Completion
Hardware opportunity Non-volatile memory Transactional memory
June 8, 2012 5
Two types of concurrency control Locks Latches
Separate… User transactions Threads Protect… Database contents In-memory data structures During… Entire transactions Critical sections Modes: Shared, exclusive, update,
intention, escrow, schema Shared, exclusive
Deadlock… Detection & resolution Avoidance … by… Waits-for graph analysis,
timeout, transaction abort, partial rollback, lock de-escalation
Coding discipline, lock leveling
Kept in… Lock manager’s hash table
Protected data structure
June 8, 2012 6
Current trends and challenges • Scalability
Query processing versus map-reduce (Hadoop etc.) Data mining, business intelligence, analytics Utilities (load, reorganization, …)
• Implementation techniques Low-level synchronization Transactional memory Non-volatile memory Other novel hardware
© 2011 Adobe Systems Incorporated. All Rights Reserved.
Me: Russell Williams. My product: Photoshop
1
• Huge cross-platform code base on single threaded framework
• Parallel computation since mid-90s using basic parallel_for
• Scaling falls off beyond 4 cores for many operations.
• Must trade off throughput for latency
• Proliferation of thread pools
© 2011 Adobe Systems Incorporated. All Rights Reserved.
Challenges — structure of the problem
2
• Asynchrony vs. parallel compute
• Available parallelism • Amdahl’s law vs. events, views, PCI bus • On server, parallelize per user. On desktop: one user • Bandwidth limited — FLOPS / memory reference
• 80-core chips not coming; so"ware can’t use ‘em.
© 2011 Adobe Systems Incorporated. All Rights Reserved.
Challenges — structure of the solutions
3
• Heterogeneous environment • C machine vs. data parallel, high latency, high throughput • Different cache / memory hierarchies
• Rapidly changing hardware landscape • Discrete->Integrated GPU • SSE -> AVX -> AVX2 -> AVX3
• __m128i vDst = _mm_cv#ps_epi32(_mm_mul_ps(_mm_cvtepi32_ps(vSum0), vInvArea));
• Variety, rapid evolution, and fragmentation of tools • CUDA / DX / OpenCL / C++ AMP, vectorizing compilers
© 2011 Adobe Systems Incorporated. All Rights Reserved.
Intrinsics / Auto-
vec
Desktop GFlops (8-core 3.5GHz Sandy Bridge + AMD 6950)
4
Straight C++
OpenGL / OpenCL / C++ AMP
TBB GCD
0 500 1000 1500 2000 2500 3000
GPU CPU Vector Unit Multi-thread Scalar (GFlops)