huge data analytics: calpont infinidb columnar dbms empowers new research with the world’s first...

Post on 24-Jun-2015

1.600 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This Presentation is from the 2012 Strata Conference and Looks at the Synergies of Column Storage and Map Reduce.The presentation titled, “Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database,” was presented by Fernanda Foertter, HPC Scientific Programmer at Genus Plc, and Jim Tommaney, CTO at Calpont in March of 2012. They discussed how the team at Genus discovered an innovative way to store and access the huge volumes of data being generated modeling genotypes. The presentation also discussed the benefits of column storage and how InfiniDB’s built in map-reduce empowers high performance Big Data analytics.A copy of the presentation is on You Tube at:http://www.youtube.com/watch?v=m55CeVYCTSk&feature=player_detailpage

TRANSCRIPT

Calpont InfiniDB® Accelerating Data Insights

Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database Strata Conference 2012

Calpont Proprietary and Confidential

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Today’s Agenda

• Introduction of today’s speakers •What is InfiniDB? •Announced today: InfiniDB 3 •Huge Data Analytics: InfiniDB Empowers New

Research with The World’s First Searchable Genotype Database

•Questions •More information and resources

2

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Today’s Presenters

3

Fernanda Foertter HPC Administrator / Scientific Programmer Genus plc Jim Tommaney Chief Technology Officer Calpont Corporation

What is InfiniDB?

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Calpont Corporation

• Company o Privately held and backed oOffices

Dallas (Headquarters) Silicon Valley

•Business o Scale-out MPP analytic database oMySQL Columnar + Map Reduction o Commercial Open Core model

• Products o InfiniDB Enterprise

Forthcoming 4th major release o InfiniDB Community

Modified Open Source license

5

Calpont Mission To provide a highly

scalable data platform that enables

analytic business decisions as timely as customers and markets dictate.

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Innovative Companies Turning to InfiniDB

6

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 7

What is InfiniDB?

®

Scalable

Fast

Simple

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

InfiniDB

8

What is InfiniDB?

Full-Featured SQL

Familiar MySQL Look and Feel

Big Data Analytics Engine

Game Changing Performance

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 9

Focus on Analytics Workloads

InfiniDB is … Engineered for large queries Engineered for ad-hoc flexibility Analytics, not OLTP Unique combination of columnar + map-reduce

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 10

What is InfiniDB?

®

Scalable

Fast

Simple

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 11

InfiniDB – Two Tier Architecture

Purpose built for big data analytics. • User Module (UM)

Understands SQL. • Performance Module (PM)

Operates on data blocks.

or …

Single Server

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 12

InfiniDB Performance Foundations

®

The Power and Scale of Map-Reduce plus

Transformational I/O Efficiency

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 13

Power and Scalability of Map-Reduce

SQL Operations are mapped to Performance Module threads • Parallel/Distributed Data Access • Parallel/Distributed Joins (Inner, Outer) • Parallel/Distributed Sub-queries (From, Where, Select) • Parallel/Distributed Group By, Distinct, and Aggregation • Extensible with Parallel/Distributed User Defined Functions

Results are returned to User Module in Reduce Phase

Map ↓↓↓↓↓ Reduce ↑↑↑↑↑

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 14

Power and Scalability of Map-Reduce

Map ↓↓↓↓↓ Reduce ↑↑↑↑↑

InfiniDB is not: … a hadoop style map-reduce framework.

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 15

Power and Scalability of Map-Reduce

Map ↓↓↓↓↓ Reduce ↑↑↑↑↑

InfiniDB is: … custom built and highly optimized map-reduce framework for queries.

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 16

Transformational I/O Efficiency

Techniques to Avoid Unnecessary I/O oVertical Partitioning: read only the columns required oHorizontal Partition: focus on the rows required oJust-in-time materialization

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 17

Transformational I/O Efficiency

Techniques for Efficient I/O oColumnar compression reduces I/O from disk oGlobal data buffer cache can reduce disk I/O oReal-time decompression accelerates reads from disk oAvoidance of Random I/O

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 18

Simple - Automatic Everything

• Vertical Partitioning • Horizontal Partitioning • Compression • Compression Algorithm Selection • Distribution of data across disk resources • Distribution of work across CPU resources

Simple

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 19

InfiniDB

®

Scalable

Fast

Simple

InfiniDB 3 Announced Today

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

InfiniDB 3: It is Now Possible...

21

InfiniDB 3

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 22

Today’s Presenters

Fernanda Foertter HPC Administrator / Scientific Programmer Genus plc Jim Tommaney Chief Technology Officer Calpont Corporation

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Where I Work

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Breeding Values

Genetic Evaluation

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Phenotype: Meat Quality

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Selection for Lean Growth

1980 2005

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Selection for Lean Growth

1980 2005

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Halothane Gene (1991)

•Gene is associated

oHigh carcass yield

o Stress triggers hyperthermia

o Poor meat quality

X (Nn/nn)

(NN)

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

DNA Marker Use

1990 2009

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

1991HAL

1994ESR

1998RN & MC4R

2003MIS

2004Large-scale SNP discovery1999

FUT1 & PRKAG3

1991 - 2002Single genes, QTLCandidate genes

Large-scale SNP discovery, genome scans,

sequencing

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

0

10000

20000

30000

40000

50000

60000

70000

2004 2005 2006 2007 2008 2009

Num

ber o

f SN

Ps

Sudden Data Growth

Porcine SNP Panel Density

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Sudden Data Growth

Sample Collection

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

Year

Animals (cumulative)

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

Year

Tissue(cumulative)

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

EBV Lean Yield Meat Quality Robustness Feed efficiency Etc

economic weights

Index = a1 × EBV1 + a2 × EBV2 + . . .

Genetic Evaluation

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Data Pipeline

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Genomic Data Deluge

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Project: Genotyping DB

The Need • Accumulating SNP chip data • Difficulty searching through • Next Gen Sequencing • Cheaper SNP chips • LOTS of animals • Other projects needed the

data

Other Considerations • Store large data…BIG data • Scalable • Alternative to Oracle • Minimally impact

infrastructure • Easy for scientists to use

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

What Do Vendors Provide for Genotype Data?

nothing

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Think Outside the (Vendor’s) Box…

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

All Databases are Not Created Equal

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

All Vehicles are Not Created Equal

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Genomic Data

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

SNP Data

Animal ID SNP1 SNP2 SNP3 … SNP65K

1 0 1 2 1 2

2 1 1 0 0 0

3

4

5 1 2 2 0 2

XXXX

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Single Research Cohort

What about selection and cohort comparisons?

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Column Bases Make More Sense

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

InfiniDB: Parallel Columnar DB

2

3

7

9

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Complicated Searches are Faster!

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Scales for a Fraction of the Cost

Compression Up 75%

Speed vs RDBMS 15X faster

Scalability 100’s TB, parallel queries/ingest

Cost vs Oracle 25%

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Future Projects: Imputation

$150 $150

$15 $15

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Caution: Data multiplies in a BIG way

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Conclusions

•Helps to have a deep understanding of the scientific problems being solved

•Have a good understanding of the data access pattern • Tool should solve 80% of the highest use patterns •Use combination of software, hardware knowledge to

improve performance • Think “out of the vendor box”, especially where

research is cutting edge • Take the lead to show new tools users may not even be

aware they want/ need

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Questions

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

More Information on InfiniDB

Visit us at: o www.Calpont.com o www.InfiniDB.org o Visit Booth #414 to register to win an iPad 3

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 52 Enter for a Chance to Win an iPad 3

top related