huge data analytics: calpont infinidb columnar dbms empowers new research with the world’s first...

52
Calpont InfiniDB ® Accelerating Data Insights Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database Strata Conference 2012 Calpont Proprietary and Confidential

Upload: calpont-corporation

Post on 24-Jun-2015

1.600 views

Category:

Technology


0 download

DESCRIPTION

This Presentation is from the 2012 Strata Conference and Looks at the Synergies of Column Storage and Map Reduce.The presentation titled, “Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database,” was presented by Fernanda Foertter, HPC Scientific Programmer at Genus Plc, and Jim Tommaney, CTO at Calpont in March of 2012. They discussed how the team at Genus discovered an innovative way to store and access the huge volumes of data being generated modeling genotypes. The presentation also discussed the benefits of column storage and how InfiniDB’s built in map-reduce empowers high performance Big Data analytics.A copy of the presentation is on You Tube at:http://www.youtube.com/watch?v=m55CeVYCTSk&feature=player_detailpage

TRANSCRIPT

Page 1: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

Calpont InfiniDB® Accelerating Data Insights

Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database Strata Conference 2012

Calpont Proprietary and Confidential

Page 2: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Today’s Agenda

• Introduction of today’s speakers •What is InfiniDB? •Announced today: InfiniDB 3 •Huge Data Analytics: InfiniDB Empowers New

Research with The World’s First Searchable Genotype Database

•Questions •More information and resources

2

Page 3: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Today’s Presenters

3

Fernanda Foertter HPC Administrator / Scientific Programmer Genus plc Jim Tommaney Chief Technology Officer Calpont Corporation

Page 4: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

What is InfiniDB?

Page 5: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Calpont Corporation

• Company o Privately held and backed oOffices

Dallas (Headquarters) Silicon Valley

•Business o Scale-out MPP analytic database oMySQL Columnar + Map Reduction o Commercial Open Core model

• Products o InfiniDB Enterprise

Forthcoming 4th major release o InfiniDB Community

Modified Open Source license

5

Calpont Mission To provide a highly

scalable data platform that enables

analytic business decisions as timely as customers and markets dictate.

Page 6: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Innovative Companies Turning to InfiniDB

6

Page 7: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 7

What is InfiniDB?

®

Scalable

Fast

Simple

Page 8: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

InfiniDB

8

What is InfiniDB?

Full-Featured SQL

Familiar MySQL Look and Feel

Big Data Analytics Engine

Game Changing Performance

Page 9: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 9

Focus on Analytics Workloads

InfiniDB is … Engineered for large queries Engineered for ad-hoc flexibility Analytics, not OLTP Unique combination of columnar + map-reduce

Page 10: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 10

What is InfiniDB?

®

Scalable

Fast

Simple

Page 11: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 11

InfiniDB – Two Tier Architecture

Purpose built for big data analytics. • User Module (UM)

Understands SQL. • Performance Module (PM)

Operates on data blocks.

or …

Single Server

Page 12: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 12

InfiniDB Performance Foundations

®

The Power and Scale of Map-Reduce plus

Transformational I/O Efficiency

Page 13: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 13

Power and Scalability of Map-Reduce

SQL Operations are mapped to Performance Module threads • Parallel/Distributed Data Access • Parallel/Distributed Joins (Inner, Outer) • Parallel/Distributed Sub-queries (From, Where, Select) • Parallel/Distributed Group By, Distinct, and Aggregation • Extensible with Parallel/Distributed User Defined Functions

Results are returned to User Module in Reduce Phase

Map ↓↓↓↓↓ Reduce ↑↑↑↑↑

Page 14: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 14

Power and Scalability of Map-Reduce

Map ↓↓↓↓↓ Reduce ↑↑↑↑↑

InfiniDB is not: … a hadoop style map-reduce framework.

Page 15: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 15

Power and Scalability of Map-Reduce

Map ↓↓↓↓↓ Reduce ↑↑↑↑↑

InfiniDB is: … custom built and highly optimized map-reduce framework for queries.

Page 16: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 16

Transformational I/O Efficiency

Techniques to Avoid Unnecessary I/O oVertical Partitioning: read only the columns required oHorizontal Partition: focus on the rows required oJust-in-time materialization

Page 17: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 17

Transformational I/O Efficiency

Techniques for Efficient I/O oColumnar compression reduces I/O from disk oGlobal data buffer cache can reduce disk I/O oReal-time decompression accelerates reads from disk oAvoidance of Random I/O

Page 18: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 18

Simple - Automatic Everything

• Vertical Partitioning • Horizontal Partitioning • Compression • Compression Algorithm Selection • Distribution of data across disk resources • Distribution of work across CPU resources

Simple

Page 19: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 19

InfiniDB

®

Scalable

Fast

Simple

Page 20: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB 3 Announced Today

Page 21: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

InfiniDB 3: It is Now Possible...

21

InfiniDB 3

Page 22: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 22

Today’s Presenters

Fernanda Foertter HPC Administrator / Scientific Programmer Genus plc Jim Tommaney Chief Technology Officer Calpont Corporation

Page 23: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Where I Work

Page 24: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Breeding Values

Genetic Evaluation

Page 25: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Phenotype: Meat Quality

Page 26: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Selection for Lean Growth

1980 2005

Page 27: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Selection for Lean Growth

1980 2005

Page 28: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Halothane Gene (1991)

•Gene is associated

oHigh carcass yield

o Stress triggers hyperthermia

o Poor meat quality

X (Nn/nn)

(NN)

Page 29: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

DNA Marker Use

1990 2009

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

1991HAL

1994ESR

1998RN & MC4R

2003MIS

2004Large-scale SNP discovery1999

FUT1 & PRKAG3

1991 - 2002Single genes, QTLCandidate genes

Large-scale SNP discovery, genome scans,

sequencing

Page 30: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

0

10000

20000

30000

40000

50000

60000

70000

2004 2005 2006 2007 2008 2009

Num

ber o

f SN

Ps

Sudden Data Growth

Porcine SNP Panel Density

Page 31: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Sudden Data Growth

Sample Collection

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

Year

Animals (cumulative)

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

Year

Tissue(cumulative)

Page 32: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

EBV Lean Yield Meat Quality Robustness Feed efficiency Etc

economic weights

Index = a1 × EBV1 + a2 × EBV2 + . . .

Genetic Evaluation

Page 33: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Data Pipeline

Page 34: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Genomic Data Deluge

Page 35: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Project: Genotyping DB

The Need • Accumulating SNP chip data • Difficulty searching through • Next Gen Sequencing • Cheaper SNP chips • LOTS of animals • Other projects needed the

data

Other Considerations • Store large data…BIG data • Scalable • Alternative to Oracle • Minimally impact

infrastructure • Easy for scientists to use

Page 36: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

What Do Vendors Provide for Genotype Data?

nothing

Page 37: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Think Outside the (Vendor’s) Box…

Page 38: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

All Databases are Not Created Equal

Page 39: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

All Vehicles are Not Created Equal

Page 40: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Genomic Data

Page 41: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

SNP Data

Animal ID SNP1 SNP2 SNP3 … SNP65K

1 0 1 2 1 2

2 1 1 0 0 0

3

4

5 1 2 2 0 2

XXXX

Page 42: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Single Research Cohort

What about selection and cohort comparisons?

Page 43: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Column Bases Make More Sense

Page 44: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

InfiniDB: Parallel Columnar DB

2

3

7

9

Page 45: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Complicated Searches are Faster!

Page 46: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Scales for a Fraction of the Cost

Compression Up 75%

Speed vs RDBMS 15X faster

Scalability 100’s TB, parallel queries/ingest

Cost vs Oracle 25%

Page 47: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Future Projects: Imputation

$150 $150

$15 $15

Page 48: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Caution: Data multiplies in a BIG way

Page 49: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Conclusions

•Helps to have a deep understanding of the scientific problems being solved

•Have a good understanding of the data access pattern • Tool should solve 80% of the highest use patterns •Use combination of software, hardware knowledge to

improve performance • Think “out of the vendor box”, especially where

research is cutting edge • Take the lead to show new tools users may not even be

aware they want/ need

Page 50: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

Questions

Page 51: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.

More Information on InfiniDB

Visit us at: o www.Calpont.com o www.InfiniDB.org o Visit Booth #414 to register to win an iPad 3

Page 52: Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database

InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved. 52 Enter for a Chance to Win an iPad 3