real%&me(analy&cs(and(data(inges&on(on(tbs( of(data(using...
TRANSCRIPT
Real%&me(Analy&cs(and(Data(Inges&on(on(TBs(of(Data(using(PostgreSQL(
Utku(Azman(Director(–(R&D(
• Processing(vast(amounts(data(for(Insights(• Providing(human(real%&me(interac&on(• Keeping(up(with(high(velocity(data(• Managing(complexity(&(cost(
The(Problem(
1(
• Processing(vast(amounts(data(for(Insights(• Providing(human(real%&me(interac&on(• Keeping(up(with(high(velocity(data(• Managing(complexity(&(cost(
The(Problem(
2(
• Processing(vast(amounts(data(for(Insights(• Providing(human(real%&me(interac&on(• Keeping(up(with(high(velocity(data(• Managing(complexity(&(cost(
The(Problem(
3(
Every(60(seconds:(
The(Problem(
• Processing(vast(amounts(data(for(Insights(• Providing(human(real%&me(interac&on(• Keeping(up(with(high(velocity(data(• Managing(complexity(&(cost(
4(
Solu&on?(
5(
Solu&on?(
6(
Fast(Analy&cs(
Solu&on?(
7(
Fast(Analy&cs( Scalability(/(High(Availability(
Solu&on?(
8(
Fast(Analy&cs(
Real%&me(((Data(
Scalability(/(High(Availability(
Solu&on(%(Approach(1(
9(
Fast(Analy&cs(
Real%&me(Data(
Scalability(/(High(Availability(
Integra&ng(Mul&ple(Database(Technologies(
(Real-time) (Offline) Analytics Operations
Data
DWH
DWH-on-Hadoop
Pre-aggregates
Production SQL
Production NoSQL
Complex(&(Expensive(
10(
Solu&on(–Approach(2(
11(
Fast(Analy&cs(
Real%&me(Data(
Scalability(/(High(Availability(
Unified(Analy&cs/Opera&ons(Database(that(scales(
Solu&on(–Approach(2(
12(
Unified(Analy&cs/Opera&ons(Database(that(scales((
+(
Fast(Analy&cs(
Real%&me(Data(
Scalability(/(High(Availability(
AND(comes(with(a(community(driven(and(open(ecosystem(
Why(PostgreSQL?(
13(
“By(2018,(more(than(70%(of(new(in%house(applica&ons(will(be(developed(on(an(Open(Source(DBMS”((Gartner(
Source:(Gartner:(State(of(Open(Source(RDBMS%(2015,(Hacker(News(
0%
10%
20%
30%
40%
50%
PostgreSQL MySQL MongoDB SQL Server Oracle Cassandra
2014 2010
PostgreSQL(Rising(–(Which(DB(do(you(use?((Hacker(News)(
Who(are(we?(
14(
• Citus(Data(based(in(San(Francisco(since(2011(
• Built(CitusDB(–(Scalable(PostgreSQL((• Open(sourced(columnar(storage(and(sharding(
extensions(for(PostgreSQL(
Result:(Real%&me(Big(Data(on(PostgreSQL(
15(
Analyze(billions(of(events(
Apply(hundreds(of(filters(on(the(fly(
Get(responses(in(<(seconds(
Serve(millions(of(end%users(
Update(millions(of(records(in(minutes(
Result:(Real%&me(Big(Data(on(PostgreSQL(
With(the(simplicity(of(maintaining(ONE(database(
16(
Our(Approach(
17(
Fast(Analy&cs(
Real%&me(Data(
Scalability(/(High(Availability( +(
Extending(PostgreSQL(
18(
U"lize'Hooks'
Use'Foreign'Data'Wrappers'
Sync'with'every'major'release'
• Always(benefit(from(latest(advancements(• Support(all(datatypes,(extensions,(tools(• Leverage(community(and(ecosystem(
How?(
PostgreSQL(Internals(
U&lize(Hooks(
Data(ty
pes(
Use(Foreign(Data(Wrappers(
Commun
ity,(features(
Sync(with(every(major(release(
19(
.(.(.(
CitusDB (Scalable PostgreSQL)
Data Storage and Retrieval
Real-time Analytics (e.g. Tableau, custom)
Flexibility and familiarity of PostgreSQL: -Data types -Storage formats -Extensions -Connectors, tools, documentation, more
SQL (ODBC / JDBC)
Data Sources App
server - Clickstream - Events, transactions
App server
App server
- Machine generated data - Other (traditional) data sources
Familiar,(Extensible,(Rich(
PG tools, connectors
Our(Approach(
20(
Fast'Analy"cs'
Real%&me(Data(
Scalability(/(High(Availability( +(
1.(Massive(Paralleliza&on(Analy&cs(
• Massively(Parallelized(Queries(• Mul&%threaded(processing(• Push(compute(to(data(
Events'
CitusDB(worker(1(
…'
…' …' …'
…' …' …'
CitusDB(master(
PostgreSQL(Query(%>(Events(
E1( E3’(
CitusDB(worker(2(
…'
…' …' …'
…' …' …'
E2( E1’(
CitusDB(worker(N(
…'
…' …' …'
…' …' …'
E3( E2’(…(
PostgreSQL(Query(%>(E1(PostgreSQL(Query(%>(E2(
PostgreSQL(Query(%>(E3(
21(
SELECT''avg(price),'max(price)'
FROM'''items'
WHERE''quantity'>'10'
Machine #1 Machine #2 Machine N
Master
Row Data
…
Pull Data
I/O Bottleneck
Heavy compute on master
Avoiding(pull(data(to(master(approach((Analy&cs(
22(
Instead(pushing(compute(to(data(SELECT'
'avg(price),'max(price)'FROM''
'items'WHERE'
'quantity'>'10'
Push Compute
SELECT''sum(price),'count(*),' 'max(price)'
FROM'''items'
WHERE''quantity'>'10'
Machine #1 Machine #2 Machine N
Master
…
sum'
count'
max'
sum'
count'
max'sum'
count'
max'
Σ sumi Σ counti
max({max1 ... maxN})
Analy&cs(
23(
2.(Columnar(Analy&cs(• Columnar(projec&ons((read(only(relevant(columns)(• Skip(indexes((skip(over(irrelevant(rows)(• PostgreSQL(integra&on((sta&s&cs,(na&ve(formats)(• Compression((more(data(fits(in(memory)(
Input Type
Estimated Input Rate
Cost to query performance
Memory 10 GB/s 3.9 seconds
SSD 600 MB/s >60 seconds
With(row(storage((PostgreSQL)(• Read(700(columns(instead(of(5(• >39(GB(of(unnecessary(I/O(
Analy&cs(
24(
Compression(with(columnar(store(
Regular(Columnar(Columnar(w/(compression(
Table sizes normalized to 1.0
~4x(compression(
Analy&cs(
25(
Bopomline:(Fast(Analy&cs(CitusDB'–'Scalable'PostgreSQL'(Columnar)'Impala'2.0.0'
SparkSQL'1.1.0'
PostgreSQL(can(be(faster(than(
Impala,(SparkSQL(!(
Analy&cs(
26(
Our(Approach(
27(
Fast(Analy&cs(
Real%&me(Data(
Scalability(/(High(Availability( +(
Scalability(/(High(Availability(Scalability(/(HA(
• Replica&on(• “Automagic”(failure(handling(• Dynamic(rebalancing/scaling(
Master(Node(
1' 3' 4'
6' 7' 9'
…' …' …'
Worker(Node(#1(
1' 2' 4'
5' 7' 8'
…' …' …'
Worker(Node(#2(
2' 3' 5'
6' 8' 9'
…' …' …'
Worker(Node(#N(
shard(and(shard(placement(metadata(
Many(small(data(shards( …(
28(
“Automagically”(Handle(Failures(
Node(#1((
SELECT''avg(price),'max(price)'
FROM'''items'
WHERE''quantity'>'10'
Fixed size block of data Data queried
Node(#2((
Node(#3((
Node(#4((
Scalability(/(HA(
29(
“Automagically”(Handle(Failures(
Node(#1((
Fixed size block of data Data queried
Replicas for failing blocks
Node(#2((
Node(#3((
Node(#4((
SELECT''avg(price),'max(price)'
FROM'''items'
WHERE''quantity'>'10'
Scalability(/(HA(
30(
“Automagically”(Handle(Failures(
Node(#1((
SELECT''avg(price),'max(price)'
FROM'''items'
WHERE''quantity'>'10'
Fixed size block of data Data queried
Replicas for failing blocks
Node(#2((
Node(#3((
Node(#4((
Scalability(/(HA(
31(
Dynamically(Scale(Out(
Node(#4(
Node(#1((
1' 3' 4'
6' 7' 9'
…' …' …'
…' …' …'
Node(#2(
1' 2' 4'
5' 7' 8'
…' …' …'
…' …' …'
Node(#3(
2' 3' 5'
6' 8' 9'
…' …' …'
…' …' …'
512'MB'(each)'
Scalability(/(HA(
32(
Mid%query(recovery(
from(failures(
Hundreds(of(nodes,(
thousands(of(CPU(cores(
Dynamic(rebalancing(and(scaling((
Petabytes(of(space(
Bopomline:(Scalability…that(works(
Hundreds(of(nodes(
Scalability(/(HA(
33(
Our(Approach(
34(
Fast(Analy&cs(
Real%&me(Data(
Scalability(/(High(Availability( +(
1' 3' 4'
6' 7' 9'
…' …' …'
…' …' …'
Worker(Node(#1(
1' 2' 4'
5' 7' 8'
…' …' …'
…' …' …'
Worker(Node(#2(
2' 3' 5'
6' 8' 9'
…' …' …'
…' …' …'
Worker(Node(#3(
SinglePshard'INSERT'Replica"on'factor:'2'
Master(
INSERT'INTO'customer_reviews'...(
Real%&me(Data(Real%&me(INSERTS/UPDATES(
35(
Real%&me(Data(
Bopomline:(Unified(Analy&cs/Opera&ons(
Number of nodes
Real-time Operations Transactions per sec (TPS)
Scalable PostgreSQL Cluster Performance
0 10 20 1 10 Number of nodes
20 5
Real-time Analytics Query completion time (sec)
Simultaneously 36(
37( 37(
Putng(it(all(together(
38(
We(start(with…(
• PG performance up by 40% with each new release between 7.4 and 9.3
• High performance JSONB introduced in 9.4 (Dec-2014)
• Feature parity with Oracle
• 100’s of developers contributing • Same day productivity for
developers, DBAs, analysts • Tools, extensions, libraries, forums
39(
Putng(it(all(together(…and(teach(PostgreSQL(new(tricks(
• 100x analytics performance with massive parallelization and columnar analytics • Scalability & high availability with dynamic horizontal scaling to 100s of nodes • Real-time insights on very large data with unified analytics/operations
Example(Applica&on(in(Produc&on(
• Cloudflare((– CDN(with(>5%(global(internet(traffic(– >100(billion(network(events(processed(per(day(– Analy&cs(dashboard(serving(2,000,000+(end(users(– Real%&me(data(ingest(and(sub%second(queries(across(billions(of(rows(for(analy&cs(
40(
Scaling(PostgreSQL(with(CitusDB(at(Cloudflare(
41(hpps://www.citusdata.com/blog(
Trillions(of(events(
billions(of(1%minute(aggrega&ons(
Real%&me(data(inges&on(
25ms(–(2sec(query(&mes(
42(
Demo:(Real%&me(Analy&cs(Dashboards(
Scaling(PostgreSQL(with(CitusDB(at(Cloudflare(
What(worked(for(Cloudflare(
43(
• PostgreSQL(compa&bility(• Trusted(DB(• Extension(mechanisms,(mul&%structured(data(• Community,(documenta&on(
• Performance(• Paralleliza&on(across(millions(of(shards(• Fast(responses(to(both(customer(facing(&(BI(queries(
• PostgreSQL(Exten&ons((• Hstore:(Keep(sparse(data(efficiently((• HLL:(Fast(unique(count(approxima&ons(
• Dynamic(Scaling(• Grow(cluster(as(needed(
• High(Availability(• Real%&me(recovery(from(failures(
(
Summary:(CitusDB(Applica&ons(
44(
• Cloudflare(example:((• Real%&me(analy&cs(• Scalable(&(high(availability(PostgreSQL(
• Other(uses(in(produc&on:(• More(interac&ve(dashboards((E.g.(funnel(analy&cs)(• NoSQL(use(cases((JSONB,(low(latency(writes)(• Simplifica&on(of(complex,(mul&%&ered(DWH(+(Analy&cs(
Ques&ons(
45(
(• Email((
• [email protected](• [email protected]((
• Visit((• www.citusdata.com(• www.citusdata.com/blog(