building your bi system-hadoopcon taiwan 2015

Download Building your bi system-HadoopCon Taiwan 2015

If you can't read please download the document

Upload: bryan-yang

Post on 18-Feb-2017

505 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Build your BI SystemPractice in data lake ecosystemBryan@Vpon Data

ExperienceVpon Data EngineerTWM, Keywear, NielsenBryans notes for data analysishttp://bryannotes.blogspot.twSpark.TWLinikedinhttps://tw.linkedin.com/pub/bryan-yang/7b/763/a79

About me

AgendaUser StoryData LakeFrame Work of BI

Deal With Big Data

Small Retailer

MORE COMPLEX and Big

http://www.slideteam.net/technology-powerpoint-templates/mobile-phones.html

3 Kinds of Problems

https://kavyamuthanna.wordpress.com/category/big-data/

big data brings the problem in 3 ways.Variety: kinds of data types, data sources , databasesVolume: log data, transection data, crawler dataVelocity: real time ,near real time, batch

big data Big Problem

http://www.mn.uio.no/ifi/studier/masteroppgaver/nd/masteroppgave_cloud_bigdata_hpc.html

Vpon is a big data advertising company. We receive and produce amount of data a day.

Big data big costThe cost of data storageWhat does the data keep?How long?The cost of data managementIs the machine and infra easy to maintain?Data Flow(ETL)?The time cost of data processingHow long will the users can wait?Accessibility of the dataHuman costs you can not see

A Real Case

So Many AdHoc Queries

Sales

Marketing

finance

Business

We receive so many adhoc queries a day.Queries are com from each development like Business development, sales, Account services RD blahblah.

For example, how many users a day, how many requests a day, click rate, etc.

Even a Simple QueryQ: Hi, Please tell me how many users from the beginning? A:select count(1)from log

https://myreelpov.wordpress.com/2012/12/23/which-story-do-you-prefer-life-of-pi/life-of-pi-2/Your LifeBossFamily and LoverCustomersData Ocean

OverviewsBusiness intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. Wikipedia

Different FeathersPricePerfomanceAccessibilityHadoopLowMedianLowSQL ServerLow-MedianDepends onMedianDataWarehouseHighHighMedianBI SystemHighDepends onHigh

http://www.datalytyx.com/big-data-data-lakes/

Why Data Lake

http://thesologuide.com/332/the-seesaw-of-success-when-taking-a-rest-is-best/

HiveCreate at FacebookData warehouse in Hadoop ecosystemHiveQL(SQL like interface)Metastore(Save the schema of data, schema on read)UDF

http://www.stratapps.net/intro-hive.php

http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/

One More Thing

TeradataMassively Parallel ProcessingEach processor handles different threads of the program, and Each processor itself has its own operating diskTeradata SQL is fully certified at the SQL 92

http://www.slideshare.net/alam7/module-02-teradata-basics

https://www.safaribooksonline.com/library/view/teradata-architecture-for

http://www.teradatawiki.net/2013/08/Teradata-AMP.html

TableauVisualization ToolConnect with kinds of databaseVizQLTableau Server

http://www.clearpeaks.com/blog/tableau/tableau-8-2-new-features

https://www.youtube.com/watch?v=fYpy04vmG_o

http://metricx.com/services/business-intelligence-services/tableau-consulting/tableau-server-learn/

JenkinsManage ETL processesFree & Many PluginsMonitor Jobs Status and dependency Communication with Git and SVMEmail alert

User Interface

Joblist

ip:port

Job Name List

Job Name List

Build Steps

call python script

call the remote shellcall local shell script

Build Graph

Job Name

Job Name

Job Name

Job Name

Job Name

Job Name

Job Name

Job Name

Job Name

Lets put it all together

HadoopCluster 1

HadoopCluster 2

Teradata

TableauServer

User

Data TransferRequestETLLive Query

Too SlowData Slicing

HadoopCluster 1

HadoopCluster 2

Teradata

TableauServer

User

Data TransferRequestETL

Extract Data

Insufficient SpaceData Slicing

HadoopCluster 1

HadoopCluster 2

Teradata

TableauServer

User

Data TransferRequestETLExtract Data Every Day

Table

View

Statistical TablesData Slicing

User Experience Tuning120X Faster

How to choose the component in your BI Framework ?The cost of data storageThe cost of data managementThe time cost of data processing

Considerings and suggestions Time is moneyHDD space/ money for the timeUnderstanding the components and relationshipsGet balance of the needs and costs Good framework will help business growth

Cost Curve

Business GrowthCost of Business Growth

Hardware*More Nodes*More Memories*Graph Card

Software*Spark*Tez*Tachyon*AlgorithmIn the Future

Cloud*EC2*Big Query*Bluemix*SAP

Thank you for your listeningSpecial Thank VponHood, Meiyen, Gil and OPS Team

Q & A