building your bi system-hadoopcon taiwan 2015
TRANSCRIPT
Build your BI SystemPractice in data lake ecosystemBryan@Vpon Data
ExperienceVpon Data EngineerTWM, Keywear, NielsenBryans notes for data analysishttp://bryannotes.blogspot.twSpark.TWLinikedinhttps://tw.linkedin.com/pub/bryan-yang/7b/763/a79
About me
AgendaUser StoryData LakeFrame Work of BI
Deal With Big Data
Small Retailer
MORE COMPLEX and Big
http://www.slideteam.net/technology-powerpoint-templates/mobile-phones.html
3 Kinds of Problems
https://kavyamuthanna.wordpress.com/category/big-data/
big data brings the problem in 3 ways.Variety: kinds of data types, data sources , databasesVolume: log data, transection data, crawler dataVelocity: real time ,near real time, batch
big data Big Problem
http://www.mn.uio.no/ifi/studier/masteroppgaver/nd/masteroppgave_cloud_bigdata_hpc.html
Vpon is a big data advertising company. We receive and produce amount of data a day.
Big data big costThe cost of data storageWhat does the data keep?How long?The cost of data managementIs the machine and infra easy to maintain?Data Flow(ETL)?The time cost of data processingHow long will the users can wait?Accessibility of the dataHuman costs you can not see
A Real Case
So Many AdHoc Queries
Sales
Marketing
finance
Business
We receive so many adhoc queries a day.Queries are com from each development like Business development, sales, Account services RD blahblah.
For example, how many users a day, how many requests a day, click rate, etc.
Even a Simple QueryQ: Hi, Please tell me how many users from the beginning? A:select count(1)from log
https://myreelpov.wordpress.com/2012/12/23/which-story-do-you-prefer-life-of-pi/life-of-pi-2/Your LifeBossFamily and LoverCustomersData Ocean
OverviewsBusiness intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. Wikipedia
Different FeathersPricePerfomanceAccessibilityHadoopLowMedianLowSQL ServerLow-MedianDepends onMedianDataWarehouseHighHighMedianBI SystemHighDepends onHigh
http://www.datalytyx.com/big-data-data-lakes/
Why Data Lake
http://thesologuide.com/332/the-seesaw-of-success-when-taking-a-rest-is-best/
HiveCreate at FacebookData warehouse in Hadoop ecosystemHiveQL(SQL like interface)Metastore(Save the schema of data, schema on read)UDF
http://www.stratapps.net/intro-hive.php
http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/
One More Thing
TeradataMassively Parallel ProcessingEach processor handles different threads of the program, and Each processor itself has its own operating diskTeradata SQL is fully certified at the SQL 92
http://www.slideshare.net/alam7/module-02-teradata-basics
https://www.safaribooksonline.com/library/view/teradata-architecture-for
http://www.teradatawiki.net/2013/08/Teradata-AMP.html
TableauVisualization ToolConnect with kinds of databaseVizQLTableau Server
http://www.clearpeaks.com/blog/tableau/tableau-8-2-new-features
https://www.youtube.com/watch?v=fYpy04vmG_o
http://metricx.com/services/business-intelligence-services/tableau-consulting/tableau-server-learn/
JenkinsManage ETL processesFree & Many PluginsMonitor Jobs Status and dependency Communication with Git and SVMEmail alert
User Interface
Joblist
ip:port
Job Name List
Job Name List
Build Steps
call python script
call the remote shellcall local shell script
Build Graph
Job Name
Job Name
Job Name
Job Name
Job Name
Job Name
Job Name
Job Name
Job Name
Lets put it all together
HadoopCluster 1
HadoopCluster 2
Teradata
TableauServer
User
Data TransferRequestETLLive Query
Too SlowData Slicing
HadoopCluster 1
HadoopCluster 2
Teradata
TableauServer
User
Data TransferRequestETL
Extract Data
Insufficient SpaceData Slicing
HadoopCluster 1
HadoopCluster 2
Teradata
TableauServer
User
Data TransferRequestETLExtract Data Every Day
Table
View
Statistical TablesData Slicing
User Experience Tuning120X Faster
How to choose the component in your BI Framework ?The cost of data storageThe cost of data managementThe time cost of data processing
Considerings and suggestions Time is moneyHDD space/ money for the timeUnderstanding the components and relationshipsGet balance of the needs and costs Good framework will help business growth
Cost Curve
Business GrowthCost of Business Growth
Hardware*More Nodes*More Memories*Graph Card
Software*Spark*Tez*Tachyon*AlgorithmIn the Future
Cloud*EC2*Big Query*Bluemix*SAP
Thank you for your listeningSpecial Thank VponHood, Meiyen, Gil and OPS Team
Q & A