modern data warehousing

Click here to load reader

Post on 08-Sep-2014

1.337 views

Category:

Data & Analytics

4 download

Embed Size (px)

DESCRIPTION

The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Parallel Data Warehouse (PDW) from Microsoft, which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and PDW. I will give an overview of the PDW hardware and software architecture, identify what makes PDW different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.

TRANSCRIPT

  • Modern Data Warehousing Insights on Any Data of Any Size James Serra, Microsoft PDW Technology Solution Professional [email protected] JamesSerra.com
  • About Me Business Intelligence Consultant, in IT for 28 years Microsoft, PDW Technology Solution Professional (TSP) Owner of Serra Consulting Services, specializing in end-to-end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP Author of book Reporting with Microsoft SQL Server 2012
  • Agenda Traditional data warehouse & modern data warehouse APS architecture Hadoop & PolyBase Performance and scale Appliance benefits Summarize/questions
  • 4 Data sources Will your current solution handle future needs?
  • 5 Data sourcesNon-Relational Data
  • Data sources Non-relational data
  • Keep legacy investment Buy new tier one hardware appliance Acquire big data solution (Hadoop) Acquire business intelligence solution Roadblocks to evolving to a modern data warehouse Limited scalability & ability to handle new data Significant training & still siloed High acquisition/ migration costs & no Hadoop Complex with low adoption Solution and issue with that solution
  • Introducing the Microsoft Analytics Platform System Your turnkey modern data warehouse appliance Relational and non-relational data in a single appliance Enterprise-ready Hadoop Integrated querying across Hadoop and APS using T-SQL Direct integration with Microsoft BI tools such as Power BI Near real-time performance with In-Memory Scale-out to accommodate your growing data Remove DW bottlenecks with MPP SQL Server Concurrency that fuels rapid adoption Industrys lowest DW price/TB Value through a single appliance solution Value with flexible hardware options using commodity hardware Free up space on SAN
  • Hardware and software engineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Pre-configured, built, and tuned software and hardware Integrated support plan with a single Microsoft contactPDW HDInsight PolyBase
  • APS Architecture Microsoft Analytics Platform System (APS), formally called by its code name Project Madison, was released in December 2010 (version 1). PDW is Microsofts reworking of the DatAllegro Inc. massive parallel processing (MPP) product started in 2003 and that Microsoft acquired in September 2008. Version 2 of PDW was made available in March, 2013. It was renamed from SQL Server Parallel Data Warehouse (PDW) to Analytics Platform System (APS) in April 2014 (it still includes the PDW region as well as a new HDInsights/Hadoop region). Polybase was introduced with version 2 of PDW and has new features in PDW v2 AU1 (April 2014). Case studies: http://www.microsoft.com/casestudies/Case_Study_Search_Results.aspx?Type=1&Keywords=%22Parallel%20 Data%20Warehouse%22&LangID=46
  • APS Logical Architecture (overview) Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL DMS DMS DMS DMS Compute Node the worker bee of APS Runs SQL Server 2012 APS Contains a slice of each database Control Node the brains of the APS Also runs SQL Server 2012 APS Holds a shell copy of each database Metadata, statistics, etc The public face of the appliance Data Movement Services (DMS) Part of the secret sauce of APS Moves data around as needed Enables parallel operations among the compute nodes (queries, loads, etc) Control node SQL DMS
  • APS Logical Architecture (querying) Compute node Balanced storage SQLControl node SQL Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL DMS DMS DMS DMS DMS 1) User connects to the appliance (control node) and submits query 2) Control node query processor determines best *parallel* query plan 3) APS distributes sub-queries to each compute node 4) Each compute node executes query on its subset of data 5) Each compute node returns a subset of the response to the control node 6) If necessary, control node does any final aggregation/computation 7) Control node returns results to user
  • APS Data Layout Options Compute node Balanced storage SQL Balanced storage Balanced storage Balanced storage Compute node SQL Compute node SQL Compute node SQL DMS DMS DMS DMS Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Customer Dim Cust Dim ID Cust Name Cust Addr Cust Phone Cust Email Sales Fact Date Dim ID Store Dim ID Prod Dim ID Cust Dim ID Qty Sold Dollars Sold T D P D S D C D T D P D S D C D T D P D S D C D T D P D S D C D SalesFact Replicated Table copied to each compute node Distributed Table spread across compute nodes based on hash Star Schema
  • FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H DATA DISTRIBUTION CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, ) WITH ( DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ) ); Control Node Compute Node 1 Compute Node 2 Compute Node X Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C Create Table FactSales_H FactSales A FactSales B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H FactSales A FactSales B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H FactSales A FactSale B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H Create table metadata on Control Node
  • APS Balanced across servers and within 15 Largest Table 600,000,000,000 Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000 In each server randomly distributed to 8 tables 1,875,000,000 Each partition 2 years data partitioned by week 18,028,846 As an end user or DBA you think about 1 table: LineItem. You run select * from LineItem APS is an appliance, simple to use! You dont care or need to know that there are actually 320 tables representing your 1 logical table.
  • Rack 15TB(Raw) 1/2Rack 30TB(Raw) FullRack 60TB(Raw) 1Rack 75.5TB (Raw) 3Rack 181.2TB(Uncompressed) 11/2Rack 90.6TB(Raw) 2Rack 120.8TB(Raw) 2 56 compute nodes (32- 896 cores) 1 7 racks 1, 2, or 3 TB drives 15TB 1.2PB uncompressed 75TB 6PB User data (5:1) Up to 7 spare nodes available across the entire appliance Dual Infiband: 56Gbps
  • Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • What is big data and why is it valuable to the business A evolution in the nature and use of data in the enterprise Data complexity: variety and velocity Petabytes/Volume Historical analysis Insight analysis Predictive analytics Predictive forecasting Valuetothebusiness
  • Microsoft Confidential Core Services OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS OOZIE AMBARI YARN MAP REDUCE HIVE & HCATALOG PIG HBASEFALCON Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • Move HDFS into the warehouse before analysis ETL Learn new skills TSQL Build Integrate Manage Maintain Support Complex query and analysis with big data today Steep learning curve, slow and inefficient Hadoop ecosystem New data sources New data sourcesNew data sources
  • APS delivers enterprise-ready Hadoop with HDInsight Manageable, secured and highly available Hadoop integrated into the appliance High performance tuned within the appliance End-user authentication with Active Directory Accessible insights for everyone with Microsoft BI tools Managed and monitored using System Center 100% Apache Hadoop SQL Server Parallel Data Warehouse Microsoft HDInsight PolyBase Leverage your existing TSQL skills
  • Parallel Data Warehouse workload HDInsight workload Fabric Hardware Appliance A region is a logical container within an appliance Each workload contains the following boundaries: Security Metering Servicing APS appliance overview
  • Select Result set Provides a single T-SQL query model (semantic layer) for APS and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera Use existing SQL skillset, no IT intervention Query Hadoop data with T-SQL using PolyBase Bringing the worlds or big data and the data warehouse together for users and IT SQL Server Parallel Data Warehouse Cloudera CHD Linux 4.3 Hortonworks HDP 2.0 (Windows, Linux) Windows Azure HDInsight PolyBase Microsoft HDInsight HDP 1.3 Others? Federated querying AU1: Windows Azure storage blob (WASB)
  • Use cases where PolyBase simplifies using Hadoop data Bringing islands of Hadoop data together High performance queries against Hadoop data (Predicate pushdown) Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area)
  • Big data insights for anyone Native Microsoft BI integration to create new insights with familiar tools Tools like Power BI minimize IT intervention for discovering data T-SQL for DBA and power users to join relational and Hadoop data Hadoop tools like map- reduce, Hive and Pig for data scientists Leverages high adoption of Excel, Power View, Power Pivot, and SSAS Power Users Data Scientist Everyone else using Microsoft BI tools
  • Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • Performance limitations and scale with a traditional data warehouse Diminishing scale as requirements grow Scale up Rowstore Sub-optimal performance for many data warehouse queries Data Page 1 Page 2 Page 3 Querying data by row C1 C2 C3 C4 R1 R1 R1 R1 R2 R2 R2 R2 R3 R3 R3 R3 R4 R4 R4 R4 R5 R5 R5 R5 R6 R6 R6 R6 Forklift Forklift
  • Scale-out Massively Parallel Processing (MPP) parallelizes queries (speed-driven not just capacity-driven) Multiple nodes with dedicated CPU, memory, storage shared-nothing Incrementally add HW for near-linear scale to multi-PB (no need to delete older data, stage) Handles query complexity and concurrency at scale No forklift of prior warehouse to increase capacity Start small with a few terabyte warehouse Scale-out technologies in the Analytics Platform System 28 PDW 0TB 6PB PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight
  • Store data in columnar format for massive compression Load data into or out of memory for next- generation performance Updateable and clustered for real-time trickle loading No secondary indexes required 29 Up to 100x faster queries Updatable clustered columnstore vs. table with customary indexing Up to 15x more compression Columnstore index representation Parallel query execution Query Results
  • PDW MPP vs. SQL Server SMP 2B row fact sales table Nine different queries including simple counts Sum/min/max with group-bys Multiple inner joins with 3-5 dimension tables Multiple sub-queries across the big fact table
  • BI Tools Reporting and cubes SQL Server SMP Concurrency that fuels rapid adoption Great performance with mixed workloads Analytics Platform System ETL/ELT with SSIS, DQS, MDS ERP CRM LOB APPS ETL/ELT with DWLoader Hadoop / Big Data PDW HDInsight PolyBase Ad hoc queries Intra-Day Near real-time Fast ad hoc Columnstore Polybase CRTAS Link Table Real-Time ROLAP / MOLAP DirectQuery SNAC
  • Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • APS provides the industrys lowest DW appliance price/TB Reshaped hardware specs through software innovation Price per terabyte for leading vendors Significantly lower price per TB than the closest competitor Price per TB User-Available Storage (Compressed) NOTE: Orange line indicates average price per TB. Thousands Oracle EMC IBM Teradata Microsoft $30 $25 $20 $15 $10 $5 $0 Lower storage costs with Windows Server 2012 Storage Spaces Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack
  • Virtualized architecture overview Host 2 Host 1 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS Base Unit C TL M AD A D V M M Compute 2 Compute 1 APS engine DMS Manager SQL Server 2012 Enterprise Edition (APS build) Software details All hosts run Windows Server 2012 Standard and Windows Azure Virtual Machines Fabric or workload in Hyper-V Virtual Machines Fabric virtual machine, management server (MAD01), and control server (CTL) share one server APS agent that runs on all hosts and all virtual machines DWConfig and Admin Console Windows Storage Spaces and Azure Storage blobs Does not require expertise in Hyper-V or Windows
  • APS High-Availability X X Compute Host 1 Compute Host 2 XControl Host Failover Host Infiniband1 Ethernet1 Infiniband2 Ethernet2 XXXFAB AD VMM MAD CTL Compute 2 VM Compute 1 VMCompute 1 VMInfiniband1 Ethernet1 No Single Point-Of-Failure No need for SQL Server Clustering
  • Less DBA Maintenance/Monitoring No index creation No deleting/archiving data to save space Management simplicity (System Center) No blocking No logs No query hints No wait states No IO tuning No query optimization/tuning No index reorgs/rebuilds No managing filegroups No shrinking/expanding databases No managing physical servers No patching servers and software RESULT: DBA spend more of their time as architects and not baby sitters!
  • The no-compromise modern data warehouse solution Microsofts turn-key modern data warehouse appliance Analytics Platform System Microsoft Improved query performance Faster data loading Improved concurrency Less DBA maintenance Limited training needed Use familiar BI tools Ease of appliance deployment Improved data compression Scalability High availability PolyBase Integration with cloud-born data HDInsight/Hadoop integration Data warehouse consolidation Summary of Benefits Bold = benefits of APS over upgrading to SQL Server 2014
  • Questions? James Serra, Microsoft PDW Technology Solution Professional Email me at: [email protected] Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com Blog about PDW topics: http://www.jamesserra.com/archive/category/pdw/