sap iq - business intelligence and vertical data ... · life cycle sap iq can organize the database...
TRANSCRIPT
Public
Dipl.- Inform. Volker Stö[email protected]
SAP IQ - Business Intelligence and verticaldata processing with 8 GB RAM or less
Agenda
Introduction: What is SAP IQ - in a nutshellŸ Architecture, Idea, BackgroundŸ Exercise: Create a database and database objects
What makes SAP IQ eligible for Big Data ScenariosŸ (Un-) Limits, Scalability AspectsŸ Exercise: Populate the database using bulk load
Ad-hoc QueriesŸ What IQ is good atŸ Exercise: Run predefined or own queries against your database
Learning Objective
After completing this sesson, you will be able to:Recognize the benefits of data compression mechanisms in Big Data scenariosDescribe how ad-hoc queries against raw fact data give you the flexibility to evaluate these data justalong the dimensions you want NOW.Match evaluation patterns against the data structures offered by SAP IQ.
What is SAP IQ - in a nutshell
Architecture, Idea, Background
Real- Time Evaluation on Very Large Tables with SAP IQ
SAP IQ is a pure-bred Data Warehouse engine designed for Very Large Databases.
Like SAP HANA, it utilizes a Columnar Data Store.Unlike SAP HANA, it stores data on Disk and utilizes RAM to cache parts of it.
Data Compression multiplies the range of Storage Resources.Dictionary Compression for repeating column valuesStorage compression for all data structuresStorage required for data can be 30% - 80% less than in traditional RDBMS.
SAP IQ integrates seamlessly with core components of the Big Data ecosystem.SAP HANA via Smart Data Access / Extended StorageHadoop via Component Integration Service or Table User Defined Functions
SAP IQ Terms
Columnar Data Store: In traditional (OLTP style) RDBMS, the various column values of a datarow are stored together, making access to a single complete row very efficient. In a columnardata store, the column values of many rows are stored together. A row is distributed overvarious column vectors
Row ID: Since a row does not exist as a memory entity, it exists as a Row ID indicating theposition in the various column vectors
Cardinality: The number of unique / distinct column values for a column.
Optimized Fast Projection: The SAP IQ term for dictionary compression
Bitmap Index: Since a row exists as a Row ID only, columns of low cardinality can be reflectedas (usually sparsely populated) bitmaps where each bit represents one row. There is onebitmap per unique value. A set bit indicates a row with that value.
SAP IQ for Big Data Scenarios
What makes SAP IQ eligible for Big Data Scenarios(Un-) Limits, Scalability Aspects
Data Acquisition
Big Data Data is acquired through bulk mechanism
fast SAP IQ holds the Guinness World Record of 34.3 TB / hour (2014)
cost efficient Runs on standard hardware
versatile IQ can load from a wide variety of data sources includingleading RDBMSs and Hadoop
scalable Parallel Processing of Load data streams
Procedure: SAP IQ Data Acquisition
Incoming data (row oriented, tabular result setor data file / data pipe)
Transformation to vertical
Dictionary Compression (where applicable)
Storage Compression as data is written to disk
Auxiliary Indexes (incremental or non-incremental)
Green blocks areeligible for massiveparallel execution
Optimized Fast Projection Dictionary Compression
Eligible Columns have a metadata Lookup TableEach distinct value is represented once in the lookup tableEach column value is stored as the position in the lookup table
Lookup Table size depends on column data type and cardinalityNumber of rows in lookup table = cardinalityLookup table row size is calculated upon the column data typeUp to cardinality 2^31 (2,147,483,647)
Column Vector size depends on number of rows and column cardinalityEach column value is represented by as few bits as required to store cardinality in binaryE.g. a column with a cardinality of 9 .. 16 requires 4 bits / row, a column with a cardinality of 513 ..1024 requires 10 bits / row
Data Storage
Big Data SAP IQ can maintain as many containers (files / raw devices) as the OSallows, each up to 4 TB in size
Life Cycle SAP IQ can organize the database for different kinds of storage, reflecting datalife cycle or temperature.
cost efficient Data compression reduces the disk footprint.
integrated SAP IQ can integrate with HANA to hold no longer hot enough for in-memory and Hadoop to age out data even colder
Compression Raw data size typically reduced by 30 – 70%
SAP IQ Storage (Un-) Limitations
SAP IQ maximum database size: number of files times maximum file size the OS allows.Maximum file size supported by IQ is 4 TBOrganized as DBSpaces consisting of up to 2000 files
Up to 2^48 – 1 rows per table15-digit decimal numberTable size is only limited by database sizeSpecial declarations required to extend a table beyond the size of a DBSpace
Up to 2^32 indexes per table
Up to 45000 columns per table (recommended limit: 10000)
Big Data Specific Features – Very Large Database Option
SemanticPartitioning Can control data location by data values
Read-OnlyDBSpaces
When fully populated with archive data, DBSpaces can be declared read-onlyand excluded from full backups
I/O Striping Auxiliary indexes can be separated from raw data
Data Aging Tables or Partitions (through semantic partitioning) with cold datacan be assigned to cheaper storage
I/O Striping Tables can be distributed over multiple devices by column and / or bypartition
Background – What are we doing – Storage Containers
CatalogStore
Temp.Store
SystemMainStore
UserDataStore
UserDataStore
Background – What are we doing – System Storage Containers
First, we create Catalog Store, System Main Store and Temporary Store (0CreateDB.SQL)
CatalogStore
Temp.Store
SystemMainStore
UserDataStore
UserDataStore
Background – What are we doing – System Storage Containers
First, we create Catalog Store, System Main Store and Temporary Store (0CreateDB.SQL)Catalog Store: …database '...\FlightStats.db‘…
Database Handle – one file system file (accompanied by a .log)Holds system tablesGrows on demand
System Main Store: …IQ path '...\FlightStatsMain.IQ‘…One or multiple file system files or raw devicesHolds system dataSpecified current and optionally reserved size for later extension
Temp Store: …temporary path '...\FlightStatsTemp_00.IQ‘…One or multiple file system files or raw devicesHolds temporary data (work tables, temporary tables, processing data)Specified current and optionally reserved size for later extension
Next, we create a User Data Store (1AdjustExtendDB.SQL)
Background – What are we doing – User Storage Containers
CatalogStore
Temp.Store
SystemMainStore
UserDataStore
UserDataStore
Background – What are we doing – Create Tables and Indexes
Then, we create Tables and Indexes (2TablesIndexes.SQL)
Table: …create table FlightsOnTime…Standard SQL
Except iq unique clause (here to bypass dictionary compression)
Indexes: Various Index TypesMany apply to one columnLF – Low Fast for low cardinality columnsHNG – High Non Group for parallel calculation of totals and averagesDATE – for the low cardinality elements of date values… more and details to follow
Ad-hoc Queries
What IQ is good at
Real- Time Evaluation on Very Large Tables with SAP IQ
Product Acct. Rep State Year Quarter RevenueIQ Steve TX 2013 1 600ASE Bill OK 2013 1 515ESP Tom MA 2013 1 780HANA Steve AZ 2013 1 340HANA Tom NJ 2013 1 375IQ Tom PH 2013 1 410ESP Greg CA 2013 1 875HANA Steve TX 2013 1 724IQ Bill CO 2013 2 415ESP Steve TX 2013 2 655HANA Bill UT 2013 2 820HANA Tom NH 2013 2 570
Real- Time Evaluation on Very Large Tables with SAP IQ
Product Acct. Rep State Year Quarter RevenueIQ Steve TX 2013 1 600ASE Bill OK 2013 1 515ESP Tom MA 2013 1 780HANA Steve AZ 2013 1 340HANA Tom NJ 2013 1 375IQ Tom PH 2013 1 410ESP Greg CA 2013 1 875HANA Steve TX 2013 1 724IQ Bill CO 2013 2 415ESP Steve TX 2013 2 655HANA Bill UT 2013 2 820HANA Tom NH 2013 2 570
Real- Time Evaluation on Very Large Tables with SAP IQ
Product Acct. Rep State Year Quarter RevenueIQ Steve TX 2013 1 600ASE Bill OK 2013 1 515ESP Tom MA 2013 1 780HANA Steve AZ 2013 1 340HANA Tom NJ 2013 1 375IQ Tom PH 2013 1 410ESP Greg CA 2013 1 875HANA Steve TX 2013 1 724IQ Bill CO 2013 2 415ESP Steve TX 2013 2 655HANA Bill UT 2013 2 820HANA Tom NH 2013 2 570
Data Processing
Big Data
scalable Server or Query Workload can be distributed across multiple machines(Multiplex / PlexQ).
Columnar Data Store allows evaluation of very large numbers of rows –irrelevant columns have no impact on query performance.
efficient Bitmap indexes allow complex aggregations through elementary binaryoperators.
PipelineProcessing
Subsequent query operators can start before completionof previous operators
scalable I/O Striping across all eligible disk containers.
Showcase: Grouped Average Calculation in 2 Dimensions
We have a numeric fact value (like number or value of items sold) for whichwe want to calculate total or average values.
Assumptions:
Every fact row has one out of 23 status values. We’re only interested in status ‘current’ or‘historic’. These two make up ~98% of the stored data.
Every fact row is assigned to a geography. The geography dimension has a cardinality of~100, but we’re only interested in 8 of them (e.g. AT, BE, CH, DE, FR, IE, NL, UK).
Every fact row is assigned to a product line. There’s 43 of them, and we’ll evaluate them all.
Showcase: Sample Data Excerpt – Low Fast (LF) Index
Status Geo PLcurrent ES 3current DK 5pending UK 9current UK 16historic DE 29current NL 2historic FR 4historic GA 5current DE 16current AT 31current IT 24
current historic pending1 0 01 0 00 0 11 0 00 1 11 0 00 1 10 1 11 0 01 0 01 0 0
DE DK
ES UK
0 0 1 00 1 0 00 0 0 10 0 0 11 0 0 00 0 0 00 0 0 00 0 0 01 0 0 00 0 0 00 0 0 0
PL2 PL3 PL4 PL50 1 0 00 0 0 00 0 0 00 0 0 00 0 0 01 0 0 00 0 1 00 0 0 10 0 0 00 0 0 00 0 0 0
Threads: 8*43
Threads: 8
Procedure: Showcase initial process steps
Filter: Create a combined bitmap current OR historic
Permutation 1: Create a combination of this bitmap (AND)with each of AT, BE, CH, DE, FR, IE, NL, UK
Permutation 2: Create an AND combination of each bitmapwith each product line
Intermediate Result: 8*43 bitmaps each indicating the rowset for a combination of Geo and PL
Pipeline Execution
Showcase: Bit Slice Sum Calculation with HNG Index
Value 16 8 4 2 123 1 0 1 1 111 0 1 0 1 117 1 0 0 0 15 0 0 1 0 115 0 1 1 1 124 1 1 0 0 016 1 0 0 0 07 0 0 1 1 112 0 1 1 0 011 0 1 0 1 125 1 1 0 0 1
• As an auxiliary index structure, numeric values can bestored in bit slices
• This is called High Non Group (HNG) Index
• Every bit value is represented by an own bitmap
• E.g. for an unsigned smallint (2 bytes; 0 .. 65535) 16bitmaps are stored
• Each represents a power of 2 (1, 2, 4, 8, 16, 32, 64, 128, 256,512, 1024, 2048, 4096, 8192, 16384, 32768)
Threads: 8*43
Threads: 8*43*16
Procedure: Showcase final process steps
Intermediate Result: 8*43 bitmaps each indicating the rowset for a combination of Geo and PL
Permutation 3: AND combine each bitmap with each HNGbit slice and count resulting set bits
Accumulation: Multiply the number of set bits with the weightof the bit and add up for each Geo / PL
Result: (Up to) 8*43 result rows
Pipeline Execution
Showcase: Summary – Why this is efficient
We’re utilizing a very high number of threadsThese can be executed in parallel if sufficient cores are available but they don’t have toThey introduce no overhead and are completely independent of each other
Even could be executed on different nodes in a PlexQ setup
The operations executed are technically trivial and highly efficient on every hardwareThe intermediate results fit into hardware registers
The persistent input bitmaps can be distributed over multiple disks for I/O stripingThe intermediate bitmaps can be expected to fit in the cache
128 Mbytes for 1G rows per bitmap (uncompressed)
Scalability Aspects
Load time vs. number of existing rowsIncremental indexes (for low cardinality data) are insensitive to the number of existing rows.Non- Incremental (B-Tree) indexes are principally sensitive to the number of existing rows, this impactis minimized using tiered B-Trees
Query execution time vs. number of coresMost Analytics style queries can efficiently scale out for a high number of CPU cores. Increasingprocessing power can be expected to produce an adequate gain in response time.
Query execution time vs. number of rowsTypically, query execution time rises linear with the number of rows or slower (due to pipeline execution)
Multinode Setup (Multiplex / PlexQ)Processing power and RAM is not restricted to the capabilities of a single box
Using SAP IQ
Standard SQL SAP IQ is addressed using standard SQL – easy to use for developersfamiliar with other RDBMS
OLAP The SQL dialect is enhanced by OLAP extensions bringing analytics into thedatabase server
ReportingTools
All reporting tools supporting at least one of the standard APIs canretrieve data from SAP IQ
Import Export ASCII Files are the most versatile data exchange format – SAP IQreads from and writes to these
Standard APIs ODBC, JDBC, OLE-DB, OpenClient – Simply use your preferred client(unless it’s proprietary)
Consistency - Concurrency
SnapshotIsolation SAP IQ uses Snapshot Isolation
No Blocks Read operations never get into lock conflicts. This minimizes the impact of dataprovisioning.
Parallel If CPU cores are available, typical Analytics operations canmassively utilize them
FullConsistency
Data visible to a reader is always consistent – nothing like dirty reads,non-repeatable reads or phantom rows
Integration into the SAP Big Data Landscape
HANA integrationNear Line Storage for SAP BW systemsSmart Data Access / HANA Extended Storage
Hadoop integrationUser defined functions in IQ to access Hadoop data and Table Parametrized Functions (TPF)
Event Stream ProcessingSAP ESP comes with a native adapter for SAP IQ
Reporting / Predictive Data AnalysisStandard APIs (ODBC / JDBC / …) available for SAP and third party productsOLAP in the database removes Workload from the Reporting systems
Thank you!Contact information:
Volker StöfflerDB-TecKnowledgyIndependant ConsultantGermany – 70771 Leinfelden-Echterdingenmailto: [email protected]://scn.sap.com/people/volker.stoeffler