big data and bi tools - bi reporting for bay area startups user group

www.infocepts.com

BI Reporting SF Bay User Group08 July 2014

#BI Reporting for Bay Area Start-ups

Presented by: Scott MitchellDWApplications

www.infocepts.com

Presenter – Scott Mitchell Background

Currently based in San Francisco Bay Area Consultant – Working for DWApplications Partnering with Infocepts for their off-shore blended staffing capacity

BI and DW Experience Started working with BI/DW tools in 1997 (17yrs) Worked on all sides of the fence

Reporting, DBA, ETL, Solution Architect Significant experience in Agile BI Application Integration Previous Implementations

Start-ups - ePredix, Telephia/Nielsen Mobile, Quantros, mFoundry/FIS, TradePulse, iQ-ity Enterprise - Victoria Secrets, eBay, Ross, Safeway, Bank of America, VISA

2BI Reporting for Bay Area Start-ups

www.infocepts.com

BIG Data

#

www.infocepts.com

Big Data Agenda

#

• BIG Data Standard BIG Data Reference Architecture 5/7 Vs of BIG Data Hadoop Ecosystem Connecting Hadoop Components of Hadoop Ecosystem

• BIG Data Questions What Hadoop can do Vs What Hadoop can’t do? When to use Hadoop When not to use Hadoop When BIG Data over RDBMS Can Big Data and traditional RDBMS co-exist? RDBMS or BIG Data or both? Real time Analytics using Big Data

• BIG Data Platform Comparison• BI Tool Comparison

www.infocepts.com

Standard BIG Data Reference Architecture

#

http://thinkbiganalytics.com/leading_big_data_technologies/big-data-reference-architecture/



www.infocepts.com 6

5 Vs of BIG Data

Volume: This is the aspect that comes to most people’s minds when they think of Big Data. Volumes of data have increased exponentially in recent times. It is not uncommon for businesses to deal with petabytes of data, and typically analysis is performed over the entire data set, not just a sample

Velocity: Big Data is not just about the volume though. Just as important is the rate of change of the data. For a large volume of data which doesn’t change very often, analysis that takes a number of hours or days to complete may be acceptable, but if the dataset is growing by terabytes per day, or the data is changing at a high rate of speed, the processing time of analysis becomes much more important

Variety: Big Data is not always structured data and it is not always easy to put big data into a relational database. Big Data includes data types such as videos, music files, emails, unstructured word documents and social media feeds. Dealing with a variety of structured and unstructured data greatly increases the complexity of both storing and analyzing Big Data

www.infocepts.com 7

5 Vs of BIG Data

Veracity: When we are dealing with a high volume, velocity and variety of data, it is inevitable that not all of the data is going to be 100% correct – there will be dirty data. The question is, how clean is good enough for the analysis to be performed? Often the data does not need to be perfect, but does need to be close enough to gain relevant insight. Dependent on the application, the veracity, or verification of the data may be essential, or simply “nice to have”

Value : This is the most important aspect of big data. It costs a lot of money to implement IT infrastructure systems to store big data, and businesses are going to require a return on investment. At the end of the day, if you can’t extract value from your data, there is No point in building the capability to store and manage it.

www.infocepts.com 8

Additional Vs – Part of 7Vs of BIG Data

Additionally some experts also add:

Validity: The interpreted data having a sound basis in logic or fact – is a result of the logical inferences from matching data. One of the most common errors being the confusion between correlation and causation. Context of the data becomes very important.

Visibility: The state of being able to see or be seen – is implied. Data from disparate sources need to be stitched together where they are visible to the technology stack making up Big Data. Critical data that is otherwise available, but not visible to the processes of Big Data may be one of the Achilles Heels of the Big Data paradigm. Conversely, unauthorized visibility is a risk.

www.infocepts.com

Hadoop Ecosystem

#

Components that can directly use YARN

Components using MapReduce framework

SQL based database tools

www.infocepts.com 10

BI Tools

ETL Tools

JDBC/ODBC

JDBC/ODBC/Native

Databases

Connecting Hadoop

www.infocepts.com

Modules of Hadoop Ecosystem

#

• Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data

• Hadoop YARN: A framework for job scheduling and cluster resource management

• Hadoop Common: The common utilities that support the other Hadoop modules

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets

www.infocepts.com #

• HBase: A scalable, distributed database that supports structured data storage for large tables

• Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying

• Pig: A high-level data-flow language and execution framework for parallel computation. Used for constructing data flows for (ETL) extract, transform, load

• ZooKeeper: A high-performance coordination service for distributed applications

• Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner

Other Components

www.infocepts.com

Gartner’s 12 Dimensions of Big Data – Extreme Information

#

There are three tiers of Information management in the model with four dimensions in each tier

www.infocepts.com

Quantification

#

Physical data characteristics with reference to Complexity

Velocity Growing by terabytes per day, or the data is changing at a high rate of speed, the processing

time of analysis becomes much more important

Variety Various structures of data from different data sources like unstructured (from

websites, sensors, social media, etc.), semi-structured (from xml, web services, etc.) and structured (from transactional systems)

Velocity Speed of data collection, processing and access in real-time, near real-time,

historical and older. Volume

High volume of data generated during different timeframes Complexity Individual data sets with different standards, business domain rules, storage formats

for each asset type.

www.infocepts.com

Access Enablement and ControlInformation control based on nature of the data and information provided by it (like, confidential HR, Finance and Sales data, Customer details, negative tweets, etc.)

Classification Classification of data in various classes depending on information hidden in it (like sensitive,

non-sensitive, private, public, etc.)

Contracts Governance rules of enterprise data governance framework to allow access to specific data (like

agreements on who will share what information and how).

Pervasiveness Spread and availability of data across various levels of organization depending on the

requirement by organization and details of information in the data (like how long does data remain active, how long the Aggregation of data is valid for summary reports, when data refreshes, etc…)

Technology-enablement (specifications for tools and technology) Controlling empowerment of users to access various functionalities of Tools and technologies to

get information from Data (like security roles in MicroStrategy, etc.)

#

www.infocepts.com

Qualification and Assurance Fidelity

Reliability of data source and authenticity of data. Linked data

Association of data with its context (Affiliation) Validation of data

Validity of data for its business use case and rules. Perishability

Longevity, i.e. how long data is relevant to its context and analysis? Aging of data while retaining its state and originality

#

www.infocepts.com

BIG Data Questions

#

www.infocepts.com

When to use Hadoop? Your Data Sets Are Really Big – If data is in GBs, use Excel, a SQL BI tool on Postgres,

or some similar combination, but if data is in TBs or Petabytes, Hadoop’s superior scalability will save you a considerable amount of time and money

You Celebrate Data Diversity – It doesn’t matter whether your raw data is structured (like out of an ERP system), semi-structured (like XML and log files), unstructured (like video files) or all three–Hadoop and its forgiving schema will gobble it up

You Have Strong Programming Skills – Hadoop is written in Java, and therefore requires Java programming skills to master. That is changing with new tools in Hadoop ecosystem but right now it largely remains a venue for excellent Java skills

You Are Building an ‘Enterprise Data Hub’ for the Future – If you work for large enterprise, you might sign up for Hadoop even if your data isn’t particularly massive or diverse or fast at this point in time. It might make sense to start experimenting with Hadoop to be ready to take advantage when the elephant really starts sizzling and goes mainstream in a few years.

You Find Yourself Throwing Away Perfectly Good Data –Hadoop can store petabytes of data. If you find that you are throwing away potentially valuable data because its costs too much to archive, you may find that setting up a Hadoop cluster allows you to retain this data, and gives you the time to figure out how to best make use of that data

#

www.infocepts.com

When not to use Hadoop? You Want to Store Sensitive Data – One of the things that Hadoop is not particularly

good at today is storing sensitive data. Hadoop today has basic data and use access security. And while these features are improving by the month, the risks of accidentally losing personally identifiable information due to Hadoop’s less-than-stellar security capabilities is probably not worth the risk

You Want to Replace Your Data Warehouse – Still a majority of data pros tell that Hadoop is complementary to a traditional data warehouse, not a replacement for it. The superior economics of Hadoop-based storage make it an excellent place to land raw data and pre-process it before siphoning it over to a traditional data warehouse to run analytic workloads

You want to Delete or Update data frequently – Hive does not support DELETE and UPDATE commands, so if there exists a business need where frequent deletion or updating data is paramount, Hadoop is not the way to go

#

www.infocepts.com

When to use BIG Data technologies over RDBMS? When you No longer can achieve the desired results with your RDBMS

#

• When data is highly unstructured. Ex: Scanner data, social media data, streaming data, videos, documents tweets, photos, etc.

• When data is huge in volume and complexity (Greater than 1TB and complex data)

• Customers adopt BIG Data for specific roles – especially exploratory data-science sandboxes and unstructured data staging

And for some very technical issue oriented reasoning:• Count Distinct Queries: A count distinct query by definition has to process every

record, including sorting and counting. And this becomes a difficult problem when the volume of data is huge. Mixing one or more such distinct aggregates with non-distinct aggregates in the same select list, or mixing two or more distinct aggregates causes more performance issues as it leads to spooling and re-reading of intermediate results.

• Cursors: A cursor is where you are stepping through a table row by row in a database. If you are doing some analysis using some kind of a case statement using a cursor on each row of the database and if the table is of any significant size, this is a very bad situation. Cursors are good for iterating through small metadata tables. RDBM Systems are not optimized for stepping through large datasets one entry at a time

www.infocepts.com #

• Alter Table: You have a big data warehouse of a customer and you have a table X. This table X is so big and is so important with so many columns that if you want to alter it by adding a column, changing a column data type or running any DML operation, it would require a long time to complete. Such operations need to be planned and done very carefully as they lock out the table during this whole operation until the statement completes. In addition if the column that you are adding has a NOT NULL clause it would be very painful as the DBMS has to insert default values into all of the existing rows which may overburden your transaction logs.

• Data Merge and Mashup (Structured meeting Unstructured): Most retailers today have both online and in-store presence. Consider a scenario where you have customers' online product search data (search logs) in the retailer’s website for the last 15 days, their past in-store purchase history (RDBMS), their in-store charge card transaction data and their daily commute pattern data that you have from their cellphone provider. If you want to build an analytical model that aims to combine these myriad sources of data to send custom discount offers that are valid in a specific store located along the customer’s daily commute path, then you would need to combine all of these sources of data to achieve this. It’s difficult to deal with unstructured data using an RDBMS, let alone combining unstructured data with structured.

www.infocepts.com

When using Big Data Technologies like Hadoop and Hive, Do we still need standard RDBMS to perform Analytics? No

#

Hive is essentially a data warehouse infrastructure that provides data summarization and ad hoc querying. It performs the role of a Data Warehouse platform for using the organization’s structured data in the Hadoop Ecosystem. The Hadoop long term vision is that an organization can completely rely on Hadoop ecosystem for Analytics even in absence of RDBMS.

However Right Now:- Hadoop is IT heavy and business users need IT hand holding- Lacks highly accessible self-service tools for business users- Hadoop does not have extensive pre-existing adapters for ERP systems- Would require significant investment to re-write advanced ETL feeding DW

Do I need a RDBMS or a BIG Data database or Both? Varies from one organization to otherAs organizations become aware of their data and their needs, they will be in a better position to decide which technology fits their requirement. As covered earlier – structured vs unstructured and the volume and complexity of data are major attributes that can help in deciding.

www.infocepts.com

How close can we get to Real time Analytics using BIG Data technologies(than having to move data through ETL Processes) ? Really Real Time or Streaming Real Time Analytics is possible with BIG Data

#

Hadoop ecosystem has already got many customer examples where the Real time Analytics is really real time / streaming real time.

Learn from this recently concluded Hadoop Summit keynote how a large truck agency tracks various events like starting, stopping, traffic violations like speeding, excessive braking and unsafe tail distance while trucks are on the roads and delivery goods.

The system also gives interactive inputs on historical data as well – to see how other routes have performed in violations. http://hadoopsummit.org/san-jose/keynote-day2/

http://hadoopsummit.org/san-jose/keynote-day2/

http://hadoopsummit.org/san-jose/keynote-day2/

www.infocepts.com #

Can we can replace RDBMS with BIG Data databases some day? Yes and NoWhy Yes?• BIG Data Eco systems like Hadoop already have components that can handle

unstructured as well as the traditional structured data.

• RDBMS Is expensive. Even with a Terabyte or two of data. The license fees and hardware needed to run even a 2-3 TB DWH and BI solution will be massive for a RDBMS based system. BIG Data technologies are quickly filling up here – giving away stable ecosystems without hampering performance or budget.

Why No?• RDBMS, its been around for ages, is mature and has a lot of helpful tools. And

then “Transactional Applications” is still one thing that RDBMS handles best, and we don’t see anything yet from the BIG Data technologies that tackles it as well.

• Hadoop’s inventor Doug Cutting feels so. He recently opined Hadoop is "augmenting and not replacing“. He mentions things like doing payroll – the real nuts and bolts things for which people have been using RDBMS will not be a good fit for Hadoop or other BIG Data platforms

www.infocepts.com #

Augment your EDW with Hadoop adding new capabilities/insight

- Continue to store summary structured data from your OLTP and back office systems into the EDW.- Store unstructured data into Hadoop that does not fit nicely into “Tables.” This means all the communication with your customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages, etc. can be stored in Hadoop. You can store this a lot more cost effectively in Hadoop.- Co-relate data in your EDW with the data in your Hadoop cluster to get better insight about your customers, products, equipment, etc. You can now use this data for analytics that are computation-intensive, such as clustering and targeting. Run ad-hoc analytics and models against your data in Hadoop, while you are still transforming and loading your EDW.- Do not build Hadoop capabilities within your enterprise in a silo. Hadoop and other big data technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies.- Data warehouse vendors are adding capabilities of Hadoop and MapReduce into their offerings while Hadoop is trying to take on more traditional DW activities

www.infocepts.com

Big Data Tool Comparison

www.infocepts.com

Big Data Technologies Comparison

#

Features Cassandra HBase Hive MongoDB

Description

Wide-column store based on ideas of BigTable and DynamoDB

Wide-column store based on Apache Hadoop and on concepts of BigTable

data warehouse software for querying and managing large distributed datasets, built on Hadoop

One of the most popular document stores

DeveloperApache Software Foundation

Apache Software Foundation

Apache Software Foundation MongoDB, Inc

Initial release 2008 2008 2012 2009

License Open Source Open Source Open Source Open Source Implementation language Java Java Java C++

Server operating systems BSD Linux, Unix, Windows

All OS with a Java VM,

Linux, OSX, Solaris, Windows

Database model Wide column store Wide column store Relational DBMS Document storeData scheme schema-free schema-free Yes schema-free Transaction concepts No No No No

www.infocepts.com


#

Name Cassandra HBase Hive MongoDB

Typing Yes No Yes Yes

Secondary indexes restricted No Yes Yes

SQL No No No NoAPIs and other access methods Proprietary protocol

Java API, RESTful HTTP API, Thrift JDBC, ODBC, Thrift

proprietary protocol using JSON

Partitioning methods Sharding Sharding Sharding Sharding

Durability Yes Yes Yes Yes

Server-side scripts No Yes Yes JavaScript

Triggers Yes Yes No No

Replication methods selectable replication factor

selectable replication factor

selectable replication factor

Master-slave replication

MapReduce Yes Yes Yes Yes

www.infocepts.com #

Features Cassandra HBase Hive MongoDB

Supported programming languages

C#, C++, Clojure, Erlang, Go, Haskell, Java, JavaScript , Perl, PHP, Python, Ruby, Scala

C, C#, C++, Groovy, Java, PHP, Python, Scala

C++, Java, PHP, Python

Actionscript , C, C#, C++, Clojure , ColdFusion , D , Dart , Delphi , , Erlang, Go , Groovy , Haskell, Java, JavaScript, Lisp , Lua , MatLab , Perl, PHP, PowerShell , Prolog , Python, R , Ruby, Scala, Smalltalk

Consistency concepts

Eventual Consistency, Immediate Consistency

Immediate Consistency Eventual Consistency

Eventual Consistency, Immediate Consistency

Foreign keys No No No No

Concurrency Yes Yes Yes Yes

User concepts

Access rights for users can be defined per object

Access Control Lists (ACL)

Access rights for users, groups and roles

Users can be defined with full access or read-only access


www.infocepts.com

BI Tool Comparison

www.infocepts.com

BI Landscape

#

Vendor Category Vendor ProductsMegavendors IBM, Microsoft, Oracle, SAPLarge Independent Vendors

Information Builders, MicroStrategy, SAS

Data Discovery Vendors

Qlik, Tableau, Tibco Spotfire

Open Source Actuate, Jaspersoft, PentahoSaaS Birst, Small Independent Vendors

Bitam, Salient, Panorama, Logi Analytics, Targit, GoodData, arcplan, Infor, Alteryx, Pyramid Analytics, Board International, Prognoz, Yellowfin

www.infocepts.com

Gartner’s 17 Categories

#

Information Delivery1. Reporting – Ability to create print-ready and interactive reports2. Dashboards – Multi-object, linked reports in an intuitive and interactive display.3. Ad hock report/query – Ability for end-users to create their own reports4. Microsoft Office Integration – How the tool integrates with Office suite5. Mobile BI – Ability to deliver to mobile devices using the native features of

mobile

Analysis6. Interactive Visualization – Exploring the data that goes beyond pie/bar charts.

Includes heat maps, geographic maps, scatter plots, etc.7. Search-based Data Discovery – Easily search structured and unstructured data

sources.8. Geospatial and Location Intelligence – Ability to show relationships on

interactive maps using geographic, spatial and time information.9. Embedded Advanced Analytics – Leverages statistical function libraries,

Predictive Model Markup Language (PMML )and R-based models.10. OLAP – Fast, multidimensional access and manipulation of the data.

www.infocepts.com

Gartner’s 17 Categories

#

Integration11. BI Infrastructure and Administration – Shared security, metadata,

administration, object model, query engine and scheduling/distribution.12. Metadata Management (MDM) – Centralized and robust way to

administer/manage dimensions, facts, performance, report layouts, etc.13. Business User Data Mashup and Modeling – Code-free, drag-and-drop and user

driving ability to mix and match different data.14. Development Tools – Programmatic and visual tools for developing reports,

dashboards and analysis. 15. Embeddable Analytics – Includes software development kit (SDK) for truly

customizing, porting and embedding analysis both within and outside the platform.

16. Collaboration – Ability to share and discuss.17. Support for Big Data – Ability to query hybrid, columnar and array-based data

sources – MapReduce and NoSQL databases.

www.infocepts.com

BI Platforms Comparison - GartnerTool Strengths Weakness

Actuate • Release of Birt iHub 3 – consistent, streamlined interface with better integration across product line

• Expanded big data connectivity and mashup capabilities• Functionality and ease of use rated high

• Deterioration of market understanding, user experience and contract experience

• Overall product capability score below average• Not highly used for dash boarding, ad hoc analysis

and interactive visualization/discovery

Jaspersoft • End-to-end BI• First pay-as-you-go BI server on AWS• Low cost of ownership

• Capabilities scored below average• Used narrowly in organizations• Below average data volumes• Embeddable analytics and advanced analytics

Pentaho • Low cost of ownership• Ranked high for development tools• Investing and launching emerging analytic application

capabilities – Big Data Layer, Instaview, Storm ad Splunk

• Customer experience, product quality and support below average.

• Difficult to use and implement

Qlik • Launch of redesigned visualization experience – Natural Analytics (Q3/Q4 2014).

• Ease of use for analysis and development• Associative search eliminates some complex SQL• Strong on dashboards, visualizations, mashups, collaboration,

mobile and big data support

• Not enterprise-ready – lacks MDM, infrastructure and embeddability

• Limited compared to other stand-alone data vendors in visual-based interactive exploration and analysis

• Major rearchitecting poses risks to current customers – could loose market traction

Tableau • Highly intuitive, visual-based data discovery, dash boarding, and data mashup capabilities

• High customer satisfaction and experience• Reusability, scalability and embeddability• Wide range of support for data access

• Used as a complement, not the standard• Inflexible in negotiations / high maintenance fees• Ability to address governance and broader BI

functionality a work in progress

34

www.infocepts.com

BI Platforms Comparison - GartnerTool Strengths Weakness

MicroStrategy • Go-to platform to handle the most complex deployments• Organic integration and superior product quality• Choice where mobile is strategic requirement• Big Data integration• Visual data discovery and multi-TB, in-memory engine (in dev)

• Steep initial learning curve (Moblie/VI combating that)

• Cost of software• Longest to develop reports (along w/SAP)• Blurred marketing message

SAP Business Object

• Large deployments and enterprise BI standards – integration key

• Heavy investing in visual data discovery/embeddable analytics• Expansion of BI Customer Success initiative

• Hard to use and do complex analysis• Software quality/difficult to migrate• High cost and hard sale• Integration concerns/questions on BI commitment

IBM Cognos • Handles some of the largest deployments• Watson Analytics (2014) – smart data discovery• Simplified licensing modeling

• Unrecognizable differentiation in market• Cost, poor performance, lack of ease-of-use and

support quality all customer concerns• Scores low/not reaching business benefits

Oracle BIEE

• Leader in information management• Integration, pre-built solutions and large scale deployments• Large network of partners

• Unavailability of complex types/advanced analytics• Requires sophisticated BI-related competencies• Scores low in quality and late with mobile

Tibco • Aims to stay ahead of the curve with aggressive development/acquisition

• Quality, functional and ease of use rated high• Used for complex analyses

• Large, complex reports take a long time to develop• Dashboards rated average• Administration, development and MDM rated

below average• Support staff coverage not always adequate

Microsoft • Ubiquitous BI across products - it is already there and being used

• Attractive packaging and pricing• Investing heavily in cloud• Excel widely used and accelerated investments in feature

releases

• Mobile BI, interactive visualization and MDM are product weaknesses

• Multiproduct complexity = on-premises or hybrid deployments.

• Do-it-yourself approach – onus is on customer

35

big data and bi tools - bi reporting for bay area startups user group

Data & Analytics

big data agenda

variety of data

unstructured data

data types

petabytes of data

volumes of data

dirty data

important aspect of