open source business intelligence overview
TRANSCRIPT
Open Source Business Intelligence Tools
Alex MeadowsTriLUG, January 2012
Agenda
Business Intelligence Overview
Review of OSBI Tools
Data Warehousing
Data Integration
Reporting/OLAP
Visualization
Statistical Analysis/Predictive Analytics
Tonight's agenda will basically be an overview of the different areas of BI, starting from the back-end with data warehousing and data integration and moving into the front-end with reporting, visualization, and statistical analysis.
What Is Business Intelligence?
Utilizing technology to identify and analyze trends in data to makebetter business decisions.
So it's important for us to level-set and define what BI really is. It has quickly become one of the most important fields in the business world because it allows businesses to make better, faster decisions.
Source: Back In Business, Klimberg, Miori (www.informs.org)
Overlapping Fields
BI is not just one field, but many overlapping fields. One can't just look at IT and say that it is BI. It takes experts in data management, process modeling, and statistics to really make a BI program deliver the best return on investment.
Source: Competing on Analytics; Thomas Davenport, Jeanne Harris
Competing On Analytics
Thomas Davenport discovered that businesses actually go through a very predictable pattern while developing the ability to make better business decisions through data. Analytically impaired companies are those that are more 'gut driven'. They make decisions based on conjecture and feeling, not on the actual data in their systems! At the top are the analytical competitors. These companies make all of their business decisions with good data to back them up. Some examples of those companies are: Amazon, Harrah's Entertainment, and Zynga.
Phases of Growth
In keeping with the same pyramid structure, there is also a clear path to the types of tools used when companies develop a BI program. Usually reporting is the start because companies need to know what happened. As mentioned, all of these tools could be used in silos throughout an analytically impaired company. A silo example would be one employee that builds complex spreadsheets because there's no other way to report on their department.
As companies move up the Analytic Competitor pyramid, more of these tools are utilized and integrated throughout the organization. For their full potential to be met, not only does the company have to start using their data to make decisions rather, they have built-in systems that can take the data, filter based on business requirement criteria, and have their workflow automatically change based on that data.
The Three Types of Questions
What happened?
How was performance last week?
What is currently happening?
How is performance right now?
What will happen?
What can I do to reach our goals?
BI's area of focus boils down to essentially three areas: past, present, and future. In taking performance as an example, one could start utilizing reporting tools to answer questions like 'How was my server's performance last week?'. At this point, the data is probably still coming from production systems and can actually hinder the performance the company is wanting to report on. As the company matures, questions quickly arise not only about past performance, but also how well performance is trending and how well are those systems currently performing. Dashboards and other data visualization tools can both report trending as well as current performance. By this time, most companies would have at least started a rudimentary data warehouse due to performance.
Many companies stop there at present performance. It takes a lot of effort to move into predictive analytics because then more data oriented skills are needed. Answering with certainty about future performance based on historical trends is the ultimate goal of BI.
Data Warehousing
Store data outside of application/normal business environment (i.e. ERP systems)
Specific for reporting/analytics
Modeling Styles
3NF (normal database modeling)
Data Marts (aka star schemas)
Data Vault (hybrid 3NF/Data Mart)
Anchor Modeling (6NF)
Any good BI program starts with a data warehouse. You can think of a warehouse as a specialized database that offloads historical data from your production environment. It does a lot more than that as well unlike in a production environment a data warehouse actually stores deltas, changes in the data set, that would be lost forever in a production environment. For example, if you have a table that stores an employee's first name, the production system would only store the current value. If an employee named Robert changed his name from Bob and then to Sally, your production database would never remember the first two events. The data warehouse would not only store the three events, but also the time they occurred and how long they were valid.The other neat thing about data warehouses is how they integrate data from across an organization. If a company has an ERP, online website, and an external data set, the warehouse can integrate those three systems' data into one cohesive data set.There are many different modeling styles for a dwh. The traditional methodologies are very similar to what is used in an ideal database environment. Third normal form is the standard normalization you would see in a typical database while data marts move the data into a format that is better suited for reporting and analysis by end users. In the Data Warehousing 2.0 line, there is data vault modeling which is a hybrid of the first two, and anchor modeling. Anchor modeling is interesting in that it is actually sixth normal form and can get pretty complex.
Data Warehousing
Databases
MySQL, Postgres, etc
Columnar Data Stores
Infobright*, LucidDB, InfiniDB*, etc.
Hybrid Data Warehouse Databases
Greenplum* (both RDBMS and Columnar)
NoSQL
Hadoop, CouchDB, MongoDB, etc.
*Hardware and/or Software limitations in community editions
There are actually quite a few options for warehousing in OS. From more traditional databases that work well with 3NF to columnar data stores that are highly optimized for data marts. NoSQL has also become an option because it can store the unstructured and semi-structured data that never could be stored in a normal warehouse environment.
RDBMS vs Columnar
Source: http://www.calpont.com/column-oriented-database-bi
Columnar data stores basically flip the data from row based into columns. In a typical database, if the last name column needed to be filtered on, columns one through three would have to be scanned. In columnar, the last name row can be filtered on and the other aggregations can be performed as fast as the rows can be read.
The other neat thing about columnar databases is that many of them are smart enough to learn how users query their data sets. They can actually trim and grow their indexes accordingly so that users will get huge performance gains.
NoSQL?
Not Only SQL
Unstructured/semi-structured data
Huge (multi-terrabyte to petabyte+ data sets)
Source: http://www.information-management.com/specialreports/20040622/1005301-1.html
NoSQL tools are able to store 'documents' in a highly compressed way so that PB+ data sets can be quickly filtered through. This is the tool that warehousers have wanted for years, but is only now starting to go mainstream! Unstructured and semi-structured data sets have not been able to easily be searched through until now. It's easily the proverbial gold mine. Look at Facebook or Twitter and you can see where this could be a huge advantage for understanding customer bases.
Data Integration
Syncing data across systems
Includes:
ETL (Extract, Transform, Load)
MDM (Master Data Management)
EAI (Enterprise Application Integration)
EII (Enterprise Information Integration)
Where data warehouses are the backend storage system, data integration acts as the plumbing. DI moves data from source systems into a warehouse or other application. There are many types of DI, from ETL which is moving, cleaning, and loading data, to MDM, which is moving and syncing data across systems, and more.
There are two big OS DI tools, Talend and Pentaho K.E.T.T.L.E.
Talend
Data Management Tool Suite
ETL
MDM
Data Profiling
Data Quality
Code generator
Eclipse based
Extensible plugin architecture
Pentaho K.E.T.T.L.E.
Kettle Extraction, Transport, Transformation, and Loading Environment
Focus on ETL
Extensible plugin architecture
Engine based
Reporting
Focus: Historical Analysis
Now that the back-end has been covered, we can start climbing the pyramid of front-end tools. Reporting is the start of this climb and usually where most organizations start since it is the easiest to implement.
Reporting Options
*Flat Files, NoSQL, etc.
MDXPivot TableChartingSQLOther Sources*Drill ThroughParameterized
BIRT
Pentaho
JasperReports
SQL Power Wabit
Saiku
There are quite a few options out there, and these are some of the more popular ones. The comparison is only taking into account the actual reporting tool and not their server-side component, if applicable.
BIRT is an Eclipse-based tool, so if you're using Eclipse you may want to consider it.Pentaho's Report Designer, JasperReports,are stand-alone tools. All three use a style of design known as banded reports where data elements are essentially dragged and dropped onto a pallet. All three do have server-side components.All three report designers can embed reports into existing applications (i.e. web apps, Java apps).
The neat thing about Saiku and SQL Power Wabit is that they are both built to handle OLAP cubes as well as normal reporting. Saiku's Interactive Reporting tool is still in beta, but is looking very impressive. They are a thin-client based analytics tool that can be embedded in with BI servers or live as it's own stand-alone tool.
BIRT Example
Some charts generated in BIRT.
Here is a screenshot of Pentaho's Report Designer. Each line of the report is the 'banded row' mentioned earlier.
Visualization
Focus: Trending and Present
Visualization is the next area of our tour. In a nutshell, visualizations take very complex data and make it very easy to interpret and take action.
This dashboard is from Stephen Few's Information Dashboard Design book. Notice how it is not flashy, with muted colors that really help to draw attention to the bright red circles. There is a lot of information packed into this space. From trends, to current performance and pacing, it's all here and in plain sight. Usually dashboards like this will also have a drill through ability. For example, clicking on an alert will take you to a more detailed report or view of the data so that a decision can be made on how to react.
Visualization can also be fun, and even describe themselves. XKCD has quite a few such examples.
Notice how much information is packed into such a small space, yet can still be understood.
Pentaho CDE/CDF
Dashboard framework and editor built into Pentaho BI Server
Community developed uses open web languages (Javascript, HTML, etc).
There is really only one OS tool that I have been able to find that builds dashboards akin to Few's. Pentaho's Community Dashboard Framework and Editor was designed by a Web Details and adopted by Pentaho. It is still a stand-alone library.
This is a sample dashboard that WebDetails built for a training course on the tools. Notice that the same principles used by Few are applied here.
Statistics/Predictive Analytics
Focus: All relevent data used to predict outcomes
We've reached the top of our tour of BI. Statistical and Predictive analysis is the goal, and OS provides quite a few options.
Statistics/Predictive Analytics
R stats oriented
Weka machine learning oriented
RapidMiner mixed
Originally YALE
Weka and R Plugins
Like SAS Enterprise Miner
Here's a pic of RapidMiner at work.
BI From Reporting to Statistical Analysis
* Utilizes Talend ETL**Utilizes Weka Data Mining***All use Mondrian for OLAP, with different front ends
ETLMetadataReportingDashboardsOLAP***StatisticsAutomated Decisions
Jaspersoft* Pentaho **SpagoBI* * **
Of note, there are three companies providing an OSBI suite of tools. The biggest differentiation between them are their communities. Jaspersoft and SpagoBI's suites are not totally in their control because they have licensed Talend for their ETL and Metadata tools.
All three use Pentaho's Mondrian OLAP engine.
Pentaho and SpagoBI license the use of Weka as part of their suite of tools.
Shameless Plug
RTP Pentaho User Group
On LinkedIn (soon to be also on Meetup)
Meets quarterly
Yes, I have to put in a shameless plug. I am the Community Leader for the local Pentaho User Group. We are currently on LinkedIn (www.linkedin.com/groups/RTP-Pentaho-User-Group-3674498) and will soon be on Meetup. We're currently meeting quarterly and are looking for speakers.