whitepaper inergy jan 2015 v1
TRANSCRIPT
Behind the Scenes of BICC in the Cloud
A Whitepaper
Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy
January 2015 Sponsored by
Copyright © 2015 R20/Consultancy. All rights reserved. Inergy is a trademark of Inergy Analytical Solutions BV. Trademarks of companies referenced in this document are the sole property of their respective owners.
Copyright © 2015 R20/Consultancy, all rights reserved.
Table of Contents 1 Management Summary 1
2 Why Business Intelligence in the Cloud? XXXX
3 BI in the Cloud: Five Different Levels of Unburdening Level 1: Hardware in the Cloud Level 2: Database in the Cloud Level 3: Data Warehouse in the Cloud Level 4: BI Solution in the Cloud Level 5: BICC in the Cloud
XXXX
4 The Path from Data to Business Insights 4.1 Data is Added to a Transaction System 4.2 Data is Transmitted to the Cloud 4.3 Data is Stored in a Holding Area 4.4 Data is Loaded in the Data Warehouse 4.5 Data is Made Available for Reporting and Analytics 4.6 Data us Turned into Business Insights 4.7 Contiuous Activities
5 Inergy’s BICC in the Cloud 5.1 Overall Description of Inergy 5.2 Data is Added to a Transaction System 5.3 Data is Transmitted to the Cloud 5.4 Data is Stored in a Holding Area 5.5 Data is Loaded in the Data Warehouse 5.6 Data is Made Available for Reporting and Analytics 5.7 Data us Turned into Business Insights
6 Two Case Studies
About the Author Rick F. van der Lans
About Inergy Analytical Solutions
Behind the Scenes of BICC in the Cloud 1
Copyright © 2015 R20/Consultancy, all rights reserved.
1 Management Summary
Cloud Computing and Business Intelligence – BI in the Cloud, SaaS BI, managed BI services, and cloud analytics, are all terms that somehow relate to deployment of business intelligence (BI) and cloud computing. Cloud computing in general has been widely accepted by organizations. Therefore, it was to be expected that vendor would offer services to run business intelligence systems and data warehouse systems in the cloud.
Why BI in the Cloud? – An organization’s prime interest in BI systems is to use their data for reporting and analytics with the intention to support and improve their decision making processes. Their interest is not in designing, developing, testing, operating, and maintaining database servers, ETL programs, and so on. Still, if organizations have decided to run their BI systems on premise those are the activities they are responsible for and for which they need specialists. The key reason for organizations to invest in cloud computing is to unburden themselves. This definitely applies for migrating BI to the cloud. By moving the entire BI system or some components to the cloud, most of the BI tasks are outsourced to the cloud vendor.
Five Levels of BI in the Cloud – Many vendors offer some form of BI in the cloud. These different forms can be classified as five different levels where the first one offers the lowest level of unburdening and the last the highest:
Hardware in the Cloud
Database in the Cloud
Data Warehouse in the Cloud
BI Solution in the Cloud
BICCC in the Cloud
The Whitepaper – BICC in the Cloud offers the highest level of organizational unburdening. The BICC in the Cloud vendor hides the complexities of designing, developing, operating, managing, and maintaining a BI environment. Because a BICCC vendor hides so many of their tasks, it may not be clear what they do exactly, and how much work it involves. Therefore, this whitepaper presents a view behind the scenes. It describes all the activities performed by such vendors. It’s like getting a glimpse of how airports get all the luggage from the airplanes to the right belt in a baggage claim area. The whitepaper describes the five different forms of BI in the Cloud with their respective levels of organizational unburdening. For each level the responsibilities and activities that are taken over by the vendor of the cloud service are explained. In addition, a detailed description of BICC in the Cloud is given by following the path that data travels from its inception in a transaction system all the way up to when it’s presented in a report that hopefully leads to valuable business insights. The whitepaper ends with a description of the services and technologies offered by BICC in the Cloud vendor Inergy Analytical Solutions.
Behind the Scenes of BICC in the Cloud 2
Copyright © 2015 R20/Consultancy, all rights reserved.
2 Why Business Intelligence in the Cloud?
Cloud Computing – Cloud Computing doesn’t need an introduction or explanation anymore. Books and articles on this topic can fill entire cabinets1,2. Also, many studies have shown the commercial success of cloud computing. For example, Salesforce.com3 predicts that by 2020 the cloud computing market will exceed 241 billion US$. Depending on their needs, organizations can now choose between many different forms of cloud, including private, public, and hybrid clouds. It’s hard to imagine that there still are organizations that do not use cloud computing in some form. Even families use the cloud when they share their files in cloud storage services as Dropbox or OneDrive.
BI in the Cloud – Organizations can rent storage and processing power in the cloud, they can place their internally‐developed transaction systems in the cloud, they can operate their database servers in the cloud, or migrate their ERP system to the cloud. New applications, such as popular Salesforce and NetSuite, are not even available for on‐premise usage but in the cloud only. It’s hard to find a form of computing that cannot be migrated to the cloud. Therefore, it wasn’t unexpected that it became possible to run business intelligence (BI) and data warehouse systems in the cloud. Today, many vendors offer some form of BI in the Cloud (sometimes referred to as SaaS BI or Cloud BI). Designing, developing, running, managing, tuning, and changing BI systems requires many software components and a wide range of analytical and technical skills. As with every migration to the cloud, to migrate a BI system to the cloud, many of the software components and many of the skills don’t have to be available on‐premise anymore. The cloud gives organizations a choice, they can place their hardware, databases and applications in the cloud or run them on‐premise. In this case, the organization itself is responsible for every aspect, including installing new versions of tools periodically, running backup/restore jobs, tuning and optimizing modules, extending storage, and so on. This requires in‐depth knowledge of all these aspects. Let’s use the analogy of a car as example. Nowadays, the way on‐premise BI systems are developed and used, is like buying all the components of a car, assembling it to form a car, tuning it, driving it, maintaining it, tuning it even more, replacing faulty components, and so on. BI in the Cloud is more like a rental car. The customer indicates which one he wants, and he uses it when he wants. When the car starts malfunctioning, the rental car company takes care of it.
Elasticity Through Virtualization – Cloud computing means virtualization. Cloud technology hides many aspects of an application or database server. It’s as if there is an unlimited amount of storage and computing power available. Organizations don’t see which servers or storage devices are being used, they don’t see that their application has been moved to a more powerful machine, or that extra internal memory has been made available. The cloud is like one large environment with unlimited resources. The real configuration has been virtualized, in the way that memory has been virtualized years ago. With virtualization comes a popular advantage of cloud computing: elasticity. If an organization requires more storage or more computing power, they just have to switch it on. There is no need to order extra machines, install them, test them, and so on. It’s as if the available computing power can be extended like a rubber band can be stretched. Plus, if the need for more computing power is only temporary, the elasticity concept pops in, and the extra resources can be returned. This makes it easy
1 M.J. Kavis, Architecting the Cloud: Design Decisions for Computing Service Models (SaaS, PaaS, and IaaS), John Wiley, January 2014. 2 J. Hurwitz et al., Cloud Computing for Dummies, For Dummies, October 2009. 3 Salesforce.com, A Complete History of Cloud Computing, January 2012; see http://www.salesforce.com/uk/socialsuccess/cloud‐computing/the‐complete‐history‐of‐cloud‐computing.jsp
Behind the Scenes of BICC in the Cloud 3
Copyright © 2015 R20/Consultancy, all rights reserved.
for customers when they have to deal with scheduled and unscheduled short peek workloads. The elasticity aspect of cloud computing is like hitting a magic button in your car that suddenly adds ten more seats to your car allowing you to drive the entire soccer team to the match. And when you have delivered those players safely back home, you hit the button again, and your car has four seats again.
The Cloud and Unburdening the Organization – Although elasticity is an important aspect of cloud computing, its main benefit is unburdening. Cloud is all about unburdening organizations. Take a BI environment as an example. As indicated, designing, developing, running, managing, tuning, and changing BI systems requires many software components and a wide range of analytical and technical skills. This is a heavy burden for an organization. Specialists must be employed or hired; machines must be acquired and installed; data center space, computing power, desk space, and so on, must be made available. These specialists have to be trained when new products are installed. They must be trained to keep their analytical skills on par, and so on. Running a BI environment on premise involves a major investment. Let’s go back to the car analogy again. When you need to get from San Francisco to Los Angeles, you can use your own car, you can rent a car, or you can fly or take a train. Driving your own car means that you’re responsible for making sure the car is in the right condition to drive 500 miles. Tires ok? Oil ok? Insurance ok? Do the wipers and lights still work fine? Is the air pressure of the spare tire ok? Also, when you damage your car during the trip, you have to organize that the car is towed to a garage, and you’re responsible for organizing another car. In addition, you must do all the driving yourself, so when you arrive you may by a little tired. During the drive, you must take care of yourself as well: eat, drink, rest. When you rent a car, the rental company checks the state of the car. You only have to take care of the food and drinks. And if you do damage the rental car, they will organize that the damaged car is towed away, and they will bring a replacement car. You only have to fill in a form. It’s only the driving you have to do yourself. When you take a flight to get to Los Angeles, besides buying a ticket, you don’t have to do much. You just have to be sure you arrive on time at the airport, and they will take care of the rest, including the food and drinks, navigation, and you even arrive more relaxed than when you had driven a car. So, the difference between using your car versus renting a car or flying, is the level of unburdening. Flying offers the highest level of unburdening. When you drive your own car, there is no unburdening. It’s all up to you! Car rental companies offers a medium level of unburdening. So, you will feel a different level of unburdening. But as we all know, there is a different price tag as well. So, it’s not about migrating BI systems to the cloud, it’s not about the cloud computing itself, in fact, it’s not even about elasticity, cloud computing is about unburdening organizations, and this is especially true for business intelligence systems in the cloud.
3 BI in the Cloud: Five Different Levels of Unburdening Migrating a BI environment to the cloud can be done in many different ways. An organization can decide to migrate the entire BI environment unchanged from on‐premise machines to machines in the cloud, so that all the activities related to the servers and storage units are outsourced. In this case, the organization is still responsible for installing and managing the database servers, running the ETL jobs,
Behind the Scenes of BICC in the Cloud 4
Copyright © 2015 R20/Consultancy, all rights reserved.
and so on. On the other hand, one can also migrate the hardware, the software, and the management of the BI environment to the cloud. This reduces the amount of work to be done on‐premise and the required number of BI and technical specialists. Conclusion, for BI systems a wide range of alternatives exists with different levels to unburden an organization. The first example offers less unburdening than the second example. The following five different levels to unburden organizations with their BI environment can be identified; see also Figure 1:
Hardware in the Cloud
Database in the Cloud
Data Warehouse in the Cloud
BI Solution in the Cloud
BICCC in the Cloud
Figure 1 Five different forms of BI in the Cloud with their respective levels of organizational unburdening.
This section describes these five levels in ascending order of unburdening.
3.1 Level 1: Hardware in the Cloud Running machines on‐premise in your own data center is not just a matter of buying and installing the machines. It involves many complex and time‐consuming tasks, such as:
Organizing redundant or backup power supply
Setting up and managing a fallback environment
Organizing redundant data communication connections
Establishing environmental controls, such as air conditioning and fire suppression In addition, customers have to consider aspects such as the proximity of the data center to available power grids, telecommunication infrastructures, networking services, transportation lines, and emergency services. These aspects can all affect costs, risk, security and other factors.
Behind the Scenes of BICC in the Cloud 5
Copyright © 2015 R20/Consultancy, all rights reserved.
An alternative to running your own data center, is to use computing and storage resources in the cloud. Many vendors offer such resources. Popular ones are Amazon and Google. Instead of having the machines and storage devices on premise, they reside at the cloud vendors’ data centers. Organizations are free to determine what they want to run on those machines: database servers, applications, reporting tools, ETL tools, anything. The cloud vendor is responsible for making the required computing and storage resources available. Requesting more resources is a matter of pushing the button. This form of BI in the cloud is called Hardware‐in‐the‐Cloud. Here, all the activities described in this section are handled by the cloud vendor, which clearly unburdens the customers. However, many of the typical BI tasks remain unchanged. The organization is still responsible for designing and managing the entire architecture, for scheduling ETL jobs, solving problems when a crash happens, database server optimization and tuning, backup and restore of the data warehouse, and so on.
3.2 Level 2: Database in the Cloud Vendors of a Database‐in‐the‐Cloud service offer fast SQL database servers running in the cloud. Examples are Rackspace with MySQL and CloudPostgres with PostgreSQL. These database servers and the hardware they run on, have been tuned and optimized by experts. When performance, scalability, or concurrency issues arise, they will intervene and try to solve the problem. Customers do not have to care about backup/recovery issues, expansion of required disk storage, enlarging the database buffer, installing and upgrading to new versions of the database server, and unloading and loading of large portions of the database in case of upgrades. All these activities are handled by the vendor. Most of the activities formerly done by internal DBAs are now handled by the vendor. Commonly, Database‐in‐the‐Cloud vendors deliver all the services that a Hardware‐in‐the‐Cloud vendor offers. For example, customers don’t need to worry about lack of disk storage capacity or sufficient internal physical memory. Customers using a Database‐in‐the‐Cloud service are still responsible for safely transporting the data to the cloud, for transforming and cleansing the data, efficient loading the data, developing and maintaining all the reports, scheduling and managing ETL jobs, designing star schemas, updating report definitions, and so on. The workload generated by a BI environment is completely dominated by queries. The database servers offered by a Database‐in‐the‐cloud vendor are generic products. They are designed by the manufacturer and setup and tuned by the service vendor to run a mixed workload consisting of queries and transactions. So, they are not perfectly setup for a pure BI environment and that may cause performance challenges. The amount of outsourced activities with Database‐in‐the‐Cloud should not be underestimated. Just look at how many DBA’s an average company has on its payroll to keep all the databases up and running. Database‐in‐the‐Cloud does offer a respectable level of unburdening to organizations.
3.3 Level 3: Data Warehouse in the Cloud The next level of unburdening is Data‐Warehouse‐in‐the‐Cloud. This form of BI in the Cloud resembles Database‐in‐the‐Cloud. Here, a database server is made available as a service to customers. The
Behind the Scenes of BICC in the Cloud 6
Copyright © 2015 R20/Consultancy, all rights reserved.
difference is that the platform offered has been optimized to support a BI workload, and nothing else. Examples are Amazon with Redshift, Kognitio with WX2, and Teradata with their Teradata appliances. These services are more suited for a BI workload for two reasons. First, the database servers in use are analytical SQL database servers. These products are designed and developed for reporting and analytics, like big trucks have been designed and developed for transporting heavy loads. Many private and public benchmarks have shown that these analytical SQL products can really handle larger BI workloads and are able to run queries faster4. Second, these database servers and the hardware they run on, have been tuned and optimized by experts to run at their best. So, all the parameters are set to make reports run fast. The big difference with Database‐in‐the‐Cloud is brute query performance.
3.4 Level 4: BI Solution in the Cloud Vendors of BI Solution in the Cloud services offer a full and integrated set of tools to develop and run a complete BI system. Everything is there, including ETL tools, database servers, and tools for reporting and analytics. The vendor guarantees that all the products work together seamlessly, are installed properly, and are tuned and optimized. BI‐Solution‐in‐the‐Cloud services can be divided in two groups. The first group offers all the hardware and all the tools required to develop and run a BI system. Some of these BI‐in‐the‐Cloud services dictate which tools can be used. In fact, many of them go even further, they prescribe tools they developed themselves. This means that existing reports and ETL programs developed by customers can’t be migrated to these proprietary environments. Redevelopment is always required. Other services do allow customers to choose their own preferred tools. In both cases, the BI systems themselves, including the database structures, ETL logic, and reports, still have to be designed, developed, and tested. The customer determines whether he leaves this to the vendor or not. The second group of BI‐Solution‐in‐the‐Cloud services offer pre‐built ETL programs and pre‐built reports. For example, Gooddata offers a wide range of prebuilt reports. If customer data is stored in Salesforce.com or NetSuite, they can extract customer data from these cloud applications, load it in the data warehouse, and make the pre‐built reports available. In fact, pre‐built reports can be available to customers a few hours after the customer has given the BI‐Solution‐in‐the‐Cloud vendor access to their data in these cloud applications. There are also BI‐Solution‐in‐the‐Cloud vendors that do the same with data stored in an SAP system; this data can be available for reporting and analytics within hours. In these examples, there is no need for the customers to develop or design anything. These vendors are responsible for tuning and optimizing the machines and database servers, they will schedule and manage the ETL jobs, and so on. When customers require other reports, ones that are not pre‐built yet, these services become identical to the first group. BI‐Solution‐in‐the‐cloud offers a high level of unburdening, but the drawback can be flexibility. Maybe it’s not possible to use other BI tools to access the data. In addition, if such a vendor closes shop, there may not be a chance to take the solution and port it to another environment.
4 Transaction Processing Council, TPC‐H – Top Ten Performance Results, December 2014; see http://www.tpc.org/tpch/results/tpch_perf_results.asp
Behind the Scenes of BICC in the Cloud 7
Copyright © 2015 R20/Consultancy, all rights reserved.
3.5 Level 5: BICC in the Cloud Many definitions of Business Intelligence Competency Center (BICC) exist. TechTarget defines it as5: “A BICC is a team of people that, in its most fully realized form, is responsible for managing all aspects of an organization's BI strategy, projects and systems.” This and other definitions clearly indicate that a BICC is about people. It consists of specialists who know how to analyze reporting needs, design data warehouses, run ETL jobs, and who are responsible for all the operational activities of a BI environment, including those when a problem occurs. If a BICC is about people, then the same applies for a BICC‐in‐the‐Cloud service. It not only unburdens an organization from activities related to installing, tuning hardware and software tools, activities related to design, development and maintenance, but also from the countless tedious operational activities that make a BI environment run smoothly and flawlessly. A BICC‐in‐the‐Cloud service takes most of the development and operational work out of the hands of the organization. If, for example, a load program crashes unexpectedly, the vendor has to have the procedures and specialists in place to fix the problem; they have to monitor data load speeds to identify possible performance degradation (if loading is taking longer and longer, a solution must be implemented); and when new reports must be developed or changed, they supply the specialists to interview the users, implement the new reports, and train the users in how to use the reports efficiently. What makes BICCC special is that it offers full support for operational activities executed by human specialists. So, it’s not only software and hardware elasticity, but also specialist‐elasticity. When, unexpectedly, more DBAs, ETL programmers, or report developers are required, the BICCC vendor can deliver. They have a pool of specialists available they can choose from. Such a pool of specialists is very hard to justify financially for an on‐premise situation, because the costs can’t be distributed over several BI environments. A vendor of BICCC can. Note: More forms of cloud services exist, such as Applications‐in‐the‐Cloud and Open‐Data‐in‐the‐Cloud, but because they’re not relevant to BI, they fall outside the scope of this whitepaper.
5 See http://searchbusinessanalytics.techtarget.com/definition/business‐intelligence‐competency‐center‐BICC
Behind the Scenes of BICC in the Cloud 8
Copyright © 2015 R20/Consultancy, all rights reserved.
3.6 Summary Table 1 contains a high‐level overview of the activities handled by the five levels of BI in the cloud. The ones on the right‐hand side clearly unburden organizations more extensively. Activities handled by cloud vendor
Hardware in the Cloud
Database in the Cloud
Data Warehouse in the Cloud
BI Solution in the Cloud
BICC in the Cloud
Installing, upgrading, and tuning computing power and storage capacity
Yes Yes Yes Yes Yes
Installing, upgrading, and tuning database servers for any type of workload
Yes Yes Yes Yes
Installing, upgrading, and tuning analytical SQL database servers
Yes Yes Yes
Installing, upgrading, and tuning all tools
Yes Yes
Delivering pre‐built reports and analysis
Yes
Delivering 24x7 operational activities
Yes
Analysis, modeling, and development of new reports, dashboards, analytics, and ETL programs
Yes
Table 1 A high-level overview of the activities handled by the five levels of BI in the Cloud. When a level of unburdening has to be selected, costs are weighed against the level of unburdening. Commonly, the higher the level of unburdening is, the higher the costs of the cloud solution are. But these costs must be weighed against designing, developing and running a BI environment on‐premise and having to have BI specialists available. This is difficult to do, because determining the full on‐premise costs is complex. Many hidden costs exist, ranging from clear‐cut costs such as acquiring new hardware up to training costs for a BI specialist to improve his interviewing skills. Note: This classification of five levels suggests that all vendors only fit in one level. This is not always the case. Commercial vendors do exist that combine multiple levels of unburdening or a mixed form.
4 The Path from Data to Business Insights This section discusses in detail the services offered by a BICCC vendor by following the path that data travels. Data starts its life when it’s entered in a transaction system. Next, it follows a long and winding road before it becomes part of a report that leads to valuable business insights. In other words, it’s a long path from being produced to being consumed. If we follow this path of data from the viewpoint of a BICCC vendor, it passes the following stages:
Data is added to a transaction system
Data is transmitted to the cloud
Data is stored in a holding area
Behind the Scenes of BICC in the Cloud 9
Copyright © 2015 R20/Consultancy, all rights reserved.
Data is loaded in the data warehouse
Data is aggregated for reporting purposes
Data is made available for reporting and analytics
Data is interpreted by business users For each of these stages, there are technical aspects and service aspects. The next subsections describe in detail these aspects for each of the seven stages.
4.1 Data is Added to a Transaction System Every day, new data is entered in transaction systems. Users enter data in internally‐developed systems, ERP systems, or cloud‐based CRM systems; customers enter new data through the company’s website; and sensors in machines enter data automatically. The live of data starts when it’s entered in one of these transaction systems. Nowadays, organizations also deploy data they haven’t produced themselves. For example, they use customer data coming from Dunn & Bradstreet, they use data from social media networks, or they access open data sources. All that data is not created by the organization itself, but it can be regarded as new data. This first stage, in which new data is entered in transaction systems, falls outside the scope of a BICCC vendor. It’s the customer’s responsibility to make sure that these applications run correctly. The BICCC becomes involved after new data has been entered.
4.2 Data is Transmitted to the Cloud
Technical Aspects – Somehow, new data must be extracted from the transaction systems and must be transmitted to the cloud. Technically, there are many different ways to copy the data. These are the most popular four solutions:
Full copy: With full copy all the data is extracted from the transaction systems and copied to files that are transmitted to the cloud vendor using a pull or push mechanism. All the data means all the customers, invoices, call detail records, and so on. Making a full copy is technically simple, because no complex logic is required to identify the data that has been updated, inserted, or deleted since the previous extraction. Creating full copies can take quite some time, therefore this process can only take place infrequently, for example once every 24 hours.
Incremental copy: With this solution only updated, inserted, or deleted data is written to files. To transmit files that contain only the changes, minimizes network traffic. In addition, because less data is processed a more frequent processing may be possible, such as once every 8 hours.
Trickle copy: Trickle copy resembles incremental copy, the difference is that the frequency of extracting and transmitting data to the cloud is higher. So, instead of transmitting new data a few times per day, files are processed every half or full hour. Trickle copy requires an extract technology that is tightly integrated with the transaction system itself.
Behind the Scenes of BICC in the Cloud 10
Copyright © 2015 R20/Consultancy, all rights reserved.
Real‐time copy: With real‐time copy when new data is entered, it’s transmitted to the cloud almost instantaneously. So, a constant stream of data is transmitted from the transaction systems to the cloud. For this form of copying, enterprise service busses and message queues are commonly used.
Network bandwidth plays a crucial role when determining how to extract and transmit data. For example, with 10 Mb/s bandwidth roughly 1Mb can be transmitted per second which equals to approximately 3,600 Mb per hour and 86,4 Gb per day. A 200 Mb/s bandwidth reaches 72Gb per hour and 1,7Tb per day. So, if a network with a 200Mb/s bandwidth is available and if all the data to be transmitted exceeds 2Tb, the full copy approach would not work. It would take more than 24 hours to transmit all the data. So, network bandwidth plays a big role in determining the right form of copying.
Service Aspects – A BICCC must support as many solutions for copying as possible. The preferred solution depends on the requirements of the customer and the characteristics of their data. For example, if users need to see data not older than a few minutes, then a full or incremental copy are not useful, and if the amount of new data created per day is relatively small, a full copy may be simpler than an incremental copy. Whatever the best solution is, a BICCC must offer the technology and the services fitting the customer requirements. Things can go wrong when new data is transmitted to the cloud. In fact, this is the most fragile stage. The solution must be able to deal with, for example, transmission failures, file structure corruption, unavailability of the network, network crash, security breach, and so on. Compared to twenty years ago, the availability and reliability has improved dramatically, but things still go wrong. Most of these problems can’t always be prevented and technically solved afterwards. What if it’s time to transmit the new file, and it’s not there? Or, what if the file has been transmitted, but after its arrival it contains an incorrect structure or set of characters? What if a portion of the file hasn’t arrived? In many cases, just rerunning the job is not the solution. Specialists have to come in. Someone has to check what has happened. Someone has to call the customer and ask why the file is not there. Someone may have to restart a network component and restart the transmission process. Someone may have to inform the customer that the corrupted file has to be recreated. For specialists to be able to react quickly, it’s not sufficient that they’re experts of the tools and technologies. Also, they must be familiar with the customer’s system and its idiosyncrasies. Another important task of a BICCC is monitoring. Continuous monitoring of network traffic is important to see trends and to prevent problems. For example, monitoring the required time to transmit data may show that transmission times are increasing and that within a few weeks the data won’t arrive in the cloud on time anymore to be available for reporting. In such a situation it may be necessary that the customer switches to another form of copying, for example, from full copy to incremental copy, or from incremental to trickle copy. Although the monitoring can be done with software, again, it will be specialists who have to study the problem, recommend the customer to switch to another copying approach, and probably help with implementing the new solution.
4.3 Data is Stored in a Holding Area
Technical Aspects – After transmitted data has been received correctly, it’s stored in a holding area. This holding area contains the data in its original form as received from the customer. So, it hasn’t been processed or cleansed. The holding area keeps the history of the data stored to a date agreed upon with the customer.
Behind the Scenes of BICC in the Cloud 11
Copyright © 2015 R20/Consultancy, all rights reserved.
The history of all the data is kept for several reasons. One, in case data gets lost, it can be reconstructed using the data in the holding area. Second, when full copies are transmitted, the historical data in the holding area can be used to determine which data is actually new. In other words, the holding area is used to derive the incremental copy. BICCC vendors can implement a holding area in many different ways. For example, they can select a Data Vault‐like6 database structure that stores all the data. They may also opt for data structures in the holding area that resemble those in the transaction systems, so that loading data in the holding area is a straightforward 1:1 copy process. And, the holding area may simply be a directory in which all the transmitted files are stored. Failures may occur when new data is loaded in the holding area due to incorrect or corrupted data. If portions of the data have already been loaded, an unload is required. This is not always easy to automate. The problems have to be fixed (possibly manually), and load jobs have to be re‐executed.
Service Aspects – In this stage, the role of the BICCC is primarily a monitoring and management one. Aspects to be monitored include the data load time in the holding area, the file correctness, and the growth of the holding area:
Monitoring the load time is important. There is only so much time available to load data. An increasing load time eventually leads to scheduling problems. If this happens, specialists have to optimize the solution, or come up with a faster solution.
Correct files have been transmitted correctly and can be processed correctly. Monitoring file correctness can be done by counting the number of load failures. Specialists can decide whether the file correctness is stable, improves, or declines. They can’t really act on it, but they can inform the customer that an unwanted trend can be seen. The customer may want to react on it by changing or hardening the programs that create the files.
Monitoring the holding area’s growth is important for a BICCC to make storage space available on time. It’s unacceptable that a job for loading data in the holding area fails because there is no available storage space left.
Many tools exist for monitoring the aspects described. Such tools can be programmed to indicate when a problem has occurred or is about to occur. But what’s more important is, again, the human aspect. If a problem occurs, specialists must react. They must call the customer, schedule a meeting to discuss possible solutions, and they have to implement these solutions. All these activities are typical for a BICCC.
4.4 Data is Loaded in the Data Warehouse
Technical Aspects – In Stage 4, data stored in the holding area is copied to another storage solution, which is developed for analytics and reporting. This solution can be developed in different ways. First, it can be one large data warehouse in which all the data is stored in a star schema or normalized form. Such a data warehouse traditionally contains all the data (historical and current) and in the most detailed form. Second, it can be a data warehouse plus a set of derived data stores, such as data marts developed with SQL database servers, multi‐dimensional cubes, Excel spreadsheets, or just simple comma‐separated files. Derived data stores commonly contain (slightly) aggregated data and don’t contain all the available data but a subset. Derived data stores are added to speed up query performance and present an easier to understand data structure to users and reports. And sometimes
6 Dan E. Linstedt, Data Vault Series 1 ‐ Data Vault Overview, July 2002, see http://www.tdan.com/view‐articles/5054/
Behind the Scenes of BICC in the Cloud 12
Copyright © 2015 R20/Consultancy, all rights reserved.
derived data stores are created because the reporting tools demand the data to be stored and structured in a certain form, such as a star schema. Third, many data stores are developed but without a data warehouse. Whatever the solution is, data must be copied from the holding area, transformed, and loaded. It’s common that traditional ETL tools are deployed for this process. Depending on the solution, two ETL steps may be required: step one copies the data from the holding area to the data warehouse and step two copies it from the data warehouse to the derived data stores. Data stored in the holding area has a certain structure, and it must be rearranged to fit the data structure of the data warehouse and derived data stores. Also, loading new data may involve unloading existing data. This depends on the copying approach and the technical solution. In the process of copying, data must be prepared first. The correctness and consistency of the data must be checked. Are the right codes used, are names spelled correctly, and are the values realistic? Preparation of data is executed to improve the data quality level and with that the quality of report results. The data quality rules that have to be applied must be specified by the customer and implemented by the BICCC vendor.
Service Aspects – A BICCC vendor is responsible for designing and developing the ETL jobs that copy the data to the desired data stores. They are also responsible for scheduling and running all these jobs. Performance plays an important role in this entire process. For example, when the loading of data in the holding area has finished at midnight and when the SLA states that all new data must be available before 4am in the morning, there is a maximum of four hours available for this stage. So, it’s crucial that an efficient and fast solution is developed. Besides running all the ETL jobs, this stage must be monitored closely. These are some of the aspects that must be monitored:
Did all the jobs run correctly and was all the data processed? Imagine that some data was not copied. Users may not notice this because the reporting results look correct.
Is the data load time still acceptable? If not, can the database server be optimized and tuned? Or is it time to introduce more high‐end hardware.
How fast is the size of the data warehouse and/or derived data stores growing? If the size of the data warehouse is leading to performance or scalability problems, it may be useful to archive some of the (c)old data.
A BICCC can’t afford to be late. It must see things coming and must react on time. Specialists must have in‐depth understanding of the hardware and software in use.
4.5 Data is Made Available for Reporting and Analytics
Technical Aspects – After data is stored for reporting and analytics, it must be made available for users. The following aspects play an important role: query performance, network delay, data access control, and data encryption. Besides user‐friendliness and intuitiveness, query performance is an important aspect of the overall user experience. Sluggish reports are bad for user satisfaction and have a negative impact on user productivity. Therefore, it’s important that query performance is monitored. Many factors can
Behind the Scenes of BICC in the Cloud 13
Copyright © 2015 R20/Consultancy, all rights reserved.
decrease the query performance, such as the ever‐growing size of the data warehouse, a workload escalation (more queries per user, more users, and more complex queries), or a database server that has problems to balance the entire workload. In the same way that network bandwidth plays a role when data is transmitted from the customer site to the cloud (see Section 4.2), it plays a role when the report results are transmitted back to users. How much data is transmitted depends heavily on the tools used. Some tools pull a large quantity of data from the cloud in one go to the internal memory of the client machine, while others extract data piecemeal (depending on how much data can be presented on the screen). Both solutions lead to a specific network delay. Users consider the time it takes to transmit results over the network as part of the performance. Especially in a cloud environment, it’s important that this aspect is optimized. Data in the data warehouse and derived data stores must be protected against deliberate or accidental unauthorized use. In other words, not everyone should have access to all the data in the database. So, data access control rules must be specified. Results of reports shipped over the network must be encrypted. Especially, because many of the results contain business‐related data. Some results can be highly confidential.
Service Aspects – A BICCC vendor must monitor four aspects: query performance, network delay, data access control, and data encryption. If the performance of queries starts to degrade, the BICCC must react. There are many techniques they can apply: optimization and rewriting of queries, defining additional indexes for direct‐access, repartitioning tables, tuning buffer space, and so on. Tools can help with this, but 99% of the work must be done by technical specialists. Network traffic of results must be monitored continuously to recognize degradation of network transmission times and to prevent problems. This is identical to monitoring the network traffic in Stage 2. The data access rules themselves do not have to be monitored. These are defined and work. But what can be monitored is whether specific users repeatedly try to access data they are not supposed to access. It may be useful to investigate what the reason is.
4.6 Data is Turned into Business Insights
Technical Aspects – To turn data into business insights on which an organization can act, the data must be presented in the right form with the right interface. One user wants to see his data in a spreadsheet, the other prefers a dashboard look, and a third likes to work with a self‐service BI tool. A BICCC must support as many forms of reporting and analytics as possible including:
Executive Reporting
Classic OLAP / Reporting
Dashboarding
Operational Reporting (BI)
Analytics (forecasting, predictive modeling)
Sandboxing
Embedded Analytics
Big Data Analytics
Behind the Scenes of BICC in the Cloud 14
Copyright © 2015 R20/Consultancy, all rights reserved.
Unstructured Data Analytics
360 Degree Reporting
Transaction Reporting Many customers use transactional systems, such as SAP, Salesforce.com, and NetSuite. In these systems the database structure is always the same. In addition, many of the reports that the users require are also known. In this case, a BICCC can offer a predefined set of reports and dashboards.
Service Aspects – A BICCC can help to analyze the reporting needs of users, and may be responsible for developing and maintaining the reports. They may help users to use their reports more efficiently and to teach new users how to work with the reports. Monitoring this aspect of the path of data is not evident. Some BICCCs do monitor user happiness through interviews and questionnaires.
4.7 Continuous Activities Besides the activities and stages related to the path of data and as described in this section, there are some other activities for which a BICCC is responsible. Examples are vendor release management, human resource management, 24*7 management, upgrades of standard software, and test cycles. These activities must be executed with a minimal interference on the customer. The fact that four of the five DBA’s have the flu, that patches must be installed to avoid an ETL job failure, that a new software version has to be tested before it’s made operational, is not the customer’s responsibility. A BICCC is about unburdening the customer and that includes all such activities. The agreement between a customer and a BICCC vendor must cover a number of SLA aspects, including:
The average query performance (including and excluding network delay)
The maximum query performance (including and excluding network delay)
The availability of the data
Downtime (scheduled and non‐scheduled) To measure whether SLAs have been met, continuous monitoring is an important activity of the BICCC vendor. They must be able to inform their customers about how the real measurements of, for example downtime and query performance, relate to the SLAs.
5 Inergy’s BICC in the Cloud This section describes one BICC in the Cloud vendor in detail: Inergy Analytical Solutions. The first subsection describes the services they offer and the subsequent subsections explain how they have implemented the stages described in Section 4.
5.1 Overall Description of Inergy
Introduction – Inergy sometimes refer to their BICCC service as Managed BI Services. They provide two
Behind the Scenes of BICC in the Cloud 15
Copyright © 2015 R20/Consultancy, all rights reserved.
services. First, they can architect, design, develop, test, implement, and maintain BI systems for customers. Second, they can operate, manage, and maintain operational BI systems running on their own BICCC in the Cloud service. Where other vendors have focused on delivering one of the two services, Inergy’s objective has always been to unburden customers in all aspects of business intelligence. With respect to their BICCC service, Inergy runs the BI environments for a large number of customers. They unburden their customers by deploying the right hardware, software, and specialists. They often use the iceberg metaphor to illustrate how they work. The customer sees and experiences only a small portion of the entire BI factory. First, they make their data available for upload to the cloud and, second, that data is made available for reporting and analytics. Customers don’t have to concern themselves with network hiccups, overloading disk space, upgrades to new versions, declining query performance, restarting failed ETL jobs, and so on. They only see the tip if the iceberg. A large part of all the work and the infrastructure (the part of the iceberg that’s under water) is shielded.
The Cloud Infrastructure – Currently, Inergy uses IBM’s SQL analytical appliance called IBM PureData Systems for Analytics (formerly called Netezza) as the hardware and database platform for storing the holding area and the data warehouse. In addition, they prefer Informatica PowerCenter for ETL processing, and the MicroStrategy BI platform for reporting and analytics. Note that the use of these tools are optional; reports and ETL programs can be developed with other tools if the customers so prefer. Inergy has developed several dedicated software components to guide data on their path from the transaction systems to the user reports. These components have been designed specifically to be able to monitor all the aspects of a BICCC service in as much detail and as extensive as possible. Standard software components do not contain such extensive monitoring features. Together, the software components form a robust factory for transmitting and processing all the data. They’re continuously improved based on customer experiences.
The Agreements – Two agreements outline the relationship between Inergy and the customers. The first is an agreement to define SLAs. These SLAs deal with performance aspects, data availability, down time, and so on. As example, Figure 2 shows the overall SLA dashboard for a specific customer.
Behind the Scenes of BICC in the Cloud 16
Copyright © 2015 R20/Consultancy, all rights reserved.
XXXX
Figure 2 The SLA dashboard for a specific customer. To meet all the SLAs, every aspect of the BI environment is monitored in detail, and specialists are standby 24 hours a day. When a job crashes in the middle of the night and can’t be recovered, specialists are literally woken up to fix the problem. In addition, groups of specialists are assigned to specific customers, so that they understand their specific issues. These specialists have been involved in report development, defining data quality rules, setting up data access rules, and so on. They know the customer’s BI requirements and solutions. The second agreement is called the data delivery agreement. This agreement specifies how the system is developed. For example, it describes the form of copying, the encryption mechanism used, and the format of the files.
Two BICCs – Note that Inergy’s BICC in the Cloud service doesn’t replace the customer’s own BICC. In general, Inergy focuses on the more technical aspects and the customer on the functional aspects. It always comes down to a cooperation between to the two BICCs.
5.2 Data is Added to a Transaction System Inergy is not involved in this stage. This is the responsibility of the customer.
5.3 Data is Transmitted to the Cloud Inergy supports all four forms of copying specified in Section 4.2: full copy, incremental copy, trickle copy, and real‐time copy:
A classic form of full copy is supported for those customers that don’t have the technology to
Behind the Scenes of BICC in the Cloud 17
Copyright © 2015 R20/Consultancy, all rights reserved.
identify the changes made to the data since the previous copy operation. In this case, customers must implement a solution that periodically creates files containing all the data. When full copy is deployed, Inergy offers a solution to optimize network traffic. This solution compares the new full copy with the previous full copy, and determines the delta. Only the delta is uploaded to the cloud.
For full and incremental copy, Inergy allows refreshes every 8 hours (or longer).
Trickle copy is supported as well.
When real‐time copying is required, special software components must be installed close to the transaction systems to intercept all the new data which is then uploaded to the cloud.
On the cloud side, Inergy’s software checks whether all the data was transmitted correctly by, for example, doing cross checks. Note that this form of checking is not the same as data quality checking. The data itself is not changed or corrected in this stage. Inergy checks whether the data that was uploaded is still 100% the same as the data that was received.
5.4 Data is Stored in a Holding Area Uploaded data is placed in the holding area. Inergy stores all the data in the holding area in their IBM PureData System for Analytics platform. Data is loaded without any changes. The structures of the receiving tables are almost identical to the structures of the uploaded files, which are identical to those of the transaction systems. One column is added to store the date and time on which the data was loaded. Data is not removed from this holding area. This allows customers to go back to older versions of the data if required, for example, to recover lost or damaged data, or for compliancy reasons. In other words, all the historical data remains available. Most of the software used for loading data in the holding area is developed by Inergy itself: a highly efficient and maintainable solution. If the structure of the uploaded data changes, the amount of work required by Inergy to implement this change is minimal. As indicated, Inergy uses an advanced monitoring system for all aspects of the BICC, including for loading data. As an example, Figure 3 shows that monitor in progress.
Figure 3 Screenshot showing the Inergy monitor for loading data.
5.5 Data is Loaded in the Data Warehouse All the data for reporting and analytics is stored in a data warehouse implemented on the IBM PureData System for Analytics platform as well. For each customer a separate data warehouse is
Behind the Scenes of BICC in the Cloud 18
Copyright © 2015 R20/Consultancy, all rights reserved.
developed. Inergy tries to avoid developing derived data stores. If specific data must be available with a specific data structure, the view mechanism is used. But if the customer or reports dictate physical data marts, they are developed. But in general Inergy prefers to work without derived data stores to keep the architecture as simple and flexible as possible. In addition, they make all the detailed data available to users, and they don’t restrict reporting and analytical capabilities by only making aggregated data available. For copying data from the holding area to the data warehouse, an ELT approach is used. During Stage 3 data is loaded in its original form in dedicated tables in the holding area. In this stage, relevant data is extracted from the holding area, transformed, and stored in the real data warehouse tables. Inergy has selected this ELT approach, because the IBM platform is able to execute these in‐database transformations very fast. One of the reasons is that it requires no data transmission, and many operations are executed in parallel. For long, Inergy has used the IBM platform for storing all the data. Lately, new data storage platforms have been introduced, of which Hadoop is the most popular. The characteristics of Hadoop are data storage scalability, processing scalability, and a relatively low price/performance ratio. It has been designed specifically for big data. Inergy has adopted this data storage platform as well.
5.6 Data is Made Available for Reporting and Analysis Inergy supports different network protocols for different types of data usage. Data access rules can be specified to protect against deliberate or accidental unauthorized use of tables. Inergy monitors query performance and network delay. For example, Figure Fout! Verwijzingsbron niet gevonden. contains a screenshot of the monitor indicating CPU utilization, memory utilization, and IO. All these factors have an impact on performance. Therefore, Inergy allows such aspects to be monitored in detail.
Behind the Scenes of BICC in the Cloud 19
Copyright © 2015 R20/Consultancy, all rights reserved.
Figure 4 The Inergy monitor provides extensive and detailed information on all aspects of processing. This screenshot shows CPU and memory utilization and IO.
5.7 Data is Turned into Business Insights For report development, Inergy prefers the use of the MicroStrategy BI platform. But if customers prefer, other tools can be used. Currently, Inergy does not offer pre‐defined reports or dashboards.
6 Two Case Studies This section contains two short case studies of organizations that implemented their BI environments using Inergy’s BICCC solution. Both illustrate how extensive the BICCC service is as offered by Inergy.
Behind the Scenes of BICC in the Cloud 20
Copyright © 2015 R20/Consultancy, all rights reserved.
6.1 PostNL XXXX PostNL is a parcel service operating in The Netherlands, Belgium, UK, Germany, and Italy. Inergy has architected and developed their BI environment and operates it. A wide range of management reports and dashboards is available for different divisions. Analytics is supported on all available data to support the core processes, distribution, shipments, and route optimization. The data warehouse developed is an active data warehouse. Real‐time copying is used so that reports and dashboards show the current situation. This allows OostNL’s to offer real‐time track and trace features for parcels. When customers of PostNL track their parcel, information is retrieved from the data warehouse running in the cloud, and they see almost 100% up‐to‐date data. Parcels can be traced via the Web, e‐mail, and SMS. The facts: PostNL ships 135 million parcels per year, and on certain days more than 1 million. 98% of these parcels are delivered within 24 hours.
6.2 Intergamma XXXX Intergamma is retail company specializing in do‐it‐yourself products. They operate 375 stores across The Netherlands and Belgium. Their BI environment has been designed, developed, and is operated by Inergy in the cloud. Reports are available that are been used across the entire organization, and includes reports on sales, margins, stock, shop performance, budgets, shelf planning, and market overviews.
Behind the Scenes of BICC in the Cloud 21
Copyright © 2015 R20/Consultancy, all rights reserved.
About the Author Rick F. van der Lans Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data warehousing, business intelligence, database technology, and data virtualization. He works for R20/Consultancy (www.r20.nl), a consultancy company he founded in 1987. Rick is chairman of the annual European Enterprise Data and Business Intelligence Conference (organized annually in London). He writes for SearchBusinessAnalytics.Techtarget.com, B‐eye‐Network.com7 and other websites. He introduced the business intelligence architecture called the Data Delivery Platform in 2009 in a number of articles8 all published at BeyeNetwork.com. The Data Delivery Platform is an architecture based on data virtualization. He has written several books on SQL. Published in 1987, his popular Introduction to SQL9 was the first English book on the market devoted entirely to SQL. After more than twenty five years, this book is still being sold, and has been translated in several languages, including Chinese, German, and Italian. His latest book10 Data Virtualization for Business Intelligence Systems was published in 2012. For more information please visit www.r20.nl, or email to [email protected]. You can also get in touch with him via LinkedIn and via Twitter @Rick_vanderlans.
About Inergy Analytical Solutions Inergy supports data driven organizations in collecting, organizing, and analyzing data. Inergy delivers consultancy and complete solutions in the field of Business Intelligence, Big Data, Data Warehouse, Analytics, and Data Discovery. In addition, Inergy offers a comprehensive BICC in the Cloud service. Inergy is based in The Netherlands and works for many clients from a wide range of industries, including retail (do‐it‐yourself, fashion, and food), finance, parcel services, media, and leisure.
7 See http://www.b‐eye‐network.com/channels/5087/articles/ 8 See http://www.b‐eye‐network.com/channels/5087/view/12495 9 R.F. van der Lans, Introduction to SQL; Mastering the Relational Database Language, fourth edition, Addison‐Wesley, 2007. 10 R.F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann Publishers, 2012.