why real-time analytics?

The Chimp Way:Using the right tool for each job

At Infochimps, we abide by the philosophy that you should use the right tool for each job. Why lock in to one set of technologies or techniques? Depending on what you are trying to accomplish - the questions you want to ask of your data, or the applications and visualizations you build on top of that data - different tech-nologies are best suited for each unique task. You should have all the best tools at your fingertips for each task. Infochimps excels at systems and technology integration -- we can take your existing tools, add powerful new ones from our kit, and glue them together into a unified whole.

We also strongly embrace open source technologies as part of a complete data solution. Not only do you benefit from the active participation of the open source community -- you aren’t limited to a proprietary vendor’s finite feature set and integration connec-tors. We use Hadoop, Elasticsearch, Flume, Ironfan, and Wu-kong, among other world-class open source tools that work flex-ibly with each other and the rest of the tools in your enterprise.

© 2012 Infochimps, Inc. All rights reserved. 1

Why Real-Time Analytics?

Explore the technology that enables real-time analytics and streaming data processing, and how it differs from the world of Hadoop and batch analytics.

The Hadoop & NoSQL conundrumHadoop is a powerful framework for Big Data analytics. It simplifies the analysis of massive sets of data by distributing the computation load across many processes and machines. Hadoop embraces a map/reduce framework, which means analytics are performed as batch processes. Depending on the quantity of data and the complexity of the computation, running a set of Hadoop jobs could take anywhere from a few minutes to many days. Batch analytics tool sets like Hadoop are great for doing one-off reports, a recurring schedule of periodic runs, or setting up dedicated data exploration envi-ronments. However, waiting hours for the analysis you need means you aren’t able to get real-time answers from your data. Hadoop analysis ends up being a rear view mirror instead of a pulse on the moment.

NoSQL databases are extremely powerful, but come with certain challenges of their ownAt Infochimps we use Hadoop to run map/reduce jobs against scalable, NoSQL data stores like HBase, Cassandra, or Elasticsearch. These databases are extremely good at enabling fast queries against many terabytes of data, but each makes certain tradeoffs to enable this ability. One major tradeoff, common across all three of these examples, is the inability to do SQL-like joins -- the ability to combine data from one database table with data from another table.

The usual way we work around this tradeoff is to practice denormalization. Imagine we’re asking a question such as “Find all posts that contain the phrase ‘Cola-Cola’ from all authors based in Spo-kane, Washington”. In a traditional relational database like SQL, a table of “posts” would join against a table of “authors” using a shared key like an author’s ID number. In NoSQL databases, denormal-ization consists of inserting a copy of the author into each row of their posts. Rather than joining the posts table with the authors table during the query a la SQL, all the authors’ data is already contained within the posts table before the query.

The question then becomes when should the denormalization of our NoSQL database occur? One option is to use Hadoop to “backfill” denormalized data from normalized tables before running these kinds of queries. This approach is perfectly workable but it suffers from the same “rear-view mirror” problem of doing Hadoop-based batch analytics -- we still cannot perform complex queries of real-time data. What if we could write denormalized data on the fly: write each incoming Twitter post into a row in the posts table, and augment that row with information on the author in real-time. This would keep all data denormalized at all times, always ready for downstream applications to run complex queries and generate the rich, real-time business insights. Real-time analytics and stream processing make this possible.


Real-time + Big Data = Stream ProcessingIn situations where you need to make well-informed, real-time decisions, good data isn’t enough. It must be timely and actionable. As a mutual fund operator, you can’t wait hours to analyze whether or not it’s the right moment to sell 200,000 stock shares. As CMO, you can’t wait days to see if there is a PR crisis occurring around your brand. The time window for data analysis is shrinking, and you need a different set of tools to get these on-the-fly answers.

Batch Versus StreamingConsider two hypothetical sandwich makers. Each company makes great sandwiches, but chooses to deliver them to their customers either in batches or in near real-time.


The Batch Sub Shop can provide large quantities of sandwiches by leveraging many people to ac-complish the overall project. Similarly, batch analytics can leverage multiple machines to accomplish a set of analytics jobs. By adding more resources, we can increase the speed with which the tasks are accomplished, but at a higher cost.

Contrast that with the Streaming Sub Shop, which doesn’t deliver a huge set of sandwiches all at once, but does quickly create sandwiches on the fly. The process aims to get a sandwich in the cus-tomer’s hand as soon as possible. Real-time analytics works the same way by processing data the moment it is collected. If the data is coming in too quickly, we can flexibly increase the resources that support our real-time workflow. Is the toasting process the bottleneck of our production line? We eas-ily add a couple of additional toasters.

As you can imagine, the ideal sandwich company probably combines both the ability to cater large orders ahead of time and in-store made to order business. Likewise, your organization can leverage both batch analytics and real-time analytics depending on your business needs. Batch analytics is the most efficient way to process a large quantity of data in a non-time sensitive manner. Real-time analytics and stream processing are the answer when the timeliness of your insights is important, you need to scalably process a very large influx of live data, or if NoSQL databases cannot answer the questions you are asking.


How Does Real-Time Analytics Work?

1. Collect real-time data. Real-time data is being generated all the time. If you are a mutual fund operator, it’s real-time stock price data. If you are a CMO, it’s real-time social media posts and Google search results. Typically this data is live streaming data. That means the moment the stock price changes, we can grab that data point - like a faucet of running water. We collect live data by “hooking a hose up” to the faucet stream to capture that information in real-time. A lot of different vocabulary exists to describe these “hoses” including calling them scrapers, collectors, agents, and listeners.

2. Process the data as it flows in. The key to real-time analytics is that we cannot wait until later to do things to our data; we must analyze it instantly. Stream processing (also known as streaming data processing) is the term used for doing things to data instantly as it’s collected. Actions that you can perform in real-time include splitting data, merging it, doing calculations, connecting it with outside data sources, forking data to multiple destinations, and more.

3. Reports and dashboards access processed data. Now that data has been processed, it is reliably delivered to the databases that power your reports, dashboards, and ad-hoc queries. Just seconds after the data was collected, it is now visible in your charts and tables. Since real-time analytics and stream processing are flexible frameworks, you can utilize whatever tools you prefer, whether that’s Tableau, Pentaho, GoodData, a custom application, or something else. Integration is Infochimps’ forté.


What Can You Do With Stream Processing?

Augment• Enhanceyoursalesleads - IP addresses of visitors to your website are augmented by the

“company name” associated with that visitor if they are coming from an enterprise. Email ad-dresses get linked to Twitter handles and Facebook handles to help your sales team leverage social selling.

• Real-timesocialmediaanalytics - tweets that mention the brands you are tracking are aug-mented with a sentiment score (how positive or negative the comment was) and an influencer score (such as Klout). Know instantly if positive news breaks or a PR crisis arises. Instantly gain insight into how influential people are and on what topics.

Process and Transform • On-the-flyanalyticsreporting - Reformat a tweet on the fly to fit into an agency’s data model so

that the data is visible in our reporting application immediately upon landing in the database.• SQL-likedataqueries- Implement a denormalization policy to allow for doing complex JOIN-

like queries in real-time in downstream analytics applications.• Stockpricealgorithms- Implement your stock analyzer algorithm mid-stream. Instantly after

an updated stock price is received, the data is processed through the algorithm, and placed in your reporting database.

Calculate• Usagemonitoring - Track the number of social media posts mentioning your client company’s

brand. See at any given moment how much a brand is buzzing, and even set up tiered pricing based on how many social posts you are collecting on a client’s behalf.


Real-time analytics with the Infochimps Platform

Apache FlumeWhile initially built for log collection and routing, Flume has evolved to confidently serve the roles of general data transport and streaming data processing. Flume not only reliably delivers data from a source to a destination. With the right optimizations, a single Flume system can ingest many tera-bytes of data per day, from thousands of data sources. As data flows in, you can do things to that data, such as add additional data, do calculations, run algorithms, split data, merge data, etc. In Flume lingo, these actions are powered by scripts called decorators, which perform the stream pro-cessing required for real-time analytics.

Infochimps Data Delivery ServiceInfochimps uses Apache Flume for the Data Delivery Service (DDS), our reliable data transport and real-time analytics engine for the Infochimps Platform. Infochimps DDS adds important enhance-ments to the Flume open-source tool including:


• Seamless integrations with your existing environment and data sources

• Optimizations for highly scalable data collection and distributed ETL (extract, transform, load)

• Tool set for rapid development of decorators which perform the stream processing

• Flexible delivery framework to send data to any type and quantity of databases or file systems

• Rapid solution development and deployment, along with our expert Big Data methodology and best practices

Infochimps has extensive experience implementing the DDS, both for clients and for our internal data flows including massive Twitter scrapes, the Foursquare firehose, customer purchase data, product pricing data, and much more.

Single-purpose ETL solutions are rapidly being replaced with multi-node, multi-purpose data integra-tion platforms -- the universal glue that connects systems together and makes Big Data analytics feasible. Today, companies are taking advantage of Amazon Web Services for a few processes, on-premise or outsourced data centers for others, NoSQL databases, relational databases, cloud storage -- the list goes on. Data Delivery Service is compatible with all of those environments, making your data transport needs an implementation detail, not an analytics bottleneck.


About Infochimps

Our mission is to make the world’s data more accessible. Infochimps helps companies understand their data. We provide tools and services that connect their internal data, leverage the power of cloud computing and new technologies such as Hadoop, and provide a wealth of external datasets, which organizations can connect to their own data.

Contact UsInfochimps, Inc.1214 W 6th St. Suite 202 Austin, TX 78703

1-855-DATA-FUN (1-855-328-2386)

[email protected]

Twitter: @infochimps

Get a free Big Data consultationLet’s talk Big Data in the enterprise!

Get a free conference with the leading big data experts regarding your enterprise big data project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop about your project objectives, design, infrastructure, tools, etc. Find out how other compa-nies are solving similar problems. Learn best practices and get recommendations — free.

why real-time analytics?

Technology