turning big data into opportunity: the data lake

byMark [email protected]

Michael [email protected]

Turning Big Data into Opportunity

The Data Lake

Table of Contents

Introduction ....................................................................................................................... 1

A New Mindset ................................................................................................................. 1

Ingesting Data into the Data Lake ...................................................................................... 1

Opening Up the Data ......................................................................................................... 2

Tagging the Data ............................................................................................................... 3

A New Way of Storing Data ................................................................................................. 4

Accessing the Data for Analytics ........................................................................................ 4

Bottom-Line Savings and Top-Line Growth ........................................................................... 5

1

IntroductionBig data by itself does not create opportunity. The most successful, competitive organizations will be the ones with the ability to turn that data into game-changing paths to new kinds of value, cost savings, revenue growth, and operational effectiveness.

Organizations today are amassing so much information so quickly that they are reaching a tipping point. They are gaining remarkable potential to use big data in new ways, and redefine the very nature of how they do business. Yet none of this is guaranteed. Current tools cannot easily integrate disparate data collections, or fully use the kinds of “unstructured” data—such as photographs, doctors’ examination notes, and social media posts—that hold the most promise.

The bigger that data gets, the more impractical these tools become, in terms of time, cost, and analytic ability. The conventional approaches have, in effect, created a glass ceiling with big data. Organizations may be able to envision new opportunities for growth and effectiveness, and yet have no method of reaching them. There is, however, a way around the glass ceiling. Booz Allen Hamilton has developed a revolutionary approach known as the “data lake” that removes the current constraints.

With the data lake, an organization’s repository of information—structured and unstructured, along with streaming and batch data—is consolidated in a single, large table. The entire body of information in the data lake is available for every inquiry, and all at once—a capability that can create powerful new knowledge and insight. And because the data lake simplifies virtually every aspect of the loading, storing, and accessing of data, it provides business and government with substantial cost savings and efficiencies.

The data lake is now being used in a wide range of business and government applications. For example, it is helping a pharmaceutical company bring to market successful new drug compounds up to three times

faster than was previously possible. It is enabling hospitals to more quickly identify and treat life-threatening infections. And it is helping the US military integrate its intelligence sources to track insurgents and others who are planting improvised explosive devices (IEDs).

In these and other instances, the data lake is creating the kinds of opportunities would have been prohibitively expensive and time-consuming with conventional tools. Instead of being left behind by big data, organizations are now using it to compete and win in our digitally enabled economy.

A New MindsetMany organizations are now collecting large amounts of data in the cloud. But the data lake is an entirely different model—it does not just bring data together, it helps connect and integrate the data so that its full value can be realized. Even in the cloud, data is stored in rigid, regimented data structures—essentially data silos—that are difficult to connect, limiting our ability to see the big picture. Despite its promise to revolutionize data analysis, the cloud does not truly integrate data—it simply makes the data silos taller and fatter.

The data lake is not an incremental advance, but rather represents a completely new mindset. Big data requires organizations to stop thinking in terms of data mining and data warehouses—the equivalent of industrial processes—and to consider how data can be more fluid and expansive, like in a data lake.

Organizations may be concerned that by consolidating and connecting their data, they might be making it more vulnerable. Just the opposite is true. The data lake incorporates a new, granular level of security and privacy that is not available with conventional techniques.1

Ingesting Data into the Data LakeAs with much of the conventional approach, the process of preparing the data for analysis, known as extract/

The Data Lake

Turning Big Data into Opportunity:

1See Booz Allen Viewpoint “Enabling Cloud Analytics with Data-Level Security: Tapping the Full Value of Big Data and the Cloud” http://www.boozallen.com/media/file/Enabling_Cloud_Analytics_with_Data-Level_Security.pdf

http://www.boozallen.com/media/file/Enabling_Cloud_Analytics_with_Data-Level_Security.pdf

http://www.boozallen.com/media/file/Enabling_Cloud_Analytics_with_Data-Level_Security.pdf

2

transform/load (ETL), tends to be highly inefficient in terms of the resources used. At many organizations, analysts may spend as much as 80 percent of their time preparing the data, leaving just 20 percent for conducting actual analysis. The reason is that with each new line of inquiry, a specific data structure and analytic is custom-built. All information entered into the data structure must first be converted into a recognizable format, often a slow, painstaking task. For example, an analyst might be faced with merging several different data sources that each use different fields. The analyst must decide which fields to use and whether new ones need to be created. The more complex the query, the more data sources that typically must be homogenized. Formatting also carries the risk of data-entry errors. By contrast, data from a wide range of sources is smoothly and easily ingested into the data lake.

More importantly, there are no requirements for rigid data structures—and so no need for formal data formatting as the information is loaded. In a data lake, indexing is not done en masse at the time of ingestion, which is a time-consuming part of the traditional ETL process. Instead, indices and relationships can be derived to enrich the information base over time, and executed at the time of the analysis to create “views” that are tailored to the needs of a specific analysis, reducing the time to operationalize data.

The data lake might be thought of as a giant collection grid, like a spreadsheet—with billions of rows and billions of columns available to hold data. Each cell of the grid contains a piece of data—a document, perhaps, or maybe a paragraph, or even a single word from the document. Cells might contain names, photographs, incident reports, or Twitter feeds—anything and everything. It does not matter where in the grid each bit of information is located. It also makes no difference where the data comes from, whether it is formatted, or how it might relate to any other piece of information in the data lake. The data simply takes its place in the cell, and after only minimal preparation by analysts, is ready for use.

The image of the grid helps describe the difference between data mining and the data lake. If we want to

mine precious metals, we have to find where they are, then dig deep to retrieve them. But imagine if, when the Earth was formed, nuggets of precious metals were laid out in a big grid on top of the ground. We could just walk along, picking up what we wanted. The data lake makes information just as readily available.

The process of placing the data in open cells as it comes in gives the ingest process remarkable speed. Large amounts of data that might take 3 weeks to prepare using conventional cloud computing can be placed into the data lake in as little as 3 hours. This enables organizations to achieve substantial savings in IT resources and manpower. Just as important, it frees analysts for the more important task of finding connections and value in the data. Many organizations today are trying to “do more with less.” That is difficult with the conventional approach, but becomes possible, for the first time, with the data lake.

Opening Up the DataThe ingest process of the data lake also removes another disadvantage of the conventional approach—the need to pre-define our questions. With conventional computing techniques, we have to know in advance what kinds of answers we are looking for and where in the existing data the computer needs to look to answer the inquiry. Analysts do not really ask questions of the data—they form hypotheses well in advance of the actual analysis, and then create data structures and analytics that will enable them to test those hypotheses. The only results that come back are the ones that the custom-made databases and analytics happen to provide.

What makes this exercise even more constraining is that the data supporting an analysis typically contains only a portion of the potentially available information. Because the process of formatting and structuring the data is so time-intensive, analysts have no choice but to cull the data by some method. One of the most prevalent techniques is to discount (and even ignore) unstructured data. This simplifies the data ingest, but it severely reduces the value of the data for analysis.

Hampered by these severe limitations, analysts can pose only narrow questions of the data. And there is a

3

risk that the data structures will become closed-loop systems—echo chambers that merely validate the original hypotheses. When we ask the system what is important, it points to the data that we happened to put in. The fact that a particular piece of data is included in a database tends to make it de facto significant—it is important only because the hypothesis sees it that way.

With the data lake, data is ingested with a wide-open view as to the queries that may come later. Because there are no structures, we can get all of the data in—all 100 variables, or 500, or any other number, so that the data in its totality becomes available. Organizations may have a great deal of data stored in the cloud, but without the data lake they cannot easily connect it all, and discover the often-hidden relationships in the world around us. It is in those relationships that knowledge and insight—and opportunity—reside.

Tagging the Data The data lake also provides organization with value in the way the data itself is managed. When a piece of data is ingested, certain details, called metadata (or “data about the data”), are added so that the basic information can be quickly located and identified. For example, an investor’s portfolio balance (the data) might be stored with the name of the investor, the account number, the location of the account, the types of investments, the country the investor lives in, and so on. These metadata “tags” serve the same purpose as old-style card catalogues, which allow readers to find a book by searching the author, title, or subject. As with the card catalogues, tags enable us to find particular information from a number of different starting points—but with today’s tagging abilities, we can characterize data in nearly limitless ways. The more tags, the more complex and rich the analytics can become.

With the tags, we can look not only for connections and patterns in the data, but in the tags themselves. As an example of how this technology can be applied, tags were used to help a major pharmaceutical company find connections in a wide range of public data sources to identify drug compounds with few adverse reactions, and a high likelihood of clinical and commercial success. Those sources have included market and

social media data—to help determine the need— as well as data on clinical development, structural analysis, disease structures, and patents—to determine where there might be a gap. Data from those sources were tagged and ingested into a data lake, enabling the pharmaceutical company to identify the most promising compounds. With conventional techniques, those compounds would have been needles in a haystack, but tags and the data lake help them stand out brightly.

The data lake allows us to ask questions and search for patterns using either the data itself, the tags themselves, or a combination of both. We can begin our search with any piece of data or tag—for example, a market analysis or the existing patents on a type of drug—and pivot off of it in any direction to look for connections.

While the process of tagging information is not new, the data lake uses it in a unique way—as the primary method of locating and managing the data. With the tags, the rigid data structures that so limit the conventional approach are no longer needed.

Along with the streamlined ingest process, tags help give the data lake its speed. When organizations need to update or search the data in new ways, they do not have to tear down and rebuild data structures, as in the conventional method. They can simply update the tags already in place.

Tagging all of the data, and at a much more granular level than is possible in the conventional cloud approach, greatly expands the value that big data can provide. Information in the data lake is not random and chaotic, but rather is purposeful. The tags help make the data lake like a viscous medium that holds the data in place, and at the same time fosters connections.

The tags also provide a strong new layer of security. We can tag each piece of data, down to the image or paragraph in a document, with the relevant restrictions, authorities, and security and privacy levels. Organizations can establish rules regarding which information can be shared, with whom, and under what circumstances. With the conventional approach, the primary obstacle to information sharing is not technology, but rather the concern that secure

4

information will be compromised. The data lake, by contrast, makes it possible for business and government organizations to easily share information, confident that security, privacy, and other rules governing the data will be strictly maintained. The security of data in the data lake has been proven to work in very secure environments within the US government, where the highest levels of precision in security and privacy are required.

A New Way of Storing DataWith the conventional approach, data storage is expensive—even in the cloud. The reason is that so much space is wasted. Imagine a spreadsheet combining two data sources, an original one with 100 fields and the other with 50. The process of combining means that we will be adding 50 new ”columns” into the original spreadsheet. Rows from the original will hold no data for the new columns, and rows from the new source will hold no data from the original. The result will be a great deal of empty cells. This is wasted storage space, and creates the opportunity for a great many errors.

In the data lake, however, every cell is filled—no space is wasted. This makes it possible to store vast amounts of data in far less space than would be required for even relatively small conventional cloud databases. With the conventional approach, organizations must continually reinvest in infrastructure as analytic needs change. Connecting the data silos, for example, typically requires reconfiguring and even expanding the infrastructure. But with the data lake, the infrastructure becomes a stable platform. Organizations do not need to continually rebuild and reconfigure their infrastructure. Their initial investment in infrastructure is both enduring and cost-effective.

The data lake’s almost limitless capacity also enables organizations to store data in a variety of different forms, to aid in later analysis. A financial institution, for example, could store records of certain transactions converted into all of the world’s major currencies. Or, a company could translate every document on a particular subject into Chinese, and store it until it might be needed.

One of the more transformative aspects of the data lake is that it stores every type of data equally—not just structured and unstructured, but also batch and streaming. Batch data is typically collected on an automated basis and then delivered for analysis en masse—for example, the utility meter readings from homes. Streaming data is information from a continuous feed, such as video surveillance.

Formatting unstructured, batch, and streaming data inevitably strips it of much of its richness. And even if a portion of the information can be put into a conventional cloud database, we are still constrained by limited, pre-defined questions. The data lake holds no such constraints. When unstructured, batch, and streaming data are ingested, analytics can take advantage of the tagging approach to begin to look for patterns that naturally emerge. All types of data, and the value they hold, now become fully accessible.

The US military is taking advantage of this capability to help track insurgents and others who are planting improvised explosive devices (IEDs) and other bombs. Many of the military’s data sources include unstructured data, and using the conventional approach—with its extensive preparation—had proved unwieldy and time-consuming. With the data lake, the military is now able to quickly integrate and analyze its vast array of disparate data sources—including its unstructured data—giving military commanders unprecedented situational awareness. This is another example of why simply amassing large amounts of data does not create a data lake. The military was collecting an enormous quantity of data, but without the data lake could not make full use of it to try to stop IEDs. Commanders have reported that the current approach—which has the data lake as its centerpiece—is saving more lives, and at a lower operating cost than the traditional methods.

Accessing the Data for AnalyticsOne of the chief drawbacks of the conventional approach, which the cloud does not ameliorate, is that it essentially samples the data. When we have questions (or want to test hypotheses), we select a sample of the available data and apply analytics to it. The problem is that we are never quite sure we are

5

pulling the right sample—that is, whether it is really representative of the whole. The data lake eliminates sampling. We no longer have to guess about which data to use, because we are using it all.

With the data lake, our information is available for analysis on-demand, when the need arises. The conventional approach not only requires extensive data preparation, but it is difficult to change databases as queries change. Say the pharmaceutical company wants to add new data sources to identify promising drug compounds, or perhaps wants to change the type of financial analyses it uses. With the conventional approach, analysts would have to tear down the initial data and analytics structures, and re-engineer new ones. With the data lake, analysts would simply add the new data, and ask the new questions.

Because there is no need to continually engineer and re-engineer data structures, the data lake also becomes accessible to non-technical subject matter experts. They no longer need to rely on computer scientists and others to explore the data—they can ask the questions themselves. Subject matter experts best understand the needs and goals of their organizations, and the data lake helps make it possible for them to identify where a specific opportunity may lie. This might entail pinpointing a promising area for revenue growth that has been overlooked by competitors, or finding ways to execute a government agency’s mission faster and more effectively, as in the military’s search for insurgents and IEDs.

The data lake sets the stage for the advanced, high-powered analytics that can point the way to top-line business growth, and help government agencies achieve their mission goals in better ways. Analytics that search for connections and look for patterns have long been hamstrung by being confined to limited, rigid datasets and databases. The data lake frees them to search for knowledge and insight across all of the data. In essence, it allows the analytics, for the first time, to reach their true potential.

A version of the data lake, for example, helped researchers from Booz Allen and a large hospital chain in the Midwest gain surprising insights into severe sepsis and septic shock, life-threatening conditions brought on by serious infection. Using the data lake, researchers consolidated the electronic medical records of tens of thousands of past patients with sepsis, and found unexpected patterns in how their conditions progressed. Those insights prompted the hospital chain to begin a program to more quickly identify and treat current sepsis patients. The program was credited with saving nearly 100 lives during just its first 9 months.

Bottom-Line Savings and Top-Line GrowthVirtually every aspect of the data lake creates cost savings and efficiencies, from freeing up analysts to the ability to easily and inexpensively scale to an organization’s growing data. While the conventional methods have worked in the past, they are simply too costly and cumbersome in the age of big data. The data lake gives organizations a reset, in a sense, allowing them to distribute their resources to obtain optimal efficiency and effectiveness. That is critical in today’s economic climate. Organizations can address budgetary constraints while significantly expanding, rather than limiting, their data analysis.

At the same time, the data lake helps organizations to reach and then exploit the tipping point of opportunity. Ultimately, the real value of big data lies in big analytics—the capacity to help us do things not just cheaper and better, but in ways we have not yet imagined. For government, this can mean new paradigms for mission success. For business, it can show the way to entire new areas of revenue growth.

As big data grows even larger in the coming years, it will increasingly be used by organizations to differentiate themselves and compete in the marketplace. The winners will be the ones with the greatest ability to extract knowledge and insight from that data, and use it to remake their futures. The data lake opens that door.

7

About Booz Allen Hamilton

ContactsBooz Allen Hamilton has been at the forefront of strategy and technology consulting for nearly a century. Today, Booz Allen is a leading provider of management and technology consulting services to the US government in defense, intelligence, and civil markets, and to major corporations, institutions, and not-for-profit organizations. In the commercial sector, the firm focuses on leveraging its existing expertise for clients in the financial services, healthcare, and energy markets, and to international clients in the Middle East. Booz Allen offers clients deep functional knowledge spanning strategy and organization, engineering and operations, technology, and analytics—which it combines with specialized expertise in clients’ mission and domain areas to help solve their toughest problems.

The firm’s management consulting heritage is the basis for its unique collaborative culture and operating model, enabling Booz Allen to anticipate needs and opportunities, rapidly deploy talent and resources, and deliver enduring results. By combining a consultant’s problem-solving orientation with deep technical knowledge and strong execution, Booz Allen helps clients achieve success in their most critical missions—as evidenced by the firm’s many client relationships that span decades. Booz Allen helps shape thinking and prepare for future developments in areas of national importance, including cybersecurity, homeland security, healthcare, and information technology.

Booz Allen is headquartered in McLean, Virginia, employs approximately 25,000 people, and had revenue of $5.86 billion for the 12 months ended March 31, 2012. For over a decade, Booz Allen’s high standing as a business and an employer has been recognized by dozens of organizations and publications, including Fortune, Working Mother, G.I. Jobs, and DiversityInc. More information is available at www.boozallen.com. (NYSE: BAH)

Mark Herman Executive Vice President [email protected] 703-902-5986

Michael Delurey Principal [email protected] 703-902-6858

The most complete, recent list of offices and their addresses and telephone numbers can be found on www.boozallen.com

Principal Offices

Huntsville, Alabama

Sierra Vista, Arizona

Los Angeles, California

San Diego, California

San Francisco, California

Colorado Springs, Colorado

Denver, Colorado

District of Columbia

Orlando, Florida

Pensacola, Florida

Sarasota, Florida

Tampa, Florida

Atlanta, Georgia

Honolulu, Hawaii

O’Fallon, Illinois

Indianapolis, Indiana

Leavenworth, Kansas

Aberdeen, Maryland

Annapolis Junction, Maryland

Hanover, Maryland

Lexington Park, Maryland

Linthicum, Maryland

Rockville, Maryland

Troy, Michigan

Kansas City, Missouri

Omaha, Nebraska

Red Bank, New Jersey

New York, New York

Rome, New York

Dayton, Ohio

Philadelphia, Pennsylvania

Charleston, South Carolina

Houston, Texas

San Antonio, Texas

Abu Dhabi, United Arab Emirates

Alexandria, Virginia

Arlington, Virginia

Chantilly, Virginia

Charlottesville, Virginia

Falls Church, Virginia

Herndon, Virginia

McLean, Virginia

Norfolk, Virginia

Stafford, Virginia

Seattle, Washington

www.boozallen.com/cloud ©2013 Booz Allen Hamilton Inc.

12.032.12G

turning big data into opportunity: the data lake

Data & Analytics