big data - eric state“big data” refers to datasets whose size is beyond the ability of a typical...

MIS 433 :: Big Data :: Dvorsky, State, Turvin 1

Big Data MIS 433 – Spring 2013 Authors: Biran Dvorsky, Eric State, Patrick Turvin

2013

Instructor: Yuanfeng Cai 5/1/2013

2 MIS 433 :: Big Data :: Dvorsky, State, Turvin

Table of Contents

Abstract ......................................................................................................................................................... 3

Big Data ........................................................................................................................................................ 4

What is Big Data? ..................................................................................................................................... 4

Making Big Data Valuable ....................................................................................................................... 6

Map Reduce .......................................................................................................................................... 8

Hadoop .................................................................................................................................................. 9

Healthcare ................................................................................................................................................... 11

Big Data in the Cancer Fight................................................................................................................... 11

New Discoveries ..................................................................................................................................... 13

Insurance ..................................................................................................................................................... 14

Future of Big Data in Insurance .............................................................................................................. 15

Conclusion .................................................................................................................................................. 17

References ................................................................................................................................................... 18


Abstract

Big data is a very popular phrase being thrown around in the world of technology news.

Few know what it is, and even fewer truly understand it. But there are many people working with

it, and working on improving it as a viable business tool. It’s clear that big data is at it’s infancy,

and it will become more and more a mainstay business practices. Big data is a big topic, and one

that several companies are entirely based around. We decided to take a quick look at some of the

big data technologies that are out there now, and how big data techniques are being utilized in a

few industries: healthcare, and insurance.

Our research just skimmed the surface, and opened our eyes to a vast number of

technologies, styles, programs, companies and ideologies. We came away with a greater

understanding of what bid data is, what role it plays, and where it will go in the future.


Big Data

If you're a business hearing about big data, you have lots of questions about what it is and

what impact it will have on my business. Lots of businesses are just starting to hear this for the

first time and are now realizing they have collected all this information. Many companies are

saying, “now what do we do with it all?” That’s a question that can be answered so many

different ways that many businesses just don’t know where to start. Technology over the past ten

years has been very easy to implement to store all this data, as the price of hard drives and

various storage solutions have allowed companies to scale their storage for their collections.

Companies have collected so much data that they are having issues trying to figure out how to

process data to gain any sort of business intelligence that might be of value to them.

We will try to provide thoughtful insight about what companies are doing to implement

solutions for these sorts of problems that arise when big data is useful. We will look at popular

big data technologies, such as MapReduce and Hadoop, as well as looking at how big data is

being used in the Healthcare and Insurance industries.

What is Big Data? First off, let’s define what Big Data is. According to an executive summary published by

McKinsey Global Institute titled “Big data: The next frontier for innovation, competition, and

productivity” they define Big Data as this:

“Big Data” refers to datasets whose size is beyond the ability of a typical database

software tools to capture, store, manage, and analyze. This definition is intentionally

subjective and incorporates a moving definition of how big a dataset needs to be in order

to be considered big data.

They continue to go on and continue defining big data loosely saying, that they assume

technology will continue to increase over time and that what defines the storage part of big data

will continue to increase as storage capacities will continue to increase. The McKinsey Global


Institute also made it clear that big data does not mean the same thing for all industries and is

specific to the industry.

This definition above is very adaptable for many businesses to define when they have an

issue that can fit under this topic of big data. Information stores right now are growing faster than

we have ever seen in the past. One startling statistic that was mentioned in this executive

summary is that 15 out of the 17 sectors in the United States have more data stored per company

than the U.S. Library of Congress. From a response to an article posted by Leslie Johnston an

employee from the U.S. Library of Congress on March 8th

, 2013 they had over 6.5 Petabytes of

digital archives. There archives are growing at a data rate of 15TB a day and then to think on top

of that, she goes on to mention that the data is all backed up on tapes that is geo redundant to

minimize the risk of data loss. Ten years ago hard drives for personal computers were very

small. Looking at this graph from Arstechnica on the “Information explosion: how rapidly

expanding storage spurs innovation,” you can see the exponential increase in hard drive storage

and how it has grown almost exponentially such as the data that is being stored.


Figure 1 - Storage capacity has grown exponentially

Making Big Data Valuable With the ever-increasing availability of storage technology, and all the data that is

currently be collect and stored, companies are looking for something to process it and to get

value out of what they have collected. The technology community is on the verge of something

entirely new that they have never faced before, and with the right tools companies can hopefully

gain a qualitatively deeper level of insight on what they are collecting to gain value and insight

in their own data.

One topic that mentioned throughout our research was the topic of Hadoop and

MapReduce. These two items appear to be mentioned in almost the same breath in just about

every article that there is about big data. Since Hadoop is open source it allows for many

‘flavors’ to be developed to meet the needs of any company. Another nice feature of Hadoop is

that it is developed on a platform that is adaptable to almost any architecture and is not

dependent on necessarily on the hardware that is available. Hadoop is written in the Java

programming language, developed by Oracle. The development of Hadoop and the use of the


MapReduce framework will be very important with and how the industry will use these two

technologies to provide useful incite to their data collections in the future.

Hadoop can be used to support data intense distributed applications. One of Hadoop’s

main features is how it is written on a platform that is easily adaptable to any server hardware a

business might have. Since Hadoop is written in and runs on the Java Runtime Environment, the

benefit of Hadoop’s framework is that it supports running on large clusters of non-specialized

commodity hardware.

Surprisingly the Hadoop software and framework is not all that old. It was created by two

individuals, Doug Cutting and Mike Carafella. Hadoop was name after Cutting’s son’s stuffed

yellow elephant and was an inspiration to the Hadoop Project. According to a New York Times

article published on March 17, 2009 titled, “Hadoop, a Free Software Program, Finds Uses

beyond Search”, they talk about the development of Hadoop as an off bread to Google’s

MapReduce and Hadoop’s beginnings. At the time Hadoop was getting started Google was

publishing very little on its MapReduce technology. What they did publish was enough to give

Doug a start on developing his own technology called Hadoop. Hadoop got its starting a Yahoo,

Inc. when the executives there realized what Doug was working on and they hired him to bring

that to them. According the New York Times article within six months Hadoop had become a

critical part of Yahoo and a year later was super-critical to operations. At the time the article was

published Yahoo was reportedly servicing over three hundred million people a month.

Hadoop implements a key foundation core that makes what it does possible called

MapReduce. MapReduce is a programming model that was developed by Google that can help

when it comes to processing large data sets. MapReduce can also be used for the distributed

computing cluster of computers. From doing some general research to get start from Wikipedia,

MapReduce allows programmers the ability to produce parallel distributed programs more easily

by requiring them to write only the simpler Map() and Reduce() functions, which the focus on

the logic of the specific problem at hand.


Map Reduce MapReduce is a framework for processing large data sets across large number

nodes/clusters. With this type of processing it does not matter how the data is structured, it can

either be unstructured or structured in a database. Looking at the data flow diagram below you

can see how MapReduce is structured and ‘visually works’.

Figure 2 - MapReduce spreads workload over several 'Nodes'

The MapReduce process is fairly simple in how it works to make processing easier on

large data sets. The MapReduce Process master node (pictured in the center) takes the problem

data and divides the input into smaller sub problems and distributes them to the work nodes.

Once the worker nodes finish processing their problem data, then the worker nodes send

processed data back to the master node. After the master node collects all the worker nodes


solutions, the master node will then reduce them all into one large solution which is the complete

solution to the original problem.

Combining what we have learned about the uses of MapReduce framework you can then

apply that same framework to Hadoop. Apache’s open source solution to big data, Hadoop, is a

framework that can bring this technology to businesses. Though Hadoop has may off breads such

as Hive, Hbase, Zookeeper, just to name a few, they all do about the same thing, such as the

streaming of data and using the Map() and Reduce() framework to bring the data is processes all

together as one solution to the large sets of data it processes.

Hadoop According to Apache’s website and the FAQ on their wiki page, the processes of how

Hadoop works like the following. Since Hadoop can work on commodity hardware that means

there is a possibility there can be many servers that hold the data. In order for the Hadoop

software to be efficient in processing data, it needs to know where the data is located and the

closest processor to the data. If the Hadoop server cannot process the data for whatever reason

such as failure, the next closest server in the rack will then be chosen to process the data. If that

still cannon happen it will continue to choose the next closest server until it can process the data.

The whole idea of this methodology is to have the closest server process the data closest to

where the data is stored to efficiently use the hardware and to reduce network traffic on the

server backbone.

Built into the Hadoop framework is the Hadoop Distributed File System (HDFS). The

HDFS goal is to try to keep data replicated on different machines throughout the cluster. The

reason why HDFS does this is to minimize any risk of a rack outage due to either a power or a

switch failure. With this type of redundancy the data can and should be able to be processed even

with an outage.

In figure 3 below is a visualization of the processes involved with Hadoop. On the left,

you take in the data that needs to be processed and then then the data is broken up at least three

times (an IBM standard for their Hadoop clusters) for redundancy. Then the compute cluster

(center) will map them to the closest processor to process the data and send the results to the


reducing process where all the separate results are joined together to produce one large result

which is pictured on the right.

Through my research and reading the various articles, Hadoop is very ‘mainstream’ and

is used by many companies today to tackle their various big data issues. According to Apache’s

Hadoop website under the ‘Who uses Hadoop’ link there is an extensive list of various entities

who use Hadoop for their organization’s computing. Some of the major players in involved with

either developing for the Hadoop’s framework and are users of it, are companies like Facebook,

Yahoo, LinkedIn, Hortonworks, and Cloudera to name a few.

Many of us use services from these companies on day-to-day basis and it is very

important that these services to work and have maximum reliability and uptime as they are a

business critical function. Since the Hadoop framework is open source, it allows for something to

become better from the contributions of all. When all users are contributing to a common cause

like the Hadoop framework, not only does it make for great, high quality software, but allows for

bugs to be fixed easily by many contributors. Another great benefit that is nice for Hadoop is you

can ‘try before you buy’ as it is free to see if the option is right for you. The try before you buy is

great because as a company you want to make least risky moves that give you a high return on

investment that is in best interest of your shareholders and the company’s wallet. Hadoop has a

very good future in sight and appears to be not leaving anytime soon.

Figure 3 - Hadoop uses Map and Reduce functions to produce results


Healthcare

Several companies and agencies are trying to use big data in the fight against cancer.

Many believe that with the amount of medical data in the world, we can find some major insights

into diseases, and better treat, and eventually cure them. By storing information on patients’

cancer status, treatment, and genetic profile, big data analysis can look at how individual cancers

react to an individual’s genes, and how it responds to different types of treatment. By continuing

to collect more detailed information, and doing so millions of times, machine algorithms can

detect trends, and seek out successful or unsuccessful treatments depending on a patient’s likely

response.

Big Data in the Cancer Fight Ayasdi, a company based in Palo Alto, is working on utilizing a new technique called

Topological Data Analysis to make use of the big data sets on cancer treatment. Topological

Data Analysis (TDA) creates images, or maps, which represent the data in a conceptually useful

way, and allows the user to interact with them in ways which allow one to better understand the

nature of data sets.

Figure 4 - TDA visualization of data from the Miller-Reaven diabetes study


TDA allows users to look at and interact with data in a highly visualized way. Data sets

can be plotted, and looked at through filters (called lenses) to show density, centrality, variance

and other useful metrics. It’s important to note that this visualization is not the same as simply

graphing the data on a Euclidian plot; Ayasdi explains:

The image is not obtained by any standard method in which one projects

on two or three coordinates, and views a scatterplot. Rather, it uses

intrinsic properties of the data in question, and produces a combinatorial

representation which can then be laid out in a convenient form.

Figure 5 - An example of how tumor data can be displayed to assist researchers

Several companies and organizations are working hard to get the world’s cancer data in

one place. Most of the hospitals databases do not interact well, and are poorly connected. Once


the data can be united, many believe that valuable information can be extracted using big data

techniques.

New Discoveries

Ayasdi recently identified a new subgroup of breast cancer that, according to its data

derivations, will not require extensive treatment if found in women of a certain genotype. This

information is incredibly valuable to doctors, who, one day, can lookup a patient’s genotype,

cross reference it with a cancer type, stage and group in order to find out the historically most

effective treatment.

Figure 6 - Data maps of various breast cancer types

As doctors understanding of cancer continues to increase, Ayasdi believes that this

extensive database can eventually lead to a cure for cancer. Since there are so many types, it is

unlikely that a cure will treat all types, so the medical community will need to continue to track

treatments and its effectiveness.


Insurance

The insurance industry is new to big data. They have a tendency to be late adaptors due to how

important their information is, cost for such information, and how much risk they acquire from being in

such an industry. The insurance industry is aware of big data and the benefits it holds if implemented

and cultivated. The act of implementing such an undertaking is what keeps the industry from fully

indulging. Currently only what are called "supercarriers" (e.g. Progressive, Allstate, Nationwide) have the

budget and the people to do so. Other slightly smaller companies, "large", "are aware of big data at a

high level, but not so much in terms of what it means to them specifically" (Burger). These companies

understand that big data is a must in the coming future. They are actively researching big data so they

efficiently implement big data. The midsized to smaller companies are have only begun to research big

data. Due to their size and limitations big data is quite possibly out of their reach for now. As technology

becomes better, efficient, and cheaper this may change.

The few companies that are able to utilize big data are using them to increase customer

experience, risk assessment, and financial maximization. The idea of increasing customer experience or

exposure is to make sure the customer is receiving the best care. To retrieve this information insurance

companies are storing customer data from internal and external means. Externally they are widening

their data collection from customers’ use of the company websites and social media websites trends

and uses. In the case of the company website they are able to see which product is view most, what

trends are there if any, and general traffic flow. With trends they are able to see if customers who

purchase personal auto are more likely to purchase home insurance. This allows the insurance company

to tailor suggestion. This also lets the insurance company to see which product needs to dropped or

improved. With the extension and inclusion of social media, provides the same benefits as the website

but in this case it does so on a much larger scale. Not only are they able to see trends but they can draw

those trends from cross-platforms (Facebook, Twitter, Google+). The biggest advantage is the exposure

the company receives from this. A larger demographic has exposure to the insurance company by word

of mouth, posts, reviews, tweets, and 'likes'. This exposure also allows the insurance company to draw

from those trends and venture in to previously untouched market. Internal use of big data is a bit

different. Instead of looking for trends between groups of customers they are used to look at individual


customers trends. This is done by drawing information from many sources internally to the company

(Billing, Claims, Policy) and watching for trends. One use is to be able to see a trend of a customer is

constantly lapsing in policy. The company can then draw information from policy, billing, and finance to

see the reason is because the customer is carrying a policy that is larger than what is required. From

there the insurance company is able to reassess the policy for the customer to help mitigate this

problem. In this case big data is able to help lean and tailor policies.

Even though the insurance company has the exposure and increased customer support that

does not mean everyone is able to become a customer. By nature insurance is built on risk. Insurance

companies have to be able to calculate and manage the risk taken on by adding new customers. The

larger insurance companies are using big data to more accurately assess a persons or companies risk

before signing them as customers. In the case of a new customer the insurance customer is able to

compare the provided information to current customers to see how much of a risk they are in

comparison and decide if they are willing to take on that risk. This can be done for all avenues of

insurance personal and commercial auto, home, fire, property, and more. Having the ability to quickly

access this risk puts the insurance company ahead of their competitors.

The mitigation of risk leads directly to financial maximization. The largest advantage is being

able to accurately assess pricing for each policy issued. Insurance policies are rated on the amount of

risk a person provides. The accuracy makes sure the customer is receiving a competitive quote and the

insurance company is not over or under pay for that quote. As another level of big data the company

then is able to compare each quote among other similar quotes to guarantee the accuracy of the quote.

For customers who "bundle" their car and home insurance or more. The insurance company is able to

draw information from big data to do a combination comparison of all possible insurance combinations

based on the policy requirements. This makes sure the customer is getting the best price and the

insurance company is providing the best insurance policy.

Future of Big Data in Insurance Not all insurance companies are currently able to provide this level of care. In to the future big

data will have to become cheaper and readily available for everyone. For the insurance industry who is

dependent of accurate information. The integrity of data internally and externally has be reassured and

maintained. One current trend moving forward is the integration of multiple "lines". The mean of "lines"

are the different types of insurance, auto, personal, fire, property and more. The growth stems from the


need to more accurately assess risk. If a customer has commercial insurance, home insurance, auto

insurance, and wants to add property. The insurance company can not only assess risk from the

property itself they can also assess risk of the customer from the other current polices. The customer

can be low risk with auto insurance but high risk with commercial and home insurance. The insurance

company might assess the portfolio as high risk as a whole. While before the combination the customer

could have been low risk with the property. This combination of "lines" enables the insurance company

accurately access risk, price, and strength customer care.


Conclusion

Clearly, Big Data can prove to be very valuable to many people, and applicable to many

industries. As new techniques are developed to handle incredible amounts of data, the usefulness

of big data as a concept will continue to prove itself. While right now, Big Data is somewhat a

buzzword, and a very difficult conceptual problem, it’s sure to become a mainstay in how

technology can help business, governments, and eventually individuals. We looked at just a bit

of what big data is, what it does, and specific uses in the healthcare and insurance industries.

Over the coming years, we are confident that big data will be less of an external research project,

and more of an everyday classwork.


References

Figure 1: Hutchinson, Lee. "Information Explosion: How Rapidly Expanding Storage Spurs

Innovation."Ars Technica. Ars Technica, 27 Sept. 2011. Web. 28 Apr. 2013.

<http://arstechnica.com/business/2011/09/information-explosion-how-rapidly-expanding-

storage-spurs-innovation/>.

Figure 2: "MapReduce." Wikipedia. Wikimedia Foundation, 24 Apr. 2013. Web. 28 Apr. 2013.

<http://en.wikipedia.org/wiki/MapReduce>.

Figure 3: "jStart and Solutions - IBM." IBM Emerging Technologies. IBM, n.d. Web. 28 Apr.

2013. <http://www-01.ibm.com/software/ebusiness/jstart/hadoop/>.

Figure 4: “Introduction to Topological Data Analysis.” Ayasdi Web 2013.

<http://www.ayasdi.com/_downloads/Introduction_to_Topological_Data_Analysis.pdf>.

Figure 5: Nicolaua, Monica, Levineb, Arnold J., Carlsson, Gunnar. “Topology based data

analysis identifies a subgroup of breast cancers with a unique mutational profile and

excellent survival” Ayasdi 25. Feb. 2011. <http://www.ayasdi.com/resources>.

Figure 6: Farr, Christina. “A cure for cancer? This ‘big data’ startup says it can deliver” Venture

Beat 16. Jan. 2013. <http://venturebeat.com/2013/01/16/ayasdi/>.

Burger, Kathy "Big Data Not Yet A Best Practice In Insurance." Insurance and Technology. Ed.

UBM Tech, 20 Mar. 2013. Web. 30 Apr. 2013.

<http://www.insurancetech.com/architecture-infrastructure/big-data-not-yet-a-best-

practice-in-insu/240151052>.

Carlsson, Gunnar. “Topology and Data” Ayasdi 29. Jan. 2009

<http://www.ayasdi.com/_downloads/Topology_and_Data.pdf>.

Johnston, Leslie. "How Many Libraries of Congress Does It Take?" The Signal Digital

Preservation. United States Library of Congress, 23 Mar. 2013. Web. 29 Apr. 2013.

<http://blogs.loc.gov/digitalpreservation/2012/03/how-many-libraries-of-congress-does-

it-take/>.

Manyika, James, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles

Roxburgh, and Angela H. Byers. "Insights & Publications." Big Data: The next Frontier

for Innovation, Competition, and Productivity. McKinsey Global Institute, May 2011.

Web. 29 Apr. 2013.

<http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_fo

r_innovation>.


Marza, Glen, et al. "FAQ - Hadoop." Hadoop Wiki. Apache.org, 28 Nov. 2013. Web. 29 Apr.

2013. <http://wiki.apache.org/hadoop/FAQ>.

Nicolaua, Monica, Levineb, Arnold J., Carlsson, Gunnar. “Topology based data analysis

identifies a subgroup of breast cancers with a unique mutational profile and excellent

survival” Ayasdi 25. Feb. 2011. <http://www.ayasdi.com/resources>.

Vance, Ashlee. "Hadoop, a Free Software Program, Finds Uses Beyond Search." The New York

Times. The New York Times, 17 Mar. 2009. Web. 29 Apr. 2013.

<http://www.nytimes.com/2009/03/17/technology/business-

computing/17cloud.html?_r=0>.

"Welcome to Apache™ Hadoop®!" Welcome to Apache™ Hadoop®! Apache.org, 26 Apr.

2013. Web. 29 Apr. 2013. <http://hadoop.apache.org/index.html>.

"What Does Big Data Really Mean for Insurers?" S.A.S. Strategy Meets Action, Aug. 2012.

Web. 30 Apr. 2013. <http://www.sas.com/resources/whitepaper/wp_49547.pdf>.

Winslow, Rob. “'Big Data' for Cancer Care” Wall Street Journal 26. Mar. 2013.

<http://online.wsj.com/article/

SB10001424127887323466204578384732911187000.html>.

big data - eric state“big data” refers to datasets whose size is beyond the ability of a typical...

Documents