big data - eric state“big data” refers to datasets whose size is beyond the ability of a typical...
TRANSCRIPT
MIS 433 :: Big Data :: Dvorsky, State, Turvin 1
Big Data MIS 433 – Spring 2013 Authors: Biran Dvorsky, Eric State, Patrick Turvin
2013
Instructor: Yuanfeng Cai 5/1/2013
2 MIS 433 :: Big Data :: Dvorsky, State, Turvin
Table of Contents
Abstract ......................................................................................................................................................... 3
Big Data ........................................................................................................................................................ 4
What is Big Data? ..................................................................................................................................... 4
Making Big Data Valuable ....................................................................................................................... 6
Map Reduce .......................................................................................................................................... 8
Hadoop .................................................................................................................................................. 9
Healthcare ................................................................................................................................................... 11
Big Data in the Cancer Fight................................................................................................................... 11
New Discoveries ..................................................................................................................................... 13
Insurance ..................................................................................................................................................... 14
Future of Big Data in Insurance .............................................................................................................. 15
Conclusion .................................................................................................................................................. 17
References ................................................................................................................................................... 18
MIS 433 :: Big Data :: Dvorsky, State, Turvin 3
Abstract
Big data is a very popular phrase being thrown around in the world of technology news.
Few know what it is, and even fewer truly understand it. But there are many people working with
it, and working on improving it as a viable business tool. It’s clear that big data is at it’s infancy,
and it will become more and more a mainstay business practices. Big data is a big topic, and one
that several companies are entirely based around. We decided to take a quick look at some of the
big data technologies that are out there now, and how big data techniques are being utilized in a
few industries: healthcare, and insurance.
Our research just skimmed the surface, and opened our eyes to a vast number of
technologies, styles, programs, companies and ideologies. We came away with a greater
understanding of what bid data is, what role it plays, and where it will go in the future.
4 MIS 433 :: Big Data :: Dvorsky, State, Turvin
Big Data
If you're a business hearing about big data, you have lots of questions about what it is and
what impact it will have on my business. Lots of businesses are just starting to hear this for the
first time and are now realizing they have collected all this information. Many companies are
saying, “now what do we do with it all?” That’s a question that can be answered so many
different ways that many businesses just don’t know where to start. Technology over the past ten
years has been very easy to implement to store all this data, as the price of hard drives and
various storage solutions have allowed companies to scale their storage for their collections.
Companies have collected so much data that they are having issues trying to figure out how to
process data to gain any sort of business intelligence that might be of value to them.
We will try to provide thoughtful insight about what companies are doing to implement
solutions for these sorts of problems that arise when big data is useful. We will look at popular
big data technologies, such as MapReduce and Hadoop, as well as looking at how big data is
being used in the Healthcare and Insurance industries.
What is Big Data? First off, let’s define what Big Data is. According to an executive summary published by
McKinsey Global Institute titled “Big data: The next frontier for innovation, competition, and
productivity” they define Big Data as this:
“Big Data” refers to datasets whose size is beyond the ability of a typical database
software tools to capture, store, manage, and analyze. This definition is intentionally
subjective and incorporates a moving definition of how big a dataset needs to be in order
to be considered big data.
They continue to go on and continue defining big data loosely saying, that they assume
technology will continue to increase over time and that what defines the storage part of big data
will continue to increase as storage capacities will continue to increase. The McKinsey Global
MIS 433 :: Big Data :: Dvorsky, State, Turvin 5
Institute also made it clear that big data does not mean the same thing for all industries and is
specific to the industry.
This definition above is very adaptable for many businesses to define when they have an
issue that can fit under this topic of big data. Information stores right now are growing faster than
we have ever seen in the past. One startling statistic that was mentioned in this executive
summary is that 15 out of the 17 sectors in the United States have more data stored per company
than the U.S. Library of Congress. From a response to an article posted by Leslie Johnston an
employee from the U.S. Library of Congress on March 8th
, 2013 they had over 6.5 Petabytes of
digital archives. There archives are growing at a data rate of 15TB a day and then to think on top
of that, she goes on to mention that the data is all backed up on tapes that is geo redundant to
minimize the risk of data loss. Ten years ago hard drives for personal computers were very
small. Looking at this graph from Arstechnica on the “Information explosion: how rapidly
expanding storage spurs innovation,” you can see the exponential increase in hard drive storage
and how it has grown almost exponentially such as the data that is being stored.
6 MIS 433 :: Big Data :: Dvorsky, State, Turvin
Figure 1 - Storage capacity has grown exponentially
Making Big Data Valuable With the ever-increasing availability of storage technology, and all the data that is
currently be collect and stored, companies are looking for something to process it and to get
value out of what they have collected. The technology community is on the verge of something
entirely new that they have never faced before, and with the right tools companies can hopefully
gain a qualitatively deeper level of insight on what they are collecting to gain value and insight
in their own data.
One topic that mentioned throughout our research was the topic of Hadoop and
MapReduce. These two items appear to be mentioned in almost the same breath in just about
every article that there is about big data. Since Hadoop is open source it allows for many
‘flavors’ to be developed to meet the needs of any company. Another nice feature of Hadoop is
that it is developed on a platform that is adaptable to almost any architecture and is not
dependent on necessarily on the hardware that is available. Hadoop is written in the Java
programming language, developed by Oracle. The development of Hadoop and the use of the
MIS 433 :: Big Data :: Dvorsky, State, Turvin 7
MapReduce framework will be very important with and how the industry will use these two
technologies to provide useful incite to their data collections in the future.
Hadoop can be used to support data intense distributed applications. One of Hadoop’s
main features is how it is written on a platform that is easily adaptable to any server hardware a
business might have. Since Hadoop is written in and runs on the Java Runtime Environment, the
benefit of Hadoop’s framework is that it supports running on large clusters of non-specialized
commodity hardware.
Surprisingly the Hadoop software and framework is not all that old. It was created by two
individuals, Doug Cutting and Mike Carafella. Hadoop was name after Cutting’s son’s stuffed
yellow elephant and was an inspiration to the Hadoop Project. According to a New York Times
article published on March 17, 2009 titled, “Hadoop, a Free Software Program, Finds Uses
beyond Search”, they talk about the development of Hadoop as an off bread to Google’s
MapReduce and Hadoop’s beginnings. At the time Hadoop was getting started Google was
publishing very little on its MapReduce technology. What they did publish was enough to give
Doug a start on developing his own technology called Hadoop. Hadoop got its starting a Yahoo,
Inc. when the executives there realized what Doug was working on and they hired him to bring
that to them. According the New York Times article within six months Hadoop had become a
critical part of Yahoo and a year later was super-critical to operations. At the time the article was
published Yahoo was reportedly servicing over three hundred million people a month.
Hadoop implements a key foundation core that makes what it does possible called
MapReduce. MapReduce is a programming model that was developed by Google that can help
when it comes to processing large data sets. MapReduce can also be used for the distributed
computing cluster of computers. From doing some general research to get start from Wikipedia,
MapReduce allows programmers the ability to produce parallel distributed programs more easily
by requiring them to write only the simpler Map() and Reduce() functions, which the focus on
the logic of the specific problem at hand.
8 MIS 433 :: Big Data :: Dvorsky, State, Turvin
Map Reduce MapReduce is a framework for processing large data sets across large number
nodes/clusters. With this type of processing it does not matter how the data is structured, it can
either be unstructured or structured in a database. Looking at the data flow diagram below you
can see how MapReduce is structured and ‘visually works’.
Figure 2 - MapReduce spreads workload over several 'Nodes'
The MapReduce process is fairly simple in how it works to make processing easier on
large data sets. The MapReduce Process master node (pictured in the center) takes the problem
data and divides the input into smaller sub problems and distributes them to the work nodes.
Once the worker nodes finish processing their problem data, then the worker nodes send
processed data back to the master node. After the master node collects all the worker nodes
MIS 433 :: Big Data :: Dvorsky, State, Turvin 9
solutions, the master node will then reduce them all into one large solution which is the complete
solution to the original problem.
Combining what we have learned about the uses of MapReduce framework you can then
apply that same framework to Hadoop. Apache’s open source solution to big data, Hadoop, is a
framework that can bring this technology to businesses. Though Hadoop has may off breads such
as Hive, Hbase, Zookeeper, just to name a few, they all do about the same thing, such as the
streaming of data and using the Map() and Reduce() framework to bring the data is processes all
together as one solution to the large sets of data it processes.
Hadoop According to Apache’s website and the FAQ on their wiki page, the processes of how
Hadoop works like the following. Since Hadoop can work on commodity hardware that means
there is a possibility there can be many servers that hold the data. In order for the Hadoop
software to be efficient in processing data, it needs to know where the data is located and the
closest processor to the data. If the Hadoop server cannot process the data for whatever reason
such as failure, the next closest server in the rack will then be chosen to process the data. If that
still cannon happen it will continue to choose the next closest server until it can process the data.
The whole idea of this methodology is to have the closest server process the data closest to
where the data is stored to efficiently use the hardware and to reduce network traffic on the
server backbone.
Built into the Hadoop framework is the Hadoop Distributed File System (HDFS). The
HDFS goal is to try to keep data replicated on different machines throughout the cluster. The
reason why HDFS does this is to minimize any risk of a rack outage due to either a power or a
switch failure. With this type of redundancy the data can and should be able to be processed even
with an outage.
In figure 3 below is a visualization of the processes involved with Hadoop. On the left,
you take in the data that needs to be processed and then then the data is broken up at least three
times (an IBM standard for their Hadoop clusters) for redundancy. Then the compute cluster
(center) will map them to the closest processor to process the data and send the results to the
10 MIS 433 :: Big Data :: Dvorsky, State, Turvin
reducing process where all the separate results are joined together to produce one large result
which is pictured on the right.
Through my research and reading the various articles, Hadoop is very ‘mainstream’ and
is used by many companies today to tackle their various big data issues. According to Apache’s
Hadoop website under the ‘Who uses Hadoop’ link there is an extensive list of various entities
who use Hadoop for their organization’s computing. Some of the major players in involved with
either developing for the Hadoop’s framework and are users of it, are companies like Facebook,
Yahoo, LinkedIn, Hortonworks, and Cloudera to name a few.
Many of us use services from these companies on day-to-day basis and it is very
important that these services to work and have maximum reliability and uptime as they are a
business critical function. Since the Hadoop framework is open source, it allows for something to
become better from the contributions of all. When all users are contributing to a common cause
like the Hadoop framework, not only does it make for great, high quality software, but allows for
bugs to be fixed easily by many contributors. Another great benefit that is nice for Hadoop is you
can ‘try before you buy’ as it is free to see if the option is right for you. The try before you buy is
great because as a company you want to make least risky moves that give you a high return on
investment that is in best interest of your shareholders and the company’s wallet. Hadoop has a
very good future in sight and appears to be not leaving anytime soon.
Figure 3 - Hadoop uses Map and Reduce functions to produce results
MIS 433 :: Big Data :: Dvorsky, State, Turvin 11
Healthcare
Several companies and agencies are trying to use big data in the fight against cancer.
Many believe that with the amount of medical data in the world, we can find some major insights
into diseases, and better treat, and eventually cure them. By storing information on patients’
cancer status, treatment, and genetic profile, big data analysis can look at how individual cancers
react to an individual’s genes, and how it responds to different types of treatment. By continuing
to collect more detailed information, and doing so millions of times, machine algorithms can
detect trends, and seek out successful or unsuccessful treatments depending on a patient’s likely
response.
Big Data in the Cancer Fight Ayasdi, a company based in Palo Alto, is working on utilizing a new technique called
Topological Data Analysis to make use of the big data sets on cancer treatment. Topological
Data Analysis (TDA) creates images, or maps, which represent the data in a conceptually useful
way, and allows the user to interact with them in ways which allow one to better understand the
nature of data sets.
Figure 4 - TDA visualization of data from the Miller-Reaven diabetes study
12 MIS 433 :: Big Data :: Dvorsky, State, Turvin
TDA allows users to look at and interact with data in a highly visualized way. Data sets
can be plotted, and looked at through filters (called lenses) to show density, centrality, variance
and other useful metrics. It’s important to note that this visualization is not the same as simply
graphing the data on a Euclidian plot; Ayasdi explains:
The image is not obtained by any standard method in which one projects
on two or three coordinates, and views a scatterplot. Rather, it uses
intrinsic properties of the data in question, and produces a combinatorial
representation which can then be laid out in a convenient form.
Figure 5 - An example of how tumor data can be displayed to assist researchers
Several companies and organizations are working hard to get the world’s cancer data in
one place. Most of the hospitals databases do not interact well, and are poorly connected. Once
MIS 433 :: Big Data :: Dvorsky, State, Turvin 13
the data can be united, many believe that valuable information can be extracted using big data
techniques.
New Discoveries
Ayasdi recently identified a new subgroup of breast cancer that, according to its data
derivations, will not require extensive treatment if found in women of a certain genotype. This
information is incredibly valuable to doctors, who, one day, can lookup a patient’s genotype,
cross reference it with a cancer type, stage and group in order to find out the historically most
effective treatment.
Figure 6 - Data maps of various breast cancer types
As doctors understanding of cancer continues to increase, Ayasdi believes that this
extensive database can eventually lead to a cure for cancer. Since there are so many types, it is
unlikely that a cure will treat all types, so the medical community will need to continue to track
treatments and its effectiveness.
14 MIS 433 :: Big Data :: Dvorsky, State, Turvin
Insurance
The insurance industry is new to big data. They have a tendency to be late adaptors due to how
important their information is, cost for such information, and how much risk they acquire from being in
such an industry. The insurance industry is aware of big data and the benefits it holds if implemented
and cultivated. The act of implementing such an undertaking is what keeps the industry from fully
indulging. Currently only what are called "supercarriers" (e.g. Progressive, Allstate, Nationwide) have the
budget and the people to do so. Other slightly smaller companies, "large", "are aware of big data at a
high level, but not so much in terms of what it means to them specifically" (Burger). These companies
understand that big data is a must in the coming future. They are actively researching big data so they
efficiently implement big data. The midsized to smaller companies are have only begun to research big
data. Due to their size and limitations big data is quite possibly out of their reach for now. As technology
becomes better, efficient, and cheaper this may change.
The few companies that are able to utilize big data are using them to increase customer
experience, risk assessment, and financial maximization. The idea of increasing customer experience or
exposure is to make sure the customer is receiving the best care. To retrieve this information insurance
companies are storing customer data from internal and external means. Externally they are widening
their data collection from customers’ use of the company websites and social media websites trends
and uses. In the case of the company website they are able to see which product is view most, what
trends are there if any, and general traffic flow. With trends they are able to see if customers who
purchase personal auto are more likely to purchase home insurance. This allows the insurance company
to tailor suggestion. This also lets the insurance company to see which product needs to dropped or
improved. With the extension and inclusion of social media, provides the same benefits as the website
but in this case it does so on a much larger scale. Not only are they able to see trends but they can draw
those trends from cross-platforms (Facebook, Twitter, Google+). The biggest advantage is the exposure
the company receives from this. A larger demographic has exposure to the insurance company by word
of mouth, posts, reviews, tweets, and 'likes'. This exposure also allows the insurance company to draw
from those trends and venture in to previously untouched market. Internal use of big data is a bit
different. Instead of looking for trends between groups of customers they are used to look at individual
MIS 433 :: Big Data :: Dvorsky, State, Turvin 15
customers trends. This is done by drawing information from many sources internally to the company
(Billing, Claims, Policy) and watching for trends. One use is to be able to see a trend of a customer is
constantly lapsing in policy. The company can then draw information from policy, billing, and finance to
see the reason is because the customer is carrying a policy that is larger than what is required. From
there the insurance company is able to reassess the policy for the customer to help mitigate this
problem. In this case big data is able to help lean and tailor policies.
Even though the insurance company has the exposure and increased customer support that
does not mean everyone is able to become a customer. By nature insurance is built on risk. Insurance
companies have to be able to calculate and manage the risk taken on by adding new customers. The
larger insurance companies are using big data to more accurately assess a persons or companies risk
before signing them as customers. In the case of a new customer the insurance customer is able to
compare the provided information to current customers to see how much of a risk they are in
comparison and decide if they are willing to take on that risk. This can be done for all avenues of
insurance personal and commercial auto, home, fire, property, and more. Having the ability to quickly
access this risk puts the insurance company ahead of their competitors.
The mitigation of risk leads directly to financial maximization. The largest advantage is being
able to accurately assess pricing for each policy issued. Insurance policies are rated on the amount of
risk a person provides. The accuracy makes sure the customer is receiving a competitive quote and the
insurance company is not over or under pay for that quote. As another level of big data the company
then is able to compare each quote among other similar quotes to guarantee the accuracy of the quote.
For customers who "bundle" their car and home insurance or more. The insurance company is able to
draw information from big data to do a combination comparison of all possible insurance combinations
based on the policy requirements. This makes sure the customer is getting the best price and the
insurance company is providing the best insurance policy.
Future of Big Data in Insurance Not all insurance companies are currently able to provide this level of care. In to the future big
data will have to become cheaper and readily available for everyone. For the insurance industry who is
dependent of accurate information. The integrity of data internally and externally has be reassured and
maintained. One current trend moving forward is the integration of multiple "lines". The mean of "lines"
are the different types of insurance, auto, personal, fire, property and more. The growth stems from the
16 MIS 433 :: Big Data :: Dvorsky, State, Turvin
need to more accurately assess risk. If a customer has commercial insurance, home insurance, auto
insurance, and wants to add property. The insurance company can not only assess risk from the
property itself they can also assess risk of the customer from the other current polices. The customer
can be low risk with auto insurance but high risk with commercial and home insurance. The insurance
company might assess the portfolio as high risk as a whole. While before the combination the customer
could have been low risk with the property. This combination of "lines" enables the insurance company
accurately access risk, price, and strength customer care.
MIS 433 :: Big Data :: Dvorsky, State, Turvin 17
Conclusion
Clearly, Big Data can prove to be very valuable to many people, and applicable to many
industries. As new techniques are developed to handle incredible amounts of data, the usefulness
of big data as a concept will continue to prove itself. While right now, Big Data is somewhat a
buzzword, and a very difficult conceptual problem, it’s sure to become a mainstay in how
technology can help business, governments, and eventually individuals. We looked at just a bit
of what big data is, what it does, and specific uses in the healthcare and insurance industries.
Over the coming years, we are confident that big data will be less of an external research project,
and more of an everyday classwork.
18 MIS 433 :: Big Data :: Dvorsky, State, Turvin
References
Figure 1: Hutchinson, Lee. "Information Explosion: How Rapidly Expanding Storage Spurs
Innovation."Ars Technica. Ars Technica, 27 Sept. 2011. Web. 28 Apr. 2013.
<http://arstechnica.com/business/2011/09/information-explosion-how-rapidly-expanding-
storage-spurs-innovation/>.
Figure 2: "MapReduce." Wikipedia. Wikimedia Foundation, 24 Apr. 2013. Web. 28 Apr. 2013.
<http://en.wikipedia.org/wiki/MapReduce>.
Figure 3: "jStart and Solutions - IBM." IBM Emerging Technologies. IBM, n.d. Web. 28 Apr.
2013. <http://www-01.ibm.com/software/ebusiness/jstart/hadoop/>.
Figure 4: “Introduction to Topological Data Analysis.” Ayasdi Web 2013.
<http://www.ayasdi.com/_downloads/Introduction_to_Topological_Data_Analysis.pdf>.
Figure 5: Nicolaua, Monica, Levineb, Arnold J., Carlsson, Gunnar. “Topology based data
analysis identifies a subgroup of breast cancers with a unique mutational profile and
excellent survival” Ayasdi 25. Feb. 2011. <http://www.ayasdi.com/resources>.
Figure 6: Farr, Christina. “A cure for cancer? This ‘big data’ startup says it can deliver” Venture
Beat 16. Jan. 2013. <http://venturebeat.com/2013/01/16/ayasdi/>.
Burger, Kathy "Big Data Not Yet A Best Practice In Insurance." Insurance and Technology. Ed.
UBM Tech, 20 Mar. 2013. Web. 30 Apr. 2013.
<http://www.insurancetech.com/architecture-infrastructure/big-data-not-yet-a-best-
practice-in-insu/240151052>.
Carlsson, Gunnar. “Topology and Data” Ayasdi 29. Jan. 2009
<http://www.ayasdi.com/_downloads/Topology_and_Data.pdf>.
Johnston, Leslie. "How Many Libraries of Congress Does It Take?" The Signal Digital
Preservation. United States Library of Congress, 23 Mar. 2013. Web. 29 Apr. 2013.
<http://blogs.loc.gov/digitalpreservation/2012/03/how-many-libraries-of-congress-does-
it-take/>.
Manyika, James, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles
Roxburgh, and Angela H. Byers. "Insights & Publications." Big Data: The next Frontier
for Innovation, Competition, and Productivity. McKinsey Global Institute, May 2011.
Web. 29 Apr. 2013.
<http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_fo
r_innovation>.
MIS 433 :: Big Data :: Dvorsky, State, Turvin 19
Marza, Glen, et al. "FAQ - Hadoop." Hadoop Wiki. Apache.org, 28 Nov. 2013. Web. 29 Apr.
2013. <http://wiki.apache.org/hadoop/FAQ>.
Nicolaua, Monica, Levineb, Arnold J., Carlsson, Gunnar. “Topology based data analysis
identifies a subgroup of breast cancers with a unique mutational profile and excellent
survival” Ayasdi 25. Feb. 2011. <http://www.ayasdi.com/resources>.
Vance, Ashlee. "Hadoop, a Free Software Program, Finds Uses Beyond Search." The New York
Times. The New York Times, 17 Mar. 2009. Web. 29 Apr. 2013.
<http://www.nytimes.com/2009/03/17/technology/business-
computing/17cloud.html?_r=0>.
"Welcome to Apache™ Hadoop®!" Welcome to Apache™ Hadoop®! Apache.org, 26 Apr.
2013. Web. 29 Apr. 2013. <http://hadoop.apache.org/index.html>.
"What Does Big Data Really Mean for Insurers?" S.A.S. Strategy Meets Action, Aug. 2012.
Web. 30 Apr. 2013. <http://www.sas.com/resources/whitepaper/wp_49547.pdf>.
Winslow, Rob. “'Big Data' for Cancer Care” Wall Street Journal 26. Mar. 2013.
<http://online.wsj.com/article/
SB10001424127887323466204578384732911187000.html>.