janusz szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · bibliography flach, peter,...

Introduction into Big Data analytics

Janusz Szwabiński

Contact data

● [email protected]

● office hours (C-11 building, room 5.16):– Monday, 13:30-15:00– Thursday, 14:00-16:30– preferably make an appointment via email,

providing details of your problem

● http://prac.im.pwr.wroc.pl/~szwabin/index

mailto:[email protected]

http://prac.im.pwr.wroc.pl/~szwabin/index

Course overview

● Introduction to Big Data

● Big data platforms

● In-memory big data platform – Spark

● MapReduce

● Big data analytics

● Big data visualization

● Machine learning

● Project presentations

Bibliography

● Flach, Peter, “Machine Learning”, Cambridge University Press, 2012

● Holmes, Alex, “Hadoop in practice”, Manning Publications, 2013

● Provost, Foster, Facett, Tom, “Data Science for Business. What you need to know about data mining and data-analytic thinking”, O’Reilly, 2013

● Loshin, David, “Big Data Analytics. From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph”, Morgan Kaufmann, 2013

● http://hadoop.apache.org/

● http://spark.apache.org/● http://storm.apache.org/

● http://kafka.apache.org/● deRoos, Dirk, “Hadoop for Dummies”, 2014

http://hadoop.apache.org/

http://spark.apache.org/

http://storm.apache.org/

http://kafka.apache.org/

Assessment

● the lab is graded pass/fail

● final project:

– project ideas in the next talk

– 2-3 students per team (single author projects not allowed)

– project proposal presentation in the 6th lab

– final presentation (up to 15 min, the last 2 lectures)

– project’s assessment is the final grade

Outlook of today’s talk

● Examples of big data applications

● Definition and characteristics of big data

● Techniques towards big data

● More examples of big data applications

Target knows you are pregnant...

● customers have (had?) a Guest ID number:

– tied to their credit card, name, or email address

– a bucket that stores a history of everything they've bought

– any demographic information Target has collected from them or bought from other sources

● historical buying data for all the ladies who had signed up for Target baby registries in the past

– identified about 25 products that allow to assign each shopper a “pregnancy prediction” score

– a fictional Target shopper who is 23, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug there’s an 87% →chance that she’s pregnant and that her delivery date is sometime in late August

Target knows you are pregnant...

● Target started sending coupons for baby itemsto customers according to their pregnancy scores

– an angry man went into a Target outside of Minneapolis, complaining his teen daughter is getting this kind of coupons for baby clothes

– Target’s manager apologized, and then called the customer few days later to apologize again

– the man admitted his daughter was pregnant...

● source: https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#e7c161266686

World according to Google

● Google uses 57 different variables or "signals" to create search results tailored specifically for you:

– Search history

– Location

– Active browser

– Computer being used

– Language configured


Author: Eli Pariser

How much data do we generate?


Source: IDC


Source: Industry Tap

x1000

1 megabyte (a large novel)

1 gigabyte (information in the

human genom)

1 terabyte (annual world

literature production)

1 petabyte (All US academic research libraries)

1 exabyte (1/3 of annual production

of information)

x1000 x1000x1000

a tiny ant (~ 1,5 mm)

height of a person ( 1,7 m)

length of Rędziński bridge (1,8 km)

Diameter of the Sun (1391400 km)

length of New Zealand (1700 km)


● about 852 500 mln (853 500 000 000!) of 3-min songs in mp3 format

about 10 mln BlueRay discs 90 years of movies in HD

quality

Source: http://www.northeastern.edu/levelblog/

2.5 exabyte a day!!!

Big data – 5V classification

Big data

● the term in use since the 1990s, with some giving credit to John Mashey for coining or at least making it popular

● data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time

● unstructured, semi-structured and structured data, however the main focus is on unstructured data

● "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data

● requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale

Big data

● Volume - the quantity of generated and stored data; the size of the data determines the value and potential insight- and whether it can actually be considered big data or not

● Variety - big data draws from text, images, audio, video; plus it completes missing pieces through data fusion

● Velocity - big data is often available in real-time

● Veracity - data quality of captured data can vary greatly, affecting the accurate analysis

Big data

Big data

● Google● Facebook● Youtube● Instagram● Wikipedia● Alibaba

These companies (and others) are collecting PetaBytes of data every minute

Why do they do that?

Reason #1

● they can afford it!● storage prices have dropped significantly

over the last 3 decades

Reason #2

● they can monetize it!– everything is personalized (recommendations,

ads, offers, promotions, newsfeeds)– cool products can be built

● Google Maps● Apple Siri

– other companies pay to mine the data● Twitter Firehose● Facebook Topic Data

Data centers

Have a look at

https://www.google.com/about/datacenters/inside/streetview/

Big data paradigm

lots of serversto process TBs/PBs

of data

runningsophisticated

software

Huge data centers

Big data paradigm

● only a handful of companies in the world have all of the above

● smaller companies may use cloud services like AWS, Microsoft Azure or GCP

● Netflix, Pinterest or AirBnB run their entire business just using cloud services

Big data - history

● Big data repositories have existed in many forms, often built by corporations with a special need

● Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s

● Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system

– Teradata systems were the first to store and analyze 1 terabyte of data in 1992

– Hard disk drives were 2.5GB in 1991 so the definition of big data continuously evolves according to Kryder's Law

– Teradata installed the first petabyte class RDBMS based system in 2007

– As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB

– Systems up until 2008 were 100% structured relational data

– since then, Teradata has added unstructured data types including XML, JSON, and Avro

Big data - history ● in 2000, Seisint Inc. (now LexisNexis Group) developed a C++-based distributed file-sharing

framework for data storage and query

– the system stores and distributes structured, semi-structured, and unstructured data across multiple servers

– users can build queries in a C++ dialect called ECL

● In 2004, LexisNexis acquired Seisint Inc. and in 2008 acquired ChoicePoint, Inc. and their high-speed parallel processing platform

– the two platforms were merged into HPCC (or High-Performance Computing Cluster) Systems

– in 2011, HPCC was open-sourced under the Apache v2.0 License

● In 2004, Google published a paper on a process called MapReduce

– a parallel processing model

– an associated implementation was released to process huge amounts of data

– the framework was very successful

● An implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop frameworks

Big data - history ● Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm,

as it adds the ability to set up many operations (not just map followed by reduce).

● Quantcast File System was available about 2011

– an open-source distributed file system software package

– large-scale MapReduce or other batch-processing workloads

– alternative to Hadoop, intended to deliver better performance and cost-efficiency for large-scale processing clusters

● MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications

– handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records

● 2012 studies showed that a multiple-layer architecture is one option to address the issues that big data presents

– a distributed parallel architecture distributes data across multiple servers

– this type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks

Big data - history

● data lake

– a method of storing data within a system or repository, in its natural format

– facilitates the collocation of data in various schemata and structural forms, usually object blobs or files

– a single store of all data in the enterprise ranging from raw data to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning

– the data lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (images, audio, video)

Big data - history

● one of the first lakes - the distributed file system used in Apache Hadoop

● many companies have now entered into this space: Hortonworks, Google, Microsoft, Zaloni, Teradata, Cloudera, Amazon, Cazena

● criticism:

– Sean Martin, CTO of Cambridge Semantics:“We see customers creating big data graveyards, dumping everything into HDFS [Hadoop Distributed File System] and hoping to do something with it down the road. But then they just lose track of what’s there.

The main challenge is not creating a data lake, but taking advantage of the opportunities it presents”

– the concept is fuzzy and arbitrary

● data swamp - a deteriorated data lake, that is inaccessible to its intended users and provides little value

Big data – technologies

● Techniques for analyzing data:

– A/B testing

– machine learning

– natural language processing

● Big data technologies

– business intelligence

– cloud computing

– databases

● Visualization

– charts

– graphs

– other displays of the data

A/B testing

● a controlled experiment with two variants, A and B

● a form of statistical hypothesis testing or "two-sample hypothesis testing"

● in online settings, such as web design, the goal of A/B testing is to identify changes to web pages that increase or maximize an outcome of interest

– formally the current web page is associated with the null hypothesis

● two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior

Machine learning

● a field of computer science that gives computers the ability to learn without being explicitly programmed

● the name Machine learning was coined in 1959 by Arthur Samuel

● evolved from the study of pattern recognition and computational learning theory in artificial intelligence

● explores the study and construction of algorithms that can learn from and make predictions on data

● employed in a range of computing tasks: email filtering, detection of network intruders or malicious insiders working towards a data breach,[5] optical character recognition (OCR), computer vision.

● closely related to (and often overlaps with) computational statistics

● it has strong ties to mathematical optimization

● difficult to carry out with big data

Natural language processing

● a field of computer science, artificial intelligence concerned with the interactions between computers and human (natural) languages

● concerned with programming computers to fruitfully process large natural language data

● challenges frequently involve speech recognition, natural language understanding, and natural language generation.

Business Intelligence ● comprises the strategies and technologies used by enterprises for the data analysis of business

information

● provides historical, current and predictive views of business operations

● common functions of business intelligence technologies include:

– reporting - a human readable report on operating and financial data

– data mining - discovering patterns in large data sets

– process mining - analysis of business processes based on event logs

– complex event processing - a method of tracking and analyzing (processing) streams of information (data) about things that happen (events), and deriving a conclusion from them

– business performance management - a set of management and analytic processes that enable businesses to define strategic goals and then measure and manage performance against those goals (financial planning, operational planning, business modeling, consolidation and reporting, monitoring of key performance indicators linked to strategy)

– benchmarking - comparing one's business processes and performance metrics to industry bests and best practices from other companies

– text mining - deriving high-quality information from text

– predictive and prescriptive analytics - predictions about future or otherwise unknown events

Business Intelligence

● can handle large amounts of structured and sometimes unstructured data

● allows an easy interpretation of these big data

● identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability

Cloud computing

● an IT paradigm that enables ubiquitous access to shared pools of configurable system resources and higher-level services

● those resources can be rapidly provisioned with minimal management effort, often over the Internet

● relies on sharing of resources to achieve coherence and economies of scale, similar to a public utility

Big data - technologies ● tensor-based computation (for multidimensional data)

● massively parallel-processing (MPP) databases

● search-based applications

● data mining

● distributed file systems

● distributed databases

● cloud and HPC-based infrastructure

● Internet

Big data - technologies ● shared storage architectures - Storage area network (SAN)

and Network-attached storage (NAS) – are perceived as relatively slow, complex, and expensive

– these qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost

● latency should be avoided whenever and wherever possible

– therefore direct-attached storage (DAS) often preferred

Big data – computing architecture

Assumptions:

● you work for an e-commerce starup generating 1TB of weblogs everyday

● at the end of a day you want to publish a report on traffic for that day

Big data – computing architecture Options:

● use a single powerful server

– 1 TB hard disk drive (minimum)

– there is an upper bound for the disk size (12TB in 2019)

– max transfer speed of data for such a drive ~100 MB/s

– time required to read the data ~ 10000s ~ 2.5 h

● distribute data on multiple servers

– if data is divided into 10 blocks of 100GB each…

– time required to read the data ~ 2.5/10 h

– max disk size is no longer a problem

– cheaper processors may be used to read smaller blocks of data

● distribute data and process it on multiple servers!

Big data – case studies ● The Large Hadron Collider experiments represent about 150

million sensors delivering data 40 million times per second

– nearly 600 million collisions per second

– after filtering and refraining from recording more than 99.99995% of these streams, there are 100 collisions of interest per second

– only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012)

– this becomes nearly 200 petabytes after replication

– if all sensor data were recorded in LHC, the data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication

Big data – case studies


● The Square Kilometre Array - a radio telescope built of thousands of antennas

– expected to be operational by 2024

– antennas are expected to gather 14 exabytes and store one petabyte per day

– considered one of the most ambitious scientific projects ever undertaken


● Walmart handles more than 1 million customer transactions every hour

– databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data

– it is the equivalent of 167 times the information contained in all the books in the US Library of Congress

Big data – case studies ● eBay.com uses two data warehouses at 7.5 petabytes and 40PB

as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising

● Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers

– the core technology is Linux-based

– as of 2005 they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB

● Facebook handles 50 billion photos from its user base

● Google was handling roughly 100 billion searches per month as of August 2012

● Oracle NoSQL Database has been tested to pass the 1M ops/sec mark with 8 shards and proceeded to hit 1.2M ops/sec with 10 shards

Big data – criticism● no understanding of the underlying empirical micro-processes that lead to the emergence of the

typical network characteristics of Big Data

● often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes

● decisions based on the analysis of big data are inevitably "informed by the world as it was in the past, or, at best, as it currently is"

– in order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory

● threat to privacy represented by increasing storage and integration of personally identifiable information

● neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data

● big data analysis is often shallow compared to analysis of smaller data sets

– in many projects, there is no large data analysis happening, but the challenge is the extract, transform, load part of data preprocessing

● big data is a buzzword and a "vague term", but at the same time an "obsession" with entrepreneurs, consultants, scientists and the media

Big data – critiques of the “V” model

● it centres around computational scalability

● lacks the perceptibility and understandability of information

● Cognitive Big Data model:

– data completeness: understanding of the non-obvious from data;

– data correlation, causation, and predictability: causality as not essential requirement to achieve predictability;

– explainability and interpretability: humans desire to understand and accept what they understand, where algorithms don't cope with this;

– level of automated decision making: algorithms that support automated decision making and algorithmic self-learning;

Interesting applications – Deep Blue

● a chess-playing computer developed by IBM

● Deep Blue won its first game against a world champion on 10 February 1996, when it defeated Garry Kasparov in game one of a six-game match

● Kasparov won three and drew two of the following five games, defeating Deep Blue by a score of 4–2

● Deep Blue was then heavily upgraded, and played Kasparov again in May 1997

● winning the six-game rematch 3½–2½ it became the first computer system to defeat a reigning world champion in a match under standard chess tournament time controls

Interesting applications – Watson

● a question answering computer system capable of answering questions posed in natural language, developed by IBM

● specifically developed to answer questions on the quiz show Jeopardy

● in 2011, the Watson computer system competed on Jeopardy! against former winners Brad Rutter and Ken Jennings winning the first place prize of $1 million

● Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage

– full text of Wikipedia

– no access to the Internet during the game

● in 2013, IBM announced that Watson’s first commercial application would be for utilization management decisions in lung cancer treatment at Memorial Sloan Kettering Cancer Center, New York City

https://www.youtube.com/watch?v=P18EdAKuC1U

Interesting applications – AlphaGo

● a computer program that plays the board game Go

● developed by Alphabet Inc.'s Google DeepMind in London.

● in October 2015, AlphaGo became the first computer Go program to beat a human professional Go player without handicaps on a full-sized 19×19 board

● in March 2016, it beat Lee Sedol in a five-game match, the first time a computer Go program has beaten a 9-dan professional without handicaps

https://www.youtube.com/watch?v=I2WFvGl4y8c

janusz szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · bibliography flach, peter,...

Documents