janusz szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · bibliography flach, peter,...
TRANSCRIPT
![Page 1: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/1.jpg)
Introduction into Big Data analytics
Janusz Szwabiński
![Page 2: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/2.jpg)
![Page 3: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/3.jpg)
Contact data
● office hours (C-11 building, room 5.16):– Monday, 13:30-15:00– Thursday, 14:00-16:30– preferably make an appointment via email,
providing details of your problem
● http://prac.im.pwr.wroc.pl/~szwabin/index
![Page 4: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/4.jpg)
Course overview
● Introduction to Big Data
● Big data platforms
● In-memory big data platform – Spark
● MapReduce
● Big data analytics
● Big data visualization
● Machine learning
● Project presentations
![Page 5: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/5.jpg)
Bibliography
● Flach, Peter, “Machine Learning”, Cambridge University Press, 2012
● Holmes, Alex, “Hadoop in practice”, Manning Publications, 2013
● Provost, Foster, Facett, Tom, “Data Science for Business. What you need to know about data mining and data-analytic thinking”, O’Reilly, 2013
● Loshin, David, “Big Data Analytics. From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph”, Morgan Kaufmann, 2013
● http://hadoop.apache.org/
● http://spark.apache.org/● http://storm.apache.org/
● http://kafka.apache.org/● deRoos, Dirk, “Hadoop for Dummies”, 2014
![Page 6: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/6.jpg)
Assessment
● the lab is graded pass/fail
● final project:
– project ideas in the next talk
– 2-3 students per team (single author projects not allowed)
– project proposal presentation in the 6th lab
– final presentation (up to 15 min, the last 2 lectures)
– project’s assessment is the final grade
![Page 7: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/7.jpg)
Outlook of today’s talk
● Examples of big data applications
● Definition and characteristics of big data
● Techniques towards big data
● More examples of big data applications
![Page 8: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/8.jpg)
Target knows you are pregnant...
● customers have (had?) a Guest ID number:
– tied to their credit card, name, or email address
– a bucket that stores a history of everything they've bought
– any demographic information Target has collected from them or bought from other sources
● historical buying data for all the ladies who had signed up for Target baby registries in the past
– identified about 25 products that allow to assign each shopper a “pregnancy prediction” score
– a fictional Target shopper who is 23, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug there’s an 87% →chance that she’s pregnant and that her delivery date is sometime in late August
![Page 9: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/9.jpg)
Target knows you are pregnant...
● Target started sending coupons for baby itemsto customers according to their pregnancy scores
– an angry man went into a Target outside of Minneapolis, complaining his teen daughter is getting this kind of coupons for baby clothes
– Target’s manager apologized, and then called the customer few days later to apologize again
– the man admitted his daughter was pregnant...
● source: https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#e7c161266686
![Page 10: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/10.jpg)
World according to Google
● Google uses 57 different variables or "signals" to create search results tailored specifically for you:
– Search history
– Location
– Active browser
– Computer being used
– Language configured
![Page 11: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/11.jpg)
World according to Google
Author: Eli Pariser
![Page 12: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/12.jpg)
World according to Google
Author: Eli Pariser
![Page 13: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/13.jpg)
World according to Google
![Page 14: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/14.jpg)
World according to Google
![Page 15: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/15.jpg)
How much data do we generate?
![Page 16: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/16.jpg)
How much data do we generate?
Source: IDC
![Page 17: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/17.jpg)
How much data do we generate?
Source: Industry Tap
x1000
1 megabyte (a large novel)
1 gigabyte (information in the
human genom)
1 terabyte (annual world
literature production)
1 petabyte (All US academic research libraries)
1 exabyte (1/3 of annual production
of information)
x1000 x1000x1000
a tiny ant (~ 1,5 mm)
height of a person ( 1,7 m)
length of Rędziński bridge (1,8 km)
Diameter of the Sun (1391400 km)
length of New Zealand (1700 km)
![Page 18: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/18.jpg)
How much data do we generate?
● about 852 500 mln (853 500 000 000!) of 3-min songs in mp3 format
about 10 mln BlueRay discs 90 years of movies in HD
quality
Source: http://www.northeastern.edu/levelblog/
2.5 exabyte a day!!!
![Page 19: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/19.jpg)
Big data – 5V classification
![Page 20: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/20.jpg)
Big data
● the term in use since the 1990s, with some giving credit to John Mashey for coining or at least making it popular
● data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time
● unstructured, semi-structured and structured data, however the main focus is on unstructured data
● "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data
● requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale
![Page 21: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/21.jpg)
Big data
● Volume - the quantity of generated and stored data; the size of the data determines the value and potential insight- and whether it can actually be considered big data or not
● Variety - big data draws from text, images, audio, video; plus it completes missing pieces through data fusion
● Velocity - big data is often available in real-time
● Veracity - data quality of captured data can vary greatly, affecting the accurate analysis
![Page 22: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/22.jpg)
Big data
![Page 23: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/23.jpg)
Big data
● Google● Facebook● Youtube● Instagram● Wikipedia● Alibaba
These companies (and others) are collecting PetaBytes of data every minute
Why do they do that?
![Page 24: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/24.jpg)
Reason #1
● they can afford it!● storage prices have dropped significantly
over the last 3 decades
![Page 25: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/25.jpg)
Reason #2
● they can monetize it!– everything is personalized (recommendations,
ads, offers, promotions, newsfeeds)– cool products can be built
● Google Maps● Apple Siri
– other companies pay to mine the data● Twitter Firehose● Facebook Topic Data
![Page 26: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/26.jpg)
Data centers
Have a look at
https://www.google.com/about/datacenters/inside/streetview/
![Page 27: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/27.jpg)
Big data paradigm
lots of serversto process TBs/PBs
of data
runningsophisticated
software
Huge data centers
![Page 28: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/28.jpg)
Big data paradigm
● only a handful of companies in the world have all of the above
● smaller companies may use cloud services like AWS, Microsoft Azure or GCP
● Netflix, Pinterest or AirBnB run their entire business just using cloud services
![Page 29: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/29.jpg)
Big data - history
● Big data repositories have existed in many forms, often built by corporations with a special need
● Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s
● Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system
– Teradata systems were the first to store and analyze 1 terabyte of data in 1992
– Hard disk drives were 2.5GB in 1991 so the definition of big data continuously evolves according to Kryder's Law
– Teradata installed the first petabyte class RDBMS based system in 2007
– As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB
– Systems up until 2008 were 100% structured relational data
– since then, Teradata has added unstructured data types including XML, JSON, and Avro
![Page 30: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/30.jpg)
Big data - history ● in 2000, Seisint Inc. (now LexisNexis Group) developed a C++-based distributed file-sharing
framework for data storage and query
– the system stores and distributes structured, semi-structured, and unstructured data across multiple servers
– users can build queries in a C++ dialect called ECL
● In 2004, LexisNexis acquired Seisint Inc. and in 2008 acquired ChoicePoint, Inc. and their high-speed parallel processing platform
– the two platforms were merged into HPCC (or High-Performance Computing Cluster) Systems
– in 2011, HPCC was open-sourced under the Apache v2.0 License
● In 2004, Google published a paper on a process called MapReduce
– a parallel processing model
– an associated implementation was released to process huge amounts of data
– the framework was very successful
● An implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop frameworks
![Page 31: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/31.jpg)
Big data - history ● Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm,
as it adds the ability to set up many operations (not just map followed by reduce).
● Quantcast File System was available about 2011
– an open-source distributed file system software package
– large-scale MapReduce or other batch-processing workloads
– alternative to Hadoop, intended to deliver better performance and cost-efficiency for large-scale processing clusters
● MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications
– handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records
● 2012 studies showed that a multiple-layer architecture is one option to address the issues that big data presents
– a distributed parallel architecture distributes data across multiple servers
– this type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks
![Page 32: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/32.jpg)
Big data - history
● data lake
– a method of storing data within a system or repository, in its natural format
– facilitates the collocation of data in various schemata and structural forms, usually object blobs or files
– a single store of all data in the enterprise ranging from raw data to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning
– the data lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (images, audio, video)
![Page 33: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/33.jpg)
Big data - history
● one of the first lakes - the distributed file system used in Apache Hadoop
● many companies have now entered into this space: Hortonworks, Google, Microsoft, Zaloni, Teradata, Cloudera, Amazon, Cazena
● criticism:
– Sean Martin, CTO of Cambridge Semantics:“We see customers creating big data graveyards, dumping everything into HDFS [Hadoop Distributed File System] and hoping to do something with it down the road. But then they just lose track of what’s there.
The main challenge is not creating a data lake, but taking advantage of the opportunities it presents”
– the concept is fuzzy and arbitrary
● data swamp - a deteriorated data lake, that is inaccessible to its intended users and provides little value
![Page 34: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/34.jpg)
Big data – technologies
● Techniques for analyzing data:
– A/B testing
– machine learning
– natural language processing
● Big data technologies
– business intelligence
– cloud computing
– databases
● Visualization
– charts
– graphs
– other displays of the data
![Page 35: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/35.jpg)
A/B testing
● a controlled experiment with two variants, A and B
● a form of statistical hypothesis testing or "two-sample hypothesis testing"
● in online settings, such as web design, the goal of A/B testing is to identify changes to web pages that increase or maximize an outcome of interest
– formally the current web page is associated with the null hypothesis
● two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior
![Page 36: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/36.jpg)
Machine learning
● a field of computer science that gives computers the ability to learn without being explicitly programmed
● the name Machine learning was coined in 1959 by Arthur Samuel
● evolved from the study of pattern recognition and computational learning theory in artificial intelligence
● explores the study and construction of algorithms that can learn from and make predictions on data
● employed in a range of computing tasks: email filtering, detection of network intruders or malicious insiders working towards a data breach,[5] optical character recognition (OCR), computer vision.
● closely related to (and often overlaps with) computational statistics
● it has strong ties to mathematical optimization
● difficult to carry out with big data
![Page 37: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/37.jpg)
Natural language processing
● a field of computer science, artificial intelligence concerned with the interactions between computers and human (natural) languages
● concerned with programming computers to fruitfully process large natural language data
● challenges frequently involve speech recognition, natural language understanding, and natural language generation.
![Page 38: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/38.jpg)
Business Intelligence ● comprises the strategies and technologies used by enterprises for the data analysis of business
information
● provides historical, current and predictive views of business operations
● common functions of business intelligence technologies include:
– reporting - a human readable report on operating and financial data
– data mining - discovering patterns in large data sets
– process mining - analysis of business processes based on event logs
– complex event processing - a method of tracking and analyzing (processing) streams of information (data) about things that happen (events), and deriving a conclusion from them
– business performance management - a set of management and analytic processes that enable businesses to define strategic goals and then measure and manage performance against those goals (financial planning, operational planning, business modeling, consolidation and reporting, monitoring of key performance indicators linked to strategy)
– benchmarking - comparing one's business processes and performance metrics to industry bests and best practices from other companies
– text mining - deriving high-quality information from text
– predictive and prescriptive analytics - predictions about future or otherwise unknown events
![Page 39: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/39.jpg)
Business Intelligence
● can handle large amounts of structured and sometimes unstructured data
● allows an easy interpretation of these big data
● identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability
![Page 40: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/40.jpg)
Cloud computing
● an IT paradigm that enables ubiquitous access to shared pools of configurable system resources and higher-level services
● those resources can be rapidly provisioned with minimal management effort, often over the Internet
● relies on sharing of resources to achieve coherence and economies of scale, similar to a public utility
![Page 41: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/41.jpg)
Big data - technologies ● tensor-based computation (for multidimensional data)
● massively parallel-processing (MPP) databases
● search-based applications
● data mining
● distributed file systems
● distributed databases
● cloud and HPC-based infrastructure
● Internet
![Page 42: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/42.jpg)
Big data - technologies ● shared storage architectures - Storage area network (SAN)
and Network-attached storage (NAS) – are perceived as relatively slow, complex, and expensive
– these qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost
● latency should be avoided whenever and wherever possible
– therefore direct-attached storage (DAS) often preferred
![Page 43: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/43.jpg)
Big data – computing architecture
Assumptions:
● you work for an e-commerce starup generating 1TB of weblogs everyday
● at the end of a day you want to publish a report on traffic for that day
![Page 44: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/44.jpg)
Big data – computing architecture Options:
● use a single powerful server
– 1 TB hard disk drive (minimum)
– there is an upper bound for the disk size (12TB in 2019)
– max transfer speed of data for such a drive ~100 MB/s
– time required to read the data ~ 10000s ~ 2.5 h
● distribute data on multiple servers
– if data is divided into 10 blocks of 100GB each…
– time required to read the data ~ 2.5/10 h
– max disk size is no longer a problem
– cheaper processors may be used to read smaller blocks of data
● distribute data and process it on multiple servers!
![Page 45: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/45.jpg)
Big data – case studies ● The Large Hadron Collider experiments represent about 150
million sensors delivering data 40 million times per second
– nearly 600 million collisions per second
– after filtering and refraining from recording more than 99.99995% of these streams, there are 100 collisions of interest per second
– only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012)
– this becomes nearly 200 petabytes after replication
– if all sensor data were recorded in LHC, the data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication
![Page 46: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/46.jpg)
Big data – case studies
![Page 47: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/47.jpg)
Big data – case studies
● The Square Kilometre Array - a radio telescope built of thousands of antennas
– expected to be operational by 2024
– antennas are expected to gather 14 exabytes and store one petabyte per day
– considered one of the most ambitious scientific projects ever undertaken
![Page 48: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/48.jpg)
Big data – case studies
● Walmart handles more than 1 million customer transactions every hour
– databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data
– it is the equivalent of 167 times the information contained in all the books in the US Library of Congress
![Page 49: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/49.jpg)
Big data – case studies ● eBay.com uses two data warehouses at 7.5 petabytes and 40PB
as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising
● Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers
– the core technology is Linux-based
– as of 2005 they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB
● Facebook handles 50 billion photos from its user base
● Google was handling roughly 100 billion searches per month as of August 2012
● Oracle NoSQL Database has been tested to pass the 1M ops/sec mark with 8 shards and proceeded to hit 1.2M ops/sec with 10 shards
![Page 50: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/50.jpg)
Big data – criticism● no understanding of the underlying empirical micro-processes that lead to the emergence of the
typical network characteristics of Big Data
● often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes
● decisions based on the analysis of big data are inevitably "informed by the world as it was in the past, or, at best, as it currently is"
– in order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory
● threat to privacy represented by increasing storage and integration of personally identifiable information
● neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data
● big data analysis is often shallow compared to analysis of smaller data sets
– in many projects, there is no large data analysis happening, but the challenge is the extract, transform, load part of data preprocessing
● big data is a buzzword and a "vague term", but at the same time an "obsession" with entrepreneurs, consultants, scientists and the media
![Page 51: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/51.jpg)
Big data – critiques of the “V” model
● it centres around computational scalability
● lacks the perceptibility and understandability of information
● Cognitive Big Data model:
– data completeness: understanding of the non-obvious from data;
– data correlation, causation, and predictability: causality as not essential requirement to achieve predictability;
– explainability and interpretability: humans desire to understand and accept what they understand, where algorithms don't cope with this;
– level of automated decision making: algorithms that support automated decision making and algorithmic self-learning;
![Page 52: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/52.jpg)
Interesting applications – Deep Blue
● a chess-playing computer developed by IBM
● Deep Blue won its first game against a world champion on 10 February 1996, when it defeated Garry Kasparov in game one of a six-game match
● Kasparov won three and drew two of the following five games, defeating Deep Blue by a score of 4–2
● Deep Blue was then heavily upgraded, and played Kasparov again in May 1997
● winning the six-game rematch 3½–2½ it became the first computer system to defeat a reigning world champion in a match under standard chess tournament time controls
![Page 53: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/53.jpg)
Interesting applications – Watson
● a question answering computer system capable of answering questions posed in natural language, developed by IBM
● specifically developed to answer questions on the quiz show Jeopardy
● in 2011, the Watson computer system competed on Jeopardy! against former winners Brad Rutter and Ken Jennings winning the first place prize of $1 million
● Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage
– full text of Wikipedia
– no access to the Internet during the game
● in 2013, IBM announced that Watson’s first commercial application would be for utilization management decisions in lung cancer treatment at Memorial Sloan Kettering Cancer Center, New York City
![Page 54: Janusz Szwabińskiprac.im.pwr.wroc.pl/~szwabin/assets/bdata/1.pdf · Bibliography Flach, Peter, “Machine Learning”, Cambridge University Press, 2012 Holmes, Alex, “Hadoop in](https://reader034.vdocument.in/reader034/viewer/2022042313/5edc0c5aad6a402d66668c25/html5/thumbnails/54.jpg)
Interesting applications – AlphaGo
● a computer program that plays the board game Go
● developed by Alphabet Inc.'s Google DeepMind in London.
● in October 2015, AlphaGo became the first computer Go program to beat a human professional Go player without handicaps on a full-sized 19×19 board
● in March 2016, it beat Lee Sedol in a five-game match, the first time a computer Go program has beaten a 9-dan professional without handicaps