cs 626 large scale data sciencejzhang/cs626/lecture1.pdf · • contact mr. jarad downing...
TRANSCRIPT
![Page 1: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/1.jpg)
CS 626 Large Scale Data Science
Jun ZhangDepartment of Computer Science
University of KentuckyBased on materials prepared by Dr. Licong Cui
Lecture 1 – Introduction
1
![Page 2: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/2.jpg)
Outline
Course Logistics
Student Introduction
Introduction to Big Data
2
![Page 3: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/3.jpg)
Course Logistics
• Class hours: TR 12:30 pm - 1:45 pm• Class location: F. Paul Anderson Tower Room 255• Office hours: MW: 9:00am – 10:00am• Course documents:
http://www.cs.uky.edu/~jzhang/CS626/cs626.htmlo Syllabuso Files
- Slides- Homework and Project Assignments
3
![Page 4: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/4.jpg)
Course Description
• Data => Actionable information• Big Data Techniques– Hadoop/MapReduce– HBase– Hive– Pig– Spark
• Real-world data science problems
4
![Page 5: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/5.jpg)
Prerequisites and Expected Background
• Algorithm design and analysis• Database systems (e.g. MySQL)• Programming languages– Java (preferred)– Python
• Linux basics (e.g., ssh, scp)• Your own computer requirements:– 64-bit OS– 10+ GB RAM
5
![Page 6: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/6.jpg)
Alternative Hardware Systems
• Use CS Department’s OpenStack cluster• Contact Mr. Jarad Downing [email protected] for
obtaining an account and knowing the requirements
• The Cloudera system has been installed on OpenStack
• More information about OpenStack is at:https://www.cs.uky.edu/docs/users/openstack.html
6
![Page 7: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/7.jpg)
What Do You Need for the OpenStack Cluster?
• You need to connect to the UK campus via VPN, see:https://www.cs.uky.edu/docs/users/vpn.html
• You need to install nomachine, it can be foundhere: https://www.nomachine.com
• You need to use your UK ID address and the credentials (cloudera/cloudera) to connect.
7
![Page 8: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/8.jpg)
Textbook (Optional)
• Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale (4th Edition)
• Author: Tom White • ISBN-13: 978-1491901632 • ISBN-10: 1491901632
8
![Page 9: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/9.jpg)
Grading Criteria
• Homework/Programming assignments (40%)• Paper presentation (20%)• Project (30%) – Project team: each team consists of up to 3 members– Clear statement of contribution for each team member– Deliverables: mid-project report (5%), live demos (5%),
and final project report (20%)• Attendance and participation (10%)– Attendance: 5%– Participation: 5% (participating discussions in class)
9
![Page 10: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/10.jpg)
Grading Scale
85 – 100% = A75 – 84% = B60 – 74% = C< 60% = E
10
![Page 11: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/11.jpg)
Course Policies
• Academic Integrity– Independently complete
homework/programming assignments.– Proper acknowledgement is required if you
borrow idea or content from other sources.• Submission Policy– See each assignment for deadlines.– Late submission will not be accepted.
11
![Page 12: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/12.jpg)
Course Policies
• Attendance Policy– In order to meet federal regulations, the
instructor will monitor student participation in this class through attendance or assignments. Students whose attendance or participation cannot be determined one time during the first three weeks of the semester may be dropped from the course.
12
![Page 13: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/13.jpg)
Course Policies
• Attendance Policy– University policy: students are expected to
withdraw from the class if more than 20% of the classes scheduled for the semester are missed (excused or unexcused)
• Excused Absences– http://www.uky.edu/Ombud/
13
![Page 14: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/14.jpg)
Student Introduction
14
![Page 15: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/15.jpg)
Introduction to Big Data
Why Big Data?o What launches Big Data era?
o What makes Big Data valuable?
Characteristics of Big Data
15
![Page 16: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/16.jpg)
What launches Big Data era?
Retail2 billion products sold in 2014
Social media 204 million emails/min
1.8 million likes, 200,000 photos/min
278,000 tweets/min
40,000 queries/sec, 3.5 billion/day
HealthcareA Samaritan Medical Center Watertown NY: 120 TB as of 2013
16
![Page 17: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/17.jpg)
What Makes Big Data Valuable?
Big Data Better Models
Higher Precision
17
![Page 18: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/18.jpg)
Example: Recommendation Engines
18
![Page 19: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/19.jpg)
Example: Using Big Data to Help Patients
Big Data for precision medicineo Personalized healthcare
o Predict/Prevent disease
Data sourceso Genome
o Sensors
o Electronic Health Record (EHR)
o People19
![Page 20: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/20.jpg)
Genome Data
200 GB/genome
20
![Page 21: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/21.jpg)
Sensor Data
21
![Page 22: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/22.jpg)
Electronic Health Record (EHR)
22
![Page 23: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/23.jpg)
People-generated Data- Fitness Device Data
2-5 GB/day
23
![Page 24: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/24.jpg)
How Big Data Can Help?
Integration
Genome Data
Sensor DataElectronic
Health Records
People-generated
Data
24
![Page 25: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/25.jpg)
How Big Data Can Help?
Integration Personalization Precision
25
![Page 26: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/26.jpg)
Basic principles for big data integration
• Create a common understanding of data definition
• Develop a set of data services to qualify the data and make it consistent and ultimate trustworthy
• Set up a streamlined way to integrate your big data sources and system of record
26
![Page 27: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/27.jpg)
Characteristics of Big Data – 6V’s
• Veracity• Valence
Volume Variety Velocity
Value
27
![Page 28: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/28.jpg)
Volume of big data
• The amount of data• Facebook has 250 billion images, and 2.5
trillion posts (2016)• The amount of data is ever increasing• How to store the data• How to process the data
28
![Page 29: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/29.jpg)
Variety of big data
• Ever increasing different forms of data• Photographs, sensor data, tweets,
encrypted packages• Traditional data tables • E-mail messages, with attachments• Photos, videos and audio recordings
29
![Page 30: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/30.jpg)
Velocity of big data
• The speed at which big data is created, stored, and/or analyzed.
• Facebook users upload 900 million photos every day
• Packet analysis for cybersercurity• Search engine query• Internet of Things
30
![Page 31: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/31.jpg)
Veracity of big data
• Quality and trustfulness of data• Accuracy, preciseness, reliability• Any bias, noises, and abnormality in
data?• Falsification?• No good data, no good results
31
![Page 32: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/32.jpg)
Valence of big data
• Connectedness of big data in the form of graphs
• Data bond with each other• Forming connection between disparate
data• Positive valence and negative valence
32
![Page 33: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632](https://reader033.vdocument.in/reader033/viewer/2022060300/5f081d817e708231d42069e1/html5/thumbnails/33.jpg)
Value of big data
• The ability to convert big data information into a monetary reward
• The final goal of big data• Data mining?• Decision and results
33