cs535 big data 1/22/2020 sangmi lee pallickaracs535/slides/week1-a-6.pdf · •guided exploration...

7
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1 CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 1. INTRODUCTION TO BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 What is Big Data? CS535 Big Data | Computer Science Department | Colorado State University Big Data Things one can do at a large scale that cannot be done at a smaller one To extract new insights Create new forms of values Big Data is about analytics of huge quantities of data in order to infer probabilities Big Data is NOT about trying to “teach” a computer to “think” like humans Providing a quantitative dimension it never had before CS535 Big Data | Computer Science Department | Colorado State University The three(or four) Vs in Big Data Volume Voluminous It does not have to be certain number of petabytes or quantity. Velocity How fast the data is coming in? How fast you need to be able to analyze and utilize it Variety Number of sources or incoming vectors Veracity Can you trust the data itself, source of the data, or the process? User entry errors, redundancy, corruption of the values Data cleaning CS535 Big Data | Computer Science Department | Colorado State University Who is using Big Data? CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University

Upload: others

Post on 17-Aug-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY1. INTRODUCTION TO BIG DATA

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

What is Big Data?

CS535 Big Data | Computer Science Department | Colorado State University

Big Data• Things one can do at a large scale that cannot be done at a smaller one

• To extract new insights• Create new forms of values

• Big Data is about analytics of huge quantities of data in order to infer probabilities• Big Data is NOT about trying to “teach” a computer to “think” like humans

• Providing a quantitative dimension it never had before

CS535 Big Data | Computer Science Department | Colorado State University

The three(or four) Vs in Big Data• Volume

• Voluminous• It does not have to be certain number of petabytes or quantity.

• Velocity• How fast the data is coming in?• How fast you need to be able to analyze and utilize it

• Variety • Number of sources or incoming vectors

• Veracity• Can you trust the data itself, source of the data, or the process?• User entry errors, redundancy, corruption of the values• Data cleaning

CS535 Big Data | Computer Science Department | Colorado State University

Who is using Big Data?

CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University

Page 2: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2

CS535 Big Data | Computer Science Department | Colorado State University

Photo Credit:https://datafloq.com/read/car-manufacturers-are-using-big-data/1204

CS535 Big Data | Computer Science Department | Colorado State University

Connected cars• Single hybrid plug-in car generates up to 25 gigabytes per hour

• Connected cars• $130 billion • Traffic problem, re-routing based on the volume of traffic• Alerts driver when a road conditions are hazardous by automatically activating anti-lock break

• This information is shared by the vehicles that are nearby

CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University

The Artemis project: Saving “preemies” using Big Data• The Artemis project

• Dr. Carolyn McGregor

• Toronto’s Hospital for Sick Children, University of Ontario Institute of Technology and IBM• Captures and process the patients’ data in real time

• 16 different data streams

• Heart rate, respiration rate, temperature, blood pressure and blood oxygen level• Around 1,260 data points per second

• System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear

CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University

Page 3: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3

Look Who’s Peeking at Your Paycheck• Experian’s Income Insight

• Estimates people’s income level• Based on their credit history • Trains the estimation model using selected credit history and tax information from IRS

KAREN BLUMENTHAL, “Look W ho’s Peeking at Your Paycheck”, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106

CS535 Big Data | Computer Science Department | Colorado State University

Related research areas• Storage systems

• How can we efficiently resolve queries on massive amounts of input data?• The input dataset may be presented in the form of a distributed data stream

• Machine learning• How can we efficiently solve large-scale machine learning problems?• The input data may be massive; stored in a distributed cluster of machines

• Distributed computing• How can we efficiently solve large-scale optimization problems in distributed computing environments?• For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large

scale graphs?

CS535 Big Data | Computer Science Department | Colorado State University

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY2. COURSE INTRODUCTION

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

Big Data Lab at Colorado State University• Director: Sangmi Pallickara

• Algorithmic and systems design• Scalable analytics over voluminous datasets

on complex distributed architectures

• Research has been deployed in the following domains• Precision agriculture, atmosphere science,

environmental biology, ecology, civil engineering, bioinformatics, and public health

CS535 Big Data | Computer Science Department | Colorado State University

Big Data Lab at Colorado State University• Awards

• Cochran Family Professorship 2018-2021• IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher) 2018• National Science Foundation CAREER Award 2016

• Funded by • The National Science Foundation• The Advanced Research Projects Agency-Energy (Department of Energy)

• Department of Homeland Security• The Environmental Defense Fund• Google, Amazon, and Hewlett Packard

CS535 Big Data | Computer Science Department | Colorado State University

People

Saptashwa Mitra

Sam Armstrong

Aaron PereiraKartik Khurana Paahuni Khandelwal

Sangmi Pallickara

Undergraduate researchers at CURC

Walid BudgagaDan Rammer

Ryan Becwar

CS535 Big Data | Computer Science Department | Colorado State University

Laksheen Mendis

Kevin Brewwiler

Caleb Carlson

Page 4: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4

Goal of this course • Understanding fundamental concepts in Big Data Analytics

• Computing Systems + Scalable Algorithms and Models

• Learn about existing technologies and how to apply themComputing

systemsAlgorithms and models

Analytics

Predictive models

Graph models

Storage systems and middle ware

Computing frameworks

Specialized modeling tools

CS535 Big Data | Computer Science Department | Colorado State University

Communications [1/2]

• Course Website• http://www.cs.colostate.edu/~cs535

• Announcements: Check the course website at least twice a week.

• Schedule (course materials, readings, assignments)

• Policies

• Canvas

• Assignment submission

• Grades

• Piazza• Discussion board

CS535 Big Data | Computer Science Department | Colorado State University

Communications [2/2]

• Contact Me• [email protected]• Office hour: Friday 10:00AM ~ 11:00AM and by appointment• Office: CSB456• URL: http://www.cs.colostate.edu/~sangmi

• Contact GTAs• Paahuni Khandelwal• Mohamed Chaabane• Office hours (in CSB120 and online office hours TBA)

CS535 Big Data | Computer Science Department | Colorado State University

Research Group MeetingWhen: 1:30-2:30pm FridaysWhere: CSB305

Course Structure

Big Data TechnologyWeek 1 ~ Week 4

GEAR I: Peta-scale Storage SystemsWeek 5, 6

GEAR II: Machine Learning for Big DataWeek 7, 8

GEAR III: Big Graph AnalysisWeek 9, 10

GEAR IV: Large Scale Recommendation Systems and Social Media: Week 11, 12

CS535 Big Data | Computer Science Department | Colorado State University

GEAR V: Algorithmic Techniques for Big Data Week 13, 14

Course Structure | Part A: Big Data Technology• Week 1 ~ Week 4• Big Data Technology

• Purposes• Understand concepts of Big Data computing environment

• Hands-on experience

• Topics• Introduction to Big Data• Lambda Model

• Quick view of MapReduce • Introduction to Apache Spark

• Analytics with Apache Storm

CS535 Big Data | Computer Science Department | Colorado State University

Course Structure | Part B: GEAR Sessions• What is the GEAR Session?

• Guided Exploration for Big Data Analytics Research

• Goals• Guided learning environment for advanced research topics in Big Data• Understanding different aspects of Big Data research with lectures and discussions

• SessionsSession I. Peta-scale Storage SystemsSession II. Machine Learning for Big DataSession III. Big Graph AnalysisSession IV. Large Scale Recommendation Systems and Social MediaSession V. Algorithmic Techniques for Big Data

• Duration: 2 weeks/session• Up to 3 lectures• 1 student-led research discussion

CS535 Big Data | Computer Science Department | Colorado State University

Page 5: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 5

Course Structure | Part B: GEAR Sessions• Lectures

• Introducing fundamental concepts

• Student-led discussions• Students present research papers • 3 papers per session• We will have 5 sessions

• GEAR is a team effort• Each team will submit 1 critical review for each GEAR Session• Each team will lead 1 discussion session in this semester

CS535 Big Data | Computer Science Department | Colorado State University

Course Structure | Part B: GEAR Sessions• Example session

• Session 4: Recommendation systems and multi-media data mining

• Lectures• Concepts of the recommendation systems

• Latent factor model

• Student-led discussions• Recommendation systems used by Airbnb and Alibaba• Route and cost estimation model by Uber

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Programing Assignments• Programming Assignment 1

• Implementing link analysis algorithm over the Wikipedia pages using Apache Spark• Due on 2/18 5:00PM• The description of PA1 is available now.

• Programming Assignment 2• Implementing real-time Twitter stream analysis using Apache Storm• Due on 3/10 5:00PM

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Programing Assignments• All of the programming assignments are individual submission• Submission should be via canvas• Late policy for the assignment submissions

• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

• Grading will be based on a software demo of the programming assignment in CSB120• Each assignment will count 10% of total score of this course

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Quizzes• 7-8 online quizzes• 48 hour window (with the 20 minute duration)• Two lowest scores will be eliminated• Quizzes will count 15% of total score of this course

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Term Project• Objectives

• Students identify their topics for the term project• Students provide methodology to solve their problem• Students implement software solution• Students provide evaluation scheme for their software

• Requirements• Team effort• Deep learning aspect using TensorFlow

CS535 Big Data | Computer Science Department | Colorado State University

Page 6: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 6

Course Component | Term Project• Term project grading (40% of this course)

• Term project planning: 1% (e.g. team members, a tentative title of your project)• Proposal document: 5%

• Final paper: 24% • Final demonstration: 2%• Final presentation (peer review): 3%• Participation: 4%

• Late policy for the deliverable submissions• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Term Project• Highlights of the Previous Term Projects

• Supporting Emergency Response During Natural Disasters with Twitter Data

• Winning Words in the Supreme Court

• Rapid, Progressive Sub-Graph Explorations for Interactive Visual Analytics over Large-Scale Graph

Datasets, (published and won the best paper award in IEEE/ACM International Conference on Big

Data Computing, Application, and Technology 2019)

• Mendel: A Distributed Storage System for Efficient Sequence Alignment and Similarity Searching

(published in IEEE IPDPS 2016)

• Processing Smart Grid Data In Real Time (DEBS grand challenge 2014)

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Term Project• Highlights of the Previous Term Projects

• Time to Answer” for Questions on stackoverflow.com using Map Reduce

• Analysis of words for spell checking in search queries using digitized books and articles

• Efficient Boolean Symmetric Searchable Encryption

• Big Data I/O Performance Improvement Using Buffered B-Tree Algorithm

• Node and Metadata Visualization in a Distributed Hash Table

• Who is Building Wikipedia?

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Team Efforts• CS535 requires team efforts

• Large scale analysis usually performed by a team of experts

• How do we build a team?• Post your advertisement on the Piazza discussion board TODAY!• List

• Your strengths/backgrounds• Interests (topics or even dataset)

• We will have 15 teams this semester• Each team must include 1 or 2 on-line students• Each team should not exceed 4 members

• Peer-reviewed participation score• 8% of course grade will be the participation score• 7% will be the peer-reviewed score at the end of the semester

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Team Efforts• How can I be a GOOD team member?

• Evaluation categories• Frequency of participation in tasks• Task implementation• Communication

• Each category will be assessed by your peer teammates (including yourself!)

CS535 Big Data | Computer Science Department | Colorado State University

Note: These images are examples.

Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% Assignment 2: 10%

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 5%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 5%

Quizzes 15%

GEAR Sessions 25% (3% participation included)

CS535 Big Data | Computer Science Department | Colorado State University

Page 7: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 7

Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% Assignment 2: 10%

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 5%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 5%

Quizzes 15%

GEAR Sessions 25% (3% participation included)

CS535 Big Data | Computer Science Department | Colorado State University

8% of these scores will be participation score

7% of these scores will be peer-evaluated team participation score

Grading | Final Letter Grade

Letter Grade Total PercentageA 90.00 % and higherB 80.00 ~ 89.99 %C 70.00 ~ 79.99 %D 60.00 ~ 69.99 %F Below 60.00 %

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Course Policy• No make-up for missed quizzes and exams

• Except for the case where student provided an advance written notice to the instructor based on an emergency• Supporting paper works will be requested

• Two lowest quiz scores will be eliminated at the end of semester

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Course Policy• No Cell-phones in the class.

• No Laptops in the class.

• If you need to use a laptop during lectures, please sit in the back row.

• I will ask you to turn off your laptop if needed.

CS535 Big Data | Computer Science Department | Colorado State University

More importantly,• Attend the class, ask questions, and discuss• Check the course web page and canvas regularly • Try new technologies and apply them• Share your experiences with other students in class

CS535 Big Data | Computer Science Department | Colorado State University

Questions?

CS535 Big Data | Computer Science Department | Colorado State University