Transcript
Page 1: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY1. INTRODUCTION TO BIG DATA

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

What is Big Data?

CS535 Big Data | Computer Science Department | Colorado State University

Big Data• Things one can do at a large scale that cannot be done at a smaller one

• To extract new insights• Create new forms of values

• Big Data is about analytics of huge quantities of data in order to infer probabilities• Big Data is NOT about trying to “teach” a computer to “think” like humans

• Providing a quantitative dimension it never had before

CS535 Big Data | Computer Science Department | Colorado State University

The three(or four) Vs in Big Data• Volume

• Voluminous• It does not have to be certain number of petabytes or quantity.

• Velocity• How fast the data is coming in?• How fast you need to be able to analyze and utilize it

• Variety • Number of sources or incoming vectors

• Veracity• Can you trust the data itself, source of the data, or the process?• User entry errors, redundancy, corruption of the values• Data cleaning

CS535 Big Data | Computer Science Department | Colorado State University

Who is using Big Data?

CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University

Page 2: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2

CS535 Big Data | Computer Science Department | Colorado State University

Photo Credit:https://datafloq.com/read/car-manufacturers-are-using-big-data/1204

CS535 Big Data | Computer Science Department | Colorado State University

Connected cars• Single hybrid plug-in car generates up to 25 gigabytes per hour

• Connected cars• $130 billion • Traffic problem, re-routing based on the volume of traffic• Alerts driver when a road conditions are hazardous by automatically activating anti-lock break

• This information is shared by the vehicles that are nearby

CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University

The Artemis project: Saving “preemies” using Big Data• The Artemis project

• Dr. Carolyn McGregor

• Toronto’s Hospital for Sick Children, University of Ontario Institute of Technology and IBM• Captures and process the patients’ data in real time

• 16 different data streams

• Heart rate, respiration rate, temperature, blood pressure and blood oxygen level• Around 1,260 data points per second

• System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear

CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University

Page 3: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3

Look Who’s Peeking at Your Paycheck• Experian’s Income Insight

• Estimates people’s income level• Based on their credit history • Trains the estimation model using selected credit history and tax information from IRS

KAREN BLUMENTHAL, “Look W ho’s Peeking at Your Paycheck”, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106

CS535 Big Data | Computer Science Department | Colorado State University

Related research areas• Storage systems

• How can we efficiently resolve queries on massive amounts of input data?• The input dataset may be presented in the form of a distributed data stream

• Machine learning• How can we efficiently solve large-scale machine learning problems?• The input data may be massive; stored in a distributed cluster of machines

• Distributed computing• How can we efficiently solve large-scale optimization problems in distributed computing environments?• For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large

scale graphs?

CS535 Big Data | Computer Science Department | Colorado State University

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY2. COURSE INTRODUCTION

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

Big Data Lab at Colorado State University• Director: Sangmi Pallickara

• Algorithmic and systems design• Scalable analytics over voluminous datasets

on complex distributed architectures

• Research has been deployed in the following domains• Precision agriculture, atmosphere science,

environmental biology, ecology, civil engineering, bioinformatics, and public health

CS535 Big Data | Computer Science Department | Colorado State University

Big Data Lab at Colorado State University• Awards

• Cochran Family Professorship 2018-2021• IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher) 2018• National Science Foundation CAREER Award 2016

• Funded by • The National Science Foundation• The Advanced Research Projects Agency-Energy (Department of Energy)

• Department of Homeland Security• The Environmental Defense Fund• Google, Amazon, and Hewlett Packard

CS535 Big Data | Computer Science Department | Colorado State University

People

Saptashwa Mitra

Sam Armstrong

Aaron PereiraKartik Khurana Paahuni Khandelwal

Sangmi Pallickara

Undergraduate researchers at CURC

Walid BudgagaDan Rammer

Ryan Becwar

CS535 Big Data | Computer Science Department | Colorado State University

Laksheen Mendis

Kevin Brewwiler

Caleb Carlson

Page 4: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4

Goal of this course • Understanding fundamental concepts in Big Data Analytics

• Computing Systems + Scalable Algorithms and Models

• Learn about existing technologies and how to apply themComputing

systemsAlgorithms and models

Analytics

Predictive models

Graph models

Storage systems and middle ware

Computing frameworks

Specialized modeling tools

CS535 Big Data | Computer Science Department | Colorado State University

Communications [1/2]

• Course Website• http://www.cs.colostate.edu/~cs535

• Announcements: Check the course website at least twice a week.

• Schedule (course materials, readings, assignments)

• Policies

• Canvas

• Assignment submission

• Grades

• Piazza• Discussion board

CS535 Big Data | Computer Science Department | Colorado State University

Communications [2/2]

• Contact Me• [email protected]• Office hour: Friday 10:00AM ~ 11:00AM and by appointment• Office: CSB456• URL: http://www.cs.colostate.edu/~sangmi

• Contact GTAs• Paahuni Khandelwal• Mohamed Chaabane• Office hours (in CSB120 and online office hours TBA)

CS535 Big Data | Computer Science Department | Colorado State University

Research Group MeetingWhen: 1:30-2:30pm FridaysWhere: CSB305

Course Structure

Big Data TechnologyWeek 1 ~ Week 4

GEAR I: Peta-scale Storage SystemsWeek 5, 6

GEAR II: Machine Learning for Big DataWeek 7, 8

GEAR III: Big Graph AnalysisWeek 9, 10

GEAR IV: Large Scale Recommendation Systems and Social Media: Week 11, 12

CS535 Big Data | Computer Science Department | Colorado State University

GEAR V: Algorithmic Techniques for Big Data Week 13, 14

Course Structure | Part A: Big Data Technology• Week 1 ~ Week 4• Big Data Technology

• Purposes• Understand concepts of Big Data computing environment

• Hands-on experience

• Topics• Introduction to Big Data• Lambda Model

• Quick view of MapReduce • Introduction to Apache Spark

• Analytics with Apache Storm

CS535 Big Data | Computer Science Department | Colorado State University

Course Structure | Part B: GEAR Sessions• What is the GEAR Session?

• Guided Exploration for Big Data Analytics Research

• Goals• Guided learning environment for advanced research topics in Big Data• Understanding different aspects of Big Data research with lectures and discussions

• SessionsSession I. Peta-scale Storage SystemsSession II. Machine Learning for Big DataSession III. Big Graph AnalysisSession IV. Large Scale Recommendation Systems and Social MediaSession V. Algorithmic Techniques for Big Data

• Duration: 2 weeks/session• Up to 3 lectures• 1 student-led research discussion

CS535 Big Data | Computer Science Department | Colorado State University

Page 5: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 5

Course Structure | Part B: GEAR Sessions• Lectures

• Introducing fundamental concepts

• Student-led discussions• Students present research papers • 3 papers per session• We will have 5 sessions

• GEAR is a team effort• Each team will submit 1 critical review for each GEAR Session• Each team will lead 1 discussion session in this semester

CS535 Big Data | Computer Science Department | Colorado State University

Course Structure | Part B: GEAR Sessions• Example session

• Session 4: Recommendation systems and multi-media data mining

• Lectures• Concepts of the recommendation systems

• Latent factor model

• Student-led discussions• Recommendation systems used by Airbnb and Alibaba• Route and cost estimation model by Uber

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Programing Assignments• Programming Assignment 1

• Implementing link analysis algorithm over the Wikipedia pages using Apache Spark• Due on 2/18 5:00PM• The description of PA1 is available now.

• Programming Assignment 2• Implementing real-time Twitter stream analysis using Apache Storm• Due on 3/10 5:00PM

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Programing Assignments• All of the programming assignments are individual submission• Submission should be via canvas• Late policy for the assignment submissions

• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

• Grading will be based on a software demo of the programming assignment in CSB120• Each assignment will count 10% of total score of this course

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Quizzes• 7-8 online quizzes• 48 hour window (with the 20 minute duration)• Two lowest scores will be eliminated• Quizzes will count 15% of total score of this course

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Term Project• Objectives

• Students identify their topics for the term project• Students provide methodology to solve their problem• Students implement software solution• Students provide evaluation scheme for their software

• Requirements• Team effort• Deep learning aspect using TensorFlow

CS535 Big Data | Computer Science Department | Colorado State University

Page 6: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 6

Course Component | Term Project• Term project grading (40% of this course)

• Term project planning: 1% (e.g. team members, a tentative title of your project)• Proposal document: 5%

• Final paper: 24% • Final demonstration: 2%• Final presentation (peer review): 3%• Participation: 4%

• Late policy for the deliverable submissions• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Term Project• Highlights of the Previous Term Projects

• Supporting Emergency Response During Natural Disasters with Twitter Data

• Winning Words in the Supreme Court

• Rapid, Progressive Sub-Graph Explorations for Interactive Visual Analytics over Large-Scale Graph

Datasets, (published and won the best paper award in IEEE/ACM International Conference on Big

Data Computing, Application, and Technology 2019)

• Mendel: A Distributed Storage System for Efficient Sequence Alignment and Similarity Searching

(published in IEEE IPDPS 2016)

• Processing Smart Grid Data In Real Time (DEBS grand challenge 2014)

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Term Project• Highlights of the Previous Term Projects

• Time to Answer” for Questions on stackoverflow.com using Map Reduce

• Analysis of words for spell checking in search queries using digitized books and articles

• Efficient Boolean Symmetric Searchable Encryption

• Big Data I/O Performance Improvement Using Buffered B-Tree Algorithm

• Node and Metadata Visualization in a Distributed Hash Table

• Who is Building Wikipedia?

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Team Efforts• CS535 requires team efforts

• Large scale analysis usually performed by a team of experts

• How do we build a team?• Post your advertisement on the Piazza discussion board TODAY!• List

• Your strengths/backgrounds• Interests (topics or even dataset)

• We will have 15 teams this semester• Each team must include 1 or 2 on-line students• Each team should not exceed 4 members

• Peer-reviewed participation score• 8% of course grade will be the participation score• 7% will be the peer-reviewed score at the end of the semester

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Team Efforts• How can I be a GOOD team member?

• Evaluation categories• Frequency of participation in tasks• Task implementation• Communication

• Each category will be assessed by your peer teammates (including yourself!)

CS535 Big Data | Computer Science Department | Colorado State University

Note: These images are examples.

Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% Assignment 2: 10%

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 5%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 5%

Quizzes 15%

GEAR Sessions 25% (3% participation included)

CS535 Big Data | Computer Science Department | Colorado State University

Page 7: CS535 Big Data 1/22/2020 Sangmi Lee Pallickaracs535/slides/week1-A-6.pdf · •Guided Exploration for Big Data Analytics Research •Goals •Guided learning environment for advanced

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 7

Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% Assignment 2: 10%

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 5%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 5%

Quizzes 15%

GEAR Sessions 25% (3% participation included)

CS535 Big Data | Computer Science Department | Colorado State University

8% of these scores will be participation score

7% of these scores will be peer-evaluated team participation score

Grading | Final Letter Grade

Letter Grade Total PercentageA 90.00 % and higherB 80.00 ~ 89.99 %C 70.00 ~ 79.99 %D 60.00 ~ 69.99 %F Below 60.00 %

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Course Policy• No make-up for missed quizzes and exams

• Except for the case where student provided an advance written notice to the instructor based on an emergency• Supporting paper works will be requested

• Two lowest quiz scores will be eliminated at the end of semester

CS535 Big Data | Computer Science Department | Colorado State University

Course Component | Course Policy• No Cell-phones in the class.

• No Laptops in the class.

• If you need to use a laptop during lectures, please sit in the back row.

• I will ask you to turn off your laptop if needed.

CS535 Big Data | Computer Science Department | Colorado State University

More importantly,• Attend the class, ask questions, and discuss• Check the course web page and canvas regularly • Try new technologies and apply them• Share your experiences with other students in class

CS535 Big Data | Computer Science Department | Colorado State University

Questions?

CS535 Big Data | Computer Science Department | Colorado State University


Top Related