cs535 big data 1/22/2020 sangmi lee pallickaracs535/slides/week1-a-6.pdf · •guided exploration...
TRANSCRIPT
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1
CS535 BIG DATA
PART A. BIG DATA TECHNOLOGY1. INTRODUCTION TO BIG DATA
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535
What is Big Data?
CS535 Big Data | Computer Science Department | Colorado State University
Big Data• Things one can do at a large scale that cannot be done at a smaller one
• To extract new insights• Create new forms of values
• Big Data is about analytics of huge quantities of data in order to infer probabilities• Big Data is NOT about trying to “teach” a computer to “think” like humans
• Providing a quantitative dimension it never had before
CS535 Big Data | Computer Science Department | Colorado State University
The three(or four) Vs in Big Data• Volume
• Voluminous• It does not have to be certain number of petabytes or quantity.
• Velocity• How fast the data is coming in?• How fast you need to be able to analyze and utilize it
• Variety • Number of sources or incoming vectors
• Veracity• Can you trust the data itself, source of the data, or the process?• User entry errors, redundancy, corruption of the values• Data cleaning
CS535 Big Data | Computer Science Department | Colorado State University
Who is using Big Data?
CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2
CS535 Big Data | Computer Science Department | Colorado State University
Photo Credit:https://datafloq.com/read/car-manufacturers-are-using-big-data/1204
CS535 Big Data | Computer Science Department | Colorado State University
Connected cars• Single hybrid plug-in car generates up to 25 gigabytes per hour
• Connected cars• $130 billion • Traffic problem, re-routing based on the volume of traffic• Alerts driver when a road conditions are hazardous by automatically activating anti-lock break
• This information is shared by the vehicles that are nearby
CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University
The Artemis project: Saving “preemies” using Big Data• The Artemis project
• Dr. Carolyn McGregor
• Toronto’s Hospital for Sick Children, University of Ontario Institute of Technology and IBM• Captures and process the patients’ data in real time
• 16 different data streams
• Heart rate, respiration rate, temperature, blood pressure and blood oxygen level• Around 1,260 data points per second
• System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear
CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3
Look Who’s Peeking at Your Paycheck• Experian’s Income Insight
• Estimates people’s income level• Based on their credit history • Trains the estimation model using selected credit history and tax information from IRS
KAREN BLUMENTHAL, “Look W ho’s Peeking at Your Paycheck”, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106
CS535 Big Data | Computer Science Department | Colorado State University
Related research areas• Storage systems
• How can we efficiently resolve queries on massive amounts of input data?• The input dataset may be presented in the form of a distributed data stream
• Machine learning• How can we efficiently solve large-scale machine learning problems?• The input data may be massive; stored in a distributed cluster of machines
• Distributed computing• How can we efficiently solve large-scale optimization problems in distributed computing environments?• For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large
scale graphs?
CS535 Big Data | Computer Science Department | Colorado State University
CS535 BIG DATA
PART A. BIG DATA TECHNOLOGY2. COURSE INTRODUCTION
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535
Big Data Lab at Colorado State University• Director: Sangmi Pallickara
• Algorithmic and systems design• Scalable analytics over voluminous datasets
on complex distributed architectures
• Research has been deployed in the following domains• Precision agriculture, atmosphere science,
environmental biology, ecology, civil engineering, bioinformatics, and public health
CS535 Big Data | Computer Science Department | Colorado State University
Big Data Lab at Colorado State University• Awards
• Cochran Family Professorship 2018-2021• IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher) 2018• National Science Foundation CAREER Award 2016
• Funded by • The National Science Foundation• The Advanced Research Projects Agency-Energy (Department of Energy)
• Department of Homeland Security• The Environmental Defense Fund• Google, Amazon, and Hewlett Packard
CS535 Big Data | Computer Science Department | Colorado State University
People
Saptashwa Mitra
Sam Armstrong
Aaron PereiraKartik Khurana Paahuni Khandelwal
Sangmi Pallickara
Undergraduate researchers at CURC
Walid BudgagaDan Rammer
Ryan Becwar
CS535 Big Data | Computer Science Department | Colorado State University
Laksheen Mendis
Kevin Brewwiler
Caleb Carlson
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4
Goal of this course • Understanding fundamental concepts in Big Data Analytics
• Computing Systems + Scalable Algorithms and Models
• Learn about existing technologies and how to apply themComputing
systemsAlgorithms and models
Analytics
Predictive models
Graph models
Storage systems and middle ware
Computing frameworks
Specialized modeling tools
CS535 Big Data | Computer Science Department | Colorado State University
Communications [1/2]
• Course Website• http://www.cs.colostate.edu/~cs535
• Announcements: Check the course website at least twice a week.
• Schedule (course materials, readings, assignments)
• Policies
• Canvas
• Assignment submission
• Grades
• Piazza• Discussion board
CS535 Big Data | Computer Science Department | Colorado State University
Communications [2/2]
• Contact Me• [email protected]• Office hour: Friday 10:00AM ~ 11:00AM and by appointment• Office: CSB456• URL: http://www.cs.colostate.edu/~sangmi
• Contact GTAs• Paahuni Khandelwal• Mohamed Chaabane• Office hours (in CSB120 and online office hours TBA)
CS535 Big Data | Computer Science Department | Colorado State University
Research Group MeetingWhen: 1:30-2:30pm FridaysWhere: CSB305
Course Structure
Big Data TechnologyWeek 1 ~ Week 4
GEAR I: Peta-scale Storage SystemsWeek 5, 6
GEAR II: Machine Learning for Big DataWeek 7, 8
GEAR III: Big Graph AnalysisWeek 9, 10
GEAR IV: Large Scale Recommendation Systems and Social Media: Week 11, 12
CS535 Big Data | Computer Science Department | Colorado State University
GEAR V: Algorithmic Techniques for Big Data Week 13, 14
Course Structure | Part A: Big Data Technology• Week 1 ~ Week 4• Big Data Technology
• Purposes• Understand concepts of Big Data computing environment
• Hands-on experience
• Topics• Introduction to Big Data• Lambda Model
• Quick view of MapReduce • Introduction to Apache Spark
• Analytics with Apache Storm
CS535 Big Data | Computer Science Department | Colorado State University
Course Structure | Part B: GEAR Sessions• What is the GEAR Session?
• Guided Exploration for Big Data Analytics Research
• Goals• Guided learning environment for advanced research topics in Big Data• Understanding different aspects of Big Data research with lectures and discussions
• SessionsSession I. Peta-scale Storage SystemsSession II. Machine Learning for Big DataSession III. Big Graph AnalysisSession IV. Large Scale Recommendation Systems and Social MediaSession V. Algorithmic Techniques for Big Data
• Duration: 2 weeks/session• Up to 3 lectures• 1 student-led research discussion
CS535 Big Data | Computer Science Department | Colorado State University
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 5
Course Structure | Part B: GEAR Sessions• Lectures
• Introducing fundamental concepts
• Student-led discussions• Students present research papers • 3 papers per session• We will have 5 sessions
• GEAR is a team effort• Each team will submit 1 critical review for each GEAR Session• Each team will lead 1 discussion session in this semester
CS535 Big Data | Computer Science Department | Colorado State University
Course Structure | Part B: GEAR Sessions• Example session
• Session 4: Recommendation systems and multi-media data mining
• Lectures• Concepts of the recommendation systems
• Latent factor model
• Student-led discussions• Recommendation systems used by Airbnb and Alibaba• Route and cost estimation model by Uber
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Programing Assignments• Programming Assignment 1
• Implementing link analysis algorithm over the Wikipedia pages using Apache Spark• Due on 2/18 5:00PM• The description of PA1 is available now.
• Programming Assignment 2• Implementing real-time Twitter stream analysis using Apache Storm• Due on 3/10 5:00PM
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Programing Assignments• All of the programming assignments are individual submission• Submission should be via canvas• Late policy for the assignment submissions
• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied
• Grading will be based on a software demo of the programming assignment in CSB120• Each assignment will count 10% of total score of this course
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Quizzes• 7-8 online quizzes• 48 hour window (with the 20 minute duration)• Two lowest scores will be eliminated• Quizzes will count 15% of total score of this course
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Term Project• Objectives
• Students identify their topics for the term project• Students provide methodology to solve their problem• Students implement software solution• Students provide evaluation scheme for their software
• Requirements• Team effort• Deep learning aspect using TensorFlow
CS535 Big Data | Computer Science Department | Colorado State University
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 6
Course Component | Term Project• Term project grading (40% of this course)
• Term project planning: 1% (e.g. team members, a tentative title of your project)• Proposal document: 5%
• Final paper: 24% • Final demonstration: 2%• Final presentation (peer review): 3%• Participation: 4%
• Late policy for the deliverable submissions• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Term Project• Highlights of the Previous Term Projects
• Supporting Emergency Response During Natural Disasters with Twitter Data
• Winning Words in the Supreme Court
• Rapid, Progressive Sub-Graph Explorations for Interactive Visual Analytics over Large-Scale Graph
Datasets, (published and won the best paper award in IEEE/ACM International Conference on Big
Data Computing, Application, and Technology 2019)
• Mendel: A Distributed Storage System for Efficient Sequence Alignment and Similarity Searching
(published in IEEE IPDPS 2016)
• Processing Smart Grid Data In Real Time (DEBS grand challenge 2014)
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Term Project• Highlights of the Previous Term Projects
• Time to Answer” for Questions on stackoverflow.com using Map Reduce
• Analysis of words for spell checking in search queries using digitized books and articles
• Efficient Boolean Symmetric Searchable Encryption
• Big Data I/O Performance Improvement Using Buffered B-Tree Algorithm
• Node and Metadata Visualization in a Distributed Hash Table
• Who is Building Wikipedia?
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Team Efforts• CS535 requires team efforts
• Large scale analysis usually performed by a team of experts
• How do we build a team?• Post your advertisement on the Piazza discussion board TODAY!• List
• Your strengths/backgrounds• Interests (topics or even dataset)
• We will have 15 teams this semester• Each team must include 1 or 2 on-line students• Each team should not exceed 4 members
• Peer-reviewed participation score• 8% of course grade will be the participation score• 7% will be the peer-reviewed score at the end of the semester
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Team Efforts• How can I be a GOOD team member?
• Evaluation categories• Frequency of participation in tasks• Task implementation• Communication
• Each category will be assessed by your peer teammates (including yourself!)
CS535 Big Data | Computer Science Department | Colorado State University
Note: These images are examples.
Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%
Assignment 1: 10% Assignment 2: 10%
Term Project 40%D0: Term project planning: 1%D1: Proposal document: 5%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 5%
Quizzes 15%
GEAR Sessions 25% (3% participation included)
CS535 Big Data | Computer Science Department | Colorado State University
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 7
Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%
Assignment 1: 10% Assignment 2: 10%
Term Project 40%D0: Term project planning: 1%D1: Proposal document: 5%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 5%
Quizzes 15%
GEAR Sessions 25% (3% participation included)
CS535 Big Data | Computer Science Department | Colorado State University
8% of these scores will be participation score
7% of these scores will be peer-evaluated team participation score
Grading | Final Letter Grade
Letter Grade Total PercentageA 90.00 % and higherB 80.00 ~ 89.99 %C 70.00 ~ 79.99 %D 60.00 ~ 69.99 %F Below 60.00 %
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Course Policy• No make-up for missed quizzes and exams
• Except for the case where student provided an advance written notice to the instructor based on an emergency• Supporting paper works will be requested
• Two lowest quiz scores will be eliminated at the end of semester
CS535 Big Data | Computer Science Department | Colorado State University
Course Component | Course Policy• No Cell-phones in the class.
• No Laptops in the class.
• If you need to use a laptop during lectures, please sit in the back row.
• I will ask you to turn off your laptop if needed.
CS535 Big Data | Computer Science Department | Colorado State University
More importantly,• Attend the class, ask questions, and discuss• Check the course web page and canvas regularly • Try new technologies and apply them• Share your experiences with other students in class
CS535 Big Data | Computer Science Department | Colorado State University
Questions?
CS535 Big Data | Computer Science Department | Colorado State University