big data & data science

23
BIG DATA & DATA SCIENCE S.P.S.SENDANAYAKA

Upload: piyumi-sendanayaka

Post on 11-Apr-2017

374 views

Category:

Education


0 download

TRANSCRIPT

BIG DATA & DATA SCIENCE

S.P.S.SENDANAYAKA

05/02/2023 2

What is big data? Ways of big data different from small data Characteristics of big data Structure of big data Who’s generating big data Storing big data Why big data Big data analytics Applications of big data analytics Risks of big data Tools used in big data Data Science Data Scientist Skills of data scientist Challenges in data science Data processing and big data

Content

05/02/2023 3

Big data is bigger in size.

Big data requires different approaches:- techniques, tools and architecture.

Difficult to process using traditional database and software techniques.

Used to enhance decision making, provide insight and discovery, support and optimize processes.

What is big data?

05/02/2023 4

Way Small data Big data

1.goal usually gathered for specific goal may have a goal in mind when it's first started, but things can evolve or take unexpected directions

2.location usually in one place can be in multiple files in multiple servers on computers in different geographic locations

3.structure & content highly structured Can be unstructured

4.data preparation usually prepared by the end user for their own purposes

often prepared by one group of people, analyzed by a second group of people, and then used by a third group of people

5.longevity usually kept for a specific amount of time after the project is over 

data in perpetuity, and things are going to stay there for a very long time

6.measurements typically measured with a single protocol

may be measuring things using different protocols

7. reproducibility usually be reproduced in their entirety if something goes wrong in the process

may not be possible to start over again if something's gone wrong

8. stakes  if things go wrong the costs are limited, it's not an enormous problem

get highly cost

9.introspection things tend to be well-organized, individual data points can be identified

things can be so complex with many files and many formats

10.analysis usually possible to analyze all of the data at once in a single procedure from a single computer file

things are so enormous and they're spread across lots of different files and servers

Ways of big data different from small data

05/02/2023 5

Volume -data quantity

Velocity -data speed

Variety -data types

Three characteristics of big data

05/02/2023 6

Structured -most traditional sources: data stored in fields in database

Semi structured -XML documents

Unstructured -video, photos, audio

Structure of big data

05/02/2023 7

Scientific instruments Mobile devices Sensor technology and network Social media networks

◦ Facebook◦ Twitter◦ Youtube

Who’s generating big data?

05/02/2023 8

Distributed storage SAN: Storage Area Network Cloud Storage Scale out NAS Object storage Hadoop

Storing big data

05/02/2023 9

Growth of Big Data is needed ◦ Increase of storage capacities◦ Increase of processing power◦ Availability of data

(different data types)

Why big data?

05/02/2023 10

Monitoring and anomaly detection◦ Monitoring can be very helpful when you know

what you're looking for and you need a notification when that thing occurs.

◦ It detects when a specific event occurs.◦ Anomaly detection, on the other hand, can

describe a situation in which the user wants to know when something unusual happens.

◦ They're looking for a notification of unusual activity without necessarily knowing in advance what that something might be.

Big Data Analytics

05/02/2023 11

Prescriptive analytics◦ Prescriptive analytics suggests what to do.◦ Prescriptive analytics can identify optimal solutions, often

for the allocation of scarce resources.

Predictive analytics◦ Predictive analytics describes what will occur in the future.◦ Methods and algorithms: regression analysis, machine

learning, and neural networks

Big data analytics

05/02/2023 12

Applications of big data analyticsSmarter health care

Home land security

Traffic control

manufacturing

Multi-channel sales

Telecom

Trading analytic

Search quality

05/02/2023 13

Loss of data security

Loss of data privacy

High cost

Bad analytics

Bad data

Risks of big data

05/02/2023 14

Hive◦ Hive is a "SQL-like" bridge that allows conventional BI applications to run

queries against a Hadoop cluster. Hadoop

◦ Hadoop is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs.

MapReduse◦ This is a programming paradigm that allows for massive job execution scalability

against thousands of servers or clusters of servers. PIG Wibidata MongoDB Amazon S3

Types of tools used in big data

05/02/2023 15

Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products.

Data science affects academic and applied research in many domains:◦ machine translation, speech recognition, robotics,

search engines Goal of data science is turn data into data

products.

Data science

05/02/2023 16

Methods that scale to big data are of particular interest in data science.

Big data solutions are often focused on organizing and preprocessing the data instead of analysis.

Data science utilizes data preparation, statistics, predictive modeling and machine learning to investigate problems in various domains such as agriculture, marketing optimization, fraud detection, risk management.

Data science

05/02/2023 17

Data scientists use their data and analytical ability:◦ to find and interpret rich data sources◦ manage large amounts of data despite hardware,

software◦ bandwidth constraints◦ merge data sources◦ ensure consistency of datasets◦ create visualizations to aid in understanding data

Data scientist

05/02/2023 18

Data Management – Data collection, storage, cleaning, filtering, integration

Large-scale Parallel Data Processing – Parallel computing

Statistics and Machine Learning – Data modeling, inference, prediction, pattern recognition

Skills of data scientist

05/02/2023 19

Preparing Data (Noisy, Incomplete, Diverse, Streaming …)

Analyze Data (Scalable, Accurate, Realtime, Advanced Methods, Probabilities and Uncertainties ...)

Represent Analysis Results (i.e. data product) (Story-telling, Interactive, explainable…)

Challenges in data science

05/02/2023 20

Data science process

05/02/2023 21

Data processing and big data

THANK YOU