big data & data science
TRANSCRIPT
05/02/2023 2
What is big data? Ways of big data different from small data Characteristics of big data Structure of big data Who’s generating big data Storing big data Why big data Big data analytics Applications of big data analytics Risks of big data Tools used in big data Data Science Data Scientist Skills of data scientist Challenges in data science Data processing and big data
Content
05/02/2023 3
Big data is bigger in size.
Big data requires different approaches:- techniques, tools and architecture.
Difficult to process using traditional database and software techniques.
Used to enhance decision making, provide insight and discovery, support and optimize processes.
What is big data?
05/02/2023 4
Way Small data Big data
1.goal usually gathered for specific goal may have a goal in mind when it's first started, but things can evolve or take unexpected directions
2.location usually in one place can be in multiple files in multiple servers on computers in different geographic locations
3.structure & content highly structured Can be unstructured
4.data preparation usually prepared by the end user for their own purposes
often prepared by one group of people, analyzed by a second group of people, and then used by a third group of people
5.longevity usually kept for a specific amount of time after the project is over
data in perpetuity, and things are going to stay there for a very long time
6.measurements typically measured with a single protocol
may be measuring things using different protocols
7. reproducibility usually be reproduced in their entirety if something goes wrong in the process
may not be possible to start over again if something's gone wrong
8. stakes if things go wrong the costs are limited, it's not an enormous problem
get highly cost
9.introspection things tend to be well-organized, individual data points can be identified
things can be so complex with many files and many formats
10.analysis usually possible to analyze all of the data at once in a single procedure from a single computer file
things are so enormous and they're spread across lots of different files and servers
Ways of big data different from small data
05/02/2023 5
Volume -data quantity
Velocity -data speed
Variety -data types
Three characteristics of big data
05/02/2023 6
Structured -most traditional sources: data stored in fields in database
Semi structured -XML documents
Unstructured -video, photos, audio
Structure of big data
05/02/2023 7
Scientific instruments Mobile devices Sensor technology and network Social media networks
◦ Facebook◦ Twitter◦ Youtube
Who’s generating big data?
05/02/2023 8
Distributed storage SAN: Storage Area Network Cloud Storage Scale out NAS Object storage Hadoop
Storing big data
05/02/2023 9
Growth of Big Data is needed ◦ Increase of storage capacities◦ Increase of processing power◦ Availability of data
(different data types)
Why big data?
05/02/2023 10
Monitoring and anomaly detection◦ Monitoring can be very helpful when you know
what you're looking for and you need a notification when that thing occurs.
◦ It detects when a specific event occurs.◦ Anomaly detection, on the other hand, can
describe a situation in which the user wants to know when something unusual happens.
◦ They're looking for a notification of unusual activity without necessarily knowing in advance what that something might be.
Big Data Analytics
05/02/2023 11
Prescriptive analytics◦ Prescriptive analytics suggests what to do.◦ Prescriptive analytics can identify optimal solutions, often
for the allocation of scarce resources.
Predictive analytics◦ Predictive analytics describes what will occur in the future.◦ Methods and algorithms: regression analysis, machine
learning, and neural networks
Big data analytics
05/02/2023 12
Applications of big data analyticsSmarter health care
Home land security
Traffic control
manufacturing
Multi-channel sales
Telecom
Trading analytic
Search quality
05/02/2023 13
Loss of data security
Loss of data privacy
High cost
Bad analytics
Bad data
Risks of big data
05/02/2023 14
Hive◦ Hive is a "SQL-like" bridge that allows conventional BI applications to run
queries against a Hadoop cluster. Hadoop
◦ Hadoop is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs.
MapReduse◦ This is a programming paradigm that allows for massive job execution scalability
against thousands of servers or clusters of servers. PIG Wibidata MongoDB Amazon S3
Types of tools used in big data
05/02/2023 15
Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products.
Data science affects academic and applied research in many domains:◦ machine translation, speech recognition, robotics,
search engines Goal of data science is turn data into data
products.
Data science
05/02/2023 16
Methods that scale to big data are of particular interest in data science.
Big data solutions are often focused on organizing and preprocessing the data instead of analysis.
Data science utilizes data preparation, statistics, predictive modeling and machine learning to investigate problems in various domains such as agriculture, marketing optimization, fraud detection, risk management.
Data science
05/02/2023 17
Data scientists use their data and analytical ability:◦ to find and interpret rich data sources◦ manage large amounts of data despite hardware,
software◦ bandwidth constraints◦ merge data sources◦ ensure consistency of datasets◦ create visualizations to aid in understanding data
Data scientist
05/02/2023 18
Data Management – Data collection, storage, cleaning, filtering, integration
Large-scale Parallel Data Processing – Parallel computing
Statistics and Machine Learning – Data modeling, inference, prediction, pattern recognition
Skills of data scientist
05/02/2023 19
Preparing Data (Noisy, Incomplete, Diverse, Streaming …)
Analyze Data (Scalable, Accurate, Realtime, Advanced Methods, Probabilities and Uncertainties ...)
Represent Analysis Results (i.e. data product) (Story-telling, Interactive, explainable…)
Challenges in data science
05/02/2023 22
http://www.lynda.com https://en.wikipedia.org/wiki/Big_data http://www.slideshare.net/nasrinhussain1/bi
g-data-ppt-31616290 https://ufl.instructure.com/files/25550696/
download?download_frd... https://bcourses.berkeley.edu/files/5003131
2/download?download...
References