introduccion_final.pdf
TRANSCRIPT
![Page 2: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/2.jpg)
Contenidos
• Por que BIG DATA?
• Características de Big Data
• Tecnologías y Herramientas Big Data
• Paradigmas fundamentales Big Data
• Data Mining
• Visualización
DIAPOSITIVA 1
![Page 3: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/3.jpg)
Por qué BIG DATA?
DIAPOSITIVA 2
We are drawing on
data but starving on
knowledge !!
![Page 4: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/4.jpg)
Por qué BIG DATA?
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
3
DIAPOSITIVA 3
![Page 5: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/5.jpg)
Quien genera y usa datos?
Social media and networks
(all of us are generating data) Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
DIAPOSITIVA 4
![Page 6: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/6.jpg)
Evolución
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
DIAPOSITIVA 5
![Page 7: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/7.jpg)
Big Data
• “Big data refers to the tools, processes
and procedures allowing an organization
to create, manipulate, and manage very
large data sets and storage
facilities”(zdnet.com)
• The big deal about big data is the potential
for getting more value more quickly from
more data, at a lower cost and with greater
agility. (Brian Hopkins, zdnet)
DIAPOSITIVA 6
![Page 8: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/8.jpg)
Big Data
“Big Data” is data whose scale, diversity,
and complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
DIAPOSITIVA 7
![Page 9: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/9.jpg)
Características de Big Data
DIAPOSITIVA 8
![Page 10: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/10.jpg)
Características de Big Data:
Volume • Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
DIAPOSITIVA 9
![Page 11: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/11.jpg)
Características de Big Data:
Varity • Various formats, types, and
structures
• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many types of data
To extract knowledge all
these types of data need to
linked together
DIAPOSITIVA 10
![Page 12: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/12.jpg)
Características de Big Data:
Velocity • Data is begin generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples – E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
– Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction
DIAPOSITIVA 11
![Page 13: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/13.jpg)
Big Data: 3V’s
DIAPOSITIVA 12
![Page 14: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/14.jpg)
Incluso 4V’s!
DIAPOSITIVA 13
![Page 15: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/15.jpg)
Big Data Bubble?
© 2013 KDnuggets
Gartner Hype Cycle
Big Data
Gartner VP says Big Data is
Falling into the Trough of
Disillusionment, Jan 2013
DIAPOSITIVA 14
![Page 16: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/16.jpg)
Retos
• The Bottleneck is in technology – New architecture, algorithms, techniques are needed
• Also in technical skills – Experts in using the new technology and dealing with big
data
DIAPOSITIVA 15
![Page 17: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/17.jpg)
Tecnologías y Herramientas
Big Data
DIAPOSITIVA 16
![Page 18: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/18.jpg)
![Page 19: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/19.jpg)
Arquitectura
DIAPOSITIVA 18
![Page 20: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/20.jpg)
Paradigmas fundamentales
• MapReduce
DIAPOSITIVA 19
![Page 21: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/21.jpg)
Paradigmas fundamentales
• Teorema CAP
DIAPOSITIVA 20
![Page 22: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/22.jpg)
Business Intelligence
• Statistics
• Data mining
• Knowledge Discovery in Data (KDD)
• Predictive Analytics
• Business Analytics
• Data Science
• Data Analytics
• …
Same Core Idea:
Finding Useful Patterns in Data
Different Emphasis
DIAPOSITIVA 21
![Page 23: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/23.jpg)
Data Mining
DIAPOSITIVA 22
![Page 24: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/24.jpg)
• Lots of data is being collected and warehoused – Web data, e-commerce
– purchases at department/ grocery stores
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
DIAPOSITIVA 23
¿Por qué?
![Page 25: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/25.jpg)
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
¿Por qué?
DIAPOSITIVA 24
![Page 26: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/26.jpg)
¿Qué es? – Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
DIAPOSITIVA 25
![Page 27: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/27.jpg)
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
– Enormity of data
– High dimensionality
of data
– Heterogeneous,
distributed nature
of data
Origenes
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
DIAPOSITIVA 26
![Page 28: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/28.jpg)
CRISP-DM
• Why Should There be a Standard
Process?
– The data mining process must be reliable and
repeatable by people with little data mining
background.
DIAPOSITIVA 27
![Page 29: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/29.jpg)
CRISP-DM
• Why Should There be a Standard
Process?
– Allows projects to be replicated
– Aid to project planning and management
– Allows the scalability of new algorithms
DIAPOSITIVA 28
![Page 30: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/30.jpg)
CRoss-Industry Standard
Process
for Data Mining
The CRISP-DM Model: The New Blueprint
for DataMining”, Colin Shearer, JOURNAL
of Data Warehousing, Volume 5, Number 4,
p. 13-22, 2000
DIAPOSITIVA 29
![Page 31: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/31.jpg)
CRISP-DM
DIAPOSITIVA 30
![Page 32: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/32.jpg)
CRISP-DM • Business Understanding:
– Project objectives and requirements understanding, Data mining problem definition
• Data Understanding:
– Initial data collection and familiarization, Data quality problems identification
• Data Preparation:
– Table, record and attribute selection, Data transformation and cleaning
• Modeling:
– Modeling techniques selection and application, Parameters calibration
• Evaluation:
– Business objectives & issues achievement evaluation
• Deployment:
– Result model deployment, Repeatable data mining process implementation
DIAPOSITIVA 31
![Page 33: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/33.jpg)
CRISP-DM
Business
Understanding Data
Understanding
Data
Preparation Modeling Deployment Evaluation
Format
Data
Integrate
Data
Construct
Data
Clean
Data
Select
Data
Determine
Business
Objectives
Review
Project
Produce
Final
Report
Plan Monitering
&
Maintenance
Plan
Deployment
Determine
Next Steps
Review
Process
Evaluate
Results
Assess
Model
Build
Model
Generate
Test Design
Select
Modeling
Technique
Assess
Situation
Explore
Data
Describe
Data
Collect
Initial
Data
Determine
Data Mining
Goals
Verify
Data
Quality
Produce
Project Plan
DIAPOSITIVA 32
![Page 34: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/34.jpg)
CRISP-DM
• Business Understanding and Data
Understanding
DIAPOSITIVA 33
![Page 35: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/35.jpg)
CRISP-DM
• Knowledge acquisition techniques
Knowledge Acquisition,
Representation, and
Reasoning
Turban, Aronson, and Liang,
Prentice Hall, Decision Support
Systems and Intelligent
Systems, 7th Edition, 2005
DIAPOSITIVA 34
![Page 36: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/36.jpg)
DM Tools
• Open Source
• Weka
• Orange
• R-Project
• KNIME
• Commercial
• SPSS
• Clementine
• SAS Miner
• Matlab
• …
DIAPOSITIVA 35
![Page 37: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/37.jpg)
DM Tools
• Weka 3.6
– Java
– Excellent library, regular interface
– http://www.cs.waikato.ac.nz/ml/weka/
• Orange
• R-Project
• KNIME
DIAPOSITIVA 36
![Page 38: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/38.jpg)
DM Tools
• Weka 3.6
• Orange
– C++ and Python
– Regular library !, good interface
– http://orange.biolab.si/
• R-Project
• KNIME
DIAPOSITIVA 37
![Page 39: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/39.jpg)
DM Tools
• Weka 3.6
• Orange
• R-Project
– Similar than Matlab and Maple
– Powerfull libraries, Regular interface. Too
slow for file access!
– http://cran.es.r-project.org/
• KNIME
DIAPOSITIVA 38
![Page 40: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/40.jpg)
DM Tools
• Weka 3.6
• Orange
• R-Project
• KNIME
– Java
– Includes Weka, Python and R-Project
– Powerfull libraries, good interface
– http://www.knime.org/download-desktop
DIAPOSITIVA 39
![Page 41: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/41.jpg)
DM Tools
• Let’s go to install KNIME!!
DIAPOSITIVA 40
![Page 42: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/42.jpg)
Visualización
DIAPOSITIVA 41
![Page 43: introduccion_final.pdf](https://reader036.vdocument.in/reader036/viewer/2022062305/55cf85a4550346484b903218/html5/thumbnails/43.jpg)
Visualización
DIAPOSITIVA 42