data science - fabrice rossiapiacoa.org/publications/teaching/data-science/... · i shopping cart:...
TRANSCRIPT
Data Science
Fabrice Rossi
CEREMADEUniversité Paris Dauphine
2019
Data Science?
the Science of Data!
2
Data Science?
the Science of Data!
2
Data Science?
the Science of Data!
2
Data?
Informal definitions1. Collection of measured and recorded values2. Set of values of entities (subjects) with respect to variables
(attributes)
Data is not1. information: structured, post-processed, organized data2. knowledge: high level understanding
RemarksI data tends to be used as a singular mass noun (as information)I datum is seldom usedI a data set: a collection of data/datum
3
Data?
Informal definitions1. Collection of measured and recorded values2. Set of values of entities (subjects) with respect to variables
(attributes)
Data is not1. information: structured, post-processed, organized data2. knowledge: high level understanding
RemarksI data tends to be used as a singular mass noun (as information)I datum is seldom usedI a data set: a collection of data/datum
3
Example
age job marital education default balance housing1 30 unemployed married primary no 1787 no2 33 services married secondary no 4789 yes3 35 management single tertiary no 1350 yes4 30 management married tertiary no 1476 yes5 59 blue-collar married secondary no 0 yes6 35 management single tertiary no 747 no7 36 self-employed married tertiary no 307 yes8 39 technician married secondary no 147 yes9 41 entrepreneur married tertiary no 221 yes
10 43 services married primary no -88 yes11 39 services married secondary no 9374 yes12 43 admin. married secondary no 264 yes13 36 technician married tertiary no 1109 no14 20 student single secondary no 502 no15 31 blue-collar married secondary no 360 yes16 40 management married tertiary no 194 no17 56 technician married secondary no 4073 no18 37 admin. single tertiary no 2317 yes19 25 blue-collar single primary no -221 yes20 31 services married secondary no 132 no
4
Data Science?
A possible definitionTools, techniques, systems, processes, etc. that extract informationand possibly knowledge from data, using the scientific method
Examples of standard applicationsI spam filtering:
data emails sorted into categories (acceptable versusspam)
information a computer program that infers from the content ofan email its category
I movie recommendations:data a movie database (with movie descriptions), user
ratings for some moviesinformation a computer program that infers the rating of an
unseen movie for a given user
5
Data Science?
A possible definitionTools, techniques, systems, processes, etc. that extract informationand possibly knowledge from data, using the scientific method
Examples of standard applicationsI spam filtering:
data emails sorted into categories (acceptable versusspam)
information a computer program that infers from the content ofan email its category
I movie recommendations:data a movie database (with movie descriptions), user
ratings for some moviesinformation a computer program that infers the rating of an
unseen movie for a given user
5
From data to knowledge
Data 6= informationHierarchy:
1. data2. information/insights3. knowledge
SoundHound/Shazam1. data: sound (digital recording)2. information (low level):
fingerprint of the recording3. information (high level):
metadata of the songassociated to the fingerprint
4. knowledge: musical genre,band history, etc.
6
From data to knowledge
Data 6= informationHierarchy:
1. data2. information/insights3. knowledge
SoundHound/Shazam1. data: sound (digital recording)2. information (low level):
fingerprint of the recording3. information (high level):
metadata of the songassociated to the fingerprint
4. knowledge: musical genre,band history, etc.
6
Data science
Older/other namesI data miningI knowledge discovery in databases (KDD)I business intelligence/analyticsI statistics (!)I big dataI artificial intelligence (AI)
The confusion is damaging!I big data induces specific problems and specific solutionsI AI is used in data science
7
Data science
Older/other namesI data miningI knowledge discovery in databases (KDD)I business intelligence/analyticsI statistics (!)I big dataI artificial intelligence (AI)
The confusion is damaging!I big data induces specific problems and specific solutionsI AI is used in data science
7
Big Data?
8
Big Data?
8
Big Data?
Operational definitionData sets that are too big to be processed by a single computer or by alimited number of computers (say less than 10)
RemarksI a definition by size is meaningless: computer processing power
grows in parallel with storage capacityI solutions adapted to large scale data sets are generally wasteful
when applied to standard scale data sets
9
Big Data?
Operational definitionData sets that are too big to be processed by a single computer or by alimited number of computers (say less than 10)
RemarksI a definition by size is meaningless: computer processing power
grows in parallel with storage capacityI solutions adapted to large scale data sets are generally wasteful
when applied to standard scale data sets
9
Specific solutions
Big datacannot beI storedI queriedI processed
on a single computer (even a powerful one!)
Specific solutionsI e.g. Google:
I in 2016, 2.5 millions of servers (estimation)I Google File SystemI BigtableI Map Reduce
I open source: Apache Hadoop
10
Consequences of the confusion
A typical story
1. we want to do predictive modelling but we confuse that with bigdata
2. we read tutorials and books about big data finding discussionsabout predictive modelling, reinforcing our confusion
3. we use the dominant open source solution, hadoop4. performances are very bad so we add more computers,
reinforcing our belief that we do big data magic5. wash, rinse, repeat
2019 versionreplace
1. big data by AI (and/or deep learning)2. hadoop by TensorFlow
11
Consequences of the confusion
A typical story
1. we want to do predictive modelling but we confuse that with bigdata
2. we read tutorials and books about big data finding discussionsabout predictive modelling, reinforcing our confusion
3. we use the dominant open source solution, hadoop4. performances are very bad so we add more computers,
reinforcing our belief that we do big data magic5. wash, rinse, repeat
2019 versionreplace
1. big data by AI (and/or deep learning)2. hadoop by TensorFlow
11
Outline
General Introduction
The Data Science value chain
Data science skills and knowledge
12
Value chain
Value chain conceptI business oriented conceptI process point of viewI activity of an organization:
I a process that receives inputs and produces outputs in a series ofsteps
I each step increases the “value” of the objects it transforms
Data science caseI natural representation in terms of stepsI “value”: information content, operability
13
Data science steps
A possible breakdown of the data science process
1. Data generation and acquisition2. Data preprocessing3. Data storage4. Data management and querying5. Data analysis
14
Reccuring examples
Spam filteringOverall goal: automatic detection of unsolicited emails
Product recommendationOverall goal: automatic recommendation of products to consumers
Activity trackerOverall goal: fitness reporting and health related advice
15
Data science steps
A possible breakdown of the data science process
1. Data generation and acquisition2. Data preprocessing3. Data storage4. Data management and querying5. Data analysis
16
Data generation and acquisition
The input phaseI data collection is the first phase of any data science projectI generation and acquisition is the low level part of this collection
phase
Generation (design)I measurements of entitiesI what are the entities?I what is measured?
Acquisition (engineering)I how to conduct measurements?I how to retrieve the measurement results?
17
Spam filtering
Basic viewI email: text content + tag (acceptable/spam)I acquisition by the email service
More advanced generationI leveraging email structure
I body: main textI subject: textI sender, recipient: email addressesI more technical headers: date, server, authentication aspects, etc.
I usage analysis for web mail
18
Product recommendation
GenerationI shopping cart: association between consumers and productsI consumer profile: as many variable as possible (location, gender,
age, etc.), previous shopping activityI product profile: text description, image, reviews and comments,
previous salesI web site usage
AcquisitionI shopping cart: as part of the buying processI user profile: during registration, solicited afterwardsI product profile: as part of the inclusion of the product in the
merchant offer
19
Activity tracker
GenerationI personal information: age, gender, weight, etc.I heart rateI temperatureI step countsI positionI etc.
AcquisitionI user profile: during registrationI dual approach: tracker + external deviceI on tracker collection (and preprocessing!)I periodic synchronization with a connected device and through it to
a server
20
Data revolutionData has never been easier to acquire!
Online servicesI e-commerceI software as a service: email, document editions, etc.I log generation
Social networksI massive sharingI very rich and complex data
Open dataI large sources of data mainly from governmental agencies and
scientific projectsI already organized, documented, etc.
21
Data Science Revolution
Digital Services
DataData Science
produce
feeds
creates
Open data
feeds
22
Data Science Revolution
Digital Services
DataData Science
produce
feeds
creates
Open data
feeds
22
Data Science Revolution
Digital Services
DataData Science
produce
feeds
creates
Open data
feeds
22
Data Science Revolution
Digital Services
DataData Science
produce
feeds
creates
Open data
feeds
22
Permanent connection
Smart phonesI permanent connectionI permanent trackingI enormous computational resources
Internet of thingsI home applianceI smart home
23
Virtuous circle of data science
With accurate data come accurate predictionsI data quality as part of the collectionI to benefit from a service based on personal data, your personal
data must be accurate!
ExamplesI spam filtering
I manual tagging of badly classified spamsI manual inspection of the junk folder
I product recommendationI actual buying/viewing actI reviews and ratings
I activity tracking: usage=collection!
24
Generation issues
I conceptual levelI practical trade off
I large scale collection: large scale data!I small scale collection: missing important data!
I conceptual trade offI low level data (e.g. raw data)
I easier to measureI potential high acquisition costsI minimal risk of information loss
I high level data (e.g. aggregated data)I more complex measurement strategiesI can reduce acquisition costs and noiseI risk of missing details
25
Acquisition issues
I engineering level (mostly)I seldom under the responsibility of a data scientistI sensors
I e.g. accelerometer, GPS chipI typical difficulties
I noise and calibrationI data volumeI reliability
I distributed acquisition (e.g. via smartphones)
26
Value analysis
Value creationI uncollected data is lost!I from zero to possibly something
Value creation compromiseI however data acquisition is not (always) free!I another aspect of the trade off between large scale and focused
collectionI standard strategy: transfer part of the acquisition cost to the
clients, e.g.I email tagging, product reviewsI activity tracker (you have to buy it!)I smart phone resources (local processing and connectivity)
27
Value analysis
Value creationI uncollected data is lost!I from zero to possibly something
Value creation compromiseI however data acquisition is not (always) free!I another aspect of the trade off between large scale and focused
collectionI standard strategy: transfer part of the acquisition cost to the
clients, e.g.I email tagging, product reviewsI activity tracker (you have to buy it!)I smart phone resources (local processing and connectivity)
27
GDPR
General Data Protection RegulationI EU law implemented in may 2018I restricts personal data collection and processing
I explicit collection and redistributionI collect only what is neededI data protection (anonymization)I explicit consentI right to withdraw consent, right of access, right of portability, right to
be forgotten
ImpactsI limits personal data collection and processingI but clarifies and simplifies certain aspectsI long term consequences are unclearI ongoing very active research on privacy preserving data science
28
Data science steps
A possible breakdown of the data science process
1. Data generation and acquisition2. Data preprocessing3. Data storage4. Data management and querying5. Data analysis
29
Data preprocessing
Transformations before storageI data science projects leverage multiple data sourcesI data should be integratedI but also cleaned or compressed
Data integrationI reference formats (for instance UTC for time)I links between data sources (e.g. mapping between user ids)
30
Data compression
Compressed formatsI obvious (and almost mandatory) for media applicationsI audio, image and videoI especially relevant for smart phones and smart objects with
limited storage and processing capabilitiesI extremely important for distributed data collection
Advanced semantic compressionI music signature (e.g. Shazam/SoundHound)I activity detection (activity trackers)
31
Data cleaning and general processing
SensorsI inconsistent value removal (e.g. very noisy GPS positions)I averaging and other low level processing
Declarative dataI bogus value detection (such as fake phone numbers)I auto correction
32
Spam filtering
Data integrationI linking emails to web mail usageI integration with external verification tools (e.g DKIM)I emails shared between users
CompressionI deduplication for shared emailsI reduced representation for deleted emails
33
Product recommendation
Data integrationI linking web usage with products and users descriptionI “equivalent” product detection
I softcover/hardcoverI collector editionsI multiple vendors (e.g. amazon marketplace)
CleaningI fake reviewsI spam in comments or reviewsI external vendors badly written product presentation
34
Activity tracker
Semantic compressionI activity detection
I elementary (standing, walking, etc.)I advanced sport related
I event detectionI multiple resolution compression
CleaningI GPS positionsI heart beat detection
35
Preprocessing issues
A broad range of methodsI from well known methods
I standard format (UTC, WGS 84, etc.)I standard compression tools (mp3, jpg, h.265, etc.)I classical signal processing techniques (moving average)
I to sophisticated onesI music signatureI activity detectionI etc.
Blurry distinctionI advanced preprocessing methods are data science methods!I recursive structure: complex data collection can be seen as a full
data science pipeline
36
Value analysis
Value creationI obvious for data cleaningI obvious for some specific schemes
I activity detectionI compressed dataI higher information contentI limited disclosure (vendor lock-in!)
I outlier removalI compressed dataI cleaner data
CompromiseI similar to the acquisition compromiseI user related for e.g. smart phones (e.g. battery versus bandwidth)
37
Data science steps
A possible breakdown of the data science process
1. Data generation and acquisition2. Data preprocessing3. Data storage4. Data management and querying5. Data analysis
38
Data storage
The easy part!I the data revolution is a consequence of the data storage revolutionI storage cost evolution (constant dollar analysis, reference 2017)
1956 first HDD 5 MB for 400 K $1995 typical HDD 1 GB for 1350 $2005 typical HDD 200 GB for 250 $2010 typical HDD 1 TB for 90 $2018 typical HDD 4 TB for 110 $
I transparent HDD aggregation is readily availableI e.g. Synology FlashStation FS3017: 11 K$I 24 HDD: 96 TB seen as a single hard driveI expandable to 200 TB
39
Data storage issues
None!
Big dataI a bit more complicated in the big data contextI distributed storage can be more efficient than centralized oneI specific solutions
I mixed “small” computersI provide processing power and storageI reduce the bottleneck associated to centralized storageI see Apache Hadoop (HDFS)
40
Data storage issues
None!
Big dataI a bit more complicated in the big data contextI distributed storage can be more efficient than centralized oneI specific solutions
I mixed “small” computersI provide processing power and storageI reduce the bottleneck associated to centralized storageI see Apache Hadoop (HDFS)
40
Data science steps
A possible breakdown of the data science process
1. Data generation and acquisition2. Data preprocessing3. Data storage4. Data management and querying5. Data analysis
41
Data management and querying
Using dataI raw storage of data is arguably uselessI data science is based on data manipulation
I selections: looking at specific itemsI links: finding related objectsI aggregations: group analysis
RDBMSI relational database management systemI standard and obvious solution for such needsI minimal ease with SQL and RDBMS is mandatory for a data
scientistI SQL like queries are available in many systems (e.g. in Python
and R)
42
NoSQL?
DefinitionI non relational databasesI and also, in some cases, non reliable database systems!
Non relational dataI some data types are not adapted to the relational model
I graph data (social network): too much relational!I text data: not enough relational!
I specific database management systems have been design tohandle those data type
I important complement to RDBMS
43
Non reliable database systems
The (pseudo) CAP theorema distributed system cannot be simultaneously (Brewer, circa 2000)I consistent (a read receive the most recent write as answer)I available (every read receive an answer)I partition tolerant (the system operates even when messages
between its parts are lost)
NoSQLI RDBMS are generally ACID: strong form of consistencyI some NoSQL systems achieve high performances by dropping
consistency guaranteesI this induces strong risks of e.g. losing data
44
Examples
Spam filteringI raw acquisition:
I emails in e.g. MaildirI log files
I post storage: RBDMS, text oriented database
Product recommendationI base storage in a RDBMS (strong consistency requirements!)I complementary storage in dedicated databases (text, images,
reviews)
Activity trackerI RDBMS for general dataI possibly specialized databases for e.g. trajectories
45
Data management and querying issues
Big dataI very active research fieldI back to SQL with the NewSQL movement (e.g. Google Spanner)
In practiceI RDBMS as an obvious basic solutionI large scale data should be moved to dedicated databases if
neededI database design and optimization is a complex subject!
46
Data science steps
A possible breakdown of the data science process
1. Data generation and acquisition2. Data preprocessing3. Data storage4. Data management and querying5. Data analysis
47
Data analysis
Information and knowledge extractionI data mining, analytics, machine learning, AI (!)I from exploratory aspects
I dashboardI visualizationI clustering
I to predictive modellingI recommendationI targeted advertisementI filtering
48
Examples
Spam filteringI automatic tagging of incoming emailsI supervised learning
Product recommendationI automatic rating of products (ranking)I frequent association (unsupervised learning)
Activity trackerI personal coaching (supervised learning)I global user base analysis (unsupervised learning)
49
Data science skills and knowledge
Computer engineeringI programmingI parallel programmingI distributed programmingI system programmingI data base manipulation
Computer scienceI computational complexityI algorithmic thinkingI relational modelsI distributed systems
50
Data science skills and knowledge
MathematicsI probabilityI statisticsI optimization
51
Licence
This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
http://creativecommons.org/licenses/by-sa/4.0/
53
Version
Last git commit: 2020-01-06By: Fabrice Rossi ([email protected])Git hash: f5aa192604fbf4b5727574b1c7135c0706141f96
54
Changelog
I September 2019: initial version
55