data science with...
TRANSCRIPT
Data Science with PostgreSQL
Data Science with PostgreSQL
Balázs Bárány
Data Scientist
pgconf.eu 2015
Data Science with PostgreSQL
Contents
Introduction � What is Data Science?Process model
Tools and methods of Data Scientists
Data Science with PostgreSQLBusiness & data understandingPreprocessingModeling
Evaluation
Deployment
Summary
Data Science with PostgreSQL
Introduction � What is Data Science?
Sexiest job of the 21st century
I According to Google, LinkedIn, ...
I Who is a Data Scientist?
Data Science with PostgreSQL
Introduction � What is Data Science?
Sexiest job of the 21st century
I According to Google, LinkedIn, ...
I Who is a Data Scientist?
Data Science with PostgreSQL
Introduction � What is Data Science?
Data Science Venn Diagram
(c) Drew Conway, 2010. CC-BY-NC
Data Science with PostgreSQL
Introduction � What is Data Science?
Tasks of data scientists
I Get data from various sources
I Big data?
I Mash up & format for analysis
I Analyze & visualize
I Predict & prescribe
I Operationalize
Data Science with PostgreSQL
Introduction � What is Data Science?
Tasks of data scientists
I Get data from various sources
I Big data?
I Mash up & format for analysis
I Analyze & visualize
I Predict & prescribe
I Operationalize
Data Science with PostgreSQL
Introduction � What is Data Science?
Tasks of data scientists
I Get data from various sources
I Big data?
I Mash up & format for analysis
I Analyze & visualize
I Predict & prescribe
I Operationalize
Data Science with PostgreSQL
Introduction � What is Data Science?
Tasks of data scientists
I Get data from various sources
I Big data?
I Mash up & format for analysis
I Analyze & visualize
I Predict & prescribe
I Operationalize
Data Science with PostgreSQL
Introduction � What is Data Science?
Tasks of data scientists
I Get data from various sources
I Big data?
I Mash up & format for analysis
I Analyze & visualize
I Predict & prescribe
I Operationalize
Data Science with PostgreSQL
Introduction � What is Data Science?
Process model
The Data Mining process
Cross Industry Standard Process for Data Mining (Kenneth Jensen/Wikimedia Commons)
Data Science with PostgreSQL
Tools and methods of Data Scientists
Tools and methods
Tools and methods
Data Science with PostgreSQL
Tools and methods of Data Scientists
Scripting and programming
I R
I Python with extensions
I Octave/Matlab, other mathematic languages
I Hadoop and Big Data programming libraries (mostly Java)
I Cloud services
Data Science with PostgreSQL
Tools and methods of Data Scientists
Integrated GUI tools
I (partly) Open Source: RapidMiner, KNIME, Orange
I Data Warehouse tools extended for analytics: Pentaho, Talend
I Many commercial tools, e. g. SAS, IBM SPSS
I Hadoop-related newcomers: e. g. Datameer
Data Science with PostgreSQL
Tools and methods of Data Scientists
Data Infrastructure
I Databases and data stores
I Relational, NoSQLI Hadoop clustersI In-memory
I Data streams
I Free-form data: text, images, video, audio, ...
I Web APIs
I Open Data
Data Science with PostgreSQL
Tools and methods of Data Scientists
Data acquisition and preprocessing
I Data ingestion in raw format
I Joining, aggregating, �ltering, calculating, ...
I Data cleansing
I Missing valuesI Abnormal values
I Result: data set suitable for analytics
Data Science with PostgreSQL
Tools and methods of Data Scientists
Data acquisition and preprocessing
I Data ingestion in raw format
I Joining, aggregating, �ltering, calculating, ...
I Data cleansing
I Missing valuesI Abnormal values
I Result: data set suitable for analytics
Data Science with PostgreSQL
Tools and methods of Data Scientists
Data acquisition and preprocessing
I Data ingestion in raw format
I Joining, aggregating, �ltering, calculating, ...
I Data cleansing
I Missing valuesI Abnormal values
I Result: data set suitable for analytics
Data Science with PostgreSQL
Tools and methods of Data Scientists
Data acquisition and preprocessing
I Data ingestion in raw format
I Joining, aggregating, �ltering, calculating, ...
I Data cleansing
I Missing valuesI Abnormal values
I Result: data set suitable for analytics
Data Science with PostgreSQL
Tools and methods of Data Scientists
Predictive Modeling
I Supervised and unsupervised methods
I Target variable known or not
I Classi�cation (supervised): Prediction of a class or category
I Regression (supervised): Prediction of numeric value
I Clustering (unsupervised): Automatic �grouping� of data
I Association analysis, outlier detection, time series prediction,...
Data Science with PostgreSQL
Tools and methods of Data Scientists
Predictive Modeling
I Supervised and unsupervised methods
I Target variable known or not
I Classi�cation (supervised): Prediction of a class or category
I Regression (supervised): Prediction of numeric value
I Clustering (unsupervised): Automatic �grouping� of data
I Association analysis, outlier detection, time series prediction,...
Data Science with PostgreSQL
Tools and methods of Data Scientists
Predictive Modeling
I Supervised and unsupervised methods
I Target variable known or not
I Classi�cation (supervised): Prediction of a class or category
I Regression (supervised): Prediction of numeric value
I Clustering (unsupervised): Automatic �grouping� of data
I Association analysis, outlier detection, time series prediction,...
Data Science with PostgreSQL
Tools and methods of Data Scientists
Deployment and operationalization
I Model application to new data => prediction + con�dence
I What to do with predictions?
I Store in ERP or CRM
I Tell someone (email, popup)
I Add a label (e. g. mark email as spam)
I Interrupt �nancial transaction => prescription
I Order supplies => prescription
I ...
Data Science with PostgreSQL
Tools and methods of Data Scientists
Deployment and operationalization
I Model application to new data => prediction + con�dence
I What to do with predictions?
I Store in ERP or CRM
I Tell someone (email, popup)
I Add a label (e. g. mark email as spam)
I Interrupt �nancial transaction => prescription
I Order supplies => prescription
I ...
Data Science with PostgreSQL
Tools and methods of Data Scientists
Deployment and operationalization
I Model application to new data => prediction + con�dence
I What to do with predictions?
I Store in ERP or CRM
I Tell someone (email, popup)
I Add a label (e. g. mark email as spam)
I Interrupt �nancial transaction => prescription
I Order supplies => prescription
I ...
Data Science with PostgreSQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Doing Data Science with PostgreSQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Caveats
I This stu� is not easy
I Must be root and postgres
I Maintain your PostgreSQL yourselfI Able to compile stu�
I You should ask ;-)
I your bossI co-workersI customer
Data Science with PostgreSQL
Data Science with PostgreSQL
Caveats
I This stu� is not easy
I Must be root and postgres
I Maintain your PostgreSQL yourselfI Able to compile stu�
I You should ask ;-)
I your bossI co-workersI customer
Data Science with PostgreSQL
Data Science with PostgreSQL
Caveats
I This stu� is not easy
I Must be root and postgres
I Maintain your PostgreSQL yourselfI Able to compile stu�
I You should ask ;-)
I your bossI co-workersI customer
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Business understanding
I What is the purpose of the business?
I What are existing processes?
I Drivers of business success
I Project goals and challenges
I Availability of data and resources
I Success criteria
I Not a technical activity, PostgreSQL can't help much
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Business understanding
I What is the purpose of the business?
I What are existing processes?
I Drivers of business success
I Project goals and challenges
I Availability of data and resources
I Success criteria
I Not a technical activity, PostgreSQL can't help much
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Business understanding
I What is the purpose of the business?
I What are existing processes?
I Drivers of business success
I Project goals and challenges
I Availability of data and resources
I Success criteria
I Not a technical activity, PostgreSQL can't help much
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding
I Existing data
I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)
I Connecting separate data sources
I Simple or complex IDs
I Data size
I Too smallI Too big
I Suitability for predictive modeling
I Target variable?I Attribute types
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding
I Existing data
I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)
I Connecting separate data sources
I Simple or complex IDs
I Data size
I Too smallI Too big
I Suitability for predictive modeling
I Target variable?I Attribute types
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding
I Existing data
I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)
I Connecting separate data sources
I Simple or complex IDs
I Data size
I Too smallI Too big
I Suitability for predictive modeling
I Target variable?I Attribute types
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding
I Existing data
I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)
I Connecting separate data sources
I Simple or complex IDs
I Data size
I Too smallI Too big
I Suitability for predictive modeling
I Target variable?I Attribute types
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding with PostgreSQL
I Get data into PostgreSQL
I Classical import processI Foreign Data Wrappers
I Analyze data distribution
I Group by and aggregate
I Count, Count Distinct, Min, Max
I Count NULLsI Search for missing links (incomplete foreign keys)
I Analyze �surprizes�
I Impossible valuesI Missing values in �required� �elds
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding with PostgreSQL
I Get data into PostgreSQL
I Classical import processI Foreign Data Wrappers
I Analyze data distribution
I Group by and aggregate
I Count, Count Distinct, Min, Max
I Count NULLsI Search for missing links (incomplete foreign keys)
I Analyze �surprizes�
I Impossible valuesI Missing values in �required� �elds
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding with PostgreSQL
I Get data into PostgreSQL
I Classical import processI Foreign Data Wrappers
I Analyze data distribution
I Group by and aggregate
I Count, Count Distinct, Min, Max
I Count NULLsI Search for missing links (incomplete foreign keys)
I Analyze �surprizes�
I Impossible valuesI Missing values in �required� �elds
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding with PostgreSQL � summary
I Good SQL knowledge required
I Tedious manual process
I repetitiveI not suitable for large number of attributes
I No built-in visualization
I Or maybe...
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding with PostgreSQL � summary
I Good SQL knowledge required
I Tedious manual process
I repetitiveI not suitable for large number of attributes
I No built-in visualization
I Or maybe...
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
SQL barchart output
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Bar chart from GUI tool
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Boxplot output
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding wrap up
I DBMS not built for this
I It can support more specialized tools
I Introduction: R
I �A free software environment for statistical computing andgraphics�
I Available in PostgreSQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Data understanding wrap up
I DBMS not built for this
I It can support more specialized tools
I Introduction: R
I �A free software environment for statistical computing andgraphics�
I Available in PostgreSQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
PL/R: A statistical language for PostgreSQL
I R as a standalone language
I Mathematical and statistical methodsI Powerful visualization functionsI Classical, modern and bleeding edge modelingI Arrays and data frames are central data typesI Operates only in memory
I PL/R: R as a loadable procedural language for PostgreSQL
I First released in 2003 by Joe ConwayI License: GPL
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
PL/R: A statistical language for PostgreSQL
I R as a standalone language
I Mathematical and statistical methodsI Powerful visualization functionsI Classical, modern and bleeding edge modelingI Arrays and data frames are central data typesI Operates only in memory
I PL/R: R as a loadable procedural language for PostgreSQL
I First released in 2003 by Joe ConwayI License: GPL
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
R usage outside of PostgreSQL
I Development environments
I RStudio (AGPL or commercial, local & web)I RKWard, Cantor (KDE projects)I StatET (Eclipse)
I Frontends
I R CommanderI DeducerI Rattle
I Web framework: Shiny (AGPL or commercial)
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Working with R in PostgreSQL
I Install functions in the database
Example
select install_rcmd('
myfunction <-function(x)
{print(x)}
');
I Install without function body
Example
CREATE FUNCTION rnorm
(n integer, mean double precision, sd double precision)
RETURNS double precision[]
AS �
LANGUAGE 'plr';
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Using R in PostgreSQL for data understanding
I Advanced visualization
I Data distributions
I Advanced statistics
I Execution in the database
I Clumsy, but direct data access
I Execution outside
I Simple and interactive, but data transfer
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Using R in PostgreSQL for data understanding
I Advanced visualization
I Data distributions
I Advanced statistics
I Execution in the database
I Clumsy, but direct data access
I Execution outside
I Simple and interactive, but data transfer
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding
Using R in PostgreSQL for data understanding
I Advanced visualization
I Data distributions
I Advanced statistics
I Execution in the database
I Clumsy, but direct data access
I Execution outside
I Simple and interactive, but data transfer
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing
I What databases are built for
I Rows: very dynamic
I Easy to create new rows by joiningI Easy to �lter
I Columns: not so much
I Easy to create new columnsI Only explicit access
I Wider interpretation of preprocessing
I Enrichment with external dataI New attributes from existing onesI Recoding, recalculationI Missing value handling
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing
I What databases are built for
I Rows: very dynamic
I Easy to create new rows by joiningI Easy to �lter
I Columns: not so much
I Easy to create new columnsI Only explicit access
I Wider interpretation of preprocessing
I Enrichment with external dataI New attributes from existing onesI Recoding, recalculationI Missing value handling
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing
I What databases are built for
I Rows: very dynamic
I Easy to create new rows by joiningI Easy to �lter
I Columns: not so much
I Easy to create new columnsI Only explicit access
I Wider interpretation of preprocessing
I Enrichment with external dataI New attributes from existing onesI Recoding, recalculationI Missing value handling
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: organizing work�ow
I Common Table Expressions
I organize processing stepsI partial and intermediate results
Example
WITH source AS (
SELECT *, ROW_NUMBER() OVER () AS rownum
FROM source_table
),
no_missings AS (
SELECT *
FROM source
WHERE field1 IS NOT NULL
AND field2 IS NOT NULL
)
etc.
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: attribute creation
I Aggregation
I Partial aggregation by window functions
I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)
I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)
I Comparing to previous/next value
I att - LAG(att, 1) OVER (ORDER BY ...)
I Much easier in SQL than programming languages and datamining tools
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: attribute creation
I Aggregation
I Partial aggregation by window functions
I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)
I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)
I Comparing to previous/next value
I att - LAG(att, 1) OVER (ORDER BY ...)
I Much easier in SQL than programming languages and datamining tools
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: attribute creation
I Aggregation
I Partial aggregation by window functions
I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)
I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)
I Comparing to previous/next value
I att - LAG(att, 1) OVER (ORDER BY ...)
I Much easier in SQL than programming languages and datamining tools
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: attribute creation
I Aggregation
I Partial aggregation by window functions
I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)
I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)
I Comparing to previous/next value
I att - LAG(att, 1) OVER (ORDER BY ...)
I Much easier in SQL than programming languages and datamining tools
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: attribute creation
I Aggregation
I Partial aggregation by window functions
I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)
I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)
I Comparing to previous/next value
I att - LAG(att, 1) OVER (ORDER BY ...)
I Much easier in SQL than programming languages and datamining tools
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: enrichment
I PostGIS for geodata
I Foreign data wrappers (see PostgreSQL Wiki)
I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)
I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)
I Big Data (Hadoop Hive, Impala)
I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)
I Write your own in C or Python or Ruby
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: enrichment
I PostGIS for geodata
I Foreign data wrappers (see PostgreSQL Wiki)
I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)
I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)
I Big Data (Hadoop Hive, Impala)
I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)
I Write your own in C or Python or Ruby
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: enrichment
I PostGIS for geodata
I Foreign data wrappers (see PostgreSQL Wiki)
I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)
I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)
I Big Data (Hadoop Hive, Impala)
I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)
I Write your own in C or Python or Ruby
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: enrichment
I PostGIS for geodata
I Foreign data wrappers (see PostgreSQL Wiki)
I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)
I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)
I Big Data (Hadoop Hive, Impala)
I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)
I Write your own in C or Python or Ruby
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: enrichment
I PostGIS for geodata
I Foreign data wrappers (see PostgreSQL Wiki)
I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)
I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)
I Big Data (Hadoop Hive, Impala)
I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)
I Write your own in C or Python or Ruby
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing
Preprocessing: enrichment
I PostGIS for geodata
I Foreign data wrappers (see PostgreSQL Wiki)
I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)
I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)
I Big Data (Hadoop Hive, Impala)
I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)
I Write your own in C or Python or Ruby
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Model development
I Machine learning algorithms not well suited for SQL
I Some attempts to build them
I Naive Bayes, Linear RegressionI Di�cult for more advanced algorithms
I Better done in specialized language or tool
I PL/RI PL/Python
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Model development
I Machine learning algorithms not well suited for SQL
I Some attempts to build them
I Naive Bayes, Linear RegressionI Di�cult for more advanced algorithms
I Better done in specialized language or tool
I PL/RI PL/Python
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Model development
I Machine learning algorithms not well suited for SQL
I Some attempts to build them
I Naive Bayes, Linear RegressionI Di�cult for more advanced algorithms
I Better done in specialized language or tool
I PL/RI PL/Python
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
PL/Python
I Python procedural language available in PostgreSQL
I scikit-learn: Machine learning toolbox for Python
I Classi�cation, regression, clusteringI Model selection, validationI Preprocessing
I matplotlib: Generic and statistical plotting library
I PL/Python is an alternative to PL/R
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
PL/Python
I Python procedural language available in PostgreSQL
I scikit-learn: Machine learning toolbox for Python
I Classi�cation, regression, clusteringI Model selection, validationI Preprocessing
I matplotlib: Generic and statistical plotting library
I PL/Python is an alternative to PL/R
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation of modeling results
I Models return predictions
I Prediction can be compared to known result (target variable)
I Measures of model performance: Accuracy, precision, recall, ...
I Results on the training set meaningless
I Split validation
I Cross validation
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation of modeling results
I Models return predictions
I Prediction can be compared to known result (target variable)
I Measures of model performance: Accuracy, precision, recall, ...
I Results on the training set meaningless
I Split validation
I Cross validation
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation of modeling results
I Models return predictions
I Prediction can be compared to known result (target variable)
I Measures of model performance: Accuracy, precision, recall, ...
I Results on the training set meaningless
I Split validation
I Cross validation
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation results
I �Good result� depends on the application
I If not good enough,
I get more data
I do more preprocessing
I select better classi�er
I optimize classi�er parameters
I Cycle: preprocessing - modeling - evaluation
I Better done in data mining environment
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation results
I �Good result� depends on the application
I If not good enough,
I get more data
I do more preprocessing
I select better classi�er
I optimize classi�er parameters
I Cycle: preprocessing - modeling - evaluation
I Better done in data mining environment
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation results
I �Good result� depends on the application
I If not good enough,
I get more data
I do more preprocessing
I select better classi�er
I optimize classi�er parameters
I Cycle: preprocessing - modeling - evaluation
I Better done in data mining environment
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation results
I �Good result� depends on the application
I If not good enough,
I get more data
I do more preprocessing
I select better classi�er
I optimize classi�er parameters
I Cycle: preprocessing - modeling - evaluation
I Better done in data mining environment
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation results
I �Good result� depends on the application
I If not good enough,
I get more data
I do more preprocessing
I select better classi�er
I optimize classi�er parameters
I Cycle: preprocessing - modeling - evaluation
I Better done in data mining environment
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation results
I �Good result� depends on the application
I If not good enough,
I get more data
I do more preprocessing
I select better classi�er
I optimize classi�er parameters
I Cycle: preprocessing - modeling - evaluation
I Better done in data mining environment
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling
Evaluation results
I �Good result� depends on the application
I If not good enough,
I get more data
I do more preprocessing
I select better classi�er
I optimize classi�er parameters
I Cycle: preprocessing - modeling - evaluation
I Better done in data mining environment
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment
I Advantages of deployment in the database:
I Less overhead
I Instant application using triggers
I Well-known execution environment
I Functionality available over standard interface
I Some models easily expressed in SQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment
I Advantages of deployment in the database:
I Less overhead
I Instant application using triggers
I Well-known execution environment
I Functionality available over standard interface
I Some models easily expressed in SQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment
I Advantages of deployment in the database:
I Less overhead
I Instant application using triggers
I Well-known execution environment
I Functionality available over standard interface
I Some models easily expressed in SQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment
I Advantages of deployment in the database:
I Less overhead
I Instant application using triggers
I Well-known execution environment
I Functionality available over standard interface
I Some models easily expressed in SQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment
I Advantages of deployment in the database:
I Less overhead
I Instant application using triggers
I Well-known execution environment
I Functionality available over standard interface
I Some models easily expressed in SQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment of PL/R or PL/Python models
I Model developed in database or outside
I Put into global context
I PL/R: load(�saved object�, .GlobalEnv)I PL/Python: Global dictionary �GD�
I Application function in matching language
I Uses existing modelI Returns target variable
I Trigger func or UPDATE uses application function
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment of PL/R or PL/Python models
I Model developed in database or outside
I Put into global context
I PL/R: load(�saved object�, .GlobalEnv)I PL/Python: Global dictionary �GD�
I Application function in matching language
I Uses existing modelI Returns target variable
I Trigger func or UPDATE uses application function
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment
Deployment of PL/R or PL/Python models
I Model developed in database or outside
I Put into global context
I PL/R: load(�saved object�, .GlobalEnv)I PL/Python: Global dictionary �GD�
I Application function in matching language
I Uses existing modelI Returns target variable
I Trigger func or UPDATE uses application function
Data Science with PostgreSQL
Summary
Summary
I PostgreSQL's support for data science tasks
I Best: preprocessing, deployment
I Modern SQL for preprocessing
I Foreign Data Wrappers for data integration
I Procedural languages for data mining
Data Science with PostgreSQL
Summary
Summary
I PostgreSQL's support for data science tasks
I Best: preprocessing, deployment
I Modern SQL for preprocessing
I Foreign Data Wrappers for data integration
I Procedural languages for data mining
Data Science with PostgreSQL
Summary
Summary
I PostgreSQL's support for data science tasks
I Best: preprocessing, deployment
I Modern SQL for preprocessing
I Foreign Data Wrappers for data integration
I Procedural languages for data mining
Data Science with PostgreSQL
Summary
Summary
I PostgreSQL's support for data science tasks
I Best: preprocessing, deployment
I Modern SQL for preprocessing
I Foreign Data Wrappers for data integration
I Procedural languages for data mining
Data Science with PostgreSQL
Summary
Questions?
I Balázs Bárány, <[email protected]>
I https://datascientist.at/