machine learning for knowledge dissemination in creative...

Post on 13-Jun-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Machine Learning for Knowledge Dissemination in Creative Economies

Krzysztof

Pampuch

• What is machine learning?

• Basic terminology

• Systematics of ML methods

• How to measure the quality of our model

• Selected methods of ML

• What ML looks like in everyday practice?

StatisticsComputer

Science

Machine learning (ML) is a category of algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed.

No observation

Length of stalk

Width of stalk

Length of petal

Width of petal

Label

1 5.1 3.5 1.4 0.2 Setosa

2 4.9 3.0 1.4 0.2 Setosa

3 6.4 3.5 4.5 1.2 Versicolor

… … … … … …

100 5.9 3.0 5.0 1.8 Virginica

Ob

serv

atio

ns

FeaturesPredictors

LabelPredicted variable

A neurone of McCullocha-Pittsa (1943)

A neurone of Frank Rosenblatt (1957)

Learning conception:

Machine learning

unsupervised

clusteringdimensionality

reduction

supervised

classification regression

reinforcementlearning

quantity

• can be expressedusing specificunits of measurement

quality

• can be describedonly by words, can’tbe ordered

Criteria:

• Efficiency

• Stability

• For other samples

• Over time

• Interpretability

• We split the dataset into:• Train set - used for training a model

• Validation set - used to choose the best model

• Test set - used to make sure that our model is stable

train validation test

Test set Training set

Test setTraining set…

Each observation is used exactly one for test and k-1 times for a training

The quality of a model is a mean counted on all training sets

An expected error on a test test:

𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]

2+𝑉𝑎𝑟(𝜀)

𝑉𝑎𝑟 𝑦𝑖 - variance

𝐵𝑖𝑎𝑠( 𝑦𝑖) - bias

𝑉𝑎𝑟(𝜀) - variance of a random component

• A bias reflects what error we make when appraching reality with a model

• A variance reflects how much the prediction would change if a different set of data were used to learn the model

• A random component variance is independent of the proces modeled and irreducible

• Best situation: negliglible deviation and variance

The more „flexible” the method, the less devation

𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]

2+𝑉𝑎𝑟(𝜀)

The more „flexible” the method, the higher the variance

𝐸( 𝑦𝑖 − 𝑦𝑖)2 = 𝑉𝑎𝑟 𝑦𝑖 + [𝐵𝑖𝑎𝑠( 𝑦𝑖)]

2+𝑉𝑎𝑟(𝜀)

• Goal: to fit a linear function to our data

• 𝑦 = 𝛽0 + 𝑖=1𝑝

𝛽𝑖𝑥𝑖 + 𝜖

• How to find model coefficients?

• Minimizing the cost functions:

𝐿 = 𝑖=1𝑁 (𝑦𝑖 − 𝑦𝑖)

2

• Disadvantages: sensitivity to outliers, poorly modeling nonlinear relationships

15

𝑅2 = 1 − 𝑖( 𝑦𝑖 − 𝑦)2

𝑖(𝑦𝑖 − 𝑦)2

• Values in the range [0;1]• Interpretation:

How much variance of data does the model explain?

Mean value 𝑦

• Misclassification Rate: 𝑀𝑅 = 1 − 𝑖 𝑓𝑖𝑖

𝑖≠𝑗 𝑓𝑖𝑗

• Accuracy: 𝐴𝐶𝐶 = 1 − 𝑀𝑅

• Multi-class log-loss: 𝑀𝐿𝐿 = −1

𝑁 𝑖=1

𝑁 𝑗=1𝑀 𝑦𝑖𝑗log(𝑝𝑖𝑗)

• ROC, AUC, F-measure: 𝐹1 =2𝑇𝑃

2𝑇𝑃+𝐹𝑃+𝐹𝑁

True value

0 1 2

Pre

dic

ted

valu

e

0 𝑓00 𝑓01 𝑓021 𝑓10 𝑓11 𝑓122 𝑓20 𝑓21 𝑓22

True value

1/T 0/N

Pre

dic

ted

valu

e

1/T 𝑇𝑃 𝐹𝑃

0/N 𝐹𝑁 𝑇𝑁

K-means DBSCAN

DataFeature

engineering Tain set

Test set

Model

Learning

Model validation

• Data almost never has the desired format

• Often we have to acquire data from many sources

• Volume, inflow rate

• Examples of problems

• Storage of terabytes of data

• Data from various DBMS + external data

• Data refreshing and retention

• Consistency od data types

• Unstructured data

• Character encoding, numer and date formats

• The most time-consuming activity

• The type of processing required depends on the type of data and the problem

• Generating features – manual vs automatic:

• Examples of generation of the features:

czas

preprocessingdimensionality reduction

prediciton

Text

• Regular expression• tokenization• lematiozation• bag-of-words• TF-IDF

Customer data

• Total playments• Balance on accounts• Number of logins• Demographic data

Audio / video

• Signal framing• LPC, MFCC• Color/gradient hist• SIFT, SURF• bag-of-words

• High dimensionality of the space of features:

• Degrades the predictive power of models

• Introduces redundancy (variable correlation)

• Leads to overfitting

• Requires larger data sets to achieve the same goal

• Increases the computational effort

• And besides… decision-makers do not like complex models and many variables

• So let’s reduce the dimensionality!

• Principle of operation (most ofen):

• The most accurate reproduction of data in the space of lower dimensionality

• The best possible highlighting of information differentiating the predicted value of variables

nkn x

x

x

f

y

y

y

x

x

x

2

1

2

1

cech ekstrakcja2

1

ki

i

i

nx

x

x

x

x

x

2

1

cech selekcja2

1

𝑘 < 𝑛Feature selection Feature selection

top related