looking for something special -- outlier detection in r

43
Ágnes Salánki [email protected] Budapest BI Forum 2015

Upload: salankia

Post on 13-Apr-2017

420 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Ágnes Salánki [email protected]

Budapest BI Forum 2015

PhD student in Computer Engineering

Fault Tolerant Systems Research Group

Availability of 99.99%

2011: „We need a method to detect erroneous

observations.”

PhD. student in Computer Engineering

Fault Tolerant Systems Research Group

Availability of 99.99%

2011: „We need a method to detect erroneous

observations.”

PhD. student in Computer Engineering

Fault Tolerant Systems Research Group

Availability of 99.99%

2011: „We need a method to detect erroneous

observations.”

PhD. student in Computer Engineering

Fault Tolerant Systems Research Group

Availability of 99.99%

2011: „We need a method to detect erroneous

observations.”

„An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins 1980)

„An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins 1980)

„An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins 1980)

„An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins 1980)

isodepth mve db lof

1970 1980 1990 2000 2010

mcd

bacon s-h-esd

fast-mcd

isodepth mve db lof

1970 1980 1990 2000 2010

mcd

bacon s-h-esd

fast-mcd

PISA 2012 results

Children’s math and reading scores by country

PISA 2012 results

Children’s math and reading scores by country

PISA 2012 results

Children’s math and reading scores by country

China-Shanghai

Quatar Peru

Japan

Indonesia Colombia

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

China-Shanghai

Kazakhstan

Japan

Costa Rica

Colombia

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1987

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1987

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1987

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

China-Shanghai

Kazakhstan

Montenegro Peru

Albania

Quatar

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1987

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

China-Shanghai

Japan

Costa Rica

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

1970

1980

1990

2000

2010

isodepth

mve

db

lof

China-Shanghai

Macao

Liechtenstein

Happy families are all alike; every unhappy family is unhappy in its own way. /Anna Karenina/

Fault Tolerant Systems Research Group Outliers: high communication workload

Only planned system maintenance with moving lots of data

isodepth mve db lof

1970 1980 1990 2000 2010

mcd

bacon s-h-esd

fast-mcd

Algorithm R Scikit-learn Rapidminer WEKA ELKI

isodepth

MVE

DB

LOF

salankia

R packages: depth, fields, robustX, DMwR

Pictures

Forest Gump, Judit Polgár, Garry Kasparov

Outlier detection applications in finance, security, medicine, police surveillance

1977 and 1987 pictures

Github code: https://github.com/salankia/OutlierDetection-Budapest-BI-2015