alexander gammerman - machine learning for big data

19
Machine Learning for Big Data Alexander Gammerman Computer Learning Research Centre Royal Holloway, University of London Trends in Big Data STFC/RUSI: Big Data for Security and Resillience March 7th, 2014 1 / 19

Category:

Government & Nonprofit


8 download

DESCRIPTION

This is a presentation delivered by Alexander Gammerman, at the STFC Futures / RUSI Conference Series: Data for Security and Resilience 2014

TRANSCRIPT

Page 1: Alexander Gammerman - Machine Learning for Big Data

Machine Learning for Big Data

Alexander Gammerman

Computer Learning Research CentreRoyal Holloway, University of London

Trends in Big DataSTFC/RUSI: Big Data for Security and Resillience

March 7th, 2014

1 / 19

Page 2: Alexander Gammerman - Machine Learning for Big Data

Layout

1 Debunking the myth

2 Machine Learning (Data Analytics)

3 Trends in Machine Learning for Big Data

4 Conclusions

2 / 19

Page 3: Alexander Gammerman - Machine Learning for Big Data

”Fashionable” pursuit

AI, Cybernetics, Neural Networks, Expert Systems,Big Data?

Big Data, small data, any data – what we need is Data Analysis orData Analytics or Machine Learning

3 / 19

Page 4: Alexander Gammerman - Machine Learning for Big Data

Machine Learning: what is it?

ML is intersection of Statistics and Computer Science.

Statistics deals with inferences to obtain valid conclusions from data undervarious models and assumptions.

Computer Science considers what is computable, develops efficientalgorithms and concerns with data storage and manipulation.

ML takes the past data, ”learns”, tries to find some rules, regularities inthe data in order to make predictions for the future examples. Efficientalgorithms have to be developed to make valid predictions.

4 / 19

Page 5: Alexander Gammerman - Machine Learning for Big Data

Computer Learning Research Centre (CLRC) at RoyalHolloway, University of London

Established in 1998 to develop machine learning theory, including design ofefficient algorithms for data analysis.

CLRC Fellows, including several prominent ones, such as: Vapnik andChervonenkis (the two founders of statistical learning theory), Shafer(co-founder of the DempsterShafer theory), Rissanen (inventor of theMinumum Description Length principle), Levin (one of the 3 founders ofthe theory of NP-completeness, made fundamental contributions toKolmogorov complexity)

5 / 19

Page 6: Alexander Gammerman - Machine Learning for Big Data

Recent years: explosion of interest in machine-learning methods, inparticular statistical learning theory. Statistical learning theory: similargoals to statistical science, but

it is nonparametric and

concerned with the problem of prediction.

6 / 19

Page 7: Alexander Gammerman - Machine Learning for Big Data

Problems and Current Techniques

Classical techniques: small scale, low-dimensional data. But conceptualand computational difficulties for high-dimensional data. Validity ofpredictions. Confidence measures. Online prediction.

Current techniques for dimensionality problem: Support Vector Machine(Vapnik, 1995, 1998; Vapnik and Chervonenkis, 1974); Kernel Methods.New technique for validity problem: Conformal Predictors.

7 / 19

Page 8: Alexander Gammerman - Machine Learning for Big Data

Projects

Compact Descriptors for Automatic Target Identification (withQinetiQ).

Statistical profiling of offenders (with the Home Office).

Material identification with atmosphere corrections (with WatefallSolutions).

Unmixing spectra (with Qinetiq).

Anomaly detection (vehicles) (with Thales).

Fault Diagnosis (with Marconi Instruments).

8 / 19

Page 9: Alexander Gammerman - Machine Learning for Big Data

Projects – cont’d

Abdominal Pain (with Western General Hospital, Edinburgh).

Ovarian Cancer (with Institute for Women’s Health, UCL).

Depression (with Institute of Psychiatry, Kings College)

Child Leukemia (with Royal London Hospital)

Heart Diseases ((with Institute for Women’s Health, UCL).

Analysis of microarrays (with Veterinary Laboratory Agency –DEFRA)

Protein-Protein Interaction (EU project)

9 / 19

Page 10: Alexander Gammerman - Machine Learning for Big Data

How much data do we need to answer our questions?

Big Data: V 3

Volume: Gigabyte(109); Terabyte (1012); Petabyte (1015); Exabyte(1018); Zettabyte (1021).

Variety: structured, semi-structured, unstructured; text, image, audio,video.

Velocity: dynamic; time-varying, etc.

Plus: high-dimensionality

But: if the answer is a Zettabyte what is the question?

The global data supply reached 2.8 zettabytes (ZB) in 2012 - or 2.8trillion GB - but just 0.5% of this is used for analysis, according to theDigital Universe Study. Volumes of data are projected to reach 40ZB by2020, or 5,247 GB per person.

10 / 19

Page 11: Alexander Gammerman - Machine Learning for Big Data

We don’t need the big data per se - we need to have a problem first andthen decide how much data we need to solve the problem.

If a child wants to learn a concept of a car, he/she doesn’t need to have 1million or billion cars to learn the concept - enough 10 or 100.If we want to predict digits, we can learn on the first 100 or 1000 digitsand confidently with high accuracy, identify the next one.

11 / 19

Page 12: Alexander Gammerman - Machine Learning for Big Data

Figure : USPS data

12 / 19

Page 13: Alexander Gammerman - Machine Learning for Big Data

Figure : Conformal Predictors on USPS data: Online cumulative multiplepredictions at different confidence levels (”Hedging predictions in MachineLearning” by A.Gammerman and V.Vovk The Computer Journal (2007) 50 (2):151-163).

13 / 19

Page 14: Alexander Gammerman - Machine Learning for Big Data

In fact, there is a well-known concept in machine learning. If in the pastpeople thought that the larger training set of data we have the moreaccurate results can be obtained. But the founders of statistical learningtheory, V.Vapnik and A.Cherovnenkis, showed that it is not just the lengthof the training data - it is actually another charachterisitcs called”capacity” that is more important.

14 / 19

Page 15: Alexander Gammerman - Machine Learning for Big Data

Trends in Machine Learning for Big Data

How do we make machine learning algorithms scale to large datasets?There are two main approaches: (1) developing parallelizable MLalgorithms and integrating them with large parallel systems and (2)developing more efficient algorithms.

The data growth is driving the need for parallel and online algorithms andmodels that can handle this ”Big Data”.

Need to explore the computational foundations associated with performingthese analyses in the context of parallel and cloud architectures.

15 / 19

Page 16: Alexander Gammerman - Machine Learning for Big Data

Large-scale modeling techniques and algorithms include

transductive and inductive models,

online compression models (extension of conformal predictors),

graphical models,

deep learning and semi-supervised learning algorithms,

clustering algorithms,

parallel learning algorithms.

The computational techniques provide a basic foundation in large-scaleprogramming, ranging from the basic ”parfor” to parallel abstractions,such as MapReduce (Hadoop) and GraphLab.

16 / 19

Page 17: Alexander Gammerman - Machine Learning for Big Data

Transduction

Data General

Knowledgelearning

Particular

(future examples)

(past examples)

inductive

transduction deduction

Figure : Induction and Transduction [V.Vapnik, 1995]

17 / 19

Page 18: Alexander Gammerman - Machine Learning for Big Data

Why use conformal predictions?

Why, after 100 years of research in statistics, do we need yet anothermethod of prediction?

It is simple and rigorous.

Given any of a wide range of learning/statistical prediction methods,conformal prediction can be used as a wrapper to provide a measureof confidence.

It is valid under weak assumptions.

It limits the fraction of prediction mistakes from the start. (Crudely, apredictor can either make a prediction, or else say dont know, possiblyin a graded way, such as giving a wide prediction interval.)

It works in practice.

18 / 19

Page 19: Alexander Gammerman - Machine Learning for Big Data

Conclusions

”It took Deep Thought 7.5 million years to answer the ultimate question.As nobody knew what the ultimate question to Life, The Universe andEverything actually was, nobody knows what to make of the answer (42)”.

Nowdays, as John Poppelaars noticed, many people think that the BigData would help to find the ultimate question.

But I already know that it is not Big Data, and the answer is not 42, butthe Machine Learning.

19 / 19