predictive science

Upload: peter-nordin

Post on 08-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Predictive Science

    1/101

    Predictive Sciencea Tautology

    Peter Nordin

  • 8/7/2019 Predictive Science

    2/101

  • 8/7/2019 Predictive Science

    3/101

    The Answers is:

  • 8/7/2019 Predictive Science

    4/101

  • 8/7/2019 Predictive Science

    5/101

  • 8/7/2019 Predictive Science

    6/101

    The asymmetry of similarity

    ! What thing is this like?

  • 8/7/2019 Predictive Science

    7/101

    ! And what is this like?

  • 8/7/2019 Predictive Science

    8/101

    A heuristic measure of amount of

    information: Shannons guessinggame

    1. Pony?

    2. Cow?

    3. Dog?

    345. Pegasus!

    1. Pony?

    2. Cow?

    3. Dog?

    345. Pegasus!

    345

    !

  • 8/7/2019 Predictive Science

    9/101

    Science is Prediction

    ! When does the next solar eclipise in Europe occur?

    ! The next solar eclipse in Europe will happen in

    August 12, 2026.

  • 8/7/2019 Predictive Science

    10/101

    Science is Compression

  • 8/7/2019 Predictive Science

    11/101

    The Model and

    Science and Prediction

    The Model

  • 8/7/2019 Predictive Science

    12/101

    The Turkey and the issue

    with inductive predictions(1)

  • 8/7/2019 Predictive Science

    13/101

    The Turkey and the issue

    with inductive predictions(2)

  • 8/7/2019 Predictive Science

    14/101

    Mandatory Reading

  • 8/7/2019 Predictive Science

    15/101

  • 8/7/2019 Predictive Science

    16/101

  • 8/7/2019 Predictive Science

    17/101

  • 8/7/2019 Predictive Science

    18/101

    All Real Science is

    Predictive Science

    ! Predict when the sun will set tomorrow

    ! Predict if you will be sick or well by taking

    this medicine

    ! Predict what will happen in this project if this

    methodology is used

  • 8/7/2019 Predictive Science

    19/101

    How to predict

    anything:

    1. Collect facts

    2. Find a shortmodel fitting all the facts

    3. Extrapolate that model into the future,

    probability is length of model

    4. Meta loop: Collect and include facts

    about your model finding adventures,

    goto step 2 and use for planning

  • 8/7/2019 Predictive Science

    20/101

  • 8/7/2019 Predictive Science

    21/101

  • 8/7/2019 Predictive Science

    22/101

    Companies and

    Prediction

    ! A company is a collection of peoplepredicting risk from actions

    ! No risk - no gain

  • 8/7/2019 Predictive Science

    23/101

    Recent progress

  • 8/7/2019 Predictive Science

    24/101

    Recent advances:

    Universal Learning

    Algorithms.There is a theoreticallyoptimal way of predicting the future,

    given the past. It can be used to define

    an optimal (though noncomputable)

    rational agent that maximizes its

    expected reward in almost arbitraryenvironments sampled from

    computable probability distributions.

  • 8/7/2019 Predictive Science

    25/101

    Recent advances:

    All Scientist,: Physicists and economists and otherscientists make predictions based on observations. So

    does everybody in daily life. Did you know that there is atheoretically optimalway of predicting? Everyscientist

    should know about it.

    Normally we do not know the true conditional probability distribution p(next event | past). But assume we do know that p is in some set P of

    distributions. Choose a fixed weight w_q for each q in P such that the

    w_q add up to 1 (for simplicity, let P be countable). Then construct theBayesmix M(x) = Sum_q w_q q(x), and predict using M instead of the

    optimal but unknown p.How wrong is it to do that? The recent exciting work ofMarcus Hutter

    (funded through Juergen Schmidhuber's SNF research grant"Unification of Universal Induction and Sequential Decision Theory")provides general and sharp loss bounds:

    Let LM(n) and Lp(n) be the total expected losses of the M-predictor andthe p-predictor, respectively, for the first n events. Then LM(n)-Lp(n) is

    at most of the order of sqrt[Lp(n)]. That is, M is not much worse than p.

    And in general, no other predictor can do better than that!In particular, if p is deterministic, then the M-predictor soon won't make

    any errors any more!If P contains ALL computable distributions, then M becomes the

    celebrated enumerable universal prior. That is, after decades ofsomewhat stagnating research we now have sharp loss bounds forRaySolomonoff's universal (but incomputable) induction scheme (1964,

    1978).Alternatively, reduce M to what you get if you just add up weighted

    estimated future finance data probabilities generated by 1000

    commercial stock-market prediction software packages. If only one ofthem happens to work fine (but you do not know which) you still should

    get rich..

  • 8/7/2019 Predictive Science

    26/101

    Intelligence

    ! Is compression

    ! If used for prediction

  • 8/7/2019 Predictive Science

    27/101

  • 8/7/2019 Predictive Science

    28/101

    =

  • 8/7/2019 Predictive Science

    29/101

    Art?

  • 8/7/2019 Predictive Science

    30/101

    Theory Pyramid

    Undedecidable stuff etc

    Optimal Cognition

    Algorithmic Information The.

    Optimal prediction

    Exprerimental planning

    Turingcompete repr.

    Bayes etc

    Multivariate distrib stats

    Sing var distrib stat

  • 8/7/2019 Predictive Science

    31/101

    Agent

  • 8/7/2019 Predictive Science

    32/101

    Formal Agent Model

  • 8/7/2019 Predictive Science

    33/101

    Gdel machine

  • 8/7/2019 Predictive Science

    34/101

    Artificial Intelligence

    ! Information-theoretic,

    ! Statistical, and

    ! Philosophical,

    ! Foundations of

    ! Artificial Intelligence

  • 8/7/2019 Predictive Science

    35/101

    !

    Universal AI

    Universal Artificial Intelligence

    = =

    Decision Theory = Probability + Utility Theory

    + +

    Universal Induction = Ockham + Bayes + Turing

  • 8/7/2019 Predictive Science

    36/101

    Pieces of the puzzle

    ! Philosophical Issues: common principle

    to their solution is Occams simplicity

    principle. Based on Occams andEpicurus principle, Bayesian probability

    theory, and Turings universal machine,

    Solomonofdeveloped a formal theory

    of induction.

    ! the sequential/online setup considered

    in this pres and place it into the widermachine learning context.

  • 8/7/2019 Predictive Science

    37/101

    What is I

    ! Informal Definition of (Artificial) Intelligence?

    ! Intelligence measures an agents ability to achievegoals in a wide range of environments.

    ! Emergent: Features such as the ability to learn andadapt, or to understand, are implicit in the above

    definition as these capacities enable an agent to

    succeed in a wide range of environments.

    !

    The science of Artifi

    cial Intelligence is concernedwith the construction of intelligent systems/artifacts/agents and their analysis.

  • 8/7/2019 Predictive Science

    38/101

    The Hiearchy

    ! InductionPredictionDecisionAction

    ! Having or acquiring or learning or inducing a model of

    the environment an agent interacts with allows theagent to make predictions and utilize them in its

    decision process offinding a good next action.

    ! Induction infers general models from specificobservations/facts/data, usually exhibiting regularities

    or properties or relations in the latter.

    ! Example Induction: Find a model of the world

    economy.

    ! Prediction: Use the model for predicting the futurestock market.

    ! Decision: Decide whether to invest assets in stocks orbonds. Action: Trading large quantities of stocks

    influences the market.

  • 8/7/2019 Predictive Science

    39/101

  • 8/7/2019 Predictive Science

    40/101

    Sequence

    ! Example 2:

    ! Digits of a Computable Number Extend

    14159265358979323846264338327950288419716939937?

    ! Looks random?! Frequency estimate: n = length of

    sequence. ki = number of occured i = Probabilityof next digit being i is i n . Asymptotically i n 1 10(seems to be) true.

    ! But we have the strong feeling that (i.e. with highprobability) the next digit will be 5 because theprevious digits were the expansion of!.

    ! Conclusion: We prefer answer 5, since we see more

    structure in the sequence than just random digits.

  • 8/7/2019 Predictive Science

    41/101

    Sequence 2

    ! Example 3:

    ! Number Sequences Sequence: x1 , x2 , x3 , x4 ,x5 , ... 1, 2, 3, 4, ?, ...

    ! x5 = 5, since xi = i for i = 1..4.

    ! x5 = 29, since xi = i 4 10i 3 + 35i2 49i + 24.Conclusion: We prefer 5, since linear relation involves

    less arbitrary parameters than 4th-order polynomial.Sequence:

    2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,?

    ! 61, since this is the next prime

    ! 60, since this is the order of the next simple group

    ! Conclusion: We prefer answer 61, since primes are amore familiar concept than simple groups. On-Line

    Encyclopedia of Integer Seque

  • 8/7/2019 Predictive Science

    42/101

    Occam?

    ! Occams Razor to the Rescue

    ! Is there a unique principle which allows us to

    formally arrive at a prediction which - coincides(always?) with our intuitive guess -or- even better, -which is (in some sense) most likely the best orcorrect answer?

    ! Yes! Occams razor: Use the simplest explanation

    consistent with past data (and use it for prediction). Works! For examples presented and for many more. Actually Occams razor can serve as a foundation ofmachine learning in general, and is even afundamental principle (or maybe even the mere

    defi

    nition) of science.

    ! Problem: Not a formal/mathematical objective

    principle. What is simple for one may be complicatedfor another.

  • 8/7/2019 Predictive Science

    43/101

    Blue Emeralds?

    ! Grue Emerald Paradox

    ! Hypothesis 1: All emeralds are green.

    ! Hypothesis 2: All emeralds found till y2010 are

    green, thereafter all emeralds are blue.

    ! Which hypothesis is more plausible? H1!Justification?

    ! Occams razor: take simplest hypothesis consistentwith data. is the most important principle in machine

    learning and science.

  • 8/7/2019 Predictive Science

    44/101

    View on probalilites

    ! Uncertainty and Probability

    ! The aim of probability theory is to describeuncertainty. Sources/interpretations for uncertainty:

    ! Frequentist: probabilities are relative frequencies.(e.g. the relative frequency of tossing head.)

    ! Objectivist: probabilities are real aspects of the

    world. (e.g. the probability that some atom decays in

    the next hour)

    ! Subjectivist: probabilities describe an agents degreeof belief. (e.g. it is (im)plausible that extraterrestrians

    exist)

  • 8/7/2019 Predictive Science

    45/101

    What we need

    ! Kolmogorov complexity

    ! Universal Distribution

    ! Inductive Learning

  • 8/7/2019 Predictive Science

    46/101

    Principle of

    Indifference(Epicurus)

    !Keep all hypotheses thatare consistent with the

    facts

  • 8/7/2019 Predictive Science

    47/101

    Occams Razor

    ! Among all hypotheses consistent with thefacts, choose the simplest

    ! Newtons rule #1 for doing nature

    philosophy

    ! We are to admit no more costs of nature

    things than such as are both true and

    sufficient to explain the appearances

  • 8/7/2019 Predictive Science

    48/101

    Question

    ! What does simplest mean?

    ! How to define simplicity?

    ! Can a thing be simple under one definition

    and not under another?

  • 8/7/2019 Predictive Science

    49/101

    Bayes Rule

    ! P(H|D) = P(D|H)*P(H)/P(D)

    -P(H) is often considered as initial degree

    of belief in H

    ! In essence, Bayes rule is a mapping fromprior probability P(H) to posterior

    probability P(H|D) determined by D

  • 8/7/2019 Predictive Science

    50/101

    How to get P(H)

    ! By the law of large numbers, we canget P(H|D) if we use many examples

    !Give as much information about thatfrom only a limited of number ofdata

    ! P(H) may be unknown,uncomputable, even may not exist

    ! Can we find a single probabilitydistribution to use as priordistribution in each different case,with a proximately the same result asif we had used the real distribution

  • 8/7/2019 Predictive Science

    51/101

    Hume on Induction

    ! Induction is impossible because we can onlyreach conclusion by using known data and

    methods.

    ! So the conclusion is logically alreadycontained in the start configuration

  • 8/7/2019 Predictive Science

    52/101

  • 8/7/2019 Predictive Science

    53/101

  • 8/7/2019 Predictive Science

    54/101

  • 8/7/2019 Predictive Science

    55/101

  • 8/7/2019 Predictive Science

    56/101

  • 8/7/2019 Predictive Science

    57/101

  • 8/7/2019 Predictive Science

    58/101

    Only one algorithm?

  • 8/7/2019 Predictive Science

    59/101

    Solomonoff s Theory of

    Induction

    ! Maintain all hypotheses consistent with thedata

    ! Incoporate Occams Razor-assign the

    simplest hypotheses with highest probability

    ! Using Bayes rule

  • 8/7/2019 Predictive Science

    60/101

    Kolmogorov

    Complexity

    ! k(s) is the length of the shortest programwhich, on no input, prints out s

    ! k(s)=n

    ! k(s) is objective (program languageindependent) by Invariance Theorem

  • 8/7/2019 Predictive Science

    61/101

    Universal Distribution

    ! P(s) = 2-k(s)

    ! We use k(s) to describe the complexity of anobject. By Occams Razor, the simplest

    should have the highest probability.

  • 8/7/2019 Predictive Science

    62/101

    Problem: !P(s)>1

    ! For every n, there exists a n-bit string s, k(s)= log n, so P(s) = 2-log n = 1/n

    ! "+1/3+.>1

  • 8/7/2019 Predictive Science

    63/101

    Levins improvement

    ! Using prefix-free program

    ! A set of programs, no one of which is a

    prefix of any other

    ! Krafts inequality

    ! Let L1, L2, be a sequence of natural

    numbers. There is a prefix-code with this

    sequence as lengths of its binary code words

    iff!n2-ln

  • 8/7/2019 Predictive Science

    64/101

    Multiplicative

    domination

    ! Levin proved that there exists c, c*p(s) >=p(s) where c depends on p, but not on s

    ! If true prior distribution is computable, thenuse the single fixed universal distribution p

    is almost as good as the actually truedistribution itself

  • 8/7/2019 Predictive Science

    65/101

    ! Turings thesis: Universal turingmachine can compute all intuitivelycomputable functions

    !

    Kolmogorovs thesis: the Kolmogorovcomplexity gives the shortestdescription length among alldescription lengths that can beeffectively approximated according to

    intuition.! Levins thesis: The universal

    distribution give the largestdistribution among all the distributionthat can be effectively approximatedaccording to intuition

  • 8/7/2019 Predictive Science

    66/101

    Universal Bet

    ! Street gambler Bob tossing a coin and offer:

    ! Next is head 1 give Alice 2$

    ! Next is tail 0 pay Bob 1$

    ! Is Bob honest?

    ! Side bet: flip coin 1000 times, record the

    result as a string s

    ! Alice pay 1$, Bob pay Alice 21000-k(s) $

  • 8/7/2019 Predictive Science

    67/101

    ! Good offer:

    ! !|s|=1000 2-1000 21000-k(s)=!|s|=1000 2

    -k(s)

  • 8/7/2019 Predictive Science

    68/101

    Notice

    ! The complexity of a string is non-

    computable

  • 8/7/2019 Predictive Science

    69/101

    Conclusion

    ! Kolmogorov complexity optimal effectivedescriptions of objects

    ! Universal Distribution optimal effective

    probability of objects

    ! Both are objective and absolute

  • 8/7/2019 Predictive Science

    70/101

    The most neutral possible prior

    ! Suppose we want a

    prior so neutral thatit never rules out a

    model

    ! Possible, if limit to

    computablemodels

    ! Mixture of all

    (computable) priors,

    with weights, "i, that

    decline fairly fast:

    ! Then, this

    multiplicatively

    dominates all priors

    ! though neutral priors

    will mean slowlearning

    ! m(x) are universal

    priors

  • 8/7/2019 Predictive Science

    71/101

    The most neutral possible coding

    language

    ! Universal programming languages (Java, matlab, UTMs,etc)

    ! K(x) = length of shortest program in Java, matlab, UTM,

    that generates x(Kis uncomputable)

    ! Invariance theorem

    !any languages L1, L2,

    #c,

    ! $x|KL1(x)-KL2(x)| #c

    ! Mathematically justifies talk ofK(x), not KJava(x) , KMatlab(x),

  • 8/7/2019 Predictive Science

    72/101

    So does this mean that choice oflanguage doesnt matter?

    ! Not quite!

    ! ccan be large

    ! And, for any $L1, c0, #L2, xsuch that

    ! |KL1(x)-KL2(x)| $c0

    ! The problem of the one-instruction code for the

    entire data set

    But Kolmogorov complexity can be made

    concrete

  • 8/7/2019 Predictive Science

    73/101

    Compact Universal Turing

    machines

    ! 210 bits, !-calculus ! 272, combinators

    Not much room to hide, here!

  • 8/7/2019 Predictive Science

    74/101

    Neutral priors and Kolmogorov

    complexity

    ! A key result:

    ! K(x) = -log2m(x) o(1)

    ! Where m is auniversal prior

    ! Analogous to the

    Shannons sourcecoding theorem

    ! And foranycomputable q,

    ! K(x) # -log2q(x) o(1)

    ! Fortypicalxdrawn from q(x)

    ! Any data, x, that islikely for anysensible probabilitydistribution has lowK(x)

  • 8/7/2019 Predictive Science

    75/101

    Prediction by simplicity

    ! Find shortest program/explanation for current

    corpus (binary string)

    ! Predict using that program! Strictly, use weighted sum of

    explanations, weighted by brevity

  • 8/7/2019 Predictive Science

    76/101

    Prediction is possible (Solomonoff,1978)

    Summed error has finite bound

    ! sj is summed squared error betweenprediction and true probability on item j

    ! So prediction converges [faster than 1/

    nlog(n)], for corpus size n

    ! Computability assumptions only (nostationarity needed)

  • 8/7/2019 Predictive Science

    77/101

    Summary so far

    ! Simplicity/occam- close and deep

    connections with Bayes

    ! Defines universal prior (i.e., based on

    simplicity)

    ! Can be made concrete

    ! General prediction results

    ! A convenient dual framework to Bayes,

    when codes are easier than probabilities

  • 8/7/2019 Predictive Science

    78/101

  • 8/7/2019 Predictive Science

    79/101

  • 8/7/2019 Predictive Science

    80/101

  • 8/7/2019 Predictive Science

    81/101

  • 8/7/2019 Predictive Science

    82/101

  • 8/7/2019 Predictive Science

    83/101

  • 8/7/2019 Predictive Science

    84/101

  • 8/7/2019 Predictive Science

    85/101

  • 8/7/2019 Predictive Science

    86/101

  • 8/7/2019 Predictive Science

    87/101

  • 8/7/2019 Predictive Science

    88/101

  • 8/7/2019 Predictive Science

    89/101

  • 8/7/2019 Predictive Science

    90/101

  • 8/7/2019 Predictive Science

    91/101

  • 8/7/2019 Predictive Science

    92/101

  • 8/7/2019 Predictive Science

    93/101

    Methods

  • 8/7/2019 Predictive Science

    94/101

  • 8/7/2019 Predictive Science

    95/101

  • 8/7/2019 Predictive Science

    96/101

    Infrastructure

  • 8/7/2019 Predictive Science

    97/101

  • 8/7/2019 Predictive Science

    98/101

  • 8/7/2019 Predictive Science

    99/101

  • 8/7/2019 Predictive Science

    100/101

  • 8/7/2019 Predictive Science

    101/101