big data machine learning topic models text recognition natural language processing

Upload: fkjljsdkfj

Post on 07-Jul-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    1/58

    CMSC 25025 / Stat 37601

    Machine Learning andLarge Scale Data Analysis

    Tuesday, April 21

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    2/58

    For Today

    •   Mixtures (redux)

    •  Bayesian inference (redux)

    •  Topic models

    2

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    3/58

    Mixtures

    •  Key technique: Mixture models

    •  Mixtures have latent variables

    •  Flexible tool

    •  Simple and difficult at the same time

    3

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    4/58

    Gaussian Mixture

    x

          p                (      x

                    )

    0.00

    0.05

    0.10

    0.15

    0.20

    !4   !2 0 2 4 6

    p (x ) =   25φ(x ;−1.25, 1) +  35φ(x ; 2.95, 1)

    4

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    5/58

    Bumps and More Bumps  (MacKay and Williams)

    A mixture of k  Gaussians models can have   53 k   modes.

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    -4 -3 -2 -1 0 1 2 3 4

    -4

    -3

    -2

    -1

    0

    1

    2

    3

    4

    .

    .

    .

    .

    .

    .

    .

    .

    -0.5 0 0.5 1

    -0.5

    0

    0.5

    1

    5

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    6/58

    Mixtures

    •  Mixture of f and g :

    p (x ) = ηf (x ) + (1 − η)g (x )

    Simplest, most common kind of latent variable model

    •  Hidden variable represention : Define Z   ∼ Bernoulli(η) and

    p (x ) =X

    z =0,1

    p (x  | z ) p (z )

    with p (x  | 0) = f (x ), p (x  | 1) = g (x ), p (z ) = ηz (1 − η)(1−z ).

    6

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    7/58

    Gaussian Mixture: All the Key Concepts

    x

          p                (      x

                    )

    0.00

    0.05

    0.10

    0.15

    0.20

    !4   !2 0 2 4 6

    7

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    8/58

    Bayesian Inference

    The parameter θ of a model is viewed as a random variable.

    Inference usually carried out as follows:

    •  Choose a generative model  p (x  | θ) for the data.

    •  Choose a prior distribution  π(θ) that expresses beliefs about theparameter before seeing any data.

    •  After observing data Dn  = {x 1, . . . , x n }, update beliefs andcalculate the posterior distribution  p (θ | Dn ).

    8

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    9/58

    Bayes’ Theorem

    The posterior distribution can be written as

    p (θ | x 1, . . . , x n ) = p (x 1, . . . , x n  | θ)π(θ)

    p (x 1, . . . , x n )  =

     Ln (θ)π(θ)

    c n ∝ Ln (θ)π(θ)

    where Ln (θ) = Qn i =1 p (x i  | θ) is the likelihood function  andc n  = p (x 1, . . . , x n ) =

    Z   p (x 1, . . . , x n  | θ)π(θ)d θ =

    Z   Ln (θ)π(θ)d θ

    is the normalizing constant, which is also called evidence .

    9

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    10/58

    Example

    X   ∼ Bernoulli(θ) with data Dn  = {x 1, . . . , x n }. Prior Beta(α,β )distribution

    πα,β (θ) =  Γ(α + β )

    Γ(α)Γ(β )θα−1(1 − θ)β −1

    Let s  =

    Pn i =1 x i  be the number of “successes.”

    Posterior distribution θ | Dn  is Beta(α + s ,β  + n − s ). Posterior meanis a mixture:

    θ̄ =  α + s 

    α + β  + n   =

      n 

    α + β  + n  bθ +   α + β 

    α + β  + n  θ0where bθ = s /n  is the MLE and  θ0  = α/(α + β ) is the prior mean.

    10

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    11/58

    Example

    n  =  15 points sampled as X   ∼ Bernoulli(θ = 0.4), with s  = 7 heads.

    !

          D     e     n     s      i      t     y

    0

    1

    2

    3

    0.0 0.2 0.4 0.6 0.8 1.0!

          D     e     n     s      i      t     y

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    0.0 0.2 0.4 0.6 0.8 1.0

    good prior bad prior

    Prior distribution (black-dashed), likelihood function (blue-dotted),

    posterior distribution (red-solid).

    11

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    12/58

    Dirichlet

    Multinomial model with Dirichlet prior is generalization of the

    Bernoulli/Beta model.

    Dirichletα(θ) =

    Γ(PK  j =1 α j )QK 

     j =1 Γ(α j )

    Y j =1

    θ

    α j −1

     j 

    where α  = (α1, . . . , αK ) ∈ RK +  is a non-negative vector.

    12

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    13/58

    Example

     !34

       !    3    4

        !    3    4

        !    3    4

     

    ! 3 4  

    !32

       !    3    2

        !    3    2

     

       !    3    2

        !    3    2

     !   3   2   

     

    !30!28

    !26!24

    !22

    !20

    !18

    !45 !40!35!30 

    !25 

       !    2    5

        !    2    5

        !    2    5

     

       !    2    5

     

    !20

    prior with Dirichlet(6,6,6) likelihood function with n = 20

     !85

    !85 

    !80!75

       !    7    5

     !70!65

       !   6    5

        !    6    5

        !   6    5

        !   6    5

        !   6    5

     !60 !55

    !50 

    !45

    !40

    !550!500 !450!400

    !350 

       !    3     5    0

        !    3    5   0

        !    3    5   0

     !300

    !250

    posterior distribution with n = 20 posterior distribution with n = 200

    13

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    14/58

    Summary

    •  Mixtures are latent variable models

    •  The mixing weight encodes a hidden variable

    •  Computing with mixtures uses basic probabilistic reasoning

    •  But can get complicated

    •  Topic models are flexible mixtures models for complex data likedocuments and images (next)

    14

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    15/58

    Ball and Elephants

    15

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    16/58

    Captioning

    there is a large bird on the water a professional baseball game is played in the middle of the field

    a small bird sitting on top of a lake several players at the end of a baseball game

    a large white bird standing on the water on a beach a group of players playing a baseball game

    a bird is on the water on a beach the baseball players are playing games at the field

    a bird that is standing in the water a baseball players are playing with a game and fans

    www.cs.toronto.edu/˜nitish/nips2014demo/ 

    16

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    17/58

    Intro to Topic Modeling

    Some of the following slides are from Dave Blei’s 2011 tutorial onTopic Modeling

    http://www.cs.princeton.edu/˜blei/topicmodeling.html

    A survey paper describing many of these ideas in more detail is here:

    http://www.cs.princeton.edu/˜blei/papers/

    BleiLafferty2009.pdf

    See also:

    http://awards.acm.org/award_winners/blei_3974465.cfm

    17

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    18/58

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    19/58

    Discover topics from a corpus

    human evolution disease computer

    genome evolutionary host models

    dna species bacteria information

    genetic organisms diseases data

    genes life resistance computers

    sequence origin bacterial systemgene biology new network

    molecular groups strains systems

    sequencing phylogenetic control model

    map living infectious parallel

    information diversity malaria methods

    genetics group parasite networksmapping new parasites software

    project two united new

    sequences common tuberculosis simulations

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    20/58

    Model the evolution of topics over time

    1880 1900 1920 1940 1960 1980 2000

    o   o   o   o   o   oo

    o

    o

    o

    o

    o

    o

    o

    oo

    oo

    oo

    o oo o   o

    o   o   o  o   o   o

      o   o

      o  o

      o  o   o

      o  o

    o

    o

    o

    o

    o

    o

    o

    o

    o   o

    o   o  o   o

      oo

    o

    oo   o

      o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o oo

    o   o

    1880 1900 1920 1940 1960 1980 2000

    o   o   o

    o

    o  o

    oo

    o

    oo   o

    o

    o

    oo   o o o

      oo

    o oo o

    o   o  o

      oo

    o

    o

    oo

    o

    oo

    o

    o

    o

    o

    o

    oo

    oo

    o oo o

    o   o   o   o   o   o   o  o   o   o

      o  o

      o  o

    o

    o

    o

    o

    o

    o   o  o

    o   oo

    RELATIVITY

    LASER

    FORCE

    NERVE

    OXYGEN

    NEURON

    "Theoretical Physics" "Neuroscience"

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    21/58

    Model connections between topics

    wild typemutant

    mutations

    mutantsmutation

    plants

    plant

    gene

    genes

    arabidopsis

    p53

    cell cycle

    activity

    cyclin

    regulation

    amino acids

    cdna

    sequence

    isolated

    protein

    genedisease

    mutations

    families

    mutation

    rna

    dna

    rna polymerase

    cleavage

    site

    cells

    cell

    expressioncell lines

    bone marrow

    united states

    women

    universities

    students

    education

    science

    scientists

    says

    research

    people

    research

    funding

    support

    nih

    program

    surface

    tipimagesampledevice

    laser

    optical

    light

    electrons

    quantum

    materials

    organic

    polymer

    polymers

    molecules

    volcanicdepositsmagmaeruption

    volcanism

    mantle

    crust

    upper mantle

    meteorites

    ratios

    earthquake

    earthquakes

    fault

    images

    data

    ancient

    found

    impactmillion years ago

    africaclimate

    ocean

    ice

    changes

    climate change

    cells

    proteins

    researchers

    protein

    found

    patients

    disease

    treatment

    drugs

    clinical

    genetic

    populationpopulationsdifferences

    variation

    fossil record

    birds

    fossilsdinosaurs

    fossil

    sequence

    sequences

    genome

    dnasequencing

    bacteria

    bacterial

    host

    resistance

    parasitedevelopment

    embryos

    drosophila

    genes

    expression

    speciesforest

    forests

    populations

    ecosystems

    synapsesltp

    glutamate

    synaptic

    neurons

    neurons

    stimulus

    motor

    visualcortical

    ozoneatmospheric

    measurementsstratosphere

    concentrations

    sun

    solar wind

    earth

    planets

    planet

    co2

    carbon

    carbon dioxide

    methane

    water

    receptorreceptors

    ligandligands

    apoptosis

    proteins

    protein

    binding

    domain

    domains

    activatedtyrosine phosphorylation

    activation

    phosphorylation

    kinase

    magnetic

    magnetic field

    spin

    superconductivity

    superconducting

    physicists

    particles

    physics

    particle

    experimentsurface

    liquid

    surfacesfluid

    model   reaction

    reactionsmoleculemolecules

    transition state

    enzyme

    enzymes

    iron

    active site

    reduction

    pressure

    high pressure

    pressures

    core

    inner core

    brain

    memorysubjects

    left

    task

    computer

    problem

    information

    computers

    problems

    starsastronomers

    universe

    galaxies

    galaxy

    virus

    hivaids

    infection

    viruses

    miceantigen

    t cells

    antigens

    immune response

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    22/58

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    23/58

    Annotate images

     SKY WATER TREEMOUNTAIN PEOPLE

     

    SCOTLAND WATER

    FLOWER HILLS TREE

     

    SKY WATER BUILDINGPEOPLE WATER

     

    FISH WATER OCEAN

    TREE CORAL

      PEOPLE MARKET PATTERN

    TEXTILE DISPLAY

     BIRDS NEST TREE

    BRANCH LEAVES

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    24/58

    Discover influential articles

    Year

        W   e    i   g    h    t   e    d    I   n    f    l   u   e   n   c   e

    0.000

    0.005

    0.010

    0.015

    0.020

    0.025

    0.030

    1880 1900 1920 1940 1960 1980 2000

    Jared M. Diamond, Distributional Ecology of New Guinea Birds. Science (1973)[296 citations]

    W. B. Scott, The Isthmus of Panama in Its Relation to the Animal Life of North and South America , Science (1916)[3 citations]

    William K. Gregory, The New Anthropogeny: Twenty-Five Stages ofVertebrate Evolution, from Silurian Chordate to Man , Science (1933)[3 citations]

    Derek E. Wildman et al., Implications of Natural Selection in Shaping 99.4% NonsynonymousDNA Identity between Humans and Chimpanzees: Enlarging Genus Homo, PNAS (2003)[178 citations]

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    25/58

    Predict links between articles

    Markov chain Monte Carlo convergence diagnostics: A comparative review 

    Minorization conditions and convergence rates for Markov chain Monte Carlo

     RTM

      (    ψ e

      )  

    Rates of convergence of the Hastings and Metropolis algorithms

    Possible biases induced by MCMC convergence diagnostics

    Bounding convergence time of the Gibbs sampler in Bayesian image restoration

    Self regenerative Markov chain Monte Carlo

    Auxiliary variable methods for Markov chain Monte Carlo with applications

    Rate of Convergence of the Gibbs Sampler by Gaussian Approximation

    Diagnosing convergence of Markov chain Monte Carlo algorithms

    Exact Bound for the Convergence of Metropolis Chains   LDA + R

     e gr e s si   on

    Self regenerative Markov chain Monte Carlo

    Minorization conditions and convergence rates for Markov chain Monte Carlo

    Gibbs-markov models

    Auxiliary variable methods for Markov chain Monte Carlo with applications

    Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models

    Mediating instrumental variables

    A qualitative framework for probabilistic inference

    Adaptation for Self Regenerative MCMC

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    26/58

    Characterize political decisions

    dod,defense,defense and appropriation,military,subtitle

    veteran,veterans,bills,care,injury

    people,woman,american,nation,school

    producer,eligible,crop,farm,subparagraph

    coin,inspector,designee,automobile,lebanon

    bills,iran,official,company,sudan

    human,vietnam,united nations,call,people

    drug,pediatric,product,device,medical

    child,fire,attorney,internet,billssurveillance,director,court,electronic,flood

    energy,bills,price,commodity,market

    land,site,bills,interior,river 

    child,center,poison,victim,abuse

    coast guard,vessel,space,administrator,requires

    science,director,technology,mathematics,bills

    computer,alien,bills,user,collection

    head,start,child,technology,award

    loss,crop,producer,agriculture,trade

    bills,tax,subparagraph,loss,taxablecover,bills,bridge,transaction,following

    transportation,rail,railroad,passenger,homeland security

    business,administrator,bills,business concern,loan

    defense,iraq,transfer,expense,chapter 

    medicare,medicaid,child,chip,coverage

    student,loan,institution,lender,school

    energy,fuel,standard,administrator,lamp

    housing,mortgage,loan,family,recipient

    bank,transfer,requires,holding company,industrial

    county,eligible,ballot,election,jurisdictiontax credit,budget authority,energy,outlays,tax

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    27/58

    Organize and browse large corpora

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    28/58

    This tutorial

    •  What are topic models?

    •  What kinds of things can they do?•  How do I compute with a topic model?

    •  What are some unsanswered questions in this field?

    •  How can I learn more?

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    29/58

    Uber Topics

    Hi Prof. Lafferty,

    I took your ML+LSDA course last Spring. The course was super helpful,

    and I just wanted to let you know that I’m currently using Latent Dirichlet 

    Allocation at my current job at Uber! 

    We’re using LDA to discover topics in rider feedback – when riders write 

    comments about their driver after the trip. We’re trying to find topics such 

    as ’unprofessional driver’, ’driver no-show’, ’sexual harassment’, etc. LDA

    has worked really well with this – so thank you for covering it in much 

    detail in your course.

    18

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    30/58

    Bag Demo

    19

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    31/58

    Introduction to Topic Modeling

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    32/58

    Probabilistic modeling

    1   Data are assumed to be observed from a generative probabilistic

    process that includes hidden variables.

    •  In text, the hidden variables are the thematic structure.

    2   Infer the hidden structure using posterior inference

    •  What are the topics that describe this collection? 

    3   Situate new data into the estimated model.

    •  How does a new document fit into the topic structure? 

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    33/58

    Latent Dirichlet allocation (LDA)

    Simple intuition: Documents exhibit multiple topics.

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    34/58

    Generative model for LDA

    gene 0.04

    dna 0.02

    genetic 0.01

    .,,

    life 0.02

    evolve 0.01

    organism 0.01

    .,,

    brain 0.04

    neuron 0.02

    nerve 0.01

    ...

    data 0.02

    number 0.02

    computer 0.01

    .,,

    Topics Documents    Topic proportions and 

    assignments 

    •   Each topic is a distribution over words

    •   Each document is a mixture of corpus-wide topics

    •   Each word is drawn from one of those topics

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    35/58

    The posterior distribution

    Topics Documents    Topic proportions and 

    assignments 

    •  In reality, we only observe the documents

    •  The other structure are  hidden variables

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    36/58

    The posterior distribution

    Topics Documents    Topic proportions and 

    assignments 

    •  Our goal is to  infer the hidden variables

    •   I.e., compute their distribution conditioned on the documents

    p (topics, proportions, assignments | documents)

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    37/58

    LDA as a graphical model

    d   d,n   d,n   β k   η

    Proportionsparameter

    Per-documenttopic proportions

    Per-wordtopic assignment

    Observedword   Topics

    Topicparameter

    •  Encodes our assumptions about the data

    •  Connects to algorithms for computing with data

    •   See Pattern Recognition and Machine Learning  (Bishop, 2006).

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    38/58

    LDA as a graphical model

    d   d,n   d,n   β k   η

    Proportionsparameter

    Per-documenttopic proportions

    Per-wordtopic assignment

    Observedword   Topics

    Topicparameter

    •  Nodes are random variables; edges indicate dependence.

    •  Shaded nodes are observed.

    •  Plates indicate replicated variables.

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    39/58

    LDA as a graphical model

    d   d,n   d,n   β k   η

    Proportionsparameter

    Per-documenttopic proportions

    Per-wordtopic assignment

    Observedword   Topics

    Topicparameter

    K Yi =1

    p (β i  | η)D Y

    d =1

    p (θd  |α)

      N Y

    n =1

    p (z d ,n  | θd )p (w d ,n  | β 1:K , z d ,n )

    !

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    40/58

    LDA

    θd   Z d,n   W d,nN D   K 

    β kα η

    •  This joint defines a posterior.

    •  From a collection of documents, infer

    •  Per-word topic assignment z d 

    ,

    n •  Per-document topic proportions θd •  Per-corpus topic distributions β k 

    •  Then use posterior expectations to perform the task at hand,

    e.g., information retrieval, document similarity, exploration, ...

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    41/58

    LDA

    θd   Z d,n   W d,nN D   K 

    β kα η

    Approximate posterior inference algorithms

    •  Mean field variational methods  (Blei et al., 2001, 2003)

    •   Expectation propagation  (Minka and Lafferty, 2002)

    •  Collapsed Gibbs sampling  (Griffiths and Steyvers, 2002)•  Collapsed variational inference  (Teh et al., 2006)

    •  Online variational inference  (Hoffman et al., 2010)

    Also see Mukherjee and Blei (2009) and Asuncion et al. (2009).

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    42/58

    Example inference

    θd   Z d,n   W d,nN 

    D   K 

    β kα η

    •   Data: The OCR’ed collection of  Science  from 1990–2000

    •   17K documents•   11M words•  20K unique terms (stop words and rare words removed)

    •   Model: 100-topic LDA model using variational inference.

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    43/58

    Example inference

    1 8 16 26 36 46 56 66 76 86 96

    Topics

          P     r     o      b     a      b      i      l      i      t     y

          0 .      0

          0 .      1

          0 .      2

          0 .      3

          0 .      4

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    44/58

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    45/58

    Example inference (II)

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    46/58

    Example inference (II)

    problem model selection species

    problems rate male forest

    mathematical constant males ecology

    number distribution females fish

    new time sex ecological

    mathematics number species conservation

    university size female diversity

    two values evolution population

    first value populations natural

    numbers average population ecosystems

    work rates sexual populations

    time data behavior endangeredmathematicians density evolutionary tropical

    chaos measured genetic forests

    chaotic models reproductive ecosystem

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    47/58

    Used to explore and browse document collections

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    48/58

    Aside: The Dirichlet distribution

    •  The Dirichlet distribution is an exponential family distribution over

    the simplex, i.e., positive vectors that sum to one

    p (θ | ~ α) =  Γ (

    Pi  αi )

    Qi 

     Γ(αi ) Yi 

    θαi −1i    .

    •   It is conjugate to the multinomial. Given a multinomial

    observation, the posterior distribution of θ is a Dirichlet.

    •   The parameter α controls the mean shape and sparsity of  θ.

    •  The topic proportions are a K  dimensional Dirichlet.

    The topics are a V  dimensional Dirichlet.

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    49/58

    α = 1

    item

         v      a       l      u      e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

     1

    !

    !

    !

    !   !

    !

    !!   !

    !

     6

    !

    !

    !

    !

    !

    !

    !

    !

    !

    !

    11

    ! !

    !

    !

    !

    !

    !

    !

    !

    !

    1 2 3 4 5 6 7 8 9 10

     2

    !

    !

    !

    !

    !

    !

    !

    !

    !

    !

     7

    !

    ! !

    !

    !

    !  !

    !  !   !

    12

    !

    !

    !!

    !

    !

    !

    !   !

    !

    1 2 3 4 5 6 7 8 9 10

     3

    !

    !

    !

    !

    !

    !

    !

    !  !

    !

     8

    !

    !

    !

    !

    !

    !

    !

    !

    !

    !

    13

    !

    !

    !

    !   !   !

    !!

    !

    !

    1 2 3 4 5 6 7 8 9 10

     4

    !   !   !

    !

    ! !

    ! !!

    !

     9

    !

    !

    !

    !

    !

    !!

    !

    !!

    14

    !!

    !

    !!

    !

    !!

    !

    !

    1 2 3 4 5 6 7 8 9 10

     5

    !!

    !

    !!

    !   !

    !

    !

    !

    10

    !

    !!   !

    !

    !

    !  !   !

    !

    15

    !   !

    !

    !   !

    !   !

    !

    !  !

    1 2 3 4 5 6 7 8 9 10

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    50/58

    α = 10

    item

         v      a       l      u      e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

     1

    !!

    !   !  !   !

    !

    !

    !

    !

     6

    !!

    !

    !   !   !

    !

    !

    !

    !

    11

    !!   !

      !

    !  !

    !!

    !!

    1 2 3 4 5 6 7 8 9 10

     2

    !   !

    !! !

    !

    !!

    !!

     7

    !

    !

    !

    ! !!

    !  !   !

      !

    12

    ! !

    !

    !   !   !   !!

    !

    !

    1 2 3 4 5 6 7 8 9 10

     3

    !

    !   !

    !

    !

    !

    ! !

    !

    !

     8

    !!

    ! !  !

      !!   !

      !

    !

    13

    !!

    !   ! !!

    !!

    !

    !

    1 2 3 4 5 6 7 8 9 10

     4

    !   !!   !

    ! !!   !

    !!

     9

    !

    !!

      !!   !

      !  !

    !   !

    14

    !!

    !   !

    !

    !

    !

    !!

    !

    1 2 3 4 5 6 7 8 9 10

     5

    !   !

    !!

    !

    !!   ! !

      !

    10

    !  !

    !!

    !

    !!

    !

    !!

    15

    !

    !

    !

    ! !   !

    !   !

    !

    !

    1 2 3 4 5 6 7 8 9 10

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    51/58

    α = 100

    item

         v      a       l      u      e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

     1

    ! ! !   !   !   !  !

    !   ! !

     6

    !   ! !   !   ! !   !  !

    !   !

    11

    !   !   !   ! !   !   !   !  ! !

    1 2 3 4 5 6 7 8 9 10

     2

    !   !   ! !   ! !   !   !  !

    !

     7

    !   !   !  !   !   !   !   !   ! !

    12

    !!   !   !

      !!   !   !   !   !

    1 2 3 4 5 6 7 8 9 10

     3

    !   !   !   !   !   !  !   !   !   !

     8

    !   !   ! !   !   !  !   !   !   !

    13

    !!   !   ! !   !

      ! !   !   !

    1 2 3 4 5 6 7 8 9 10

     4

    !  !   !   ! !

      !!

      !   !   !

     9

    !   !   !   !   !  !   !   ! !   !

    14

    !   ! !  !   !   !   !   ! !

      !

    1 2 3 4 5 6 7 8 9 10

     5

    ! !   !   ! !   !   !  !   !   !

    10

    !   !   ! !   !   !  !   ! !   !

    15

    !   !   !  !

    !   !   !   !   !   !

    1 2 3 4 5 6 7 8 9 10

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    52/58

    α = 1

    item

         v      a       l      u      e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

     1

    !

    !

    !

    !   !

    !

    !!   !

    !

     6

    !

    !

    !

    !

    !

    !

    !

    !

    !

    !

    11

    ! !

    !

    !

    !

    !

    !

    !

    !

    !

    1 2 3 4 5 6 7 8 9 10

     2

    !

    !

    !

    !

    !

    !

    !

    !

    !

    !

     7

    !

    ! !

    !

    !

    !  !

    !  !   !

    12

    !

    !

    !!

    !

    !

    !

    !   !

    !

    1 2 3 4 5 6 7 8 9 10

     3

    !

    !

    !

    !

    !

    !

    !

    !  !

    !

     8

    !

    !

    !

    !

    !

    !

    !

    !

    !

    !

    13

    !

    !

    !

    !   !   !

    !!

    !

    !

    1 2 3 4 5 6 7 8 9 10

     4

    !   !   !

    !

    ! !

    ! !!

    !

     9

    !

    !

    !

    !

    !

    !!

    !

    !!

    14

    !!

    !

    !!

    !

    !!

    !

    !

    1 2 3 4 5 6 7 8 9 10

     5

    !!

    !

    !!

    !   !

    !

    !

    !

    10

    !

    !!   !

    !

    !

    !  !   !

    !

    15

    !   !

    !

    !   !

    !   !

    !

    !  !

    1 2 3 4 5 6 7 8 9 10

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    53/58

    α = 0.1

    item

         v      a       l      u      e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

     1

    !   !   ! ! !  !

    !

    !   ! !

     6

    !   ! !

    !

    !

    !

    !   !

    !

    !

    11

    !  !

    !

    ! !

    !

    !

    ! !   !

    1 2 3 4 5 6 7 8 9 10

     2

    ! !!   ! !

    !

    ! !

    !

    !

     7

    ! !   !

    !

    !

    !

    !

    ! !

    !

    12

    !

    !! ! ! ! ! !

    !

    !

    1 2 3 4 5 6 7 8 9 10

     3

    !

    !

    !

    !

    !   !   !

    !

    !   !

     8

    !

    ! !

    !

    !

    !   ! ! ! !

    13

    !!

      !

    !!

    ! ! ! !

    !

    1 2 3 4 5 6 7 8 9 10

     4

    !

    !! ! !   !   !   !   !   !

     9

    !

    !

    ! !

    !

    !! !   !   !

    14

    !

    ! ! ! !   !   ! !   ! !

    1 2 3 4 5 6 7 8 9 10

     5

    ! ! !   !

    !

    !

    !

    !

    !   !

    10

    ! !   ! !   !

    !

    !

    !

    !

    !

    15

    !

    !

    !

    !   !

    !

    !

    ! !

    !

    1 2 3 4 5 6 7 8 9 10

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    54/58

    α = 0.01

    item

         v      a       l      u      e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

     1

    !

    !

    ! ! ! ! ! ! ! !

     6

    ! ! !

    !

    ! ! ! ! ! !

    11

    !! ! ! ! ! !

    !

    !

    !

    1 2 3 4 5 6 7 8 9 10

     2

    ! ! ! !  !

    ! !

    !

    ! !

     7

    ! ! ! ! ! ! ! !

    !

    !

    12

    ! ! !

    !

    ! ! ! ! ! !

    1 2 3 4 5 6 7 8 9 10

     3

    !

    !

    !

    !

    ! ! !   !   ! !

     8

    ! ! ! ! ! ! !

    !

    ! !

    13

    ! ! ! !

    !

    !

    ! ! ! !

    1 2 3 4 5 6 7 8 9 10

     4

    ! ! ! ! !

    !

    ! ! !  !

     9

    ! ! ! ! ! ! ! ! !

    !

    14

    ! !   !   ! !

    !

    ! !

    !

    !

    1 2 3 4 5 6 7 8 9 10

     5

    ! ! ! ! ! ! !

    !

    ! !

    10

    ! ! ! ! ! !

    !

    ! ! !

    15

    ! ! ! ! ! ! !

    !

    ! !

    1 2 3 4 5 6 7 8 9 10

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    55/58

    α = 0.001

    item

         v      a       l      u      e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

     1

    !

    ! ! ! ! ! ! ! ! !

     6

    ! !

    !

    ! ! ! ! ! ! !

    11

    ! ! ! ! ! ! ! !

    !

    !

    1 2 3 4 5 6 7 8 9 10

     2

    ! ! ! ! ! ! ! !

    !

    !

     7

    ! !

    !

    ! ! ! ! ! ! !

    12

    ! ! ! ! ! ! ! !

    !

    !

    1 2 3 4 5 6 7 8 9 10

     3

    ! ! ! ! ! !

    !

    ! ! !

     8

    ! ! ! !

    !

    ! ! ! ! !

    13

    ! ! ! ! ! ! !

    !

    ! !

    1 2 3 4 5 6 7 8 9 10

     4

    !

    !

    ! ! ! ! ! ! ! !

     9

    ! ! ! ! ! !

    !

    ! ! !

    14

    ! ! ! ! ! !

    !

    ! ! !

    1 2 3 4 5 6 7 8 9 10

     5

    ! ! ! ! ! ! ! ! !

    !

    10

    ! ! !

    !

    ! ! ! ! ! !

    15

    ! !

    !

    ! ! ! ! ! ! !

    1 2 3 4 5 6 7 8 9 10

    Wh d LDA “ k”?

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    56/58

    Why does LDA “work”?

    Why does the LDA posterior put “topical” words together?

    •  Word probabilities are maximized by dividing the words among

    the topics. (More terms means more mass to be spread around.)

    •  In a mixture, this is enough to find clusters of co-occurring words.

    •  In LDA, the Dirichlet on the topic proportions can encourage

    sparsity, i.e., a document is penalized for using many topics.

    •  Loosely, this can be thought of as softening the strict definition of“co-occurrence” in a mixture model.

    •  This flexibility leads to sets of terms that more tightly co-occur.

    S f LDA

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    57/58

    Summary of LDA

    θd   Z d,n   W d,nN D   K 

    β kα η

    •  LDA can

    •  visualize the hidden thematic structure in large corpora•  generalize new data to fit into that structure

    •  Builds on Deerwester et al. (1990) and Hofmann (1999)

    It is a mixed membership model  (Erosheva, 2004).

    Relates to multinomial PCA (Jakulin and Buntine, 2002)

    •  Was independently invented for genetics (Pritchard et al., 2000)

    I l t ti f LDA

  • 8/18/2019 Big data machine learning topic models text recognition natural language processing

    58/58

    Implementations of LDA

    There are many available implementations of topic modeling—

    LDA-C∗ A C implementation of LDA

    HDP∗ A C implementation of the HDP (“infinite LDA”)

    Online LDA∗

    A python package for LDA on massive dataLDA in R∗ Package in R for many topic models

    LingPipe   Java toolkit for NLP and computational linguistics

    Mallet   Java toolkit for statistical NLP

    TMVE∗ A python package to build browsers from topic models

    ∗ available at www.cs.princeton.edu/ ∼blei/