method seminar

41
Method Seminar Tutorial : Using Stanford Topic Modeling Toolbox Lili Lin

Upload: abie

Post on 23-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Method Seminar. Tutorial : Using Stanford Topic Modeling Toolbox Lili Lin. Contents. Introduction Getting Started Prerequisites Installation Toolbox Running Latent Dirichlet Allocation Model (LDA Model) Labeled LDA Model. Contents. Introduction Getting Started - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Method Seminar

Method SeminarTutorial : Using Stanford Topic Modeling Toolbox

Lili Lin

Page 2: Method Seminar

Contents Introduction Getting Started

Prerequisites Installation

Toolbox Running Latent Dirichlet Allocation Model (LDA

Model) Labeled LDA Model

Page 3: Method Seminar

Contents Introduction Getting Started

Prerequisites Installation

Toolbox Running Latent Dirichlet Allocation Model (LDA

Model) Labeled LDA Model

Page 4: Method Seminar

Introductionhttp

://nlp.stanford.edu/software/tmt/tmt-0.4/

The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by: Daniel Ramage and Evan Rosen, first released in September 2009

Topic models (e.g. LDA, Labeled LDA) training and inference to create summaries of the text

Page 5: Method Seminar

Introduction - LDA ModelLDA model is a unsupervised topic modelUser need to define some important

parameters, such as number of topicsIt is hard to choose the number of topics Even with some top terms for each topic,

it is still difficult to interpret the content of the extracted topics

Page 6: Method Seminar

Introduction – Labeled LDA ModelLabeled LDA is a supervised topic model

for credit attribution in multi-labeled corpora.

If one of the columns in your input text file contains labels or tags that apply to the document, you can use Labeled LDA to discover which parts of each document go with each label, and to learn accurate models of the words best associated with each label globally

Page 7: Method Seminar

Contents Introduction Getting Started

Prerequisites Installation Simple Testing

Toolbox Running LDA Model Labeled LDA Model

Page 8: Method Seminar

Prerequisites A text editor (e.g. TextWrangler) for

creating TMT processing scripts.TMT scripts are written in Scala, but

no knowledge of Scala is required to get started.

An installation of Java 6SE or greater: http://java.com/en/download/index.jsp.

Windows, Mac, and Linux are supported.

Page 9: Method Seminar

InstallationDownload the TMT executable

(tmt-0.4.0.jar) from http://nlp.stanford.edu/software/tmt/tmt-0.4/

Double-click the jar file to open toolbox or run the toolbox with the command line : java -jar tmt-0.4.0.jar

You should see a simple GUI

Page 10: Method Seminar

Simple TestingExample data and scripts for

simple testing◦Download the example data file:

pubmed-oa-subset.csv◦Download the first testing script:

example-0-test.scalaNote: the data file and the script

should be put into the same folder

Page 11: Method Seminar

Simple Testing - GUILoad script: File Open script

Page 12: Method Seminar

Simple Testing - GUIEdit script: val pubmed = CSVFile("pubmed-oa-subset.csv”)

Page 13: Method Seminar

Simple Testing - GUIRun the script: click the button

‘Run’

Page 14: Method Seminar

Simple Testing - Command Line

Page 15: Method Seminar

Contents Introduction Getting Started

Prerequisites Installation

Toolbox Running Latent Dirichlet Allocation Model (LDA

Model) Labeled LDA Model

Page 16: Method Seminar

LDA Model – Data Preparation173, 777 Astronomy papers were

collected from the Web of Science (WOS) covering the period from 1992 to 2012

In the file ‘astro_wos_lda.csv’, every record includes paper ID (the first column), title (the second column) and published year (the third column)

Page 17: Method Seminar

LDA Training – Script LoadingFile Open script Navigate to

example-2-lda-learn.scala Open

Page 18: Method Seminar

LDA Training – Data LoadingEdit Script : ‘val source = CSVFile("astro_wos_lda.csv”)’ ‘Column(2) ~>’ Note: if your text cover 2 columns or more than 2

columns, such as the third and forth columns, you can use ‘Columns(3,4) ~> Join(" ") ~>’ to replace ’Column(2) ~>’

Page 19: Method Seminar

LDA Training – Parameter SelectionEdit Script : val params = LDAModelParams(numTopics = 30, dataset = dataset, topicSmoothing = 0.01, termSmoothing = 0.01)

Page 20: Method Seminar

LDA Training – Model TrainingRun : Out of Memory due to the

big data

Page 21: Method Seminar

LDA Training – Model TrainingChange the size of Memory Run

Page 22: Method Seminar

LDA Training – Output Generation lda-b2aa1797-30-751edefe

◦ description.txt : A description of the model saved in this folder

◦ document-topic-distributions.csv : A csv file containing the per-document topic distribution for each document in the training dataset

◦ 00000-01000 : Snapshots of the model during training

Page 23: Method Seminar

LDA Training – Output Generation /params.txt : Model parameters used during training /tokenizer.txt : Tokenizer used to tokenize text for use with

this model /summary.txt : Human readable summary of the topic

model, with top-20 terms per topic and how many words instances of each have occurred

/log-probability estimate.txt : Estimate of the log probability of the dataset at this iteration

/term-index.txt : Mapping from terms in the corpus to ID numbers

/description.txt : A description of the model saved in this iteration

/topic-termdistributions.csv.gz : For each topic, the probability of each term in that topic

Page 24: Method Seminar

LDA Training – Command LineJava –Xmx4G –jar tmt-0.4.0.jar

example-2-lda-learn.scala

Page 25: Method Seminar

LDA Inference – Script Loading File Open script Navigate to

example-3-lda-infer Open

Page 26: Method Seminar

LDA Inference – Trained Model Loading Edit Script: val modelPath = file("lda-b2aa1797-30-751edefe”)

Page 27: Method Seminar

LDA Inference – Data Loading Edit Script: ‘val source = CSVFile("astro_wos_lda.csv”)’ ‘Column(2) ~>’ Note: Here we just use the same dataset as the

inference data, but actually it should be some new dataset

Page 28: Method Seminar

LDA Inference – Model InferenceChange the size of Memory Run

Page 29: Method Seminar

LDA Inference – Output GenerationNavigate to the folder ’lda-b2aa1797-30-

751edefe’◦ astro_wos_lda-document-topic-distributuions.csv :

A csv file containing the per-document topic distribution for each document in the inference dataset

◦ astro_wos_lda-top-terms.csv: A csv file containing the top terms in the inference dataset for each topic

◦ astro_wos_lda-usage.csv

Page 30: Method Seminar

LDA Inference – Command LineJava –Xmx4G –jar tmt-0.4.0.jar

example-3-lda-infer.scala

Page 31: Method Seminar

LLDA Model – Data Preparation4,770 metformin papers were collected from

pubMed covering the period from 1997 to 2011Training data : metformin_train_data_llda.csv

(2798 papers), every record includes paper ID (the first column), bio-term list (the second column), title (the third column) and abstract (the forth column), the number of bio-terms in very record is at least 3

Inference data: metformin_infer_data_llda.csv (4770 papers), every record includes paper ID (the first column), title (the second column) and abstract (the third column)

Page 32: Method Seminar

LLDA Training – Script LoadingFile Open script Navigate to

example-6-llda-learn.scala Open

Page 33: Method Seminar

LLDA Training – Data LoadingEdit Script : ‘val source = CSVFile("metformin_train_data_llda.csv")’ ‘Columns(3,4) ~> Join(" ") ~>’ ’Column(2) ~>’

Page 34: Method Seminar

LLDA Training – Model TrainingRun

Page 35: Method Seminar

LLDA Training – Output Generation llda-cvb0-bd54e9b6-176-1213c7f4-222a08a4

◦ description.txt : A description of the model saved in this folder

◦ document-topic-distributions.csv : A csv file containing the per-document topic distribution for each document in the training dataset

◦ 00000-01000 : Snapshots of the model during training

Page 36: Method Seminar

LLDA Training – Output Generation /params.txt : Model parameters used during training /tokenizer.txt : Tokenizer used to tokenize text for use

with this model /summary.txt : Human readable summary of the topic

model, with top-20 terms per topic and how many words instances of each have occurred

/term-index.txt : Mapping from terms in the corpus to ID numbers

/description.txt : A description of the model saved in this iteration

/label-index.txt : Topics extracted after LLDA training /topic-termdistributions.csv.gz : For each topic, the

probability of each term in that topic

Page 37: Method Seminar

LLDA Training – Command LineJava –Xmx4G –jar tmt-0.4.0.jar

example-6-llda-learn.scala

Page 38: Method Seminar

LLDA Inference – Jar ScriptThe TMT toolbox doesn’t provide

script for LLDA inferenceA java script, packaged into ‘llda-

infer.jar’, was generated in order to conduct LLDA inference

Page 39: Method Seminar

LLDA Inference – Command Linejava -jar llda-infer.jar

metformin_infer_data_llda.csv llda-cvb0-bd54e9b6-176-1213c7f4-222a08a4 metformin_infer_result.csv

Page 40: Method Seminar

A file named metformin_infer_result.csv will be generated after LLDA Inference

LLDA Inference – Output Generation

Page 41: Method Seminar

Thanks….. Any Question?