text mining with node.js - philipp burckhardt, carnegie mellon university

21
Text Mining with Node.js Philipp Burckhardt Carnegie Mellon University

Upload: nodejsfoundation

Post on 14-Apr-2017

251 views

Category:

Software


1 download

TRANSCRIPT

Text Mining with Node.jsPhilipp Burckhardt

Carnegie Mellon University

Who am I?

Why re-invent the wheel?

Reasons for using Node.js

• JavaScript - language of the Web

• Platform-agnostic (all operating systems, browser, CLIs and desktop applications)

• V8 engine is fast enough to handle text mining tasks (faster than Python or R)

• Core streams can handle real-time data & large amounts of text

Drawback: Besides few popular packages like natural, no eco-system of good text mining modules yet.

Use Case: deidentify

Use Case: deidentify

• Software for de-identification of protected healthinformation in free-text medical record data

• Developed as part of research project at CMU

The Challenge

Unstructured data might account for more than 80% percent of data collected.

Text Mining Overview

Typical Test Mining Tasks

• Sentiment analysis

•Cluster analysis and topic modeling: find hidden patterns or grouping in data

Getting practicalSentiment Analysis of „State of the Union“ addressesby President Obama

const getSpeeches = require( '@stdlib/datasets/sotu-addresses' );

const words = require( '@stdlib/datasets/afinn-111' );

const tm = require( 'text-miner' );

// Convert to a dictionary...

const len = words.length;

const dict = {};

for ( let i = 0; i < len; i++ ) {

dict[ words[i][0] ] = words[i][1];

}

const obamaSpeeches = getSpeeches({

'president': [ 'Barack Obama' ]

});

let obamaCorpus = new tm.Corpus(

obamaSpeeches.map( x => x.text )

)

.trim()

.toLower()

.removeInterpunctuation();

// Calculate sentiments...

const docs = obamaCorpus.getTexts();

const sentiments = [];

for ( let i = 0; i < docs; i++ ) {

const words = docs[ i ].split( ' ' );

let score = 0;

for ( let j = 0; j < words.length; j++ ) {

const val = dict[ words[ j ] ];

if ( val ) { score += val; }

}

sentiments.push( score );

}

Pre-Processing

sentiments = [ 69, 47, 266, 75, 234, 234, 163, 157 ]

Topic ModelingGoal: find documents which share the same themes

(e.g. politics, business, sports)

Latent Dirichlet Allocation

• Probabilistic model for text documents by Blei et al.

• Documents are assumed to have a distributionover topics

• Very popular because of its expandability

const getSpeeches = require( '@stdlib/datasets/sotu-addresses' )

const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' );

const tm = require( 'text-miner' );

let speeches = getSpeeches({ range: [ 1930, 2010 ] })

.map( ( e ) => e.text );

let corpus = new tm.Corpus( speeches );

corpus = corpus

.toLower()

.removeInterpunctuation()

.removeWords( tm.STOPWORDS.EN );

let docs = corpus.getTexts();

let model = lda( docs, 3 );

model.fit( 1000, 100, 10 );

lda( <Documents Array>, <Number of Topics> )

model.fit( <Iterations>, <Burnin>, <Thinning> )

Results for SOTU addressesfrom 1930 to 2010

Topic Words

1 world, peace, war, nations, free, people, great, nation, united, freedom, power, military, american, men, defense, time, forces, strength

2 america, people, american, years, americans, year, work, make, children, congress, tonight, time, tax, country, government, health, budget, care

3 government, year, federal, program, congress, economic, states, national, administration, million, policy, public, dollars, legislation, programs, billion, system, years, united, fiscal

Text Analysis using the Command Line

Rationale

• Data pipelines using UNIX shell commands

• Processing of shell commands is done in parallel

• Memory usage• V8 engine has default limit of 1.76 GB on 64 bit machine

(changeable via --max_old_space_size=<size>)

• Use stream processing instead of batch processing to avoid high memory usage

LIVE DEMONSTRATION