text mining with node.js - philipp burckhardt, carnegie mellon university

Text Mining with Node.jsPhilipp Burckhardt

Carnegie Mellon University

Who am I?

Why re-invent the wheel?

Reasons for using Node.js

• JavaScript - language of the Web

• Platform-agnostic (all operating systems, browser, CLIs and desktop applications)

• V8 engine is fast enough to handle text mining tasks (faster than Python or R)

• Core streams can handle real-time data & large amounts of text

Drawback: Besides few popular packages like natural, no eco-system of good text mining modules yet.

Use Case: deidentify

Use Case: deidentify

• Software for de-identification of protected healthinformation in free-text medical record data

• Developed as part of research project at CMU

The Challenge

Unstructured data might account for more than 80% percent of data collected.

Text Mining Overview

Typical Test Mining Tasks

• Sentiment analysis

•Cluster analysis and topic modeling: find hidden patterns or grouping in data

Getting practicalSentiment Analysis of „State of the Union“ addressesby President Obama

Topic ModelingGoal: find documents which share the same themes

(e.g. politics, business, sports)

Latent Dirichlet Allocation

• Probabilistic model for text documents by Blei et al.

• Documents are assumed to have a distributionover topics

• Very popular because of its expandability

const getSpeeches = require( '@stdlib/datasets/sotu-addresses' )

const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' );

const tm = require( 'text-miner' );

let speeches = getSpeeches({ range: [ 1930, 2010 ] })

.map( ( e ) => e.text );

let corpus = new tm.Corpus( speeches );

corpus = corpus

.toLower()

.removeInterpunctuation()

.removeWords( tm.STOPWORDS.EN );

let docs = corpus.getTexts();

let model = lda( docs, 3 );

model.fit( 1000, 100, 10 );

lda( <Documents Array>, <Number of Topics> )

model.fit( <Iterations>, <Burnin>, <Thinning> )

Results for SOTU addressesfrom 1930 to 2010

Topic Words

1 world, peace, war, nations, free, people, great, nation, united, freedom, power, military, american, men, defense, time, forces, strength

2 america, people, american, years, americans, year, work, make, children, congress, tonight, time, tax, country, government, health, budget, care

3 government, year, federal, program, congress, economic, states, national, administration, million, policy, public, dollars, legislation, programs, billion, system, years, united, fiscal

Text Analysis using the Command Line

Rationale

• Data pipelines using UNIX shell commands

• Processing of shell commands is done in parallel

• Memory usage• V8 engine has default limit of 1.76 GB on 64 bit machine

(changeable via --max_old_space_size=<size>)

• Use stream processing instead of batch processing to avoid high memory usage

LIVE DEMONSTRATION