text mining with node.js - philipp burckhardt, carnegie mellon university
TRANSCRIPT
Reasons for using Node.js
• JavaScript - language of the Web
• Platform-agnostic (all operating systems, browser, CLIs and desktop applications)
• V8 engine is fast enough to handle text mining tasks (faster than Python or R)
• Core streams can handle real-time data & large amounts of text
Drawback: Besides few popular packages like natural, no eco-system of good text mining modules yet.
Use Case: deidentify
• Software for de-identification of protected healthinformation in free-text medical record data
• Developed as part of research project at CMU
const getSpeeches = require( '@stdlib/datasets/sotu-addresses' );
const words = require( '@stdlib/datasets/afinn-111' );
const tm = require( 'text-miner' );
// Convert to a dictionary...
const len = words.length;
const dict = {};
for ( let i = 0; i < len; i++ ) {
dict[ words[i][0] ] = words[i][1];
}
const obamaSpeeches = getSpeeches({
'president': [ 'Barack Obama' ]
});
let obamaCorpus = new tm.Corpus(
obamaSpeeches.map( x => x.text )
)
.trim()
.toLower()
.removeInterpunctuation();
// Calculate sentiments...
const docs = obamaCorpus.getTexts();
const sentiments = [];
for ( let i = 0; i < docs; i++ ) {
const words = docs[ i ].split( ' ' );
let score = 0;
for ( let j = 0; j < words.length; j++ ) {
const val = dict[ words[ j ] ];
if ( val ) { score += val; }
}
sentiments.push( score );
}
Pre-Processing
sentiments = [ 69, 47, 266, 75, 234, 234, 163, 157 ]
Latent Dirichlet Allocation
• Probabilistic model for text documents by Blei et al.
• Documents are assumed to have a distributionover topics
• Very popular because of its expandability
const getSpeeches = require( '@stdlib/datasets/sotu-addresses' )
const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' );
const tm = require( 'text-miner' );
let speeches = getSpeeches({ range: [ 1930, 2010 ] })
.map( ( e ) => e.text );
let corpus = new tm.Corpus( speeches );
corpus = corpus
.toLower()
.removeInterpunctuation()
.removeWords( tm.STOPWORDS.EN );
let docs = corpus.getTexts();
let model = lda( docs, 3 );
model.fit( 1000, 100, 10 );
lda( <Documents Array>, <Number of Topics> )
model.fit( <Iterations>, <Burnin>, <Thinning> )
Results for SOTU addressesfrom 1930 to 2010
Topic Words
1 world, peace, war, nations, free, people, great, nation, united, freedom, power, military, american, men, defense, time, forces, strength
2 america, people, american, years, americans, year, work, make, children, congress, tonight, time, tax, country, government, health, budget, care
3 government, year, federal, program, congress, economic, states, national, administration, million, policy, public, dollars, legislation, programs, billion, system, years, united, fiscal
Rationale
• Data pipelines using UNIX shell commands
• Processing of shell commands is done in parallel
• Memory usage• V8 engine has default limit of 1.76 GB on 64 bit machine
(changeable via --max_old_space_size=<size>)
• Use stream processing instead of batch processing to avoid high memory usage