3. introduction to text mining

Introduction to Text Mining

Agenda

• Defining Text Mining

• Structured vs. Unstructured Data

• Why Text Mining

• Some Text Mining Ambiguities

• Pre-processing the Text

Text Mining

• The discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources

Previously unknown means:

• Discovering genuinely new information

• Discovering new knowledge vs. merely finding patterns is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft

Unstructured means:

• Free naturally occurring text

• As opposed HTML, XML….

Text Mining Vs. Data Mining

• Data in Data mining is a series of numbers. Data for text mining is a collection of documents.

• Data mining methods see data in spreadsheet format. Text mining methods see data in document format

Structured vs. Unstructured Data

• Structured data• Loadable into “spreadsheets”

• Arranged into rows and columns

• Each cell filled or could be filled

• Data mining friendly

• Unstructured daa• Microsoft Word, HTML, PDF documents, PPTs

• Usually converted into XML semi structured

• Not structured into cells

• Variable record length, notes, free form survey-answers

• Text is relatively sparse, inconsistent and not uniform

• Also images, video, music etc.

Why Text Mining?

• Leveraging text should improve decisions and predictions

• Text mining is gaining momentum• Sentiment analysis (twitter, facebook)

• Predicting stock market

• Predicting churn

• Customer influence

• Customer service and help desk

• Not to mention Watson

Why Text Mining is Hard?

• Language is ambiguous• Context is needed to clarify

• The same words can have different meaning (homographs)• Bear (verb) – to support or carry

• Bear (noun) – a large animal

• Different words can mean the same (synonyms)

• Language is subtle

• Concept / word extraction usually results in huge number of dimensions• Thousands of new fields

• Each field typically has low information content (sparse)

• Misspellings, abbreviations, spelling variants• Renders search engines, SQL queries.. ineffective.

Some Text Mining Ambiguities

• Homonomy: same word, different meaning• Mary walked along the bank of the river• HarborBank is the richest bank in the citys

• Synonymy: Synonyms, different words, similar or same meaning, can substitute one word for other without changing meaning• Miss Nelson became a kind of big sister to Benjamin• Miss Nelson became a kind of large sister to Benjamin

• Polysemy: same word or form, but different, albeit related meaning• The bank raised its interest rates yesterday• The store is next to the newly constructed bank• The bank appeared first in Italy I the Renaissance

• Hyponymy: Concept hierarchy or subclass• Animal (noun) – cat, dog• Injury – broken leg, intusion

Seven Types of Text Mining

• Search and Information Retrieval – storage and retrieval of text documents, including search engines and keyword search

• Document Clustering – Grouping and categorizing terms, snippets, paragraphs or documents using clustering methods

• Document Classification – grouping and categorizing snippets, paragraphs or document using data mining classification methods, based on methods trained on labelled examples

• Web Mining – Data and Text mining on the internet with specific focus on scaled and interconnectedness of the web

• Information Extraction – Identification and extraction of relevant facts and relationships from unstructured text

• Natural Language Processing – Low level language processing and understanding of tasks (eg. Tagging part of speech)

• Concept extraction – Grouping of words and phrases into semantically similar groups

Text Mining – Some Definitions

• Document – a sequence of words and punctuation, following the grammatical rules of the language.

• Term – usually a word, but can be a word-pair or phrase

• Corpus – a collection of documents

• Lexicon – set of all unique words in corpus

Pre-processing the Text

• Text Normalization

• Parts of Speech Tagging

• Removal of stop words

Stop words – common words that don’t add meaningful content to the document

• Stemming• Removing suffices and prefixes leaving the root or stem of the word.

• Term weighting

• POS Tagging

• Tokenization

Text Normalization

• Case• Make all lower case (if you don’t care about proper nouns, titles, etc)

• Clean up transcription and typing errrors• do n’t, movei

• Correct misspelled words• Phonetically

• Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance

• Dictionaries• Use POS and context to make good guess

Parts of Speech Tagging

• Useful for recognizing names of people, places, organizations, titles

• English language• Minimum set includes noun, verb, adjective, adverb, prepositions, congjunctions

POS Tags from Penn Tree BankTag Description Tag Description Tag Description

CC Coordinating Conjunction CD Cardinal Number DT Determiner

EX Existential there FW Foreign Word IN Preposition or subordinating conjuction

JJ Adjective JJR Adjective, comparative JJS Adjective, superlative

LS List Item Marker MD Modal NN Noun, singular or mass

NNS Noun Plural NNPS Proper Noun Plural PDT Prederminer

POS Possessive Ending PRP Personal pronoun PRPS Possessive pronoun

RB Adverb RBR Adverb, comparative RBS Adverb, superlative

RP Particle SYM Symbol TO To

UH Interjection VB Verb, base form VBD Verb, past tens

Example of Tagging

• In this talk, Mr. Pole discussed how Target was using Predictive Analytics including descriptions of using potential value models, coupon models, and yes predicting when a woman is due

• In/IN this/DT talk/NN, Mr./NNP Pole/NNP discussed/VBD how/WRB Target/NNPwas/VBD using/VBG Predictive/NNP Analytics/NNP including/VBGdescriptions/NNS of/IN using/VBG potential/JJ value/NN models/NNS, coupon/NN models/NNS, and yes predicting/VBG when/WRB a/DT woman/NN is due/JJ

Tokenization

• Converts streams of characters into words

• Main clues (in English): Whitespace

• No single algorithms ‘works’ always

• Some languages do not have white space (Chinese, Japanese)

Stemming

• Normalizes / unifies variations of the same data• ‘walking’, ‘walks’, ‘walked’, ‘walked’ walk

• Inflectional stemming• Remove plurals

• Normalize verb tenses

• Remove other affixes

• Stemming to root• Reduce word to most basic element

• More aggressive than inflectional

• ‘denormalization’ norm

• ‘Apply’, ‘applications’, ‘reapplied’ apply

Common English Stop Words

• a, an, and, are, as, at, be, but, buy, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, these, they, this, to, was, will, with

• Stop words are very common and rarely provide useful information for information extraction and concept extraction

• Removing stop words also reduce dimensionality

Dictionaries and Lexicons

• Highly recommended, can be very time consuming

• Reduces set of key words to focus on• Words of interest• Dictionary words

• Increase set of keywords to focus on• Proper nouns• Acronyms• Titles• Numbers

• Key ways to use dictionary• Local dictionary (specialized words)• Stop words and too frequent words• Stemming – reduce stems to dictionary words• Synonyms – replace synonyms with root words in the list• Resolve abbreviations and acronyms

Sentiment Analysis Workflow

Content Retrieval

Content Extraction

Corpus Generation

Corpus Transformation

Corpus Filtering

Sentiment Calculation

Web

Dat

a R

etri

eval

Co

rpu

s P

re

Pro

cess

ing

Sen

tim

en

t A

nal

ysis

Sentiment Indicators

• 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 =𝑝−𝑛

𝑝+𝑛

• 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 =𝑝+𝑛

𝑁

• 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =𝑝

𝑁

• 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =𝑛

𝑁

• 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =𝑝 − 𝑛

𝑁

3. introduction to text mining

Data & Analytics

text mining methods

text mining introduction

watsonwhy text mining

text text miningthe

data mining methods

retrieval of text documents

predictionstext mining

types of text miningsearch