text m ining

15
www.decideo.fr/bruley Text mining Text mining [email protected] Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

Upload: hashim-davis

Post on 31-Dec-2015

28 views

Category:

Documents


0 download

DESCRIPTION

Text m ining. [email protected]. Extract from various presentations: Temis , URI-INIST-CNRS, Aster Data …. Information context. Big amount of information is available in textual form in databases and online sources - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text m ining

www.decideo.fr/bruley

Text miningText mining

[email protected]

Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

Page 2: Text m ining

www.decideo.fr/bruley

Information contextInformation context

Big amount of information is available in textual form in databases and online sources

In this context, manual analysis and effective extraction of useful information are not possible

It is relevant to provide automatic tools for analyzing large textual collections

Page 3: Text m ining

www.decideo.fr/bruley

Text mining definition Text mining definition

The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc.

The results can be important both for: the analysis of the collection, and providing intelligent navigation and browsing methods

Page 4: Text m ining

www.decideo.fr/bruley

Text mining pipeline Text mining pipeline

Unstructured Text(implicit knowledge)

Structured content(explicit knowledge)

Informationextraction

Semantic metadata

Knowledge Discovery

InformationRetrieval

Semantic Search/

Data Mining

Page 5: Text m ining

www.decideo.fr/bruley

Text mining processText mining process

Text preprocessingSyntactic/Semantic text analysis

Features Generation Bag of words

Features SelectionSimple countingStatistics

Text/Data MiningClassification- Supervised learningClustering- Unsupervised learning

Analyzing resultsMapping/VisualizationResult interpretation

Iterative and interactive process

Page 6: Text m ining

www.decideo.fr/bruley

PublishersPublishers

Enriched contentAnnotation tools Tools for authors

New applications based on annotation layers Richer cross linking based on content…

AnalystsAnalysts

Empowers themAnnotating research output

Hypothesis generation Summarisation of findingsFocused semantic search…

LibrariesLibraries

Linking between Institutional repositoriesAccess to richer metadata

Aggregation Aids to subject analysis/classification …

Text mining actorsText mining actors

Page 7: Text m ining

www.decideo.fr/bruley

Challenges in text miningChallenges in text mining

Data collection is “free text”, is not well-organized (Semi-structured or unstructured)

No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web

A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information

Learning techniques for processing text typically need annotated training

XML as the common model, it allows:– Manipulation data with standards– Mining becomes more data mining– RDF emerging as a complementary model

The more structure you can explore the better you can do mining

Page 8: Text m ining

www.decideo.fr/bruley

Intranet

Internet

On-lineDatabank

Information Provider

File SystemDatabasesEDMS

Web Crawling

XML Normalisation-subject-Author-text corpora-keywords

Format filter

Data source administrationData source administration

Page 9: Text m ining

www.decideo.fr/bruley

Text mining tasks Text mining tasks

TM

Text AnalysisTools

Feature extraction

Categorization

Summarization

Clustering

Name Extractions

Term Extraction

Abbreviation Extraction

Relationship Extraction

Hierarchical Clustering

Binary relational Clustering

Web Searching Tools

Text search engine

NetQuestion Solution

Web Crawler

Page 10: Text m ining

www.decideo.fr/bruley

Information extraction Information extraction

Extract domain-specific information from natural language text

– Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”)

• Constructed by hand• Automatically learned

from hand-annotated training data

– Need a semantic lexicon (dictionary of words with semantic category labels)

• Typically constructed by hand

Link Analysis

Query Log Analysis

Metadata Extraction

Keyword Ranking

Intelligent Match

Duplicate Elimination

Page 11: Text m ining

www.decideo.fr/bruley

CategorizationCategorization

Document collections Document collections treatment treatment

ClusteringClustering

Page 12: Text m ining

www.decideo.fr/bruley

Text Mining example:Text Mining example: Obama vs. McCain

Page 13: Text m ining

www.decideo.fr/bruley

Aster Data position for Text Aster Data position for Text AnalysisAnalysis

Data Acquisition

Data Acquisition Pre-ProcessingPre-Processing MiningMining Analytic

ApplicationsAnalytic

Applications

Perform processing required to transform and

store text data and information

(stemming, parsing, indexing, entity extraction, …)

Gather text from relevant sources

(web crawling, document scanning, news feeds,

Twitter feeds, …)

Apply data mining techniques to derive insights about stored

information

(statistical analysis, classification, natural

language processing, …)

Leverage insights from text mining to provide

information that improves decisions and processes

(sentiment analysis, document management, fraud analysis,

e-discovery, ...)

Third-Party Tools Fit

Aster Data Fit

Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries

Page 14: Text m ining

www.decideo.fr/bruley

• Ability to store and process massive volumes of text data– Massively parallel data stores and massively parallel analytics engine– SQL-MapReduce framework enables in-database processing for

specialized text analytics tools

• Tools and extensibility for processing diverse text data– SQL-MapReduce framework enables loading and transforming diverse

sources and types of text data– Pre-built functions for text processing

• Flexible platform for building and processing diverse analytics– SQL-MapReduce framework enables creation of flexible, reusable

analytics– Embedded MapReduce processing engine for high-performance analytics

Aster Data Value for Text Aster Data Value for Text AnalyticsAnalytics

Page 15: Text m ining

www.decideo.fr/bruley

• Data transformation utilities

- Pack: compress multi-column data into a single column

- Unpack: extract nested data for further analysis

• Web log analysis

- Sessionization: identify unique browsing sessions in clickstream data

• Text analysis

- Text parser: general tool for tokenizing, stemming, and counting text data

- nGram: split text into component parts (words & phrases)

- Levenstein distance: compute “distance” between words

Aster Data Capabilities for Text Aster Data Capabilities for Text DataData

Pre-built SQL-MapReduce functions for text processing

Data Data Data

Aster Data Analytic Foundation

SQL SQL-MapReduce

App App AppApp App App

Custom and Packaged Analytics

Aster Data nCluster