text m ining
DESCRIPTION
Text m ining. [email protected]. Extract from various presentations: Temis , URI-INIST-CNRS, Aster Data …. Information context. Big amount of information is available in textual form in databases and online sources - PowerPoint PPT PresentationTRANSCRIPT
www.decideo.fr/bruley
Text miningText mining
Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
www.decideo.fr/bruley
Information contextInformation context
Big amount of information is available in textual form in databases and online sources
In this context, manual analysis and effective extraction of useful information are not possible
It is relevant to provide automatic tools for analyzing large textual collections
www.decideo.fr/bruley
Text mining definition Text mining definition
The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc.
The results can be important both for: the analysis of the collection, and providing intelligent navigation and browsing methods
www.decideo.fr/bruley
Text mining pipeline Text mining pipeline
Unstructured Text(implicit knowledge)
Structured content(explicit knowledge)
Informationextraction
Semantic metadata
Knowledge Discovery
InformationRetrieval
Semantic Search/
Data Mining
www.decideo.fr/bruley
Text mining processText mining process
Text preprocessingSyntactic/Semantic text analysis
Features Generation Bag of words
Features SelectionSimple countingStatistics
Text/Data MiningClassification- Supervised learningClustering- Unsupervised learning
Analyzing resultsMapping/VisualizationResult interpretation
Iterative and interactive process
www.decideo.fr/bruley
PublishersPublishers
Enriched contentAnnotation tools Tools for authors
New applications based on annotation layers Richer cross linking based on content…
AnalystsAnalysts
Empowers themAnnotating research output
Hypothesis generation Summarisation of findingsFocused semantic search…
LibrariesLibraries
Linking between Institutional repositoriesAccess to richer metadata
Aggregation Aids to subject analysis/classification …
Text mining actorsText mining actors
www.decideo.fr/bruley
Challenges in text miningChallenges in text mining
Data collection is “free text”, is not well-organized (Semi-structured or unstructured)
No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web
A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information
Learning techniques for processing text typically need annotated training
XML as the common model, it allows:– Manipulation data with standards– Mining becomes more data mining– RDF emerging as a complementary model
The more structure you can explore the better you can do mining
www.decideo.fr/bruley
Intranet
Internet
On-lineDatabank
Information Provider
File SystemDatabasesEDMS
Web Crawling
XML Normalisation-subject-Author-text corpora-keywords
Format filter
Data source administrationData source administration
www.decideo.fr/bruley
Text mining tasks Text mining tasks
TM
Text AnalysisTools
Feature extraction
Categorization
Summarization
Clustering
Name Extractions
Term Extraction
Abbreviation Extraction
Relationship Extraction
Hierarchical Clustering
Binary relational Clustering
Web Searching Tools
Text search engine
NetQuestion Solution
Web Crawler
www.decideo.fr/bruley
Information extraction Information extraction
Extract domain-specific information from natural language text
– Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”)
• Constructed by hand• Automatically learned
from hand-annotated training data
– Need a semantic lexicon (dictionary of words with semantic category labels)
• Typically constructed by hand
Link Analysis
Query Log Analysis
Metadata Extraction
Keyword Ranking
Intelligent Match
Duplicate Elimination
www.decideo.fr/bruley
CategorizationCategorization
Document collections Document collections treatment treatment
ClusteringClustering
www.decideo.fr/bruley
Text Mining example:Text Mining example: Obama vs. McCain
www.decideo.fr/bruley
Aster Data position for Text Aster Data position for Text AnalysisAnalysis
Data Acquisition
Data Acquisition Pre-ProcessingPre-Processing MiningMining Analytic
ApplicationsAnalytic
Applications
Perform processing required to transform and
store text data and information
(stemming, parsing, indexing, entity extraction, …)
Gather text from relevant sources
(web crawling, document scanning, news feeds,
Twitter feeds, …)
Apply data mining techniques to derive insights about stored
information
(statistical analysis, classification, natural
language processing, …)
Leverage insights from text mining to provide
information that improves decisions and processes
(sentiment analysis, document management, fraud analysis,
e-discovery, ...)
Third-Party Tools Fit
Aster Data Fit
Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries
www.decideo.fr/bruley
• Ability to store and process massive volumes of text data– Massively parallel data stores and massively parallel analytics engine– SQL-MapReduce framework enables in-database processing for
specialized text analytics tools
• Tools and extensibility for processing diverse text data– SQL-MapReduce framework enables loading and transforming diverse
sources and types of text data– Pre-built functions for text processing
• Flexible platform for building and processing diverse analytics– SQL-MapReduce framework enables creation of flexible, reusable
analytics– Embedded MapReduce processing engine for high-performance analytics
Aster Data Value for Text Aster Data Value for Text AnalyticsAnalytics
www.decideo.fr/bruley
• Data transformation utilities
- Pack: compress multi-column data into a single column
- Unpack: extract nested data for further analysis
• Web log analysis
- Sessionization: identify unique browsing sessions in clickstream data
• Text analysis
- Text parser: general tool for tokenizing, stemming, and counting text data
- nGram: split text into component parts (words & phrases)
- Levenstein distance: compute “distance” between words
Aster Data Capabilities for Text Aster Data Capabilities for Text DataData
Pre-built SQL-MapReduce functions for text processing
Data Data Data
Aster Data Analytic Foundation
SQL SQL-MapReduce
App App AppApp App App
Custom and Packaged Analytics
Aster Data nCluster