content processing architecture and applications - introduction to text mining
TRANSCRIPT
CONTENT PROCESSING ARCHITECTURE AND
APPLICATIONS Introduction to text mining – Warsaw University of Technology
Plan
Findwise – who we are, what we do. What is content?
Why content processing is important
Content processing and information retrieval
Technology for content processing
Methods for content processing
Examples of usage
Findwise – Search Driven Solutions
• Founded in 2005
• Offices in Sweden, Denmark,
Norway, Poland and Australia
• 90 employees
Our objecBve is to be a leading provider of Findability soluBons uBlising the full potenBal of search technology to create customer business value.
• Paweł Wróblewski & Marcin Goss
WHAT IS CONTENT?
Content ≥ Information
From the business point of view INFORMATION is the key to success.
”Informa)on can only be an asset when it enables a task to be completed.” “The value is in the outcome of the task, not in the informa)on itself.” MarBn White
Employee productivity (The hidden cost… IDC April 2006):
” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise
with 1000 knowledge workers.””
Information is hidden
Big Data is commonly described with 3V:
1. Variety
Human generated vs. Machine generated
Text & MulBmedia
2. Volume
Up to Petabytes
3. Velocity
Stream of data
GBs per day, hour, minute, second
Information lives in the context
The right Information is hidden in text.
Text forms a context:
word -> sentence -> paragraph -> chapter -> document
Content processing is about extracting required information from the context.
WHY CONTENT PROCESSING IS IMPORTANT?
Why content processing is important
To get right information in seconds • Usage of faceted search
To tag consistently large document set
• Usage of automaBc extactor
To biuld semantic database
• ExtracBon of concepts with linkage to taxonomy/ontology
To perform document classi#cation
• ExtracBon of enBBes with grouping / clustering
Examples from publicly available websites [live show]
Conclusion
Content processing is a set of techniques enabling text analytics.
Content processing leverages the value of data stored in companies
improving data consumption.
Content processing used with search engines helps #nd information
in any context. • Enteprise Findability • E-‐commerce
TECHNOLOGY FOR CONTENT PROCESSING
General architecture of search engines
Content Processing – the idea
Lemmas (tenses, forms)
Spell Checking
Synonyms Format Conversion
Language Detec?on
En??es Custom PLUG-‐IN
Taxonomy Classifica?on
Vectorizer Geography Companies People
index Scopifier
Document
PARIS (Reuters) -‐ Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-‐3, 6-‐3, in 65 minutes.
The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said. Input: byte stream
Output: structured document ready to be indexed
Content Processing – the implementation
Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
Hydra - example
Select stages to use in the pipeline, the leX column corresponds to the “market”, and the right is the stages used.
Hydra - example
Modify the format of the date to only include year.
Hydra - example
The new year meta-‐data can be used as a facet
Hydra - example
Map every author field to a metadata field called author.
Pipeline A
Pipeline B
Hydra - example
In the search result…
Hydra is Open Source
http://#ndwise.github.com/Hydra/
METHODS FOR CONTENT PROCESSING
Named entity recognition – statistical classi#ers
• OpenNLP (http://opennlp.apache.org/)
• Markov chains
• Mallet (http://mallet.cs.umass.edu/)
• Conditional random #elds
Input:
Mark has been in London since Mary dumped him.
Output:
<person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.
Classi#ers - training
• Training set - language corpora
• (http://nkjp.pl/) for Polish
Set of manually tagged texts in given language. Preferably from various sources, various topics.
Tokens PoS tags Name tags
He Pronoun O
went Verb O
to Prep. O
United AdjecBve Place
States Noun Place
. Interp O
Classi#ers – supervised training
• Training input
• Features extracted from each token token: text, PoS tag, token class
prev token: text, PoS tag, token class
next token: text, PoS tag, token class
previous tags assigned
• Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ...
• Training output
• <place> <location> <person>
• <B-place> <I-place> <L-place> <U-place>
Classi#ers – approaches
„Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name?
• One classi!er for all name-types
• faster
• automatically resolves con#icts
• One classi!er per name-type
• slower, memory consuming
• provides more information
EXAMPLES
Naive approach
Often people names intersect with location names:
- Kazimierz
- Washington
Company names may come from common language:
- Oracle
- Dialog
Conlcusion: dictionaries are not enough without contextual analysis
Findwise implementation
QUESTIONS?
Paweł Wróblewski pawel.wroblewski@#ndwise.com
Marcin Goss marcin.goss@#ndwise.com