content processing architecture and applications - introduction to text mining

CONTENT PROCESSING ARCHITECTURE AND

APPLICATIONS Introduction to text mining – Warsaw University of Technology

Plan

Findwise – who we are, what we do. What is content?

Why content processing is important

Content processing and information retrieval

Technology for content processing

Methods for content processing

Examples of usage

Findwise – Search Driven Solutions

•  Founded in 2005

•  Offices in Sweden, Denmark,

Norway, Poland and Australia

•  90 employees

Our objecBve is to be a leading provider of Findability soluBons uBlising the full potenBal of search technology to create customer business value.

•  Paweł Wróblewski & Marcin Goss

WHAT IS CONTENT?

Content ≥ Information

From the business point of view INFORMATION is the key to success.

”Informa)on can only be an asset when it enables a task to be completed.” “The value is in the outcome of the task, not in the informa)on itself.” MarBn White

Employee productivity (The hidden cost… IDC April 2006):

” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise

with 1000 knowledge workers.””

Information is hidden

Big Data is commonly described with 3V:

1.  Variety

Human generated vs. Machine generated

Text & MulBmedia

2.  Volume

Up to Petabytes

3.  Velocity

Stream of data

GBs per day, hour, minute, second

Information lives in the context

The right Information is hidden in text.

Text forms a context:

word -> sentence -> paragraph -> chapter -> document

Content processing is about extracting required information from the context.

WHY CONTENT PROCESSING IS IMPORTANT?

Why content processing is important

To get right information in seconds •  Usage of faceted search

To tag consistently large document set

•  Usage of automaBc extactor

To biuld semantic database

•  ExtracBon of concepts with linkage to taxonomy/ontology

To perform document classi#cation

•  ExtracBon of enBBes with grouping / clustering

Examples from publicly available websites [live show]

Conclusion

Content processing is a set of techniques enabling text analytics.

Content processing leverages the value of data stored in companies

improving data consumption.

Content processing used with search engines helps #nd information

in any context. •  Enteprise Findability •  E-‐commerce

TECHNOLOGY FOR CONTENT PROCESSING

General architecture of search engines

Content Processing – the idea

Lemmas (tenses, forms)

Spell Checking

Synonyms Format Conversion

Language Detec?on

En??es Custom PLUG-‐IN

Taxonomy Classifica?on

Vectorizer Geography Companies People

index Scopifier

Document

PARIS (Reuters) -‐ Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-‐3, 6-‐3, in 65 minutes.

The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said. Input: byte stream

Output: structured document ready to be indexed

Content Processing – the implementation

Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.

Hydra - example

Select stages to use in the pipeline, the leX column corresponds to the “market”, and the right is the stages used.

Hydra - example

Modify the format of the date to only include year.

Hydra - example

The new year meta-‐data can be used as a facet

Hydra - example

Map every author field to a metadata field called author.

Pipeline A

Pipeline B

Hydra - example

In the search result…

Hydra is Open Source

http://#ndwise.github.com/Hydra/

METHODS FOR CONTENT PROCESSING

Named entity recognition – statistical classi#ers

•  OpenNLP (http://opennlp.apache.org/)

•  Markov chains

•  Mallet (http://mallet.cs.umass.edu/)

•  Conditional random #elds

Input:

Mark has been in London since Mary dumped him.

Output:

<person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.

Classi#ers - training

•  Training set - language corpora

•  (http://nkjp.pl/) for Polish

Set of manually tagged texts in given language. Preferably from various sources, various topics.

Tokens PoS tags Name tags

He Pronoun O

went Verb O

to Prep. O

United AdjecBve Place

States Noun Place

. Interp O

Classi#ers – supervised training

•  Training input

•  Features extracted from each token token: text, PoS tag, token class

prev token: text, PoS tag, token class

next token: text, PoS tag, token class

previous tags assigned

•  Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ...

•  Training output

•  <place> <location> <person>

•  <B-place> <I-place> <L-place> <U-place>

Classi#ers – approaches

„Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name?

•  One classi!er for all name-types

•  faster

•  automatically resolves con#icts

•  One classi!er per name-type

•  slower, memory consuming

•  provides more information

EXAMPLES

Naive approach

Often people names intersect with location names:

- Kazimierz

- Washington

Company names may come from common language:

- Oracle

- Dialog

Conlcusion: dictionaries are not enough without contextual analysis

Findwise implementation

QUESTIONS?

Paweł Wróblewski pawel.wroblewski@#ndwise.com

Marcin Goss marcin.goss@#ndwise.com

content processing architecture and applications - introduction to text mining

Documents

important content processing

content processing methods

content informationfrom

conclusioncontent processing

documentcontent processing

information lives

required information

nd information