content processing architecture and applications - introduction to text mining

30
CONTENT PROCESSING ARCHITECTURE AND APPLICATIONS Introduction to text mining – Warsaw University of Technology

Upload: findwise

Post on 12-Jul-2015

732 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Content Processing Architecture and Applications - Introduction to Text Mining

CONTENT PROCESSING ARCHITECTURE AND

APPLICATIONS Introduction to text mining – Warsaw University of Technology

Page 2: Content Processing Architecture and Applications - Introduction to Text Mining

Plan

Findwise – who we are, what we do. What is content?

Why content processing is important

Content processing and information retrieval

Technology for content processing

Methods for content processing

Examples of usage

Page 3: Content Processing Architecture and Applications - Introduction to Text Mining

Findwise – Search Driven Solutions

•  Founded  in  2005  

•  Offices  in  Sweden,  Denmark,    

             Norway,  Poland  and  Australia  

•  90  employees  

Our  objecBve  is  to  be  a  leading  provider  of  Findability  soluBons  uBlising  the  full  potenBal  of  search  technology  to  create  customer  business  value.  

 •  Paweł  Wróblewski  &  Marcin  Goss  

Page 4: Content Processing Architecture and Applications - Introduction to Text Mining

WHAT IS CONTENT?

Page 5: Content Processing Architecture and Applications - Introduction to Text Mining

Content ≥ Information

From the business point of view INFORMATION is the key to success.

”Informa)on  can  only  be  an  asset  when  it  enables  a  task  to  be  completed.”  “The  value  is  in  the  outcome  of  the  task,  not  in  the  informa)on  itself.”  MarBn  White  

Employee productivity (The hidden cost… IDC April 2006):

” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise

with 1000 knowledge workers.””

Page 6: Content Processing Architecture and Applications - Introduction to Text Mining

Information is hidden

Big Data is commonly described with 3V:

1.  Variety

Human  generated  vs.  Machine  generated  

Text  &  MulBmedia  

2.  Volume

Up  to  Petabytes  

3.  Velocity

Stream  of  data  

GBs  per  day,  hour,  minute,  second  

Page 7: Content Processing Architecture and Applications - Introduction to Text Mining

Information lives in the context

The right Information is hidden in text.

Text forms a context:

word -> sentence -> paragraph -> chapter -> document

Content processing is about extracting required information from the context.

Page 8: Content Processing Architecture and Applications - Introduction to Text Mining

WHY CONTENT PROCESSING IS IMPORTANT?

Page 9: Content Processing Architecture and Applications - Introduction to Text Mining

Why content processing is important

To get right information in seconds •  Usage  of  faceted  search  

To tag consistently large document set

•  Usage  of  automaBc  extactor  

To biuld semantic database

•  ExtracBon  of  concepts  with  linkage  to  taxonomy/ontology  

To perform document classi#cation

•  ExtracBon  of  enBBes  with  grouping  /  clustering  

Examples  from  publicly  available  websites  [live  show]  

Page 10: Content Processing Architecture and Applications - Introduction to Text Mining

Conclusion

Content processing is a set of techniques enabling text analytics.

Content processing leverages the value of data stored in companies

improving data consumption.

Content processing used with search engines helps #nd information

in any context. •  Enteprise  Findability  •  E-­‐commerce  

Page 11: Content Processing Architecture and Applications - Introduction to Text Mining

TECHNOLOGY FOR CONTENT PROCESSING

Page 12: Content Processing Architecture and Applications - Introduction to Text Mining

General architecture of search engines

Page 13: Content Processing Architecture and Applications - Introduction to Text Mining

Content Processing – the idea

Lemmas  (tenses,  forms)  

Spell  Checking  

Synonyms  Format  Conversion  

Language  Detec?on  

En??es  Custom  PLUG-­‐IN  

Taxonomy  Classifica?on  

Vectorizer  Geography  Companies  People  

 index  Scopifier  

Document  

PARIS  (Reuters)  -­‐  Venus  Williams  raced  into  the  second  round  of  the  $11.25  million  French  Open  Monday,  brushing  aside  Bianka  Lamade,  6-­‐3,  6-­‐3,  in  65  minutes.    

The  Wimbledon  and  U.S.  Open  champion,  seeded  second,  breezed  past  the  German  on  a  blustery  center  court  to  become  the  first  seed  to  advance  at  Roland  Garros.  "I  love  being  here,  I  love  the  French  Open  and  more  than  anything  I'd  love  to  do  well  here,"  the  American  said.    Input:        byte  stream  

Output:  structured  document  ready  to  be  indexed  

Page 14: Content Processing Architecture and Applications - Introduction to Text Mining

Content Processing – the implementation

Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.

Page 15: Content Processing Architecture and Applications - Introduction to Text Mining

Hydra - example

Select  stages  to  use  in  the  pipeline,  the  leX  column  corresponds  to  the  “market”,  and  the  right  is  the  stages  used.  

Page 16: Content Processing Architecture and Applications - Introduction to Text Mining

Hydra - example

Modify  the  format  of  the  date  to  only  include  year.  

 

 

Page 17: Content Processing Architecture and Applications - Introduction to Text Mining

Hydra - example

The  new  year  meta-­‐data  can  be  used  as  a  facet  

Page 18: Content Processing Architecture and Applications - Introduction to Text Mining

Hydra - example

Map  every  author  field  to  a  metadata  field  called  author.  

Pipeline  A  

 

 

 

Pipeline  B  

 

 

 

Page 19: Content Processing Architecture and Applications - Introduction to Text Mining

Hydra - example

In  the  search  result…  

 

 

Page 20: Content Processing Architecture and Applications - Introduction to Text Mining

Hydra is Open Source

http://#ndwise.github.com/Hydra/

Page 21: Content Processing Architecture and Applications - Introduction to Text Mining

METHODS FOR CONTENT PROCESSING

Page 22: Content Processing Architecture and Applications - Introduction to Text Mining

Named entity recognition – statistical classi#ers

•  OpenNLP (http://opennlp.apache.org/)

•  Markov chains

•  Mallet (http://mallet.cs.umass.edu/)

•  Conditional random #elds

Input:

Mark has been in London since Mary dumped him.

Output:

<person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.

 

Page 23: Content Processing Architecture and Applications - Introduction to Text Mining

Classi#ers - training

•  Training set - language corpora

•  (http://nkjp.pl/) for Polish

Set of manually tagged texts in given language. Preferably from various sources, various topics.

 Tokens   PoS  tags   Name  tags  

He   Pronoun   O  

went   Verb   O  

to   Prep.   O  

United   AdjecBve   Place  

States   Noun   Place  

.   Interp   O  

Page 24: Content Processing Architecture and Applications - Introduction to Text Mining

Classi#ers – supervised training

•  Training input

•  Features extracted from each token token: text, PoS tag, token class

prev token: text, PoS tag, token class

next token: text, PoS tag, token class

previous tags assigned

•  Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ...

•  Training output

•  <place> <location> <person>

•  <B-place> <I-place> <L-place> <U-place>

 

Page 25: Content Processing Architecture and Applications - Introduction to Text Mining

Classi#ers – approaches

„Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name?

•  One classi!er for all name-types

•  faster

•  automatically resolves con#icts

•  One classi!er per name-type

•  slower, memory consuming

•  provides more information

 

Page 26: Content Processing Architecture and Applications - Introduction to Text Mining

EXAMPLES

Page 27: Content Processing Architecture and Applications - Introduction to Text Mining

Naive approach

Often people names intersect with location names:

- Kazimierz

- Washington

Company names may come from common language:

- Oracle

- Dialog

Conlcusion: dictionaries are not enough without contextual analysis

Page 28: Content Processing Architecture and Applications - Introduction to Text Mining

Findwise implementation

Page 29: Content Processing Architecture and Applications - Introduction to Text Mining

QUESTIONS?

Page 30: Content Processing Architecture and Applications - Introduction to Text Mining

Paweł Wróblewski pawel.wroblewski@#ndwise.com

Marcin Goss marcin.goss@#ndwise.com