tms 09 text mining creating semantics in the real...
TRANSCRIPT
Prof. Dr. Stefan Wrobel
Dr. Gerd Paass, Andreas Schäfer, Dr. Stefan Eickeler
Text Mining: Creating Semantics in the Real World
2
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Fraunhofer IAIS: Intelligent Analysis and Information Systems250 people: scientists, project engineers,
technical and administrative staff, students
Located on Fraunhofer Campus Schloss
Birlinghoven/Bonn
Joint research groups and cooperation with
Core research areas:
Machine learning/data mining
Multimedia pattern recognition
Visual Analytics
Process Intelligence
Adaptive robotics
Cooperating objects
Directors: T. Christaller, S. Wrobel (exec.)
3
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Where is all the knowledge we lost with information?
T. S. Eliot
Thomas Stearns Eliot, OM (September 26, 1888 – January 4, 1965)
US-born British poet, dramatist and literary critic
Brainyquote.com
4
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Outline
Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
Conclusion
5
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Internet Trends
Convergence
Ubiquitous intelligent systems
Users as producers
8
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Users as producers
Web 2.0, Social Web, Crowdsourcing
Exploding growth of content
Media providers transform from content to confidence providers, competing
with social communities
Users expect full interactivity and control
Quality control, confidence, choice and searching are becoming central
9
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Drowning in Data ….Megabytes
Gigabytes
Terabytes
Petabytes
Exabytes
Megabytes
Gigabytes
Terabytes
Petabytes
Exabytes
Size of digital universe:
2007: 161 Exabyte
2010: 998 Exabyte
[IDC]
Size of digital universe:
2007: 161 Exabyte
2010: 998 Exabyte
[IDC]
10
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
The data iceberg
Database tables
Excel spreadsheets
Other data with fixed structure
Email, Notes
Word documents
PDF. Power Point
Other text
Images
Video, audio
20%
80%
11
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Drowning in Unstructured Data ….Megabytes
Gigabytes
Terabytes
Petabytes
Exabytes
Megabytes
Gigabytes
Terabytes
Petabytes
Exabytes
… and need meaning!… and need meaning!
12
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Semantics: The need for meaning
Knowledge will be the driving force
of business excellence
Quality of services increasingly
distinguished by amount of
knowledge they can use
Enormous savings if unstructured
existing documents could be used
Without needing to structure them
first
cf. failures of knowledge
management!
13
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
intelligent data and text
mining technologies
The challenge of semantics
Intelligent ServiceManual StructuringVery large set
of (electronic)
documents
14
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Outline
Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
Conclusion
15
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Text Mining is cool, since … the entire world works for us!
215,675,903 websites (Netcraft, March 2009)
19 200 000 000 webpages (Yahoo, Aug 2005)
29 700 000 000 webpages (boutell.com, Jan 2007)
Google index 26 Million (1998), 1 billion (2000), 1 Trillion 1,000,000,000,000) unique URLs (25. 7. 2008)
=> perhaps quadrillions of words (images, videos) …
And most of them put together meaningfully (somewhat)!
=> smart algorithms can build on that.
16
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
The basic idea
If two words occur frequently in the same context
- page, paragraph, sentence, part-of-speech
Then there must be some semantic relation between them
Add in a lot of statistics, algorithms, intelligence…
AND YOU CAN DO A LOT!
raw material (web, documents, …)
+ correlations and statistics
+ intelligent data mining algorithms
You can create (a bit of) semantics!
raw material (web, documents, …)
+ correlations and statistics
+ intelligent data mining algorithms
You can create (a bit of) semantics!
17
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Automated Clustering [Paass 07]
[Paass07]
100 000 documents
<document>
…
<title>Bayern München verlor Tabellenführung und Elber beim 1 : 1 in Wolfsburg</title>
<text>Ausgerechnet der VfL Wolfsburg hat den FC Bayern München vom Thron der Fußball - Bundesliga gestoßen .
Mit dem 1 : 1 ( 0 : 1 ) gelang den Wolfsburgern am Samstag der erste Punkt im sechsten Spiel gegen den
Deutschen Rekordmeister . Durch das Remis und den gleichzeitigen Sieg von Konkurrent Bayer Leverkusen
beim TSV 1860 München verlor der FC Bayern die Tabellenführung . Carsten Jancker ( 29 . ) hatte die Gäste in
Führung gebracht . Doch vor 20 400 Zuschauern im ausverkauften VfL - Stadion wurden die Bayern für ihre
pomadige Spielweise durch den Wolfsburger Ausgleichstreffer von Andrzej Juskowiak ( 60 . ) bestraft . Zudem
verlor das Team von Trainer Ottmar Hitzfeld auch noch Stürmer Giovane Elber ( 80 . ) . Er sah wegen einer
Tätlichkeit gegen VfL - Abwehrspieler Holger Ballwanz die Rote Karte . Die Bayern gingen ersatzgeschwächt in
die Partie . Vor allem das Fehlen des verletzten Regisseurs Stefan Effenberg und des ebenfalls angeschlagenen
Mehmet Scholl machte sich bemerkbar . Die Wolfsburger mussten weiter auf die Abwehrspieler Claus Thomsen
und Thomas Hengen sowie den gesperrten Waldemar Kryger verzichten . Die Münchener konnten ihre Ausfälle
anfangs besser kompensieren . Aus einer gestärkten Deckung , die vor der Pause nur selten von den
Wolfsburger Stürmern Juskowiak und Jonathan Akpoborie gefordert wurde , kontrollierten die Bayern die
Partie . Mit ihrer Taktik hatten sie nach knapp einer halben Stunde Erfolg : Jancker spitzelte den Ball nach
einem abgefälschten Freistoß von Michael Tarnat ins Tor . Der Brasilianer Paolo Sergio ( 14 . ) hätte sogar schon
früher sein Team in Führung schießen können . Doch traf er aus 14 m nur die Oberkante der Latte des VfL -
Tores . Die Gastgeber besaßen nur eine Möglichkeit in der ersten Halbzeit , als der starke Spielmacher Dorinel
Munteanu ( 37 . ) mit einem Schuss an dem großartig reagierenden Nationaltorhüter Oliver Kahn scheiterte .
Nach dem Wechsel wurden die Wolfsburger mutiger und munterer . Sie übernahmen langsam das Kommando .
Beim Ausgleichstreffer durch Juskowiak half die Bayern - Deckung allerdings mit . Samuel Kuffour verlor den
Ball an den polnischen Nationalspieler , Juskowiak zog sofort ab und ließ dem besten Bayern - Spieler Kahn
keine Chance . Danach bemühten sich die Münchner noch einmal und erhöhten den Druck . Doch klare
Möglichkeiten besaßen sie nicht mehr . In der hektischen Schlussphase verlor Elber die Nerven , so dass die
Bayern Glück hatten , in Unterzahl nicht auch noch zu verlieren .</text>
<dpa_TextEndCode>dpa yyni ce jo</dpa_TextEndCode>
</document>
…
<document>
…
<title>Bayern München verlor Tabellenführung und Elber beim 1 : 1 in Wolfsburg</title>
<text>Ausgerechnet der VfL Wolfsburg hat den FC Bayern München vom Thron der Fußball - Bundesliga gestoßen .
Mit dem 1 : 1 ( 0 : 1 ) gelang den Wolfsburgern am Samstag der erste Punkt im sechsten Spiel gegen den
Deutschen Rekordmeister . Durch das Remis und den gleichzeitigen Sieg von Konkurrent Bayer Leverkusen
beim TSV 1860 München verlor der FC Bayern die Tabellenführung . Carsten Jancker ( 29 . ) hatte die Gäste in
Führung gebracht . Doch vor 20 400 Zuschauern im ausverkauften VfL - Stadion wurden die Bayern für ihre
pomadige Spielweise durch den Wolfsburger Ausgleichstreffer von Andrzej Juskowiak ( 60 . ) bestraft . Zudem
verlor das Team von Trainer Ottmar Hitzfeld auch noch Stürmer Giovane Elber ( 80 . ) . Er sah wegen einer
Tätlichkeit gegen VfL - Abwehrspieler Holger Ballwanz die Rote Karte . Die Bayern gingen ersatzgeschwächt in
die Partie . Vor allem das Fehlen des verletzten Regisseurs Stefan Effenberg und des ebenfalls angeschlagenen
Mehmet Scholl machte sich bemerkbar . Die Wolfsburger mussten weiter auf die Abwehrspieler Claus Thomsen
und Thomas Hengen sowie den gesperrten Waldemar Kryger verzichten . Die Münchener konnten ihre Ausfälle
anfangs besser kompensieren . Aus einer gestärkten Deckung , die vor der Pause nur selten von den
Wolfsburger Stürmern Juskowiak und Jonathan Akpoborie gefordert wurde , kontrollierten die Bayern die
Partie . Mit ihrer Taktik hatten sie nach knapp einer halben Stunde Erfolg : Jancker spitzelte den Ball nach
einem abgefälschten Freistoß von Michael Tarnat ins Tor . Der Brasilianer Paolo Sergio ( 14 . ) hätte sogar schon
früher sein Team in Führung schießen können . Doch traf er aus 14 m nur die Oberkante der Latte des VfL -
Tores . Die Gastgeber besaßen nur eine Möglichkeit in der ersten Halbzeit , als der starke Spielmacher Dorinel
Munteanu ( 37 . ) mit einem Schuss an dem großartig reagierenden Nationaltorhüter Oliver Kahn scheiterte .
Nach dem Wechsel wurden die Wolfsburger mutiger und munterer . Sie übernahmen langsam das Kommando .
Beim Ausgleichstreffer durch Juskowiak half die Bayern - Deckung allerdings mit . Samuel Kuffour verlor den
Ball an den polnischen Nationalspieler , Juskowiak zog sofort ab und ließ dem besten Bayern - Spieler Kahn
keine Chance . Danach bemühten sich die Münchner noch einmal und erhöhten den Druck . Doch klare
Möglichkeiten besaßen sie nicht mehr . In der hektischen Schlussphase verlor Elber die Nerven , so dass die
Bayern Glück hatten , in Unterzahl nicht auch noch zu verlieren .</text>
<dpa_TextEndCode>dpa yyni ce jo</dpa_TextEndCode>
</document>
…
18
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Unsupervised hierarchical term Clustering: dpa data
Spiel Bundesliga Team Trainer
Sieg Mannschaft Niederlage
Samstag Platz Saison Erfolg
Punkte Pokal Nationalspieler …
Finale Frankfurt deutsche
Meister Hamburg Zuschauer
Zuschauern Männer Halle WM
Titelverteidiger Final EM …
Basketball Berlin Weltmeister
Bonn Kampf K Hagen Trier
Würzburg LBA Playoff Runde
Box Berliner Klitschko Titel…
Kiel Handball Magdeburg
Flensburg HSG VfL TV THW
Tore Bad Wuppertal Lemgo
Bundesliga Handewitt …
FC Trainer Fußball München
Spieler Bayern Mannschaft
Saison Hertha Stürmer Stadion
Spiel SV Dortmund Coach …
Minute Tor VfL Schiedsrichter
Zuschauer Minuten Führung
Tore Hansa Eintracht Schalke
Bundesliga Karte Wolfsburg …
Bayern League Champions
Fußball UEFA United Hinspiel
Cup Manager Vertrag Leeds
Club Fans Real Hitzfeld …
Team sport
Not football Football
Basketball + Boxing Handball German League European League
19
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Text Mining Market Size
„The text mining market has roughly $50-100 million annual product
revenue, and is growing at roughly 40-60% annually.“ [Monash 06, texttechnologies.com]
Sounds small …
But then …
• Several research sites devoted to the technology
So the real market must be somewhere else …
…
20
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
The Text Mining Market … is called “Text Analytics”
Primary areas:
Web search, site search
knowledge management, enterprise portals
Information collection, extraction, harvesting
Email handling, security, spam and phishing filtering
Market research
Online advertising
Specialized markets
• litigation, juridical
• Patent search
[cf. Monash/2008]
21
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Application Field Market Research: Germany 1.6 billion, growing
http://www.adm-ev.de/zahlen.html
Both ad-hoc studies and panels can benefit from text mining
22
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Enterprise Search as a text mining market
More than 1.2 billion $ in 2010
Software
revenue
Million $
Year
12191108989860717
20102009200820072006
[Gartner 2008]
23
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Companies
24
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Outline
Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Wikinger, Fraunhofer
Web
• Structuring and Monitoring: Semantic Map, EmotionRadar
Conclusion
25
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Text Mining Tasks
Document classification, scoring and/or ranking, isolated retrieval
• Assign a class, score or rank to an entire document
In-collection, linked retrieval and organization
• Find documents in a collection
• Link results to other results
Information and relation extraction
• Extract pieces of information, fill particular relations
Overview and monitoring of collections
• Give summary impression of information in a collection or source
26
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Outline
Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
Conclusion
27
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Spotting Faked Offers at Internet Auctions
Techniques to sell fakes
Put faked products on an internet
auction platform, e.g. ebay
Describe product as forged,
falsified, e.g. “very similar to XXX”
Motivation
Aspects
Infringement of registered trade
marks
Violation of patents
Enormous sales volume
28
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Counter Measures
Use trainable classifiers
Compile training set of genuine and
faked internet auction offers
Train classifiers to detect these
classes
use text, format information, etc. as
features
Use different classifier for different
brands / products
Apply to new internet auction
offers
Ban faked offers from auction
Update classifiers to new techniques
Motivation
x1
x2
Fakes
Originals
Hyperplane
29
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Results
A classifier was developed and
tested
Similar techniqes as for spam
detection
Good results: F-value >> 90%
The Germal Federal Court of Justice:
Internet Auction providers have to
filter the auctions using approriate
methods to detect faked offers.
30
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Phishing
E-mail fraud
Send officially looking
Include web link or form
Ask for confidential
information
e.g. password, account
details
Attacker uses information
to withdraw money, enter
computer system, etc.
Motivation
32
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
AntiPhishMotivation
Project AntiPhish
Develop content-based phishing filters
Include other clues like whitelists
Trainable and adaptive filters
adapt to new phishing attacks
anticipate attacks
Consortium
Fraunhofer IAIS (DE)
Symantec (GB, IRL)
Tiscali (IT)
Nortel (FR)
K.U. Leuven (BE)
33
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Phishing: Defense TechniquesMotivation
Workflow
Obtain training data from email
stream
Extract features
Estimate and update classifiers and
filters
Integrate new filters into email
filtering framework
Deploy at internet service provider
Deploy at central wireless packet
switch
37
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Approach: Multiple feature sets
38
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Basic Features
39
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Dynamic Markov Chains
40
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
DMC Details
41
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Latent Topic Models
42
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Class-Specific Topic Models
43
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Feature Processing and Selection
46
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Test Corpora
47
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Overall Result
52
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Outline
Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
Conclusion
53
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
The THESEUS research program in Germany [theseus-program.com]Deutsche
Nationalbibliothek
Deutsche Thomson OHG (DTO)
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI GmbH)
empolis GmbH
Festo AG
Fraunhofer-Gesellschaft (7 Institutes)
Friedrich-Alexander-Universität Erlangen
FZI Forschungszentrum Informatik
Institut für Rundfunktechnik GmbH (IRT)
intelligent views gmbh
Ludwig-Maximilians-Universität (LMU)
moresophy GmbH
LYCOS Europe
mufin GmbH
ontoprise GmbH
SAP AG
Siemens AG
Technische Universität Darmstadt
Technische Universität Dresden
Technische Universität München
Universität Karlsruhe (TH)
Verband Deutscher Maschinen- und Anlagebau e.V. (VDMA)
Wess/07synt
actic
sem
antic
Single author Multiple authors
54
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
The THESEUS use cases
ALEXANDRIA
The Internet Knowledge Platform
ALEXANDRIA
The Internet Knowledge Platform
CONTENTUS
Next Generation Digital Libraries
for saving our cultural heritage
CONTENTUS
Next Generation Digital Libraries
for saving our cultural heritage
MEDICO
Semantic image Search
in Medicine
MEDICO
Semantic image Search
in Medicine
ORDO
Personal Ordered Knowledge
Management
ORDO
Personal Ordered Knowledge
Management
PROCESSUS
Semantic Business Processes
PROCESSUS
Semantic Business Processes
TEXO
Business Webs in the Internet
Of Things
TEXO
Business Webs in the Internet
Of Things
55
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
CONTENTUS - Next Generation Digital Librariesfor saving our cultural heritage
Publishers, Libraries, broadcasters, etc. are interested in using, distributing and saling their archive content
In analog form archives are threatened by deterioration, are notlinked, difficult to use, and huge.
Goals:
Digitalization, optimization of quality, availability
Indexing, semantic and social linking and intelligent search, communities
Rescue of cultural heritage, preventing losses from deteriorationLaufzeit bis
2012
56
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Showcases Semantic Digital Libraries
225 years Neue Zürcher Zeitung NZZ
GDR music archive German National Library
57
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
CONTENTUS Workflow
Semanticaccess to knowledgeand content
Openknowledgenetworks – useraugmentation
Semanticlinking of metadata
Automatedgeneration of metadata
Automatedoptimization of quality
Digitization
Data generation: registered users / communities, algorithms with acceptable qualityQuality control: self control (see Wikipedia)
Data generation: registered users / communities, algorithms with acceptable qualityQuality control: self control (see Wikipedia)
Controlled qualityData generation: automatically generated through high-quality algorithmsQuality control: training and improvement of algorithms
Controlled qualityData generation: automatically generated through high-quality algorithmsQuality control: training and improvement of algorithms
Guaranteed qualityData generation and correction: Libraries, museums, universities, experts, etc. Quality control: Schooling, rules, advisory boardsHighest stability, highest persistence
Guaranteed qualityData generation and correction: Libraries, museums, universities, experts, etc. Quality control: Schooling, rules, advisory boardsHighest stability, highest persistence
Shell
Mantle
Core
1 2 3 4 5 6
Workflow
58
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
High-ThroughputMethods
Modern bookscanners: Thousands of pagesper day
Almost fullyautomatic
Data volumes: 70TB (NZZ), Peta-Exabytes(DNB)
Digitalisierung
Digitization
1 AutomaticOptimization of quality
2Digitization
1 Automated
Generation of metadata
3 SemanticLinking of metadata
4 Open knowledge networks –user augmentation
5 Semantic access to knowledge and content
6
59
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
High-ThroughputMethods
Modern bookscanners: Thousands of pagesper day
Almost fullyautomatic
Data volumes: 70TB (NZZ), Peta-Exabytes(DNB)
Digitalisierung
Digitalisierung
1Digitization
1
60
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Development of
intelligent
algorithms for
optimizing print,
images, sound &
movies
Automated
generation of
presentation formats
Qualitätsoptimierung
Digitalisierung
1 AutomatedOptimization of quality
2Digitalization
1
Margin removal
Sharpening,
Straightening
Denoising, declicking
Scratch removal
61
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Metadatengenerierung
Structural and
contentual metadata
OCR, speech, music,
video recognition
Structure analysis and
type recognition
Linking with current
norms & standards
Digitalisierung
1 AutomatedOptimization of quality
2Digitalization
1 Automated
Generation of metadata
3
62
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Semantische Vernetzung
Link-up with related
media
Incorporation of external
knowledge sources
(metadata systems,
Wikipedia, …)
Disambiguation,
classification, relation
extraction
Digitalisierung
1 AutomatedOptimization of quality
2Digitalization
1 Automated
Generation of metadata
3 Semanticlinking of contents
4
63
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Determining meaning
The words of natural
language are often
ambiguousÜber Kohl höhnte Strauß: „Er wird nie Kanzler werden. Die Zeit, 18.7.08
» For each word / term, find a meaning
» Subproblem:
» Part of speech recognition: Nouns, Verb, Adjective, …
» Named entity recognition: People, Places, Organizations, …
» Assignment of concepts: Plant, Bird, Politician, …
64
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Named entity recognition
» Analyze
Surroundings of
Words
» “Kohl” in a sentence with “Kanzler” probably “person”
» “Kohl” in a sentence with “kochen” probably “vegetable”
Über Kohl höhnte Strauß: „Er wird nie Kanzler werden.
» Statistical model for person names
» Word + Surroundings -> word is a person
» Training using annotated sentences.
» Automatic Recognition of words / phrases that represent people
66
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Conditional Random Field Model
» Observed words X1,…,Xn
» Category of words Y1,…,Yn
⎟⎟⎠
⎞⎜⎜⎝
⎛⎥⎦
⎤⎢⎣
⎡= ∑ ∑
= =−
N
t
N
kttCkkn YYf
ZXYYp
1 11,1
2),,(exp
),,(1
)|,,( XX
λμλ
K
» Properties f may depend on two subsequent states and on all observed words
Example
» Property f10293 has value 1,
- if Yt-1=“PER" and Yt=“PER” and
- Xt has value “Müller”.
Otherwise its value is 0.[Lafferty, McCallum, Pereira 01]
67
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Modeling of names: features for a CRF model
» Title FirstName Connective LastName
» Properties. Recorded for the words xt-2,xt-1,xt,xt+1,xt+2
» Words, stem, part of speech
» Prefix, Suffix (3 letters)
» Shape properties
Capital characters at the beginning, only numbers, contains numbers, mix capital /no
capital, contains hyphens
» LDA topic model class
» Contained in list of first names, contained in list of last names
In Arbeit
68
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Identity of names
» There are several people named “Helmut Kohl”
» Helmut Kohl, born 1930, Chancellor
» Helmut Kohl, born 1943, Referee
» Helmut Kohl, textile merchand
» … 99 further hits in the telephone book
» Identification in Wikipedia
» Compare words of Wikipedia-article
with the text in which
“Helmut Kohl” was found
» Similar words -> similar person
» Automated assignment:
Person name -> Wikipedia article
70
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Assignment of similarity from the environment
Simple algorithm for assigning people to Wikipedia article
» Occurence in text: Helmut Kohl
» Description using characteristic terms -> x
» Wikipedia article on Kohl
» Description by characteristic terms -> w
» Comparison using a distance metric: for example Cosine distance d(w,u)
» Implemented in a prototype
» Further approach: Assignment as a classification task
f(w,u) = 0 or 1
Master Thesis
72
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Currently assign semantic categories in the Contentus Prototype
» Names: People, Organizations, points in time, places, …
» Assignment to Wikipedia articles
Semantische Interpretation
Under development:
» Hypernyms in ontology (GermaNet): Nouns, Verbs Supersenses
» Cluster of words with similar meaning: Topics
» Relations between names / concepts
“Berthold Brecht” studied in “München”
» Classes of documents: Politics, Economy, …
Semantic Interpretation
87
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Knowledge store
» Further information for entities that
were found in the text
» Dates, publications
» Number of inhabitants, topological
relationships
Helmut Kohl
Geburtsdatum 30.4.1930
Geburtsort Ludwigshafen
Ehegatte Hannelore K.
Ausbildung Historiker
Religion katholisch
Partei CDU
» Social networks
» Who knows whom?
» Who was at the same place at the same time?
» Who influenced whom?
Berlin
Fläche 891 km2
Einwohner 3.420.786
BIP 83,6 Mrd. €
Höhe 34–115 m
Geo. Breite 52° 31′ N
Geo. Länge 13° 25′ O
88
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Buchmesse Frankfurt 10.Oktober 2008 | 88
Knowledge store: Format
» Factual knowledge as logical expressions:
» <subject> <predicate> <object>
» Semantic-Web-Standards
» RDF
» RDFS
» OWL
» Technical Basis
» Database MySQL
» Triple-Store Jena + Joseki
» Query language
» SPARQL
89
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
» Semantic Integration of data and information from different sources
» DBPedia: an interpreted form of Wikipedia
» Geonames Ontology: all the places in the world
» Catalogue of the German national library:
Books and publications
» Based on open standards
» W3C Semantic Web Stack
» RDF, RDFS, OWL, SPARQL
Wissensvernetzung
Triplestore
Linking of knowledge
90
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Buchmesse Frankfurt 10.Oktober 2008 | 90
Knowledge sources
» DBPedia (www.dbpedia.org)
» GeoNames Ontology
» Already in RDF/OWL-Format
» Person reference database PND
» Topic reference database SWD
» Online catalogue OPAC
» Partial export to RDF
» Found entities in the text
» Identification using Wikipedia
» Linking with DBPedia-Daten per Link
91
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Offene Wissensnetzwerke
Further annotations from experts and users
• Completions, corrections
• Cooperation with the ALEXANDRIA project in Theseus
• Suitable measures to assure high quality of data
Digitalisierung
1 AutomatedOptimization of quality
2Digitalization
1 Automated
Generation of metadata
3 Semantic linking of metadata
4 Open knowledge networks –user augmentation
5
92
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Offene Wissensnetzwerke
The Multiple Shell Model
Controlled QualityData generation: Algorithms of high qualityQuality control: Training and improvement of algorithms
Controlled QualityData generation: Algorithms of high qualityQuality control: Training and improvement of algorithms
Mantel
Open knowledge networkData generation: Registered users / Communities, AlgorithmsQuality control: Self control (cf. Wikipedia)
Open knowledge networkData generation: Registered users / Communities, AlgorithmsQuality control: Self control (cf. Wikipedia)
Outer
Assured qualityData generation and correction: Libraries, Universities, Museums, groups of experts, etc.Quality control: Fixed rules, committes,maximal Stability and Persistence
Assured qualityData generation and correction: Libraries, Universities, Museums, groups of experts, etc.Quality control: Fixed rules, committes,maximal Stability and Persistence
Core
Cf. Wikinger [Bröcker et.al. 08]!
93
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Semantische Suche
The knowledge network
• Digital, multimedia data
• Content is semantically linked
• Is enriched from external sources and
user groups
Access
• Structure by Ontology
• Content relationships become clear
• “Knowledge exploration” is possible
Digitalisierung
1 AutomatedOptimization of quality
2Digitalization
1 Automated
Generation of metadata
3 Semantic linking of metadata
4 Open knowledge networks –user augmentation
5 Semantic access to knowledge and content
6
95
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
The Contentus Demonstrator
102
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Outline
Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
Conclusion
107
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Example of a classified webpage
Übereinstimmung zu dem Dokumentmodell = 80%Klassifikation als = Projekte
111
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Workflow for semantic processing of documents
Crawl Documents
Search index
Pre-processing
Categorization
EntityRecognitio
nExtractedMetadata
Knowledge
Store
Using the document model
Using the structuremodel
ExtractedSearchregions
Watchout fo
r the relaunchin su
mmer 2009!
Watchout fo
r the relaunchin su
mmer 2009!
112
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Outline
Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
Conclusion
113
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Emotion Radar
Which issues are important to
people, where are the
emotional discussions in
blogs and discussion
forums?
Goal: market research, …
114
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Emotion Radar example: Looking at two large automobil companies A and B*
Selection and Crawling of discussion forums
• Criteria: Search engine ranking, size, activity
• Period used: January 2008 to January 2009
• Storage: 2 GB of data crawl during a period of 7 days
Structure analysis of the discussion forums:
• Manufacturer A:
– Number of postings: 188.487
– Monthly number of new postings: ca. 1.500
– Number of threads: 21.613
– Number of authors: 15.445
• Manufacturer B:
– Number of postings: 406.814
– Monthly number of new postings: ca. 2.700
– Number of threads: 38.758
– Number of authors: 21.919
* anonymisiert
115
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Case study: Internet postings related to the introduction of a new carmodel in Germany 2008
Cars are delivered
Manufacturer
publishes first
pictures
Manufacturer publishes
further product
features/start of sales
* anonymisiert
116
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Partially automated emotion analysis shows a mood swing from positive to negative („in love-> angry“)
„in love“„in love“ − „hoping“− „hoping“
− „turned off“− „turned off“
− „interested“− „interested“
− „surprised“− „surprised“
− „angry“− „angry“
− „angry“
− „proud“
− „angry“
− „proud“
* anonymisiert
Manufacturer
publishes first
pictures
Manufacturer publishes
further product
features/start of sales
Cars are delivered
117
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Topic recognition shows a change of product features that are discussed fromdesign to gasoline consumption
Auslieferungen
− Fahrzeug-länge, Design
− „verliebt“
− Fahrzeug-länge, Design
− „verliebt“
− Schaltung, Effizienz, Technologie
− Verbrauch
− „hoffend“
− Schaltung, Effizienz, Technologie
− Verbrauch
− „hoffend“
− Chromleisten, Wertanmutung
− „Riesen-fischmaul“
− „abgestoßen“
− Chromleisten, Wertanmutung
− „Riesen-fischmaul“
− „abgestoßen“− Schiebedach,
Lenkrad, Bordcomputer, Audiosystem
− „zugeneigt“
− Schiebedach, Lenkrad, Bordcomputer, Audiosystem
− „zugeneigt“
− Preise Car-Konfiguratorund Liste
− „überrascht“
− Preise Car-Konfiguratorund Liste
− „überrascht“
− Probefahrten
− Verbrauch
− „verärgert“
− Probefahrten
− Verbrauch
− „verärgert“− Verbrauch,
Klappschlüssel, Audio
− „verärgert“
− Nachbarn
− „stolz“
− Verbrauch, Klappschlüssel, Audio
− „verärgert“
− Nachbarn
− „stolz“
Erste Fotos
* anonymisiert
„...kaum zu glauben
was dieses kleine
Auto an Benzin
verbraucht!“
(Kalle83)
„...kaum zu glauben
was dieses kleine
Auto an Benzin
verbraucht!“
(Kalle83)
118
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
How can manufacturers use these text mining results?
Auslieferungen
− Fahrzeug-länge, Design
− „verliebt“
− Fahrzeug-länge, Design
− „verliebt“
− Schaltung, Effizienz, Technologie
− Verbrauch
− „hoffend“
− Schaltung, Effizienz, Technologie
− Verbrauch
− „hoffend“
− Chromleisten, Wertanmutung
− „Riesen-fischmaul“
− „abgestoßen“
− Chromleisten, Wertanmutung
− „Riesen-fischmaul“
− „abgestoßen“− Schiebedach,
Lenkrad, Bordcomputer, Audiosystem
− „zugeneigt“
− Schiebedach, Lenkrad, Bordcomputer, Audiosystem
− „zugeneigt“
− Preise Car-Konfiguratorund Liste
− „überrascht“
− Preise Car-Konfiguratorund Liste
− „überrascht“
− Probefahrten
− Verbrauch
− „verärgert“
− Probefahrten
− Verbrauch
− „verärgert“− Verbrauch,
Klappschlüssel, Audio
− „verärgert“
− Nachbarn
− „stolz“
− Verbrauch, Klappschlüssel, Audio
− „verärgert“
− Nachbarn
− „stolz“
* anonymisiert
Z.B. durch eine frühzeitige
Erkennung relevanter Themen
(Verbrauch) und Ableiten von
Maßnahmen (Spritspartrainings,
Leichtlaufreifen, Kommunikation)
e.g. short term recognition of relevant
topics (consumption) and preparation
of appropriate resonse (gas saver
trainings, fuel efficient tires, proactive
communikation)
... Long term continuous
monitoring of emotional topics
119
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
120
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Summary
Text Mining is cool!
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
We can do a lot with Text Mining in the Real World!
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer
Web
• Structuring and Monitoring: EmotionRadar
121
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
The fine print: Papers and further reading
Horváth, Tamás; Ramon, Jan:Efficient frequent connected subgraph mining in
graphs of bounded treewidth: Machine learning and knowledge discovery in
database: ECML PKDD 2008. Berlin [u.a.]: Springer, 2008. (Machine learning and
knowledge discovery in databases 1), S. 520-535
Kolb, Inke; Deutschland / Bundesbeauftrager für Kultur und Medien; Fraunhofer-
Institut IAIS: Auf dem Weg zur Deutschen Digitalen Bibliothek (DDB): erstellt im
Auftrag des Beauftragten der Bundesregierung für Kultur und Medien. 2008
Köhler, Joachim; Larson, Martha; Jong, Franciska de Jong; Kraaij, Wessel;
Ordelman, Roeland; Association for Computing Machinery / Special Interest Group
on Information Retrieval: Proceedings of the ACM SIGIR Workshop "Searching
Spontaneous Conversational Speech": held in conjunction with the 31th Annual
International ACM SIGIR Conference 24 July 2008, Singapore, 2008
Krausz, Barbara; Herpers, Rainer: Event detection for video surveillance using an
expert system In: Association for Computing Machinery / Special Interest Group on
Multimedia: 1st ACM International Workshop in Analysis and Retrieval of
Events/Actions and Workflows in Video Streams (AREA 2008): October 31, 2008,
Vancouver, Canada ; in conjunction with ACM Multimedia 2008. New York, NY:
ACM, 2008, S. 49-55
Lioma, Christina; Moens, Marie-Francine; Gomez, Juna-Carlos; De Beer, Jan;
Bergholz, Andre; Paass, Gerhard; Horkan, Patrick: Anticipating Hidden Text Salting
in Emails: extended abstract. In: Lippmann, Richard (Ed.) et al.: Recent advances in
intrusion detection: 11th international symposium, RAID 2008, Cambridge, MA,
USA, September 15-17, 2008 ; proceedings. Berlin [u.a.]: Springer, 2008.
Anja Pilz, Lukas Molzberger, and Gerhard Paa. Entity resolution by kernel methods.
In Proc. Sabre TMS, 2009.
Andre Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard
Paass, Siehyun Strobel. New Filtering Approaches for Phishing Email. Accepted for
publication for Journal of Computer Security (JCS)
Gerhard Paass and Frank Reichartz (2009): Exploiting Semantic Constraints for
Estimating Supersenses with CRFs. Proc. SDM 2009 (accepted for publication)
Paaß, Gerhard; Reinhardt, Wolf; Rüping, Stefan; Wrobel, Stefan: Data
mining for security and crime detection In: Gal, Cecilia S. (Ed.) et al.:
Security informatics and terrorism: social and technical problems of
detecting and controlling terrorists' use of the World Wide Web ;
proceedings of the NATO Advanced Research Workshop on Security
Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5
June 2007. Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series
D, Information and Communication Security 15), S. 56-70
Paaß, Gerhard; Kindermann, Jörg: Entity and relation extraction in texts
with semi-supervised extensions In: Gal, Cecilia S. (Ed.) et al.: Security
informatics and terrorism: social and technical problems of detecting
and controlling terrorists' use of the World Wide Web ; proceedings of
the NATO Advanced Research Workshop on Security Informatics and
Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 June 2007.
Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series D,
Information and Communication Security 15), S. 132-141
Frank Reichartz and Gerhard Paaß. Estimating Supersenses with
Conditional Random Fields. Workshop on High-Level Information
Extraction, ECML/PKDD 2008.
Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel,
Marie-Francine Moens and Brian Witten: Detecting Known and New
Salting Tricks in Unwanted Emails Fifth Conference on Email and Anti-
Spam, CEAS 2008, Aug 21-22, 2008
Andre Bergholz,Jeong-Ho Chang, Gerhard Paass, Frank Reichartz and
Siehyun Strobel. Improved Phishing Detection using Model-Based
Features. Fifth Conference on Email and Anti-Spam, CEAS 2008, Aug 21-
22, 2008, Mountain View, Ca.
Stefan Eickeler, Lars Br¨ocker, and Ruth Haener. NZZ: 225 Jahre Old
economy vernetzt - Realisierung des digitalen Archivs der Neuen
Zürcher Zeitung. In GI Jahrestagung, pages 73–77, 2005.