nlp and big data shanxi hpc research center xiaoge li [email protected] wbdb2013, xi’an, china
TRANSCRIPT
Introduction
Internet is a big knowledge base unstructured
NLP & IE“understand” human language
Unstructured data Structure data
Problems Human language changed
Let Google it !Net language ( LOL , 给力 ) compounds words (JFK airport)
Domain knowledgeDomain specific training sets
Chinese tokenization 小菊 / nr / 的 /u/ 生活 / vn / 很 /d/ 给 /v 力 / vg 小菊 / nr / 的 /u/ 生活 / vn / 很 /d/ 给力 /a
NLP need big data
Unsupervised (weekly supervised)learningknowledge acquisitionRelationship New wordsNE gazette
System Architecture
Linux Cluster
HDFS
Knowledge
acquisition
NLP & IE Map
Reduce HBase
Entity graph
information
fusion
knowledge acquisition
Large scale Corpus from Web Weekly supervised learning Bootstrapping technique Map reduce , Hbase Location NE and new word P = 87.28%, 72.1%
Chinese NLP & IE engine
Pipeline FST & statistic mixture modelInput : plain textOut : structured XMLMap reduce Speed: 500KB/s in 10 nodes
Information objectInformation Object
Name Entity
Person
Organization
Location
Product
Time
事件
Pre-defined Event
General Event
Profile and Event
Example Profile
In Concept-Based Profile, its attributes are filled by its participant profiles.
Information Network
NLP
• Tokenization
• POS• Sallow
parsing• Deep
parsing
IE
• NE tag• CE
linkage• NE
Profile • Profile
Merge
Cross Document Information fusion
Hierarchical Clustering Map Reduce Hbase Half Million Profiles Computing complexity P=94.65% R=88.24% F= 91.33%
Information Graph multi-dimension
Orange: locationGray: organizationBlue: Person
Source:2012 People’s dailyQuery :China Agricultural University
Expand 1 level
Organization-Organization Network
Query: China Agricultural University filter: Organization
Location-Personal Network
Query : 青岛港, filter : Location
Person-location Network
Query: 金日成
Future Work
Query LanguageGraph Mining Enhance NLP Enginevisualization
Questions?
Thank you