semantic data search and analysis using web-based user-generated knowledge bases

Semantic Data Search and Analysis Using Web-based User-Generated

Knowledge Bases

Dr. Maria GrinevaSystems Group @ ETH Zurich

Sunday, April 7, 13

Today’s Search is Based On Links

• Full-text search is the main way to access information on the Web

• The goal of Web search engines: find out the most relevant pages for the user’s query

• Google employs the Web’s hyperlinks to compute relevance of a Web page (PageRank)

22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation

Sunday, April 7, 13

Domains Without Links

• PageRank does not work when documents are are not interlinked

• Breaking news and Blog posts - must be available in real-time, when no links have been created yet

• Enterprise databases - documents are not well interconnected because of organizational silos and limited number of people who create and use them

Sunday, April 7, 13

Web-based User-Generated Knowledge Bases

• To rank and organize documents that are not interlinked well, we need additional knowledge bases:

• Wikipedia - Online encyclopedia

• Twitter - real-time microblogging service

Sunday, April 7, 13

The Goal of This ProjectDevelop a technology which automatically extracts semantic information:

• from Wikipedia - term meanings, relationships, ontologies ...

• from Twitter - real-time information about breaking news, trends, people opinions ...

and applies this information to organize:

• news and blogs on the Web

• documents in enterprise databases

We will release our technology as an open source software framework

Sunday, April 7, 13

Semantic Text Analysis Using Wikipedia

• Leveraging Wikipedia to improve text analysis methods:

• Comprehensive coverage (6M terms vs. 65K in Britannica)

• Continuously brought up-to-date

• Rich structure (cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes)

• New algorithms:

• Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic Inference

• Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds

• Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation

• Zero-cost deployment and customization: No need to train methods, no human labor, no “cold start” problem

Sunday, April 7, 13

Basic Technique:Semantic Relatedness of Terms

• We analyze Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms

• We use Dice-measure with weighted hyperlinks (bi-directional links, direct links, “see also” links, etc)

Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim GrinevAccuracy Estimate and Optimization Techniques for SimRank ComputationVLDB 2008Sunday, April 7, 13

Word Sense Disambiguation • Exmple: IBM may stand for International Business

Machines Corp. or International Brotherhood of Magicians

• We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text

• Example: Platform is mentioned in the context of implementation, open-source, web-server, HTTP

Sunday, April 7, 13

Prototype of a Semantic Search Engine for the Blogosphere

Sunday, April 7, 13

Twitter - A Real-Time News Medium

• ~200M users all over the world posting short messages (tweets) via mobile devices and web browser

• ~140M tweets per day

• Twitter - is an open social network where everyone can follow everyone

• Retweets - a mechanism for fast news spreading

Sunday, April 7, 13

Following + Retweets:Twitter is the Fastest News Medium

• Twitter reacts faster than mainstream media: Haiti Earthquake, Hudson river plane crash

• Everyone can be a reporter: real-time updates on the revolutions in Tunisia, Egypt, Libya, Iran ...

Sunday, April 7, 13

Extracting Useful Information From Twitter

• Popularity of a URL

• Sentiments, opinions about a news story (tweets containing the news URL)

• Trending topics: what is being actively discussed right now

• Personalization of news based on user’s friends connections: The Tweeted Times http://tweetedtimes.com

Sunday, April 7, 13

The Tweeted Times: personalized newspaper generated from user’s Twitter account

Sunday, April 7, 13

At the Systems Layer

• Scalable distributed architecture is required:

• Hadoop (MapReduce software framework) for batch processing of Wikipedia snapshots

• Real-time analytics based on distributed key-value store for online Twitter stream processing

Sunday, April 7, 13

Scalable Real-Time Analytics Based On Distributed Key-Value Store

• At Systems Group, we are working on a system for real-time analytics based on Cassandra:

• We extend Cassandra with:

• push-style procedure for real-time analytics

• incremental computations (alternative to batch-processing) - processing data as it arrives from the stream

Sunday, April 7, 13

References

• Prototype of the semantic search engine Blognoon: http://blognoon.com

• The Tweeted Times - personalized newspaper based on user’s Twitter account:http://tweetedtimes.com

• Triggy: a system for real-time analytics:http://www.systems.ethz.ch/research/projects

Sunday, April 7, 13

semantic data search and analysis using web-based user-generated knowledge bases

Technology

intelligent cooperative sensing for improved traffic ......

texts as knowledge bases · frame-semantic parsing attempts...

i. physical bases of the photovoltaic · pdf filei. physical...

mining the semantic web - uni trier · mining the semantic...

topics in semantic representation...

a goal-oriented web...

1 cui tao phd dissertation defense ontology generation,...

evaluation of automatically generated semantic...

semantic analysis on twitter data generated by … analysis...

computer generated celtic design - semantic scholar · m....

the neural bases of the pseudohomophone effect: phonological...

learning sentiment and semantic relatedness in user ... ·...

a hybrid and semantic location management system for mobile...

shower bases • bases de douche • bases de ducha

automatically generated pdf from existing images. · de...

the untold story - comsnets...typing • auto-correction •...

bases. ... bases

semantic relation composition in large scale knowledge...

bases bases bases bases bases bases bases bases bases...

semantic concept annotation for user generated videos...