semantic data search and analysis using web-based user-generated knowledge bases
Post on 18-Dec-2014
490 Views
Preview:
DESCRIPTION
TRANSCRIPT
Semantic Data Search and Analysis Using Web-based User-Generated
Knowledge Bases
Dr. Maria GrinevaSystems Group @ ETH Zurich
Sunday, April 7, 13
Today’s Search is Based On Links
• Full-text search is the main way to access information on the Web
• The goal of Web search engines: find out the most relevant pages for the user’s query
• Google employs the Web’s hyperlinks to compute relevance of a Web page (PageRank)
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Domains Without Links
• PageRank does not work when documents are are not interlinked
• Breaking news and Blog posts - must be available in real-time, when no links have been created yet
• Enterprise databases - documents are not well interconnected because of organizational silos and limited number of people who create and use them
Sunday, April 7, 13
Web-based User-Generated Knowledge Bases
• To rank and organize documents that are not interlinked well, we need additional knowledge bases:
• Wikipedia - Online encyclopedia
• Twitter - real-time microblogging service
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
The Goal of This ProjectDevelop a technology which automatically extracts semantic information:
• from Wikipedia - term meanings, relationships, ontologies ...
• from Twitter - real-time information about breaking news, trends, people opinions ...
and applies this information to organize:
• news and blogs on the Web
• documents in enterprise databases
We will release our technology as an open source software framework
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Semantic Text Analysis Using Wikipedia
• Leveraging Wikipedia to improve text analysis methods:
• Comprehensive coverage (6M terms vs. 65K in Britannica)
• Continuously brought up-to-date
• Rich structure (cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes)
• New algorithms:
• Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic Inference
• Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds
• Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation
• Zero-cost deployment and customization: No need to train methods, no human labor, no “cold start” problem
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Basic Technique:Semantic Relatedness of Terms
• We analyze Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms
• We use Dice-measure with weighted hyperlinks (bi-directional links, direct links, “see also” links, etc)
Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim GrinevAccuracy Estimate and Optimization Techniques for SimRank ComputationVLDB 2008Sunday, April 7, 13
Word Sense Disambiguation • Exmple: IBM may stand for International Business
Machines Corp. or International Brotherhood of Magicians
• We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text
• Example: Platform is mentioned in the context of implementation, open-source, web-server, HTTP
Sunday, April 7, 13
Prototype of a Semantic Search Engine for the Blogosphere
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Twitter - A Real-Time News Medium
• ~200M users all over the world posting short messages (tweets) via mobile devices and web browser
• ~140M tweets per day
• Twitter - is an open social network where everyone can follow everyone
• Retweets - a mechanism for fast news spreading
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Following + Retweets:Twitter is the Fastest News Medium
• Twitter reacts faster than mainstream media: Haiti Earthquake, Hudson river plane crash
• Everyone can be a reporter: real-time updates on the revolutions in Tunisia, Egypt, Libya, Iran ...
Sunday, April 7, 13
Extracting Useful Information From Twitter
• Popularity of a URL
• Sentiments, opinions about a news story (tweets containing the news URL)
• Trending topics: what is being actively discussed right now
• Personalization of news based on user’s friends connections: The Tweeted Times http://tweetedtimes.com
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
The Tweeted Times: personalized newspaper generated from user’s Twitter account
Sunday, April 7, 13
At the Systems Layer
• Scalable distributed architecture is required:
• Hadoop (MapReduce software framework) for batch processing of Wikipedia snapshots
• Real-time analytics based on distributed key-value store for online Twitter stream processing
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Scalable Real-Time Analytics Based On Distributed Key-Value Store
• At Systems Group, we are working on a system for real-time analytics based on Cassandra:
• We extend Cassandra with:
• push-style procedure for real-time analytics
• incremental computations (alternative to batch-processing) - processing data as it arrives from the stream
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
References
• Prototype of the semantic search engine Blognoon: http://blognoon.com
• The Tweeted Times - personalized newspaper based on user’s Twitter account:http://tweetedtimes.com
• Triggy: a system for real-time analytics:http://www.systems.ethz.ch/research/projects
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
top related