toward next generation search: business, product, science, infrastructure, and talent
Post on 30-Dec-2015
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
Toward Next Generation Search: Business, Product, Science, Infrastructure, and Talent
William I. Chang
Chief ScientistBaidu.com
wchang @ baidu.com
History
• Synopsis of Web search evolution
• Themes and principles
• Challenges and opportunities
• Possible steps toward the next generation of search
Outline
History
Three Laws of Search (Infoseek ~1996-8)1. Phrase are more basic than single words2. Confidence in the source is more important than the content 3. “Facet” is more powerful than the relational model
Fundamental Theorem of Search: the search space can be factored, where each dimension is a taxonomy
• NLP– Billions of terms, proper names
– Lexical analysis and special cases: capitalization, contraction, acronyms, possessives, etc.
– Word stemming, phrase stemming
– Phrase extraction and query-rewriting, e.g. home run record
• Leveraging user input and community recommendation– Query suggestion by log-mining– Selection and ranking using link analysis and anchortext indexing
• Birth of Adversarial Information Retrieval (anti-spam)
First Generation: ~1996-2000
History
“The Internet is a place where one can always find someone to help answer any question or get anything done.”
• Web Oracle model (proposed in 1998)– Online communities: BBS, Mailing list, eGroup, Usenet News…– FAQ documents on the Internet– FAQ Finder & Builder as a community killer-app– Intelligent search
• User-generated content: blogs, MySpace, YouTube…• Tagging• Communities around knowledge: Wikipedia, Baidupedia…• Question-answering communities:
– Navers, Yahoo! Answers, Sina iAsk, Baidu iKnow…
• People search: LinkedIn, Facebook…
Second Generation: ~2001-present
History
“The Internet is a matching network.”
• Personalized search results– Based on locale, personal profile, search and browse history– Personal ranking function, source selection, keyword filtering– Personal search agent: spider, summarizer, Q&A agent
• Integration of search and recommendation (pull and push)– Subscription through automatic personalization
• Content, media, events, products and services…
– Matching things with people, etc.– Shopping assistant, information integrator
• Predictive recommendation with feedback– Always-on and environment-aware– Do you like this? Make it more (or less) custom, please.– Is your taste like mine? How to evaluate the evaluator?
Third Generation
History
• First Generation– Searching for information or content using NLP techniques– Based on community recommendation of content or keywords– Little or no personalization
• Second Generation– Aimed at resolving problems or finding people, entertainment– Centered around community-created content– Group customization
• Third Generation– More integrated into people’s daily lives and needs– Predictive, locale and environment-aware– True personalization
Summary
History
• Phrases are the conceptual units– Accurate name extraction and matching– Query rewriting & suggestion, “no quotes”, typos are OK– Understanding user needs, semantic match, machine translation
• Confidence in the source– Leverage community recommendation to filter content– Tagging, blogging, SMS forwarding– Community-created content are more interesting
• People helping people– Answer any question or get anything done
• Internet is more and more part of people’s everyday lives– Ubiquitous, always on, environment aware– Universal messaging/delivery of content, better integration– On-the-spot advice e.g. personalized shopping
Principles
History
• Search ranking function is an incredibly complex, possibly non-decomposable multi-objective optimization problem:
– Recall-precision tradeoff– Weighting of multiple terms– Textual quality and specificity, information richness– Unique and original content– Popularity vs authority– Freshness, timeliness– User needs, domain specific query
• Search engine is a database of massive size that needs to be continually refreshed and near-real-time updated, with high QofS requirements (response time, uptime) and throughput/efficiency requirements (search itself is free).
• Search has to be built around user behavior analysis of massive scale, in order to respond to constantly changing WWW environment. This has to be automated, self-adaptive, and near-real-time.
Challenges for Search
History
• (China) Each year, data size doubles and user-base doubles (2x2=4), placing financial strains on service providers. Data centers and electricity are scarce resources.
• Many distributed systems in operation, but they need to be flexible and reconfigurable, without sacrificing efficiency (much).
• How to beneficially direct traffic between search and other services? What types of advertising will users accept? How to be context sensitive and user-sensitive?
Challenges for a Search Service
History
• WWW as social network has become balkanized. We need new “people” search engines that let people find and help other people, yet protect privacy and reputation.
• (China) The emergence of nascent commerce infrastructure poses huge challenges. Commerce platforms need to support safe transactions, advertising and brand marketing, and need to seamlessly integrate online and offline services.
• Education, government, media, and Internet as agents for social engineering?
Challenges for Society and the Internet
History
• Transparency of advertising effectiveness vs secrecy of matching algorithm
• Ad targeting and audience segmentation
• Convergence of different forms of online advertising: search, display, contextual, behavioral
• Convergence of online and traditional advertising: brand marketing, local advertising (classifieds, yellowpages), direct marketing
• Integration of online and offline services
• Ubiquitous, mobile applications
Toward Next Generation Search: Business
History
• Ease of use: AND vs OR, “soft” AND, synonyms and concept search
• Query term suggestion (does it hurt?)
• Community Q&A: mining FAQs, routing to experts, “Wiki-Answers”
• Factoid extraction
• Open platform to accommodate topic/user/task-specific search engines
Toward Next Generation Search: Product
History
• Relevance vs user satisfaction
• Session behavior and modeling: term additions
• Result diversity; avoidance of “abandonment”
• How to evaluate the efficiency of incremental information discovery by a search engine
• TF*IDF revisited
Toward Next Generation Search: Science
History
• FLASH memory SSD: fast read, slow write
• Data analysis platform
• Development platform
• Internal- and external use of P2P technologies
• Search engine as a platform
Toward Next Generation Search: Infrastructure
History
• Recruitment
• Talent development
Toward Next Generation Search: Talent
History
Ever-increasingly leverage user and community collective intelligence, in a manner that is self-adaptive, scalable, and (near) real-time, in order to support ubiquitous, integrated online and offline services.
In Conclusion
Thank you
wchang @ baidu.com
top related