brave new search world

Post on 15-Apr-2017

398 Views

Category:

Internet

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Brave New Search World

Ran HockOnline Strategiesran@onstrat.com

2

Brave New Search World• The nature of “search” is changing

radically. • Structure is being created from (relatively)

unstructured data.• The “Semantic Web” is becoming an

actuality.• Natural Language Processing (NLP) and

other technologies are being extensively applied to search and search-related activities.

3

Brave New Search World• These technologies are making the following

kinds of things happen:– “Knowledge graphs”– “Entity” identification in numerous

applications– Natural language search statements– Actual searching of images (not just of

image metadata)• These advances are coming not just from

Google but from numerous services, especially for “news” search.

4

Some Themes/Perspectives• What is happening is more evolutionary than

revolutionary. Many, but not all, of the "pieces" of the technology have been around for a while.

• Structure is being derived out of (not totally) chaos. We are going from words to meaning.

• Google isn’t the only player here.• We can take real advantage of the developments.• Using what you already know about “search” is

important.

5

Unstructuredness of Data• Part of the “organization of knowledge” problem• Particularly acute for textual material • To a computer, a “word” is a string of characters

bounded by spaces or punctuation and has no “meaning”.

• When we are searching for something, we are searching for meaningful things, not character strings.

• Meaning can be derived from context by the use of NLP.

6

Where We Were Recently

• Boolean Logic– Actually a precursor/example of Artificial

Intelligence (AI) applied to “search”.– Still a part of search AI

• Boolean is (from our infancy) a central aspect of how we think, a part of our “consciousness”

• Old approach: Searching by concepts

7

Where We Were Recently “Old” (circa 1975 – 2???)

search strategy (searching by “concepts”)

OR

8

Where We Were Recently(cont.)

• Ranking of web search results was/is based on a wide range (ca 200) factors, “signals”

• User-controlled field searching (intitle: etc.)

• Etc.

9

The “Newer” Technologies• Semantic Web Technologies• Artificial Intelligence (AI) used at a broad

level and utilizing various AI subfields• AI - Expert Systems approaches• AI - Natural Language Processing (NLP)• AI - NLP - Entity identification (extraction,

disambiguation, classification, etc.) • AI - Machine Learning• Big Data processing

10

Technologies:The Semantic Web

• W3C “informal” definition – "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”(from Tim Berners-Lee et al, The Semantic Web. Scientific American, May 2001.)

11

Technologies:The Semantic Web

• Essence:• “strings to things”• “words to meaning”

• Technologically accomplished on webpages by means of a specialized xml markup language, etc.

12

Technologies:The Semantic Web

• Idea born pre-1999• In practice, also requires other technologies

such as Natural Language Processing, etc. • 2006 - Berners-Lee and colleagues stated

that: "This simple idea…remains largely unrealized".

• 2013 - more than four million Web domains contained Semantic Web markup.

13

Technologies:AI - Expert Systems

• Search results ranking has long used an “expert systems” approach, mimicking what an experienced researcher looks for:– Words appearing in the title – Number of times cited (linked-to)– Proximity of words– Words in the abstract– Words in headings – Etc.

• This will continue, more and more automatically.

14

Technologies:Natural Language Processing

• A part of artificial intelligence and computational linguistics

• Deals with helping computers “understand” written and spoken languages

• Plays a key role in voice input for search, natural language search statements, translations, and more.

15

Technologies:Natural Language Processing

Google's syntactic systems • predict part-of-speech tags for each word in

a given sentence, • identify morphological features such as

gender and number. • label relationships between words, such as

subject, object, modification, etc. • leverage large amounts of unlabeled data• incorporate neural net technology.

research.google.com/pubs/NaturalLanguageProcessing.html

16

Technologies:Natural Language Processing

Google’s semantic systems• identify entities in free text,• label them with types (such as person,

location, or organization), • cluster mentions of those entities within and

across documents (co-reference resolution), • incorporates multiple sources of knowledge

and information to aid with analysis of textresearch.google.com/pubs/NaturalLanguageProcessing.html

17

Technologies:Entity Extraction

• A.k.a. named-entity recognition, entity identification• Complementary to other natural language processing• Identifies things, people, places, etc. within text (and

speech).• Relates to the idea of concepts referred to earlier. • Because “text” is based on language, “structure” is there

but the structure is not readily evident to a computer.

18

Technologies:Entity Extraction

• Context-based connections allow discernment of different meanings of a word.

• Entity extraction draws inferences based on the logical content of the data.

• Entity extraction may be the single most important tool for bringing structure to unstructured data, specifically text.

• Also used for search query “suggestions”.• An excellent example is found in Silobreaker.

19

18

20

21

22

Technologies:Machine Learning

Computers teaching themselves

Google RankBrain• Used in processing search results, part of Google’s

Hummingbird search algorithm• A way of interpreting a search statement in order to

find web pages that may not have the specific words in the search statement.

• Uses patterns from seemingly unconnected other “complex” searches to find similarities in the current search, then applying that information to most likely useful content.

• Google regards this as the third most important signal.

23

Technologies:Big Data

• The existence of “big data” collections provides unprecedented opportunities for computational approaches for computers to “understand” text.

• In neural networking image entity identification experiments, the accuracy of machine learning algorithms improves vastly when used with large pools of data.

• "...Google’s search engine queries a 100 petabyte index that incorporates over 200 indicators and whose algorithms change more than 500 times per year."

24

Specific Applications of These (and Other) Technologies

• Continued gradual incorporation of “expert” techniques

• Natural language search statements• Search by voice• Image recognition and search: search of images,

search by image, and facial recognition• Knowledge Graphs• Entities in news search

25

Gradual Incorporation of “Expert” Techniques

• An “ordinary” search isn’t what it used to be.• Google has now quietly taken over more of the

“old” “professional searcher” techniques and now automatically adds not just word variants, but synonyms.

26

Gradual Incorporation of “Expert” Techniques

• Suggested searches (based on known connections and not just based on your character string)

A "data-driven" approach - trillions of words, vs "rules“. Not just word variants.

• The old “synonyms” (~diet) option didn’t just go away. It is now applied automatically. (Few people use the OR.)

27

Gradual Incorporation of “Expert” Techniques

• “Did you mean” is now more often “Showing results for”

28

Gradual Incorporation of “Expert” Techniques

• “Fuzzy Logic” – As well as searching for words that are “close”, Google may drop some of your “concepts” for some records

29

Gradual Incorporation of “Expert” Techniques

– If Google “thinks” you want specific facts and “sees” a matching answer, you may get that immediately.

30

Specific Applications:Natural Language Search Statements

• Don’t hesitate to use them!

• The above two searches give different (and relevant) answers

• This is especially important for Google Now and Siri!

31

Specific Applications:Voice Search

• Apple (iOS) - Siri• Google – Google Now• Bing – Cortana (recently deceased?)• These “expect” natural language, so

natural language will yield the best results.

32

Specific Applications:Image Recognition and Search:

Search of ImagesNot much recent obvious change in Bing’s or

Google’s regular image search, but:• “Categorization” (aspect of entity extraction) is

now shown on image search results pages• Google, Microsoft (Bing) and Apple are heavy

into research on image identification and classification.

• What’s happening/coming can be anticipated by looking at Google Photos.

33

Specific Applications:Image Recognition and Search:

Search of ImagesBing Image Search

34

Specific Applications:Image Recognition and Search:

Search of Images

35

Specific Applications:Image Recognition and Search:

Search of Images• In December 2015, Microsoft beat out 5 competitors

(including Google) in the ImageNet contest for machine recognition of images

• Machines were trained to recognize images using a “deep neural networking” method.

• Competitors must locate and identify objects from 100,000 photographs found in Flickr and search engines and then place them in 1,000 object categories.

• Microsoft, the winner, had an error rate of 3.5 percent for classification and 9 percent for localization.

• Machine learning using neural networking is also very successfully used for translations, such as in Skype’s new translation offering

36

Specific Applications:Image Recognition and Search: Search by Image

37

Specific Applications:

Image Recognition and

Search: Entity and Facial Recognitionin Google Photos

38

Specific Applications:Knowledge Graphs

• Knowledge graphs do not originate with Google (but Google has made the term widely known.)

• “Knowledge graph theory was initiated by C. Hoede, a discrete mathematician at the University of Twente and F.N. Stokman, a mathematical sociologist at the University of Groningen, both in the Netherlands.” (ca 1982) http://doc.utwente.nl/64931/1/memo1876.pdf

39

Specific Applications:Google Knowledge Graph

• The Google Knowledge Graph, overall, is a database about “things” and the connections between those things.

• Delivers and summarizes key facts about people, places, things.

• The selection of those facts is based on connections regarding that entity and related entities and on what other users have asked about that entity.

40

Specific Applications:Google Knowledge Graph

• Launched May 2012• At its heart, Google Knowledge Graph is a

database of facts.• At that time it contained 18 billion facts

between 570 million objects.• The kinds of things included vary with the

kind of entity.• Content comes primarily from Wikipedia,

World Factbook, Freebase/Wikidata, plus other sources.

41

42

43

Specific Applications:Google Knowledge Graph

• The key power of Google Knowledge Graph lies in its utilization of connections between entities as searched for by other users.

• At present, its present main weakness is its heavy un-vetted reliance on Wikipedia, which is not always right, e.g., the Wikipedia article on Knowledge Graph.

44

WRONG!

45

46

Bing’s Knowledge Graph

• Named “Snapshot”, it uses Bing’s Satori technology

• Launched in June 2012• Utilizes Wikipedia, Freebase, Qwiki,

LinkedIn, Britannica, etc.• Builds into results interactive features

such as audio and video

47

48

49

Specific Applications:News Applications

Examples of News Sites Effectively Using These Technologies

• Silobreaker (example shown earlier)• EMM

50

Specific Applications:News Applications

EMM – European Media Monitor• From the European Commission• Computerized analysis of news trends

and story content• Makes extensive use of NLP techniques

for entity extraction and clustering• “Organizes” a vast quantity of

knowledge very efficiently.

51

52

53

54

So, How do we as researchers take advantage of this?

• Get in the habit of using what's new (Siri, Google Now, natural language). Join the Evolution!

• Actually pay attention to Google Instant (suggestions).

• Don't forsake the old. There are times when you need to turn the auto-pilot off and take charge.

• Ask questions you didn't bother asking before [because you didn't think the search engine would do it.]

55

So, how do we as researchers take best advantage of this?

• Increase awareness of information quality criteria

• Worry a bit - – Worrisome - the general public's further reliance

on quick, single, local, twitter-length answers– Worrisome - Localization, – Worrisome -"echo chambers“– " Machines making decisions on our behalf”

• Enjoy the new.

56

Questions?

Ran HockOnline Strategiesran@onstrat.com

top related