project 1 search engine
TRANSCRIPT
WE SCHOOL
CASE STUDY – SEARCH ENGINES AS IMAGE BUILDERS
HEENA JAISINGHANI
DPGD/JL13/1836
SPECIALIZATION: GENERAL MANAGEMENT
PRIN. L.N. WELINGKAR INSTITUE OF
MANAGEMENT DEVELOPMENT & RESEARCH
YEAR OF SUBMISSION: MARCH 2015
Page 1 of 75
WE SCHOOL
ANNEXURE 1
FLOW CHART INDICATING THE BASIC ELEMENTS OF THE PROJECT
Page 2 of 75
To reach Search Engine as Image Builder we need to know Search
Engine Optimization (SEO)
Code techniques that minimize the use of flash
and frames
Keywords or Keyword Phrase that fits the tatrget market
Linking Strategy
WE SCHOOLPage 3 of 75
WE SCHOOL
ANNEXURE 3
UNDERTAKING BY CANDIDATE
I declare that project entitled Case Study: Search Engine as Image Builders is my own
work conducted as part of my syllabus.
I further declare that the project work presented has been prepared personally by me and it is
not sourced from any outside agency. I understand that, any such malpractice will have very
serious consequence and my admission to the program will be cancelled without any refund
of fees.
I am also aware that, I may face legal action, if I follow such malpractice.
Heena Jaisinghani
(Signature of Candidate)
Page 4 of 75
WE SCHOOL
Table of contents
Introduction
Background
Methodology
Conclusions & Recommendations
Limitations
Bibliography
Page 5 of 75
WE SCHOOL
Introduction
Search Engine as Image Builder has two point of view which as follows
1) Using Search Engine how does the company Builds their Image &
2) How does Image gets Build through Search Engine Spider Simulator with Search Engine Optimization
Let us start with 2 topic first which will conclude us to 1 topic
Search Engine Optimization (SEO) is the process not only involves of making web pages
easy to find, easy to crawl and easy to categorise also make those pages rank high for certain
keywords or search terms.
The technique behind Image building is Search Engine Spider Simulator
Basically all search engine spiders function on the same principle – they crawl the Web and
pages, which are stored in a database and later use various algorithms to determine
page ranking, relevancy, etc. of the collected pages. While the algorithms of calculating
ranking and relevancy widely differ among search engines, the way they index sites is more
or less uniform and it is very important that you know what spiders are interested in and what
they neglect.
Businesses are growing more aware of the need to understand and implement at least the
basics of search engine optimization (SEO). But if you read a variety of blogs and websites,
you’ll quickly see that there’s a lot of uncertainty over what makes up “the basics.” Without
access to high-level consulting and without a lot of experience knowing what SEO resources
can be trusted, there’s also a lot of misinformation about SEO strategies and tactics.
Below are Techniques for usage of SEO
1. Commit yourself to the process: SEO isn’t a one-time event. Search engine algorithms
change regularly, so the tactics that worked last year may not work this year. SEO requires a
Page 6 of 75
WE SCHOOL
long-term outlook and commitment.
2. Be patient: SEO isn’t about instant gratification. Results often take months to see, and this
is especially true the smaller you are, and the newer you are to doing business online.
3. Ask a lot of questions when hiring an SEO company: It’s your job to know what kind of
tactics the company uses. Ask for specifics. Ask if there are any risks involved. Then get
online yourself and do your own research—about the company, about the tactics they
discussed, and so forth.
4. Become a student of SEO: If you’re taking the do-it-yourself route, you’ll have to become
a student of SEO and learn as much as you can.
5. Have web analytics in place at the start: You should have clearly defined goals for your
SEO efforts, and you’ll need web analytics software in place so you can track what’s working
and what’s not.
6. Build a great web site: Ask yourself, “Is my site really one of the 10 best sites in the
world on this topic?” Be honest. If it’s not, make it better.
7. Include a site map page: Spiders can’t index pages that can’t be crawled. A site map will
help spiders find all the important pages on your site, and help the spider understand your
site’s hierarchy. This is especially helpful if your site has a hard-to-crawl navigation menu. If
your site is large, make several site map pages. Keep each one to less than 100 links. It is
advisable 75 to the max to be safe.
8. Make SEO-friendly URLs: Use keywords in your URLs and file names, such as
yourdomain.com/red-widgets.html. Don’t overdo it, though. A file with 3+ hyphens tends to
look spammy and users may be hesitant to click on it. Use hyphens in URLs and file names,
not underscores. Hyphens are treated as a “space,” while underscores are not.
Page 7 of 75
WE SCHOOL
9. Do keyword research at the start of the project: If you’re on a tight budget, use the free
versions of Keyword Discovery or WordTracker, both of which also have more powerful
paid versions. Ignore the numbers these tools show; what’s important is the relative volume
of one keyword to another. Another good free tool is Google’s AdWords Keyword Tool,
which doesn’t show exact numbers.
10. Open up a PPC account: Whether it’s Google’s AdWords, Microsoft adCenter or
something else, this is a great way to get actual search volume for your keywords. Yes, it
costs money, but if you have the budget it’s worth the investment. It’s also the solution if you
didn’t like the “Be patient” suggestion above and are looking for instant visibility.
11. Use a unique and relevant title and Meta description on every page: The page title is
the single most important on-page SEO factor. It’s rare to rank highly for a primary term (2-3
words) without that term being part of the page title. The meta description tag won’t help you
rank, but it will often appear as the text snippet below your listing, so it should include the
relevant keyword(s) and be written so as to encourage searchers to click on your listing.
12. Write for users first: Google, Yahoo, etc., have pretty powerful bots crawling the web,
but to my knowledge these bots have never bought anything online, signed up for a
newsletter, or picked up the phone to call about your services. Humans do those things, so
write your page copy with humans in mind. Yes, you need keywords in the text, but don’t
stuff each page like a Thanksgiving turkey. Keep it readable.
13. Create great, unique content: This is important for everyone, but it’s a particular
challenge for online retailers. If you’re selling the same widget that 50 other retailers are
selling, and everyone is using the boilerplate descriptions from the manufacturer, this is a
great opportunity. Write your own product descriptions, using the keyword research you did
earlier (see #9 above) to target actual words searchers use, and make product pages that blow
the competition away. Plus, retailer or not, great content is a great way to get inbound links.
Page 8 of 75
WE SCHOOL
14. Use your keywords as anchor text when linking internally: Anchor text helps tells
spiders what the linked-to page is about. Links that say “click here” do nothing for your
search engine visibility.
15. Build links intelligently: Begin with foundational links like trusted directories. (Yahoo
and DMOZ are often cited as examples, but don’t waste time worrying about DMOZ
submission. Submit it and forget it.) Seek links from authority sites in your industry. If local
search matters to you (more on that coming up), seek links from trusted sites in your
geographic area — the Chamber of Commerce, local business directories, etc. Analyze the
inbound links to your competitors to find links you can acquire, too. Create great content on a
consistent basis and use social media to build awareness and links.
16. Use press releases wisely: Developing a relationship with media covering your industry
or your local region can be a great source of exposure, including getting links from trusted
media web sites. Distributing releases online can be an effective link building tactic, and
opens the door for exposure in news search sites. Only issue a release when you have
something newsworthy to report. Don’t waste journalists’ time.
17. Start a blog and participate with other related blogs: Search engines, Google
especially, love blogs for the fresh content and highly-structured data. Beyond that, there’s no
better way to join the conversations that are already taking place about your industry and/or
company. Reading and commenting on other blogs can also increase your exposure and help
you acquire new links. Put your blog at yourdomain.com/blog so your main domain gets the
benefit of any links to your blog posts. If that’s not possible, use blog.yourdomain.com.
18. Use social media marketing wisely. If your business has a visual element, join the
appropriate communities on Flickr and post high-quality photos there. If you’re a service-
oriented business, use Quora and/or Yahoo Answers to position yourself as an expert in your
industry. Any business should also be looking to make use of Twitter and Facebook, as social
information and signals from these are being used as part of search engine rankings for
Page 9 of 75
WE SCHOOL
Google and Bing. With any social media site you use, the first rule is don’t spam! Be an
active, contributing member of the site. The idea is to interact with potential customers, not
annoy them.
19. Take advantage of local search opportunities. Online research for offline buying is a
growing trend. Optimize your site to catch local traffic by showing your address and local
phone number prominently. Write a detailed Directions/Location page using neighbourhoods
and landmarks in the page text. Submit your site to the free local listings services that the
major search engines offer. Make sure your site is listed in local/social directories such as
CitySearch, Yelp, Local.com, etc., and encourage customers to leave reviews of your
business on these sites, too.
20. Take advantage of the tools the search engines give you. Sign up for
Google Webmaster Central, Bing Webmaster Tools and Yahoo Site Explorer to learn more
about how the search engines see your site, including how many inbound links they’re aware
of.
21. Diversify your traffic sources. Google may bring you 70% of your traffic today, but
what if the next big algorithm update hits you hard? What if your Google visibility goes away
tomorrow? Newsletters and other subscriber-based content can help you hold on to
traffic/customers no matter what the search engines do. In fact, many of the DOs on this
list—creating great content, starting a blog, using social media and local search, etc.—will
help you grow an audience of loyal prospects and customers that may help you survive the
whims of search engines
Page 10 of 75
WE SCHOOL
Background
Here it shows the behind the scene concept of Search Engine Spider Simulator & Techniques Mentioned above
Are Your Hyperlinks Spiderable?
The search engine spider simulator can be of great help when trying to figure out if the
hyperlinks lead to the right place. For instance, link exchange websites often put fake links to
your site with _JavaScript (using mouse over events and stuff to make the link look genuine)
but actually this is not a link that search engines will see and follow. Since the spider
simulator would not display such links, you'll know that something with the link is wrong.
It is highly recommended to use the <noscript> tag, as opposed to _JavaScript based menus.
The reason is that _JavaScript based menus are not spiderable and all the links in them will
be ignored as page text. The solution to this problem is to put all menu item links in the
<noscript> tag. The <noscript> tag can hold a lot but please avoid using it for link stuffing or
any other kind of SEO manipulation.
If you happen to have tons of hyperlinks on your pages (although it is highly recommended to
have less than 100 hyperlinks on a page), then you might have hard times checking if they are
OK. For instance, if you have pages that display “403 Forbidden”, “404 Page Not Found” or
similar errors that prevent the spider from accessing the page, then it is certain that this page
will not be indexed. It is necessary to mention that a spider simulator does not deal with 403
and 404 errors because it is checking where links lead to not if the target of the link is in
place, so you need to use other tools for checking if the targets of hyperlinks are the intended
ones.
Looking for Your Keywords
While there are specific tools, like the Keyword Playground or the Website Keyword
Suggestions, which deal with keywords in more detail, search engine spider simulators also
help to see with the eyes of a spider where keywords are located among the text of the page.
Page 11 of 75
WE SCHOOL
Why is this important? Because keywords in the first paragraphs of a page weigh more than
keywords in the middle or at the end. And if keywords visually appear to us to be on the top,
this may not be the way spiders see them. Consider a standard Web page with tables. In this
case chronologically the code that describes the page layout (like navigation links or separate
cells with text that are the same sitewise) might come first and what is worse, can be so long
that the actual page-specific content will be screens away from the top of the page.
Are Dynamic Pages Too Dynamic to be Seen At All
Dynamic pages (especially ones with question marks in the URL) are also an extra that
spiders do not love, although many search engines do index dynamic pages as well. Running
the spider simulator will give you an idea how well your dynamic pages are accepted by
search engines.
Meta Keywords and Meta Description
Meta keywords and meta description, as the name implies, are to be found in the <META>
tag of a HTML page. Once meta keywords and meta descriptions were the single most
important criterion for determining relevance of a page but now search engines employ
alternative mechanisms for determining relevancy, so you can safely skip listing keywords
and description in Meta tags (unless you want to add there instructions for the spider what to
index and what not but apart from that meta tags are not very useful anymore).
Meta tags are a great way for webmasters to provide search engines with information about
their sites. Meta tags can be used to provide information to all sorts of clients, and each
system processes only the meta tags they understand and ignores the rest. Meta tags are added
to the <head> section of your HTML page and generally look like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="Description" CONTENT="Author: A.N. Author, Illustrator: P. Picture, Category: Books, Price: £9.24, Length: 784 pages">
Page 12 of 75
WE SCHOOL
Methodology
Here we come to know how the whole process started
Finding information on the World Wide Web had been a difficult and frustrating task, but
became much more usable with breakthroughs in search engine technology in the late 1990s.
A web search engine is a software system that is designed to search for information on the
World Wide Web. The search results are generally presented in a line of results often referred
to as search engine results pages (SERPs). The information may be a mix of web pages,
images, and other types of files. Some search engines also mine data available in databases or
open directories. Unlike web directories, which are maintained only by human editors, search
engines also maintain real-time information by running an algorithm on a web crawler (A
Web crawler is an Internet bot that systematically browses the World Wide Web, typically
for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant,
an automatic indexer)
History
Page 13 of 75
WE SCHOOL
Further information: Timeline of web search engines
Timeline (full list)Yea
rEngine Current status
1993 W3Catalog InactiveAliweb InactiveJumpStation InactiveWWW Worm Inactive
1994 WebCrawler Active, AggregatorGo.com Active, Yahoo SearchLycos ActiveInfoseek Inactive
1995 AltaVista Inactive, redirected to Yahoo!Daum ActiveMagellan InactiveExcite ActiveSAPO ActiveYahoo! Active, Launched as a directory
1996 Dogpile Active, AggregatorInktomi Inactive, acquired by Yahoo!HotBot Active (lycos.com)Ask Jeeves Active (rebranded ask.com)
1997 Northern Light InactiveYandex Active
1998 Google ActiveIxquick Active also as Start pageMSN Search Active as Bingempas Inactive (merged with NATE)
1999 AlltheWeb Inactive (URL redirected to Yahoo!)GenieKnows Active, rebranded Yellowee.comNaver ActiveTeoma Inactive, redirects to Ask.comVivisimo Inactive
2000 Baidu ActiveExalead ActiveGigablast Active
2003 Info.com ActiveScroogle Inactive
2004 Yahoo! Search Active, Launched own web search(see Yahoo! Directory, 1995)
A9.com Inactive
Page 14 of 75
WE SCHOOL
Sogou Active2005 AOL Search Active
GoodSearch ActiveSearchMe Inactive
2006 Soso (search engine) ActiveQuaero InactiveAsk.com ActiveLive Search Active as Bing, Launched as
rebranded MSN SearchChaCha ActiveGuruji.com Inactive
2007 wikiseek InactiveSproose InactiveWikia Search InactiveBlackle.com Active, Google Search
2008 Powerset Inactive (redirects to Bing)Picollator InactiveViewzi InactiveBoogami InactiveLeapFish InactiveForestle Inactive (redirects to Ecosia)DuckDuckGo Active
2009 Bing Active, Launched asrebranded Live Search
Yebol InactiveMugurdy Inactive due to a lack of fundingScout (Goby) ActiveNATE Active
2010 Blekko ActiveCuil InactiveYandex Active, Launched global
(English) search2011 YaCy Active, P2P web search engine2012 Volunia Inactive2013 Halalgoogling Active, Islamic / Halal
filter Search
During early development of the web, there was a list of webservers edited by Tim Berners-
Lee and hosted on the CERN webserver. One historical snapshot of the list in 1992 remains,
Page 15 of 75
WE SCHOOL
but as more and more webservers went online the central list could no longer keep up. On the
NCSA(National Center for Supercomputing Applications) site, new servers were announced
under the title "What's New!"
The first tool used for searching on the Internet was Archie. The name stands for "archive"
without the "v". It was created in 1990 by Alan Emtage, Bill Heelan and J. Peter Deutsch,
computer science students at McGill University in Montreal. The program downloaded the
directory listings of all the files located on public anonymous FTP (File Transfer Protocol)
sites, creating a searchable database of file names; however, Archie did not index the contents
of these sites since the amount of data was so limited it could be readily searched manually.
In June 1993, Matthew Gray, then at MIT (Massachusetts Institute of Technology), produced
what was probably the first web robot (is a software application that runs automated tasks
over the Internet.), the Perl (is about the programming language Perl is a family of high-level,
general-purpose, interpreted, dynamic programming languages. The languages in this family
include Perl 5 and Perl 6)-based World Wide Web Wanderer, and used it to generate an index
called 'Wandex'.
The purpose of the Wanderer was to measure the size of the World Wide Web, which it did
until late 1995. The web's second search engine Aliweb appeared in November 1993. Aliweb
did not use a web robot, but instead depended on being notified by website administrators of
the existence at each site of an index file in a particular format.
JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to find web
pages and to build its index, and used a web form (it allows a user to enter data that
is sent to a server for processing.) as the interface to its query program. It was thus the first
Page 16 of 75
WE SCHOOL
WWW -discovery tool to combine the three essential features of a web search engine
(crawling, indexing, and searching) as described below, Because of the limited resources
available on the platform it ran on, its indexing and hence searching were limited to the titles
and headings found in the web pages the crawler encountered.
One of the first "all text" crawler-based search engines was WebCrawler, which came out in
1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which
has become the standard for all major search engines since. It was also the first one widely
known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was
launched and became a major commercial endeavor.
Soon after, many search engines appeared and vied for popularity. These included Magellan,
Excite, Infoseek, Inktomi, Northern Light, and AltaVista. Yahoo! was among the most
popular ways for people to find web pages of interest, but its search function operated on its
web directory (it specializes in linking to other web sites and categorizing those
links.) rather than its full-text copies of web pages. Information seekers could also
browse the directory instead of doing a keyword-based search.
Google adopted the idea of selling search terms in 1998, from a small search engine company
named goto.com( it relates to internet advertising). This move had a significant effect on the
SE business, which went from struggling to one of the most profitable businesses in the
internet.
In 1996, Netscape (it’s an American computer services company, best known for Netscape
Navigator, its web browser) was looking to give a single search engine an exclusive deal as
the featured search engine on Netscape's web browser. There was so much interest that
instead Netscape struck deals with five of the major search engines: for $5 million a year, Page 17 of 75
WE SCHOOL
each search engine would be in rotation on the Netscape search engine page. The five engines
were Yahoo!, Magellan, Lycos, Infoseek, and Excite.
Search engines were also known as some of the brightest stars in the Internet investing frenzy
that occurred in the late 1990s. Several companies entered the market spectacularly,
receiving record gains during their initial public offerings. Some have taken down their
public search engine, and are marketing enterprise-only editions, such as Northern Light.
Many search engine companies were caught up in the dot-com bubble, a speculation-driven
market boom that peaked in 1999 and ended in 2001.
Around 2000, Google's search engine rose to prominence. The company achieved better
results for many searches with an innovation called PageRank (it is a way of measuring the
importance of website pages), as was explained in the paper Anatomy of a Search Engine
written by Sergey Brin and Larry Page, the later founders of Google. This iterative algorithm
ranks web pages based on the number and PageRank of other web sites and pages that link
there, on the premise that good or desirable pages are linked to more than others. Google also
maintained a minimalist interface to its search engine. In contrast, many of its competitors
embedded a search engine in a web portal. In fact, Google search engine became so popular
that spoof engines emerged such as Mystery Seeker.
By 2000, Yahoo! was providing search services based on Inktomi's search engine. Yahoo!
acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003.
Yahoo! switched to Google's search engine until 2004, when it launched its own search
engine based on the combined technologies of its acquisitions.
Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In Page 18 of 75
WE SCHOOL
early 1999 the site began to display listings from Looksmart (it is an American, publicly
traded, online advertising company founded in 1995), blended with results from
Inktomi. For a short time in 1999, MSN Search used results from AltaVista were instead. In
2004, Microsoft began a transition to its own search technology, powered by its own web
crawler (called msnbot). Microsoft's rebranded search engine, Bing, was launched on June 1,
2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search
would be powered by Microsoft Bing technology.
Before going ahead with further details would like to highlight about Aggregators
“Aggregators” are the buzz word of choice for the various online companies that gather
information from fragmented marketplaces into a single portal to make life easier for
everyone. A classic example is the online airline and hotel reservations.
How web search engines work
A search engine operates in the following order:
1. Web crawling (it is an Internet bot that systematically browses the World Wide Web,
typically for the purpose of Web indexing)
2. Indexing (it collects, parses, and stores data to facilitate fast and accurate
information retrieval. Index design incorporates interdisciplinary concepts from
linguistics, cognitive psychology, mathematics, informatics, and computer science)
3. Searching (is a query that a user enters into a web search engine to satisfy his or her
information needs.)
Explanation of each is mentioned belowPage 19 of 75
WE SCHOOL
Web search engines work by storing information about many web pages, which they retrieve
from the HTML markup of the pages. These pages are retrieved by a Web crawler
(sometimes also known as a spider) — An automated Web crawler which follows every link
on the site. The site owner can exclude specific pages by using robots.txt.
The search engine then analyzes the contents of each page to determine how it should be
indexed (for example, words can be extracted from the titles, page content, headings, or
special fields called meta tags [They are part of a web page's head section ]. Data about web
pages are stored in an index database for use in later queries. A query from a user can be a
single word. The index helps find information relating to the query as quickly as possible.
Some search engines, such as Google, store all or part of the source page (referred to as a
cache) as well as information about the web pages, whereas others, such as AltaVista, store
every word of every page they find. This cached page always holds the actual search text
since it is the one that was actually indexed, so it can be very useful when the content of the
current page has been updated and the search terms are no longer in it. This problem might be
considered a mild form of linkrot, and Google's handling of it increases usability by
satisfying user expectations that the search terms will be on the returned webpage. This
satisfies the principle of least astonishment, since the user normally expects that the search
terms will be on the returned pages. Increased search relevance makes these cached pages
very useful as they may contain data that may no longer be available elsewhere.
Page 20 of 75
WE SCHOOL
High-level architecture of a standard Web crawler
When a user enters a query into a search engine (typically by using keywords), the engine
examines its inverted index and provides a listing of best-matching web pages according to
its criteria, usually with a short summary containing the document's title and sometimes parts
of the text. The index is built from the information stored with the data and the method by
which the information is indexed. From 2007 the Google.com search engine has allowed one
to search by date by clicking "Show search tools" in the leftmost column of the initial search
results page, and then selecting the desired date range. Most search engines support
the use of the Boolean operators (This article is about connectives in logical systems).AND,
and NOT to further specify the Web search query.
Boolean operators are for literal searches that allow the user to refine and extend the terms of
the search. The engine looks for the words or phrases exactly as entered. Some search
engines provide an advanced feature called proximity search, which allows users to define the
distance between keywords. There is also concept-based searching where the research
involves using statistical analysis on pages containing the words or phrases you search for.
As well, natural language queries allow the user to type a question in the same form one
would ask it to a human. A site like this would be ask.com.
Page 21 of 75
WE SCHOOL
The usefulness of a search engine depends on the relevance of the result set it gives back.
While there may be millions of web pages that include a particular word or phrase, some
pages may be more relevant, popular, or authoritative than others. Most search engines
employ methods to rank the results to provide the "best" results first. How a search engine
decides which pages are the best matches, and what order the results should be shown in,
varies widely from one engine to another. The methods also change over time as Internet
usage changes and new techniques evolve. There are two main types of search engine that
have evolved: one is a system of predefined and hierarchically ordered keywords that humans
have programmed extensively. The other is a system that generates an "inverted index" (In
computer science, an inverted index (also referred to as postings file or inverted file) is an
index data structure storing a mapping from content, such as words or numbers, to its
locations in a database file, or in a document or a set of documents. The purpose of an
inverted index is to allow fast full text searches, at a cost of increased processing when a
document is added to the database) by analyzing texts it locates. This first form relies much
more heavily on the computer itself to do the bulk of the work.
Most Web search engines are commercial ventures supported by advertising revenue and thus
some of them allow advertisers to have their listings ranked higher in search results for a fee.
Search engines that do not accept money for their search results make money by running
search related ads alongside the regular search engine results. The search engines make
money every time someone clicks on one of these ads.
Page 22 of 75
WE SCHOOL
Market share
Here we will come to know which site has the highest market share
Google is the world's most popular search engine, with a market share of 68.69 per cent.
Baidu comes in a distant second, answering 17.17 per cent online queries.
The world's most popular search engines are:
Search engine Market share in October 2014
Google 58.01%Baidu 29.06%Bing 8.01%Yahoo! 4.01%AOL 0.21%Ask 0.10%Excite 0.00%
East Asia and Russia
East Asian countries and Russia constitute a few places where Google is not the most popular
search engine. Soso (search engine) is more popular than Google in China.
Yandex commands a marketshare of 61.9 per cent in Russia, compared to Google's 28.3 per
cent. In China, Baidu is the most popular search engine. South Korea's homegrown
search portal, Naver, is used for 70 per cent online searches in the country. Yahoo! Japan
and Yahoo! Taiwan are the most popular avenues for internet search in Japan and Taiwan,
respectively.
Page 23 of 75
WE SCHOOL
Search engine bias
Although search engines are programmed to rank websites based on some combination of
their popularity and relevancy, empirical studies indicate various political, economic, and
social biases in the information they provide. These biases can be a direct result of
economic and commercial processes (e.g., companies that advertise with a search engine can
become also more popular in its organic search results), Organic search results are listings
on search engine results pages that appear because of their relevance to the search terms, as
opposed to their being advertisements. In contrast, non-organic search results may include
pay per click advertising. And political processes (e.g., the removal of search results to
comply with local laws). For example, Google will not surface certain Neo-Nazi websites in
and Germany, where Holocaust denial is illegal.
Biases can also be a result of social processes, as search engine algorithms are frequently
designed to exclude non-normative viewpoints in favor of more "popular" results. Indexing
algorithms of major search engines skew towards coverage of U.S.-based sites, rather than
websites from non-U.S. countries.
Google Bombing (The terms Google bomb and Googlewashing refer to the practice of
causing a web page to rank highly in search engine results for unrelated or off-topic search
terms by linking heavily.) is one example of an attempt to manipulate search results for
political, social or commercial reasons.
Customized results and filter bubbles
Page 24 of 75
WE SCHOOL
Many search engines such as Google and Bing provide customized results based on the user's
activity history. This leads to an effect that has been called a filter bubble. The term describes
a phenomenon in which websites use algorithms to selectively guess what information a user
would like to see, based on information about the user (such as location, past click behavior
and search history). As a result, websites tend to show only information that agrees with the
user's past viewpoint, effectively isolating the user in a bubble that tends to exclude contrary
information. Prime examples are Google's personalized search results and Facebook's
personalized news stream. According to Eli Pariser, who coined the term, users get less
exposure to conflicting viewpoints and are isolated intellectually in their own informational
bubble. Pursier related an example in which one user searched Google for "BP" and got
investment news about British Petroleum while another searcher got information about the
Deepwater Horizon oil spill and that the two search results pages were "strikingly
different. The bubble effect may have negative implications for civic discourse,
according to Pursier.
Since this problem has been identified, competing search engines have emerged that seek to
avoid this problem by not tracking or "bubbling" users.
Faith-based search engines
The global growth of the Internet and popularity of electronic contents in the Arab and
Muslim World during the last decade has encouraged faith adherents, notably in the Middle
East and Asian sub-continent, to "dream" of their own faith-based i.e. "Islamic" search
engines or filtered search portals filters that would enable users to avoid accessing forbidden
websites such as pornography and would only allow them to access sites that are compatible
Page 25 of 75
WE SCHOOL
to the Islamic faith. Shortly before the Muslim only month of Ramadan, Halalgoogling which
collects results from other search engines like Google and Bing was introduced to the world
July 2013 to presents the halal results to its users, nearly two years after I’mHalal, another
search engine initially (launched on September 2011) to serve Middle East Internet had to
close its search service due to what its owner blamed on lack of funding.
While lack of investment and slow pace in technologies in the Muslim World as the main
consumers or targeted end users has hindered progress and thwarted success of serious
Islamic search engine, the spectacular failure of heavily invested Muslim lifestyle web
projects like Muxlim, which received millions of dollars from investors like Rite Internet
Ventures, has - according to I’mHalal shutdown notice - made almost laughable the idea that
the next Facebook or Google can only come from the Middle East if you support your bright
youth. Yet Muslim internet experts have been determining for years what is or is not
allowed according to the "Law of Islam" and have been categorizing websites and such into
being either "halal" or "haram". All the existing and past Islamic search engines are merely
custom search indexed or monetized by web major search giants like Google, Yahoo and
Bing with only certain filtering systems applied to ensure that their users can't access Haram
sites, which include such sites as nudity, gay, gambling or anything that is deemed to be anti-
Islamic.
Another religiously-oriented search engine is Jewogle, which is the Jewish version of Google
and yet another is SeekFind.org, which is a Christian website that includes filters preventing
users from seeing anything on the internet that attacks or degrades their faith.
Till now we studied how search engines are build and their
Page 26 of 75
WE SCHOOL
contribution towards today’s high tech atmosphere
Now we look with the help of these technology how Image is build
How do I increase my site visibility to search engines?
These days you don’t have to limit your search to just websites. Many other forms of content
are easy to find, including images. No matter what you’re looking for, an image is (for better
or worse) just one image search away.
You may wonder, however, how image search works. How are images sorted and classified,
making it possible to find tens or hundreds of relevant results? Perhaps you’re just curious, or
perhaps you run a site and want to know so you can improve your own ranking. In either
case, taking a deeper look could be helpful.
Some people assume that image search is conducted via fancy algorithms that determine what
an image is about and then index it. I know that’s where I started. As it turns out, however,
old fashioned text is one of the most important factors in an image’s ranking.
More specifically, the file name matters. Go ahead – do an image search. What do the top
results have in common? Almost invariably, it’s a portion of their file name. Most of the top
results for “pizza” have the word pizza in the file name.
That might seem obvious. But actually, it’s not. Most digital photographs, for example, will
start life with a file name like “1020302.jpg.” It’s only later that they’re re-named. For
webmasters, ensuring that a relevant file name is given to an image is just as basic and
important as making sure that a webpage’s keyword appears in that page’s metadata title
and/or description. But it’s not automatic. It takes constant effort.
Page 27 of 75
WE SCHOOL
We are now at the final step of building an image search engine — accepting a query image
and performing an actual search.
Let’s take a second to review how we got here:
Step 1: Defining Your Image Descriptor. Before we even consider building an
image search engine, we need to consider how we are going to represent and quantify
our image using only a list of numbers (i.e. a feature vector). We explored three
aspects of an image that can easily be described: color, texture, and shape. We can use
one of these aspects, or many of them.
Step 2: Indexing Your Dataset. Now that we have selected a descriptor, we can
apply the descriptor to extract features from each and every image in our dataset. The
process of extracting features from an image dataset is called “indexing”. These
features are then written to disk for later use. Indexing is also a task that is easily
made parallel by utilizing multiple cores/processors on our machine.
Step 3: Defining Your Similarity Metric. In Step 1, we defined a method to extract
features from an image. Now, we need to define a method to compare our feature
vectors. A distance function should accept two feature vectors and then return a value
indicating how “similar” they are. Common choices for similarity functions include
(but are certainly not limited to) the Euclidean, Manhattan, Cosine, and Chi-Squared
distances.
Finally, we are now ready to perform our last step in building an image search engine:
Page 28 of 75
WE SCHOOL
Searching and Ranking
The Query
Before we can perform a search, we need a query.
The last time you went to Google, you typed in some keywords into the search box, right?
The text you entered into the input form was your “query”.
Google then took your query, analyzed it, and compared it to their gigantic index of
webpages, ranked them, and returned the most relevant webpages back to you.
Similarly, when we are building an image search engine, we need a query image.
Query images come in two flavors: an internal query image and an external query image.
As the name suggests, an internal query image already belongs in our index. We have already
analyzed it, extracted features from it, and stored its feature vector.
The second type of query image is an external query image. This is the equivalent to typing
our text keywords into Google. We have never seen this query image before and we can’t
make any assumptions about it. We simply apply our image descriptor, extract features, rank
the images in our index based on similarity to the query, and return the most relevant results.
Let’s think back to our similarity metrics for a second and assume that we are using the
Euclidean distance. The Euclidean distance has a nice property called the Coincidence
Axiom, implying that the function returns a value of 0 (indicating perfect similarity) if and
only if the two feature vectors are identical.
Example: If I were to search for an image already in my index, then the Euclidean distance
between the two feature vectors would be zero, implying perfect similarity. This image would
then be placed at the top of my search results since it is the most relevant. This makes sense
Page 29 of 75
WE SCHOOL
and is the intended behavior.
How strange it would be if I searched for an image already in my index and did not find it in
the #1 result position. That would likely imply that there was a bug in my code somewhere or
I’ve made some very poor choices in image descriptors and similarity metrics.
Overall, using an internal query image serves as a sanity check. It allows you to make sure
that your image search engine is functioning as expected.
Once you can confirm that your image search engine is working properly, you can then
accept external query images that are not already part of your index.
The Search
So what’s the process of actually performing a search? Checkout the outline below:
1. Accept a query image from the user
A user could be uploading an image from their desktop or from their mobile device. As
image search engines become more prevalent, I suspect that most queries will come from
devices such as iPhones and Droids. It’s simple and intuitive to snap a photo of a place,
object, or something that interests you using your cellphone, and then have it automatically
analyzed and relevant results returned.
2. Describe the query image
Now that you have a query image, you need to describe it using the exact same image
descriptor(s) as you did in the indexing phase. For example, if I used a RGB color histogram
with 32 bins per channel when I indexed the images in my dataset, I am going to use the same
32 bin per channel histogram when describing my query image. This ensures that I have a
consistent representation of my images. After applying my image descriptor, I now have a
feature vector for the query image.Page 30 of 75
WE SCHOOL
3. Perform the Search
To perform the most basic method of searching, you need to loop over all the feature vectors
in your index. Then, you use your similarity metric to compare the feature vectors in your
index to the feature vectors from your query. Your similarity metric will tell you how
“similar” the two feature vectors are. Finally, sort your results by similarity.
Looping over your entire index may be feasible for small datasets. But if you have a large
image dataset, like Google or TinEye, this simply isn’t possible. You can’t compute the
distance between your query features and the billions of feature vectors already present in
your dataset.
4. Display Your Results to the User
Now that we have a ranked list of relevant images we need to display them to the user. This
can be done using a simple web interface if the user is on a desktop, or we can display the
images using some sort of app if they are on a mobile device. This step is pretty trivial in the
overall context of building an image search engine, but you should still give thought to the
user interface and how the user will interact with your image search engine.
Summary
So there you have it, the four steps of building an image search engine, from front to back:
1. Define your image descriptor.2. Index your dataset.3. Define your similarity metric.4. Perform a search, rank the images in your index in terms of relevancy to the user, and
display the results to the user.
Page 31 of 75
WE SCHOOL
Here is the best example of image build search engine against above explanation
Think about it this way. When you go to Google and type “Lord of the Rings” into the search
box, you expect Google to return pages to you that are relevant to Tolkien’s books and the
movie franchise. Similarly, if we present an image search engine with a query image, we
expect it to return images that are relevant to the content of image — hence, we sometimes
call image search engines by what they are more commonly known in academic circles
as Content Based Image Retrieval (CBIR) systems.
So what’s the overall goal of our Lord of the Rings image search engine?
The goal, given a query image from one of our five different categories, is to return the
category’s corresponding images in the top 10 results.
That was a mouthful. Let’s use an example to make it more clear.
If I submitted a query image of The Shire to our system, I would expect it to give me all 5
Shire images in our dataset back in the first 10 results. And again, if I submitted a query
image of Rivendell, I would expect our system to give me all 5 Rivendell images in the first
10 results.
Make sense? Good. Let’s talk about the four steps to building our image search engine.
The 4 Steps to Building an Image Search Engine
On the most basic level, there are four steps to building an image search engine:
Define your descriptor: What type of descriptor are you going to use? Are you describing
color? Texture? Shape?
Index your dataset: Apply your descriptor to each image in your dataset, extracting a set of
features.
Page 32 of 75
WE SCHOOL
Define your similarity metric: How are you going to define how “similar” two images are?
You’ll likely be using some sort of distance metric. (a metric or distance function is a
function that defines a distance between elements of a set. A set with a metric is called a
metric space. A metric induces a topology on a set but not all topologies can be generated by
a metric.) Common choices include Euclidean, Cityblock (Manhattan), Cosine, and chi-
squared to name a few.
Searching: To perform a search, apply your descriptor to your query image, and then ask
your distance metric to rank how similar your images are in your index to your query images.
Sort your results via similarity and then examine them.
Step #1: The Descriptor – A 3D RGB Color Histogram
Our image descriptor is a 3D color histogram in the RGB color space with 8 bins per red,
green, and blue channel.
The best way to explain a 3D histogram is to use the conjunctive AND. This image descriptor
will ask a given image how many pixels have a Red value that falls into bin #1 AND a Green
value that falls into bin #2 AND how many Blue pixels fall into bin #1. This process will be
repeated for each combination of bins; however, it will be done in a computationally efficient
manner.
When computing a 3D histogram with 8 bins, OpenCV will store the feature vector as an (8,
8, 8) array. We’ll simply flatten it and reshape it to (512,). Once it’s flattened, we can easily
compare feature vectors together for similarity.
Ready to see some code? Okay, here we go:
3D RGB Histogram in OpenCV and Python
PythonPage 33 of 75
WE SCHOOL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# import the necessary packages
import numpy as np
import cv2
class RGBHistogram:
def __init__(self, bins):
# store the number of bins the histogram will use
self.bins = bins
def describe(self, image):
# compute a 3D histogram in the RGB colorspace,
# then normalize the histogram so that images
# with the same content, but either scaled larger
# or smaller will have (roughly) the same histogram
hist = cv2.calcHist([image], [0, 1, 2],
None, self.bins, [0, 256, 0, 256, 0, 256])
hist = cv2.normalize(hist)
# return out 3D histogram as a flattened array
Page 34 of 75
# import the necessary packageimport numpy as npimport cv2
WE SCHOOL
20 return hist.flatten()
As you can see, RGBHistogram class has been defined. The reason for this is because you
rarely ever extract features from a single image alone. You instead extract features from an
entire dataset of images. Furthermore, you expect that the features extracted from all images
utilize the same parameters — in this case, the number of bins for the histogram. It wouldn’t
make much sense to extract a histogram using 32 bins from one image and then 128 bins for
another image if you intend on comparing them for similarity.
Let’s take the code apart and understand what’s going on:
Lines 6-8: Here I am defining the constructor for the RGBHistogram. The only parameter we
need is the number of bins for each channel in the histogram. Again, this is why I prefer using
classes instead of functions for image descriptors — by putting the relevant parameters in the
constructor, you ensure that the same parameters are utilized for each image.
Line 10: You guessed it. The describe method is used to “describe” the image and return a
feature vector.
Line 15: Here we extract the actual 3D RGB Histogram (or actually, BGR since OpenCV
stores the image as a NumPy array, but with the channels in reverse order). We assume
self.bins is a list of three integers, designating the number of bins for each channel.
Line 16: It’s important that we normalize the histogram in terms of pixel counts. If we used
the raw (integer) pixel counts of an image, then shrunk it by 50% and described it again, we
would have two different feature vectors for identical images. In most cases, you want to
avoid this scenario. We obtain scale invariance by converting the raw integer pixel counts
into real-valued percentages. For example, instead of saying bin #1 has 120 pixels in it, we
Page 35 of 75
WE SCHOOL
would say bin #1 has 20% of all pixels in it. Again, by using the percentages of pixel counts
rather than raw, integer pixel counts, we can assure that two identical images, differing only
in size, will have (roughly) identical feature vectors.
Line 20: When computing a 3D histogram, the histogram will be represented as a NumPy
array with (N, N, N) bins. In order to more easily compute the distance between histograms,
we simply flatten this histogram to have a shape of (N ** 3,). Example: When we instantiate
our RGBHistogram, we will use 8 bins per channel. Without flattening our histogram, the
shape would be (8, 8, 8). But by flattening it, the shape becomes (512,).
Now that we have defined our image descriptor, we can move on to the process of indexing our dataset.
Step #2: Indexing our Dataset
Okay, so we’ve decided that our image descriptor is a 3D RGB histogram. The next step is to
apply our image descriptor to each image in the dataset.
This simply means that we are going to loop over our 25 image dataset, extract a 3D RGB
histogram from each image, store the features in a dictionary, and write the dictionary to file.
Yep, that’s it.
In reality, you can make indexing as simple or complex as you want. Indexing is a task that is
easily made parallel. If we had a four core machine, we could divide the work up between the
four cores and speedup the indexing process. But since we only have 25 images, that’s pretty
silly, especially given how fast it is to compute a histogram.
Let’s dive into some code: Indexing an Image Dataset using Python
Python
Page 36 of 75
WE SCHOOL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
# import the necessary packages
from pyimagesearch.rgbhistogram import RGBHistogram
import argparse
import cPickle
import glob
import cv2
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "Path to the directory that contains the images to be indexed")
ap.add_argument("-i", "--index", required = True,
help = "Path to where the computed index will be stored")
args = vars(ap.parse_args())
# initialize the index dictionary to store our our quantifed
# images, with the 'key' of the dictionary being the image
# filename and the 'value' our computed features
index = {}
Page 37 of 75
# import the necessary packagefrom pyimagesearch.rgbhistogrimport argparseimport cPickle
WE SCHOOL
6
17
18
19
Alright, the first thing we are going to do is import the packages we need.
The --dataset argument is the path to where our images are stored on disk and the --index
option is the path to where we will store our index once it has been computed.
Finally, we’ll initialize our index — a builtin Python dictionary type. The key for the
dictionary will be the image filename. We’ve made the assumption that all filenames are
unique, and in fact, for this dataset, they are. The value for the dictionary will be the
computed histogram for the image.
Using a dictionary for this example makes the most sense, especially for explanation
purposes. Given a key, the dictionary points to some other object. When we use an image
filename as a key and the histogram as the value, we are implying that a given histogram H is
used to quantify and represent the image with filename K.
Again, you can make this process as simple or as complicated as you want. More complex
image descriptors make use of term frequency-inverse document frequency weighting (tf-idf)
and an inverted index, but for the time being, let’s keep it simple.
Indexing an Image Dataset using Python
Python
Page 38 of 75
WE SCHOOL
1
2
3
# initialize our image descriptor -- a 3D RGB histogram with
# 8 bins per channel
desc = RGBHistogram([8, 8, 8])
Here we instantiate our RGBHistogram. Again, we will be using 8 bins for each, red, green,
and blue, channel, respectively.
Indexing an Image Dataset using Python
Python
1
2
3
4
5
6
# use glob to grab the image paths and loop over them
for imagePath in glob.glob(args["dataset"] + "/*.png"):
# extract our unique image ID (i.e. the filename)
k = imagePath[imagePath.rfind("/") + 1:]
# load the image, describe it using our RGB histogram
Page 39 of 75
# initialize our image descriptor # 8 bins per channeldesc = RGBHistogram([8, 8, 8])
# use glob to grab the image pafor imagePath in glob.glob(args[
# extract our unique imk = imagePath[imageP
WE SCHOOL
7
8
9
10
# descriptor, and update the index
image = cv2.imread(imagePath)
features = desc.describe(image)
index[k] = features
Here is where the actual indexing takes place. Let’s break it down:
Line 2: We use glob to grab the image paths and start to loop over our dataset.
Line 4: We extract the “key” for our dictionary. All filenames are unique in this sample
dataset, so the filename itself will be enough to serve as the key.
Line 8-10: The image is loaded off disk and we then use our RGBHistogram to extract a
histogram from the image. The histogram is then stored in the index.
Indexing an Image Dataset using Python
Python
1
2
3
4
5
# we are now done indexing our image -- now we can write our
# index to disk
f = open(args["index"], "w")
f.write(cPickle.dumps(index))
f.close()
Page 40 of 75
# w e are now done indexing ou# index to diskf = open(args["index"], "w ")f.w rite(cPickle.dumps(index))
WE SCHOOL
Now that our index has been computed, let’s write it to disk so we can use it for searching later on.
Step #3: The Search
We now have our index sitting on disk, ready to be searched.
The problem is, we need some code to perform the actual search. How are we going to
compare two feature vectors and how are we going to determine how similar they are?
This question is better addressed first with some code.
Building an Image Search Engine in Python and OpenCV
Python
1
2
3
4
5
6
7
8
# import the necessary packages
import numpy as np
class Searcher:
def __init__(self, index):
# store our index of images
self.index = index
Page 41 of 75
# import the necessary packageimport numpy as np
class Searcher:
WE SCHOOL
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def search(self, queryFeatures):
# initialize our dictionary of results
results = {}
# loop over the index
for (k, features) in self.index.items():
# compute the chi-squared distance between the features
# in our index and our query features -- using the
# chi-squared distance which is normally used in the
# computer vision field to compare histograms
d = self.chi2_distance(features, queryFeatures)
# now that we have the distance between the two feature
# vectors, we can udpate the results dictionary -- the
# key is the current image ID in the index and the
# value is the distance we just computed, representing
# how 'similar' the image in the index is to our query
results[k] = d
# sort our results, so that the smaller distances (i.e. the
# more relevant images are at the front of the list)
Page 42 of 75
WE SCHOOL
24
25
26
27
28
29
30
31
32
33
34
35
36
37
3
results = sorted([(v, k) for (k, v) in results.items()])
# return our results
return results
def chi2_distance(self, histA, histB, eps = 1e-10):
# compute the chi-squared distance
d = 0.5 * np.sum([((a - b) ** 2) / (a + b + eps)
for (a, b) in zip(histA, histB)])
# return the chi-squared distance
return d
Page 43 of 75
WE SCHOOL
8
39
40
41
First off, most of this code is just comments. Don’t be scared that it’s 41 lines. If you haven’t
already guessed. Let’s investigate what’s going on:
Lines 4-7: The first thing I do is define a Searcher class and a constructor with a single
parameter — the index. This index is assumed to be the index dictionary that we wrote to file
during the indexing step.
Line 11: We define a dictionary to store our results. The key is the image filename (from the
index) and the value is how similar the given image is to the query image.
Lines 14-26: Here is the part where the actual searching takes place. We loop over the image
filenames and corresponding features in our index. We then use the chi-squared distance to
compare our color histograms. The computed distance is then stored in the results dictionary,
indicating how similar the two images are to each other.
Lines 30-33: The results are sorted in terms of relevancy (the smaller the chi-squared
distance, the relevant/similar) and returned.
Lines 35-41: Here we define the chi-squared distance function used to compare the two
histograms. In general, the difference between large bins vs. small bins is less important and
should be weighted as such. This is exactly what the chi-squared distance does. We provide
Page 44 of 75
WE SCHOOL
an epsilon dummy value to avoid those pesky “divide by zero” errors. Images will be
considered identical if their feature vectors have a chi-squared distance of zero. The larger the
distance gets, the less similar they are.
So there you have it, a Python class that can take an index and perform a search.
Now it’s time to put this searcher to work.
Step #4: Performing a Search
Finally. We are closing in on a functioning image search engine.
But we’re not quite there yet. We need a little extra code to handle loading the images off
disk and performing the search:
Building an Image Search Engine in Python and OpenCV
Python
1
2
3
4
5
6
7
# import the necessary packages
from pyimagesearch.searcher import Searcher
import numpy as np
import argparse
import cPickle
import cv2
Page 45 of 75
# import the necessary packagefrom pyimagesearch.searcher iimport numpy as npimport argparse
WE SCHOOL
8
9
10
11
12
13
14
15
16
17
18
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "Path to the directory that contains the images we just indexed")
ap.add_argument("-i", "--index", required = True,
help = "Path to where we stored our index")
args = vars(ap.parse_args())
# load the index and initialize our searcher
index = cPickle.loads(open(args["index"]).read())
searcher = Searcher(index)
First things first. Import the packages that we will need. We then define our arguments in the
same manner that we did during the indexing step. Finally, we use cPickle to load our index
off disk and initialize our Searcher.
Python
1 # loop over images in the index -- we will use each one as
Page 46 of 75
# loop over images in the index # a query imagefor (query, queryFeatures) in in
# perform the search
WE SCHOOL
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# a query image
for (query, queryFeatures) in index.items():
# perform the search using the current query
results = searcher.search(queryFeatures)
# load the query image and display it
path = args["dataset"] + "/%s" % (query)
queryImage = cv2.imread(path)
cv2.imshow("Query", queryImage)
print "query: %s" % (query)
# initialize the two montages to display our results --
# we have a total of 25 images in the index, but let's only
# display the top 10 results; 5 images per montage, with
# images that are 400x166 pixels
montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8")
montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8")
# loop over the top ten results
for j in xrange(0, 10):
# grab the result (we are using row-major order) and
Page 47 of 75
WE SCHOOL
19
20
21
22
23
24
25
26
27
28
29
30
31
32
3
# load the result image
(score, imageName) = results[j]
path = args["dataset"] + "/%s" % (imageName)
result = cv2.imread(path)
print "\t%d. %s : %.3f" % (j + 1, imageName, score)
# check to see if the first montage should be used
if j < 5:
montageA[j * 166:(j + 1) * 166, :] = result
# otherwise, the second montage should be used
else:
montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result
# show the results
cv2.imshow("Results 1-5", montageA)
cv2.imshow("Results 6-10", montageB)
cv2.waitKey(0)
Page 48 of 75
WE SCHOOL
3
34
35
36
37
38
39
40
Most of this code handles displaying the results. The actual “search” is done in a single line
(#31). Regardless, let’s examine what’s going on:
Line 3: We are going to treat each image in our index as a query and see what results we get
back. Normally, queries are external and not part of the dataset, but before we get to that,
let’s just perform some example searches.
Line 5: Here is where the actual search takes place. We treat the current image as our query
and perform the search.
Lines 8-11: Load and display our query image.
Lines 17-35: In order to display the top 10 results, I have decided to use two montage images.
The first montage shows results 1-5 and the second montage results 6-10. The name of the
Page 49 of 75
WE SCHOOL
image and distance is provided on Line 27.
Lines 38-40: Finally, we display our search results to the user.
So there you have it. An entire image search engine in Python.
Figure: Search Results using Mordor-002.png as a query. Our image search engine is able to return images from Mordor and the Black Gate.
Let’s start at the ending of The Return of the King using Frodo and Sam’s ascent into the
volcano as our query image. As you can see, our top 5 results are from the “Mordor”
category.
Perhaps you are wondering why the query image of Frodo and Sam is also the image in the
Page 50 of 75
WE SCHOOL
#1 result position? Well, let’s think back to our chi-squared distance. We said that an image
would be considered “identical” if the distance between the two feature vectors is zero. Since
we are using images we have already indexed as queries, they are in fact identical and will
a distance of zero. Since a value of zero indicates perfect similarity, the query image appears
in the #1 result position.
Now, let’s try another image, this time using The Goblin King in Goblin Town:
Figure: Search Results using Goblin-004.png as a query. The top 5 images returned are from Goblin Town.
The Goblin King doesn’t look very happy. But we sure are happy that all five images from
Goblin Town are in the top 10 results.
Finally, here are three more example searches for Dol-Guldur, Rivendell, and The Shire.
Again, we can clearly see that all five images from their respective categories are in the top
Page 51 of 75
WE SCHOOL
10 results.
Figure: Using images from Dol-Guldur (Dol-Guldur-004.png), Rivendell (Rivendell-
003.png), and The Shire (Shire-002.png) as queries.
But clearly, this is not how all image search engines work. Google allows you
to upload an image of your own. TinEye allows you to upload an image of your own. Why
can’t we? Let’s see how we can perform a search using an image that we haven’t already
indexed:
Building an Image Search Engine using Python and OpenCV
Python
1 # import the necessary packages
Page 52 of 75
# import the necessary packagefrom pyimagesearch.rgbhistogrfrom pyimagesearch.searcher iimport numpy as np
WE SCHOOL
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from pyimagesearch.rgbhistogram import RGBHistogram
from pyimagesearch.searcher import Searcher
import numpy as np
import argparse
import cPickle
import cv2
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "Path to the directory that contains the images we just indexed")
ap.add_argument("-i", "--index", required = True,
help = "Path to where we stored our index")
ap.add_argument("-q", "--query", required = True,
help = "Path to query image")
args = vars(ap.parse_args())
# load the query image and show it
queryImage = cv2.imread(args["query"])
cv2.imshow("Query", queryImage)
print "query: %s" % (args["query"])
Page 53 of 75
WE SCHOOL
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# describe the query in the same way that we did in
# index.py -- a 3D RGB histogram with 8 bins per
# channel
desc = RGBHistogram([8, 8, 8])
queryFeatures = desc.describe(queryImage)
# load the index perform the search
index = cPickle.loads(open(args["index"]).read())
searcher = Searcher(index)
results = searcher.search(queryFeatures)
# initialize the two montages to display our results --
# we have a total of 25 images in the index, but let's only
# display the top 10 results; 5 images per montage, with
# images that are 400x166 pixels
montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8")
montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8")
# loop over the top ten results
for j in xrange(0, 10):
Page 54 of 75
WE SCHOOL
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# grab the result (we are using row-major order) and
# load the result image
(score, imageName) = results[j]
path = args["dataset"] + "/%s" % (imageName)
result = cv2.imread(path)
print "\t%d. %s : %.3f" % (j + 1, imageName, score)
# check to see if the first montage should be used
if j < 5:
montageA[j * 166:(j + 1) * 166, :] = result
# otherwise, the second montage should be used
else:
montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result
# show the results
cv2.imshow("Results 1-5", montageA)
cv2.imshow("Results 6-10", montageB)
cv2.waitKey(0)
Lines 2-17: This should feel like pretty standard stuff by now. We are importing our packages
and setting up our argument parser, although, you should note the new argument –query. This
Page 55 of 75
WE SCHOOL
is the path to our query image.
Lines 20-21: We’re going to load your query image and show it to you, just in case you
forgot what your query image is.
Lines 27-28: Instantiate our RGBHistogram with the exact same number of bins as during our
indexing step. We then extract features from our query image.
Lines 31-33: Load our index off disk using cPickle and perform the search.
Lines 39-62: Just as in the code above to perform a search, this code just shows us our results.
One of Rivendell and one of The Shire. These two images will be our queries.
Check out the results below:
Figure: Using external Rivendell (Left) and the Shire (Right) query images. For both cases,
we find the top 5 search results are from the same category.
In this case, we searched using two images that we haven’t seen previously. The one on the
left is of Rivendell. We can see from our results that the other 5 Rivendell images in our
Page 56 of 75
WE SCHOOL
index were returned, demonstrating that our image search engine is working properly.
On the right, we have a query image from The Shire. Again, this image is not present in our
index. But when we look at the search results, we can see that the other 5 Shire images were
returned from the image search engine, once again demonstrating that our image search
engine is returning semantically similar images.
Summary
Here we’ve explored how to create an image search engine from start to finish.
The first step was to choose an image descriptor — we used a 3D RGB histogram to
characterize the color of our images. We then indexed each image in our dataset using our
descriptor by extracting feature vectors (i.e. the histograms). From there, we used the chi-s
quared distance to define “similarity” between two images. Finally, we glued all the pieces
together and created a Lord of the Rings image search engine.
After above example we will now see that being difference in terms of text search & image search they are correlated to each other
Most webmasters don’t see any difference between image alt text and title mostly keeping
them the same. A great discussion over at Google Webmaster Groups provides an exhaustive
information on the differences between an image Alt attribute and an image title and standard
recommendations of how to use them.
Alt text is meant to be an alternative information source for those people who have chosen
to disable images in their browsers and those user agents that are simply unable to “see” the
images. It should describe what the image is about and get those visitors interested to see it.
Without alt text, an image will be displayed as an empty iconIn Internet Explorer Alt text also
Page 57 of 75
WE SCHOOL
pops up when you hover over an image.
Plus, Google officially confirmed it mainly focuses on alt text when trying to understand
what an image is about. Image title (and the element name speaks for itself) should provide
additional information and follow the rules of the regular title: it should be relevant, short,
catchy, and concise (a title “offers advisory information about the element for which it is
set“). In FireFox and Opera it pops up when you hover over an image:
So based on the above, we can discuss how to properly handle them:
Both tags are primarily meant for visitors (though alt text seems more important for
crawlers) – so provide explicit information on an image to encourage views.
Include your main keywords in both, but change them up. Keyword stuffing in alt
text and title is still keyword stuffing, so keep them relevant and meaningful.
Another good point to take into consideration:
According to Aaron Wall, alt text is crucially important when used for a site-
wide header banner.
One of the reasons Aaron Wall, was so motivated to change the tagline of this site recently
was because the new site design contained the site's logo as a background image. The logo
link was a regular static link, but it had no anchor text, only a link title to describe the link. If
you do not look at the source code, the link title attribute can seem like an image alt tag when
you scroll over it, but to a search engine they do not look that same. A link title is not
weighted anywhere near as aggressively as an image alt tag is.
The old link title on the header link for this site was search engine optimization book. While
this site ranks #6 and #8 for that query in Google, neither of the ranking pages are the Page 58 of 75
WE SCHOOL
homepage (the tools page and sales letter rank). That shows that Google currently places
negligible, if any, weight on link titles.
If the only link to your homepage is a logo check the source code to verify you are using
descriptive image alt text.
Conclusion & Recommendations
Conclusion:
We conclude that with the help of above mentioned technologies that any company can build
their image which can be helpful to the society with these Search Engines.
While nobody can guarantee top level positioning in search engine organic results, proper
search engine optimization can help. Because the search engines, such as Google, Yahoo!,
and Bing, are so important today it is necessary to make each page in a Web site conform to
the principles of good SEO as much as possible.
To do this it is necessary to:
Understand the basics of how search engines rate sites
Use proper keywords and phrases throughout the Web site
Avoid giving the appearance of spamming the search engines
Write all text for real people, not just for search engines
Use well-formed alternate attributes on images
Make sure that the necessary meta tags (and title tag) are installed in the head of each Web page
Have good incoming links to establish popularity
Page 59 of 75
WE SCHOOL
Make sure the Web site is regularly updated so that the content is fresh
Recommendations:
Following recommendations are made on the basis of overall usage of systems:
Overview
Recommender systems typically produce a list of recommendations in one of two ways –
through collaborative or content-based filtering. Collaborative filtering approaches building
a model from a user's past behaviour (items previously purchased or selected and/or
numerical ratings given to those items) as well as similar decisions made by other users; then
use that model to predict items (or ratings for items) that the user may have an interest in.
Content-based filtering approaches utilize a series of discrete characteristics of an item in
order to recommend additional items with similar properties. These approaches are often
combined.
Main Article
Collaborative filtering:
One approach to the design of recommender systems that has seen wide use is collaborative
filtering. Collaborative filtering methods are based on collecting and analyzing a large
amount of information on users’ behaviors, activities or preferences and predicting what
users will like based on their similarity to other users. A key advantage of the collaborative
filtering approach is that it does not rely on machine analyzable content and therefore it is
Page 60 of 75
WE SCHOOL
capable of accurately recommending complex items such as movies without requiring an
"understanding" of the item itself. Many algorithms have been used in measuring user
similarity or item similarity in recommender systems. For example, the k-nearest neighbour
(k-NN) approach and the Pearson Correlation.
Collaborative Filtering is based on the assumption that people who agreed in the past will
agree in the future, and that they will like similar kinds of items as they liked in the past.
When building a model from a user's profile, a distinction is often made between explicit and
implicit forms of data collection.
Examples of explicit data collection include the following:
Asking a user to rate an item on a sliding scale.
Asking a user to search.
Asking a user to rank a collection of items from favourite to least favourite.
Presenting two items to a user and asking him/her to choose the better one of them.
Asking a user to create a list of items that he/she likes.
Examples of implicit data collection include the following:
Observing the items that a user views in an online store.
Analysing item/user viewing times
Keeping a record of the items that a user purchases online.
Obtaining a list of items that a user has listened to or watched on his/her computer.
Analyzing the user's social network and discovering similar likes and dislikes
Page 61 of 75
WE SCHOOL
The recommender system compares the collected data to similar and dissimilar data collected
from others and calculates a list of recommended items for the user. Several commercial and
non-commercial examples are listed in the article on collaborative filtering systems.
One of the most famous examples of collaborative filtering is item-to-item collaborative
filtering (people who buy x also buy y), an algorithm popularized by Amazon.com's
recommender system.
Facebook, MySpace, LinkedIn, and other social networks use collaborative filtering to
recommend new friends, groups, and other social connections (by examining the network of
connections between a user and their friends). Twitter uses many signals and in-memory
computations for recommending who to follow to its users.
Collaborative filtering approaches often suffer from three problems:
Cold Start,
Scalability,
And
Sparsity.
Cold Start: These systems often require a large amount of existing data on a user in order to
make accurate recommendations.
Scalability: In many of the environments that these systems make recommendations in, there
are millions of users and products. Thus, a large amount of computation power is often
necessary to calculate recommendations.
Sparsity: The number of items sold on major e-commerce sites is extremely large. The most
active users will only have rated a small subset of the overall database. Thus, even the most
popular items have very few ratings. A particular type of collaborative filtering algorithm uses Page 62 of 75
WE SCHOOL
matrix factorization, a low-rank matrix approximation technique.
Collaborative filtering are classified as memory-based and model based collaborative
filtering. A well known example of memory-based approaches is user-based algorithm
and that of model-based approaches is Kernel-Mapping Recommender.
Content-based filtering
Another common approach when designing recommender systems is content-based filtering.
Content-based filtering methods are based on a description of the item and a profile of the
user’s preference. In a content-based recommender system, keywords are used to describe
the items; beside, a user profile is built to indicate the type of item this user likes. In other
words, these algorithms try to recommend items that are similar to those that a user liked in
the past (or is examining in the present). In particular, various candidate items are compared
with items previously rated by the user and the best-matching items are recommended. This
approach has its roots in information retrieval and information filtering research.
To abstract the features of the items in the system, an item presentation algorithm is applied.
A widely used algorithm is the tf–idf, (short for term frequency–inverse document
frequency, is a numerical statistic that is intended to reflect how important a word is to a
document in a collection) representation (also called vector space representation).
To create user profile, the system mostly focuses on two types of information:
1) A model of the user's preference.
2) A history of the user's interaction with the recommender system.
Basically, these methods use an item profile (i.e. a set of discrete attributes and features)
characterizing the item within the system. The system creates a content-based profile of users
Page 63 of 75
WE SCHOOL
based on a weighted vector of item features. The weights denote the importance of each
feature to the user and can be computed from individually rated content vectors using a
variety of techniques.
Simple approaches use the average values of the rated item vector while other sophisticated
methods use machine learning techniques such as
Bayesian Classifiers (In machine learning, naive Bayes classifiers are a family of
simple probabilistic classifiers based on applying Bayes' theorem with strong (naive)
independence assumptions between the features.)
Cluster Analysis (is the task of grouping a set of objects in such a way that objects in
the same group (called a cluster) are more similar (in some sense or another) to each
other than to those in other groups (clusters)).
Decision Trees (A decision tree is a decision support tool that uses a tree-like graph
or model of decisions and their possible consequences, including chance event
outcomes, resource costs, and utility.) &
Artificial Neural Networks (In machine learning, artificial neural networks (ANNs)
are a family of statistical learning algorithms inspired by biological neural networks
(the central nervous systems of animals, in particular the brain) and are used to
estimate or approximate functions that can depend on a large number of inputs and
are generally unknown.) in order to estimate the probability that the user is going to like
the item.
Page 64 of 75
WE SCHOOL
Direct feedback from a user, usually in the form of a like or dislike button, can be used to
assign higher or lower weights on the importance of certain attributes.
A key issue with content-based filtering is whether the system is able to learn user
preferences from user's actions regarding one content source and use them across other
content types. When the system is limited to recommending content of the same type as the
user is already using, the value from the recommendation system is significantly less than
when other content types from other services can be recommended. For example,
recommending news articles based on browsing of news is useful, but it's much more useful
when music, videos, products, discussions etc. from different services can be recommended
based on news browsing.
Hybrid Recommender Systems
Recent research has demonstrated that a hybrid approach, combining collaborative filtering
and content-based filtering could be more effective in some cases. Hybrid approaches can be
implemented in several ways: by making content-based and collaborative-based predictions
separately and then combining them; by adding content-based capabilities to a collaborative-
based approach (and vice versa); or by unifying the approaches into one model.
Several studies empirically compare the performance of the hybrid with the pure
collaborative and content-based methods and demonstrate that the hybrid methods can
more accurate recommendations than pure approaches. These methods can also be used to
overcome some of the common problems in recommender systems such as cold start and the
sparsity problem.
Netflix is a good example of hybrid systems. They make recommendations by comparing the
watching and searching habits of similar users (i.e. collaborative filtering) as well as by Page 65 of 75
WE SCHOOL
offering movies that share characteristics with films that a user has rated highly (content-
based filtering).
A variety of techniques have been proposed as the basis for recommender systems:
collaborative, content-based, knowledge-based, and demographic techniques. Each of these
techniques has known shortcomings, such as the well-known cold-start problem for
collaborative and content-based systems (what to do with new users with few ratings) and the
knowledge engineering bottleneck in knowledge-based approaches is a technology used to
store complex structured and unstructured information used by a computer system.
A hybrid recommender system is one that combines multiple techniques together to achieve
some synergy between them.
Collaborative: The system generates recommendations using only information about rating
profiles for different users. Collaborative systems locate peer users with a rating history
similar to the current user and generate recommendations using this neighbourhood.
Content-based: The system generates recommendations from two sources: the features
associated with products and the ratings that a user has given them. Content-based
recommenders treat recommendation as a user-specific classification problem and learn a
classifier for the user's likes and dislikes based on product features.
Demographic: A demographic recommender provides recommendations based on a
demographic profile of the user. Recommended products can be produced for different
demographic niches, by combining the ratings of users in those niches.
Knowledge-based: A knowledge-based recommender suggests products based on inferences
about a user’s needs and preferences. This knowledge will sometimes contain explicit
functional knowledge about how certain product features meet user needs.Page 66 of 75
WE SCHOOL
The term hybrid recommender system is used here to describe any recommender system that
combines multiple recommendation techniques together to produce its output. There is no
reason why several different techniques of the same type could not be hybridized, for
example, two different content-based recommenders could work together, and a number of
projects have investigated this type of hybrid: NewsDude, which uses both naive Bayes and
kNN classifiers in its news recommendations is just one example.
Seven hybridization techniques:
1) Weighted: The score of different recommendation components are combined
numerically.
2) Switching: The system chooses among recommendation components and applies the
selected one.
3) Mixed: Recommendations from different recommenders are presented together.
4) Feature Combination: Features derived from different knowledge sources are
combined together and given to a single recommendation algorithm.
5) Feature Augmentation: One recommendation technique is used to compute a feature
or set of features, which is then part of the input to the next technique.
6) Cascade: Recommenders are given strict priority, with the lower priority ones
breaking ties the scoring of the higher ones.
7) Meta-level: One recommendation technique is applied and produces some sort of
model, which is then the input used by the next technique.
Beyond Accuracy
Typically, research on recommender systems is concerned about finding the most accurate
Page 67 of 75
WE SCHOOL
recommendation algorithms. However, there is a number of factors that are also important.
Diversity - Users tend to be more satisfied with recommendations when there is a higher
intra-list diversity, i.e. items from e.g. different artists.
Recommender Persistence - In some situations it is more effective to re-show
recommendations, or let users re-rate items, than showing new items. There are several
reasons for this. Users may ignore items when they are shown for the first time, for instance,
because they had no time to inspect the recommendations carefully.
Privacy - Recommender systems usually have to deal with privacy concerns because
users have to reveal sensitive information. Building user profiles using collaborative filtering
can be problematic from a privacy point of view. Many European countries have a strong
culture of data privacy and every attempt to introduce any level of user profiling can result in
a negative customer response. A number of privacy issues arose around the dataset offered by
Netflix for the Netflix Prize competition. Although the data sets were anonymised in order to
preserve customer privacy, in 2007, two researchers from the University of Texas were able
to identify individual users by matching the data sets with film ratings on the Internet Movie
Database. As a result, in December 2009, an anonymous Netflix user sued Netflix in Doe v.
Netflix, alleging that Netflix had violated U.S. fair trade laws and the Video Privacy
Protection Act by releasing the datasets. This led in part to the cancellation of a second
Netflix Prize competition in 2010. Much research has been conducted on ongoing privacy
issues in this space. Ramakrishnan have conducted an extensive overview of the trade-
offs between personalization and privacy and found that the combination of weak ties
and other data sources can be used to uncover identities of users in an anonymised dataset.
User Demographics - Beel et al. found that user demographics may influence how satisfied Page 68 of 75
WE SCHOOL
users are with recommendations. In their paper they show that elderly users tend to be
more interested in recommendations than younger users.
Robustness - When users can participate in the recommender system, the issue of fraud must
be addressed.
Serendipity - Serendipity is a measure "how surprising the recommendations are". For
instance, a recommender system that recommends milk to a customer in a grocery store,
might be perfectly accurate but still it is not a good recommendation because it is an obvious
item for the customer to buy.
Trust - A recommender system is of little value for a user if the user does not trust the
system. Trust can be built by a recommender system by explaining how it generates
recommendations, and why it recommends an item.
Labelling - User satisfaction with recommendations may be influenced by the labeling of the
recommendations. For instance, in the cited study click-through rate (CTR) is a way of
measuring the success of an online advertising campaign for a particular website as well as
the effectiveness of an email campaign by the number of users that clicked on a specific link
for recommendations labelled as "Sponsored" were lower (CTR=5.93%) than CTR for
identical recommendations labelled as "Organic" (CTR=8.86%). Interestingly,
recommendations with no label performed best (CTR=9.87%) in that study.
Mobile Recommender Systems
One growing area of research in the area of recommender systems is mobile recommender
systems. With the increasing ubiquity of internet-accessing smart phones, it is now possible to
offer personalized, context-sensitive recommendations. This is a particularly difficult area of
research as mobile data is more complex than recommender systems often have to deal with Page 69 of 75
WE SCHOOL
(it is heterogeneous, noisy, requires spatial and temporal auto-correlation, and has validation
and generality problems). Additionally, mobile recommender systems suffer from a
transplantation problem - recommendations may not apply in all regions (for instance, it
would be unwise to recommend a recipe in an area where all of the ingredients may not be
available).
One example of a mobile recommender system is one that offers potentially profitable
driving routes for taxi drivers in a city. This system takes as input data in the form of GPS
traces of the routes that taxi drivers took while working, which include location (latitude and
longitude), time stamps, and operational status (with or without passengers). It then
recommends a list of pickup points along a route that will lead to optimal occupancy times
and profits. This type of system is obviously location-dependent, and as it must operate on a
handheld or embedded device, the computation and energy requirements must remain low.
An other example of mobile recommendation is what (Bouneffouf et al., 2012) developed for
professional users. This system takes as input data the GPS traces of the user and his agenda
to suggest him suitable information depending on his situation and interests. The system uses
machine learning techniques and reasoning process in order to adapt dynamically the mobile
system to the evolution of the user’s interest. The author called his algorithm hybrid-ε-
greedy.
Mobile recommendation systems have also been successfully built using the Web of Data as
a source for structured information. A good example of such system is
SMARTMUSEUM. The system uses semantic modelling, information retrieval and
machine learning techniques in order to recommend contents matching user’s interest, even Page 70 of 75
WE SCHOOL
when the evidence of user's interests is initially vague and based on heterogeneous
information.
Risk-Aware Recommender Systems
The majority of existing approaches to RS focus on recommending the most relevant
documents to the users using the contextual information and do not take into account the risk
of disturbing the user in specific situations. However, in many applications, such as
recommending a personalized content, it is also important to incorporate the risk of upsetting
the user into the recommendation process in order not to recommend documents to users in
certain circumstances, for instance, during a professional meeting, early morning, late-night.
Therefore, the performance of the RS depends on the degree to which it has incorporated the
risk into the recommendation process.
Risk Definition: "The risk in recommender systems is the possibility to disturb or to upset
the user which leads to a bad answer of the user".
In response to this problems, the authors in have developed a dynamic risk sensitive
recommendation system called DRARS (Dynamic Risk-Aware Recommender System),
which model the context-aware recommendation as a bandit problem. This system combines
a content-based technique and a contextual bandit algorithm. They have shown that DRARS
improves the Upper Condense Bound (UCB) policy, the currently available best algorithm,
by calculating the most optimal exploration value to maintain a trade-off between exploration
and exploitation based on the risk level of the current user's situation. The authors conducted
experiments in an industrial context with real data and real users and has shown that taking
into account the risk level of users' situations significantly increased the performance of the
recommender systems.Page 71 of 75
WE SCHOOL
The Netflix Prize
One of the key events that energized research in recommender systems was the Netflix prize.
From 2006 to 2009, Netflix sponsored a competition, offering a grand prize of $1,000,000 to
the team that could take an offered dataset of over 100 million movie ratings and return
recommendations that were 10% more accurate than those offered by the company's existing
recommender system. This competition energized the search for new and more accurate
algorithms. On 21 September 2009, the grand prize of US$1,000,000 was given to the
BellKor's Pragmatic Chaos team using tiebreaking rules.
The most accurate algorithm in 2007 used an ensemble method of 107 different algorithmic
approaches, blended into a single prediction.
Predictive accuracy is substantially improved when blending multiple predictors. Our
experience is that most efforts should be concentrated in deriving substantially different
approaches, rather than refining a single technique. Consequently, our solution is an
ensemble of many methods. Many benefits accrued to the web due to the Netflix project.
A second contest was planned, but was ultimately cancelled in response to an ongoing lawsuit
and concerns from the Federal Trade Commission. (The Federal Trade Commission (FTC) is
an independent agency of the United States government, established in 1914 by the Federal
Trade Commission Act. Its principal mission is the promotion of consumer protection and the
elimination and prevention of anticompetitive business practices, such as coercive monopoly.)
Multi-criteria Recommender Systems
Multi-Criteria Recommender Systems (MCRS) can be defined as Recommender Systems that
incorporate preference information upon multiple criteria. Instead of developing
Page 72 of 75
WE SCHOOL
recommendation techniques based on a single criterion values, the overall preference of user
u for the item i, these systems try to predict a rating for unexplored items of u by exploiting
preference information on multiple criteria that affect this overall preference value. Several
researchers approach MCRS as a Multi-criteria Decision Making (MCDM) problem, and
apply MCDM methods and techniques to implement MCRS systems.
The limitations of SEO
One important characteristic of an expert and professional consultant is that he or she
understands that the theories and techniques in their field are subject to quite concrete and
specific limitations. In this page, I review the important ones that apply to search engine
optimization and website marketing and their practical significance.
SEO is hand made to order. If you have planned a product launch, a wedding or been a party
to a lawsuit, you know that the best laid plans are seldom executed without some major
changes. Life is simply too complex. SEO is another one of those human activities because,
to be effective, it is hand made for a specific site and business.
There are other important limitations which you need to understand and take into
consideration.
Searching is an evolving process from the point of view of providers (the search engines),
users and website owners. What worked yesterday may not work today and be counter-
productive or harmful tomorrow. In the result, monitoring or regular checks of the key search
engines and directories is required to maintain a high ranking once it is achieved.
Quality is everything. Since virtually everything that we do to improve a site's ranking will be
Page 73 of 75
WE SCHOOL
known to anyone who knows how to get it, innovations tend to be short-lived. Moreover,
search engines are always on the lookout for exploits that manipulate their ranking
algorithms. The only thing that cannot be copied or exploited is high quality and value
content especially when others link to it for those reasons. Only higher and more valuable
content trumps it.
The cost of SEO is rising. More expertise is required than before and this trend will continue.
techniques employed are more sophisticated, complex and time consuming. There are fewer
worthwhile search engines and directories that offer free listings. Paid placement costs are
rising and the best key words expensive.
The search lottery. Search engines collect only a fraction of the billions of sites' pages for
various technological reasons which change over time but nonetheless will mean for the
foreseeable future that searching is akin to a lottery. SEO improves the odds but cannot
remove the uncertainty altogether.
SEO is a marketing exercise and, accordingly, the same old business rules apply. You can sell
almost anything to someone once but businesses are built and prosper through repeat
customers to whom the reputation, brand or goodwill is important. Content quality and value
is the key and that remains elusive, expensive and difficult to source but, in websites, is the
only basis of effective marketing using SEO techniques.
Suffice to say, if your site is included in a search engine and you achieve a high enough
ranking for your requirements, then these limitations are costs of doing business in
cyberspace.
Page 74 of 75
WE SCHOOL
Bibliography
The topic itself completes the Bibliography that is search engine
The entire project is done with the help of following sites & taken Guidance from
Mr C.P. Venkatesh in Digital Marketing Workshop
www.google.com
www.searchcounsel.com
www.searchenginejournal.com
www.pyimagesearch.com
www.seobook.com
Page 75 of 75