why we need an independent index of the web
DESCRIPTION
TRANSCRIPT
![Page 1: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/1.jpg)
Why we need an independent index of the Web
Dirk Lewandowski [email protected] http://www.bui.haw-hamburg.de/lewandowski.html @Dirk_Lew Society of the Query Conference, Amsterdam, 7/11/2013
![Page 2: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/2.jpg)
The “local copy” of the Web
• Web Indexing – New, changed, deleted document
– “Holy grail” of keeping the index complete and current
Risvik, K. M., & Michelsen, R. (2002). Search engines and web dynamics. Computer Networks, 39(3), 289–302.
![Page 3: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/3.jpg)
Representation of documents in a search engine
Referring documents à Document à Metadata (examplex)
heading1
heading 2
Anchor text
Anchor text
Anchor text
From the source code - Title - Description - Keywords - Author
From the document (document info) - Length - Date - Decay - Name of the author
From the Web - PageRank - Number of citations
![Page 4: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/4.jpg)
The User’s Perspective
• Everyone uses search engines (Purcell, Brenner & Raine, 2012; van Eimeren & Frees, 2012)
• Market is dominated by Google (ComScore data) • Users rely on
– Google’s method of ordering results
– Google’s method of collecting data
à If Google hasn’t seen it — and indexed it — or kept it up to date, it can’t be found with a search query.
![Page 5: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/5.jpg)
Freshness of Web search engines (see Lewandowski, Wahlig & Meyer-Bautor, 2006; Lewandowski, 2008)
Original (as of yesterday) Google‘s copy (as of yesterday)
![Page 6: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/6.jpg)
What about the alternatives to Google?
• Many “seems to be” search engines – Accessing the data of another search engine
– Representing nothing more than an alternative user interface to one of the more well-known engines
– In many cases, that turns out to be Google
– E.g., in Germany, we can see that the major internet portals T-Online, GMX, AOL, and web.de all display results obtained from Google
![Page 7: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/7.jpg)
Why is one search engine not enough?
• We need more than one search engine to ensure that a broad range of opinions are represented in the search market.
• Users should have the choice between different worldviews which originate as a product of algorithm-based search result generation
• Ideology-free search algorithms are simply not possible
![Page 8: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/8.jpg)
Alternative Search Engine Indexes
• There are only a handful of search engines that operate their own indexes, due to costs and technical complexity
• Search engines start-ups – Use an existing external index – Focus on a specialised topic (which requires only a small index)
– Aggregate data from different search engines (meta search engine)
• Actual search engine startups like Blekko and Duck Duck Go are more the exception than the rule
![Page 9: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/9.jpg)
Partner model
• “Real” search engine providers such as Google and Bing operate their own search engines but also provide their search results to partners
• All the major web portals have now embraced this model.
• Income through ads; revenue-sharing
• Attractiveness of the model – The search engine provider encounters only minimal costs
– The operator of the portal no longer needs to go to the great expense of running its own search engine.
– The partner index model has served to thin out the competition in the search industry.
![Page 10: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/10.jpg)
Access to Search Engine Indexes
• Application programming interfaces (APIs) – No direct access to the search engine index – Limited number of top results which have already been ranked by the search
engine provider
– Access via APIs is similar to what is occurring at the meta-search engines
– The representation of the document in the source search engine is also not included
![Page 11: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/11.jpg)
Alternative Search Engines
• What constitutes an “alternative search engine”? – All search engines that are not Google? (“Google Killers“, e.g., Cuil)
– Some alternatives are not perceived as such because they are considered to be simply the same as Google (e.g., Bing)
– Search engines which explicitly position themselves as an alternative to Google through a regional approach (e.g., Seekport)
– New approaches to search / “Real alternatives”: Alternative approaches to gathering and representing web content
![Page 12: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/12.jpg)
Public Support for Search Engine Technology?
• Quaero/Theseus: Funding a “Google Killer”? – Quaero: Technologies for multimedia searching.
– Theseus: Semantic technologies for business-to-business applications (without focusing exclusively on search).
• The proposal to provide government funding for search engine technology has been subject to intense criticism in the past
• Establish a single alternative?
• A number of factors which would cause it to fail – Poor marketing – Graphic design of the user interface – ...
• Regardless of the reason, a failure of the new search engine would result in the entire publicly funded initiative failing.
![Page 13: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/13.jpg)
Economic perspective
• Only the largest internet companies are able to afford large indexes. • Microsoft is the only company besides Google to possess a comprehensive
search engine index.
• Yahoo gave up on its own index several years ago
• It appears as though operating a dedicated index is attractive to practically no one — and there are hardly any candidates with the necessary financial resources in any case
![Page 14: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/14.jpg)
The Solution
• Create the conditions that will make establishing alternative search engines possible
• We can expect that the possibilities it presents would benefit a number of different companies, individuals, and institutions.
• The result will be fair competition to develop the best concepts for using the data provided by the index.
![Page 15: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/15.jpg)
Vision
• “An index of the web that can be accessed at fair conditions for everyone”
– “Everyone” means that anyone who is interested can access the index.
– “Fair conditions” does not mean that access to the index must be free of charge for everyone. A certain number of document requests per day should be available at no cost in order to promote non-profit projects.
– “Access” to the index can be defined as the ability to automatically query the index with ease.
– The concept “index of the web” is intended to cover as much of the web as possible
![Page 16: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/16.jpg)
Funding and operation
• Funding – This type of project cannot be supported by any one country alone. The only
feasible option is a pan-European initiative.
• Who would operate the index? – Existing research institution or newly-founded institution – The operator of the index should not obtain the exclusive right to determine the
way in which the documents are used or made available (à Board of trustees)
![Page 17: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/17.jpg)
Conclusion: Advantages of an independent index of the web
• Motivate companies, institutions, and developers pursuing personal projects to create their own search applications.
• The data available on the web is so boundless that it lends itself to countless applications in a broad range of fields.
• Enable applications we are not yet capable of even imagining.
• An open structure, transparency with respect to access, and the assurance of permanent availability thanks to state sponsorship would lay the groundwork for innovation.
![Page 18: Why we need an independent index of the Web](https://reader033.vdocument.in/reader033/viewer/2022051612/54c44ab74a795916078b4709/html5/thumbnails/18.jpg)
Thank you
Prof. Dr. Dirk Lewandowski Hochschule für Angewandte Wissenschaften Hamburg dirk.lewandowski@haw-hamburg,de Twitter: Dirk_Lew http://www.bui.haw-hamburg.de/lewandowski.html http://www.searchstudies.org