epl 660: lab 1 general info, exercise 1, b-trees, apache lucene … · 2011-01-31 · department of...

38
Created by Andreas Kamilaris for EPL660 University of Cyprus Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene Andreas Kamilaris

Upload: others

Post on 27-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

Created by Andreas Kamilaris for EPL660

University of CyprusDepartment of Computer Science

EPL 660: Lab 1General Info, Exercise 1, B-Trees, Apache Lucene

Andreas Kamilaris

Page 2: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

2

University of CyprusResearch on the Web of Things

Page 3: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 3

University of CyprusGeneral info

• Every Friday 18:00-19:30.• Check course Web site for schedule.• Lab content - Exercises, general questions,

tutorials, tool demonstrations.

• Deadlines of exercises: 23:59 at delivery day.• Email submission: [email protected]

Page 4: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 4

University of CyprusTutorials info

• Review of tools for Information Retrieval.• Every lab session includes introducing some tool.• A variety of libraries and tools:

– Apache Lucene– Apache Solr– Apache Tika– Hadoop– Nutch

Page 5: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 5

University of CyprusProgram info

• Presentation of the students’ final projectProjects’ Presentations15/04

• Getting Started with NutchNutch8/04

Public HolidayNo Tutorial01/04

Public HolidayNo Tutorial25/03

• Background Information about Crawling• Introduction to Nutch

Nutch18/03

• Getting Started with Hadoop• Demonstration of a simple scenario

Hadoop11/03

• Background information about MapReduce• Introduction to Hadoop

Hadoop4/03

Absence of AssistantNo Tutorial25/02

• Introduction to Apache Tika• Demostration of a simple scenario

Apache Tika18/02

• Introduction to Apache Solr• Demonstration of a simple scenario

Apache Solr11/02

• Getting Started with Apache Lucene• Demonstration of a simple scenario

Apache Lucene4/02

• Introduction to Apache Lucene• Background Information for B-Trees

Apache Lucene28/01

DescriptionTopicDate

Page 6: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 6

University of Cyprus1st Programming Exercise

• Create a doc-based inverted index.• Records have the format:

• Include stemming using Porter Stemmer algorithm.• Include detection of stop-words.• Search terms using B-Trees.• The B-Tree must be a 4-ordered tree.• Add skip pointers to inverted index for

performance reasons.

Positional Posting ListFrequencyterm

Page 7: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 7

University of Cyprus1st Programming Exercise

• Deadline is 8th February 2011.• You need to include:

– Source code with comments.– Executable files.– A Brief Documentation.

• E-mail Submission including a zip attachment.

Page 8: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 8

University of CyprusIntroduction to B-Trees• A B-Tree of order m is an m-way tree (a tree where each

node may have up to m children) in which:1. the number of keys in each non-leaf node is one less than the

number of its children and these keys partition the keys in the children in the fashion of a search tree.

2. all leaves are on the same level.3. all non-leaf nodes except the root have at least ⎡m / 2⎤ children.4. the root is either a leaf node, or it has from two to m children.5. a leaf node contains no more than m – 1 keys.

• B-trees are always balanced!

Page 9: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 9

University of CyprusWhy using B-Trees• It was difficult to access a large amount of data from a

secondary memory.

• Many algorithms were introduced to make search faster, to access the required data from the secondary memory more optimized.

• B-Trees are more effective and faster.• B-Trees are used in many database management systems.

Page 10: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 10

University of CyprusAn example B-TreeA B-tree of order 4 containing 26 items:

51 6242

6 12

26

55 60 7064 9045

1 2 4 7 8 13 15 18 25

27 29 46 48 53

Note that all the leaves are at the same levelNote that all the leaves are at the same level

Page 11: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 11

University of CyprusSearching a B-TreeSearch for the item #48:

51 6242

6 12

26

55 60 7064 9045

1 2 4 7 8 13 15 18 25

27 29 46 48 53

Note that all the leaves are at the same levelNote that all the leaves are at the same level

Page 12: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 12

University of CyprusConstructing a B-Tree• Suppose we start with an empty B-tree and keys arrive in

the following order:1 12 8 2 25 5 14 28 17 7 52 16 48 68 3 26 29 53 55 45

• We want to construct a B-tree of order 5• The first four items go into the root:

• To put the fifth item in the root would violate condition 5• Therefore, when 25 arrives, pick the middle key to make a

new root

1 2 8 12

Page 13: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 13

University of CyprusConstructing a B-Tree

1 2

8

12 25

6, 14, 28 get added to the leaf nodes:

1 2

8

12 146 25 28

Page 14: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 14

University of CyprusConstructing a B-Tree

Adding 17 to the right leaf node would over-fill it, so we take the middle key, promote it (to the root) and split the leaf:

8 17

12 14 25 281 2 6

7, 52, 16, 48 get added to the leaf nodes:8 17

12 14 25 281 2 6 16 48 527

Page 15: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 15

University of CyprusConstructing a B-Tree

Adding 68 causes us to split the right most leaf, promoting 48 to the root, and adding 3 causes us to split the left most leaf,promoting 3 to the root; 26, 29, 53, 55 then go into the leaves:

3 8 17 48

52 53 55 6825 26 28 291 2 6 7 12 14 16

Adding 45 causes a split of: 25 26 28 29

and promoting 28 to the root then causes the root to split.

Page 16: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 16

University of CyprusConstructing a B-Tree

17

3 8 28 48

1 2 6 7 12 14 16 52 53 55 6825 26 29 45

Page 17: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 17

University of CyprusGuidelines for constructing a B-Tree

1. Attempt to insert the new key into a leaf by searching for the proper position.

2. If the leaf is not full, then insert the key and you are done.3. If this would result in that leaf becoming too big, split the

leaf into two, promoting the middle key to the leaf’s parent4. If this would result in the parent becoming too big, split the

parent into two, promoting the middle key.5. This strategy might have to be repeated all the way to the

top.6. If necessary, the root is split in two and the middle key is

promoted to a new root, making the tree one level higher.

Page 18: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 18

University of CyprusTime complexity of a B-Tree• Search/Insert/Delete all take up to the number of items in

a path from the root to a leaf.• The total number of operations is no more than the height

of the tree.• The height of a tree is no more than log(n) where n is the

number of items in the B-Tree.

Page 19: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

University of CyprusDepartment of Computer Science

Tutorial 1Apache Lucene Overview

Page 20: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 20

University of CyprusWhat is Apache Lucene?

“Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.”

- from http://lucene.apache.org/

Page 21: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 21

University of CyprusWhat is Apache Lucene?• Lucene is specifically an API, not an application.• Hard parts have been done, easy programming

has been left to you.• You can build a search application that is

specifically suited to your needs .• You can use Lucene to provide consistent full-text

indexing across both database objects anddocuments in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).

Page 22: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 22

University of CyprusAvailability• Freely Available (no cost)• Open Source

– Apache License, version 2.0• http://www.apache.org/licenses/LICENSE-2.0

– Download from:• http://www.apache.org/dyn/closer.cgi/lucene/java/

Page 23: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 23

University of CyprusFeatures

• Ranked Searching• Flexible Queries

– Phrases, Wildcards, etc…• Field-specific Queries

– e.g. title, artist, album• Sorting

Page 24: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 24

University of CyprusRanked Searching

1. Phrase Matching2. Keyword Matching

– Prefer more unique terms first • takes into account the uniqueness of each term when

determining a document’s relevance score

Page 25: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 25

University of CyprusFlexible Queries

• Phrases“star wars”

• Wildcardsstar*Bra?il

• Ranges{star-stun}[2006-2007]

• Boolean Operatorsstar AND wars

This is just a small subset of the types of queries that Lucene can support. Some query types such as wildcard and range queries have a potential to cause heavy load on the Lucene server, so Lucene makes it easy to disable certain types of queries while allowing all others to proceed through the system. This gives programmers better control and allows the system performance to be more predictable.

Page 26: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 26

University of CyprusField-specific Queries

• For example

title:”star wars”AND

director:”George Lucas”

Page 27: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 27

University of CyprusSorting

• Can sort any field in a Document– For example, by Price, Release Date, Amazon Sales

Rank, etc…• By default, Lucene will sort results by their

relevance score. Sorting by any other field in a Document is also supported.

Page 28: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 28

University of CyprusDocuments

• A document can represent anything textual:– Word Document– DVD (the textual metadata only)– Website Member (name, ID, etc…)

• A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database.

• Each developer is responsible for turning their own data sets into Lucene Documents.

• Lucene comes with a number of 3rd party contributions, including examples for parsing structured data files such as XML documents and Word files.

Page 29: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 29

University of CyprusIndexes

• Lucene employs inverted indexing (like most full-text-based search engines).

• Indexes track term frequencies.• Every term maps back to a Document.• This index is what allows Lucene to quickly locate every

document currently associated with a given set of input search terms.

Page 30: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 30

University of Cyprus

An index consists of one or more Lucene documents.

1. Create a document:– A document consists of one or more fields: name-value

pairExample: A field commonly found in applications is title. In the case of a title field, the field name is title and thevalue is the title of that item.

– Add one or more fields to the document.2. Add the document to an index:

– Indexing involves adding documents to an IndexWriter.

3. Indexer will analyze the Document:– We can provide specialized analyzers such as

StandardAnalyzer.

Basic Indexing

Page 31: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 31

University of CyprusAnalyzing• Analyzers control how the text is broken into terms which

are then used to index the document. • Analyzers can be used to remove stop words and they

also perform stemming.• Lucene comes with a default analyzer which works well for

unstructured English text, however it often performs incorrect normalizations on non-English texts.

• Lucene makes it easy to build custom Analyzers, and provides a number of helpful building blocks with which to build your own.

• Lucene even includes a number of stemming algorithms for various languages, which can improve document retrieval accuracy when the source language is known at indexing time.

Page 32: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 32

University of CyprusBasic SearchingSearching requires an index to have already been built.1. Create a Query:

• Usually via QueryParser, MultiPhraseQuery etc. that parse user input.

2. Open an Index:3. Search the Index:

• E.g. via IndexSearcher.• Use an Analyzer (as before).

4. Iterate through returned Documents:• Extract out needed results.• Extract out result scores (if needed).

Page 33: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 33

University of CyprusLucene as a Web Service1. Design an HTTP query syntax

– GET queries– XML for results

2. Wrap Tomcat around core code• Tomcat is a source software implementation of the

Java Servlet and JavaServer Pages technologies3. Write a Client Library

Page 34: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 34

University of CyprusScalability Limits• 3 main scalability factors:

– Query Rate– Index Size– Update Rate

Page 35: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 35

University of CyprusQuery Rate Scalability• Lucene is already fast:

– Built-in simple cache mechanism• Easy solution for heavy workloads:

– Add more query servers behind a load balancer– Can grow as your traffic grows

Page 36: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 36

University of CyprusIndex Size Scalability• Can easily handle millions of documents

– Lucene is very commonly deployed into systems with 10s of millions of documents.

• Although query performance can degrade as more documents are added to the index, the growth factor is very low.

• The main limits related to index size that you are likely to run into, will be disk capacity and disk I/O limits.

If you need bigger index:• Built-in methods to allow queries to span multiple remote

Lucene indexes– Can merge multiple remote indexes at query-time.

Page 37: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 37

University of CyprusLucene Installation1. Download the latest version of Lucene (v3.0.3) from:

http://www.apache.org/dyn/closer.cgi/lucene/java/2. Add files lucene-core-{version}.jar and lucene-demos-

{version}.jar in your Java CLASSPATH.3. Start programming!

(Optional Step)4. Go to Lucene-{version}/src/demo/org/apache/lucene/demo

directory and start editing files IndexFiles.java and SearchFiles.java.

Page 38: EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene … · 2011-01-31 · Department of Computer Science EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL660 38

University of CyprusUseful Info• Official Apache Lucene site: http://lucene.apache.org/java/docs/• Lucene-java Wiki: http://wiki.apache.org/lucene-

java/FrontPage?action=show&redirect=FrontPageEN• Lucene Intro (java.net):

http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html• Lucene Tutorial.com: http://www.lucenetutorial.com/