apache solr tech doc

Apache Solr Technical Document

Contents Requirements ................................................................................................................................................ 3

Solution - Solr ................................................................................................................................................ 3

Features .................................................................................................................................................... 3

Typical Solr Setup Diagram ....................................................................................................................... 4

Basic Solr Concepts ................................................................................................................................... 4

1. Indexing ............................................................................................................................................. 4

2. How Solr represents data .................................................................................................................. 5

Installing Solr ............................................................................................................................................. 7

Starting Solr ............................................................................................................................................... 7

Indexing Data ............................................................................................................................................ 7

Searching ................................................................................................................................................... 8

Faceting ................................................................................................................................................. 9

Highlighting ......................................................................................................................................... 10

Spell Checking ..................................................................................................................................... 10

Relevance ............................................................................................................................................ 10

Shutdown ................................................................................................................................................ 10

Screen Shots ............................................................................................................................................ 11

Apache SolrCloud ........................................................................................................................................ 15

Features .................................................................................................................................................. 15

Simple two shard cluster......................................................................................................................... 15

Dealing with high volume of data ........................................................................................................... 18

Dealing with failure ................................................................................................................................. 19

Synchronization of data (added/updated in DB) with Solr ..................................................................... 20

Limitations .............................................................................................................................................. 20

Screen Shots ............................................................................................................................................ 21

Integration with .Net using SolrNet ........................................................................................................ 23

Requirements

a. Fast and full text search capabilities

b. Optimization of huge data on web traffic

c. Highly and linearly scalable on demand

d. Plug with any platform

e. Near real time search and indexing

f. Flexible and Adaptable with XML,JSON,CSV configuration

Solution - Solr Solr is a standalone enterprise search server with a REST-like API. You put documents in it

(called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and

receive XML, JSON, CSV or binary results.

Features

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML, JSON and HTTP

Comprehensive HTML Administration Interfaces

Linearly scalable, auto index replication, auto failover and recovery

Near Real-time indexing

Flexible and Adaptable with XML configuration

Extensible Plugin Architecture

Easily manage multilingual support

Typical Solr Setup Diagram

Figure 1 Typical Solr Setup Diagram

Basic Solr Concepts

In this document, we'll cover the basics of what you need to know about Solr in order to use it.

1. Indexing

Solr is able to achieve fast search responses because, instead of searching the text directly, it

searches an index instead.

This is like retrieving pages in a book related to a keyword by scanning the index at the back of

a book, as opposed to searching every word of every page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure

(page->words) to a keyword-centric data structure (word->pages).

Solr stores this index in a directory called index in the data directory.

2. How Solr represents data

In Solr, a Document is the unit of search and index.

An index consists of one or more Documents, and a Document consists of one or more Fields.

Schema

Before adding documents to Solr, you need to specify the schema, represented in a file

called schema.xml. It is not advisable to change the schema after documents have been added

to the index.

The schema declares:

o what kinds of fields there are

o which field should be used as the unique/primary key

o which fields are required

o how to index and search each field

Field Types

In Solr, every field has a type.

Examples of basic field types available in Solr include:

o float

o long

o double

o date

o text

Defining a field

Here's what a field declaration looks like:

<field name="id" type="text" indexed="true" stored="true"multiValued="true"/>

o name: Name of the field

o type: Field type

o indexed: this field be added to the inverted index

o stored: the original value of this field be stored

o multivalued: this field have multiple values

The indexed and stored attributes are important.

Analysis

When data is added to Solr, it goes through a series of transformations before being added to

the index. This is called the analysis phase. Examples of transformations include lower-casing,

removing word stems etc. The end result of the analysis is a series of tokens which are then

added to the index. Tokens, not the original text, are what are searched when you perform a

search query.

Indexed fields are fields which undergo an analysis phase, and are added to the index.

Term Storage

When we displaying search results to users, they generally expect to see the original document,

not the machine-processed token.

That's the purpose of the stored attribute to tell Solr to store the original text in the index

somewhere.

Sometimes, there are fields which aren't searched, but need to display in the search results.

You accomplish that by setting the field attributes to stored=true and indexed=false.

So, why wouldn't you store all the fields all the time?

Because storing fields increases the size of the index, and the larger the index, the slower the

search. In terms of physical computing, we'd say that a larger index requires more disk seeks to

get to the same amount of data.

Installing Solr

You should also have JDK 5 or above installed.

Begin by unziping the Solr release and changing your working directory to be the "example"

directory.

unzip –q apache-solr-4.1.0.zip

cd apache-solr-4.1.0/example/

Starting Solr

Solr comes with an example directory which contains some sample files we can use.

We start this example server with java -jar start.jar.

cd example

java -jar start.jar

You should see something like this in the terminal.

2011-10-02 05:20:27.120:INFO::Logging to STDERR via org.mortbay.log.StdErrLog

2011-10-02 05:20:27.212:INFO::jetty-6.1-SNAPSHOT

....

2011-10-02 05:18:27.645:INFO::Started [email protected]:8983

Solr is now running! You can now access the Solr Admin webapp by loading

http://localhost:8983/solr/admin/ in your web browser.

Indexing Data

We're now going to add some sample data to our Solr instance.

The exampledocs folder contains some XML files we can posting them from the command line

cd exampledocs

java -jar post.jar solr.xml monitor.xml

http://localhost:8983/solr/admin/

That produces:

SimplePostTool: POSTing files to http://localhost:8983/solr/update.

SimplePostTool: POSTing file solr.xml

SimplePostTool: POSTing file monitor.xml

SimplePostTool: COMMITting Solr index changes.

This response tells us that the POST operation was successful.

You can also index all of the sample data, using the following command (assuming your

command line shell supports the *.xml notation):

cd exampledocs

java -jar post.jar *.xml

Searching

Let's see if we can retrieve the document we just added below URL on browser.

Since Solr accepts HTTP requests, you can use your web browser to communicate with

Solr: http://localhost:8983/solr/select?q=*:*&wt=json

This returns the following JSON result:

{

"responseHeader": {

"status": 0,

"QTime": 0,

"params": {

"wt": "json",

"q": "*:*"

}

},

"response": {

http://localhost:8983/solr/select?q=*:*&wt=json

"numFound": 1,

"start": 0,

"docs": [

{

"id": "3007WFP",

"name": "Dell Widescreen UltraSharp 3007WFP",

"manu": "Dell, Inc.",

"includes": "USB cable",

"weight": 401.6,

"price": 2199,

"popularity": 6,

"inStock": true,

"store": "43.17614,-90.57341",

"cat": [

"electronics",

"monitor"

],

"features": [

"30\" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast"

]

}

]

}

}

Faceting

Faceting is the arrangement of search results into categories based on indexed terms. Searchers

are presented with the indexed terms along with numerical counts of how many matching

documents were found were each term. Faceting makes it easy for users to explore search

results, narrowing in on exactly the results they are looking for.

Highlighting

Highlighting in Solr allows fragments of documents that match the user's query to be included

with the query response. The fragments are included in a special section of the response

(the highlighting section), and the client uses the formatting clues also included to determine

how to present the snippets to users.

Spell Checking

The Spellcheck component is designed to provide inline query suggestions based on other,

similar, terms.

Relevance

Relevance is the degree to which a query response satisfies a user who is searching for

information.

The relevance of a query response depends on the context in which the query was performed.

A single search application may be used in different contexts by users with different needs and

expectations. For example, a search engine of climate data might be used by a university

researcher studying long-term climate trends, a farmer interested in calculating the likely date

of the last frost of spring, a civil engineer interested in rainfall patterns and the frequency of

floods, and a college student planning a vacation to a region and wondering what to pack.

Because the motivations of these users vary, the relevance of any particular response to a

query will vary as well.

Shutdown

To shut down Solr, from the terminal where you launched Solr, hit Ctrl+C. This will shut down

Solr cleanly.

Link: http://lucene.apache.org/solr/3_6_2/doc-files/tutorial.html

http://www.solrtutorial.com/

https://cwiki.apache.org/confluence/display/solr/

https://cwiki.apache.org/confluence/display/solr/Relevance

http://lucene.apache.org/solr/3_6_2/doc-files/tutorial.html

http://www.solrtutorial.com/

https://cwiki.apache.org/confluence/display/solr/

Screen Shots

Figure 2 Solr Admin UI-Dashboard Screen

Figure 3 Solr Admin UI-Collection Detail Screen

Figure 4 Solr Admin UI-Query Result Screen

Figure 5 Solr Admin UI-Fetching Data from Database Using DataImportHandler

Figure 6 Solr Admin UI-Schema.xml Screen

Figure 7 Solr Admin UI-SolrConfig.xml Screen

Figure 8 Solr Admin UI-Core Admin Detail Screen

Figure 9 Solr Admin UI-Java Properties Screen

Apache SolrCloud

SolrCloud is the name of a set of new distributed capabilities in Solr. Passing parameters to

enable these capabilities will enable you to set up a highly available, fault tolerant cluster of

Solr servers. Use SolrCloud when you want high scale, fault tolerant, distributed indexing and

search capabilities.

Solr embeds and uses Zookeeper as a repository for cluster configuration and coordination -

think of it as a distributed filesystem that contains information about all of the Solr servers.

Note: reset all configurations and remove documents from the tutorial before going through

the cloud features.

Features

Centralized Apache ZooKeeper based configuration

Automated distributed indexing/sharding - send documents to any node and it will be

forwarded to correct shard

Near Real-Time indexing

Transaction log ensures no updates are lost even if the documents are not yet indexed to

disk

Automated query failover, index leader election and recovery in case of failure

No single point of failure

Simple two shard cluster

Figure 10 Simple Two Shard Cluster Image

This example simply creates a cluster consisting of two solr servers representing two different shards of a collection.

Since we'll need two solr servers for this example, simply make a copy of the example directory for the second server -- making sure you don't have any data already indexed.

rm -r example/solr/collection1/data/* cp -r example example2

This command starts up a Solr server and bootstraps a new solr cluster.

cd example java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

-DzkRun causes an embedded zookeeper server to be run as part of this Solr server.

-Dbootstrap_confdir=./solr/collection1/conf, this parameter causes the local configuration directory ./solr/conf to be uploaded as the "myconf" config. The name "myconf" is taken from the "collection.configName" param below.

-Dcollection.configName=myconf sets the config to use for the new collection.

-DnumShards=2 the number of logical partitions we plan on splitting the index into.

Browse to http://localhost:8983/solr/#/~cloud to see the state of the cluster (the zookeeper distributed filesystem).

You can see from the zookeeper browser that the Solr configuration files were uploaded under "myconf", and that a new document collection called "collection1" was created. Under collection1 is a list of shards, the pieces that make up the complete collection.

Now we want to start up our second server - it will automatically be assigned to shard2 because we don't explicitly set the shard id.

Then start the second server, pointing it at the cluster:

cd example2 java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

-Djetty.port=7574 is just one way to tell the Jetty servlet container to use a different port.

http://localhost:8983/solr/#/~cloud

-DzkHost=localhost: 9983 points to the Zookeeper ensemble containing the cluster state. In this example we're running a single Zookeeper server embedded in the first Solr server. By default, an embedded Zookeeper server runs at the Solr port plus 1000, so 9983.

If you refresh the zookeeper browser, you should now see both shard1 and shard2 in collection1. View http://localhost:8983/solr/#/~cloud.

Next, index some documents.

cd exampledocs java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar ipod_video.xml java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar monitor.xml java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar mem.xml

And now, a request to either server results in a distributed search that covers the entire collection:

http://localhost:8983/solr/collection1/select?q=*:*

If at any point you wish to start over fresh or experiment with different configurations, you can delete all of the cloud state contained within zookeeper by simply deleting the solr/zoo_data directory after shutting down the servers.

http://localhost:8983/solr/#/~cloud

http://localhost:8983/solr/collection1/select?q=*:*

Dealing with high volume of data

Solution: If the data volume goes high then creating more shards or splitting shard with

physical memory and storage in existing cluster cloud environment.

Figure 11 Creating Shard and Replica when volume goes high

Link: http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-

500000-volumes-5-million-volumes-and-beyond

http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-500000-volumes-5-million-volumes-and-beyond

http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-500000-volumes-5-million-volumes-and-beyond

Dealing with failure

Solution:

a. Failure of zookeeper: To avoid failure keeping zookeeper in two separate server so if one goes down then other can work because zookeeper has maintain all the cluster state and configuration information .

b. Failure of Solr shard: We can create the replica of each shard so if one shard goes down then replica can do our job.

Figure 12 Diagram which handling failure scenario

Link:

https://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_a

nd_zookeeper_ensemble

https://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensemble

https://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensemble

Synchronization of data (added/updated in DB) with Solr

Solution:

a. We can create the cron job which can fetch data from database and updating index in Solr.

b. Another option is that as and when data is added/update in frontend, after inserting/updating data in database from business layer, we can add piece of code which can add/update data using update Solr APIs (as we have integration with .net we can use SolrNet library which provides such addition/updation APIs).

Link: http://wiki.apache.org/solr/DataImportHandler#Scheduling

http://stackoverflow.com/questions/6463844/how-to-index-data-in-solr-from-database-

automatically

Limitations

1. No more than 50 to 100 million documents per node.

2. No more than 250 fields per document.

3. No more than 250K characters per document.

4. No more than 25 faceted fields.

5. No more than 32 nodes in your SolrCloud cluster.

6. Don't return more than 250 results on a query.

A major driving factor for Solr performance is RAM. Solr requires sufficient memory for two separate things: One is the Java heap, the other is "free" memory for the OS disk cache.

It is strongly recommended that Solr runs on a 64-bit Java. A 64-bit Java requires a 64-bit operating system, and a 64-bit operating system requires a 64-bit CPU. There's nothing wrong with 32-bit software or hardware, but a 32-bit Java is limited to a 2GB heap, which can result in artificial limitations that don't exist with a larger heap.

Link: http://lucene.472066.n3.nabble.com/Solr-limitations-td4076250.html

https://wiki.apache.org/solr/SolrPerformanceProblems

http://wiki.apache.org/solr/DataImportHandler#Scheduling

http://stackoverflow.com/questions/6463844/how-to-index-data-in-solr-from-database-automatically

http://stackoverflow.com/questions/6463844/how-to-index-data-in-solr-from-database-automatically

http://lucene.472066.n3.nabble.com/Solr-limitations-td4076250.html

https://wiki.apache.org/solr/SolrPerformanceProblems

Screen Shots

Figure 13 Solr Admin UI-Cloud Screen

Figure 14 Solr Admin UI-Zookeeper maintains Cluster State Information that is shown in Tree Screen

Figure 15 Solr Admin UI-Cloud Graph Screen

Figure 16 Solr Admin UI-Cluster Information Screen

Integration with .Net using SolrNet

Solr exposes REST apis which can be used for interacting with Solr, however it needs serialization in

converting documents retuned as search result to fill in actual object container. Solrnet is .Net library for

interacting with Solr. It provides convenient and easy apis to search, add, update data in Solr. Further

information on SolrNet is available at https://github.com/mausch/SolrNet

Figure 17 Integration with .Net

https://github.com/mausch/SolrNet

apache solr tech doc

Technology

datain solr

type of index

aninverted index

fieldfield typesin solr

auto index replication

field declaration

data directory

unit of search