nosql, apache solr and apache hadoop

NoSQL: Apache SOLR

Ap

ach

eH

ado

op

By

Dm

itry

Kan

fo

r N

erd

Cam

p, A

pri

l 23

20

11

[email protected]

Dilbert: expert in NoSQL

•The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQLmovement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ (wikipedia)•NoSQL = Not Only SQL•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google

•Data storage: billion gigabytes (GB) of data•Interconnected data: hyperlinks, blog pingbacks, social networks•Complex Data structure: hierarchical nested data structures easily(multiple relational tables in SQL)•Performance: the more data in SQL, the likely it to degrade

•NoSQL is not:•… SQL and not relational•… replacement for SQL, but compliment•... There is no fixed schema and no joins•... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-out” (spreading the load over many commodity systems) – horizontal scaling

NoSQL Categories

•Key-value Stores: bigh hashtable with caching mechanisms•Column Family Stores: keys point to multiple columns (Google’s BigTable)•Document Databases: documents are collections of other key-value collections•Graph Databases: nodes, relationships between nodes and nodes props

Major NoSQL players•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage service)•Cassandra: open-sourced by Facebook, column oriented NoSQL DB•BigTable: Google’s proprietary column oriented DB (App Engine)•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)•Neo4j: OS graph DB

Querying NoSQL DB:•Data model specific•RESTful interfaces or query APIs•SPARQL: declarative query specification for graph DBs

Simple Protocol And RDFQuery Language(courtesy of about.com and IBM)Example of retrieving the URL of a blogger

PREFIX foaf <http://xmlns.com/foaf/0.1/>SELECT ?urlFROM <bloggers.rdf>WHERE {?contributor foaf:name "Jon Foobar" .?contributor foaf:weblog ?url .}

stats!

Some stats from (Information Week) viaabout.com (2010):•44% biz IT professionals haven’t heard of NoSQL•1%: NoSQL is strategic direction

•Some stats from NerdCamp (April 2011):•10% heard and used the NoSQL•Much more people know about cloud, which can become more and more a driving platform behind NoSQL

Does the world of NoSQL have enough mass to appeal to IT now?

“Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.”

Created by Yonik Seeley at CNET

Features:•Full-text search•Hit highlighting•Faceted search (Dynamicclustering)•DB integration•Rich doc handling•Geospatial search•Distributed search•Replicataion•REST-like HTTP/XML & JSON APIS

http://lucene.apache.org/solr/http://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/java/docs/index.html

Books

http://lucene.apache.org/solr/

http://lucene.apache.org/solr/tutorial.html

http://lucene.apache.org/java/docs/index.html

http://lucene.apache.org/java/docs/index.html

Companies using SOLR

drupal

April 2011

Overview of current state

Curent version: Apache Solr 3.1 (March 31, 2011)License: ASL 2.0Features:•Faceted navigation•Hit highlighting•GEO search: filter and sort by distance•Spellcheck and auto suggest•Advanced ranking and sorting•Distributed and replicated search•Structured / unstructured search•Rich plugin architecture, extensible

Operating system supportAll with a Java VM, including:Linux (all versions)Windows (all versions)MacOS (all versions)Unix variantsApp-server supportApache Tomcat, Jetty, Resin,WebLogic™, WebSphere™,GlassFish, dmServer™, JBoss™and many moreJava version requirementJava JDK 1.5 or laterClient API supportJava, .NET, PHP, Python, Ruby(onRails), C++, XML/HTTP,JSON/HTTP ++

Faceted search

•A technique for refining search results

•Concept composition:

• Article + in English + about nerdcamp

• Finnish rap + < 1 minute + released in 2001

•Types:

• Standard facets (list of facets with values)

• Hierarchical facet values (taxonomy of facet values)

• Range / query facets: by date, by price, by alphabet, by interval

Spatial Search

Combines location data with text data

•Represent spatial data in the index

•Filter by some spatial concept such as a bounding box or other shape

•Sort by distance

•Score/boost by distance

•<field name="store">45.17614,-93.87341</field>  <field name="store">40.7143,-74.006</field> 

<field name="store">37.7752,-122.4232</field> 

•bbox: bounding box filter (bbox is a range of lats and lons thatencompasses the circle of radius d)

•geodist: the distance function

Hit highlighting

Example from solr admin

Spellcheck and autosuggest

Spellcheck:

•Query suggestion for a missspelled query term

http://localhost:8983/solr/spell?q=hellultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=true

<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <intname="numFound">1</int> <int name="startOffset">0</int> <intname="endOffset">4</int> <arr name="suggestion"> <str>dell</str> </arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int> <int name="startOffset">5</int> <int name="endOffset">14</int> <arrname="suggestion"> <str>ultrasharp</str> </arr> </lst> <strname="collation">dell ultrasharp</str> </lst> </lst>

Autosuggest:

Example with solr and jquery

Advanced sorting, ranking and searching

•sort=score+asc

•sort=Author+desc,score+desc

•boosting single documents

•Term Frequency—tf

•Inverse Document Frequency – idf

•Co-ordination Factor – coord (the greater the # of queried terms match, the greater the score)

•Field Length – fieldNorm (the shorter the matching field is in number ofindexed terms, the greater the document’s score)

•AND, OR, NOT, NEAR, fuzzy search

•Smashing~0.7 yields more results than just Smashing

Distributed and replicated search

Before doing this:•Consider vertical scaling (faster and better machine)•Rethink the data model (what data goes to which solr index)•Remove logging on updates (and / or searches)•Redesign you index: make as many fields non-indexed and non-stored (use cases)•Check your Internet connection

Extendability

Plugins:

•Query parser: extend LuceneQParserPlugin

public class NerdCampQParserPlugin extends LuceneQParserPlugin {

public QParser createParser(String qstr, SolrParams localParams,SolrParams params, SolrQueryRequest req) {}

}

SOLR I/O

•Nutch (crawler)

•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich documentimport, like pdf), your format

•Output: xml, json, python, javabin, csv… , your format

SOLR Processing Pipeline•On each step, a document gets transformed•Stop words removal•Stemming•(smart) Tokenization•Ngrams (letter level and word level)•Regular expressions•Low casing•Reversed wildcard•Duplicate removal

Solr on the cloudHadoop: MapReduceZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your ZooBatch indexing, no realtime search yet

Hadoop vital components: Core and API

MapReduce -- computation modelHDFSI/OZooKeeperPig (adds level of abstraction for processinglarge datasets)

Solr on the cloudDoes it shine? Yes, but not fully

References[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com GuideSarah Pidcock (2011-01-31). http://bit.ly/fFQOYI[2] "Dynamo: Amazon’s Highly Available Key-value Store". http://www.cs.uwaterloo.ca/: WATERLOO. p. 2/22. Retrieved 2011-04-05. "Dynamo: a highly available and scalable distributed data store"[3] http://cassandra.apache.org/[4] http://labs.google.com/papers/bigtable.html[5] http://aws.amazon.com/ (look for SimpleDB)[6] http://couchdb.apache.org/[7] http://neo4j.org/[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQLhttp://bit.ly/go5ios[9] http://drupal.org/[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination[11] http://wiki.apache.org/solr/SpatialSearch[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

http://bit.ly/fFQOYI

http://www.cs.uwaterloo.ca/~kdaudjee/courses/cs848/slides/sarah1.pdf



http://www.cs.uwaterloo.ca/

http://cassandra.apache.org/

http://labs.google.com/papers/bigtable.html

http://aws.amazon.com/

http://couchdb.apache.org/

http://neo4j.org/

http://bit.ly/go5ios

http://drupal.org/

http://wiki.apache.org/solr/SpatialSearch

http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html









http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

References[14] Using Nutch with SOLR, http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/[15] http://tika.apache.org/[16] http://lucene.apache.org/solr/

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/



http://tika.apache.org/

http://lucene.apache.org/solr/

nosql, apache solr and apache hadoop

Technology

world of nosql

acronym nosql

os document oriented

os graph dbquerying

solr admin

query apissparql

query suggestion

relational model