Transcript
Page 1: NoSQL, Apache SOLR and Apache Hadoop

NoSQL: Apache SOLR

Ap

ach

eH

ado

op

By

Dm

itry

Kan

fo

r N

erd

Cam

p, A

pri

l 23

20

11

[email protected]

Page 2: NoSQL, Apache SOLR and Apache Hadoop

Dilbert: expert in NoSQL

Page 3: NoSQL, Apache SOLR and Apache Hadoop

•The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQLmovement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ (wikipedia)•NoSQL = Not Only SQL•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google

•Data storage: billion gigabytes (GB) of data•Interconnected data: hyperlinks, blog pingbacks, social networks•Complex Data structure: hierarchical nested data structures easily(multiple relational tables in SQL)•Performance: the more data in SQL, the likely it to degrade

•NoSQL is not:•… SQL and not relational•… replacement for SQL, but compliment•... There is no fixed schema and no joins•... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-out” (spreading the load over many commodity systems) – horizontal scaling

Page 4: NoSQL, Apache SOLR and Apache Hadoop

NoSQL Categories

•Key-value Stores: bigh hashtable with caching mechanisms•Column Family Stores: keys point to multiple columns (Google’s BigTable)•Document Databases: documents are collections of other key-value collections•Graph Databases: nodes, relationships between nodes and nodes props

Major NoSQL players•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage service)•Cassandra: open-sourced by Facebook, column oriented NoSQL DB•BigTable: Google’s proprietary column oriented DB (App Engine)•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)•Neo4j: OS graph DB

Querying NoSQL DB:•Data model specific•RESTful interfaces or query APIs•SPARQL: declarative query specification for graph DBs

Page 5: NoSQL, Apache SOLR and Apache Hadoop

Simple Protocol And RDFQuery Language(courtesy of about.com and IBM)Example of retrieving the URL of a blogger

PREFIX foaf <http://xmlns.com/foaf/0.1/>SELECT ?urlFROM <bloggers.rdf>WHERE {?contributor foaf:name "Jon Foobar" .?contributor foaf:weblog ?url .}

stats!

Page 6: NoSQL, Apache SOLR and Apache Hadoop

Some stats from (Information Week) viaabout.com (2010):•44% biz IT professionals haven’t heard of NoSQL•1%: NoSQL is strategic direction

•Some stats from NerdCamp (April 2011):•10% heard and used the NoSQL•Much more people know about cloud, which can become more and more a driving platform behind NoSQL

Does the world of NoSQL have enough mass to appeal to IT now?

Page 7: NoSQL, Apache SOLR and Apache Hadoop

“Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.”

Created by Yonik Seeley at CNET

Features:•Full-text search•Hit highlighting•Faceted search (Dynamicclustering)•DB integration•Rich doc handling•Geospatial search•Distributed search•Replicataion•REST-like HTTP/XML & JSON APIS

http://lucene.apache.org/solr/http://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/java/docs/index.html

Books

Page 8: NoSQL, Apache SOLR and Apache Hadoop

Companies using SOLR

drupal

Page 9: NoSQL, Apache SOLR and Apache Hadoop
Page 10: NoSQL, Apache SOLR and Apache Hadoop

April 2011

Overview of current state

Curent version: Apache Solr 3.1 (March 31, 2011)License: ASL 2.0Features:•Faceted navigation•Hit highlighting•GEO search: filter and sort by distance•Spellcheck and auto suggest•Advanced ranking and sorting•Distributed and replicated search•Structured / unstructured search•Rich plugin architecture, extensible

Operating system supportAll with a Java VM, including:Linux (all versions)Windows (all versions)MacOS (all versions)Unix variantsApp-server supportApache Tomcat, Jetty, Resin,WebLogic™, WebSphere™,GlassFish, dmServer™, JBoss™and many moreJava version requirementJava JDK 1.5 or laterClient API supportJava, .NET, PHP, Python, Ruby(onRails), C++, XML/HTTP,JSON/HTTP ++

Page 11: NoSQL, Apache SOLR and Apache Hadoop

Faceted search

•A technique for refining search results

•Concept composition:

• Article + in English + about nerdcamp

• Finnish rap + < 1 minute + released in 2001

•Types:

• Standard facets (list of facets with values)

• Hierarchical facet values (taxonomy of facet values)

• Range / query facets: by date, by price, by alphabet, by interval

Page 12: NoSQL, Apache SOLR and Apache Hadoop

Spatial Search

Combines location data with text data

•Represent spatial data in the index

•Filter by some spatial concept such as a bounding box or other shape

•Sort by distance

•Score/boost by distance

•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store --> <field name="store">40.7143,-74.006</field> <!-- NYC store -->

<field name="store">37.7752,-122.4232</field> <!-- San Francisco store -->

•bbox: bounding box filter (bbox is a range of lats and lons thatencompasses the circle of radius d)

•geodist: the distance function

Page 13: NoSQL, Apache SOLR and Apache Hadoop

Hit highlighting

Example from solr admin

Page 14: NoSQL, Apache SOLR and Apache Hadoop

Spellcheck and autosuggest

Spellcheck:

•Query suggestion for a missspelled query term

http://localhost:8983/solr/spell?q=hellultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=true

<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <intname="numFound">1</int> <int name="startOffset">0</int> <intname="endOffset">4</int> <arr name="suggestion"> <str>dell</str> </arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int> <int name="startOffset">5</int> <int name="endOffset">14</int> <arrname="suggestion"> <str>ultrasharp</str> </arr> </lst> <strname="collation">dell ultrasharp</str> </lst> </lst>

Autosuggest:

Example with solr and jquery

Page 15: NoSQL, Apache SOLR and Apache Hadoop

Advanced sorting, ranking and searching

•sort=score+asc

•sort=Author+desc,score+desc

•boosting single documents

•Term Frequency—tf

•Inverse Document Frequency – idf

•Co-ordination Factor – coord (the greater the # of queried terms match, the greater the score)

•Field Length – fieldNorm (the shorter the matching field is in number ofindexed terms, the greater the document’s score)

•AND, OR, NOT, NEAR, fuzzy search

•Smashing~0.7 yields more results than just Smashing

Page 16: NoSQL, Apache SOLR and Apache Hadoop

Distributed and replicated search

Before doing this:•Consider vertical scaling (faster and better machine)•Rethink the data model (what data goes to which solr index)•Remove logging on updates (and / or searches)•Redesign you index: make as many fields non-indexed and non-stored (use cases)•Check your Internet connection

Page 17: NoSQL, Apache SOLR and Apache Hadoop

Extendability

Plugins:

•Query parser: extend LuceneQParserPlugin

public class NerdCampQParserPlugin extends LuceneQParserPlugin {

public QParser createParser(String qstr, SolrParams localParams,SolrParams params, SolrQueryRequest req) {}

}

Page 18: NoSQL, Apache SOLR and Apache Hadoop

SOLR I/O

•Nutch (crawler)

•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich documentimport, like pdf), your format

•Output: xml, json, python, javabin, csv… , your format

Page 19: NoSQL, Apache SOLR and Apache Hadoop

SOLR Processing Pipeline•On each step, a document gets transformed•Stop words removal•Stemming•(smart) Tokenization•Ngrams (letter level and word level)•Regular expressions•Low casing•Reversed wildcard•Duplicate removal

Page 20: NoSQL, Apache SOLR and Apache Hadoop

Solr on the cloudHadoop: MapReduceZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your ZooBatch indexing, no realtime search yet

Hadoop vital components: Core and API

MapReduce -- computation modelHDFSI/OZooKeeperPig (adds level of abstraction for processinglarge datasets)

Page 21: NoSQL, Apache SOLR and Apache Hadoop

Solr on the cloudDoes it shine? Yes, but not fully

Page 22: NoSQL, Apache SOLR and Apache Hadoop

References[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com GuideSarah Pidcock (2011-01-31). http://bit.ly/fFQOYI[2] "Dynamo: Amazon’s Highly Available Key-value Store". http://www.cs.uwaterloo.ca/: WATERLOO. p. 2/22. Retrieved 2011-04-05. "Dynamo: a highly available and scalable distributed data store"[3] http://cassandra.apache.org/[4] http://labs.google.com/papers/bigtable.html[5] http://aws.amazon.com/ (look for SimpleDB)[6] http://couchdb.apache.org/[7] http://neo4j.org/[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQLhttp://bit.ly/go5ios[9] http://drupal.org/[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination[11] http://wiki.apache.org/solr/SpatialSearch[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Page 23: NoSQL, Apache SOLR and Apache Hadoop

References[14] Using Nutch with SOLR, http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/[15] http://tika.apache.org/[16] http://lucene.apache.org/solr/


Top Related