nosql, apache solr and apache hadoop

of 23/23
NoSQL: Apache SOLR Apache Hadoop By Dmitry Kan for NerdCamp, April 23 2011 [email protected]

Post on 20-May-2015




9 download

Embed Size (px)


NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.


  • 1. NoSQL: Apache SOLRApache Hadoop By Dmitry Kan for NerdCamp, April 23 2011

2. Dilbert: expert in NoSQL 3. The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQLmovement "departs from the relational model altogether; it shouldtherefore have been called more appropriately NoREL, or something tothat effect. (wikipedia)NoSQL = Not Only SQLCompanies: Facebook, Twitter, Digg, Amazon, LinkedIn and GoogleData storage: billion gigabytes (GB) of dataInterconnected data: hyperlinks, blog pingbacks, social networksComplex Data structure: hierarchical nested data structures easily(multiple relational tables in SQL)Performance: the more data in SQL, the likely it to degradeNoSQL is not: SQL and not relational replacement for SQL, but compliment... There is no fixed schema and no joins... Does not scale-up (RDBMS, vertical scaling), but rather scales-out (spreading the load over many commodity systems) horizontalscaling 4. NoSQL CategoriesKey-value Stores: bigh hashtable with caching mechanismsColumn Family Stores: keys point to multiple columns (Googles BigTable)Document Databases: documents are collections of other key-valuecollectionsGraph Databases: nodes, relationships between nodes and nodes propsMajor NoSQL playersDynamo:, key-value, used in Amazon S3 (simple storageservice)Cassandra: open-sourced by Facebook, column oriented NoSQL DBBigTable: Googles proprietary column oriented DB (App Engine)CouchDB: OS document oriented NoSQL DB (as well as MongoDB)Neo4j: OS graph DBQuerying NoSQL DB:Data model specificRESTful interfaces or query APIsSPARQL: declarative query specification for graph DBs 5. Simple Protocol And RDFQuery Language(courtesy of and IBM)Example of retrieving the URL of a bloggerPREFIX foaf SELECT ?urlFROM WHERE {?contributor foaf:name "Jon Foobar" .?contributor foaf:weblog ?url .}stats! 6. Some stats from (Information Week) (2010):44% biz IT professionals havent heard of NoSQL1%: NoSQL is strategic directionSome stats from NerdCamp (April 2011):10% heard and used the NoSQLMuch more people know about cloud, which canbecome more and more a driving platform behindNoSQLDoes the world of NoSQL have enough mass toappeal to IT now? 7. Solr is the popular, blazingfast open source enterprisesearch platform from theApache Lucene project.Created by Yonik Seeley atCNETFeatures:Full-text searchHit highlighting search (Dynamic clustering) DB integrationRich doc handlingBooks Geospatial searchDistributed searchReplicataionREST-like HTTP/XML & JSONAPIS 8. drupalCompanies using SOLR 9. Curent version: Apache Solr 3.1 (March 31, 2011) Operating system support License: ASL 2.0 All with a Java VM, including: Features:Linux (all versions) Faceted navigationWindows (all versions) Hit highlightingMacOS (all versions) GEO search: filter and sort by distance Unix variants Spellcheck and auto suggest App-server support Advanced ranking and sortingApache Tomcat, Jetty, Resin, Distributed and replicated search WebLogic, WebSphere, Structured / unstructured searchGlassFish, dmServer, JBoss Rich plugin architecture, extensibleand many moreJava version requirementJava JDK 1.5 or laterClient API supportJava, .NET, PHP, Python, Ruby(onRails), C++, XML/HTTP,Overview of current state JSON/HTTP ++April 2011 10. Faceted searchA technique for refining search resultsConcept composition: Article + in English + about nerdcamp Finnish rap + < 1 minute + released in 2001Types: Standard facets (list of facets with values) Hierarchical facet values (taxonomy of facetvalues) Range / query facets: by date, by price, byalphabet, by interval 11. Spatial SearchCombines location data with text dataRepresent spatial data in the indexFilter by some spatial concept such as a bounding box or other shapeSort by distanceScore/boost by distance45.17614,-93.8734140.7143,-74.00637.7752,-122.4232bbox: bounding box filter (bbox is a range of lats and lons thatencompasses the circle of radius d)geodist: the distance function 12. Hit highlightingExample from solr admin 13. Spellcheck and autosuggestSpellcheck:Query suggestion for a missspelled query termhttp://localhost:8983/solr/spell?q=hellultrashar&spellcheck=true&spellcheck.collate=true& 04 dell1514ultrasharpdell ultrasharp Autosuggest:Example with solr and jquery 14. Advanced sorting, ranking and searchingsort=score+ascsort=Author+desc,score+descboosting single documentsTerm FrequencytfInverse Document Frequency idfCo-ordination Factor coord (the greater the # of queried terms match,the greater the score)Field Length fieldNorm (the shorter the matching field is in number ofindexed terms, the greater the documents score)AND, OR, NOT, NEAR, fuzzy searchSmashing~0.7 yields more results than just Smashing 15. Distributed and replicated searchBefore doing this:Consider vertical scaling (faster and better machine)Rethink the data model (what data goes to which solr index)Remove logging on updates (and / or searches)Redesign you index: make as many fields non-indexed and non-stored (use cases)Check your Internet connection 16. ExtendabilityPlugins:Query parser: extend LuceneQParserPluginpublic class NerdCampQParserPlugin extends LuceneQParserPlugin {public QParser createParser(String qstr, SolrParams localParams,SolrParams params, SolrQueryRequest req) {}} 17. SOLR I/ONutch (crawler)CSV, XML, DataImportHandlers, DB import, Apache Tika (rich documentimport, like pdf), your formatOutput: xml, json, python, javabin, csv , your format 18. SOLR Processing PipelineOn each step, a document gets transformedStop words removalStemming(smart) TokenizationNgrams (letter level and word level)Regular expressionsLow casingReversed wildcardDuplicate removal 19. Solr on the cloudHadoop: MapReduceZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your ZooBatch indexing, no realtime search yet Hadoop vital components: Core and API MapReduce -- computation model HDFS I/O ZooKeeper Pig (adds level of abstraction for processing large datasets) 20. Solr on the cloudDoes it shine? Yes, but not fully 21. References[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, GuideSarah Pidcock (2011-01-31).[2] "Dynamo: Amazons Highly Available Key-value Store". p. 2/22. Retrieved 2011-04-05."Dynamo: a highly available and scalable distributed data store"[3][4][5] (look for SimpleDB)[6][7][8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL[9][10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination[11][12][13] 22. References[14] Using Nutch with SOLR,[15][16]