flexible analytics for large data sets using elasticsearch

Upload: javed-vahid

Post on 03-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    1/24

    Flexible Analytics for Large Data SetsUsing ElasticsearchCopyright 2012 TNR Global, LLC

    May be reprinted with permission

    Christopher Miles

    Storing and analyzing large amounts of data are two very different challenges, but what if wecould combine them? Traditional database stores certainly provide the flexible query tools we

    need in order to answer our questions, but often our data lacks the uniformity needed by theseproducts or is simply too large. In this paper we will store large amounts of data in one of theleading NoSQL stores, Elasticsearch, and leverage the powerful indexing tools it provides inorder to generate detailed analytics that we can use in response to exploratory, ad-hoc queries.

    Introduction

    Big Data promises access to the entirety of our accumulated data and the ability to minethis data for nuggets of golden information that can revolutionize the way we handle ourproblem space. Unfortunately, much of this analysis needs to be done in batches and that isnt

    conducive to the kind of interactive exploration that many of us need.Tools like HBase, for instance, can handle large amounts of information but in order to

    provide metrics over the entire data set, we often need to clearly lay out what those metricsare when we construct the layout of our data store; these values need to be calculated as we

    https://en.wikipedia.org/wiki/Nosqlhttp://www.elasticsearch.org/https://en.wikipedia.org/wiki/Big_datahttp://hbase.apache.org/

    http://hbase.apache.org/https://en.wikipedia.org/wiki/Big_datahttp://www.elasticsearch.org/https://en.wikipedia.org/wiki/Nosql
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    2/24

    load in our data. Additional measures that may become critical afterwards can only be addedafter costly re-processing of the relevant portion of our data set, perhaps even our entire dataset. Tools like Hadoop enable us to process huge amounts of data quickly and efficiently butthese processing jobs are often run in batches. Depending on the size of the data set and the

    complexity of processing, a particular job could take anywhere from a few minutes to severalweeks to finish.

    We are looking for a solution that answers our questions as soon as we ask them. Somethingthat can handle very large data sets and provide access to both our most granular pieces of data as

    well as aggregated information. Wed like this solution to be flexible enough that it can answera wide range of questions, ideally fast enough that we can engage in something that is more likea dialogue and less like postal correspondence. While we have yet to find this idealized tool wehave discovered that Elasticsearch can provide a good amount of the speed and flexibility weare seeking.

    Overview

    Before we dive into details, let us outline the various components of our solution.

    Indexing and Data Storage

    Elasticsearch provides both an indexing service as well as a data store. It does not require adetailed schema, although one may be provided if needed. Any data that we can express as

    JSON can be easily stored and indexed with Elasticsearch. Unlike other search services, wecan always retrieve our raw JSON data when we need it (for instance, if we needed to re-index

    our data set).One of the unique features of Elasticsearch that makes it especially well suited for our pur-

    poses is the ease with which we can scale our solution. It provides us with the ability to eitheradd or remove resources (that is, individual machines running Elasticsearch) at any time. Wemight do this in order to support a growing data set or, perhaps, in order to satisfy an in-creasing amount of requests and to improve the performance of our solution. While otherproducts are looking to add this ability to painlessly scale (Lucid Imagination is working onthis functionality in their development builds of Solr) they are not there yet.

    Providing a tutorial on installing and configuring Elasticsearch is outside the scope of thisdocument, however, there is an abundance of information online. We have another white

    paper available that walks through setting up a cluster suitable for development purposes, inaddition the Elasticsearch project has a very detailed tutorial that describes deployment andsetup in Amazons EC2 environment.

    http://hadoop.apache.org/https://en.wikipedia.org/wiki/Jsonhttp://www.lucidimagination.com/products/apache-releases/solr-4.0-devhttp://bit.ly/JuT9H7http://www.elasticsearch.org/tutorials/2012/03/21/deploying-elasticsearch-with-chef-solo.html

    https://en.wikipedia.org/wiki/Amazon_ec2

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 2

    https://en.wikipedia.org/wiki/Amazon_ec2http://www.elasticsearch.org/tutorials/2012/03/21/deploying-elasticsearch-with-chef-solo.htmlhttp://bit.ly/JuT9H7http://www.lucidimagination.com/products/apache-releases/solr-4.0-devhttps://en.wikipedia.org/wiki/Jsonhttp://hadoop.apache.org/
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    3/24

    Storing Raw Data

    e next piece of the puzzle is to process our data set and load it into Elasticsearch. Certainlysome of us have data that is already in JSON or a JSON-like format, but most of us do not.

    In any case we will want to do some level of processing on this data as we load it into our datastore. We may want to normalize values or engage in some level of sanity checking.

    Hadoop provides an ideal solution to this type of problem. It enables us to store a large dataset over many individual machines (a cluster), providing us with low-cost access to a very largeamount of storage space. Hadoop will also provide us with the ability to harness the processingpower of these machines so that we can digest this data as quickly as possible.

    While Hadoop will provide us with the mechanism we need to handle our initial loadingof data, theres no need for us to rely on a batch processing mechanism forever. Once we haveloaded our existing data, well have our crawlers load additional log data as they do their work.

    Well see this data ingested into Elasticsearch and reflected in our index in near real time.

    In this document well be loading log data from our web crawlers. We run web crawlsfor several clients, in this case well be processing the log files that cover a good sized data set:over 250 million web pages each crawler run. is type of data provides a good example set,there are several fields that we can use to provide aggregate statistics and there will be manydifferent ways people will want to slice and dice this data-set for analysis. Typical log entrieslook something like the following:

    Table 1:Sample Crawl Log Entry

    Date Time URL Response Index Status2011-06-25 21:33:03 http://www.acme.com/ 200 New

    For our initial load of data well be collecting these log files from the individual crawler ma-chines and consolidating them onto our Hadoop cluster for processing (approximately 115GBof raw data per crawler run).

    Extracting, Transforming and Loading Data

    e last component of our solution involves extracting the data we want from our log files (inour case, pretty much all of it), transforming that data into a format our data store can handle(JSON for Elasticsearch), then we load this data into our data store. is process of extracting,transforming and loading is commonly referred to as the ETL process or cycle.

    Once again well be leveraging Hadoop in order to spread the processing of the log dataover all of our machines, we want to get the job done as quickly as possible. While processing

    https://en.wikipedia.org/wiki/Extract,_transform,_load

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 3

    https://en.wikipedia.org/wiki/Extract,_transform,_loadhttp://www.acme.com/
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    4/24

    data in batches isnt something we want to do often, when its unavoidable Hadoop provides agreat way to perform these types of tasks. As Hadoop processes our data, it will write this datato a RabbitMQ message queue. At the same time Elasticsearch will consume this data andadd it to our index.

    After our initial load of log data, well want additional information to be loaded as soonas its produced. We have full control over our crawlers (we use the open-source Heritrixproduct), altering the crawler to log this information directly to our message queue will bea painless configuration change. For products that dont provide that level of customization,other solutions exist. We have had reasonable success with both logstash and Flume.

    Its important to decouple the process that extracts and transforms the source data from theprocess that performs the indexing. Certainly we could write directly to Elasticsearch from ourETL process, our primary issue with that layout is that the speed at which we can extract andtransform our data is tied directly to the speed of our indexer. When were processing largeamounts of data we want to complete that process as quickly as possible. e message queue

    will effectively sever our ETL process from our indexing process, letting each run at its ownspeed.

    Hadoop Map/Reduce Process

    Message QueueElasticsearch

    Index

    Figure 1:Message Queue Separates ETL Process from Indexing

    Another benefit of breaking these two processes apart is that we can halt one without im-pacting the other. If we needed to bring our Elasticsearch cluster offline for some reason, datathat is being loaded will simply accumulate in the message queue. When the cluster comesback online and indexing resumes, Elasticsearch will simply consume messages off the queueuntil it catches up.

    In this paper well be writing our Hadoop ETL process in Clojure, a Lisp dialect that runson the Java Virtual Machine (JVM). While we are very comfortable implementing solutionsdirectly in Java, we have found that scripting languages like Clojure let us get our solutions offthe ground much faster. Hadoop supports a wide variety of languages via the streamingutility including both Ruby and Python; youll want to choose the language that is the best fit

    for your environment.

    http://www.rabbitmq.com/https://webarchive.jira.com/wiki/display/Heritrix/Heritrixhttp://logstash.net/https://cwiki.apache.org/FLUME/http://clojure.org/https://en.wikipedia.org/wiki/Jvmhttp://hadoop.apache.org/common/docs/r0.15.2/streaming.html

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 4

    http://hadoop.apache.org/common/docs/r0.15.2/streaming.htmlhttps://en.wikipedia.org/wiki/Jvmhttp://clojure.org/https://cwiki.apache.org/FLUME/http://logstash.net/https://webarchive.jira.com/wiki/display/Heritrix/Heritrixhttp://www.rabbitmq.com/
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    5/24

    Providing a tutorial on installing and configuring Hadoop is outside the scope of this doc-ument. Rather, we will be concentrating on configuring Elasticsearch, loading our data withHadoop and then querying Elasticsearch to get the aggregate data we need for our reportingapplication. ere are many resources available on the internet that cover the installation and

    configuration of Hadoop, including the Hadoop project itself and Cloudera.e RabbitMQ project provides great documentation for their product. In addition to

    detailed installation instructions for the major platforms, they also provide information specificto Amazons EC2 environment. We recommend setting up a clustered environment, this

    will make it much easier to add and remove nodes from your cluster and ensure that data isnever lost.

    Configuring Elasticsearch

    Before we start indexing our data, lets go over some of the more important configurationoptions for our Elasticsearch cluster. While we certainly can go far with the out-of-the-boxsettings, there are a couple that bear a closer look. We will not examine every single setting,Elasticsearch provides a very flexible product that can be molded to fit a great many use-cases.Rather well be looking at the ones that we need to think about in order to get up-and-running.

    Index Sharding and Cluster Nodes

    By default Elasticsearch will break our index into a set of five shards, each shard representsa chunk of our index that we can potentially migrate to another machine in our cluster. Inaddition, we can choose to have one or more backup copies of each shard; this is an important

    feature, it enables us to remove a machine (or node) from our cluster without being concernedabout losing any data. By default Elasticsearch will configure one copy (or replica) of eachshard in our index.

    Deciding how many shards wed like to comprise our index represents all of the configu-ration we need in order to have a resilient index that is easy to scale. Elasticsearch itself willdecide which shards live on which servers in order to provide a responsive cluster that canhandle the loss of one or more nodes (depending on the size of the cluster). When only onenode is available all of our shards will reside on that one node (obviously this is our least fault-tolerant configuration). As nodes are added to our cluster, Elasticsearch will move both activeand replica shards out to these new nodes. Constrained by the size of the cluster, it will do its

    best to ensure that no one node serves both the same active and replica shard. When a node isremoved from our cluster, Elasticsearch will nominate the missing shards replicas as the newactive shard and will create new replica shards if necessary. We have another paper available

    http://wiki.apache.org/hadoop/GettingStartedWithHadoophttps://ccp.cloudera.com/display/CDHDOC/CDH3+Quick+Start+Guidehttp://www.rabbitmq.com/download.htmlhttp://www.rabbitmq.com/ec2.htmlhttp://www.rabbitmq.com/clustering.html

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 5

    http://www.rabbitmq.com/clustering.htmlhttp://www.rabbitmq.com/ec2.htmlhttp://www.rabbitmq.com/download.htmlhttps://ccp.cloudera.com/display/CDHDOC/CDH3+Quick+Start+Guidehttp://wiki.apache.org/hadoop/GettingStartedWithHadoop
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    6/24

    that demonstrates this functionality in a development environment.

    Its important to take a moment and think about how large your data set may grow and howmany nodes you may eventually add to your cluster. e default settings of five shards with

    one replica per shard will let you scale your cluster up to ten nodes, each node will contain oneshard; five nodes will be hosting active shards, the other five will be hosting the replicas.

    We will be using Amazon EC2s high-memory, double extra large instances, each instancewill have 34.2GB of memory. Elasticsearch isnt the only thing well be running on thesemachines, weve allocated 25GB of RAM to Elasticsearch itself leaving 10GB available forother uses. With five machines that will provide us with 125GB of RAM available solely formanaging our index. Certain Elasticsearch operations (for instance faceting and sorting) willrequire loading all of the data in the result set into RAM. For instance, if we wanted to facet onthe server crawled across the entire data set, Elasticsearch will load all of the data in our hostfield into memory and then calculate the aggregate values. Let me be clear that Elasticsearch

    will only load data from the requested result set into memory. If we wanted to facet on the server

    for all request occurring in the month of June, 2011 then only the host field for transactionsthat fall on the month of June, 2011 will be loaded; a much smaller set of data that will requiremuch less memory to be available.

    We will be storing the logs from the last four crawler runs in our index and we would likethe freedom to facet on data across the entire result set. Our data set will be approximatelyone billion rows or 460GB of raw data. Each row of data is composed of four distinct items(transaction date and time, the URL crawled, the source servers response and our indexersresponse), one fourth of our data set is 115GB; that in turn would mean that if we were tofacet on a field across our entire data set, each node in our five machine cluster might need toload 23GB of data into RAM.

    ese numbers are conservative approximations, there are many factors that they do nottake into account (the size of each field of data, how much larger a field may become whenindexed, etc.) Still, it provides us with some numbers that we can wrap our head around. Ourfive shard cluster with five nodes looks like it will meet our needs if we do not facet on anyfields that have been tokenized by our indexer. In this scenario we also cannot grow our indexcapacity further (we can add five more nodes but they will only host replica shards, increasingperformance but not the maximum index size). Instead well configure Elasticsearch to use tenshards (with one replica each) even thoughwe currently plan on running a five node cluster. Withthis configuration we will have always have the ability to add more replica nodes to increase ourquery performance and more primary nodes to increase the maximum capacity of our index.

    Once we have begun adding data to our index we will no longer have the option to addor remove shards. e number of shards that we choose now will stay with the index over itsentire lifespan. is means that if you decide you want to increase the number of shards afteryou have added data to your index, you will need to create an entirely new index and migratethe data from the old to the new. A shard represent a piece (or partition) of our index, as more

    http://bit.ly/JuT9H7http://aws.amazon.com/ec2/instance-types/http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 6

    http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.htmlhttp://aws.amazon.com/ec2/instance-types/http://bit.ly/JuT9H7
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    7/24

    nodes are added shards are balanced across these machines. Once we have added enough nodesthat each one hosts only one shard, we will have reached the maximum size of our cluster.

    ese settings are configured in the Elasticsearch configuration file, elasticsearch.yml.While you may configure these settings on a per-index basis, it is our recommendation thatthe settings in the configuration file reflect the largest index you are willing to support. is

    will prevent someone in you organization from accidentally creating an index that cannot bescaled to take advantage of your clusters full capacity.

    e amount of memory allocated to Elasticsearch can be set in the elasticsearch.in.sh file.A sample is provide with the Elasticsearch installation, you can then copy the file to one of thesuggested locations and customize it there.

    Ensuring Safe Cluster Restarts with the Gateway Feature

    Elasticsearch provides a gateway function that will store the state of your indexes and cluster

    between full restarts (that is, restarts that involve taking every node offline). As your indexand cluster configuration change, Elasticsearch will write these changes to the gateway. Aftera restart each node will look to the gateway to restore their state. By default Elasticsearch

    will use the local gateway, storing this information on the local file system of each node.Unfortunately the local gateway disables caching index data in memory, its best to use agateway that leverages a shared file system.

    Since we are already using Hadoop to load our data into our index we will be using theHadoop gateway. Note that there are other options available, for instance the Amazon S3gateway also presents an attractive option for our setup. You will need to configure thesesettings in the elasticsearch.yml file.

    Once again, this is a setting that will stay with your index over its entire lifespan. If youdecide to change your gateway implementation after adding data to your index, youll needto configure a new index with the new settings and then migrate your data from the old to thenew.

    Elasticsearch Head

    While Elasticsearch does not come with an web-based interface out-of-the-box, the Elastic-search Head project provides this functionality. It can be used to quickly gauge the status ofour cluster, detailed information on each index and node is also kept within easy reach. Wecan browse through our index or even run queries through the provided interface.

    https://github.com/elasticsearch/elasticsearch/blob/0.19/config/elasticsearch.yml#L09http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.htmlhttps://github.com/elasticsearch/elasticsearch/blob/0.19/bin/elasticsearchhttp://www.elasticsearch.org/guide/reference/modules/gateway/index.htmlhttp://www.elasticsearch.org/guide/reference/modules/gateway/local.htmlhttp://www.elasticsearch.org/guide/reference/modules/gateway/hadoop.htmlhttp://www.elasticsearch.org/guide/reference/modules/gateway/s3.htmlhttps://github.com/elasticsearch/elasticsearch/blob/0.19/config/elasticsearch.yml#L233https://github.com/mobz/elasticsearch-head

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 7

    https://github.com/mobz/elasticsearch-headhttps://github.com/elasticsearch/elasticsearch/blob/0.19/config/elasticsearch.yml##L233http://www.elasticsearch.org/guide/reference/modules/gateway/s3.htmlhttp://www.elasticsearch.org/guide/reference/modules/gateway/hadoop.htmlhttp://www.elasticsearch.org/guide/reference/modules/gateway/local.htmlhttp://www.elasticsearch.org/guide/reference/modules/gateway/index.htmlhttps://github.com/elasticsearch/elasticsearch/blob/0.19/bin/elasticsearchhttp://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.htmlhttps://github.com/elasticsearch/elasticsearch/blob/0.19/config/elasticsearch.yml##L09
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    8/24

    Figure 2:Sample Screenshot of Elasticsearch Head

    Installation of the plugin is painless, we recommend installing Elasticsearch Head on everynode. In this configuration any active node may be used to access the web-based interface.

    RabbitMQ River

    Elasticsearch provides processes that run inside of the Elasticsearch cluster and may pull (orbe pushed) data that will then be added to an index, this is called a river. Well be usingthe RabbitMQ river; the project also provides several other implementations including onefor CouchDB. You may have as many of these processes as needed, in practice one node ofyour cluster will be assigned the task of running and monitoring a particular river; only oneinstance of each river will run at one time. When the node responsible for a particular river istaken offline, another node will be assigned that river and it will take over processing.

    Creating Our Index

    With Elasticsearch up and running, the only thing left is to create our index. In line withElasticsearchs enforcing of reasonable defaults we could simply skip this step all-together; anew index will be created the first time we attempt to use it to store data. Elasticsearch willattempt to figure out what data type is in each field of our document by looking at the JSONdata and, in some cases, it will attempt to parse out the data in text fields. For instance,

    when a new text field is detected Elasticsearch will attempt to parse that field as a date and

    if it succeeds, that field will be treated as date value (the raw text associated with the JSONdocument will also be retained). is behavior is configurable, the documentation goes intogreater detail.

    We will be providing a mapping for our index at creation time. We know what our datawill look like and creating the mapping is not so arduous. Well use the command line tool

    http://www.elasticsearch.org/guide/reference/river/index.htmlhttp://www.elasticsearch.org/guide/reference/api/index

    .

    html

    http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 8

    http://www.elasticsearch.org/guide/reference/mapping/root-object-type.htmlhttp://www.elasticsearch.org/guide/reference/api/index_.htmlhttp://www.elasticsearch.org/guide/reference/river/index.html
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    9/24

    curl for illustration purposes. Depending on your scenario, it may make sense to have yourapplication create missing indexes when it attempts to load in data.

    curl -XPUT http://node1.tnrglobal.com:9200/crawl_log/

    Were going to be storing each transaction from our log files (that is, every line from everylog file) in our index. e command below will create a mapping that tells Elasticsearch whattype of data will be in our transaction documents and how to treat that data. While we areproviding a schema for our data this is not required, Elasticsearch does not require a schema.If we provided documents with more or less data Elasticsearch will not complain.

    curl -XPOST http://node1.tnrglobal.com:9200/crawl_log -d {

    mappings : {

    transaction : {

    properties : {

    host : {type : string},

    host_raw : {type : string, index : not_analyzed},

    url : {type : string},

    url_hash : {type : string, index : not_analyzed},

    timestamp : {type : date},

    response : {type : string},

    status : {type : string}

    }

    }

    }

    }

    e host field will contain only the server host name portion of a transactions URL. In

    addition to indexing the name of the host were also storing the same data in a not analyzedform, this means that the host_raw field will be search-able but it wont be processed in anyother way (i.e. tokenized). We will be able to search the host_raw field for exact host namematches (i.e. www.acme.com) and use the host field for bits and pieces of host names(i.e. www and acme). In addition, if we were to facet over the host_raw field we couldcategorize each transaction by its host without incurring the overhead of loading tokenizedfield data into memory.

    While we index the url field from the log file, were also storing a hash of that field. One ofthe problems with the url field is that we dont have a great deal of control over what it lookslike. Some URLs are very lengthy and that could cause problems if we were to use them when

    constructing links in our reporting application. For this reason we store a hash of each URL aswell, if we need to generate a link for a URL or if we need to specifically fetch transactions fora specific URL, we can use a hash instead of the actual URLs text.

    e only other interesting field is the timestamp field, this will contain the date and timethe transaction was logged. When we load in our data well be combining the date and timefields from our log entries into this one field. When we provide data for Elasticsearch to index,

    http://curl.haxx.se/

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 9

    http://curl.haxx.se/
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    10/24

    its going to try to parse the timestamp value we provide using the Joda Time date withoptional time format. is will handle most sanely formatted dates, well be formatting ourdate like so: yyyy-MM-ddTHH:mm:ss.

    Lastly, we create the Elasticsearch process that will monitor our RabbitMQ message queueand add new entries to our index. Again, were using curl for illustration purposes; anotherapproach may make more sense depending on the scenario.

    curl -XPUT http://node1.tnrglobal.com:9200/_river/crawl_queue/_meta -d {

    type : rabbitmq,

    rabbitmq : {

    host : localhost,

    port : 5672,

    user : guest,

    pass : guest,

    vhost : /,

    queue : index,

    routing_key : index,exchange_type : direct,

    exchange_durable : true,

    queue_durable : true,

    queue_auto_delete : false

    },

    index : {

    bulk_size : 100,

    bulk_timeout : 10ms,

    ordered : false

    }

    }

    Be sure to create the appropriate queue on the RabbitMQ side as well. From there you candecide if you want to mirror the queue across the cluster, etc.

    Extract, Transform and Load (ETL)

    With our index up and running, were ready to start loading in data! Our next step will beto create a Hadoop map/reduce job that will read through our source log files and write thatdata to our RabbitMQ message queue. As items are written to this queue Elasticsearch willconsume those items and add them to our index.

    We have found that many organizations that arent already using a framework (like Hadoop)that leverages the map/reduce approach to data processing tend be reluctant to use these tools.While map/reduce may be counter-intuitive, we do feel that its much easier to grasp with astraightforward example. To that end we are going to go into greater detail on how we use thisapproach to process our log data.

    To give you an idea of what our JSON documents will look like, the command below willinsert one transaction into our index.

    http://joda-time.sourceforge.net/

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 10

    http://joda-time.sourceforge.net/
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    11/24

    curl -XPUT node1.tnrglobal.com:9200/crawl_log/transaction/9f419923fe473293b3e2da8a3ead0797 -d {

    host : jobs.telegraph.co.uk,

    host_raw : jobs.telegraph.co.uk,

    url : http://jobs.telegraph.co.uk/search-results-rss.aspx?discipline=33,

    url_hash : 2f5960385dc2bcb15a4fbd9898114b3e,timestamp : 2012-04-18T17:27:12,

    response : 200,

    status : new

    }

    Were using a hash of the URL combined with the date and time of the transaction togenerate a unique ID. We do this so that if we have a problem during indexing and need tore-load a particular file, we wont end up with more than one entry for any particular log entry.

    Well be using Clojure (a Lisp that runs on the JVM) to implement our mapper and reducertasks. While we are comfortable using Java for these tasks, we find we can get the same job

    completed much faster when we use a scripting language like Clojure. e Hadoop StreamingAPI enables you to use almost any language you like.

    Project Setup

    Writing a Hadoop job is not as much of a chore as you might think. Clojure projects typicallyuse the Leiningen tool to manage the build process and dependent libraries. After installingLeiningen (also a straight-forward task) we ask it to create a new project for us.

    $ lein new crawl-log-loader

    A new project will be created. We can then replace the project definition file (project.clj)with the text below.

    (defproject crawl-log-loader 1.0

    :description Load Crawler Log data

    :dependencies [[org.clojure/clojure 1.3.0]

    [clojure-hadoop 1.4.1]

    [com.mefesto/wabbitmq 0.2.0]

    [org.clojure/tools.logging 0.2.3]

    [org.clojure/tools.cli 0.2.1]]

    :dev-dependencies [[org.codehaus.jackson/jackson-mapper-asl 1.9.2]

    [org.slf4j/jcl104-over-slf4j 1.4.3]

    [org.slf4j/slf4j-log4j12 1.4.3]]

    :main crawl-log-loader.core)

    e Clojure Hadoop library provides the tools we need to implement our job, Wab-bitMQ is used to communicate with our RabbitMQ server. We include the Clojure logging

    https://github.com/technomancy/leiningenhttps://github.com/alexott/clojure-hadoophttps://github.com/mefesto/wabbitmq

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 11

    https://github.com/mefesto/wabbitmqhttps://github.com/alexott/clojure-hadoophttps://github.com/technomancy/leiningen
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    12/24

    and command-line utilities to make our application easier to manage, the remaining three li-braries are required for any Hadoop job.

    Next we open up our main project file and import the libraries that well be using to write

    our job.(ns crawl-log-loader.core

    (:use [clojure.tools.logging]

    [clojure.tools.cli]

    [com.mefesto.wabbitmq])

    (:require [clojure.string :as string]

    [clojure-hadoop.gen :as gen]

    [clojure-hadoop.imports :as imp])

    (:import [org.apache.hadoop.util Tool]

    [java.net URL]

    [org.apache.commons.logging LogFactory]

    [org.apache.commons.logging Log]))

    We import the logging library in order to log our progress, then WabbitMQ. e command-line tools are used to parse the arguments passed in from the command line, specifically thelocation of the source files to process. Next we import the Clojure Hadoop tools and the JavaURL library for parsing the logged data. Lastly we use the Apache Log4J library to performthe actual logging.

    Now we implement our job. Were going to do this a little backwards in order to keep thingseasy to follow. First well setup our Tool, the object that will represent our job as a whole.

    Well then code up our mapper, reducer and a couple of support functions that well need toparse our log data and write to our RabbitMQ queue.

    http://logging.apache.org/log4j/

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 12

    http://logging.apache.org/log4j/
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    13/24

    Tool

    Our Tool will handle our command line arguments and then setup the rest of our job.

    (defn tool-run

    Provides the main function needed to bootstrap the Hadoop application.

    [^Tool this args-in]

    ;; define our command line flags and parse out the provided

    ;; arguments

    (let [[options args banner]

    (cli args-in

    [-h --help

    Show usage information :default false :flag true]

    [-p --path HDFS path of data to consume]

    [-o --output HDFS path for the output report])]

    ;; display the help message(if (:help options)

    (do (println banner) 0)

    ;; setup and run our job

    (do

    (doto (Job.)

    (.setJarByClass (.getClass this))

    (.setJobName crawl-log-load)

    (.setOutputKeyClass Text)

    (.setOutputValueClass LongWritable)

    (.setMapperClass (Class/forName crawl-log-loader.core_mapper))

    (.setReducerClass (Class/forName crawl-log-loader.core_reducer))

    (.setInputFormatClass TextInputFormat)(.setOutputFormatClass TextOutputFormat)

    (FileInputFormat/setInputPaths (:path options))

    (FileOutputFormat/setOutputPath (Path. (:output options)))

    (.waitForCompletion true))

    0))))

    Parsing command line arguments is a dull but important function of our Tool. As you cansee above, we use the CLI library to both setup our arguments and to parse those argumentsout into a hash-map. More information on how this library works can be found on the projects

    web page.

    Hadoop wants our application to return a status code that indicates either healthy comple-tion of the job or that we exited under an error condition. We return 0 to indicate that ourjob exited normally. For this job its more important to plow through our data set, any logentries that give us a problem will be logged and then well move on and try the next entry;

    wed rather discard some data rather than halt a job thats been running for a couple of hours.

    If our application isnt invoked with the -h or help flag, we setup the Hadoop job. Wecreate a new Job object and set several fields. Note that we set the output key and value class.

    https://github.com/clojure/tools.cli

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 13

    https://github.com/clojure/tools.cli
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    14/24

    e main purpose of this job is to load data into RabbitMQ (and by extension Elasticsearch)but well also output the number of transactions per host. is could be used for any numberof things, well be using it to double check the data that ends up in our index.

    We then set mapper and reducer classes, well write those up next. We set the input andoutput formats; the TextInputFormat reads plain text files line-by-line, a good fit our log input.e TextOutputFormat writes out plain text files.

    Mapper

    Next we code up our mapping function. Even though were covering this after the code for theTool, this would appearabovethe tool in the actual source code.

    (defn mapper-map

    Provides the function for handling our map task. We parse the data,

    post it to our queue and then write out the host and 1. This output

    is used to provide a summary report that details the number of URLs

    logged per host.

    [this key value ^MapContext context]

    ;; parse the data

    (let [parsed-data (parse-data value)]

    ;; post the data to our queue

    (post-log-data parsed-data)

    ;; write our counter for our reduce task

    (.write context

    (Text. (:host parsed-data))

    (LongWritable. 1))))

    is is straightforward: each line of our source log files is provided to our mapping functionin the form of a key (the line number from which it was read) and a value (the entry inthe log file). For our purposes the key is not a useful value and can be safely ignored. eentry in the log file, on the other hand, represents our data and we then parse that out intocomponents with the parse-data function.

    Next, we pass our parsed log data over to our post-log function. is function will thenwrite the log entry to our RabbitMQ queue were it can be read and loaded by Elasticsearch.

    Lastly we output the host component of the crawled URL along with the number 1. Lateron in the process, our reduce function will add together all the values (that is, the 1) foreach URL host, this will comprise our final report detailing how many URLs were crawled foreach host.

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 14

  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    15/24

    Reducer

    Next we code up our reducer function.

    (defn reducer-reduce

    Provides the function for our reduce task. We sum the values for

    each host yeilding the number of URLs logged per host.

    [this key values ^ReduceContext context]

    ;; sum the values for each host

    (let [sum (reduce + (map (fn [^LongWritable value]

    (.get value))

    values))]

    ;; write out the total

    (.write context key (LongWritable. sum))))

    is function is also very simple. We map over the incoming values (Hadoop wraps thesevalues in a LongWritable instance), pull out the actual numbers and reduce those values (byadding them together) into our final sum. We write out the key; the name of the host, and thesum; the total number of transactions for this host.

    ese two function clearly illustrate the components of the map/reduce cycle. One functionmaps over all of the incoming data and produces values that are later reduced by the otherfunction. Because these functions are stateless they need have no knowledge of the other. isis what lets us spread the processing of this data over all of the nodes in our cluster withouthaving to spend a lot of time and processing power communicating between them.

    https://en.wikipedia.org/wiki/Map_reduce

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 15

    https://en.wikipedia.org/wiki/Map_reduce
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    16/24

    Support Functions

    We called two subordinate functions in our mapping function, these are detailed here. e firstparses out the individual fields of data from our log file entries.

    (defn parse-data

    Parses a String representing a row of data from a crawler log entry

    into a hash-map of data.

    [text]

    ;; parse out the row of data by splitting on whitespace

    (let [data (string/split (str text) #\s+)]

    (if (< 3 (count data))

    (cond

    (not (= 19 (count (nth data 0))))

    (warn (str Invalid timestamp for logged row: \ text \))

    (not (< 0 (count (nth data 1))))

    (warn (str Missing response for logged row :\ text \))

    (not (< 0 (count (nth data 2))))

    (warn (str Missing status for logged row :\ text \))

    (not (< 0 (count (nth data 3))))

    (warn (str Missing URL for logged row :\ text \))

    :else

    (try

    ;; parse out our host

    (let [host (parse-host (nth data 3))]

    ;; parse out the date and time

    (try

    (let [timestamp (.parse (SimpleDateFormat. yyyy-MM-dd-HH:mm:ss)

    (nth data 0))]

    {:timestamp (.format (SimpleDateFormat. yyyy-MM-ddTHH:mm:ss)

    timestamp)

    :response (nth data 1)

    :status (nth data 2)

    :url (nth data 3)

    :host host})

    (catch Exception exception

    (warn (str Couldnt parse timestamp from logged row: \ text \)))))

    (catch Exception exception

    (warn (str Couldnt parse URL for logged row: \ text \)))))

    (warn (str Couldnt parse data from logged row: \ text \)))))

    First we split our log entry into fields by breaking it apart when we see white space. We

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 16

  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    17/24

    double-check to make sure that we have at least the four items of data that we need to continue.Next we check each piece of data and verify that its at least present. We then attempt to parsethe date and time of our entry and assemble a data structure that represents this log entry.

    Our other function, post-log-data, simply writes this data to our message queue usingthe Elasticsearch bulk data format. e Elasticsearch river will then read these messages andapply them to our index.

    With our job coded up, we can tell Leiningen to build our project. We want all of thedependent libraries to be includes in one complete JAR file in order to keep deployment to ourHadoop cluster easy.

    $ lein uberjar

    Once complete our JAR file will be created, we can then copy this over to our Hadoopcluster and run our job. Well tell it where to find our log files on the cluster by passing in the

    path with the -p flag and where to write out the completed report with the -o flag.As our job runs we can monitor how quickly Hadoop is processing our data as well as how

    fast Elasticsearch is indexing that data by inspecting the RabbitMQ web-based administrationinterface. We can check on the indexing side of the process and view the data loaded intoElasticsearch in near real time through Elasticsearch Head.

    Indexing Log Data in Near Real Time

    Once the our initial load of historical data has been completed, we dont want to rely on arecurring batch process in order to keep our index up-to-date. Instead well alter our web

    crawlers to write this data directly to our message queue as crawls are conducted. Once againwe will reap the benefit of isolating our processing function from our indexing function; thespeed of our crawler will not be impacted by the speed of our indexer. e crawlers will writemessages to the queue at their own speed, Elasticsearch will read and processes these messagesat its own pace.

    You may not have the option to alter the application that is producing your log data. Boththe logstash and Flume products provide solutions, they can handle a great many productsand use-cases. You can delegate the task of writing your data to the message queue as it arrivesto one of these tools.

    Analytics and Interactive Reports

    With our data loaded and indexed we can now move on to querying that data and, eventually,providing an interactive reporting tool for our customers. Elasticsearchs strong and flexible

    http://www.elasticsearch.org/guide/reference/api/bulk.htmlhttp://www.rabbitmq.com/management.htmlhttp://logstash.net/docs/1.1.0/outputs/elasticsearch_riverhttps://github.com/kenshoo/flume-rabbitmq-sink

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 17

    https://github.com/kenshoo/flume-rabbitmq-sinkhttp://logstash.net/docs/1.1.0/outputs/elasticsearch_riverhttp://www.rabbitmq.com/management.htmlhttp://www.elasticsearch.org/guide/reference/api/bulk.html
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    18/24

    faceting support makes this an easy task.

    Queries and Faceting

    First, well author a query that will provide us with the total number of transactions for each typeof web server response the crawler encountered. For instance pages that were loaded withoutissue will have a 200 code, pages that were missing or had an incorrect URL might have a404 code.

    curl -X POST http://note1.tnrglobal.com:9200/crawl_log/_search?pretty=true -d

    {

    query : { query_string : {query : +_type:transaction} },

    facets : {

    responses : { terms : {field : response} }

    },

    size : 0

    }

    e query is in two parts, the first asks Elasticsearch to return all of the documents in ourindex that are of the type transaction. is is, in fact, every document in our data set. Next

    we provide faceting instructions, we request one set of facets called responses (we can calleach set of facets whatever we like). We then tell Elasticsearch that wed like a term facet overthe document field response.

    A term facet returns the number of items that match each term in the provided field.By default the returned data will be ordered by the number of documents, that is, the term

    with the most matching documents will be the first in the result set. Terms with zero matchingdocuments will not be returned at all (unless the match_all parameter is set).

    e data returned by Elasticsearch is listed below, weve elided the information in the middleto save space.

    http://www.elasticsearch.org/guide/reference/api/search/facets/index.htmlhttps://en.wikipedia.org/wiki/List_of_HTTP_status_codeshttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 18

    http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.htmlhttps://en.wikipedia.org/wiki/List_of_HTTP_status_codeshttp://www.elasticsearch.org/guide/reference/api/search/facets/index.html
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    19/24

    {

    timed_out : false,

    _shards : {

    total : 5,

    successful : 5,failed : 0

    },

    hits : {

    total : 254881619,

    max_score : 1.0,

    hits : [ ]

    },

    facets : {

    response : {

    _type : terms,

    missing : 0,

    total : 254881619,

    other : 874096,terms : [ {

    term : 200,

    count : 208331697

    }, {

    term : 302,

    count : 21151858

    }, {

    term : 304,

    count : 6795804

    },

    ...

    {

    term : 500,

    count : 779267

    }, {

    term : 403,

    count : 330583

    } ]

    }

    }

    }

    Elasticsearch tells us the number of shards that responded to our request and if any failedto respond (if enough nodes were offline we would receive partial results). e hits stanza

    details the number of documents that matched our query and the score that represents the bestmatch.

    Our faceting data is returned in the last stanza. In the term value, we can see each termand the number of documents that match that term. is is the kind of aggregate data that we

    will use for our reporting tool; the data above might fit best in a pie chart, for instance.

    In addition to the term facet, Elasticsearch provides several other types of facet queries.

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 19

  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    20/24

    e range facet lets you specify a set of ranges, Elasticsearch will return both the numberof documents that fall within each range as well as aggregated data based on a field.

    e histogram and date histogram facets return data across intervals. For instance, using

    the date histogram facet you can facet across a date field and return documents groupedby an interval of time, perhaps by day, week or month.

    Queryandfilter facets provide the number of documents that match a particular query.

    Statistical facets allow you to compute statistical data based on the value of a numericfield. is includes computations like total, sum, sum of squares, average, mean, mini-mum, maximum, etc.

    Elasticsearch provides aterm_stats facet that allows you to combine a term facet with astatistical facet, faceting over a set of terms in one field while also calculating a statisticalvalue on another.

    Ageo_distance facet can be used to calculate ranges of distances from a provide geo-graphical location as well as aggregated information (like a total).

    We wont detail the use of every type of facet, the Elasticsearch documentation providesexamples for each one. However we will provide an example of using the date histogram asit is a good match for our problem space.

    e query below will display the number of pages crawled by day.

    curl -X POST http://node1.tnrglobal.com:9200/crawl_log/_search\?pretty\=true -d {

    query : { query_string : {query : +_type:transaction} },

    size : 0,

    facets : {

    histogram : {

    date_histogram : {field : timestamp, interval : day }

    }

    }

    }

    Elasticsearch will fetch all of our documents and then bucket them by the day of theirtimestamps. at data returned will be in order by date with the most recent date listed first.

    http://www.elasticsearch.org/guide/reference/api/search/facets/range-facet.html

    http://www.elasticsearch.org/guide/reference/api/search/facets/histogram-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/date-histogram-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/query-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/filter-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/statistical-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/geo-distance-facet.html

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 20

    http://www.elasticsearch.org/guide/reference/api/search/facets/geo-distance-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/statistical-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/filter-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/query-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/date-histogram-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/histogram-facet.htmlhttp://www.elasticsearch.org/guide/reference/api/search/facets/range-facet.html
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    21/24

    {

    timed_out : false,

    _shards : {

    total : 5,

    successful : 5,failed : 0

    },

    hits : {

    total : 254881619,

    max_score : 1.0,

    hits : [ ]

    },

    facets : {

    histogram : {

    _type : date_histogram,

    entries : [ {

    time : 1308873600000,

    count : 111658}, {

    time : 1308960000000,

    count : 10361119

    }, {

    time : 1309046400000,

    count : 8331424

    }, {

    time : 1309132800000,

    count : 5845844

    }, {

    time : 1309219200000,

    count : 4863707

    },

    ...

    ]

    }

    }

    }

    e time field provided in the returned data contains the time as UTC milliseconds sincethe epoch (similar to a POSIX time value but with millisecond granularity).

    Weve been concentrating on facet queries as this provides one of the main tools youll needto use when creating a reporting tool. Elasticsearch is also an accomplished indexing tool and

    provides a wide range of queries built around its own query language. Under the hood,Elasticsearch is leveraging the powerful Apache Lucene search engine and provides access toall of the features that Lucene provides.

    https://en.wikipedia.org/wiki/Unix_timehttp://www.elasticsearch.org/guide/reference/query-dsl/http://lucene.apache.org/core/

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 21

    http://lucene.apache.org/core/http://www.elasticsearch.org/guide/reference/query-dsl/https://en.wikipedia.org/wiki/Unix_time
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    22/24

    Constructing an Interactive Reporting Tool

    We wont provide a tutorial on how to create a reporting tool, everyones needs will vary greatlyas well as the tools they might use to build such a solution. Typically youll want to create a web-

    based application using a lightweight framework, perhaps something like Flask for Python,Compojure for Clojure or even something implemented in PHP. Whatever language youhave the most experience with will be the best choice.

    Instead well provide an overview of an application that weve created to meet this need. Weended up implementing our solution in Python with the Flask framework and used the TwitterBootstrap template to rapidly develop our solution. e pie chart depicting the responsecodes encountered by the crawler is pictured below.

    Figure 3:Response Chart

    Using the date histogram facet, it was simple and painless to offer a bar chart depictingthe transactions by date.

    Figure 4:Transactions by Day

    As the customer drills down into data by host, we provide the more traditional table-baseddetail. In addition to listing the transactions for the selected host, we can also provide the same

    http://flask.pocoo.org/https://github.com/weavejester/compojurehttp://php.net/http://twitter.github.com/bootstrap/

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 22

    http://twitter.github.com/bootstrap/http://php.net/https://github.com/weavejester/compojurehttp://flask.pocoo.org/
  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    23/24

    pie and bar charts by faceting on the results for the selected host. Pictured below is a traditionaltable of matching items.

    Figure 5:Host Specific Transaction Table

    Below our table of transactions, we provide colorful charts that clearly illustrate what thelisted transactions have in common, in this case highlighting the fact that the vast majority ofURLs for this particular site were crawled without issue.

    Figure 6:Host Specific Response Chart

    TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 23

  • 8/11/2019 Flexible Analytics for Large Data Sets Using Elasticsearch

    24/24

    Conclusion

    It is clear that Elasticsearch can gracefully handle the large data sets that are typically associatedwith Big Data. By planning ahead for your maximum estimated storage load you can make

    some smart choices that will ensure your index can keep pace with your growing data set.e faceting functions provided by Elasticsearch allow you to query and explore your data setas you collect it, in stark contrast to tools like HBase where its necessary to plan out whataggregate values youd like to track and then calculating these values at load time. Once yourinitial loading of data is completed, you can then feed your data to your index as you collectit, ensuring that your customers can make use of this information as quickly as possible.

    is is not to say that there is no place for tools like HBase! e faceting features providedby Elasticsearch are very flexible but cannot cover every possible scenario. ey can, however,allow for the wide range of ad-hoc querying that we all need in order to actively explore a largedata set.

    TNR Global, LLC

    245 Russell Street

    Suite 10

    Hadley, MA 01035

    [email protected]

    For updates to information in this paper and

    other search technologies we work with, click

    here: http://www.tnrglobal.com/blog

    http://www.tnrglobal.com/blogmailto:[email protected]