search and time series databases

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Search and Time Series Databases

Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Valeria Cardellini

The reference Big Data stack

Valeria Cardellini - SABD 2016/17

1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Why search platforms?

•  How to find documents that match queries? –  With text search faster than RDBMs

•  How to obtain specific features? –  Such as highlighting, spatial search, suggestions,

guided navigation, …


2

Search engines

•  Most popular search platforms: –  Apache Solr –  ElasticSearch

•  ETL process


3

Apache Solr

•  Scalable, highly reliable and open-source framework for searching data

•  Built on Apache Lucene –  Open-source library for indexing and search –  Used by Solr for full-text search

•  Can index documents written in: •  XML, JSON, CSV and binary formats

•  Runs as Java Web application •  Provides a REST-like web service that exposes

services to manage the lifecycle of documents in the index (indexing, querying, …)

•  Used by most popular Web apps (Apple, Instagram, LinkedIn, …)


4

Solr: key features

•  Faceting –  To group the results based on specific field or defined

criteria, providing the count of each subset –  Example: shopping site can provide facets to narrow search

results by manufacturer or price

•  Auto-suggest –  To present list of possible query terms

•  Spell check –  To suggest corrected spelling of query terms

•  Highlighting •  Document clustering

–  To group related documents in the search results

•  Spatial search –  To filter search results based on location


5

Solr: key features

•  Pagination and ranking of search results •  Results grouping

–  To group the results based on a grouping field and return the top documents in each group

•  Near real-time search –  To search documents immediately after they have been

indexed; useful for apps with dynamic changing content (e.g., news)

•  More Like This –  identifies other documents that are similar to one in a result

set


6

Solr feature example


7

Solr components


8

Solr components •  Request Handlers: handle a request at a URL

–  E.g.: /select!•  Search Components: part of a Search Handler, a

componentized request handler –  Includes: Query, Faceting, Highlighting, Debug, … –  Distributed Search capable

•  Update Handlers: handle an indexing request •  Update Processors chain: per-handler componentized

chain that handles updates •  Query Parser plugins

–  Mix and match query types in a single request –  Function plugins for Function Query

•  Text Analysis plugins: Analyzers, Tokenizers, TokenFilters

•  Response Writers: serialize and stream response to client


9

Scaling Solr: SolrCloud •  How to provide distributed indexing and search

capabilities?

–  Up to millions of users and millions of indexed documents

•  SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers –  Enables and simplifies horizontal scaling of a search index

through replication and sharding –  Sharding: incoming queries are distributed to to shards in the

collection, which respond with merged results –  Replication: to handle higher concurrent query load by

spreading the requests to multiple servers

•  No master node to allocate nodes, shards and replicas

•  SolrCloud uses ZooKeeper for storing shared configuration files and for coordination


10

Solr distributed architecture


11

Elasticsearch

•  Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents

•  Search engine based on Apache Lucene •  Developed in Java •  Distributed

–  Indices can be divided into shards and each shard can have zero or more replicas

–  Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s)

–  Rebalancing and routing are done automatically Valeria Cardellini - SABD 2016/17

12

Elastic (ELK) Stack

•  Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack, previously known as ELK)

•  Logstash –  Server-side data processing pipeline that ingests

data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch

•  Kibana –  Data visualization platform


13

Solr vs. Elasticsearch

•  Solr –  Mature, widely deployed product –  Active and large developer community –  Provides highly detailed functional environment wide range

of plug-ins are available

•  Elasticsearch –  Newer, but already very widely used –  Focus on extracting value from data generally, and not just

on search –  Part of ELK stack –  Schema-free and document-oriented


14

•  Elasticsearch vs Solr on Google Trends

Time series data base (TSDB)

•  How to analyze DevOps monitoring, application metrics, IoT sensor data? –  Time series databases (TSDBs) provides an effective and

lightweight solution

•  Optimized for handling high-volume time series data –  Time series: a sequence of data points (arrays of numbers)

indexed by time (a date time or a date time range), e.g.: •  Time series of stock prices (price curve) •  Time series of energy consumption (load profile) •  Log of temperature values (temperature trace)

•  Optimized for providing complex logic to analyze time series data –  Queries for historical data, replete with time ranges and roll

ups and arbitrary time zone conversions are difficult in DBMS


15

TSDB: overview

•  Create, enumerate, update and destroy various time series and organize them in some fashion –  Series may be organized hierarchically and have

companion metadata –  Provide basic calculations on a series as a whole ,

(e.g., multiplying, adding, or combining various time series into a new time series)

–  Filter on arbitrary patterns (e.g., day of the week, low value, high value)

–  Provide additional statistical functions that are targeted to time series data


16

TSDB: some products

•  Some open-source products –  CrateDB https://crate.io –  Chronix http://www.chronix.io –  Graphite https://graphiteapp.org

•  Stores numeric time-series data and render graphs of this data on demand

–  InfluxDB https://www.influxdata.com –  KairosDB https://kairosdb.github.io

•  Stores its time series in Cassandra

–  OpenTSDB http://opentsdb.net •  Stores its time series in HBase

–  Riak-TS http://basho.com/products/riak-ts/


17

InfluxDB •  Written in Go •  Supports high write loads and large data set storage •  Conserves space through downsampling

–  By automatically expiring and deleting unwanted data as well as backup and restore

•  Provides easy-to-use SQL-like query language for interacting with data

•  Provides simple, high performing write and query HTTP(S) APIs, e.g.: –  To create a database

url -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb”!

–  To write data curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'!Va

leria

Car

delli

ni -

SA

BD

201

6/17

18

InfluxDB datastore •  Data organized by time series, which contain a

measured value, like “cpu_load” or “temperature” •  Time series have zero to many points, one for each

discrete sample of the metric •  Points consist of:

–  time (a timestamp) –  a measurement (e.g., “cpu_load”) –  at least one key-value field (the measured value itself, e.g.

“value=0.64”, or “temperature=21.2”) –  and zero to many key-value tags containing any metadata

about the value (e.g. “host=server01”, “region=EMEA”, “dc=Frankfurt”)


19

InfluxDB datastore

•  General format of points: <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]!

•  Examples of points: –  cpu,host=serverA,region=us_west value=0.64!–  payment,device=mobile,product=Notepad,method=credit ! billed=33,licenses=3i 1434067467100293230!–  stock,symbol=AAPL bid=127.46,ask=127.48!–  temperature,machine=unit42,type=assembly

external=25,internal=37 1434067467000000000!

!


20

InfluxDB datastore

•  A measurement is like a SQL table, where the primary index is time

•  With respect to DBMS: –  No need to define schemas up-front –  Null values are not stored


21

InfluxDB stack

•  Integrated with Telegraph, Chronograf and Kapacitor (TICK stack)


22

References

•  Apache Solr Reference Guide, http://bit.ly/2scksQF •  InfluxDB Version 1.2 Documentation,

http://bit.ly/2ryagFT •  Dunning and Friedman, “Time Series Databases”,

O’Reilly, 2015.


23

search and time series databases

Documents