high-level interfaces data processing data storage

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Search Engines and Time Series Databases

Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Valeria Cardellini

The reference Big Data stack

Valeria Cardellini - SABD 2017/18

1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Why search engines?

•  How to add search to NoSQL data stores? –  E.g., key-value data stores

•  How to find documents that match queries? –  With text search faster than RDBMs

•  How to obtain specific features? –  Such as highlighting, spatial search, suggestions,

guided navigation, …


2

Search engines

•  Most popular search engines: –  Apache Solr –  ElasticSearch

•  ETL process


3

Apache Solr

•  Scalable, highly reliable and open-source framework for searching data

•  Built on Apache Lucene –  Open-source library for indexing and search –  Used by Solr for full-text search

•  Can index documents written in –  XML, JSON, CSV and binary formats

•  Runs as standalone application service •  Provides a REST-like web service that exposes

services to manage the lifecycle of documents in the index (indexing, querying, …)

•  Used by most popular Web apps (Apple, Instagram, LinkedIn, …)


4

Solr: key features •  Faceting

–  To group the results based on specific field or defined criteria, providing the count of each subset

–  Example: shopping site can provide facets to narrow search results by manufacturer or price

•  Auto-suggest –  To present list of possible query terms

•  Spell check –  To suggest corrected spelling of query terms

•  Highlighting •  Document clustering

–  To group related documents in the search results

•  Spatial search –  To filter search results based on location


5

Solr: key features

•  Pagination and ranking of search results •  Results grouping

–  To group the results based on a grouping field and return the top documents in each group

•  Near real-time search –  To search documents immediately after they have been

indexed; useful for apps with dynamic changing content (e.g., news)

•  More Like This –  To identify other documents that are similar to one in a result

set


6

Solr feature example


7

Solr components


8

Solr components •  Request Handlers: handle a client request at a URL

–  To query, a GET request to /select handler –  To index a document, a POST request to /update handler

•  Response Writers: serialize and stream response to client •  Search Components: part of a Search Handler, a

componentized request handler –  Includes: Query, Faceting, Highlighting, Debug, … –  Distributed Search capable

•  Update Handlers: handle an indexing request •  Update Processors chain: per-handler componentized

chain that handles updates •  Query Parsing plugins

–  Mix and match query types in a single request –  Function plugins for Function Query

•  Text Analysis plugins: Analyzers, Tokenizers, TokenFilters Valeria Cardellini - SABD 2017/18

9

Basic searching

•  Solr can be queried via –  REST clients, curl, wget, Chrome POSTMAN, etc. as well as

via native clients available for many programming languages

•  Example: to search all documents in the index via curl curl "http://localhost:8983/solr/techproducts/select?indent=on&q=*:*"

•  Example: to search for a single term curl "http://localhost:8983/solr/techproducts/select?q=foundation" •  Example: to search all “electronics” documents in the index curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics" !See https://bit.ly/2GDLn3G


10

Scaling Solr: SolrCloud •  How to provide distributed indexing and search

capabilities?

–  Up to millions of users and millions of indexed documents

•  SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers –  Enables and simplifies horizontal scaling of a search index

through replication and sharding –  Sharding: incoming queries are distributed to shards in the

collection, which respond with merged results –  Replication: to handle higher concurrent query load by

spreading the requests to multiple servers

•  No master node to allocate nodes, shards and replicas

•  SolrCloud uses ZooKeeper for storing shared configuration files and for coordination


11

Solr distributed architecture


12

Elasticsearch

•  Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents

•  Search engine based on Apache Lucene •  Developed in Java •  Distributed

–  Indices can be divided into shards and each shard can have zero or more replicas

–  Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s)

–  Rebalancing and routing are done automatically Valeria Cardellini - SABD 2017/18

13

Elastic (ELK) Stack

•  Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack)

•  Logstash –  Server-side data processing pipeline that ingests

data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch

•  Kibana –  Data visualization platform


14

Solr vs. Elasticsearch

•  Solr –  Mature, widely deployed product –  Active and large developer community –  Provides highly detailed functional environment, wide range

of plug-ins are available

•  Elasticsearch –  Newer, but already very widely used –  Focus on extracting value from data generally, and not just

on search –  Part of ELK stack –  Schema-free and document-oriented


15

•  Elasticsearch vs Solr on Google Trends

Time series data base (TSDB) •  How to analyze DevOps monitoring, application

metrics, sensor data from smart factories, smart cities, or smart vehicles? Time series databases (TSDBs) –  A possible solution, not the only one!

•  Optimized for handling high-volume time series data –  Time series: sequence of data points (arrays of numbers)

indexed by time (a date time or a date time range), e.g.: •  Stock prices (price curve) •  Energy consumption (load profile) •  Temperature values (temperature trace)

•  Optimized for providing complex logic to analyze time series data –  Queries for historical data, replete with time ranges and roll

ups and arbitrary time zone conversions are difficult in DBMS Valeria Cardellini - SABD 2017/18

16

TSDB: overview

•  Create, enumerate, update and destroy various time series and organize them in some fashion –  Series may be organized hierarchically and have

companion metadata –  Provide basic calculations on a series as a whole

(e.g., multiplying, adding, or combining various time series into a new time series)

–  Filter on arbitrary patterns (e.g., day of the week, low value, high value)

–  Provide additional statistical functions that are targeted to time series data


17

TSDB: some products

•  Some open-source products –  CrateDB https://crate.io –  Chronix http://www.chronix.io –  Graphite https://graphiteapp.org

•  Stores numeric time-series data and render graphs of this data on demand

–  InfluxDB https://www.influxdata.com

–  KairosDB https://kairosdb.github.io •  Stores its time series in Cassandra

–  OpenTSDB http://opentsdb.net •  Stores its time series in HBase

–  Riak-TS http://basho.com/products/riak-ts/ •  NoSQL key/value store optimized for time series data

with masterless architecture (similar to Riak-KV) Valeria Cardellini - SABD 2017/18

18

InfluxDB •  Written in Go •  Supports high write loads and large data set storage •  Conserves space through downsampling

–  By automatically expiring and deleting unwanted data as well as backup and restore

•  Provides easy-to-use SQL-like query language for interacting with data

•  Provides simple, high performing write and query HTTP(S) APIs, e.g.: –  To create a database

curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb”

–  To write data curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'


19

InfluxDB data store •  Data organized by time series, which contain a

measured value, like “cpu_load” or “temperature” •  Time series have zero to many points, one for each

discrete sample of the metric •  Points consist of:

–  time (a timestamp) –  a measurement (e.g., “cpu_load”) –  at least one key-value field (the measured value itself, e.g.

“value=0.64”, or “temperature=21.2”) –  and zero to many key-value tags containing any metadata

about the value (e.g. “host=server01”, “region=EMEA”, “dc=Frankfurt”)


20

InfluxDB data store

•  General format of points: <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]

•  Examples of points: –  cpu,host=serverA,region=us_west value=0.64 –  payment,device=mobile,product=Notepad,method=credit billed=33,licenses=3i 1434067467100293230 –  stock,symbol=AAPL bid=127.46,ask=127.48 –  temperature,machine=unit42,type=assembly external=25,internal=37

1434067467000000000

!


21

InfluxDB data store

•  A measurement is like a SQL table, where the primary index is time

•  With respect to DBMS: –  No need to define schemas up-front –  Null values are not stored

•  InfluxDB limitation –  Horizontal scalability: clustered installation available only as

enterprise product


22

InfluxDB stack •  Integrated with Telegraph, Chronograf and Kapacitor

(TICK stack)


23 See https://www.influxdata.com/time-series-platform/

•  To realize a MAPE control loop

InfluxDB stack •  Telegraf: plugin-driven server agent for collecting and

reporting metrics and events –  Input plugins or integrations to source a variety of metrics –  Output plugins to send metrics to a variety of other data

stores, services, and message queues (InfluxDB, Graphite, OpenTSDB, Kafka, MQTT, …)

•  Chronograf: administrative user interface and visualization engine –  To build dashboards with real-time visualizations of data and

to create alerting and automation rules

•  Kapacitor: native data processing engine –  To process both stream and batch data from InfluxDB –  E.g., to perform specific actions (e.g., dynamic load

balancing) based on alerts (e.g., above load threshold)


24

References

•  Apache Solr Reference Guide, http://bit.ly/2scksQF •  InfluxDB v.1.5 Documentation, https://bit.ly/2GtM8Ie •  Dunning and Friedman, “Time Series Databases”,

O’Reilly, 2015.


25

high-level interfaces data processing data storage

Documents