high-level interfaces data processing data storage
TRANSCRIPT
Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
Search Engines and Time Series Databases
Corso di Sistemi e Architetture per Big Data A.A. 2017/18
Valeria Cardellini
The reference Big Data stack
Valeria Cardellini - SABD 2017/18
1
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
Why search engines?
• How to add search to NoSQL data stores? – E.g., key-value data stores
• How to find documents that match queries? – With text search faster than RDBMs
• How to obtain specific features? – Such as highlighting, spatial search, suggestions,
guided navigation, …
Valeria Cardellini - SABD 2017/18
2
Search engines
• Most popular search engines: – Apache Solr – ElasticSearch
• ETL process
Valeria Cardellini - SABD 2017/18
3
Apache Solr
• Scalable, highly reliable and open-source framework for searching data
• Built on Apache Lucene – Open-source library for indexing and search – Used by Solr for full-text search
• Can index documents written in – XML, JSON, CSV and binary formats
• Runs as standalone application service • Provides a REST-like web service that exposes
services to manage the lifecycle of documents in the index (indexing, querying, …)
• Used by most popular Web apps (Apple, Instagram, LinkedIn, …)
Valeria Cardellini - SABD 2017/18
4
Solr: key features • Faceting
– To group the results based on specific field or defined criteria, providing the count of each subset
– Example: shopping site can provide facets to narrow search results by manufacturer or price
• Auto-suggest – To present list of possible query terms
• Spell check – To suggest corrected spelling of query terms
• Highlighting • Document clustering
– To group related documents in the search results
• Spatial search – To filter search results based on location
Valeria Cardellini - SABD 2017/18
5
Solr: key features
• Pagination and ranking of search results • Results grouping
– To group the results based on a grouping field and return the top documents in each group
• Near real-time search – To search documents immediately after they have been
indexed; useful for apps with dynamic changing content (e.g., news)
• More Like This – To identify other documents that are similar to one in a result
set
Valeria Cardellini - SABD 2017/18
6
Solr feature example
Valeria Cardellini - SABD 2017/18
7
Solr components
Valeria Cardellini - SABD 2017/18
8
Solr components • Request Handlers: handle a client request at a URL
– To query, a GET request to /select handler – To index a document, a POST request to /update handler
• Response Writers: serialize and stream response to client • Search Components: part of a Search Handler, a
componentized request handler – Includes: Query, Faceting, Highlighting, Debug, … – Distributed Search capable
• Update Handlers: handle an indexing request • Update Processors chain: per-handler componentized
chain that handles updates • Query Parsing plugins
– Mix and match query types in a single request – Function plugins for Function Query
• Text Analysis plugins: Analyzers, Tokenizers, TokenFilters Valeria Cardellini - SABD 2017/18
9
Basic searching
• Solr can be queried via – REST clients, curl, wget, Chrome POSTMAN, etc. as well as
via native clients available for many programming languages
• Example: to search all documents in the index via curl curl "http://localhost:8983/solr/techproducts/select?indent=on&q=*:*"
• Example: to search for a single term curl "http://localhost:8983/solr/techproducts/select?q=foundation" • Example: to search all “electronics” documents in the index curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics" !See https://bit.ly/2GDLn3G
Valeria Cardellini - SABD 2017/18
10
Scaling Solr: SolrCloud • How to provide distributed indexing and search
capabilities?
– Up to millions of users and millions of indexed documents
• SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers – Enables and simplifies horizontal scaling of a search index
through replication and sharding – Sharding: incoming queries are distributed to shards in the
collection, which respond with merged results – Replication: to handle higher concurrent query load by
spreading the requests to multiple servers
• No master node to allocate nodes, shards and replicas
• SolrCloud uses ZooKeeper for storing shared configuration files and for coordination
Valeria Cardellini - SABD 2017/18
11
Solr distributed architecture
Valeria Cardellini - SABD 2017/18
12
Elasticsearch
• Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents
• Search engine based on Apache Lucene • Developed in Java • Distributed
– Indices can be divided into shards and each shard can have zero or more replicas
– Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s)
– Rebalancing and routing are done automatically Valeria Cardellini - SABD 2017/18
13
Elastic (ELK) Stack
• Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack)
• Logstash – Server-side data processing pipeline that ingests
data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch
• Kibana – Data visualization platform
Valeria Cardellini - SABD 2017/18
14
Solr vs. Elasticsearch
• Solr – Mature, widely deployed product – Active and large developer community – Provides highly detailed functional environment, wide range
of plug-ins are available
• Elasticsearch – Newer, but already very widely used – Focus on extracting value from data generally, and not just
on search – Part of ELK stack – Schema-free and document-oriented
Valeria Cardellini - SABD 2017/18
15
• Elasticsearch vs Solr on Google Trends
Time series data base (TSDB) • How to analyze DevOps monitoring, application
metrics, sensor data from smart factories, smart cities, or smart vehicles? Time series databases (TSDBs) – A possible solution, not the only one!
• Optimized for handling high-volume time series data – Time series: sequence of data points (arrays of numbers)
indexed by time (a date time or a date time range), e.g.: • Stock prices (price curve) • Energy consumption (load profile) • Temperature values (temperature trace)
• Optimized for providing complex logic to analyze time series data – Queries for historical data, replete with time ranges and roll
ups and arbitrary time zone conversions are difficult in DBMS Valeria Cardellini - SABD 2017/18
16
TSDB: overview
• Create, enumerate, update and destroy various time series and organize them in some fashion – Series may be organized hierarchically and have
companion metadata – Provide basic calculations on a series as a whole
(e.g., multiplying, adding, or combining various time series into a new time series)
– Filter on arbitrary patterns (e.g., day of the week, low value, high value)
– Provide additional statistical functions that are targeted to time series data
Valeria Cardellini - SABD 2017/18
17
TSDB: some products
• Some open-source products – CrateDB https://crate.io – Chronix http://www.chronix.io – Graphite https://graphiteapp.org
• Stores numeric time-series data and render graphs of this data on demand
– InfluxDB https://www.influxdata.com
– KairosDB https://kairosdb.github.io • Stores its time series in Cassandra
– OpenTSDB http://opentsdb.net • Stores its time series in HBase
– Riak-TS http://basho.com/products/riak-ts/ • NoSQL key/value store optimized for time series data
with masterless architecture (similar to Riak-KV) Valeria Cardellini - SABD 2017/18
18
InfluxDB • Written in Go • Supports high write loads and large data set storage • Conserves space through downsampling
– By automatically expiring and deleting unwanted data as well as backup and restore
• Provides easy-to-use SQL-like query language for interacting with data
• Provides simple, high performing write and query HTTP(S) APIs, e.g.: – To create a database
curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb”
– To write data curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'
Valeria Cardellini - SABD 2017/18
19
InfluxDB data store • Data organized by time series, which contain a
measured value, like “cpu_load” or “temperature” • Time series have zero to many points, one for each
discrete sample of the metric • Points consist of:
– time (a timestamp) – a measurement (e.g., “cpu_load”) – at least one key-value field (the measured value itself, e.g.
“value=0.64”, or “temperature=21.2”) – and zero to many key-value tags containing any metadata
about the value (e.g. “host=server01”, “region=EMEA”, “dc=Frankfurt”)
Valeria Cardellini - SABD 2017/18
20
InfluxDB data store
• General format of points: <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]
• Examples of points: – cpu,host=serverA,region=us_west value=0.64 – payment,device=mobile,product=Notepad,method=credit billed=33,licenses=3i 1434067467100293230 – stock,symbol=AAPL bid=127.46,ask=127.48 – temperature,machine=unit42,type=assembly external=25,internal=37
1434067467000000000
!
Valeria Cardellini - SABD 2017/18
21
InfluxDB data store
• A measurement is like a SQL table, where the primary index is time
• With respect to DBMS: – No need to define schemas up-front – Null values are not stored
• InfluxDB limitation – Horizontal scalability: clustered installation available only as
enterprise product
Valeria Cardellini - SABD 2017/18
22
InfluxDB stack • Integrated with Telegraph, Chronograf and Kapacitor
(TICK stack)
Valeria Cardellini - SABD 2017/18
23 See https://www.influxdata.com/time-series-platform/
• To realize a MAPE control loop
InfluxDB stack • Telegraf: plugin-driven server agent for collecting and
reporting metrics and events – Input plugins or integrations to source a variety of metrics – Output plugins to send metrics to a variety of other data
stores, services, and message queues (InfluxDB, Graphite, OpenTSDB, Kafka, MQTT, …)
• Chronograf: administrative user interface and visualization engine – To build dashboards with real-time visualizations of data and
to create alerting and automation rules
• Kapacitor: native data processing engine – To process both stream and batch data from InfluxDB – E.g., to perform specific actions (e.g., dynamic load
balancing) based on alerts (e.g., above load threshold)
Valeria Cardellini - SABD 2017/18
24
References
• Apache Solr Reference Guide, http://bit.ly/2scksQF • InfluxDB v.1.5 Documentation, https://bit.ly/2GtM8Ie • Dunning and Friedman, “Time Series Databases”,
O’Reilly, 2015.
Valeria Cardellini - SABD 2017/18
25