cs 4422/7263 information retrieval jiho noh kennesaw state
TRANSCRIPT
ElasticsearchCS 4422/7263 Information Retrieval
Jiho NohKennesaw State University
contents from • sundog education / elasticsearch• official elasticsearch tutorials
Installing and Basics
Elasticsearch — Overview
• Open source search and analytics engine for all types of data• Started off as scalable Apache Lucene• The ELK Stack (Elasticsearch, Logstash, and Kibana)• Elasticsearch maintains “shards”, where each shard is an inverted
index of documents• But not just for full text search!
• Can handle structured data, and can aggregate data quickly• Often a faster solution than Hadoop/Spark/Flink/etc.
Shards
• Documents are hashed (and split) into a particular shard.
shard1 shard1 shard1 ...
An Index
• Each shard may be on a different node in a cluster• Every shard is a self-contained Lucene index of its own
Fault Tolerance
• This index has two primary shards and two replicas
Primary1
• Write requests are routed to the primary shard, then replicated• Read requests are routed to the primary or any replica• Number of primary shards cannot be changed once set up
Replica0
Replica0
Replica1
Primary0
Replica1
node 1 node 2 node 3
Logstash / Beats
• A tool for publishing data into Elasticsearch• FileBeat monitor log files, parse them, and import into Elasticsearch
in near-real-time
• Logstash read data from machines and feed into Elasticsearch
• General purpose system; not just for log files
Kibana
• Web UI for searching and visualizing
• Complex aggregations, graphs, charts
• Often used for log analysis
Elasticsearch Installation
We are going to use a Linux (e.g., Ubuntu) system for installing and running Elasticsearch 7.
What if I use Mac or Windows?
• Cloud computing resources, such as Amazon EC2• Virtualbox + Ubuntu on your machine
Elasticsearch Installation
• Elasticsearch official installation guide
Installing on UbuntuBefore the installation, importing the Elasticsearch Signing Key (PGP) should be done.
• Download and install the keywget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add
• Install the apt-transport-https packagesudo apt-get install apt-transport-https
• Save the repository definitionecho "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list
• Following command will install the Elasticsearch package on your machinesudo apt-get update && sudo apt-get install elasticsearch
Configuring Elasticsearch
• Edit the configuration filessudo vi /etc/elasticsearch/elasticsearch.yml
• Uncomment (and edit lines) as below:• node.name: node-1
• network.host: 0.0.0.0
• discovery.seed_hosts: ["127.0.0.1"]
• cluster.initial_master_nodes: ["node-1"]
Running Elasticsearch with systemd
• The following commands will enable the elasticsearch services as a daemon
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service
• Elasticsearch can be started and stopped using these commands:sudo systemctl start elasticsearch.service
sudo systemctl stop elasticsearch.service
Checking that Elasticsearch is running
curl -XGET 127.0.0.1:9200
• This should give you a response something like this:{"name" : "node-1","cluster_name" : "elasticsearch","cluster_uuid" : "_UFe2UXmQka_Zmrkijn0IA","version" : {"number" : "7.14.0","build_flavor" : "default","build_type" : "deb","build_hash" : "dd5a0a2acaa2045ff9624f3729fc8a6f40835aa1","build_date" : "2021-07-29T20:49:32.864135063Z","build_snapshot" : false,"lucene_version" : "8.9.0","minimum_wire_compatibility_version" : "6.8.0","minimum_index_compatibility_version" : "6.0.0-beta1"
},"tagline" : "You Know, for Search"
}
HTTP and RESTful API’s
HTTP Request
Component Description
Method The “verb” of the request. {GET, POST, PUT, DELETE, etc.}
Protocol What language you use for HTTP communication
HOST Web server you want to talk to
URL What resource is being requested
BODY Auxiliary data needed in some cases
HEADERS User-agent, content-type, etc.
HTTP Response example
>> curl -ivs --raw https://www.kennesaw.edu | less
HTTP/1.1 200 OK Date: Tue, 17 Aug 2021 18:34:02 GMT Server: Apache X-Powered-By: PHP/5.4.16 Access-Control-Allow-Origin: * Transfer-Encoding: chunked Content-Type: text/html; charset=UTF-8
20c8 <!DOCTYPE html><html lang="en" id="_63608cfd-163b-4208-b021-dd4e916bb815"> <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Kennesaw State University in Georgia</title>
...
RESTful API’s
• REpresentational State Transfer• Web service using HTTP requests
Examples:
• GET: retrieve information (like search results)• PUT: insert or replace new information• DELETE: delete existing information
RESTful API’s
Six guiding constraints:
• Client-server architecture• Statelessness; Every request (response) must be self-contained• Cacheability; Responses can be cached• Layered system• Code on demand (i.e., sending javascript)• Uniform interface; Your data should have some structure and predictable
Why REST?
Whatever language or system you use, it must support HTTP Requests and responses. Thus, communicating with the Elasticsearch is language and system independent.
• Learning how to write an HTTP request for a particular command• and how to parse the response is all we need.
The Curl Command
A way to issue HTTP requests from the command line
curl –H “Content-Type: application/json” <URL> -d ‘<BODY>’
Python example
import requests
r =requests.get('https://xkcd.com/1906/')
Examples
curl -X PUT "localhost:9200/my-index-000001/_doc/1" -H 'Content-Type: application/json' -d'
{
"@timestamp": "2099-11-15T13:12:00",
"message": "GET /search HTTP/1.1 200 1070000",
"user": {
"id": "kimchy"
}
}
'
Insert a document API: PUT /<target>/_doc/<_id>
Examples
curl -X GET "localhost:9200/my-index-000001/_search?from=40&size=20"
-H 'Content-Type: application/json' -d'
{
"query": {
"term": {
"user.id": "kimchy"
}
}
}
'
Search API: GET /<target>/_search
Indexing
Example data
• The Complete Works of William Shakespeare• Suitably parsed into fields• download json file• The Shakespeare dataset has the following structure:
{"line_id": INT, "play_name": "String", "speech_number": INT, "line_number": "String", "speaker": "String", "text_entry": "String",
}
Create an index for the corpus
curl -X PUT "localhost:9200/shakespeare?pretty" -H 'Content-Type: application/json' -d'{
"settings": {"number_of_shards": 2,"number_of_replicas": 1
},"mappings": {
"properties": {"line_id": {"type": "integer"},"play_name": {"type": "keyword"},"speech_number": {"type": "integer"},"line_number": {"type": "text"},"speaker": {"type": "keyword"},"text_entry": {"type": "keyword"}
}}
}'
Create Index with mapping API: PUT /<index>
Mapping
• Mapping is a schema definition.• Elasticsearch sets an index with a reasonable defaults• But mostly you will need to provide a particular mapping properties• Field Types
• string, byte, short integer, long, float, double, boolean, date• Field Index; Do you want the field to be indexed?
• analyzed / not_analyzed / no• Field Analyzer; Define your tokenizer and token filter.
• standard / whitespace / simple / english etc.
Loading data
curl "localhost:9200/shakespeare/_mapping"
Get the mapping
Insert data by loading the dataset in bulk
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
curl "localhost:9200/_cat/indices"
List all the indexes
Search a phrase
curl "localhost:9200/shakespeare/_search?pretty" -H 'Content-Type: application/json' -d'{
"query": {"match_phrase": {
"text_entry": "all that glitters is not gold"}
}}'
Update a single entry
curl -X POST "localhost:9200/shakespeare/_doc/62427/_update" -H 'Content-Type: application/json' -d '{
"doc": {"text_entry": "ALL THAT GLITTERS IS NOT GOLD;"
}}
List all the indexes
Search
"Query Lite" search
curl "localhost:9200/shakespeare/_search?pretty" -H 'Content-Type: application/json' -d'{
"query": {"match": {
"text_entry": "glitters"}
}}'
curl "localhost:9200/shakespeare/_search?q=text_entry:glitters&pretty"
via URI search
"Query Lite" search
curl "localhost:9200/shakespeare/_search?q=text_entry:%22make%20choice%22~3&pretty"
• URL needs to be encoded• Cryptic• Security issue if exposed to end users• Fragile and difficult to debug
It's better to send a JSON request
Queries and Filters
• Filters ask a yes/no question of your data• Queries return data in terms of relevance
Use filters when you can — they are faster and cacheable
Some types of filters
Filter Description
term filter by exact values
terms match if any exact values in a list match
range find numbers or dates in a given range (gt, gte, lt, lte)
exists find documents where a field exists
missing find documents where a field is missing
bool combine filters with Boolean logic (must, must_not, should)
Some types of queries
Query Description
match_all returns all documents (default)
match searches analyzed results, such as full text search
multi_match run the same query on multiple fields
match_phrase matching exact phrases or word proximity
combined_fields matches over multiple fields as if they had been indexed into one combined field
query_string supports the Lucene query string syntax
bool words like a bool filter, but results are scored by relevance
Example of query and filter contexts
GET /_search{
"query": { "bool": {
"must": [{ "match": { "title": "Search" }},{ "match": { "content": "Elasticsearch" }}
],"filter": [
{ "term": { "status": "published" }},{ "range": { "publish_date": { "gte": "2015-01-01" }}}
]}
}}
Pagination
curl "localhost:9200/shakespeare/_search?pretty" -H 'Content-Type: application/json' -d'{
"from": 10,"size": 3,"query": {
"match_phrase": { "text_entry": "life is" }}
}'
• The from parameter defines the number of hits to skip, defaulting to 0.• The size parameter is the maximum number of hits to return
Importing Data
Different ways of importing data
• Stand-alone scripts can submit bulk documents via REST API• Logstash and beats can stream data from logs, S3, databases, and more• AWS systems can stream in data via lambda or kinesis firehose• Kafka, spark, and more have Elasticsearch integration add-ons
Write a script
• Write a script that generates a list of JSON entries• Read in data from a data source• Transform it into JSON bulk inserts
• Submit via HTTP / REST to your elasticsearch cluster• Bulk API:
curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'{ "index" : { "_index" : "test", "_id" : "1" } }{ "field1" : "value1" }{ "delete" : { "_index" : "test", "_id" : "2" } }{ "create" : { "_index" : "test", "_id" : "3" } }{ "field1" : "value3" }{ "update" : {"_id" : "1", "_index" : "test"} }{ "doc" : {"field2" : "value2"} }'
Write a script
Importing with a client libraries
• Elasticsearch client libraries are available for most of languages• Python has an elasticsearch package
Logstash
Logstash is a data loader
Files S3 Beats Kafka
Elastic-search AWS Hadoop Mongo
db
Logstash
Logstash is a data loader
• Server-side data processing pipeline• It can parse, transform, and filter data• It can derive structure from unstructured data• It can anonymize personal data or exclude it entirely• It can do geo-location lookups• It can scale across many nodes• Huge list of filter plugins is available
Installing and configure Logstashsudo apt-get install openjdk-8-jre-headless
sudo apt-get update && sudo apt-get install logstash
sudo vi/etc/logstash/conf.d/apache-access.conf
Import data from apacheaccess_log to elasticsearch
The grok program can parse log data and program output.
Running Logstash
cd /usr/share/logstash/
sudo bin/logstash –-path.config /etc/logstash/conf.d/apache-access.conf
• Run Logstash as specified in the config file
• Check the indices of elasticsearch for the Logstash created data
curl –XGET "127.0.0.1:9200/_cat/indices?v"
Importing CSV file with Logstash
filter {
csv {
separator => ","
skip_header => "true"columns => ["movieId", "title", "genres"]
}
}
• CSV filter plugin
Kibana
Installing Kibana
sudo apt install kibanasudo vi /etc/kibana/kibana.yml
(change server.host to 0.0.0.0)
sudo /bin/systemctl daemon-reloadsudo /bin/systemctl enable kibana.servicesudo /bin/systemctl start kibana.service
• Kibana port number is 5601• http://10.80.34.86:5601