the enterprise technology driven, marketing services company

The Enterprise Technology Driven, Marketing Services Company

Aasif BagdadiDirector Of Engineering

https://www.linkedin.com/in/aasifbagdadi

• www.linkedin.com/in/aasifbagdadi/

http://www.linkedin.com/in/aasifbagdadi/

http://www.linkedin.com/in/aasifbagdadi/

Unique Data AssetAutomotive transactional data on 61% of the US Households

0

10,000,000

20,000,000

30,000,000

40,000,000

50,000,000

60,000,000

70,000,000

80,000,000

90,000,000

100,000,000

81 Million Households

Dealer Aftermarket TOTAL

Vehicles

Customers

Households

Dealer Group

Dealer A

Household (Owner, Driver)

Vehicles ( Make, Model, Year, Age,

Mileage)

Transactions (Purchases, Services)

Data Mining ( Driving Habits, Segmentations , Loyalty, Life Time Value)

Marketing

List Manager - Entity Relationship

▪ Customer/Household: Name, Address, Distance, Email, Phone, Wireless, NCOA, Compliance, Federal DNC, EBR▪ Vehicle Profile : VIN, Make, Model, Year, Sale Date, Sale Amount, Last observed Mileage, Lease, Loan, Warranty,

Extended Warranty, Pre Paid Maintenance, AMPD▪ Deal: Owner, Purchase Date, Purchase Amount, Sales Person, Lease, Loan, Warranty, Odometer▪ Service: Service Date, Mileage, Service Advisor, Warranty Pay, Internal Pay, Customer Pay, Parts, Labor, Services

Performed, Services Declined, Discounts.▪ Campaigns: Customer, Date, Communication, Channel, Offers▪ Responders: Response Date, Transactions, Days to Respond▪ Forecast Communications: Date, communications▪ Ownership: Store, Store Group, OEM, Data boundary▪ Uploaded List: Conquest List or any List acquired from external sources

Purchase

List ManagerComplex search

Customer / Household

Vehicle

Campaigns

Service

Future Communications

List Manager

▪ Find Customers that are within 50 miles▪ Find Customers that have bought { Make } in last { Y } Year▪ That have Serviced their vehicles between { M1 } & { M2 } months in the past▪ That had the following service performed {Opcode1} or {Opcode2} performed▪ That had the following service declined {ASR1} or {ASR2}▪ That had been mailed between {D1} and {D2} dates▪ And have not yet Responded.▪

Adhoc Search

List ManagerAdvanced Search

List Manager V 1.0

▪ 2005 – 2006 time frame▪ SQL Server based▪ Dynamic Sql▪ Implementation:

· Table Valued function for each entities· Batch processes· Request are queued.· Job will apply all the search criteria.

▪ Pros:· Simple to use· Easy to build· Data is available to search almost real time

▪ Cons: · Slow. Took hours just to get a count.· Did not provide results in real time· No caching

Using SQL

List Manager v2.0

▪ 2010 Timeframe▪ Use SSAS ( SQL Server 2008 R2)▪ Apply the search criteria & Get the counts extremely fast▪ List can be batched▪ Implementation:

· SSAS Cubes (MOLAP).· Dynamic MDX· Use MDX to query the count· Use MDX to get the Keys· Use Dimensions / attributes to filter· Mash with Sql on Keys to get the List details (name, address etc)

▪ Pros:· Extremely Fast· Sub second response on counts.· MDX queries are cached

▪ Cons:· Complex MDX· Cube refresh / Partition reprocessing· Dimension Size constraint of 4GB size ( sql 2012 has options to overcome these limits)· Cube changes require the entire cube to be offline· Weak Scale out options

Using Cubes

List Manager 3.0

▪ 2014 ( currently in development)▪ Uses Elastic Search in the cloud▪ Layer of API written in Node.js▪ Front end (C#, MVC, jQuery, JSON)▪ Change Data ( CDC, selected columns and tables, multi databases)▪ Data Pump (C#, multi threaded, windows service, compressed json, bulk api)▪ Pros:

· Good Scale –out· High Availability· Optimized for Search· High Caching (filters)· Read-only Replica· Document based· Allows more better control of incremental data changes· Solves Volume, Velocity & Variety of Data (a.k.a BigData )

▪ Cons· Technology is still emerging

USING ELASTIC SEARCH

Big Data

Define Big DataBig Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infrastructure to address efficiently. Said differently, the volume, velocity or variety of data is too great.

Elasticsearch• real time• Search & Analytics Engine• Distributed• Scales massively• High availability• Restful api• Json over HTTP• Schema free• Multi tenancy• Open source• Lucene based

APIcurl -XGET localhost:9200/?pretty

Verb ( GET, PUT …) NodePortPath

{ "name" : "Exploding Man", "tagline" : "You Know, for Search", "ok" : true, "status" : 200, "version" : { "number" : "0.90.7", "snapshot_build" : false }}

Aasif Bagdadi

Input DataPUT /myapp/tweet/1 -d ' { "tweet": "I think #elasticsearch is AWESOME", "nick": "@clintongormley", "name": "Clinton Gormley", "date": "2013-06-03", "rt": 5, "loc": { "lat": 13.4, "lon": 52.5

} } '

PUT /index/type/id

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 1, "ok": true}

Retrieve Data• GET /myapp/tweet/1

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 1, "exists": true, "_source": { ...OUR TWEET... }}

Update Data• PUT /myapp/tweet/1 -d ' { "tweet": "I know #elasticsearch is AWESOME", "nick": "@clintongormley", "name": "Clinton Gormley", "date": "2013-06-03", "rt": 5, "loc": { "lat": 13.4, "lon": 52.5 } } '

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 2, "ok": true}

# atomic delete and put

Delete Data• DELETE /myapp/tweet/1

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 3, "ok": true, "found": true}

RDBMS lingoMySQL/Oracle/Sql Server => Databases => Tables => Columns/RowsElastic Search => Indices => Types => Documents with Properties• An Elastic Search cluster can contain multiple Indices (databases),

which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).

Glossary• Node: A node is a running instance of elasticsearch which belongs

to a cluster.• Shard: A shard is a single Lucene instance. It is a low-level “worker”

unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards.

• Primary Shard: Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.

• Replica Shard: Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes: a) increase fail over b) increase performance

Glossary• Index: An index is like a database in a relational database.• Type: A type is like a table in a relational database. Each type has a list of fields

that can be specified for documents of that type.• Document: JSON document which is stored in elasticsearch. It is like a row in a

table in a relational database.• Field: A document contains a list of fields, or key-value pairs. The value can be a

simple (scalar) value (eg a string, integer, date), or a nested structure like an array or an object. A field is similar to a column in a table in a relational database.

• Mapping: mapping is like a schema definition in a relational database. The mapping defines how each field in the document is analyzed.

• Routing: When you index a document, it is stored on a single primary shard. That shard is chosen by hashing the routing value. By default the routing value is derived from the ID of the document.

Core Field Types• Strings: string• Datetimes: date• Whole numbers: byte, short, integer, long• Floats: float, double• Booleans: boolean• Objects: object

• Also: multi_field, ip, geo_point, geo_shape,

Auto Detection of Field • "foo bar" string• "2013-01-01" date• 10 byte, short, integer, long• 10.0 float, double• true boolean• { foo: "bar" } object

• ["foo","bar"] No special mapping. Any field can have multi-values

Some more Glossary• Term: A term is an exact value that is indexed in elasticsearch. The

terms foo, Foo, FOO are NOT equivalent.• Text: Text (or full text) is ordinary unstructured text, such as this

paragraph. By default, text will be analyzed into terms, which is what is actually stored in the index. Text fields need to be analyzed at index time in order to be searchable as full text, and keywords in full text queries must be analyzed at search time to produce (and search for) the same terms that were generated at index time.

• Analysis: Analysis is the process of converting full text to terms. Depending on which analyzer is used, these phrases: FOO BAR, Foo-Bar, foo,bar will probably all result in the terms foo and bar. These terms are what is actually stored in the index.

• Tokenizer: Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation.

• Facets: They enable you to calculate and summarize data about the current query on-the-fly. They can be used for all sorts of tasks such as dynamic counting of result values or even distribution histograms. Facets only perform their calculations one-level deep, and they cant be easily combined.

• Aggregations: Aggregations are similar to facets in many ways, and overcome the limitations of facets. Indeed, aggregations are meant to eventually replace facets altogether. Facets are and should be considered deprecated and will likely be removed in one of the future major releases. One of the major limitations of facets is that you can't have facets of facets. Which is to say, facets cannot be nested. The ability to nest aggregations therefore brings a great deal of power that was missing in facets. • The two broad families of aggregations are metrics aggregations and bucket aggregations.

Metrics aggregations calculate some value (like an average) over a set of documents, and bucket aggregations group documents into buckets.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-aggregations.html#_metrics_aggregations

http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-aggregations.html#_bucket_aggregations

• Schemaless, Document Oriented• No need to configure schema upfront• No need for slow ALTER TABLE – like operations• Define mapping (schema) to customize the indexing process

• Require fields to be of certain type• If you want text fields that should not be analyzed

Distributed & Highly Available• Multiple nodes running in a cluster

• Acting as a single service• Nodes in cluster that store data or nodes that just help in speeding up

search queries• Sharding

• Indices are sharded (#shards are configurable)• Each Shard can have zero or more replicas

• Replicas on different servers for failover

• Master• Automatic Master detection + failover• Responsible for distribution/balancing of shards

A Single Node Cluster with An Index• All 3 Primary Shards allocated to Node1• No replication Nodes• A single node means single point of failure• Health of the Cluster: Yellow

Add Failover• Add one Node to cluster by configuring the cluster name.• 3 replica shards have been allocated.• Cluster Health : Green.• Now 6 Shards. There is redundancy

Scale horizontally• A 3 node cluster• One shard each from Node 1 and Node 2 have moved to Node 3• Better performance as hardware resouces (CPU,RAM, I/O) are

shared

Scale some more• More Nodes can be added• More replicas can be added• This will allow faster searches• Allows better redundancy• However the number of primary shards is fixed at the moment

an index is created.• Effectively, the maximum amount of data that can be stored in the

index is defined by this number.• Is this a limitation …..

Coping with Node Failure• Kill Master Node• Elect a New Master (Node 2)• Primary Shard 1 and 2 were lost• Cluster Health : Red• Node 2 & 3 have Replicas of these shards, which are now promoted

as primaries• Cluster Health: Yellow

Beauty Of Elastic Search•In Elasticsearch, all data in every field is indexed by default. That is, every field has a dedicated inverted index for fast retrieval. And, unlike most other databases, it can use all of those inverted indices in the same query, to return results at breathtaking speed

Document• refers to the top-level or root object which

is serialized into JSON and stored in Elasticsearch under a unique ID.

• field or property, can be a string, a number, a boolean, another object, an array of values, or some other specialized type such as a string representing a date or an object representing a geolocation

Document metadata• _index: Where the document lives• _type: The class of object that the document represents• _id: The unique identifier for the document

• Elasticsearch will auto generate id if not specified

Creating a Document

Paginationsize = num of resultsfrom = results to skip

GET /_search?size=5&from=0 GET /_search?size=5&from=5 GET /_search?size=5&from=10

Search (basic)• GET /_search?q=mary

→ user named "Mary"→ tweets by "Mary"→ tweet mentioning "@mary“

• _all field• String value from all fields

• GET /_search?q=2013->12 results

GET /_search?q=2013-06-03-> 12 results!!

GET /_search?q=date:2013-06-03-> 1 result

Mapping ( field definitions){ "tweet" : {

"properties" : { "tweet" : { "type" : "string" }, "name" : { "type" : "string" }, "nick" : { "type" : "string" }, "date" : { "type" : "date" }, "rt" : { "type" : "long" }, "loc" : {

"type": "object", "properties" : {

"lat" : { "type" : "double" }, "lon" : { "type" : "double" } }

}}}}

GET /myapp/tweet/_mapping

date = type:date _all = type:string

date = 2013-06-03 _all = 2013,06,03

Exact Value Vs Full Text104.52013-01-01trueFoofoo

The quick brown fox jumped over the lazy dog

Inverted Index→ separate words / terms→ sort unique terms→ list docs containing terms→ normalize terms

The,brown,dog,fox,jumped,lazy,over,quick,theQuick,brown,dogs,foxes,in,lazy,leap,over,summer

Analysis•The index analysis module acts as a configurable registry of Analyzers that can be used in order to both break indexed (analyzed) fields when a document is indexed and process query strings. It maps to the Lucene Analyzer.

•Analyzer:: tokenizer + token filters

Standard Analyzer"The Quick Brown Fox jumped over the Lazy Dog!“

Standard TokenizerThe,Quick,Brown,Fox,jumped,over,the,Lazy,Dog

Lowercase filterthe,quick,brown,fox,jumped,over,the,lazy,dog

Stopwords filterthe,quick,brown,fox,jumped,over,the ,lazy,dog

English Analyzerstandard tokenizer

lowercase filter english stemmer

the,quick,brown,fox,jumped,over,the,lazy,dogenglish stopwords

the,quick,brown,fox,jumped,over,the,lazy,dog

Filters Vs QueriesFilters• exact matching• Binary yes/no• Fast• Cacheable

Queries• Full Text• Relevance scoring text search• Heavier• Not cacheable

QueriesAs a general rule, queries should be used instead of filters:•for full text search•where the result depends on a relevance score

Some Query Types• Match Specifies a field to search

• _all is also a field • Match ( Boolean)

• The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator flag can be set to or or and to control the boolean clauses (defaults to or). The minimum number of should clauses to match can be set using the minimum_should_match parameter.

• Match (phrase)• The match_phrase query analyzes the text and

creates a phrase query out of the analyzed text

{ “match” : {

“message” : “this is a test”

}}

{ “match” : {

“message” : {“query” :

“this is a test”,“operator” :

“and”}

}}

{ “match_phrase” : {

“message” : “this is a test”

}}

Multi Match• Multi match query

• Multiple fields to search• Field can be identified using wild cards• Fields can be boosted {

“multi_match” : {“query” : “Will Smith”“fields” : [ “title”, “*_name”]

}}

Bool Query• A query that matches documents matching boolean combinations

of other queries.{ "bool" : { "must" : { "term" : { "user" : "kimchy" } }, "must_not" : { "range" : {"age" : { "from" : 10, "to" : 20 }}}, "should" : [ {"term" : { "tag" : "wow" } }, {"term" : { "tag" : "elasticsearch" } } ], "minimum_should_match" : 1, "boost" : 1.0 }}

Filters• For Binary Yes/No searches• For queries with exact values

• Filters can be great candidates for caching ( _cache

Aggregations• Allows real time data analytics• Better than facets. Facets will be depreciated• builds analytic information over a set of documents. The context of

the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed query/filters of the search request).

• can be Nested• Two Families of aggregation

• Bucketing• Metric

Aggregation Types• Min/ Max / Sum / Avg

aggregation• Stats/ extended stats

aggregation• Value count/ Percentile /

Cardinality aggregation• Filter / Missing aggregation• Nested / Reverse nested

aggregation

• Terms aggregation• Range / Date Range

aggregation• Ipv4 aggregation• Histogram aggregation• Geo Distance aggregration

the enterprise technology driven, marketing services company

Documents

service date

data boundaryuploaded

elastic searchbig data

purchase date

response date

sale date

search criteria

following service