the enterprise technology driven, marketing services company

55
The Enterprise Technology Driven, Marketing Services Company Aasif Bagdadi Director Of Engineering https://www.linkedin.com/in/aasifbagdadi www.linkedin.com/in/aasifbagdadi/

Upload: paloma

Post on 25-Feb-2016

28 views

Category:

Documents


3 download

DESCRIPTION

w ww.linkedin.com/in/aasifbagdadi/. The Enterprise Technology Driven, Marketing Services Company . Aasif Bagdadi Director Of Engineering https://www.linkedin.com/in/aasifbagdadi. Unique Data Asset. Automotive transactional data on 61% of the US Households. Customers. Households. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Enterprise Technology Driven, Marketing Services Company

The Enterprise Technology Driven, Marketing Services Company

Aasif BagdadiDirector Of Engineering

https://www.linkedin.com/in/aasifbagdadi

• www.linkedin.com/in/aasifbagdadi/

Page 2: The Enterprise Technology Driven, Marketing Services Company

Unique Data AssetAutomotive transactional data on 61% of the US Households

0

10,000,000

20,000,000

30,000,000

40,000,000

50,000,000

60,000,000

70,000,000

80,000,000

90,000,000

100,000,000

81 Million Households

Dealer Aftermarket TOTAL

Vehicles

Customers

Households

Page 3: The Enterprise Technology Driven, Marketing Services Company

Dealer Group

Dealer A

Household (Owner, Driver)

Vehicles ( Make, Model, Year, Age,

Mileage)

Transactions (Purchases, Services)

Data Mining ( Driving Habits, Segmentations , Loyalty, Life Time Value)

Marketing

Page 4: The Enterprise Technology Driven, Marketing Services Company

List Manager - Entity Relationship

▪ Customer/Household: Name, Address, Distance, Email, Phone, Wireless, NCOA, Compliance, Federal DNC, EBR▪ Vehicle Profile : VIN, Make, Model, Year, Sale Date, Sale Amount, Last observed Mileage, Lease, Loan, Warranty,

Extended Warranty, Pre Paid Maintenance, AMPD▪ Deal: Owner, Purchase Date, Purchase Amount, Sales Person, Lease, Loan, Warranty, Odometer▪ Service: Service Date, Mileage, Service Advisor, Warranty Pay, Internal Pay, Customer Pay, Parts, Labor, Services

Performed, Services Declined, Discounts.▪ Campaigns: Customer, Date, Communication, Channel, Offers▪ Responders: Response Date, Transactions, Days to Respond▪ Forecast Communications: Date, communications▪ Ownership: Store, Store Group, OEM, Data boundary▪ Uploaded List: Conquest List or any List acquired from external sources

Page 5: The Enterprise Technology Driven, Marketing Services Company

Purchase

List ManagerComplex search

Customer / Household

Vehicle

Campaigns

Service

Future Communications

Page 6: The Enterprise Technology Driven, Marketing Services Company

List Manager

▪ Find Customers that are within 50 miles▪ Find Customers that have bought { Make } in last { Y } Year▪ That have Serviced their vehicles between { M1 } & { M2 } months in the past▪ That had the following service performed {Opcode1} or {Opcode2} performed▪ That had the following service declined {ASR1} or {ASR2}▪ That had been mailed between {D1} and {D2} dates▪ And have not yet Responded.▪

Adhoc Search

Page 7: The Enterprise Technology Driven, Marketing Services Company

List ManagerAdvanced Search

Page 8: The Enterprise Technology Driven, Marketing Services Company

List Manager V 1.0

▪ 2005 – 2006 time frame▪ SQL Server based▪ Dynamic Sql▪ Implementation:

· Table Valued function for each entities· Batch processes· Request are queued.· Job will apply all the search criteria.

▪ Pros:· Simple to use· Easy to build· Data is available to search almost real time

▪ Cons: · Slow. Took hours just to get a count.· Did not provide results in real time· No caching

Using SQL

Page 9: The Enterprise Technology Driven, Marketing Services Company

List Manager v2.0

▪ 2010 Timeframe▪ Use SSAS ( SQL Server 2008 R2)▪ Apply the search criteria & Get the counts extremely fast▪ List can be batched▪ Implementation:

· SSAS Cubes (MOLAP).· Dynamic MDX· Use MDX to query the count· Use MDX to get the Keys· Use Dimensions / attributes to filter· Mash with Sql on Keys to get the List details (name, address etc)

▪ Pros:· Extremely Fast· Sub second response on counts.· MDX queries are cached

▪ Cons:· Complex MDX· Cube refresh / Partition reprocessing· Dimension Size constraint of 4GB size ( sql 2012 has options to overcome these limits)· Cube changes require the entire cube to be offline· Weak Scale out options

Using Cubes

Page 10: The Enterprise Technology Driven, Marketing Services Company

List Manager 3.0

▪ 2014 ( currently in development)▪ Uses Elastic Search in the cloud▪ Layer of API written in Node.js▪ Front end (C#, MVC, jQuery, JSON)▪ Change Data ( CDC, selected columns and tables, multi databases)▪ Data Pump (C#, multi threaded, windows service, compressed json, bulk api)▪ Pros:

· Good Scale –out· High Availability· Optimized for Search· High Caching (filters)· Read-only Replica· Document based· Allows more better control of incremental data changes· Solves Volume, Velocity & Variety of Data (a.k.a BigData )

▪ Cons· Technology is still emerging

Page 11: The Enterprise Technology Driven, Marketing Services Company

USING ELASTIC SEARCH

Big Data

Page 12: The Enterprise Technology Driven, Marketing Services Company

Define Big DataBig Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infrastructure to address efficiently. Said differently, the volume, velocity or variety of data is too great.

Page 13: The Enterprise Technology Driven, Marketing Services Company

Elasticsearch• real time• Search & Analytics Engine• Distributed• Scales massively• High availability• Restful api• Json over HTTP• Schema free• Multi tenancy• Open source• Lucene based

Page 14: The Enterprise Technology Driven, Marketing Services Company

APIcurl -XGET localhost:9200/?pretty

Verb ( GET, PUT …) NodePortPath

{ "name" : "Exploding Man", "tagline" : "You Know, for Search", "ok" : true, "status" : 200, "version" : { "number" : "0.90.7", "snapshot_build" : false }}

Aasif Bagdadi
Page 15: The Enterprise Technology Driven, Marketing Services Company

Input DataPUT /myapp/tweet/1 -d ' { "tweet": "I think #elasticsearch is AWESOME", "nick": "@clintongormley", "name": "Clinton Gormley", "date": "2013-06-03", "rt": 5, "loc": { "lat": 13.4, "lon": 52.5

} } '

PUT /index/type/id

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 1, "ok": true}

Page 16: The Enterprise Technology Driven, Marketing Services Company

Retrieve Data• GET /myapp/tweet/1

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 1, "exists": true, "_source": { ...OUR TWEET... }}

Page 17: The Enterprise Technology Driven, Marketing Services Company

Update Data• PUT /myapp/tweet/1 -d ' { "tweet": "I know #elasticsearch is AWESOME", "nick": "@clintongormley", "name": "Clinton Gormley", "date": "2013-06-03", "rt": 5, "loc": { "lat": 13.4, "lon": 52.5 } } '

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 2, "ok": true}

# atomic delete and put

Page 18: The Enterprise Technology Driven, Marketing Services Company

Delete Data• DELETE /myapp/tweet/1

{ "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 3, "ok": true, "found": true}

Page 19: The Enterprise Technology Driven, Marketing Services Company

RDBMS lingoMySQL/Oracle/Sql Server => Databases => Tables => Columns/RowsElastic Search => Indices => Types => Documents with Properties• An Elastic Search cluster can contain multiple Indices (databases),

which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).

Page 20: The Enterprise Technology Driven, Marketing Services Company

Glossary• Node: A node is a running instance of elasticsearch which belongs

to a cluster.• Shard: A shard is a single Lucene instance. It is a low-level “worker”

unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards.

• Primary Shard: Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.

• Replica Shard: Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes: a) increase fail over b) increase performance

Page 21: The Enterprise Technology Driven, Marketing Services Company

Glossary• Index: An index is like a database in a relational database.• Type: A type is like a table in a relational database. Each type has a list of fields

that can be specified for documents of that type.• Document: JSON document which is stored in elasticsearch. It is like a row in a

table in a relational database.• Field: A document contains a list of fields, or key-value pairs. The value can be a

simple (scalar) value (eg a string, integer, date), or a nested structure like an array or an object. A field is similar to a column in a table in a relational database.

• Mapping:  mapping is like a schema definition in a relational database. The mapping defines how each field in the document is analyzed.

• Routing: When you index a document, it is stored on a single primary shard. That shard is chosen by hashing the routing value. By default the routing value is derived from the ID of the document.

Page 22: The Enterprise Technology Driven, Marketing Services Company

Core Field Types• Strings: string• Datetimes: date• Whole numbers: byte, short, integer, long• Floats: float, double• Booleans: boolean• Objects: object

• Also: multi_field, ip, geo_point, geo_shape,

Page 23: The Enterprise Technology Driven, Marketing Services Company

Auto Detection of Field • "foo bar" string• "2013-01-01" date• 10 byte, short, integer, long• 10.0 float, double• true boolean• { foo: "bar" } object

• ["foo","bar"] No special mapping. Any field can have multi-values

Page 24: The Enterprise Technology Driven, Marketing Services Company

Some more Glossary• Term: A term is an exact value that is indexed in elasticsearch. The

terms foo, Foo, FOO are NOT equivalent.• Text: Text (or full text) is ordinary unstructured text, such as this

paragraph. By default, text will be analyzed into terms, which is what is actually stored in the index. Text fields need to be analyzed at index time in order to be searchable as full text, and keywords in full text queries must be analyzed at search time to produce (and search for) the same terms that were generated at index time.

• Analysis: Analysis is the process of converting full text to terms. Depending on which analyzer is used, these phrases: FOO BAR, Foo-Bar, foo,bar will probably all result in the terms foo and bar. These terms are what is actually stored in the index.

Page 25: The Enterprise Technology Driven, Marketing Services Company

• Tokenizer: Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation.

• Facets: They enable you to calculate and summarize data about the current query on-the-fly. They can be used for all sorts of tasks such as dynamic counting of result values or even distribution histograms. Facets only perform their calculations one-level deep, and they cant be easily combined.

• Aggregations: Aggregations are similar to facets in many ways, and overcome the limitations of facets. Indeed, aggregations are meant to eventually replace facets altogether. Facets are and should be considered deprecated and will likely be removed in one of the future major releases. One of the major limitations of facets is that you can't have facets of facets. Which is to say, facets cannot be nested. The ability to nest aggregations therefore brings a great deal of power that was missing in facets. • The two broad families of aggregations are metrics aggregations and bucket aggregations.

Metrics aggregations calculate some value (like an average) over a set of documents, and bucket aggregations group documents into buckets. 

Page 26: The Enterprise Technology Driven, Marketing Services Company

• Schemaless, Document Oriented• No need to configure schema upfront• No need for slow ALTER TABLE – like operations• Define mapping (schema) to customize the indexing process

• Require fields to be of certain type• If you want text fields that should not be analyzed

Page 27: The Enterprise Technology Driven, Marketing Services Company

Distributed & Highly Available• Multiple nodes running in a cluster

• Acting as a single service• Nodes in cluster that store data or nodes that just help in speeding up

search queries• Sharding

• Indices are sharded (#shards are configurable)• Each Shard can have zero or more replicas

• Replicas on different servers for failover

• Master• Automatic Master detection + failover• Responsible for distribution/balancing of shards

Page 28: The Enterprise Technology Driven, Marketing Services Company

A Single Node Cluster with An Index• All 3 Primary Shards allocated to Node1• No replication Nodes• A single node means single point of failure• Health of the Cluster: Yellow

Page 29: The Enterprise Technology Driven, Marketing Services Company

Add Failover• Add one Node to cluster by configuring the cluster name.• 3 replica shards have been allocated.• Cluster Health : Green.• Now 6 Shards. There is redundancy

Page 30: The Enterprise Technology Driven, Marketing Services Company

Scale horizontally• A 3 node cluster• One shard each from Node 1 and Node 2 have moved to Node 3• Better performance as hardware resouces (CPU,RAM, I/O) are

shared

Page 31: The Enterprise Technology Driven, Marketing Services Company

Scale some more• More Nodes can be added• More replicas can be added• This will allow faster searches• Allows better redundancy• However the number of primary shards is fixed at the moment

an index is created.• Effectively, the maximum amount of data that can be stored in the

index is defined by this number.• Is this a limitation …..

Page 32: The Enterprise Technology Driven, Marketing Services Company

Coping with Node Failure• Kill Master Node• Elect a New Master (Node 2)• Primary Shard 1 and 2 were lost• Cluster Health : Red• Node 2 & 3 have Replicas of these shards, which are now promoted

as primaries• Cluster Health: Yellow

Page 33: The Enterprise Technology Driven, Marketing Services Company

Beauty Of Elastic Search•In Elasticsearch, all data in every field is indexed by default. That is, every field has a dedicated inverted index for fast retrieval. And, unlike most other databases, it can use all of those inverted indices in the same query, to return results at breathtaking speed

Page 34: The Enterprise Technology Driven, Marketing Services Company

Document• refers to the top-level or root object which

is serialized into JSON and stored in Elasticsearch under a unique ID.

• field or property, can be a string, a number, a boolean, another object, an array of values, or some other specialized type such as a string representing a date or an object representing a geolocation

Page 35: The Enterprise Technology Driven, Marketing Services Company

Document metadata• _index: Where the document lives• _type: The class of object that the document represents• _id: The unique identifier for the document

• Elasticsearch will auto generate id if not specified

Page 36: The Enterprise Technology Driven, Marketing Services Company

Creating a Document

Page 37: The Enterprise Technology Driven, Marketing Services Company

Paginationsize = num of resultsfrom = results to skip

GET /_search?size=5&from=0 GET /_search?size=5&from=5 GET /_search?size=5&from=10

Page 38: The Enterprise Technology Driven, Marketing Services Company

Search (basic)• GET /_search?q=mary

→ user named "Mary"→ tweets by "Mary"→ tweet mentioning "@mary“

• _all field• String value from all fields

Page 39: The Enterprise Technology Driven, Marketing Services Company

• GET /_search?q=2013->12 results

GET /_search?q=2013-06-03-> 12 results!!

GET /_search?q=date:2013-06-03-> 1 result

Page 40: The Enterprise Technology Driven, Marketing Services Company

Mapping ( field definitions){ "tweet" : {

"properties" : { "tweet" : { "type" : "string" }, "name" : { "type" : "string" }, "nick" : { "type" : "string" }, "date" : { "type" : "date" }, "rt" : { "type" : "long" }, "loc" : {

"type": "object", "properties" : {

"lat" : { "type" : "double" }, "lon" : { "type" : "double" } }

}}}}

GET /myapp/tweet/_mapping

date = type:date _all = type:string

date = 2013-06-03 _all = 2013,06,03

Page 41: The Enterprise Technology Driven, Marketing Services Company

Exact Value Vs Full Text104.52013-01-01trueFoofoo

The quick brown fox jumped over the lazy dog

Page 42: The Enterprise Technology Driven, Marketing Services Company

Inverted Index→ separate words / terms→ sort unique terms→ list docs containing terms→ normalize terms

The,brown,dog,fox,jumped,lazy,over,quick,theQuick,brown,dogs,foxes,in,lazy,leap,over,summer

Page 43: The Enterprise Technology Driven, Marketing Services Company

Analysis•The index analysis module acts as a configurable registry of Analyzers that can be used in order to both break indexed (analyzed) fields when a document is indexed and process query strings. It maps to the Lucene Analyzer.

•Analyzer:: tokenizer + token filters

Page 44: The Enterprise Technology Driven, Marketing Services Company

Standard Analyzer"The Quick Brown Fox jumped over the Lazy Dog!“

Standard TokenizerThe,Quick,Brown,Fox,jumped,over,the,Lazy,Dog

Lowercase filterthe,quick,brown,fox,jumped,over,the,lazy,dog

Stopwords filterthe,quick,brown,fox,jumped,over,the ,lazy,dog

Page 45: The Enterprise Technology Driven, Marketing Services Company

English Analyzerstandard tokenizer

lowercase filter english stemmer

the,quick,brown,fox,jumped,over,the,lazy,dogenglish stopwords

the,quick,brown,fox,jumped,over,the,lazy,dog

Page 46: The Enterprise Technology Driven, Marketing Services Company

Filters Vs QueriesFilters• exact matching• Binary yes/no• Fast• Cacheable

Queries• Full Text• Relevance scoring text search• Heavier• Not cacheable

Page 47: The Enterprise Technology Driven, Marketing Services Company

QueriesAs a general rule, queries should be used instead of filters:•for full text search•where the result depends on a relevance score

Page 48: The Enterprise Technology Driven, Marketing Services Company

Some Query Types• Match Specifies a field to search

• _all is also a field • Match ( Boolean)

• The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator flag can be set to or or and to control the boolean clauses (defaults to or). The minimum number of should clauses to match can be set using the minimum_should_match parameter.

• Match (phrase)• The match_phrase query analyzes the text and

creates a phrase query out of the analyzed text

{ “match” : {

“message” : “this is a test”

}}

{ “match” : {

“message” : {“query” :

“this is a test”,“operator” :

“and”}

}}

{ “match_phrase” : {

“message” : “this is a test”

}}

Page 49: The Enterprise Technology Driven, Marketing Services Company

Multi Match• Multi match query

• Multiple fields to search• Field can be identified using wild cards• Fields can be boosted {

“multi_match” : {“query” : “Will Smith”“fields” : [ “title”, “*_name”]

}}

Page 50: The Enterprise Technology Driven, Marketing Services Company

Bool Query• A query that matches documents matching boolean combinations

of other queries.{ "bool" : { "must" : { "term" : { "user" : "kimchy" } }, "must_not" : { "range" : {"age" : { "from" : 10, "to" : 20 }}}, "should" : [ {"term" : { "tag" : "wow" } }, {"term" : { "tag" : "elasticsearch" } } ], "minimum_should_match" : 1, "boost" : 1.0 }}

Page 51: The Enterprise Technology Driven, Marketing Services Company

Filters• For Binary Yes/No searches• For queries with exact values

• Filters can be great candidates for caching ( _cache

Page 52: The Enterprise Technology Driven, Marketing Services Company

Aggregations• Allows real time data analytics• Better than facets. Facets will be depreciated• builds analytic information over a set of documents. The context of

the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed query/filters of the search request).

• can be Nested• Two Families of aggregation

• Bucketing• Metric

Page 53: The Enterprise Technology Driven, Marketing Services Company
Page 54: The Enterprise Technology Driven, Marketing Services Company

Aggregation Types• Min/ Max / Sum / Avg

aggregation• Stats/ extended stats

aggregation• Value count/ Percentile /

Cardinality aggregation• Filter / Missing aggregation• Nested / Reverse nested

aggregation

• Terms aggregation• Range / Date Range

aggregation• Ipv4 aggregation• Histogram aggregation• Geo Distance aggregration

Page 55: The Enterprise Technology Driven, Marketing Services Company

Q&A