pyatl meetup, oct 8, 2015

P r o c e s s i n g B i g D a t a P r e d i c t i v e A n a l y t i c s

PyATL Meetup 10/18/2015

‹#›

W h o A m I

• Roy Russo• VP Engineering,

Predikto

‹#›

W h y A m I H e r e ?

& (Big Data) Predictive Analytics

‹#›

A g e n d a

• What we do• Problems we faced• How we solved them

• Rationale• What’s good. What’s not.

‹#›

W h o i s P r e d i k t o ?

• Atlanta-based• Founded in 2012• Funded• Paying Customers• Mechanical

Engineers• Big Data Architects• Global 1000

… and we don’t suck.

‹#›

W h a t w e d o ?

• Actionable Predictive Maintenance• Predictive Analytics • Real-time health scoring• Unified asset health view• SaaS

H O W D O E S P R E D I C T I V E A N A LY T I C S W O R K ?

‹#›

D a t a S o u r c e s

‹#›

H o w W e D o I t

TRAIN MGT SYSTEMS

INTELLITRAINGE RM&D

EMD

NYAB LEADERMOTIVEPOWE

RWABTEC

EAM

SAPINFOR

ORACLEMAXIMO

OTHER

BEACONSWILD

WEATHER

TCIS

CUSTOM APPSUMLER

TIME-BASED ACTIONABLE

PREDICTIONS

PREDIKTO ENTERPRISE PLATFORM

PREDIKTOINPUTAPI’S DATA

TRANSFORM. ENGINE

MAXMACHINE LEARNING

ENGINE

PREDIKTOOUTPUT

APIS&

DASHBOARDS

‹#›

T h e P i p e l i n e

Standard JSON

AutoDynamic Feature Engineering

AutoDynamic Feature Selection

ETL

Email

SMS

Integration

UI Data Store

Pred

ikto

dat

a Pi

pelin

e

“MAX

”

Oper

atio

nal I

nteg

ratio

n

Data Aggregation/ETL Machine Learning/Analytics Outbound APIs/Integration

‹#›

H i g h - L e v e l R e q u i r e m e n t s• ETL on LARGE datasets

• Fast• Commodity hardware

• Feature scoring/selection on LARGE datasets• Scale horizontally• Runs from Instruction-set

• Visualize LARGE datasets• Time-Series• Fast • Commodity hardware• Scale horizontally

• Support dynamic querying• Differing Schema

Data Processing

Data Querying

‹#›

W h y S p a r k ?

• ETL:• Shared Memory• Not Disk-Bound• Distributed workloads

• Feature scoring/selection on LARGE datasets• Same as above

• Scale horizontally• New node = more capacity• Spin up. Spin down.

• Runs from Instruction-set• DAGs

• Python Devs… PySpark

‹#›

I m p l e m e n t i n g S p a r k

• Use DAGS• Directed Acyclic

Graphs• Config-Driven

Workflows• Tune to Job

• Workers / Core• Memory Tuning• CPU or RAM ?

• Cons:• Steep learning curve

• Dev & Ops• Documentation

• Exception handling• Native Scala

‹#›

S p a r k U I

‹#›

M o r e o n D A G sclass comma_to_decimal(BaseDagTask): @staticmethod def run(sc, config, rdds, log): from shippable.dag_tasks._utils import safe_map

out = [] for idx, rdd in enumerate(rdds): if rdd is not None: cols = [str(x).strip() for x in config['cols'].split(',')] rdd = safe_map(rdd, lambda x: {k: v if str(k) not in cols else str(v).replace('.', '').replace(',', '.') for k, v in x.items()}, log) out.append(rdd) else: out.append(None)

return out, None "datetimes_ones": { "after": "datatypes_ones", "type": "convert_datetimes", }, “comma_convert": { "after": "datetimes_ones", "type": “comma_to_decimals”, }, "parentids_ones": { "after": "devids_ones", "type": "comma_convert", }, "timestamps_ones": { "after": "parentids_ones", "type": "rename", },

‹#›

W h y E l a s t i c s e a r c h ?

• Time-Series Data• Fast-Reads

• Fast writes with Bulk Inserts• Asynch

• Dynamic querying• Differing schema• Everything is indexed• Visualization

• GeoJSON• Scale horizontally

• New node = more capacity• Spark-ES Connector• REST API & Python lib

‹#›

E S - H a d o o p

client = Elasticsearch(hosts=[es_uri])

query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'

es_conf = { "es.nodes": es_host_name, "es.port": es_host_port, "es.resource": es_index + '/' + es_mapping, "es.query": query }

es_RDD = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)

dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))

‹#›

I m p l e m e n t i n g E l a s t i c s e a r c h

• Tune• Memory-Bound• Large Drives (SSDs)• Front w/ Load Balancer

• Spark• Schema:

• Understand your customer’s data!• Typing is important

• Timestamps, GeoHASH, Floats, etc…

• Cons:• Moderate learning curve

• Query syntax• Security

‹#›

A r c h i t e c t u r e

‹#›

I m p l e m e n t i n g E l a s t i c s e a r c h

P R E D I C T | P R E V E N T | P E R F O R M

http://www.predikto.com/company/careers

pyatl meetup, oct 8, 2015

Technology