adm graphics-2003

65
Itinerary “A Traveler's Guide” About ADM Data Mining Visualization Intelligent Software Real-Time Web Applications Technology Examples

Upload: john-b-cook-pe-ceo

Post on 19-Jul-2015

75 views

Category:

Engineering


5 download

TRANSCRIPT

Itinerary

“A Traveler's Guide”

• About ADM

– Data Mining

– Visualization

– Intelligent Software

– Real-Time Web Applications

• Technology

• Examples

Data Mining?

• “The search for valuable knowledge in massive volumes

of data” (Weiss and Indurkya)

• Data Mining Tool Box

– signal processing, advanced statistics, machine learning, chaos

theory, advanced visualization

• Why?

– Continuously maximize yields, throughput, profit

– Continuously minimize problems

• How?

– Learn/quantify important cause-effect relationships

– Computer models developed directly from data

• are “virtual processes” that behave like the real processes

• predict future outcomes, evaluate alternatives, show the best

pathway forward

More on Data Mining

• Data have properties that must be measured for optimal

use

– uni / multivariate relationships

– periodicity / chaos / noise

– orthogonality / redundancy

– continuity / segmentation

– dynamics

• temporal: time delays, prediction horizon

• dimensions: inertia, historical uniqueness

• Data Mining

– Maximizes/Extracts “information content”

– Automates discovery

– Integrates your data with your business

20 Years of Medical Imaging

for GE and Siemens

Applied to Data Mining

About ADM

• New Company

• Data Mining & Visualization Services and Software

• Founders have 40+ years

– engineering, artificial intelligence/expert systems,

complex programming, signal processing,

clustering/classification, machine learning, advanced

visualization, data mining

– Automotive, Environmental, Medical, Metals, Oil & Gas,

Polymers, Electronics

– special expertise in dynamical systems that constantly

change/evolve

• Fastest, most skilled anywhere

Technology

A View of Processes

PHYSICAL

PROCESS

“deterministic

dynamical

system”

inputs

outputsx1

x2

x3

x4

x5

x6

x7

x8

y1

y2

y3

multiply periodicchaotic

stochastic

non-stochastic effects

should be predictable,

therefore controllable

Multiply Periodic

(Fourier approximations)

• people

• lab tests

• controls tuning

• raw materials

• weather

Chaos

Lorenz attractor

Power Spectrum

3D Delay Plot

“Orbitals”

Prediction from = -10

“extreme sensitivity

to changes in

boundary conditions”

Chaos Example

Role of a Process Model

process

model

inputs

outputsx1

x2

x3

x4

x5

x6

x7

x8

control

setpointse.g., pressure

temperature

speed

raw material

propertiese.g., density

surface area

molecular weight

y1

y2

y3

quality

measurese.g., strength

clarity

thickness

Things you CAN’T control

What you want to know

other state

variablese.g., humidity

amb. temperature

Things you CAN control

Deterministic vs. Empirical

n Sxy - Sx Sy

n Sx2 - (Sx)2A =

Sy Sx2 - Sx Sxy

n Sx2 - (Sx)2B =

Neural

Networks

Statistics

Empirical Models

E = m c2

du d2u d2u d2u

dt dx2 dy2 dz2 = 0+ +-

First Principles

Models

Production

Economic

Environment

Interpolation / Extrapolation

P1 Pz

Px

Py

P3

P2

P4

Pw

“a good design”

“a bad design”

“a mediocre

design”

regions where

model

extrapolates

regions where

model

interpolates

Historical Data

• noisy, small data

• designed experiments

Model Space

About Neural Networks

• Inspired by the Brain

– get complicated behaviors from lots of “simple”

interconnected devices - neurons and synapses

– models are synthesized from example data

• machine learning

x1

x2

x3

x4

x5

y1

y2

inputs outputs

About Neural Networks

• Non-linear Multivariate Curve Fitting

– the modeler prescribes inputs, outputs, hidden layer

neurons, and connections

– “Weights” are the

unknown coefficients that

are determined by the

computer from examples

using an error minimizing

“learning algorithm”

output layer

hidden layerinput layer

“weights” control connections

wi

wi+n

y1

y2

y1

y2

y1

y2

input/output examples

x1

x2

x3

x4

x5

x1

x2

x3

x4

x5

x1

x2

x3

x4

x5

About Neural Networks

• Shifts Modeling Focus

– from smaller data/big deterministic modeling effort

– to bigger data/smaller modeling effort

– combine with optimization (search) methods

• real-time prediction

• resource allocation

– deterministic + error correcting ANN hybrids

Response SurfacesWater Disinfection Trihalomethanes Formation

no data

surface fitted by non-linear

ANN model represents normal

behavior

deviation from normal

better conditions?

Response SurfacesSavannah River Saltwater Migration

Optimizing With Models

process

model

inputs outputs

x1

x2

x3

x4

x5

x6

x7

x8

y1

y2

y3

PI = ay1 + b y2 + cy3

are varied by

search routineare evaluated

for goodness

optimization program

(search routine)

GOAL: determine values of inputs (within controllable

range) to optimize Performance Index while meeting

constraints.

Control Possibilities

Polymer Packaging Film Intrinsic Viscosity

PROCESS

WEATHER

ACTUAL

Prediction

BEFORE

AFTER

Industrial”Sometimes it runs good, sometimes

bad, and we don’t know why.”

heater

dryer

air jets

Discrete Event Prediction

Unexplained Polymer Film Production Shutdowns

20 minute interval , 4 minute ramp1 minute

before19 minutes

before

12 minutes

before

temperature

related web

breaks

viscosity

initial web break

Representation

Dynamics

• can require multiple delays for same variable

• delays may be vary

Different events due to

different causes are

detected at different

times prior to occurrence

Off-Spec Production

Synthetic Textile Fiber Quality

Days since July 1, 1998 Days since July 1, 1998

Days since July 1, 1998

Pro

ce

ss

Te

mp

(C

)

Pro

ce

ss

Te

mp

(C

)

Pro

ce

ss

Te

mp

(C

)

Waste

(lb

s)

Q3

(lb

s)

Am

bie

nt

Te

mp

(C

)

• During period of

high off-spec,

process tracks

ambient

temperature.

Semi-Quantitative Data

Polymer Resin Solid State Polymerization

Ambient Pressure = 29.41

Ambient

Pressure = 30.26

Fre

qu

en

cy

heater

dryer

air jets

Environmental Compliance Water Disinfection Trihalomethanes Formation

0

20

40

60

80

100

120

140

160

7/ 1/ 00 7/ 31/ 00 8/ 30/ 00 9/ 29/ 00 10/ 29/ 00 11/ 28/ 00 12/ 28/ 00 1/ 27/ 01 2/ 26/ 01 3/ 28/ 01

Tri

halo

me

than

es

(p

pb

)

FINISHED THM (ppb)

Control Model

Virtual Sensor

• EPA regulated

carcinogen

• Different models for

– prediction

– control (gains)

• $$$ Savings by

optimizing use of

ClO2 vs. Cl2straight predictions

CustomerProcess

Engineering

Tech

Service

Output of your

process

Input to your

customer’s process

Customer Feedback

Your

Continuous

Improvement

Customer’s

Continuous

Improvement

Customer Performance

Synthetic Textile Fiber Quality

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

5/20/99 7/9/99 8/28/99 10/17/99 12/6/99 1/25/00 3/15/00 5/4/00 6/23/00 8/12/00 10/1/00

Opelika Date

Den

ier

Co

mp

osit

e

3.5

4

4.5

5

5.5

6

6.5

Red

Bu

tto

ns

DENVMACV

Opelika MJS Red Buttons

108

110

112

114

116

118

120

8/15/99 10/4/99 11/23/99 1/12/00 3/2/00 4/21/00 6/10/00 7/30/00

Date

H_W

HIT

Co

mp

osit

e

5

6

7

8

9

10

11

Calh

ou

n M

JS

Red

Bu

tto

ns

HWHITCV

Calhoun MJS Red Buttons

AL

SC

-20

-15

-10

-5

0

5

10

15

20

7/9/99 8/28/99 10/17/99 12/6/99 1/25/00 3/15/00 5/4/00 6/23/00

Frontier DATEF

US

E C

om

po

sit

e 2

wk a

vg

, D

ela

y =

15 d

ays

325

330

335

340

345

350

355

360

365

370

32.5

Sin

gle

En

ds B

reak 2

wk a

vg

FUSEC

F32_5SEB

NC

Setpoints Quality?

Synthetic Textile Fiber Quality

19

20

21

22

23

24

25

26

6/3/99 7/23/99 9/11/99 10/31/99 12/20/99 2/8/00 3/29/00 5/18/00 7/7/00

Date

CR

IMP

_P

410

420

430

440

450

460

470

480

490

500

U3

VI3

52

3.S

P

CRIMPPA

CRIMPPB

I3523SP

0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

6/6/99 7/26/99 9/14/99 11/3/99 12/23/99 2/11/00 4/1/00 5/21/00 7/10/00

Date

De

nie

r C

om

po

sit

e

90

100

110

120

130

140

U3

PP

37

27

.SP

an

d U

3P

P3

72

7.P

V

DENVMAC

P3727SP

P3727PV

25

27

29

31

33

35

37

6/4/99 7/24/99 9/12/99 11/1/99 12/21/99 2/9/00 3/30/00 5/19/00 7/8/00Date

Da

ily

Av

era

ge

EL

ON

GA

Co

mp

os

ite

50

60

70

80

90

100

U3

PF

32

91

.PV

an

d L

16

PT

40

9.P

V

ELONGAC

F3291PV

PT409PV

0

10

20

30

40

50

60

6/4/99 7/24/99 9/12/99 11/1/99 12/21/99 2/9/00 3/30/00 5/19/00 7/8/00

Date

FU

SE

Co

mp

os

ite

-0.5

0.5

1.5

2.5

3.5

4.5

S3

P2

78

45

.PV

FUSEC

27845PV

CRIMP

DENIER

ELONG

FUSE

Contract Optimization(electricity purchasing relative to usage)

Kosa Spartanburg Baselines

24000

26000

28000

30000

32000

34000

36000

38000

40000

42000

44000

M-98 M-98 A-98 M-98 J-98 J-98 A-98 S-98 O-98 N-98 D-98 J-99 F-99

Date

kw

or

kw

h

D1+D2 LOAD TOTAL

D1+D2 USE kwh/hour

D1+D2 USE Billing DemandEvaluate Contract Costs and Options

Koas Spartanburg Load Shifting Scenarios

37,000

38,000

39,000

40,000

41,000

42,000

43,000

9/1 9/2 9/3 9/4 9/5 9/6 9/7 9/8

Date 1998

kw

0.01

0.03

0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

$/k

wh

Current D1+D2 kw Best Case D1+D2 kw Worst Case D1+D2 kw RTP($/kw)

Load Control Scenarios

Real-Time Pricing and

Electricity Deregulation

Natural Systems

surface water

Estuarine Water Quality

3

4

5

6

7

8

9

10

8/21/93 0:30 8/22/93 0:30 8/23/93 0:30 8/24/93 0:30

Date and time

Dis

so

lve

d o

xy

ge

n (

mg

/L)

16

18

20

22

24

26

28

30

32

Te

mp

era

ture

(de

gre

e C

els

ius

)

Measured Neural Network BRANCH/BLTM

Water temperature

Dissolved oxygen

• Mixing - Tides, Flows from 3 Rivers

• Weather (T, P Dew Point)

• Point Discharge Wastewater

Treatment Plants

• Non-Point Discharges - rainfall, 50%

overbank storage

Pollution in Estuaries

High TideMean Tide

wastewater

discharges

non-point

from tidal

flooding

gauging

stations

non-point

from rain

Dissolved Oxygen

Concentration (mg/L)

Water-Temperature (deg. C)

Specific Conductivity (µ-siemens/cm)

Time (hours)

Water Level (feet)

2.8e4

0.4e4

1.6e4

32.0

30.5

29.0

27.5

10.4

5.6

8.0

5.0

3.5

6.5

2 months

raw signals

low frequency

broadband

6.2 hr

12.4 hr

25.6 & 24 hr

water level

dissolved oxygen

concentration

specific

conductivity

water temperature

8.2 hr

spectral

analysis

5.7

5.4

5.1

4.8

9.1

8.8

8.5

30.6

28.8

29.7

2.4e4

1.2e4

1.8e4

signal

decomposition

chaotic

component via

digital filtering

Dissolved Oxygen

Concentration (mg/L)

Water-Temperature (deg. C)

Specific Conductivity (µ-siemens/cm)

Time (hours)

Water Level (feet)

0

10,000

20,000

30,000

40,000

50,000

6/19/93 7/3/93 7/17/93 7/31/93 8/14/93 8/28/93 9/11/93

Date

SC

(m

icro

-sie

men

s/c

m)

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

DO

(m

g/L

)

SC DO

0.6

0.6e4

-0.3

-1.2

1.2

0.3

-0.6

-1.5

0.4

-0.2

-0.5

0.1

-0.6e4

0.0e4

signal

decomposition

high frequency,

periodic

components via

digital filtering

Dissolved Oxygen Concentration (mg/L)

Water-Temperature (deg. C)

Specific Conductivity (µ-siemens/cm)

Time (hours)

Water Level (feet)

Chaotic

and Low

Frequency

Data

ANN

DOt=0High

Frequency

Data

ANN

DO

final

SC’st

=x

WL’st=y

Chaotic and

Low

Frequency

Data

ANN

SC’stiDOt=0

Measured

Training

Test

Beaufort River Estuary

SS

SP

PI

-0.5

-0.3

-0.1

0.1

0.3

0.5

1 6 11 16 21 26 31 36 41 46 51 56 61 66

Sequential but Non-Consecutive Data Point Number

24

hr D

eri

va

tive

of

DO

A

(mg

/L)

-400

-200

0

200

400

600

800

So

uth

sid

e B

OD

(L

B/D

ay

)

EDOa6611

SSBOD_D

R = -0.357, R2 = 0.128

R2siso = 0.130 = 1

Gains by delay

= 2 day

= 3 days

= 1

BOD Effect on DO

Inputs = WL, SC, TP, Rainfall, BOD

R2ANN = 0.57

N = 61 points

delt

a D

Os-D

Om

flow towards gauge

TP = 20 C

61 Points

increasing BOD

delta DOs-DOm

No Data

Natural Systems

groundwater

Groundwater ModelingUpper Floridian Aquifer

Well Histories

(18 years)

Surface

Contour

Well Locations

(100x100 miles)

Sub-Regions Behave Differently

350000

370000

390000

410000

430000

450000

470000

490000

2360000 2380000 2400000 2420000 2440000 2460000 2480000 2500000

25

x3

0 m

iles

Groundwater Example (Cluster monitoring wells according to behavior)

Accuracy by Cluster

Actual

PredictionC1

History from April 1982 to October 1998

No

rma

lize

d W

ate

r

Le

ve

l a

bo

ve

Se

a L

eve

l

C3

C6

C10

Aquifer “Ceiling”

Gulf of Mexico

Max elevation above

sea level ~ 180 feet

North

Oil & Gas

Problem

• Area A is a coal bed methane field

• Data for 59 wells was compiled by petroleum engineering group in CO

to determine if an artificial neural network* (ANN) could be used to

predict total gas production for the life of each well.

0

5000

10000

15000

20000

25000

30000

0 5000 10000

Normalized UTMX (m)

No

rma

lize

d U

TM

Y (

m)

3D Range Model of

Area A (16x32 km

vertical scale ~ 200m)

Well

Locations

* A form of “machine learning” from AI.

North

North

Variable Types

• Static Variables

– e.g., depth, permeability

– for each well, treated like they do not vary in time

• in reality some probably do

– some measurements are just estimates with large error

– limited “information content”, e.g. one value per well

• Time Series, a.k.a. Signals

– e.g., water and gas production rates

– values vary in time

– large “information content”, dozens of values per well

– variability in time

• caused by the underlying process physics

• a detailed surrogate for an explicit description of the physics

• Synthetic Variables - are computed by equations or models

Static Variables

• Potential Model Inputs

– Geometric Variables - SURFace ELEVation, COAL

ELEVation, COAL DEPTH, COAL THICKness

– COAL PERMeability COAL POROsity - from cores,

logs, and engineering estimates

– KV CONFining - permeability in vertical direction

Water & Gas Time Series

Detail

59 Wells

Wells are similar, but

are different in detail

Mo

nth

ly G

as

Pro

du

ce

d (

MM

CF

)

Mo

nth

ly W

ate

r P

rod

uc

ed

(b

bl)

Production Month

Well 1

Well 2

Well 3

Synthetic Variables

• Potential Model Inputs

– GIP - estimated gas in place from geometric variables

• Model Outputs

– CGP 25 - cumulative gas production to 25 psi, estimated by reservoir

simulator using static variables

– CGP 50 - as for CGP 25, but to 50 psi.

CGP 50 vs. 25

R2 = 0.99

CGP 25 vs. GIP

R2 = 0.39

R2ANN = 0.52

“Model the Model”

• A computer model can be treated as a black box. The box “maps”

multiple inputs to an output.

• If, for all combinations of input values, the box computes a

continuously differentiable output, its map can be learned very

accurately by an ANN.

• A cause of non-differentiable output is switching by programmed logic

in model’s computer code.

• Discontinuous maps can be segmented by clustering, then modeled

using multiple ANN’s.

Reservoir

Simulator

x1

x2

x3

x4

x5

x6

y

inputs output

continuously

differentiable

output

x1

y

x5

Chaotic Systems

• Modeled using Phase Space Reconstruction

• Takens Theorem, univariate systems (1980)

– x(t) = F[x(t - p - d), x(t - p - 2d),…, x(t - p - nd)]– current state-of-the-art

– implies

• x can be predicted at time t from an optimal number n previous

measurements (n is called the “dimension”)

• n measurements spaced at optimal time delays nd produces an

optimal prediction

• Requirements/Options

– multivariate, variable delays - much better than Takens

– completely extracts “information content”

Phase Space Reconstruction

• Select d = 6 months by inspection (optimal delays can be computed

for larger data sets).

• MWA = 6-month moving window average of water and gas

production to remove high frequency variability.

0

5,000

10,000

15,000

20,000

25,000

0 6 12 18 24 30 36 42 48 54 60 66 72 78

Month of Production

Wa

ter

(BB

L/m

) &

Ga

s (

MC

F/m

) WATER MWA GAS MWA

One well’s

history

Model CGP by ANN

• Develop succession of models with longer histories as inputs, checking R2ANN

as you go.

• Point count N decreases with longer histories because some well histories are

less complete.

• R2ANN at 6 months = 0.68, 0.93 at 36 months…good enough!

• RMSE = 344 MCF at 36 months relative to 6500 MCF actual full scale (5%).

0

10

20

30

40

50

60

70

80

90

100

6 12 18 24 30 36

Month of Production

N (

po

int

co

un

t) &

R2

AN

N x

10

0

0

100

200

300

400

500

600

700

800

CU

M G

AS

25

Pre

dic

tio

n R

MS

E

(MM

CF

)

N

R2 x 100

RMSE

Results

Actual & Predicted using 6

months history, R2ANN = 0.68

Actual & Predicted using 36

months history, R2ANN = 0.93

CGP25 (BCF)CGP25 (BCF)

Pre

dic

ted

CG

P2

5 (

BC

F)

.

.

.

.

.

.

.

.

Pre

dic

ted

CG

P2

5 (

BC

F)

Prediction using 6

months of history.Prediction using 36

months of history.

Coal Gas Summary

• Time series have higher information content,

less noise than static variables.

• Phase Space Reconstruction

– leverages hidden physics manifest in time series

variability

– high accuracy

– extensible to other gas fields, collections of fields,

other problems, other domains

Conclusions

• Your process - room for improvement?

• Data Mining

– fast, powerful, decisive

– automates knowledge acquisition,

produces predictive models

– solves problems that are unsolvable by any

other means.

• Advanced visualization makes results

understandable to all.

Archive

Modeling Chaos

• Takens Theorem (1980), univariate systems

– x(t) = F[x(t - p - d), x(t - p - 2d),…, x(t - p - nd)]• each t represents a vector of measurements

• p = the “time delay” of the most recent measurement available

– implies a “prediction horizon”

• n and d = “dynamical invariants”

– analogous to amplitudes, periods, and phases in periodic systems

• n + 1 = “embedding dimension”

– the number of previous measurements

– implies an optimal number of previous measurements

– d = characteristic “time delay” of the attractor

– implies an optimal spacing in time for the previous measurements

• F is an arbitrary mapping function, e.g., “look up”, regression, or

ANN, whatever gives the best results

Modeling Chaos

• ADM, multivariate “Takens”

– y(t) = F[xi(t - pi), xi(t - pi - di),…, xi(t - pi - nidi)]• i, pi, ni, and di are dynamical invariants

• i = number of input variables

• xi = model input variables

– implies an optimal set

• pi = time delay of peak (optimal) correlation between y and

each xi

• ni + 1 = the embedding dimension of the attractor of each xi

• di = characteristic “time delay” of the attractor xi

• F is an arbitrary mapping function, generally an ANN

Modeling Chaos

• ADM generalization

– y(t) = F[xi(t - pi), xi(t - pi - dij),…, xi(t - pi - din)]• din replaces di, a variable delay

• For medium data sets (300 to 1000 vectors)

– y(t) = F[xi(t - pi), x’i(t - pi - dij),…, x’i(t - pi - din)]• replace xi with derivatives x’I to mitigate tendency of

aggressive regression techniques to overfit data

• also amplifies effects of small changes

• Dynamical Invariants computed by systematic

search