1 a case study of bayesian modeling on a real world problem ram energy energester/enziro bob...
Post on 26-Dec-2015
214 Views
Preview:
TRANSCRIPT
1
A Case Study of Bayesian Modeling on a Real World Problem
RAM Energy Energester/Enziro
Bob Mattheys, Malcolm Farrow, Giles Oatley, Garen Arevian, Souvik Banerjee
2
ISS – Intelligent Systems Solutions
Group of researchers/academics Working with CAS (Centre for Adaptive
Systems) Remit:
Provide Technology Transfer and Expertise to Industry
Assist NE SME’s and stimulate business growth Obtain funding, e.g. SMART Awards, GONE,
etc.
3
ISS Projects
RAM Energy – Intelligent Data Analysis
Neptune Engineering – Intelligent Diagnostics
HASS – Back-office system/DBase
Hart Biological – Back-office system/Dbase,
process manufacturing
Etc.
4
RAM Energy Founded 2000 Clients in Oil/Gas, Energy, Process,
Manufacturing, Haulage Industry Products Energester +Enziro
Ester based synthetic lubricants and greases, enzymatic cleaning solutions, absorbents and blasting media
Better lubrication, heat dissipation and vibration reduction than oil or grease in isolation and conventional additives
5
RAM Energy
ProblemDemonstrate effectiveness and cost efficiencyData collected by RAM Energy
very large major differences across the various sectors
Assist RAM Energy in structuring their data collection and storage in general
Heavy haulage industry
6
RAM Energy
Trials RAM energy carried out select trials with
clients. These included: Monitored consumption prior to Energester use
Monitored consumption post Energester use
Use of control vehicles (no Energester use)
Temperature data collected
7
RAM Energy Haulage
Data collected via diesel receipts Information consisted of
Card number (allocated to regn number) Vehicle registration Date Fuel Mileage
8
Registration Number Date Reg Entered Fuel Added Mileage
J577PWL 20020901 DX51MYT 276.19 128504
J577PWL 20020902 DX51MTY 296.51 129130
J577PWL 20020904 DX51MYT 288.88 999
J577PWL 20020905 J577PWL 235.95 666
J577PWL 20020907 J577PWL 346 1
J577PWL 20020907 J577PWL 234.86 1
J577PWL 20020908 DX51NYT 211 99999
J577PWL 20020909 DX51MYT 447.73 11
J577PWL 20020910 51 286.24 4717
J577PWL 20020910 DX51MYT 253.07 135300
J577PWL 20020911 DX51MYT 281 1
J577PWL 20020912 51 220.66 1000
J577PWL 20020912 DX51MYT 260 1
J577PWL 20020913 DU02PBY 325 1
J577PWL 20020914 DU02PBY 255.59 109705
J577PWL 20020915 DU02RBY 267.17 110296
J577PWL 20020915 2 267.62 120889
J577PWL 20020916 DU02PBY 182.16 111563
J577PWL 20020916 DU52PBY 260.02 112043
J577PWL 20020917 2 263.91 2646
J577PWL 20020917 DU02PBY 224.81 113223
J577PWL 20020918 2 251.09 3773
J577PWL 20020918 DU02PBY 224.67 114513
9
RAM Energy
AnalysisPerformed using Excel spreadsheetsDiscrete mpg (mileage since last fill/diesel input)Some cumulative mpg using total mileage/total
diesel input to date)Attempt to normalise using mean temperature
records Some regression analysis
10
Fuel Consumption Rover 75 W608 UOH
32
34
36
38
40
42
44
46
48
50
52
1 11 21 31 41 51 61 71
Fill No.
MPG
.Discreet MPG
Cumulative MPG
Adjusted MPG
11
RAM Energy Results
No seasonal adjustment
With seasonal
After Energester 42.94 43.46
Before Energester 42.66 42.64
Percentage gain 0.64% 1.92%
12
RAM Energy Problems
Missing data consisted of Driver information (who?)Loading information (full/empty)Length of journeyType of journey (long haul vs short haul)Urban or motorway conditionsEtc.
13
RAM Energy Conclusion
Results very poor and inconclusive
14
Database
Excel sheets were converted to an Access database with deletion of unnecessary rows and columns.
The Access database was then imported into SQL Server for data query and subsequent analysis
15
Data Cleansing
Brief outline of most obvious problems with the data 1. Card Number2. Registration Number3. Date4. Fuel Added5. Mileage
16
Card Number There were duplicate Card Numbers for
(presumably) the same Card, e.g. 85944 and 0085944 In a few cases, for a given Registration
Number, there appear additional Card Numbers, e.g. for ‘N151EUB’ there are the Card Numbers:
38195 0038195 56408
17
Registration Number
Registration numbers seemed to be always entered correctly
However, the field Reg Entered did not always tally with this
RAM recommendation to ignore
18
Date
Dates entered very consistent preserved the ordering distance between dates the actual date
An important question was: CAN WE PRESUME THE DATE IS ALWAYS ENTERED CORRECTLY ?
If this was so, then this provided us with a convenient check on the Mileage, as Date and Mileage should both increase together.
19
Fuel Outlier identification
Very small and very large values easily detected over large dataset
Take mean of the sample and flag as outliers data more than 3 or 4 SD’s away from the mean
Very small values e.g. 0 or 1 assumed as bogus values
9999, 999, etc. taken to be bogus valuesSome small and large values mistyped, with
either the decimal place occurring too soon (e.g. 38.6 instead of 386) or extra digits added (e.g. 3860 instead of 386)
20
Fuel
Difficult errorse.g. 693392.. could be 69392 ? What if
693399 ?Data must be flagged as erroneous
21
Mileage
Some values were entered as {0,1,999,9999,2,3,5,10,111,1111,123,789, etc}
If we can presume that the Date is a sensible value, then in a dataset where there are only a few missing or obviously incorrect values for the Mileage, these values can be amended as follows
22
Mileage
Day Mileage Spurious?
11 300
12 400
13 500 ?14 450 ?
We do not know if the day 13 entry is wrong, or day 14. So we can look ahead:
23
MileageDay Mileage Spurious?
11 300
12 400
13 500
14 450 ?15 510
Day Mileage Spurious?
11 300
12 400
13 500 ?14 450
15 470
Or
24
Mileage
Trans Quantity (Fuel Added) Odometer (Mileage)
182.04 55525
236 0
290 1
268.33 57589
Trans Quantity (Fuel Added) Odometer (Mileage)
182.04 55525
236+ 290 + 268.33 = 794.33 57589
Collapsed to:
25
Mileage
Small and very large values could be ignored Problem was determining whether any of the
remaining data was valid – data validation Evaluating the degree of correlation between the
increasing Date, and the supposed increasing Mileage
Useful approaches for estimating rank-orderedness and correlation between lists Spearman’s coefficient of rank correlation Kendall’s Tau
26
Data Cleansing
27
Ram Energy Data Validator
28
29
30
Bayesian - Approach In Bayesian approach to statistical inference,
express uncertain beliefs about things in terms of probability E.g. that there is a 50% chance that the average fuel
consumption of a vehicle will be less than 30mpg
Can use probabilities in this way to describe uncertainty about things we do not know E.g. amount of fuel in a vehicle’s tank at 10.00am
yesterday
31
Bayesian - Approach
Once we accept this view of probability, the principle for learning from data is simple
Before we see the data, we have a probability distribution based on our knowledge up to that point prior distribution
When we see the data our probability distribution changes, in the light of new information in the data posterior distribution.
32
Bayesian - Approach
Calculation used to get from the prior distribution to the posterior distribution Uses Bayes’ theoremHence Bayesian statistics
Very straightforward interpretation of the results when using this method
Posterior distribution tells us how likely it is that various things are true, after we have used the evidence in the data
33
Bayesian - Approach
Different observers can have different prior beliefs and this means that their posterior distributions will also be different make prior distribution represent very little information in practice prior tends to have little effect on posterior
One advantage of this approach is that it is straightforward to calculate what we expect various things to be after seeing the data For example, can calculate a posterior probability
distribution for the cost savings of applying the fuel additive to a whole vehicle fleet
34
Bayesian - Model
The basic model used is a regression, with fuel used as the dependent variable and distance travelled as one of the explanatory variables
Each observation corresponds to the time between two successive additions of fuel to the fuel tank
Expect zero fuel to be used if zero distance were travelled, amount of fuel used is not necessarily proportional to the distance travelled
For example, fuel efficiency may be greater on longer journeys
35
Bayesian - Model
Simplest form of the model, assume that fuel used is proportional to distance travelled
Constant of proportionality which is the slope of the line on a graph
Various other forms of relationship were also investigated.
While distance travelled is most obvious explanatory variable, there are several other variables and factors which must be taken into account
36
Bayesian - Factors Vehicle Types
Type of vehicle has effect Individual vehicles of same type may also
have different characteristicsEffect of individual vehicles (within a type)
was regarded as a random effectVehicles seen as a sample from all vehicles of
that type
37
Bayesian - Factors
DriversDriver identified by card numberDrivers closely associated with vehicles In this case, difficult to separate effects of
vehicles from the effects of driversHowever, if this were not the case, then it
would be possible to make inferences about individual drivers as well as individual vehicles
38
Bayesian - Factors
Time of yearFuel efficiency may be affected by ambient
temperature/meteorological variables Ideally use meteorological dataObtained data for this purposeBut, as a first step, a simple substitute is to
use the time of year, e.g. month
39
Bayesian - Factors
Presence of fuel additiveThe main question of interest is, “How does
the use of the fuel additive affect fuel consumption?
40
Bayesian - Complications
Fuel How full the fuel tank was before or after fuel was
added Precisely how much fuel was used between fills
True tank content regarded as a latent or “hidden” variable Such variables can be built into a Bayesian analysis
41
Bayesian - Complications
Data entry errors Graph of odometer readings against date for a single
vehicle shows the general pattern - spurious values This built into the model by allowing certain prior
probabilities for errors of different types The analysis can thus “recognise” errors by
calculating posterior probabilities that a reading is an error of the various types
Those values which have large posterior probabilities of being erroneous are, in effect, ignored by the rest of the analysis.
42
Bayesian - Conclusions
Prototype Bayesian models were successfully run
Demonstrated feasibility of approach for this problem
However: Need to overcome problems of missing data Uncertainty over when additive would be expected to
have an effect Pattern of this effect Confounding of additive effect with the effects of other
factors such as the changing seasons
43
Bayesian Results
Posterior probability density for the effect of the additive, in litres per mile
44
Conclusions
Recommendations:Design of better trials and data acquisitionCollection of ambient temperatures, etc.
Future DirectionsFraud detectionEfficiency of individual drivers/vehiclesPatterns of work, optimisation
top related