oil and gas analytics: cluster analysis of fuel price history · 2017. 2. 23. · oil & gas...

Oil & Gas Analytics: Cluster Analysis of Fuel Price History APPLIES TO: Cluster Analysis of Fuel Price History. SAP Predictive Analysis. For more information, visit the Predictive Analysis Space.

SUMMARY SAP Predictive Analysis is a software product that SAP is actively going to market with. Its functionality and areas of applications are in constant development and growth. In this paper we would like to discuss one of the possible areas of application for SAP Predictive Analysis – the task of dividing fuel price history cumulated over several years into statistically meaningful clusters. An important distinction of cluster analysis as applied to fuel price history (as opposed to the same cluster analysis used to segment fuel stations, fuel depots or any other "static" objects that exist in the framework of Oil & Gas Secondary Distribution and Retail Network Operations) is that the outcome of such clustering is grouping of the price history data points into a number of statistical "patterns" (i.e., clusters). This paper is not a scientific research; it is a discussion of a possible practical way to use SAP Predictive Analysis for visual and insightful analysis of fuel price history.

Author: Sergey LUKYANCHIKOV Company: SAP France S.A. Created on: February 11, 2013

AUTHOR BIO Sergey LUKYANCHIKOV, SAP France S.A, is an Analytics Solution Principal at SAP Performance & Insight Optimization (SAP PIO), specialist in Energy, Utilities and Services.

http://scn.sap.com/community/predictive-analysis

http://www.sap.com/services/portfolio/predictive-analytics/index.epx

Oil & Gas Analytics: Cluster Analysis of Fuel Price History

2

TABLE OF CONTENTS

MAJOR CORRELATIONS IN FUEL PRICE HISTORY DATA ......................................................................... 3

IDENTIFICATION OF FUEL PRICE HISTORY CLUSTERS ............................................................................ 3 Selection of Descriptive Clustering Variables .............................................................................................. 3 Configuring R-K-Means Algorithm ................................................................................................................ 4 Applying R-K-Means Algorithm ..................................................................................................................... 5

VISUALIZATION OF FUEL PRICE HISTORY CLUSTERS ............................................................................. 6 Cluster 5 ........................................................................................................................................................... 7 Clusters 2 and 4 ............................................................................................................................................... 7 Clusters 1 and 3 ............................................................................................................................................... 7 Stability of Clusters ......................................................................................................................................... 8

ASSESSMENT OF INFLUENCING FACTORS ................................................................................................ 8 Selection of Influencing Factor Candidates ................................................................................................. 8 Configuring R-CNR Tree Algorithm ............................................................................................................... 9 Applying R-CNR Tree Algorithm .................................................................................................................. 10

VISUALIZATIONS OF INFLUENCING FACTORS ........................................................................................ 10 Influencing Factor – Week ............................................................................................................................ 12 Influencing Factor – Year .............................................................................................................................. 12

RELATED CONTENT ..................................................................................................................................... 13


3

MAJOR CORRELATIONS IN FUEL PRICE HISTORY DATA

In order to explore the capability of SAP Predictive Analysis to deliver cluster analysis, we have generated a set of fuel price history data. In order to do that, a number of publicly available sources of fuel price statistics were studied. Some of the most typical "behaviors" that were encountered in the data from those public sources have been carefully modeled and simulated in a separate dataset. That separate dataset has become a source of abstract data for our fuel price clustering prototype in SAP Predictive Analysis: abstract prices for an abstract fuel expressed in an abstract currency evolve there over 13 years and across 9 abstract geographical regions. Two variables are measured in our source data:

Lowest Own Price – the lowest of all the weekly average prices for the abstract fuel applied by the fuel stations operated (directly plus via a network of "softly regulated" dealers) by our imaginary fuel distribution company

Average Competitor Price – the weekly average of all the prices for the same abstract fuel applied by the fuel stations operated by the competitors of our imaginary fuel distribution company

In the below screenshot, the price history data points are plotted against both of the above variables:

As one can see, there is a significant degree of correlation among the own and competitor prices. Nevertheless, a number of own fuel stations in certain regions may apply higher prices than the competition. In our further prototyping exercise, we will see whether the existence of such differences in the pricing behavior between our imaginary company and its competition could help identifying clusters in the overall fuel price history data. IDENTIFICATION OF FUEL PRICE HISTORY CLUSTERS

Since we are interested in how the overall fuel price history is clustered relative to "interplay" of own and competitor prices, it would be logical to include as the clustering variables both Lowest Own Price and Average Competitor Price. What other variables available in the source data should be involved in the cluster identification process? Selection of Descriptive Clustering Variables Let us open our source data in Prepare view of SAP Predictive Analysis:


4

We see that apart from "inevitable" clustering inputs such as Lowest Own Price and Average Competitor Price, there are several other variables that serve as descriptive characteristics of the fuel price history data points:

Year – the calendar year

Week – the calendar week

Region Code – the code of the geographical region Year and Week are an obvious choice since without them our clustering analysis would lack "time dimension". But whether or not to include Region Code on the clustering variables is a question. We preferred not to do this in order to capture in the resulting clusters a "pure interplay" over time of the own and competitor prices. Configuring R-K-Means Algorithm One of the most common algorithms used for clustering analysis is k-means. R-K-Means algorithm node goes right after the source data acquisition node in our analysis process in Predict view of SAP Predictive Analysis:


5

In the properties of R-K-Means node the following four clustering variables are selected:





The maximum number of k-means clusters is set to 5. Applying R-K-Means Algorithm The analysis process in SAP Predictive Analysis, when executed until and including R-K-Means node, delivers the following algorithm summary graphs:


6

From the whole variety of information that is contained in the algorithm summary graphs, we should single out the following fundamental "message" – there are five statistically meaningful "patterns" in our fuel price history data. The most general characteristics of those "patterns" are provided in graph "Cluster Density and Distance":

Clusters 2 and 4 – contain 1512 data points each, and each is characterized by the highest (from all the five clusters) degree of data point "dispersion" inside the cluster (relative to "cluster center")

Cluster 5 – contains 1323 data points and is characterized by the second highest (from all the five clusters) degree of data point "dispersion" inside the cluster

Clusters 1 and 3 – contain 855 and 864 data points respectively, and each is characterized by the lowest (from all the five clusters) degree of data point "dispersion" inside the cluster

The other algorithm summary graphs (specifically, "Cluster Variable Comparison" and "Cluster Comparison") provide additional information about the relative "influence" produced by a specific variable on the data points in a specific cluster, and vice versa – the relative "profile" of a specific cluster in the data point distribution along the value range of a specific variable. In particular, it can be noted that cluster 5, as it can be seen from "Cluster Comparison", should represent the "least disturbed" pattern in the available fuel price history. In the following chapter we will visually verify the meaning of the algorithm summary graphs. VISUALIZATION OF FUEL PRICE HISTORY CLUSTERS

The results of the R-K-Means algorithm run are shown in the screenshot below:


7

We need to understand, at least intuitively, how what we have seen in the algorithm summary graphs in the previous chapter connects to what we see in the cluster visualization above. Cluster 5 Data points belonging to cluster 5 are visualized as the "most linear" pattern compared to the other clusters. This is a somewhat expected result given that in the previous chapter, while taking a quick look at the "Cluster Comparison" graph, we noticed that cluster 5 was the "least disturbed" one (by the four clustering variables) among all the five clusters. In business terms, cluster 5 represents a "close match" pattern in the pricing behavior of both our company and its competition. Under this pattern both our company and the competitors tend to apply similar fuel prices. Clusters 2 and 4 Data points from clusters 2 and 4, as shown in the above visualization, are the "most loosely scattered" patterns compared to the patterns represented by the other clusters. Referring to the "Cluster Density and Distance" graph from the previous chapter, we should acknowledge that the data point "density" indicator there corresponds to what we visually observe in the above screenshot. The business meaning of clusters 2 and 4 could be formulated as "strong price volatility" pattern. Own prices are either substantially lower or substantially higher than competitor prices. Clusters 1 and 3 As to clusters 1 and 3, the above visualization of their data points gives two "moderately scattered" patterns, in comparison with the other clusters. On the one hand, the "Cluster Density and Distance" graph from the previous chapter communicates the lowest values of the data point "density" indicator in the case of clusters 1 and 3. On the other hand, since that "density" is measured relative to the "cluster center", it does not prevent the data points of those clusters from still having a visible scatter due to pattern non-linearity. Clusters 1 and 3, if projected on business, should be interpreted as "moderate price volatility" patterns. In this case, own prices may visibly deviate from competitor prices, but the deviations would be limited in scope and/or could be linked to specific influencing factors (seasons, price change rate, etc.).


8

Stability of Clusters Progressing with our analysis, we may need to ask an important question: how stable are the clusters identified by SAP Predictive Analysis? Or, in more practical terms, would the distribution of the fuel price history data points among the clusters change drastically if we run R-K-Means algorithm several times on the same source data sample? The easiest way to provide an answer could be just running R-K-Means several times more (see the below screenshots). Re-Run 1:

Re-Run 2:

Re-Run 3:

Re-Run 4:

The summary of our conclusions after several re-runs could be as follows: based on visual evaluation, SAP Predictive Analysis identifies a set of five stable patterns (i.e., with distinct and practically non-changing perceived visual profile, density, linearity). The cluster IDs assigned to those patterns (1, 2, 3, 4 and 5) may be different from one run to another, as well as color coding (follows the IDs) – which does not affect the fact that the five identified patterns remain almost the same. Due to changing color coding, the first impression could be that, for example, the pattern behind cluster 5 changes drastically (compare the shape of cluster 5 in re-runs 1 and 2 with its shape in re-runs 3 and 4). But this impression is wrong because the "most linear" pattern that is assigned to cluster 5 in re-runs 1 and 2 simply becomes assigned to different cluster numbers in re-runs 3 and 4. In other words, the "most linear" pattern does not disappear nor changes substantially its shape – it just becomes visualized under a different cluster ID using different color. ASSESSMENT OF INFLUENCING FACTORS

The next logical step in the analysis would be turning to the identified clusters and trying to find out the rules according to which a concrete fuel price history data point is assigned to a concrete cluster. In the subsequent chapters we will try to identify the factors that influence the assignment of the data points to the clusters. And via a set of rules we will assess the degree of influence provided by those factors. Selection of Influencing Factor Candidates We will start with the full set of clustering variables and will consider them all influencing factor candidates:



9




The outcome of our analysis should become the shortlist of the above variables. The criterion for a clustering variable to be added to the influencing factor shortlist is that depending on the value of the clustering variable a concrete data point should be with high probability assigned to a concrete cluster. Configuring R-CNR Tree Algorithm C&R tree is one of the most commonly used algorithms for the discovery of probability-based classification rules (the sets of such rules are often graphed as "decision trees" – hierarchies of classification rules in which the greater the "classification power" of the rule, the closer to the root of the decision tree it is placed). In Predict view of SAP Predictive Analysis, the node that implements R-CNR Tree algorithm follows in the analysis process the R-K-Means node:

In the properties of R-CNR Tree node the following four independent variables are selected:






10

The following dependent variable is selected:

ClusterNumber – the cluster ID generated in the previous node by R-K-Means The algorithm's method is set to "Classification", the minimum number of data points for a "branch" or "leaf/vertex" split-off is set to 10, and the split-off criterion is set to "Gini". Applying R-CNR Tree Algorithm The analysis process in SAP Predictive Analysis, when executed until and including R-CNR Tree node, delivers the following algorithm summary graph:

The above algorithm summary graph contains the decision tree that is based on the classification rules discovered by R-CNR Tree. If we attentively read through that graph, we will realize that from the influencing factor candidates identified in the previous chapter only the following variables have survived as real influencing factors:


Week – the calendar week Since the rule that is closest to the tree root (could be formulated like this: if Week <12,5 then Cluster = 5, expected percentage of processed data points = 21,81%; if Week >= 12,5 then Cluster = process further rules, expected percentage of processed data points = 78,19%) is based on Week variable, which is the case as well with the rules at another two levels down the decision tree, we could make a major conclusion about the strongest influence that is produced by the calendar week number on cluster assignment. At level "root minus four" (i.e., with calendar week numbers equal or greater than 38), Year variable starts to play its role and becomes the influencing factor. The influence of Year continues one more level down from the root. At calendar year 2006, Week becomes the influencing factor again and the "last" rule in the hierarchy is based on Week. VISUALIZATIONS OF INFLUENCING FACTORS

To explain a bit better the insights obtained via the R-CNR Tree algorithm summary graph in the previous chapter, we would like to suggest the following visualizations:


11

Influencing Factor - Week

Influencing Factor - Year

In the above visualizations the following variables are plotted against the identified influencing factors – calendar years (Year) and calendar weeks (Week):




12

In the "Influencing Factor – Week" visualization, both Lowest Own Price and Average Competitor Price variables are averaged per calendar week across all the calendar years. In the "Influencing Factor – Year", the same variables are shown with their actual values in each of the calendar years. We will try to connect what we see in the above visualizations with the insights obtained from the algorithm summary graph. Influencing Factor – Week The classification rules based on Week occupy the first three levels underneath the decision tree root in the R-CNR Tree algorithm summary graph. Depending on the range to which belongs the calendar week of a concrete data point, this or that cluster ID will be assigned to the data point. As we saw it in the "Visualization of Fuel Price History Clusters" chapter, each cluster ID corresponds to a unique fuel pricing pattern – characterized by, mainly, lower or higher volatility of own prices relative to competitor prices. In the above "Influencing Factor – Week" visualization we can see that the classification rules from the algorithm summary graph are properly reflected:

Before week 12, on average, the movement of own prices correlates closely with the movement of competitor prices, with the price differential being small (i.e., the pattern that is represented by cluster 5)

Between weeks 12 and 25, the correlation of own and competitor prices becomes worse, the price differential grows substantially (i.e., the pattern that is represented by cluster 4)

Between weeks 25 and 38, the correlation of own and competitor prices does not improve, the price differential remains visible (i.e., the pattern that is represented by cluster 2)

Beyond week 38, Year becomes the major influencing factor and provides discrimination between clusters 1 and 3

Influencing Factor – Year The classification rules based on Year occupy the fourth and fifth levels underneath the decision tree root in the R-CNR Tree algorithm summary graph. Beyond calendar week 38, depending on the range to which belongs the calendar year of a concrete data point, this or that cluster ID will be assigned to the data point. In the above "Influencing Factor – Year" visualization we can see that the classification rules from the algorithm summary graph are properly reflected:

Before year 2005, on average, the movement of own and competitor prices is driven rather by calendar weeks but not by calendar years. From the "Influencing Factor – Week" visualization we learn that the data points that lie beyond week 38 in the years before year 2005 have a poor correlation between own and competitor prices, plus a relatively small price differential (i.e., the pattern that is represented by cluster 1)

Starting from year 2005, the movement of own and competitor prices is driven by both calendar weeks and calendar years. Specifically, the data points beyond week 38 in the years starting from year 2005 have a relatively close own/competitor correlation and a rather small price differential (i.e., the pattern that is represented by cluster 3)

In year 2006, a "fight" of two patterns – the one of cluster 1 with the one of cluster 3 – can be observed, which results in the predominance during year 2006 of cluster 3 between calendar weeks 38 and 45, and of cluster 1 beyond week 45 (till the end of year 2006)


13

RELATED CONTENT

SAP PIO

For more information, visit the Predictive Analysis Space.

http://www.sap.com/services/portfolio/predictive-analytics/index.epx

http://scn.sap.com/community/predictive-analysis

© 2013 SAP AG. All rights reserved.

SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP

BusinessObjects Explorer, StreamWork, SAP HANA, and other SAP

products and services mentioned herein as well as their respective

logos are trademarks or registered trademarks of SAP AG in Germany

and other countries.

Business Objects and the Business Objects logo, BusinessObjects,

Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and

other Business Objects products and services mentioned herein as

well as their respective logos are trademarks or registered trademarks

of Business Objects Software Ltd. Business Objects is an SAP

company.

Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL

Anywhere, and other Sybase products and services mentioned herein

as well as their respective logos are trademarks or registered

trademarks of Sybase Inc. Sybase is an SAP company.

Crossgate, m@gic EDDY, B2B 360°, and B2B 360° Services are

registered trademarks of Crossgate AG in Germany and other

countries. Crossgate is an SAP company.

All other product and service names mentioned are the trademarks of

their respective companies. Data contained in this document serves

informational purposes only. National product specifications may vary.

These materials are subject to change without notice. These materials

are provided by SAP AG and its affiliated companies ("SAP Group")

for informational purposes only, without representation or warranty of

any kind, and SAP Group shall not be liable for errors or omissions

with respect to the materials. The only warranties for SAP Group

products and services are those that are set forth in the express

warranty statements accompanying such products and services, if

any. Nothing herein should be construed as constituting an additional

warranty.

www.sap.com

oil and gas analytics: cluster analysis of fuel price history · 2017. 2. 23. · oil & gas...

Documents