uncertainty quantification using distances and kernel ... · 1. introduction stochastic spatial...

29
1 Uncertainty Quantification Using Distances and Kernel Methods – Application to a Deepwater Turbidite Reservoir Céline Scheidt and Jef Caers Stanford Center for Reservoir Forecasting, Stanford University Abstract Petroleum reservoir properties are often modeled using well established geostatistical methods. With these methods, it is well known that large numbers of realizations can be rapidly generated, each of which will respect the geological constraints input into the algorithms. Uncertainty in the reservoir properties can be quantified by evaluating a large number of reservoir models. However, since flow simulation can be extremely time consuming, it is often not practical to run flow simulation on each model. The engineer must then select a subset of realizations to quantify uncertainty in reservoir properties. The traditional way to select a subset of realizations is to rank them using static properties (e.g. OOIP). One selects a set of realizations associated with particular quantiles (P10, P50, and P90, for example), and performs full field simulations for each model. One drawback to this method is that ranking techniques are highly dependent on the static property used. Another alternative to quantify uncertainty is the experimental design methodology. This method is not well suited to uncertainty in reservoir model realizations. In this paper, we propose a new method to select realizations using the concept of distance. Starting from large set of realizations, a distance function measuring “dissimilarity” between any two geostatistical realizations is defined. The distance function can be tailored to the particular problem - in this case, flow responses. Using multi-dimensional scaling based on the distance, the realizations can be mapped into a Euclidean space. This space can then be modeled using kernel techniques, such as kernel clustering, for the selection of a subset of representative realizations containing similar properties to the larger set. Without losing accuracy, production uncertainty can then be quantified from flow simulation on this subset of realizations, reducing computing time significantly. This method is well suited to quantifying uncertainty on hundreds or potentially thousands of reservoir models with reasonable cpu demand. A case study is presented on a deepwater turbidite offshore reservoir in west Africa. The reservoir is modeled using 4 facies whose spatial distribution is uncertain due to uncertain facies proportions, location, channel shape, etc. Multiple alternative training images are defined to capture the spatial uncertainty of the prior model. Then, many realizations are generated with these training images as input. Distances between realizations are calculated using a fast streamline simulator which requires minimal cpu demand, and a small subset of realizations is selected using a kernel k-means clustering algorithm. Uncertainty quantification is performed by running full-field flow simulation on the subset of realizations. We show that quantification of uncertainty on the subset results in similar statistics as the uncertainty of the full set.

Upload: others

Post on 11-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

1

Uncertainty Quantification Using Distances and Kernel Methods – Application to a Deepwater Turbidite Reservoir

Céline Scheidt and Jef Caers Stanford Center for Reservoir Forecasting, Stanford University

Abstract

Petroleum reservoir properties are often modeled using well established geostatistical methods. With these methods, it is well known that large numbers of realizations can be rapidly generated, each of which will respect the geological constraints input into the algorithms. Uncertainty in the reservoir properties can be quantified by evaluating a large number of reservoir models. However, since flow simulation can be extremely time consuming, it is often not practical to run flow simulation on each model. The engineer must then select a subset of realizations to quantify uncertainty in reservoir properties. The traditional way to select a subset of realizations is to rank them using static properties (e.g. OOIP). One selects a set of realizations associated with particular quantiles (P10, P50, and P90, for example), and performs full field simulations for each model. One drawback to this method is that ranking techniques are highly dependent on the static property used. Another alternative to quantify uncertainty is the experimental design methodology. This method is not well suited to uncertainty in reservoir model realizations. In this paper, we propose a new method to select realizations using the concept of distance. Starting from large set of realizations, a distance function measuring “dissimilarity” between any two geostatistical realizations is defined. The distance function can be tailored to the particular problem - in this case, flow responses. Using multi-dimensional scaling based on the distance, the realizations can be mapped into a Euclidean space. This space can then be modeled using kernel techniques, such as kernel clustering, for the selection of a subset of representative realizations containing similar properties to the larger set. Without losing accuracy, production uncertainty can then be quantified from flow simulation on this subset of realizations, reducing computing time significantly. This method is well suited to quantifying uncertainty on hundreds or potentially thousands of reservoir models with reasonable cpu demand. A case study is presented on a deepwater turbidite offshore reservoir in west Africa. The reservoir is modeled using 4 facies whose spatial distribution is uncertain due to uncertain facies proportions, location, channel shape, etc. Multiple alternative training images are defined to capture the spatial uncertainty of the prior model. Then, many realizations are generated with these training images as input. Distances between realizations are calculated using a fast streamline simulator which requires minimal cpu demand, and a small subset of realizations is selected using a kernel k-means clustering algorithm. Uncertainty quantification is performed by running full-field flow simulation on the subset of realizations. We show that quantification of uncertainty on the subset results in similar statistics as the uncertainty of the full set.

Page 2: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

2

1. Introduction

Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial phenomenon which characterize the uncertainty of the phenomenon. Due to the large computation time for post-processing these multiple realizations by flow simulation, it is not possible to evaluate all realizations. To circumvent this challenge, the traditional approach consists in ranking realizations according to a static measure, such as the original oil in place. Ranking techniques are used to select realizations that represent the P10, P50 and P90 quantiles of the response (Ballin et al., 1992). This approach has proven its efficiency when the ranking measure is highly correlated to the response of interest. However, ranking is often based on rather simple statistics extracted from the realization (e.g. original oil in place), which may not correctly capture the simulation behavior. These statistical measures often have a poor correlation with the response measured by the flow simulator. In this paper, we employ a different technique to identify the subset of realizations which will be evaluated by flow simulation to compute the statistics (P10, P50, P90) of the response of interest. This method, called the Distance Kernel Method (DKM), was proposed in 2007 by Scheidt and Caers. It is based on the definition of a dissimilarity distance between the realizations, which indicates how similar two realizations are in terms of their associated response of interest. In other words, the distance is defined in order to have a good correlation with the flow response of interest. The principle idea is to rely on the distance to identify a few typical realizations (in terms of flow behavior) and thus cover the spread of uncertainty accurately by only performing a small number of simulations. The small subset of realizations is selected to have similar statistics than the entire set. The following section gives a summary of the methodology. We then present an application of the DKM on a real field case from West Africa. 72 realizations of the reservoir were used for uncertainty quantification. We compare our results with more typical ranking approaches. We end the paper with some conclusions and discussions.

2. General Description of the Methodology

The principle of the methodology, illustrated for facies models, is summarized in Figure 1. Starting with multiple (NR) realizations of a spatial phenomenon (e.g. facies representation in a reservoir) generated using any algorithm, a dissimilarity distance matrix is constructed (Figure 1a and 1b). This NR x NR matrix contains the “distance” between any two model realizations, which describes how similar two reservoir models are in terms of geological properties and flow behavior. The distance can be calculated in any manner - the only requirement for the distance is to be well correlated to the flow response(s) of interest. Although this requirement is the same for ranking methods, we

Page 3: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

3

show that using distances results in a more robust solution. Example of such distances are the Hausdorff distance (Suzuki and Caers, 2006), time-of-flight based distances (Park and Caers, 2007), or flow-based distance using fast flow simulators (Scheidt and Caers, 2007).

Model 1 Model 2

Model 3 Model 4

δδδδ12

δδδδ13 δδδδ24

δδδδ34

δδδδ32

δδδδ14

�������� ����������� ���������

ΦΦΦΦ

��� ���������������

����������������

��� �������������� ���������

�������������������������� �����

��� �������������� ���������

�������������������������� �����

����� �

(a) (b) (c)

(d)

(e)

���� ����������

ΦΦΦΦP10,P50,P90 model selection

δδδδ44δδδδ43δδδδ42δδδδ414δδδδ34δδδδ33δδδδ32δδδδ313δδδδ24δδδδ23δδδδ22δδδδ212δδδδ14δδδδ13δδδδ12δδδδ111

4321

Figure 1: Proposed workflow for uncertainty quantification: (a) distance between two models, (b)

distance matrix, (c) models mapped in Euclidean space, (d) feature space, (e) pre-image construction, (f) P10, P50, P90 estimation

The distance matrix is then used to map all realizations into a Euclidean space R (Figure 1c), using a technique called multidimensional scaling (MDS). MDS translate the dissimilarity matrix into a configuration of points in n-dimensional Euclidean space (Borg and Groenen, 1997). Each point in this map represents a realization (Fig. 2)- the points are arranged in a way that their Euclidean distances correspond as much as possible to the dissimilarity distance of the realizations.

Page 4: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

4

-20 -15 -10 -5 0 5 10-10

-5

0

5

10

15

Figure 2: Multidimensional Scaling (MDS): each point represents a reservoir model in 2D

space.

At this point, we could group the points in Euclidean space R using principle component analysis or clustering algorithms and select representative points (realizations) for each cluster. However, these algorithms assume that the structure of the points in R is linear. In most cases the structure of the points in R is nonlinear (Figure 1c), thus, we use kernel methods to transform the Euclidean space R into a new space F, called the feature space (Figure 1d). The goal of the kernel transform is that the relationship between the points in this new space behaves more linearly (Schöelkopf and Smola, 2002), such that standard linear tools for pattern detection can be used more successfully (such as principal component analysis, cluster analysis, dimensionality reduction, etc.). These tools allow the selection of a few representative points, in our case reservoir models that have different flow behavior, among a potentially very large set. In reservoir engineering, kernel theory has been used by Sarma et. al (2006) in the context of inversion of flow data and production optimization. After applying the kernel transform, we employ the classical k-means algorithm in the feature space F, also called kernel k-means (KKM), to determine a subset of points defined by the cluster centroids (see below). Each cluster thus contains similar realizations in terms of flow response. The number of points (realizations) in the cluster defines the weight associated with each representative realization. The subset of models selected by KKM is defined to be small enough to allow uncertainty quantification (e.g. P10, P50, P90 quantiles) through flow simulation. In order to identify the representative realizations and visualize the location of the centroids in Euclidean space R (Figure 1e), the centroids locations are mapped from the feature space F using the Schöelkopf fixed-point algorithm (Schöelkopf and Smola, 2002). The representative realization associated with each cluster is defined as the realization which is closest in R to the cluster centroids in R.

Page 5: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

5

Appendix A describes the methodology in Figure 1 in more detail.

3. Application to a real field case (WCA)

3.1. General description of the case

The West Coast African (WCA) reservoir is a deepwater turbidite offshore reservoir located in a slope valley. The reservoir is located offshore in 1600 feet of water and is 4600 feet below see level. This case has been studied by previous authors. Hoffman (2005) employed the Probability Perturbation Method on a WCA reservoir model, integrating simultaneously prior geological information (training image, seismic data, and production data to obtain a geologically-consistent history matched model. Maharaja (2006, 2007) applied a spatial bootstrap method using several different scenarios and training images to quantify the uncertainty on net-to-gross of the WCA reservoir. Seismic data for the WCA reservoir is of good quality and allows the identification of the large scale structural settings (canyon) with good confidence. The WCA field is a slope valley system divided into four structural blocks with different fluid contacts: the West, Central, East and Southeast (Figure 3).

East

West Central

Southeast

Figure 3: Structural and stratigraphic compartments: the West, Central, East and

Southeast units

The depositional facies filling the slope valley can not be easily be inferred from the seismic data. However, four depositional facies were interpreted from the well logs: shale (Facies 1), poor quality sand #1 (Facies 2), poor quality sand #2 (Facies 3) and good quality channels (Facies 4). The amount of sand is about 55% of the gross volume. The three sand facies are classified by their petrophysical properties measured at the well.

Page 6: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

6

The good quality channel represents about 28% of the reservoir. The description of the two other sand facies is uncertain. They have been interpreted as levees or debris flows. These two facies account for 10% to 15% of the total reservoir. The depositional uncertainty for the facies is expressed through different training images (TI), which are presented in Section 3.2. The reservoir is produced with 28 wells, of which 20 are production wells and 8 are water injection wells. The locations of the wells are displayed in Figure 4. Wells colored in red are producers wells and in blue are injectors.

Figure 4: Location of the 28 wells. Red are production wells and blue are injection wells.

Different colors in grid represents different fluid regions

Reservoir model

The reservoir model has dimensions of 78 x 59 x 116, but there are only around 100,000 active gridblocks (the exact number varies for each simulation). The reservoir is approximately one mile long, one-half of a mile wide and 800 feet thick. The initial fluid contacts, pressures and oil properties vary from the different segments, but the water-oil contact depth is approximately 5440 feet, the initial pressure is around 2300 psi and the oil is around 24º API. An aquifer exists to the East of the reservoir.

MPS Models

The MPS facies geometry is simulated using the multi-point geostatistical algorithm snesim (Strebelle, 2002). The facies are conditioned to the available data: the training image, well data and seismic data in this case. The seismic inversion cube was generated and then transformed into a facies probability cube by calibration to the well data. The four facies of the MPS models were then populated with porosity, Vshale (shale volume fraction) and permeability. The porosity was obtained using sequential Gaussian simulations (SGS) – one for each facies. Vshale was simulated using SGS with

Page 7: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

7

collocated cokriging (the porosity cube being used as soft data), and the permeability was modeled using a log transform of the Vshale cube. The porosity and permeability values were then exported to the flow simulator (Chears).

3.2. Uncertainty quantification in the WCA reservoir

As mentioned below, the description of the facies filling the slope valley is subject to uncertainty. 12 TIs are used in this case study, representing uncertainty on the facies representations. The TIs can be divided in 2 families depending on how facies 2 and 3 are characterized, levees being modeled as ellipses whereas debris flows were modeled as sinuous patterns. In addition, the TIs differ one from another with respect to the channel width, width/thickness ratio and sinuosity of the channels. The 12 TIs used in this case study are presented in Figure 5.

TI 1 TI 2 TI 3

TI 4 TI 5 TI 6

TI 7 TI 8 TI 9

TI 10 TI 11 TI 12 Figure 5: Training images used to generate 72 realizations for the layer 84

Page 8: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

8

The differences in the properties of the TI are presented in Table 1.

Facies Architecture

Channel Thickness

Width Thickness

Ratio

Channel Sinuosity

TI 1 Chan + Deb L L L TI 2 Chan + Deb H L L TI 3 Chan + Deb H H L TI 4 Chan + Lev H H H TI 5 Chan + Deb L H H TI 6 Chan + Lev H L H TI 7 Chan + Deb L L H TI 8 Chan + Lev H L L TI 9 Chan + Lev L H L

TI 10 Chan + Lev L L H TI 11 Chan + Deb H H H TI 12 Chan + Lev + Deb M M M

Table 1: Properties of the Training Images In addition to the different TIs, uncertainty is assumed to be present in the facies proportions as presented in the facies probability cubes. Three different facies cubes were used in this study. Thus, the geostatistical realizations were created by varying the TIs and facies probability cubes as input to the multi-point geostatistical algorithm. To include spatial uncertainty, two realizations were generated for each combination of TI and facies probability cube, leading to a total of 72 possible realizations of the WCA reservoir. In order to have a “reference” for this case, all 72 simulations were performed using a full flow simulator (Chears). Note that this often not possible for real field cases, since the flow simulations are too time consuming. In this particular instance, each Chears simulation required approximately 2.5 hours of CPU time, requiring 8 days in total to run 72 flow simulations. The differing properties of the 72 realizations induce important differences in flow response (Figure 6). Figure 6A shows that after 1200 days of production, the difference of field cumulative oil production between the two extreme models is 29.7 MMSTB. Figure 6B shows the difference of cumulative water production (difference of 36.9 MMSTB).

Page 9: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

9

(A) (B) Figure 6: (A) Cumulative Oil Production for the 72 realizations as a function of time (B)

Cumulative Water Production for the 72 realizations as a function of time

The objective of the next section is to assess uncertainty by calculating the P10, P50 and P90 quantiles of the production without knowing the flow response of the 72 models. The responses of interest in this paper are the cumulative oil production and cumulative water production. In the following sections we describe the application of the DKM method to this case in detail, and then present the results. Subsequently, we provide a comparison of the results with the traditional ranking method.

3.3. Distance Kernel Method

The first step of the methodology is to define a distance between the realization which measures how similar two realizations are, in terms of properties and flow behavior.

Definition of the dissimilarity matrix between realizations

The distance between any two realizations is calculated using a streamline simulator (3DSL), which allows a fast simulation of the response, cumulative oil production (CumOil) or cumulative water production (CumWater) in this case. Flow-based distances are often well correlated to the response obtained by the full flow simulator. The distance is defined as:

( )� −=kt

kjkiij tt2streamlinestreamline )(R)(Rδ (1)

Where: - R represent either CumOil or CumWater from streamline simulation - tk represents the timesteps of the streamlines.

Page 10: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

10

We use only the late period time steps between 791 and 1215 days to ensure that the water breakthrough has already occurred, thus allowing for a better characterization of the dissimilarities between the realizations. Recall that the distance measure and the response of interest must be correlated for this method to be efficient. In this case, the responses of interest are the cumulative oil production and cumulative water production for the field. Here, the correlation coefficient between the distance and the difference in flow for the full set of 72 simulations is 0.85 for CumOil, and 0.51 for CumWater. This degree of correlation should be sufficient to obtain good results. The dissimilarity distance matrix is then constructed by calculating the dissimilarity distance between any two realizations.

Multi-Dimensional Scaling (MDS)

Using the dissimilarity distance matrix defined previously, all the realizations are mapped into a Euclidean space R using multi-dimensional scaling (Figure 7A for CumOil, 7B for CumWater). In both cases, a 2D space is sufficient to ensure that the Euclidean distance between any two points in this space is similar to the dissimilarity distance (correlation of 0.99 for CumOil and for CumWater). As a consequence, no information is lost by considering, from this step of the methodology, only Euclidean distances and not dissimilarity distances.

-5 -4 -3 -2 -1 0 1 2 3 4

x 104

-3000

-2500

-2000

-1500

-1000

-500

0

500

1000

1500

20002D Euclidean Space of Uncertainty

-4 -3 -2 -1 0 1 2 3 4

x 104

-6000

-5000

-4000

-3000

-2000

-1000

0

1000

2000

3000

40002D Euclidean Space of Uncertainty

(A) (B)

Figure 7: 2D Euclidean space of uncertainty resulting from MDS (A) Uncertainty in Oil response - (B) Uncertainty in Water response

Kernel K-Means

The next step of the DKM consists of defining a kernel function to transform the mapping space R into a space F with improved linear variation. The kernel employed to define the feature space F is Gaussian (Appendix A, Eq. 2), whose bandwidth parameter is taken to be 20% of the range of distance in the Euclidean space. Note that the

Page 11: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

11

Gaussian kernel is well adapted to this study since it requires Euclidean distances between points. The kernel-k-means (KKM) algorithm is then applied to the points in the Euclidean space. The number of clusters is user-defined, and is the maximum number of flow simulations the engineer can afford given available CPU time. In the case of cumulative oil, we selected 7 simulations (which requires approximately 1.3 days of simulation, and is 10% of the total number of existing realizations). Figure 8 (A) shows the back-mapped centroids in blue squares. To quantify the uncertainty in water production, since the correlation coefficient between the dissimilarity distance and the actual cumulative water production is smaller than in the case of CumOil, we must select one additional simulation to obtain satisfactory results. The 8 selected realizations in the case of CumWater are presented in Figure 8 (B).

-5 -4 -3 -2 -1 0 1 2 3 4

x 104

-3000

-2500

-2000

-1500

-1000

-500

0

500

1000

1500

2000

All realizationsSelected realizations

-4 -3 -2 -1 0 1 2 3 4

x 104

-6000

-5000

-4000

-3000

-2000

-1000

0

1000

2000

3000

4000

All realizationsSelected realizations

(A) (B) Figure 8: 2D Euclidean space of Uncertainty – (A) Cum. oil, (B) Cum water. The blue

points represent the points selected by KKM

The 7 or 8 points selected by KKM represent the realizations whose flow responses are assumed to have the same statistics as the 72 realizations. Full flow simulations are performed on those 7 or 8 points. Uncertainty quantification is then performed by calculating the quantiles P10, P50 and P90 on these selected models as a function of the time. Figure 9 (A) represents the variation of the cumulative oil production as a function of time for the 7 selected realizations. Figure 9 (B) represents the estimated quantiles (in blue) as a function of time and the quantiles resulting from the entire set of realizations. Figure 10 (A) and 10 (B) represent the same results for water response. Recall that each simulated realization is represented as many times as the number of realizations in the corresponding cluster. The representative realizations rarely have the same cluster weight, and can vary significantly between different clusters.

Page 12: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

12

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9x 10

4

Time (days)

CU

MO

IL (M

STB

)

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8x 10

4

Time (days)

CU

MO

IL (M

STB

)

Exhaustive SetKKM

(A) (B) Figure 9: Cumulative oil production as a function of time (A) for selected 7 simulations and (B) Resulting P10, P50 and P90 values. The red curves are the quantiles for all 72

realizations, whereas the blue curves are for the 7 selected simulations using KKM.

(A) (B)

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9x 10

4

Time (days)

CU

MW

ATE

R (M

STB

)

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9x 10

4

Time (days)

CU

MW

ATE

R (M

STB

)

Exhaustive SetKKM

Figure 10: Cumulative water production as a function of time (A) for selected 7 simulations and (B) Resulting P10, P50 and P90 values. The red curves are the quantiles for all 72 realizations, whereas the blue curves are for the 7 selected

simulations using KKM.

We observe on Figure 9 (B) and 10 (B) that the estimated quantiles are accurate. In Figure 11 and 12, we present the density of the cumulative oil and water production for the 72 realizations (red) and for respectively the weighted 7 and 8 realizations (blue) for two different times. Again we observe that the subset of realizations has similar characteristics to the entire set of possible realizations.

Page 13: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

13

4 5 6 7 8 9

x 104

0

1

2

3

4

5

6

7x 10

-5 Density Plot at 1156 days

CUMOIL (MSTB)

Freq

uenc

y

Exhaustive SetKKM results

1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

x 104

0

1

x 10-4 Density Plot at 607 days

CUMOIL (MSTB)

Freq

uenc

y

Exhaustive SetKKM results

(A) (B) Figure 11: Density of CumOil for all 72 realizations (red) and 7 selected realizations (blue) for (A) 607 days and (B) 1156 days. The red curves are the quantiles for all 72 realizations, whereas the blue curves are for the 7 selected simulations using KKM.

4 5 6 7 8 9 10 11

x 104

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

-5 Density Plot at 1156 days

CUMWATER (MSTB)

Freq

uenc

y

Exhaustive SetKKM results

(A) (B)

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

x 104

0

1

2

x 10-4 Density Plot at 607 days

CUMWATER (MSTB)

Freq

uenc

y

Exhaustive SetKKM results

Figure 12: Density of CumWater for all 72 realizations (red) and 7 selected realizations (blue) for (A) 607 days and (B) 1156 days. The red curves are the quantiles for all 72

realizations, whereas the blue curves are for the 8 selected simulations using KKM. The effect of the k-means cluster weights on the representative realizations is evident in the

blue curves for KKM.

Note that two different distances were used for the cumulative oil production and cumulative water production. Thus, the points in the Euclidean space are not the same, and therefore the selected simulations for the 2 responses also differ. The same distance could be employed for both responses, however the CumOil distance was poorly correlated to cumulative water production (0.22). In order to assess uncertainty on several responses at a time, the selected distance should be correlated to the difference in flow for all responses. The distance might be less correlated to the difference of the responses, however, selecting a few more simulations can increase the accuracy of the method.

Page 14: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

14

The next example shows an illustration of a joint quantification of CumOil and CumWater using the same distance. In order for the distance to be reasonably correlated with both responses, the distance is defined similarly than previously (Eq. 1), except that the sum consider both streamlines responses:

( )( )� �

��

+−=

kt kjki

kjkiij

tumWatertumWater

tumOiltumOil2streamlinestreamline

2streamlinestreamline

)(C)(C

)(C)(Cδ

The same times steps as previously are used. For this distance, we obtain a correlation between the distance and the difference in flow of 0.66 for CumOil and 0.45 for CumWater. Since the correlations have decreased, KKM is applied in order to select 10 realizations instead of 7 or 8. The 10 selected realizations are presented in the MDS space (blue squares) in Figure 13 (A). The resulting estimated quantiles for cumulative oil production and cumulative water production are respectively presented in Figure 13 (B) and (C). Accurate uncertainty quantification is observed for both responses.

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9x 10

4

Time (days)

CU

MW

ATE

R (M

STB

)

Exhaustive SetKKM

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8x 10

4

Time (days)

CU

MO

IL (M

STB

)

Exhaustive SetKKM

-6 -4 -2 0 2 4 6

x 107

-3

-2

-1

0

1

2

3

4x 10

7

All realizationsSelected realizations

(A) (B) (C)

Figure 13: KKM for CumOil and CumWater (A) 10 selected realizations, (B) Resulting P10, P50 and P90 values CumOil (C) Resulting P10, P50 and P90 values for

CumWater.

As illustrated in this section, the use of the DKM selects an effective sample of realizations which accurately characterize uncertainty when evaluated. We now demonstrate the flexibility of our methodology by applying it to quantify flow uncertainty for individual wells. In this case, the response of interest is the cumulative oil and water production for each well, and the distance is the difference in cumulative oil (and water) production at each well. We expect a much larger variation in flow on a well basis than on a field basis, hence much larger uncertainty, particularly in water production.

4. Uncertainty Quantification for Individual Wells

In this section, we apply the DKM to individual wells. Since the distance in this method is response specific, we consider in the following examples the same distance as defined in Eq. 1, where we calculate the difference in cumulative oil (water) production for the individual well between each streamline simulation. The same timesteps were used. All

Page 15: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

15

the other parameters (kernel type, kernel brandwidth parameter) are specified as described previously. We study three different wells (W1, W2 and W3) in this paper. Figure 14 shows the variation of the cumulative oil production as a function of time for each well. Figure 15 shows the variation of the cumulative water production as a function of time for each well.

(A) (B) (C) Figure 14: Cumulative oil production as a function of time: (A) W1, (B) W2 and (C) W3

(A) (B) (C) Figure 15: Cumulative water production as a function of time: (A) W1, (B) W2 and (C) W3

Well W1 has the largest uncertainty as measured by the cumulative oil production (cumulative water respectively). In this particular case, the correlation coefficient between the distance and the actual difference in flow is 0.83 for CumOil (respectively 0.70 for CumWater). KKM was applied to select 8 representative realizations for the W1 for each response. The selected realizations in MDS space are presented in Figure 16 (A) for CumOil, the weighted quantiles associated to flow simulation of those 18 realizations are shown in Figure 16 (B). Figure 17 shows results for CumWater. Accurate representation of uncertainty is obtained in both cases.

Page 16: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

16

0 5 10 15 20 25 30 35 400

500

1000

1500

2000

2500

3000

3500

4000

4500

Time (days)

Cum

Oil

(MS

TB)

Exhaustive SetKKM

-8 -6 -4 -2 0 2 4 6

x 106

-8

-6

-4

-2

0

2

4

6

8x 10

5

All realizationsSelected realizations

(A) (B) Figure 16: W1: (A) Selected realizations for simulation for CumOil, (B) Resulting quantiles

(A) (B)

-5 -4 -3 -2 -1 0 1

x 106

-5

-4

-3

-2

-1

0

1

2

3x 10

5

All realizationsSelected realizations

0 200 400 600 800 1000 12000

200

400

600

800

1000

1200

Time (days)

Cum

Wat

er (M

STB

)

Exhaustive SetKKM

Figure 17: W1: (A) Selected realizations for simulation for CumWater, (B) Resulting quantiles

For well W2, the distance has a correlation coefficient with the true response difference of 0.77 for CumOil and 0.59 for CumWater. Figure 18 demonstrates that 8 simulations are not sufficient in the case of CumOil. If we increase the number of simulations to 12, the results have significantly improved (Figure 19).

(A) (B)

-8 -6 -4 -2 0 2 4 6

x 106

-5

-4

-3

-2

-1

0

1

2

3

4x 10

5

All realizationsSelected realizations

0 5 10 15 20 25 30 35 400

500

1000

1500

2000

2500

3000

3500

Time (days)

Cum

Oil

(MS

TB)

Exhaustive SetKKM

Figure 18: W2: (A) Selected realizations for simulation for CumOil (B) Resulting quantiles

Page 17: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

17

(A) (B)

0 5 10 15 20 25 30 35 400

500

1000

1500

2000

2500

3000

3500

Time (days)

Cum

Oil

(MS

TB)

Exhaustive SetKKM

-8 -6 -4 -2 0 2 4 6

x 106

-5

-4

-3

-2

-1

0

1

2

3

4x 10

5

All realizationsSelected realizations

Figure 19: W2: (A) Selected realizations for simulation for CumOil (B) Resulting quantiles

Concerning the cumulative water production, we observe in Figure 20 (B) a small over-estimation of the P90 for 8 simulations.

(A) (B)

-2 -1.5 -1 -0.5 0 0.5 1

x 106

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5x 10

5

All realizationsSelected realizations

0 200 400 600 800 1000 12000

100

200

300

400

500

600

700

800

Time (days)

Cum

Wat

er (M

STB

)

Exhaustive SetKKM

Figure 20: W2: (A) Selected realizations for simulation for CumWater (B) Resulting quantiles

Well W3 has a smaller degree of uncertainty in cumulative oil/water production (see Figure 14 and Figure 15). In this case, the correlation coefficient between the distance and the difference in flow is lower, at 0.50 for CumOil and 0.33 for CumWater. Since KKM selects realizations based upon the distance, the selected realizations may not estimate accurately the quantiles. Figure 21 and 22 represent the results of application of KKM on both responses, for 8 flow simulations, the same number as for well W1.

Page 18: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

18

(A) (B)

0 5 10 15 20 25 30 35 400

1000

2000

3000

4000

5000

6000

7000

8000

Time (days)

Cum

Oil

(MS

TB)

Exhaustive SetKKM

-4 -3 -2 -1 0 1 2 3

x 106

-4

-3

-2

-1

0

1

2

3

4x 10

5

All realizationsSelected realizations

Figure 21: W3: (A) Selected realizations for simulation for CumOil (B) Resulting quantiles

(A) (B)

-6 -4 -2 0 2 4 6

x 106

-1.5

-1

-0.5

0

0.5

1

1.5x 10

5

All realizationsSelected realizations

0 200 400 600 800 1000 12000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

Time (days)

Cum

Wat

er (M

STB

)

Exhaustive SetKKM

Figure 22: W3: (A) Selected realizations for simulation for CumWater (B) Resulting quantiles

Although the correlation coefficient between the distance and the difference in flow is not high, Figure 21 (B) and 22 (B) show an accurate quantiles estimation for W3. This might be due to the fact that for this well, the uncertainty is relatively small, thus incorrect selection of realizations may have less impact. However, this result may simply be fortuitous, and should not be considered as a general result. The definition of an application specific distance is one of the key of this workflow. Indeed, the better correlation between the distance and the difference in flow is, the better are the estimations of the quantiles of production. However, having a smaller correlation coefficient does not necessarily result in inaccurate quantile estimations. In most cases, increasing the number of simulations is sufficient to obtain satisfactory results. This will be discussed in the next section.

Page 19: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

19

5. Robustness of the DKM method

The performance of the DKM relies on a proper selection of the distance - meaning a distance which is well correlated to the actual difference in flow. In this section, we propose to evaluate the quality of the DKM as a function of the value of this correlation coefficient. The quality of the DKM is measured by the error observed in the quantile estimations. The objective function, with units of MSTB, is defined as:

)(P)(P)(P)(P)(P)(P3

1 KKM90

Exhaustive90

KKM50

Exhaustive50

KKM10

Exhaustive10 kk

tkkkk tttttt

NtsOF

k

−+−+−= �

(2) where:

- tk represents the time at which the quantiles are estimated - Nts is the number of timesteps - KKMPi the quantiles resulting from KKM

- ExhaustivePi the quantiles tabulated from the exhaustive set of 72 realizations.

Accurate quantile estimation would be obtained if the correlation coefficient between the distance and the proxy distance were one. To set up an experiment for testing robustness of the method with regards to this correlation, we start with the true distance (ρ = 1) and then degrade the correlation by adding noise to the difference in cumulative oil production obtained by Chears. For each degree of correlation, we plot on Figure 19 the evolution of the error in quantile estimation as given in Eq. 2 as a function of the number of clusters defined by KKM. The red line corresponds to 400 MSTB which represents an error of approximately 0.5% of the cumulative oil production at the end of simulation. All values of the objective function under that threshold represent quantiles estimations for KKM that visually superimpose the quantiles for the exhaustive set (for example, the same or better quality of results as in Figure 17B).

Page 20: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

20

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.78

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.85

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.9

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.67

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 1

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.57

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.48

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.35

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

1400

1600

1800

Number of Clusters

Err

or in

Qua

ntile

s (M

STB

)

Correlation coefficient: 0.2

Figure 23: Evolution of DKM quality as a function of correlation and number of cluster

Figure 23 shows clearly that the higher the correlation, the smaller the error in the quantile estimation A high correlation also results in a smoother OF curve as function of the number of clusters. The number of flow simulations required for obtaining an accurate assessment of uncertainty increases as the correlation decreases. The latter means that more “luck” is involved in picking realizations to flow when reducing the number of flow simulations. In addition, we observe irregularities in the variation of the objective function. The reason is that the KKM algorithm calculates centroids for the clusters which do not usually correspond to a point where there is an existing realization. In this case, we select the closest realization to the centroids for flow simulation, which may not be a good representative realization for the cluster. As the correlation decreases, the probability that the realization closest to the centroid is non-representative increases. We now compare the results obtained by KKM for the total field cumulative oil production to traditional ranking methods.

Page 21: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

21

6. Comparison with Ranking Methods

In this section, we propose to compare the results obtained by the DKM with the classical technique which consists of ranking the realizations according to a specific measure and then determine realizations for flow simulation processing.

Ranking technique review

The idea of ranking stochastic realizations was first published in the context of stochastic reservoir modeling in 1992 (Ballin et al.). The central principle behind ranking realizations is to use some simple measure to rank realizations and then run full flow simulation with fewer realizations - for example on those which represent P10, P50 and P90. This would define the bounds of the uncertainty without performing a large number of fine-scale flow simulations. The central goal of ranking is to exploit a relatively simple, rapid (often static) measure to accurately select geological realizations that correspond to the targeted percentiles of the production responses. The ranking and selecting of realizations must be tailored to the flow process. It is well known that a particular ranking measure must be highly correlated to production response. Conventional ranking measures can be, for example, original oil in place or connectivity (McLennan and Deutsch, 2005). Streamline simulation (Gilman et al. 2002) and tracer simulation (Ballin et al., 1992) have also been employed. However, there is no unique ranking index when there are multiple flow response variables and no single ranking measure is always perfect.

Comparison of quantile estimation

Two different measures were used to rank the 72 realizations for the cumulative oil production: original oil in place (OOIP) and the cumulative oil production obtained by streamline simulation at the end of the production period (1215 days). The ranking measure is calculated for each realization. Figure 24 represents the scatter plot and the correlation coefficient between both ranking measure and the cumulative oil production obtained by the full flow simulator (Chears). As we can see, OOIP has a good correlation with the cumulative oil production (0.72), and the streamline measure provides an even better one (0.93).

Page 22: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

22

(A) (B)

5 5.5 6 6.5 7 7.5 8 8.5

x 105

5

5.5

6

6.5

7

7.5

8

8.5x 10

4

CU

MO

IL -

Full

Sim

ulat

ion

OOIP5 5.5 6 6.5 7 7.5 8 8.5 9

x 104

5

5.5

6

6.5

7

7.5

8

8.5

9x 10

4

CU

MO

IL -

Full

Sim

ulat

ion

CUMOIL - Streamlines

Figure 24: Cumulative Oil production as a function of (A) OOIP and (B) Streamlines

The traditional ranking method is applied for both measures. It consists of first ranking the realizations according to their associated ranking measure and then selecting the three realizations corresponding to the P10, P50 and P90 realizations for flow simulation. Results are presented in Figure 25.

(A) (B)

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8x 10

4

Time (days)

CU

MO

IL (M

STB

)

Exhaustive SetRanking - Streamlines

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8x 10

4

Time (days)

CU

MO

IL (M

STB

)

Exhaustive SetRanking - OOIP

Figure 25: Quantiles P10, P50 and P90 resulting from ranking measures – 3 simulations

(A) OOIP and (B) Streamlines simulations

Figure 25 (A) shows that the use of OOIP for the ranking measure leads to quasi- superimposition of the P10 and P50 curves. Slight differences in the two curves exist, although in Figure 25 the curves appear identical. This is due to a bad estimation of the OOIP for the realization selected as P10. The OOIP for the P10 and P50 selected realizations are different, however the responses associated to those two realizations are similar. In the case of streamline-based measure, the quantiles estimation encounters the same problem. In order to improve those results and to compare quantiles curves based on the same number of flow simulations, we apply the ranking method using 7 equally spaced simulations according to the ranking measure. The P10, P50 and P90 quantiles are then estimated from those 7 simulations.

Page 23: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

23

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9x 10

4

Time (days)

CU

MO

IL (M

STB

)

Exhaustive SetRanking - OOIP

(A) (B)

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8x 10

4

Time (days)

CU

MO

IL (M

STB

)

Exhaustive SetRanking - Streamlines

Figure 26: Quantiles P10, P50 and P90 resulting from ranking measures – 7 simulations

(A) OOIP and (B) Streamlines simulations

Figure 26 shows a great improvement of the estimation of the quantiles of the cumulative oil production. The quantiles are well estimated, although less accurately than the DKM method (Figure 9). Note that in many applications, the correlation coefficient between the OOIP and the cumulative oil production is smaller, thus using OOIP as a ranking measure would lead to less accurate results.

7. Conclusions

We have presented results for a distance-based method for uncertainty quantification of a spatial phenomenon. The method relies on a reasonable correlation between the distance measure and the response variables of interest. Using an application specific distance is an important additional tool which makes the task of response uncertainty quantification more effective. In general, each new type of application will require investigation of a new distance, which requires more work, but produces more reward. For similar types of problems, these distances can then be re-used. In our example of assessing subsurface flow uncertainty, we use streamline simulation to obtain the distances, which often correlate well with the differences in production response using standard flow simulation. We apply the distance-kernel method on a real field case. The reservoir properties are modeled with 4 facies, and are described by 12 different training images and 3 different facies probability cubes. Uncertainty quantification on the WCA reservoir has been performed by doing only a small number of simulations. The statistics obtained by flow simulation on the few realizations selected by the DKM are very similar to the one obtained by simulation on the entire set of 72 realizations. A comparison with the traditional ranking method shows that our method easily outperforms this state-of-the-art technique.

Page 24: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

24

Acknowledgements The authors would like to acknowledge SCRF sponsors and Chevron for their support. We would like to thanks Chevron for providing the data and permission to publish the results. Many thanks as well to Sebastien Strebelle and Alexandre Castellini from Chevron for responding to many questions. Also, we would like to acknowledge Darryl Fenwick from StreamSim Technologies for his help using 3DSL and for many useful discussions. References Ballin, P.R., Journel A.G., and Aziz, K. [1992] Prediction of Uncertainty in Reservoir Performance Forecast, JCPT, no. 4

Borg, I., Groenen, P. [1997] Modern multidimensional scaling: theory and applications. New-York, Springer

Buhmann, J. M., [1995] Data clustering and learning: The Handbook of Brain Theory and Neural Networks, MIT Press, p. 278-281

Dhillon, I. S., Guan, Y. and Kulis, B., Kernel k-means, Spectral Clustering and Normalized Cuts, KDD, August 22-25, 2004, Seattle, Washington, USA

Gilman, J.R., Meng, H.-Z., Uland, M. J., Dzurman, P.J., Cosic, S. [2002] Statistical Ranking of Stochastic Geomodels Using Streamline Simulation: A Field Application. SPE Annual Technical Conference and Exhibition, SPE 77374.

Hoffman, B. T. [2005], Geologically consistent history matching while perturbing facies, PhD thesis, Stanford University

Maharaja, A. [2006], Assessing uncertainty on net-to-gross at the appraisal: Application to a west Africa deep-water reservoir, SCRF report 19, Stanford University

Maharaja, A. [2007], Global uncertainty of a West-Africa reservoir, SCRF report 20, Stanford University

McLennan, J.A., and Deutsch, C.V. [2005] Ranking Geostatistical Realizations by Measures of Connectivity, SPE/PS-CIM/CHOA.

Ng, A. Y., Jordan, M. and Weiss, Y., [2001], On spectral clustering: Analysis and an algorithm: In Advances in Neural Information Processing Systems 14

Park, K., and Caers, J. [2007] History Matching in Low-Dimensional Connectivity-Vector Space, SCRF report 20, Stanford University

Sarma, P., Durlofsky, L. J., Aziz, K. and Chen, W. H., [2007] A New Approach to Automatic History Matching using Kernel PCA, SPE Reservoir Simulation Symposium, Houston, Texas, USA, , SPE 106176

Page 25: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

25

Scheidt, C., and Caers, J. [2007] A workflow for Spatial Uncertainty Quantification using Distances and Kernels, SCRF report 20, Stanford University

Schöelkopf, B., Smola, A. [2002] Learning with Kernels, MIT Press, Cambridge, MA.

Shawe-Taylor, J., Cristianni, N., [2004], Kernel Methods for Pattern Analysis: Cambridge University Press, 462 p.

Shi, J., and Malik, J., [2000], Normalized-cut and image segmentation: IEEE Transactions on Pattern Analysis and Machine Intelligence, v. 22, no. 8, p. 888-905.\

Strebelle, S., [2002], Conditional Simulation of Complex Geological Structures using Multiple-point Statistics, Mathematical Geology, 34, 1-22.

Suzuki, S., Caers, J. [2006] History matching with an uncertain geological scenario. SPE Annual Technical Conference and Exhibition, SPE 102154.

Page 26: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

26

Appendix A: Multidimensional Scaling Multidimensional scaling (MDS) is a technique used to translate the dissimilarity matrix into a configuration of points in nD Euclidean space (Borg and Groenen, 1997). The points in this spatial representation are arranged in such a way that their Euclidean distances correspond as much as possible (in least square sense) to the dissimilarities of the objects. Thus, one measurement of a successful MDS procedure is a good correlation between the Euclidean distance and the dissimilarity distance. The classical MDS algorithm rests on the fact that the coordinate matrix X can be derived by eigenvalue decomposition from a Gram matrix B, which is obtained by converting the dissimilarity matrix D into a scalar product. The following steps summarize the algorithm of classical MDS:

1. Construct a matrix A with elements 2

21

ijija δ−=

2. Construct a matrix B by centering A: HAHB = using the matrix 11IH T

n1−=

3. Extract the p largest positive eigenvalues pλλ ,,1 � of B and the corresponding p

eigenvectors pee ,,1 � .

4. A p-dimensional spatial configuration of the NR objects is derived from the coordinate matrix 2/1

pp�EX = where pE is the matrix of p eigenvectors and p�

is the diagonal matrix of p eigenvalues of B, respectively.

Classical MDS assumes the distances to be Euclidean. However, in many applications, the data are not distances as measured from a map, but rather dissimilarities data. When applying classical MDS to dissimilarities, it is assumed that the proximities behave like real measured distances. The advantage of classical MDS is that it provides an analytical solution, requiring no iterative procedures. Moreover, to be effective, the dissimilarity matrix D should be positive definite. However, mapping can be done accurately by only considering positives eigenvalues, if negative eigenvalues are of small amplitude. Note that since the map obtained by MDS is derived solely by the dissimilarity distances in the matrix, the absolute location of the points is irrelevant. The map can be subject to translation, rotation, and reflection, without impact to the methodology. Only the distances in mapping space R are of interest. For more details about MDS methodologies, see Borg and Groenen (1997).

Page 27: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

27

Appendix B: Kernel K-means and Spectral Clustering Kernel theory was recently developed from the field of neural computing and pattern recognition, and particularly kernel principal component analysis (KPCA) is often used as a tool to remove noise from computerized images. In the petroleum field, kernel theory has been used by Sarma (2007) in the context of history matching and production optimization. Kernel methods consists of mapping the given data points from their input space R to some high-dimensional feature space F using a multidimensional function Φ :

FR →Φ : . The feature space F is assumed to have a better linear variation than R, in other words points in F are linearly separable. Thus, tools requiring linear relationship between data to be efficient can be applied into F instead of R. In our application, we use kernel methods to transform points generated by MDS in nonlinear space R into a space F with improved linear variation. Once in this linear space, standard tools such as principal component analysis and cluster analysis are employed to analyze the point structure. In this paper, we consider only uncertainty quantification using kernel clustering methods, although Scheidt and Caers (2007) showed that the use of KPCA gives similar results. It has been pointed out that kernel methods can be used to develop nonlinear generalizations of any algorithm that can be cast in term of dot products, such as PCA or cluster techniques like k-means (Schöelkopf, 2002). Indeed, the kernel method has been developed for computing dot products in feature spaces. One main advantage of the use of kernel methods for algorithms requiring dot products, is that there is no need to map explicitly the points from space R to F: all necessary computations in space F can be carried out using the nonlinear function Φ in input space R. This function is called a kernel function k, and is given by:

)(),(),( yxyx ΦΦ=k (1)

Thus, even if the space F and the mapping Φ are complicated, algorithms such as KPCA or kernel clustering can formulated in such a way that only the dot product in F is needed (Eq. 1).

Kernel function Applying KPCA to a dataset requires the definition of the dot-product in the feature space F, i.e. the definition of a kernel function. A commonly used kernel function is the Gaussian kernel (radial basis function), given by:

��

��

� −−= 2

2

2exp),(

σyx

yxk with 0>σ (2)

Page 28: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

28

In our application, we consider a Gaussian kernel for all the cases (Eq. 2). The parameter σ controls the flexibility of the kernel. For small values of σ, the kernel matrix becomes close to identity matrix (K=I). On the other hand, large values of σ gradually reduce the kernel to a constant function (K=1). Shi et al. (2000) recommended choosing the kernel width as 10% to 20% of the range of the distance between samples. After many tests, it appears that choosing σ as 20% of the range of the distance between points is usually robust.

Kernel K-Means Clustering (KKM)

Clustering algorithms are suited to our problem. Cluster analysis aims to discover the internal organization of a dataset by finding structure within the data in the form of clusters. Hence, the data is broken down into a number of groups composed of similar objects. This methodology is widely used both in multivariate statistical analysis and in machine learning. Defining clusters consists in identifying an a priori fixed number of centers and assign points to cluster with the closest center. In this work, we apply the classical k-means algorithm in feature space F to determine a subset of points defined by the cluster centroids. The k-means algorithm tries to assign points in k clusters Si by minimizing the expected squared distance between the points of the cluster and its center µi:

� �= ∈

−=k

i Sij

ij

J1

2

x

x µ

The algorithm starts by partitioning randomly the input points into k initial sets Si. It then calculates the mean point, or centroid µi, of each set. Then, every point is assigned to the cluster whose centroid is closest to that point. These two steps are alternated until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed). The k-means procedure requires a method for measuring the distance between two points in the high-dimensional feature space F. Once again, this Euclidean distance can always be computed using the inner product information through the equality:

)(),(2)(),()(),()()( 2 zxzzxxzx ΦΦ−ΦΦ+ΦΦ=Φ−Φ

),(2),(),( zxzzxx kkk −+= Note that this equality is only true for Euclidean distance, hence the necessity of the MDS procedure prior to performing KKM. For an overview of clustering techniques, see Buhmann (1995), and Shawe-Taylor and Cristianini (2004) for specific information about kernel clustering techniques.

Page 29: Uncertainty Quantification Using Distances and Kernel ... · 1. Introduction Stochastic spatial simulation is widely used to generate multiple, alternative realizations of a spatial

29

Using a random partition of input points to initialize the initial clusters is not optimal, k-means algorithm is subjected to many local minima. In this work, we propose a 2 steps approach, which consists of initializing KKM with results of an alternative of k-means, with is called Spectral Clustering.

Spectral Clustering as initialization of KKM A promising alternative that has recently emerged in a number of fields is to use spectral methods for clustering. These algorithms clusters points using eigenvectors of matrices derived from the data. Algorithm proposed in Ng et al. (2001) for initializations of points for KKM

1. Form the affinity matrix ( )222/exp σjiij xxK −−=

2. Define D to be the diagonal matrix whose (i,i)-element is the sum of K’s row and construct the matrix: 1/21/2KDDL −−=

3. Find v1, ..., vk, the k largest eigenvectors of L, and form the matrix [ ]kvv ,,1 �=V by stacking the eigenvectors in columns

4. Form the matrix Y from V by renormalizing each of V’s rows to have unit length, i.e. 2/12 )/(�=

jijijij XXY

5. Treat each row of Y as a point in Rk, cluster them in k clusters via k-means (initial centroids are initialized such that they are 90 degrees apart )

6. Assign the original point xi to cluster j if and only if row i of the matrix Y was assigned to cluster j

Spectral approach allows to solve a relaxed problem: compute the first k eigenvectors of the matrix 1/21/2KDDL −−= . This maps the original points to a lower-dimensional space and a discrete clustering solution is attained. One can treat the resulting partitioning as a good initialization to kernel k-means on the full dataset. This two-layer approach – first running spectral clustering to get an initial partitioning and then refining the partitioning by running KKM on the partitioning – typically results in a robust partitioning of the data. (Dhillon, 2004)