A General Approach for Assimilation of Multi‐scale, Multi‐type Data
Yoram RubinUC Berkeley
Outline
• Perspective on data assimilation: Spatial variability in geologic media
• The Method of Anchored Distributions for data assimilation (principles, application, computational tools)
• Forward looking:– Measurement Theory (what to measure, where to measure, how to maximize information yield)
– Open‐source community tool for data assimilation.
The Media
• Geological media (soil, rock) are complex and spatially‐variable
• Larger observation scales reveal additional length scales of variability (multiple length‐scales)
• Characterization (mapping of soil properties) is subject to large uncertainty due to scarcity of data and spatial variability
Scale‐dependent variability
Small scale
Large Scale
Type-A and Type-B data
Hubbard and Rubin, 2005
REL
ATI
VE S
CA
LES
OF
INVE
STIG
ATI
ON
Labo
rato
ry o
r
Loc
al
R
egio
nal
Poi
nt
~10-
3to
10
~
10-1
to 1
02
~ 1
01 to
105
(m)
~10-4 to 1 ~10-1 to 10 ~1 to 102 (m)
High Moderate Low
RELATIVE RESOLUTION
Core/Tank Measurements
SurfaceGeophysics
Airborne/Satellite
Acquisition approaches near this end of the chart providehigh resolution information over small spatial extents
Acquisition approaches near this end of the chart providelow resolution information over large spatial extents
Wellbore Logging
Crosshole measurements and well tests
Example from ocean circulation
Another example is the mapping of ocean circulation,which relies on a variety of data types (e.g., temperature,density, velocity vector components) obtained from shipsurveys, moored instruments, buoys drifting freely on orfloating below the ocean surface, and satellites.
These data are measured over a wide range of scales and frequencies, and they need to be assimilated to yield accurate circulation models.
What are the challenges?
• Multi‐type, multi‐scale data assimilation:– Multiple perspectives are expected to provide a more coherent 3D image of the subsurface…
– But data are collected at different scales, and many types of data are only weakly‐ or indirectly‐related with hydrogeologic parameters so how to glean information for conditioning the target variable(s)?
What is MAD?
MAD (Method of Anchored Distributions) is a stochastic inverse modeling approach that:– can be used to assimilate data from multiple sources
– is not constrained by modeling choices or specific data types
– Reduces the computational effort using sparse parameterization
Rubin, Y., X. Chen, H. Murakami, and M. Hahn, A. 2010, Water Resources Research
MAD principles
• Data classification– Type‐A data: Direct or local– Type‐B data: Indirect and non‐local
• Localization: A strategy for unifying of multiple data using anchors.
• Projection: A geostatistical model is used for modeling global trends and for projecting data from measurements onto un‐sampled locations.
Systematic classification of data
Spatial variable of interest: Y(x):
Entire field: Type‐A: data give point values of Y either directly or through
models:
Type‐B: data that is function of the field:
Y~
Anchors are the carriers of information relevant to Y
• Anchors intend to capture the relevant information from the Type‐B data that is relevant for the target variables
• Anchors are statistical distributions of the target variables that could be assumed (priors) or inferred through a Bayesian data assimilation.
• The distributions represent measurement errors and the quality of the relationship between the data and the target variables.
• Anchors are placed at strategic locations (in terms of information yield)
Anchor
Type‐B
Type‐A
Anchor
Anchor
Inverse modeling with MAD
• The model is defined through a joint statistical distribution of vector that includes structural parameters and anchors – Global trends are captured via the geostatistical model– Local effects are captured by the anchors
The Architecture of MAD
Block 1Problem SetupDerive Prior
(Fixed)
The Architecture of MAD
Block 1Problem SetupDerive Prior
(Fixed)
Forward Model Driver
(Portable)
The Architecture of MAD
Block 1Problem SetupDerive Prior
(Fixed)
Forward Model Driver
(Portable)Forward Model
(External Software)
The Architecture of MAD
Block 1Problem SetupDerive Prior
(Fixed)
Block 2Likelihood Analysis
(Fixed)
Forward Model Driver
(Portable)Forward Model
(External Software)
The Architecture of MAD
Block 1Problem SetupDerive Prior
(Fixed)
Block 2Likelihood Analysis
(Fixed)
Block 3Derive PosteriorDiagnostics(Fixed)
Forward Model Driver
(Portable)Forward Model
(External Software)
• There is a need for a general, easy to apply data‐assimilation computational tool that is modular, assumption free, and not linked with any particular modeling tool.
• Such a tool would put enormous expertize at the hands of environmental scientists, saving the need to re‐create data assimilation solutions with each project.
Why is the block structure important?
Example
Data Classification– Type A data includes observations that are a function of a point value
• Small‐scale pump‐tests (EBF)• Core samples
– Type B data includes observations that are non‐local including:
• Pumping tests• Tracer tests• Geophysical Data
-Locations where test conditionsresulted in non-representative EBF profiles
Normalized Hydraulic Conductivity Profile, Well 399-2-8
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
0.23
Normalized Hydraulic ConductivityProfile, Well 399-2-16
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Normalized Hydraulic Conductivity Profile, Well 399-2-13
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Normalized Hydraulic ConductivityProfile, Well 399-2-14
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Normalized Hydraulic ConductivityProfile, Well 399-2-19
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
0.25
Normalized Hydraulic ConductivityProfile, Well 399-2-18
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Normalized Hydraulic ConductivityProfile, Well 399-2-17
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Normalized Hydraulic ConductivityProfile, Well 399-2-7
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Normalized Hydraulic ConductivityProfile, Well 399-2-12
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Normalized Hydraulic ConductivityProfile, Well 399-2-15
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
600.0 0.1 0.2
Normalized Ki
Dep
th (f
t bgs
)
Point values
Point values + anchors
unknown true field (transect)estimated mean field
measured point valuesanchors
Anchors capture local features and improve prediction
2-7 2-10
Summary
• Anchored distributions (anchors) are statistical localization devices that represent the target variables at the smallest scale, conditioned on multi‐scale data.
• Anchors can be used for assimilating multiple types of data and migrating information across scales.
• Flexible data classification makes it applcable in multiple disciplines.
What’s on the horizon:
Measurement Theory
• Theory that addresses the following questions:– What data to collect? Where? How many? What Frequency?
– How to answer these questions when addressing different goals?
– How to maximize the information yield from measurements?
What’s the problem?
• Current approaches for site characterization (and monitoring) are “need to know everything” approaches (suitable for research sites) whereas in many cases characterization should be “goal‐oriented” and should be viewed in the context of the application.
• In applications, we must deal with budgetary constraints, and so we need to make choices.
• A rational framework is needed for sound planning and prioritizing.
The First concept: Looking beyond hydrogeology
Tools are needed for comparing the contributions of hydrogeological and non‐hydrogeological data for the overall benefit of the project: recognizing that there maybe contributors to uncertainty other than hydrogeology.
Such comparison can be addressed using Comparative Information Yield Curves (de Barros et al., WRR, 2009, 2010)
The second concept: Hypothesis‐driven Approach for Site Characterization
• Characterization is performed in support of an hypothesis that could either be accepted or rejected, for example:– Hypothesis: a water supply well is not in real danger of being
contaminated– Hypothesis: Contaminated site will enhance cancer risk in humans
due to exposure of some sort. • The challenge is to design a data acquisition strategy that
would lead to accepting or rejecting the hypothesis with an a‐priori defined confidence level: characterization is viewed as the means for achieving a goal (that is, confirming or rejecting an hypothesis), not as a goal onto itself…
• and the goal is to minimize the risk of making the wrong decisions (Nowak et al., WRR, 2011)
Accept H1when true
Accept H0when true
Venn diagram of decisions/events
Accept H1when not true
Accept H0when not true
Type‐ error
Type‐ error
H0 – null hypothesisH1 – alternative hypothesis
Accept H1when true
Accept H0when true
Venn diagrams of decisions/events
Accept H1when not true
Accept H0when not true
Type‐ error
Type‐ error
Accept H1when not true
Accept H1when true
Accept H0when not true
Accept H0when true
H0 – null hypothesisH1 – alternative hypothesis
How to increase information yield?
• Where to place anchors (or other localization devices)? Where they would be most efficient in gleaning information from observations.
• Stochastic sensitivity analysis coupled with Monte‐Carlo simulations identifies promising locations for placing anchors (Yang et al., WRR, in press).
Strategic placement of localization devices (such as pilot‐points and anchors) in inverse modeling schemes (Yang et al., Water Resources Research, 2012, in press).
Bayesian MAD
Bayesian MAD
Data Assimilation: CT Scan for the Earth
Data assimilation for subsurfaceInvestigations: multiple datasources (of different quality and resolution), including direct measurements and tomographic data, are used simultaneously, leading to a more coherent 3D image of soil properties.
Hydraulic Conductivity log