designing observational biologging studies to assess the causal effect of instrumentation

Designing observational biologging studies to assess

the causal effect of instrumentation

MatthieuAuthier1*, Clara P�eron1, AlainMante2, Patrick Vidal3 andDavidGr�emillet1, 4

1Centre d’�Ecologie Fonctionelle et �Evolutive, CEFE-CNRSUMR5175, 1919Route deMende,Montpellier Cedex 5, 34 293,

France; 2Conservatoire d’EspaceNaturels de Provence-Alpes-Cote d’Azur, R�eserveNaturelle Nationale de l’Archipel deRiou,

166 avenue deHambourg – Immeuble le Sud,Marseille, 13008, France; 3S�emaphore de Pom�egues – Le Frioul, ParcMaritime

des Iles du Frioul, Marseille, 13001, France; and 4DST-NRFCentre of Excellence, FitzPatrick Institute, University of Cape

Town, Rondebosch, 7701, South Africa

Summary

1. Biologging has improved ecological knowledge on an increasing number of species for more than 2 decades.

Most studies looking at the incidence of tags on behavioural, physiological or demographic parameters rely on

‘control’ individuals chosen randomly within the population, assuming that they will be comparable with

equipped individuals. This assumption is usually untestable and untenable since biologging studies are more

observational than experimental, and often involve small sample sizes. Notably, background characteristics of

wild animals are, most of the time, unknown. Consequently, investigating any causal effect of instrumentation is

a difficult task, subjected to hidden biases.

2. We describe the counterfactual model to causal inference which was implicit in early biologging studies. We

adoptedmethods developed in social and political sciences to construct a posteriori an appropriate control group.

Using biologging data collected on Scopoli’s shearwaters (Calonectris diomedea) from a small Mediterranean

island, we used thismethod to achieve objective causal inference on the effect of instrumentation on breeding per-

formance and divorce.

3. Ourmethod revealed that the sample of instrumented birds was nonrandom. After identification of a relevant

control group, we found no carry-over effects of instrumentation on breeding performance (taking into account

imperfect detection probability) or divorce rate in Scopoli’s shearwaters.

4. Randomly chosen control groups can be both counterproductive and ethically dubious via unnecessary addi-

tional disturbance of populations. The counterfactual approach, which can correct for selection bias, has wide

applicability to biologgingwithin long-term studies.

Introduction

There is no controversy over the beneficial impact of the bio-

logging revolution for wildlife ecology (Ropert-Coudert et al.

2012, but see Hebblewhite & Haydon 2010). Biologging is the

‘use of miniaturized animal attached tags for logging and/or

relaying data about an animal movements, behaviour, physiol-

ogy and/or environment’ (Rutz & Hays 2009). Knowledge on

the ecology of elusive animals, in particular marine species,

greatly increased over the last two decades, as epitomized by

seabird research (Wilson et al. 2002). Seabirds have been the

topic of a sustained wealth of biologging studies (Vandenabe-

ele, Wilson & Grogan 2011). The percentage of publications

addressing the potential for detrimental effects of tags on

seabirds, however, did not increase over the same period

(Vandenabeele,Wilson&Grogan 2011).

Although guidelines and suggestions on instrumentation

and animal welfare have been issued over the years (Wilson,

Grant &Duffy 1986; Phillips, Xavier &Croxall 2003; Hawkins

2004; Wilson &McMahon 2006; Casper 2009), a shortcoming

of impact studies is often the control group. Biologging is pla-

gued by a catch-22 effect (Barron, Brawn & Weatherhead

2010): behaviours uponwhichwe expect an adverse effect from

tags may not be observable without the latter. Wilson, Grant

& Duffy (1986) identified this issue and proposed to use linear

regression to infer the behaviour of untagged animals. If the

underlying statistical model is correct, this approach may pre-

dict values for the same animal, as if it had not been equipped.

The conditional tense betrays a counterfactual in the terminol-

ogy of causal inference (Rubin 2006). Indeed, causal inference

concerns what would happen following an intervention,

hypothetical or real (Gelman&Hill 2007).

Sample size is another issue. Random sampling guarantees

that background characteristics of animals will be balanced on

average between a control group and an instrumented group.

Yet, any specific study, especially small sample-sized ones, will

have some bias due to imbalance (Gelman&Hill 2007, p. 172).

A small sample cannot reproduce all the essential features of

the target population, although belief in the contrary is

widespread (Tversky&Kahneman 1971). Because only a small

number of expensive tags can typically be deployed*Correspondence author. E-mail: authierm@gmail.com

Methods in Ecology and Evolution 2013, 4, 802–810 doi: 10.1111/2041-210X.12075

(Hebblewhite &Haydon 2010), the assumption of no selection

bias is strong. The healthy-looking bird with shiny feathers is

more likely to be instrumented with an expensive tag than an

emaciated, shabby-plumaged one. Ethics and animal welfare

considerations actually forbid the second bird to be instrumen-

ted. Assessing the impact of instrumentation demands a mean-

ingful control sample, which is a group of birds that could

have been equipped but were not. Causal questions on instru-

mentation can only be unambiguously addressed if such a con-

trol group exists. In general, a potent threat to causal inference

is selection bias, that is, bias due to inadequate choice of a con-

trol sample.

Some studies of the impact of instrumentation reported no

short or long-term effects, for example on large animals

(McMahon et al. 2008). A recent meta-analysis on birds

reported an overall negative impact but also that breeding

success and survival were larger for birds equipped with

larger tags (Barron, Brawn & Weatherhead 2010). Similarly,

Gr�emillet et al. (2005) found that the resighting rate of Arctic

Great Cormorants (Phalacrocorax carbo) 2 years after instru-

mentation was higher for birds which had been equipped with

internal heart-rate data loggers. It is difficult to believe the

causal interpretation of theses results, which may rather be

statistical artefacts, such as Simpson’s or Lord’s paradox (see

Appendix S1).

Simpson’s paradox occurs whenever the relationship

between two categorical variables differs depending upon

whether subgroups are accounted for in an analysis or not. Its

resolution often lies in causal reasoning: only variables that are

unaffected by the treatment should be accounted for (see

Appendix S1). This truism stems from the definition of a cause

for an effect: (i) the putative cause happened before the effect

(spatio-temporal contiguity); (ii) putative cause and effect co-

vary; and (iii) other potential putative causes that may affect

the phenomenon are ruled implausible (confounders are neu-

tralized). Lord’s paradox is more subtle (see Appendix S1,

Lord 1967; Holland & Rubin 1983), but illustrates the impor-

tance of defining relevant comparisons and clearly stating all

assumptions underlying estimates implied to be causal (Rubin,

Stuart & Zanutto 2004). Lord’s paradox occurs whenever a

control group is missing: conclusions on the cause of effects

may result more from seemingly innocuous statistical assump-

tions rather than data (King & Zeng 2007; Arah 2008). Causal

inference aims at predicting what would have happened, had a

treated unit been left as a control. That is, inference proceeds

by estimating some unobserved outcomes, either implicitly

(classical approach) or explicitly (Rubin, Stuart & Zanutto

2004; Rubin 2006). In both cases, assumptions and definitions

are required to avoid paradoxical results.

Methods

Our focus is two-fold: (i) how to find a control group in an observa-

tional study; and (ii) how to ensure that the control group is adequate

for causal inference. These steps will help assessing whether unambigu-

ous causal inference is possible. Observational studies and randomized

experiments have often been opposed in ecology (Sagarin & Pauchard

2010). Rather than a dichotomy, they represent two ends on a contin-

uum of suitability to infer the cause of effects (Rubin 2007). In a ran-

domized control trial (RCT), treatment is randomly allocated to units

such that both known and unknown confounders are evenly distrib-

uted between control and treated units: each unit has a nonzero proba-

bility of receiving each treatment, independently of other units. This is

usually not the case in an observational study, but the latter may be

conceptualized as a ‘broken RCT’ (Rubin 2006). Below is detailed how

to mend a posteriori an observational study as if it were a RCT, draw-

ing from methods developed in the political and social sciences (Rubin

2006, 2007, 2008; Sekhon 2009; Austin 2011; Sekhon 2011). Methods

are summarized on Fig. 1.

TREATMENT

The first step in any causal analysis is to define the treatment of interest.

The aim of causal inference is to investigate what would happen to an

outcome variable following a potential intervention or manipulation.

A clearly defined treatment enables the identification of appropriate

comparisons, irrespective of the technical methods used to estimate the

effects of a cause on an outcome (Rubin, Stuart &Zanutto 2004).

POTENTIAL OUTCOMES

Causal inference has a fundamental problem (Rubin 1978): a unit (indi-

vidual) is either treated (Ti ¼ 1) or not (Ti ¼ 0) but cannot be both.

Its observed response is as follows:

yi;obs ¼ Ti � yið1Þ þ ð1� TiÞ � yið0Þ eqn 1

yið1Þ and yið0Þ are potential outcomes, of which only one will effectively

materialize. Either yið1Þ is observed and yið0Þ becomes the counterfactu-

al or yið0Þ is observed and yið1Þ becomes the counterfactual. The count-

erfactual model (Eqn 1, Fig. 2a) has two core characteristics (Rubin

1978): first, it defines a causal effect as a comparison of potential out-

comes on a common set of units. The causal effect for a unit can be the

difference yið1Þ � yið0Þ, and the average causal effect is E ½yð1Þ � yð0Þ�.Second, the counterfactual model stresses the importance of study

design by insisting on the assignment mechanism. The assignment

mechanism is the hypothetical or real rule that guided the decision

whether to treat a unit (Fig. 2a). It describes which potential outcome

is observed: yið1Þ or yið0Þ. Causal inference in an observational study is adoublymissing data problemwith both the assignmentmechanism and

one of the potential outcomesmissing (Fig. 2b).

STABLE UNIT TREATMENT-VALUE ASSUMPTION

Data yobs are assumed fixed and randomness stems from the assign-

ment mechanism: this is the Stable Unit Treatment-Value Assumption

(SUTVA, Gelman et al. 2003, p. 201). SUTVA entails (i) only one ver-

sion of the treatment and (ii) no interference between units: the poten-

tial outcome observed for a given unit is independent of the treatment

assignment for other units (Sekhon 2009). If SUTVA is violated, there

are more than two potential outcomes, which complicates the identifi-

cation of causes (Fig. S1). Instances where SUTVA does not hold are

beyond the scope of this study (see Chapter 6 ofGelman et al. 2003).

Potential outcomes stress the importance of time for causal infer-

ence: before exposure to a treatment, two outcomes are possible. The

familiar notation yobs eclipses the assignment mechanism. Familiar

regression modelling implicitly relies on counterfactuals, but does not

necessarily correct for selection bias (Gelman&Hill 2007;King&Zeng

2007). Counterfactual outcomes may be predicted with this approach,

Designing biologging studies 803

but may also be very sensitive to modelling assumptions (see Appen-

dix S1, Gelman&Hill 2007;King&Zeng 2007).

PROPENSITY SCORE

Designing an observational studymeans reconstructing the assignment

mechanism with a probability model for treatment Ti given covariates

(Xi). Covariates are any variables unaffected by instrumentation and

include pretreatment variables such as age or sex. If intermediate out-

comes are included, Simpson’s paradox may arise. Assuming that no

confounder is omitted:

ei ¼ PrðTijXiÞ eqn 2

where ei is the propensity score or the probability of a unit receiving

treatment as a function of observed covariates (Rubin 2008).

MATCHING ON PROPENSITY SCORE

The propensity score is a balancing score, such that the (statistical) dis-

tribution of covariates for a given value of ei is the same whether a unit

received treatment or not (Rubin 2006). The propensity score is the

coarsest many-to-one balancing score, meaning that covariate balance

between control and treatment can be achieved by matching solely on

ei (Rubin 2007). Given a single value of ei, a suitable control for a trea-

ted individual is simply an untreated onewith a similar value of ei.

While ei is known in a RCT, ei is missing in an observational study

and must be estimated (ei). In biologging, animals that had no chance

to be instrumented (ei ¼ 0) or were bound to be equipped (ei ¼ 1)

cannot be used for causal inference. Realistic counterfactuals entail

0\ ei \ 1. For example, in burrow-nesting seabirds, nest accessibility

(burrow depth) affects trap-ability. An additional complication stems

1) Define Treatment

2) Define Outcome(s)

3) SUTVA

Is the Causal Effect Identifiable?

What are the Required Assumptions?

4) Estimate

Propensity Scores

5) Matching

6) Placebo Tests

7) Unveil Outcomes

8) Estimate ATT

DATA UNINFORMATIVE

CAUSAL INFERENCE

Untenable Assumptions

No Relevant Background Variable

No Suitable Match

Always positive

Fig. 1. Design flowchart of designing an observational study to mimic a randomized control trial. Dotted arrows symbolize feedback loops. ATT

stands for ‘Average Treatment effect on the Treated’ and is the causal effect of interest.

804 M. Authier et al.

from imperfect detection: some animals may be more trappable than

others, and such unobserved heterogeneity may give rise to Simpson’s

paradox.

PLACEBO TESTS

Propensity score matching mimics randomization after data collection,

but still assumes no hidden bias. Matching methods have the potential

to correct for selection bias, but may nevertheless fail. A painful, but

potentially valid, conclusionmay be that the data at hand are not infor-

mative on some relevant causal effects. In order to check that matching

has indeed corrected bias, one may test that no important variable is

omitted in Eqn 2 by comparing control and treated samples for a dif-

ference that should be null by design. Thus, the strong ignorability

assumption behind propensity score estimation may be checked with

placebo tests (Sekhon 2009). Positive placebosmay suggest that further

assumptions are required, to which results may be sensitive (see Appen-

dix S1).

OBSERVED OUTCOMES

One crucial aspect of an RCT is that outcomes are unavailable

when the study is implemented. To mimic this important feature,

outcomes of interest should be withheld until a suitable control

group is identified to avoid data snooping (Rubin 2008). Matching

does not require knowledge of the observed outcomes (see exam-

ples in Sekhon 2011).

CAUSAL EFFECT

Causal effects are average treatment effects on the treated (ATT):

ATT¼ E yð1Þ � yð0ÞjT¼ 1� �¼ E yð1ÞjT¼ 1

� �� E yð0ÞjT¼ 1� �

With a suitable control group, a consistent estimate of the counterfactu-

al E yð0ÞjT ¼ 1� �

is E yð0ÞjT ¼ 0� �

. Alternatively, a model can be used

to predict counterfactuals yi;pred ¼ yið0Þ, which are then comparedwith

observed values yi;obs ¼ yið1Þ (Rubin 1978;Gelman&Hill 2007).

We will now illustrate propensity score matching to correct for selec-

tion bias with an investigation on the impact of tags on Scopoli’s shear-

waters (Calonectris diomedea). Managers and scientists may be

concerned whether instrumenting seabirds causes divorce or interfere

with breeding performance the following year. Divorce, a potentially

costly event (Choudhury 1995), is defined as a bird pairing with a new

partner (at time t + 1) in spite of its formermate (at time t) being simul-

taneously present and alive (Choudhury 1995).

Material

FIELD WORK

Deployments were carried out between mid-July and mid-September

2011 on Riou Island (43�1003400 N, 5�2301000 E), offMarseille, France.

Thirty-four GPSwere deployed on one partner from 34 different active

nests of Scopoli’s shearwaters. Population size is estimated at 280–300

breeding pairs (Anselme &Durand 2012). Rats (but not cats) are pres-

ent on the island, but subjected to a regulation program. Breeding

activity was determined as part of the long-term demographic monitor-

ing program run since 1976 by the Conservatoire d’Espaces Naturels

de Provence-Alpes-Cotes d’Azur (CEN-PACA).

Birds were caught inside their underground burrows, at night, after

chick feeding. GPS were attached to back feathers using Tesa� tape

(Tesa s.a.s., Savigny le Temple, France). Total weight of a GPS was

20 g (4�0 cm 9 2�2 cm 9 0�8 cm), corresponding to 3�1% and 3�6%of average body mass for males and females, respectively. Equipped

birds were weighed with a spring scale at deployment. In addition to

GPS, time-depth recorders (TDR, ⊘ 8 mm 9 11 mm weighing 2�7 g)

were attached with Tesa� tape on tail feathers on all but six birds. The

average bodymass of equippedmales was 630 g (range 580–760 g) and

550 g (range 490–600 g) for equipped females. Out of the 34 deployed

GPS, 31 were subsequently recovered, usually within 4 days after at

Ti = 1 Ti = 0

yi(1) = 1 yi yi yi(1) = 0 (0) = 1 (0) = 0Potentialoutcomes

Science

Treatment

Assignmentmechanism

Unit bird i

Tag No Tag

Divorce No divorce Divorce No divorce

(a) (b)

Fig. 2. The Rubin Causal Model or counterfactual model. (a) A randomized control trial detailing the two steps of (i) assigning units to either the

control or treatment conditions before (ii) recording outcomes of causal interest. At the design stage, both potential outcomes are still possible. (b)

An observational study: the missing red arrow emphasizes that the assignment mechanism is missing and must be inferred to construct a valid con-

trol group for causal inference.

least one foraging trip at sea (maximum: 4 trips). Upon recapture, GPS

and TDR were retrieved, and the tip of two primary feathers was

clipped for isotopic analyses. Subsequently, 21 (out of 31) birds were

equipped with a geolocator (GLS,⊘ 8 mm 9 35 mm weighing 3�6 g)

mounted on a plastic ring. Handling time was kept ≤10 minutes.

Among the initial 34 equipped birds, three individuals were not metal-

banded andwere excluded from the analysis.

Instrumentation had no obvious short-term impacts on birds;

they all performed foraging trips and returned to their nesting bur-

rows at night. We did not assess short-term impacts because of

the following: (i) such studies already exist (Igual et al. 2005; Vil-

lard, Bonenfant & Bretagnolle 2011); and (ii) they are potentially

biased by inadequate choice of control birds. Also, (iii) using a

control group is both logistically and ethically challenging when

working on small, vulnerable populations because it doubles the

disturbance of animals. Finally, we already examined the data at

the end of the 2011 field season: no instrumented bird was a failed

breeder. Because outcomes for the 2012 breeding season (divorce,

breeding decision and success) were unknown, we could objectively

design an observational study.

From the CEN-PACA database, we extracted the life histories of all

shearwaters breeding on Riou Island in 2011. Sex was behaviourally

determined from calls. Most birds were ringed as adults, and only one

instrumented bird was ringed as a chick. Body masses correspond to

the average adult mass of birds across all resighting events before 2011.

Birds with missing sex or adult body mass information were excluded,

yielding a total of 183 birds with nomissing information.

PROPENSITY SCORE ESTIMATION

To control for trap-ability, wemodelled the propensity score (the prob-

ability to equip a bird with tags) as:

robitðeiÞ ¼

Interceptþb1 � Sexiþb2 �Ringed as Chickiþb3 �Massiþb4 �Nb. Prev. Breedingsiþb5 �Nb. Prev. Capturesi

8>>>>><>>>>>:

Because instrumentation is a rare event, we used the cdf of Student tdis-

tribution of location 0, scale 1�5484 and 7 degrees of freedom as a

robust link function (Fig. S2; Liu 2004). Using data augmentation

(Albert & Chib 1993), we used shrinkage regression with a horseshoe

prior (Carvalho, Polson & Scott 2010) to achieve automatic variable

selection in propensity score estimation (Greenland 2008). Balance was

graphically assessed (see Table S1 and Fig. S3 for results without

shrinkage). Model fitting was performed with WINBUGS (Lunn et al.

2000) called from R (RDevelopment Core Team 2012). Prior specifica-

tions are available as Supporting Information.

MATCHING

We matched individuals (without replacement) according to their esti-

mated linear propensity scores [robit(ei)] with the R package MATCHING

(Sekhon 2011): we used Mahalanobis metric matching within propen-

sity score caliper (Rubin 2006). Caliper width was set to 1/4 of the stan-

dard deviation in estimated linear propensity scores (se) (Sekhon 2011).

Only birds whose linear propensity score satisfied robitðetÞ�ðse=4Þ� robitðeiÞ� robitðetÞ þ ðse=4Þ were considered suitable

matches for a bird equippedwith tags (denotedwith the subscript t).We

excluded as potentialmatch anypartner of an equippedbird.

Initially, 29 birds out the 31 equipped could bematched.Nomatches

were found for the two individuals corresponding to the two rightmost

red strips on Fig. 3. In order to check that the matching procedure

worked, we performed a placebo test (Sekhon 2011). A placebo tests

for a causal effect that is null by definition (Fig. 1). We assessed

whether the probability of detecting a bird in 2010, that is the year prior

to tag deployment, was different between equipped and control birds.

This placebo revealed that equipped birds were more likely to be

detected in 2010 than control birds (difference in proportions:

0�20 � 0�13%, Likelihood Ratio Test = 2�4, P = 0�12). Although not

statistically significant at the 5% level, this result suggested a biased

sample due to (unobserved) trap-ability.

To remedy this, we imposed to match equipped birds that were not

detected in 2012 with control birds that were also not detected in 2012

and likewise for detected individuals.Detection in2012wasnot ourout-

come of interest. Moreover, this covariate did not enter the estimation

of propensity scores since, when birds were instrumented in 2011, it was

impossible to tell whether theywould be detected in 2012.We think that

conditioning on detection in 2012 is adequate because (i) it is not one of

our outcomes of interest; and (ii) themonitoring of theRioupopulation

isperformedbyadedicatedfield teamindependentofour research team.

Any consciousorunconsciousbias thatwemayhavehadon looking for

previously equippedbirdsdidnotaffect datacollection in2012.

With this constraint, 27 equipped birds were matched (lower panel

of Fig. 3). The placebo test revealed no obvious bias (difference in

proportions: 0�12 � 0�14%, Likelihood Ratio Test = 0�8, P = 0�37).Covariate balance between equipped and control samples was

satisfactory (Fig. 4).

OBSERVED OUTCOMES

With this suitable control group, we addressed two causal questions:

1. Does instrumentation affect the breeding performance of a bird the

following year?

2. Does instrumentation cause a bird to change partner the following

Imperfect detectability of individuals remains an issue. We used a

multistate capture–recapture model to predict counterfactuals. Our

sample consists of 54 birds that were alive in 2011. Their life histories

spanned 2004–2012. We assumed that all birds survived in 2012.

Because survival is perfect until 2011 by design, death can only occur

in 2012, but is confounded by imperfect detection. For any year, a

bird could be either (i) nonbreeding, (ii) a failed breeder, (iii) a suc-

cessful breeder or (iv) not seen. There are thus three different states

and nine possible transitions (Fig. S4). A bird was considered a

successful breeder if its chick fledged, a failed breeder if it failed to

do so after laying an egg. Birds caught on the colony, but for

which no egg or chick was found in the nest, were assumed to be

nonbreeding.

We deleted all observations in 2012 for equipped birds to predict

them from the model. These predictions correspond to what would

have been observed if these birds had not been equipped.We compared

predicted ( ^yi;pred ¼ yið0Þ) and observed ( ^yi;obs ¼ yið1Þ) values with botha v2 and likelihood ratio tests:

Pvalue ¼ Prðv2pred [ v2obsÞ eqn 5

A Bayesian Pvalue (Gelman, Meng & Stern 1996) close to 0�5 reflectsno causal effect of instrumentation on breeding performance the

following year: observed data are similar to predicted counterfactuals.

A Pvalue close to either 0 or 1 betrays model misfit, suggesting a causal

effect.

We checked the Goodness-of-fit (GOF) of the multistate model with

U-CARE (Choquet et al. 2009), which was adequate (global test,

GOF = 9�4,P = 0�97, df = 21).Model fitting was performed with WIN-

BUGS (Lunn et al. 2000) called from R (R Development Core Team

2012). Prior specifications are available as Supporting Information.

Finally, we compared the proportion of control birds which changed

partner in 2012 to that of equipped birds.

Results

There was no causal effect of instrumenting a bird with tags on

its breeding performance the following year. Results from the

multistate capture–recapture model are summarized in

Tables S2, S3 and Fig. S6. Bayesian Pvalues were 0�7 and 0�5for the v2 and likelihood ratio test respectively: observed

outcomes were not different from their predicted counterfactu-

Among the 13 control birds with a known partner in 2012,

two changed partners compared with 2011. Among the 12

equipped birds with a known partner in 2012, none changed

partners compared with 2011. In the latter case, the zero

numerator is problematic for classical inference (Winkler,

Smith & Fryback 2002) but informative priors offer a solution

(Seamen, Seamen & Stamey 2012). Using data from Swat-

schek, Ristow&Wink (1994) from a colony of Scopoli’s shear-

waters in Crete, we elicited an informative prior. Divorce rate

in this Cretan population was between 3�6% (perfect detection

scenario) and 18�8% (conservative scenario). In contrast,

divorce rate on Lavezzi Island, Corsica, where black rats were

present, was 23�1% (Thibault 1994). Black rats also occur on

Riou, yet we could not determined whether they were also

present in the study population of Swatschek, Ristow &Wink

(1994). We elicited an informative Beta prior by matching the

first quartile with the value 3�6%, and its third quartile with the

value 18�8%. The resulting prior is an informative Beta

(0�86,5�77) distribution (Fig. 5) with an effective sample size

of = 7 (0�86 + 5�77).Posterior mean divorce rates among equipped birds and

control birds were, respectively, 0�000�030�17 and 0�030�130�33(means are bracketed by a 95% credible interval following

Louis & Zeger 2009). The difference was �0�29 � 0�090�07, indi-cating no causal effect of instrumentation on divorce rate the

following year (Fig. 5).

Discussion

After explicitly correcting for selection bias, we found no effect

on mate fidelity of instrumenting Scopoli’s shearwaters from

Riou Island with tags for 3–10 days during the chick-rearing

period. Taking into account imperfect detection probability,

we found no effect on breeding performance one year after

instrumentation either.

15InstrumentedNot InstrumentedControl

Linear propensity score

–4 −3 −2 −1 0 1 2

Fig. 3. Results from the robit regression for propensity score estima-

tion of the 183 breeding Scopoli’s shearwaters detected in 2011 onRiou

Island. Red strips correspond to birds that were instrumented in 2011

and light-coloured strips to birds that were not tagged. Individuals rep-

resented as light-coloured bands on the leftmost part correspond to

birds breeding in 2011 that had an extremely small probability to be

equipped and for which there is no similar equipped bird. Likewise, the

4 rightmost red bands correspond to equipped birds that had the largest

probability to be equipped and for which no matches were available

(lower panel).

−0·4

–0·2

All Tag Control All Tag Control

All Tag ControlAll Tag Control

All Tag Control All Tag Control

−1·0

−0·8

−0·6

−0·4

−0·2

0·0Chick

Prev. Breed

Prev· Obs

2Prop·score

Fig. 4. One-to-one matching with Mahalanobis metric within propen-

sity score caliper (Rubin 2006). Covariates balance is illustrated by

means of Tukey plots for the 183 breeding birds in 2011, 27 equipped

birds and the corresponding 27 identified controls. The point represents

the median, and the thick line the interquartile range. Thin-lined fences

were computed as in D€umbgen &Riedwyl (2007) to illustrate asymme-

try. Covariates were standardized.

ASSUMPTIONS AND LIMITS

Four instrumented birds could not be matched (Fig. 3): our

causal estimate does not cover the whole possible range of

observations, even if none of these four birds divorced.We also

assumed that GPS instrumentation, not geolocators or clip-

ping two feather tips, is the sole differential treatment, which

conforms with the Stable Unit Treatment-Value Assumption

(SUTVA). Without SUTVA, there are more than two poten-

tial outcomes, which complicates the identification of causes.

We deployed several tags on single individuals, a potential

SUTVA violation. Because all tags were externally attached

and the GPS, which was systematically fitted, was the largest

device, we assumed that SUTVAholds. Our study is not atypi-

cal with respect to other published ones. Our estimated ATT

was imprecise because of small sample size, another character-

istic of biologging (Hebblewhite & Haydon 2010). To increase

precision, many-to-one propensity score matching may be

used, although it may also cause further attrition in the sample

if k matches are unavailable for each treated unit. Matching

with replacement is another possibility (Sekhon 2011), but

beyond the scope of the present study.

Both Igual et al. (2005) and Villard, Bonenfant &

Bretagnolle (2011) investigated the impact of instrumentation

on Scopoli’s shearwaters. Their causal effect was the impact of

instrumenting at least one mate of a breeding pair with tags.

SUTVAdoes not hold since the probability of equipping a bird

may depend on whether its mate was instrumented or not

(Fig. S1A). As John Tukey famously declared ‘an approxi-

mate answer to the right problem is worth a good deal more

than an exact answer to an approximate problem’. We should

nevertheless strive to provide precise answers to ethics commit-

tees andmanagers for them tomake the best possible decisions

(Wilson&McMahon 2006).

Biologging studies can have several goals (studying foraging

and the impact of tags), thereby raising the possibility that

none of them can be attained satisfactorily. Learning about the

effects of tags and the foraging ecology of animals simulta-

neouslymay not be possible with the same data. A neat distinc-

tion between the numerical representation (the ATT) and the

empirical representation of a phenomenon (a bird seen with a

different mate following its instrumentation) is essential for a

fruitful discussion between scientists and managers. The

importance of defining the aim of the study and the causal

effect of interest before examining data is paramount. To fur-

ther guarantee objectivity, outcomes of interest must be kept

hidden from the analyst until a suitable control group has been

found. Designing observational studies as if they were RCT is

important for the credibility of researchers relative to ethics

committees andwildlife managers.

DESIGN VERSUS ANALYSIS

Scientists deploy expensive telemetric tags to collect data on

the ecology and physiology of wild animals in their natural

environment. The sample of equipped animals may be uncon-

sciously biased towards good-quality or easily recapturable

individuals. Valid inferences may be drawn from this sample,

but extrapolation to the larger population involves additional

assumptions (Gelman & Hill 2007). Instrumentation is a rare

event, concerning a potentially nonrepresentative fraction of

the population. In our study of Scopoli’s shearwaters, the dual

goals of estimating representative demographic rates and esti-

mating a causal effect are not attainable because of selection

bias. The low precision of the demographic estimates

(Tables S2 and S3)makes them of little use. Suppose for exam-

ple that, in a capture–recapture study, <5%of animals were in-

strumented. A multistate model is fitted to the observed data,

with an indicator variable (or a stratum) for instrumented indi-

viduals. Suppose further the model is deemed acceptable if it

accommodates 95% of the life histories. If the 5% of misfits

are precisely instrumented animals, estimated vital rates are

still reasonable, but it is risky to give a causal interpretation to

the regression coefficient for instrumentation because the

model does not provide an adequate fit to these animals.

Our aim with capture–recapture modelling was to account

for imperfect detection probability among instrumentable

birds. Because the estimated causal effect concerns ‘instrumen-

table’ animals, we cannot generalize results to the Riou popu-

lation and determine the causal effect of instrumentation on a

typical individual without defining ‘typical’.

ETHICAL IMPL ICATIONS

The ethical issue raised by our work is whether it is worth

assessing the causal effect of instrumentation by sampling ‘con-

trol’ individuals, when ‘control’ is a strong and untestable

Divorce rate

0 0·05 0·1 0·15 0·2 0·25 0·3 0·35 0·4

Control

Fig. 5. Posterior distributions of the divorce rate for control and

equipped birds. The informative prior that was used in the analysis in

depicted in grey. Points symbolize the median, thick lines a 50% credi-

ble interval and thin lines a 95% credible interval.

assumption. Our study highlights the necessity to find a suit-

able control group before collecting punctual data (data that

are not part of a systematic monitoring effort) on random indi-

viduals if our aim is to test for instrumentation effects. One

must explicitly spell an assignment mechanism before carrying

out the study (for example, tossing a fair coin). In the case of

punctual instrumentation within a long-term monitoring

study, background characteristics can also be used prior to

deployment to define a set of similar individuals which will be

equipped or will serve as control. Causal inference is straight-

forward because the assignment mechanism is specified a

priori. A power analysis should also be carried out to assess

whether meaningful effects can be detected given the planned

number of tag deployments (Igual et al. 2005).

The scope for correcting selection bias after the experiment

is limited without detailed knowledge of animals’ background

characteristics. Propensity score methods are intrinsically a

posteriori: they may only be useful within long-term studies. In

the case of a punctual biologging study on a population of

unknown characteristics, propensity scores cannot be used to

reconstruct the assignment mechanism. Collecting data on

wild animals must be scientifically and ethically justified. Col-

lecting data on a random control group may be unjustified

when causal inference is not guaranteed: an ill-defined control

group may imply unnecessary disturbance of particularly vul-

nerable individuals. Randomly sampling a control group may

nonetheless be useful to control for large detrimental effects

which may occur during fieldwork (significant increase in trip

duration or mass loss, for instance). Fortunately, with minia-

turization of data loggers, large effects are less likely.

Conclusion

We detailed how to assess the causal impact of biologging on

instrumented animals by trying to recover a posteriori a suit-

able control group. The grim picture of limited research on the

impact of biologging (Vandenabeele, Wilson & Grogan 2011)

may partly result from the lack of guidelines to identify mean-

ingful control groups. Vandenabeele, Wilson &Grogan (2011)

or Wilson &McMahon (2006) briefly mentioned this issue but

offered no guidelines. Our incremental contribution is to

suggest existingmethods to fill that gap.

Fig. 1 details how to design an observational study to explic-

itly assess the impact of biologging on animals. Propensity

score matching is, however, not a panacea as it assumes no

hidden bias or cannot be easily used within catch-22 cases such

as the study of foraging efficiency or heart-rate frequency,

where different modelling approaches (hydrodynamics, flight

mechanics) may be more appropriate (Hazekamp, Mayer &

Osinga 2010). A pluralistic approach is clearly needed, within

which the counterfactual model should be seriously consid-

Acknowledgements

The long-term monitoring study of Scopoli’s shearwaters on Riou Island is

approved by the Centre de Recherches par le Baguage des Population d’Oiseaux

(CRBPO, Paris). Access to protected areas and tag deployments were approved

by the ethics board of the Conservatoire d’Espace Naturels de Provence-

Alpes-Cote d’Azur. Bird instrumentation was carried out under personal animal

experimentation permits #34–369 (D. Gr�emillet) and #34–505 (C. P�eron) deliv-

ered by theDirectionD�epartementale de la Protectiondes Populations.We thank

CEN-PACA staff in charge of long-term demographic monitoring of Scopoli’s

shearwaters onRiou Island: Jean PatrickDurand, C�elia Pastorelli, Nicolas Bazin,

Timoth�ee Cuchet and Lorraine Anselme. We thank Pierrick Giraudet and L�eo

Martin for tag deployment, Emmanuelle Cam for multistate models’ Goodness-

of-fit tests and Olivier Gimenez for multistate capture–recapture model BUGS

code. Emmanuelle Cam, Olivier Gimenez and Christophe Barbraud offered

suggestions on an early version of the manuscript.We thank JarrodHadfield and

two anonymous reviewers for helpful and constructive comments. The authors

declare no conflict of interest.

References

Albert, J. &Chib, S. (1993) Bayesian analysis of binary and polychotomous data.

Journal of the American Statistical Association, 88, 669–679.Anselme, L. & Durand, J. (2012) The Cory’s Shearwater Calonectris diomedea

diomedea, updated state of knowledge and conservation of the nesting

populations of the small Mediterranean Islands. Monography Initiative PIM,

Conservatoire d’EspacesNaturels de ProvenceAlpes Cotes d’Azur.

Arah, O. (2008) The role of causal reasoning in understanding Simpson’s

Paradox, Lord’s Paradox, and the suppression effect: covariate selection in the

analysis of observational studies.Emerging Themes in Epidemiology, 5, 5.

Austin, P. (2011) An introduction to propensity score methods for reducing the

effects of confounding in observational studies. Multivariate Behavioral

Research, 46, 399–424.Barron, D., Brawn, J. & Weatherhead, P. (2010) Meta-analysis of transmitter

effects on avian behaviour and ecology. Methods in Ecology and Evolution, 1,

180–187.Carvalho, C., Polson, N. & Scott, J. (2010) The horseshoe estimator for sparse

signals.Biometrika, 97, 465–480.Casper, R. (2009) Guidelines for the instrumentation ofwild birds andmammals.

Animal Behaviour, 78, 1477–1483.Choquet, R., Lebreton, J., Gimenez, O., Reboulet, A. & Pradel, R. (2009)

U-CARE: utilities for performing goodness-of-fit tests and manipulating

CApture–REcapture data.Ecography, 32, 1071–1074.Choudhury, S. (1995) Divorce in birds: a review of the hypotheses. Animal

Behaviour, 50, 413–429.D€umbgen, L. & Riedwyl, H. (2007) On fences and asymmetry in box-and-whis-

kers plots.American Statistician, 61, 356–359.Gelman, A. &Hill, J. (2007)Data Analysis Using Regression andMultilevel-Hier-

archicalModels, 1st edn.CambridgeUniversity Press, Cambridge,UK.

Gelman, A., Meng, X.L. & Stern, H. (1996) Posterior predictive assessment of

model fitness via realized discrepancies.Statistica Sinica, 6, 733–807.Gelman, A., Carlin, J., Stern, H. &Rubin, D. (2003)Bayesian Data Analysis, 2nd

edn. Chapman&HallCRC, BocaRaton, Florida, USA.

Greenland, S. (2008) Invited commentary: variable selection versus shrinkage in

the control of multiple confounders. American Journal of Epidemiology, 167,

523–529.Gr�emillet, D., Kuntz, G., Woakes, A.J., Gilbert, C., Robins, J., Le Maho, Y. &

Butlin, P. (2005) Year-round recordings of behavioural and physiological

parameters reveal the survival strategy of a poorly insulated diving endotherm

during the arctic winter. Journal of Experimental Biology, 208, 4231–4241.Hawkins, P. (2004) Bio-logging and animal welfare: practical refinements.

Memoirs of the National Institute for Polar Research, 58, 58–68.Hazekamp, A., Mayer, R. & Osinga, N. (2010) Flow simulation along a seal: the

impact of an external device. European Journal of Wildlife Management, 56,

131–140.Hebblewhite, M. & Haydon, D. (2010) Distinguishing technology from biology:

a critical review of the use of GPS telemetry data in ecology. Philosophical

Transactions of the Royal Society London series B, 365, 2303–2312.Holland, P. & Rubin, D. (1983) Principals of Modern Psychological Measure-

ment: A Festschrift for Frederic M. Lord, chapter On Lord’s Paradox,

pp. 3–26. Lawrence ErlbaumAssociates Inc., Hillsdale, New Jersey.

Igual, J., Forero, M., Tavecchia, G., Gonz�ales-Solis, J., Mart�ınez Abra�ın, A.,

Hobson, K., Ruiz, A. & Oro, D. (2005) Short-term effects of data-loggers on

Cory’s Shearwater (Calonectris diomedea).Marine Biology, 146, 619–624.King, G. & Zeng, L. (2007) When can history be our guide? The pitfalls of

counterfactual inference. International Studies Quarterly, 51, 183–210.Liu, C. (2004) Applied Bayesian Modeling and Causal Inference from Incomplete

Data Perspectives, chapter 21 –Robit Regression: a Simple Robust Alternative

to Logistic and Probit Regression, pp. 227–238. John Wiley and Sons Ltds,

NewYork.

Lord, F. (1967)Aparadox in the interpretationof group comparisons.Psycholog-

ical Bulletin, 68, 304–305.Louis, T.&Zeger, S. (2009) Effective communicationof standard error and confi-

dence interval.Biostatistics, 10, 1–2.Lunn,W., Thomas, A., Best, N. & Spiegelhalter, D. (2000)WinBUGS – a Bayes-

ian modelling framework: concept, structure, and extensibility. Statistics and

Computing, 10, 325–337.McMahon, C., Field, I., Bradshaw, C., White, G. &Hindell, M. (2008) Tracking

and data-logging devices attached to elephant seals do not affect individual

mass gain or survival. Journal of Experimental Marine Biology and Ecology,

360, 71–77.Phillips, R., Xavier, J. &Croxall, J. (2003) Effects of satellite transmitters on alba-

trosses and petrels.Auk, 120, 1082–1090.RDevelopmentCore Team (2012)R:ALanguage and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN

3-900051-07-0.

Ropert-Coudert, Y., Kato, A., Gr�emillet, D. & Crenner, F. (2012) Sensors for

Ecology. Towards Integrated Knowledge of Ecosystems, chapter 1 –Biologging:Recording the Ecophysiology and Behaviour of Animals Moving Freely in

their Environment, pp. 17–42. ISBN:978-2-9541683-0-2. CNRS.

Rubin, D. (1978) Bayesian inference for causal effects: the role of randomization.

TheAnnals of Statistics, 6, 34–58.Rubin, D. (2006) Matched Sampling for Causal Effects, 1st edn. Cambridge

University Press, 32 avenue of the Americas, New York, NY 10013-2473,

Rubin,D. (2007) The design versus the analysis of observational studies for causal

effects: parallels with the design of randomized trials. Statistics inMedicine, 26,

20–36.Rubin, D. (2008) For objective causal inference, design trumps analysis.Annals of

Applied Statistics, 2, 808–840.Rubin, D., Stuart, E. & Zanutto, E. (2004) A potential outcomes view of

value-added assessment in education. Journal of Educational and Behavioral

Statistics, 29, 103–116.Rutz, C. & Hays, G. (2009) New frontiers in biologging science. Biology Letters,

5, 289–292.Sagarin, R. & Pauchard, A. (2010) Observational approaches in ecology open

new ground in a changing world. Frontiers in Ecology and the Environment, 8,

379–386.Seamen III, J., Seamen Jr, J. & Stamey, J. (2012) Hidden dangers of specifying

noninformative priors.TheAmerican Statistician, 66, 77–84.Sekhon, J. (2009)Opiates for thematches:matchingmethods for causal inference.

Annual Review of Political Science, 12, 487–508.Sekhon, J. (2011)Multivariate and propensity scorematching softwarewith auto-

mated balance optimization: the MATCHING package for R. Journal of Statistical

Software, 42, 1–52.Swatschek, I., Ristow, D. & Wink, M. (1994) Mate fidelity and parentage in

Cory’s Shearwater Calonectris diomedea – field studies and DNA fingerprint-

ing.Molecular Ecology, 3, 259–262.Thibault, J. (1994) Nest-Site tenacity and mate fidelity in relation to breed-

ing success in Cory’s Shearwater Calonectris diomedea. Bird Studies, 41,

25–28.Tversky, A. &Kahneman, D. (1971) Beliefs in the law of small numbers.Psycho-

logical Bulletin, 76, 105–110.Vandenabeele, S., Wilson, R. & Grogan, A. (2011) Tags on seabirds: how

seriously are instrument-induced behaviours considered. Animal Welfare, 20,

559–571.Villard, P., Bonenfant, C. & Bretagnolle, V. (2011) Effects of satellite transmitters

fitted to breeding Cory’s Shearwaters. The Journal of Wildlife Management,

75, 709–714.Wilson, R. & McMahon, C. (2006) Measuring devices on wild animals: what

constitutes acceptable practice? Frontiers in Ecology and the Environment, 4,

147–154.

Wilson, R., Grant, W. & Duffy, D. (1986) Recording devices on free-ranging

marine animals: does measurement affect foraging performance. Ecology, 67,

1091–1093.Wilson, R., Gr�emillet, D., Syder, J., Kierspel, M., Garthe, S., Weimerskirch, H.,

Sch€afer-Neth, C., Scolaro, J., Bost, C.A., Pl€otz, J. & Nel, D. (2002)

Remote-sensing systems and seabirds: their use, abuse and potential for

measuring marine environmental variables. Marine Ecology Progress Series,

228, 241–261.Winkler, R., Smith, J. & Fryback, D. (2002) The role of informative priors in

zero-numerator problems: being conservative versus being candid. The

American Statistician, 56, 1–4.

Received 29April 2013; accepted 22May 2013

Handling Editor:Dr. JarrodHadfield

Supporting Information

Additional Supporting Information may be found in the online version

of this article.

Table S1. Estimated regression coefficients for the propensity score

model.

Table S2. Estimated transitions and detection probabilities from the

multi-state capture-recapture models. (propensity scores estimated

with shrinkage).

Table S3. Estimated transitions and detection probabilities from the

multi-state capture-recapture models. (propensity scores estimated

without shrinkage).

Fig S1. Violations of the Stable Unit-Treatment Value Assumption

(SUTVA).

Fig S2.CumulativeDistributionFunction (CDF) of a standard logistic

distribution and a Student-t distribution with 7 degrees of freedom and

scale set to 1.5484.

Fig S3. Graphical display of covariate balance after matching on pro-

pensity scores (estimatedwithout shrinkage).

Fig S4. Graphical representation of the multi-state capture

recapture model used to estimate counterfactual outcomes for

equipped birds.

Fig S5. Graphical representation of the Student t priors of location 0,

scale 10 and 7 df on a logit scale used for detection probabilities p.

Fig S6. Comparison between predicted and observed breeding perfor-

mance in 2012 for birds equipped with tags in 2011.

Appendix S1. Simpson’s and Lord’s Paradoxes.

Data S1.Data andR codes to reproduce the analysis.

designing observational biologging studies to assess the causal effect of instrumentation

Documents

causal reinforcement learning using observational …

joint causal inference on observational and...

biologging, remotely-sensed oceanography and the

methods for causal inference with observational data...2...

matching methods for causal inference: designing...

causal inference with observational data - a brief review of...

chapter 6 causal dags: inference and...

instrumental variables method for causal inference in...

estimating the causal impact of recommendation...

causal inference in observational studies

targeted learning causal inference for observational and...

how big data and causal inference work together in health...

estimating a causal effect using observational data

a structural approach to bias: causal diagrams provide an...

00 from observational studies to causal rule mining

causal inference with observational...

estimating the causal impact of recommendation systems from...

biotelemetry and biologging in endangered species research...

experiments & observational studies: causal inference in

population heterogeneity and causal inference...due to...