Download - Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations
Effects of differing statistical methodologies on inferences about
Earth-like exoplanet populations
John Owens
Abstract
Sophisticated statistical methodologies have become vital to the study of exoplanet formation and detec-
tion. As this field continues to grow and a wider range of methodologies are utilized, the question of the
appropriateness of different techniques becomes more pressing. The following comparisons between differ-
ing methods of searching through planet-transit catalogs to determine the planet occurrence rate and their
corresponding results demonstrate the proper and improper situations to use different statistical methods.
I give explanations of each method, the assumptions each author made in their studies, and the differences
in the resulting values for the occurrence rate. It appears that using the inverse-detection-efficiency method
results in consistently higher occurrence rates than using the likelihood method.
1 Introduction and background
Exoplanets have been the subject of great study in the twentieth and twenty-first centuries. Several
studies have published various data about different features of exoplanetary systems, the most prevalent
being the occurrence rate of planets in these systems. The Kepler mission, launched in 2009 by the National
Aeronautics and Space Administration, provided astronomers with an extensive catalog of planets that
transit (cross in front of) their host stars (Petigura et al. 2013).
When planets transit, they dim the brightness of their star (the signal of the star is reduced when
observed). The magnitude of the change in visual brightness depends on the relative sizes of the star and
the planet. Since the size of the star should already be known, it is straightforward to determine the size
of the planet from this relation. Planets are most easily detectable when they are relatively large and have
small orbital periods (they orbit close to their host stars). In order to detect a planet using this method, the
1
orbit of the planet must be “edge-on” to an observer on Earth. In other words, the orbit must be positioned
such that an observer could view the planet crossing in front of the star. The probability of an edge-on orbit
is the ratio of the diameter of the star to the diameter of the orbit. This value is equal to about 0.5% for
an Earth-size planet orbiting a Sun-size star. When creating pipelines to detect planet transits from the
Kepler data, astronomers inject synthetic (false) planet signals into their pipelines to determine the rate of
false positives. Using the false positive information and the probability of planet transit, one can determine
the efficiency of a pipeline to detect planet transits.
The occurrence rate is defined as the fraction of stars having a planet within a specified range of
parameters (Petigura et al. 2013). This value is significant because it allows us to predict the probability
of a system having a planet given certain parameters. One of the most intriguing sets of parameters that
can be used with occurrence rate data is those of our Solar System. When applied to these calculations, we
can predict the occurrence rate of Earth-like planets orbiting Sun-like stars which allows for further research
into potential habitability.
The goal of the Kepler mission is to detect transiting planets in a large domain of continuously-
observed stars. The Kepler telescope maintains focus on a very large field of view (105 square degrees).
Kepler views the same field for at least 3.5 years; a time that can be extended to allow for the detection
of smaller and more distant planets. The transit data gathered by Kepler is used to determine planet
occurrence rates and other statistics. This information was found on NASA’s Kepler homepage.
Previous studies that use the Kepler data claim that the most common small planets are those
approaching Earth-size but that orbit close to their host stars. The studies I examine here extend the planet
survey to those that are Earth-size and orbit at a distance such that they receive a similar intensity of light
energy as Earth.
Many studies have been done to calculate the occurrence rate of planets both Earth-like and
otherwise, some of which will be the focus of this paper. I will compare several different methods used to
calculate the occurrence rate and the effects of different methods on resulting values. My primary points
of concern will be on key assumptions made during calculations that differ between authors. The term
statistical methodologies refers to the methods used to analyze or represent data. The usage of multiple
statistical methodologies to determine the planet occurrence rate poses a unique issue. In theory, many
methods may be used to arrive at the same result. The challenge, however, comes from the fact that no
two methodologies will produce the exact same result in practice for distributions of this complexity. In the
2
presence of noisy data produced by a complicated instrument and analysis, statistical inference offers more
choices of methods to determine the same statistics. The purpose of this paper is not to determine the “best”
methodology for determining the occurrence rate, but rather comment on the effects of the differences in
existing methodologies.
(Petigura et al. 2013) use their own TERRA software package to search for transiting planets in
the National Aeronautics and Space Administration’s (NASA’s) Kepler mission data. TERRA first accounts
for systematic error common to many stars, outliers in the data, and variability caused by long timescales.
The software then searches for planet transit signals by evaluating the signal-to-noise ratio at locations of
prospective transits in time. The signal-to-noise ratio compares the power of a desired signal to the power
of the background noise. It is significant to note that they correct for candidates missed by TERRA by
including the probability that the orbital plane of a planet would not be conducive to transit detection, and
by injecting “transit-like synthetic dimmings” into real Kepler photometry. Petigura et al. calculate their
occurrence rate by first creating a grid of planets sorted by orbital period, P, and planet radius, Rp. Both of
these values can be measured from the Kepler photometry. They then count the number of detected planets
in each cell and compute P,Rp, as well as the distribution of planet sizes and orbital periods.
(Foreman-Mackey et al. 2014) use a Bayesian hierarchical probabilistic inference form to calculate
the occurrence rate density (see subsection 4.1 on Bayesian inference). They apply this method to the catalog
of Earth-like planet candidates determined by the TERRA pipeline. The steps they take begin with creating
a likelihood function and then calculating the detection efficiency in the same bins used by Petigura et al.
The likelihood function uses a set of model parameters to describe the probability of observing a specific
data set. Finally, they constrain the occurrence rate density of Earth-like planets by evaluating occurrence
rate at the location of Earth on the (P,Rp) grid.
(Dressing Charbonneau 2015) perform similar calculations to Petigura et al. and Foreman-Mackey
et al. but limit their scope to small planets orbiting small stars. It could be argued that focusing on smaller
systems results in more accurate constraints for solar-like systems. However, computing the occurrence rate
in this setting is complicated by the fact that it is more difficult to measure the parameters of low-mass stars
than the parameters of Sun-like stars. Dressing et al.’s 2015 paper is written to combine improved methods
of calculating the occurrence rate that were published after their 2013 paper (Dressing Charbonneau 2013).
Thus, they make different assumptions than Petigura et al., Foreman-Mackey et al., and their own from their
previous paper. Their calculations of the occurrence rate generally resemble those used by (Foreman-Mackey
3
et al. 2014).
I will examine the calculations of the occurrence rate performed by Petigura et al., Foreman-Mackey
et al., and Dressing et al. in several lenses.
1. The dependence on orbital period, planet radius, and habitable zone.
2. The dependence of the results on the inverse-detection efficiency method versus the likelihood function
method.
3. Differences in survey completeness.
4. Effects of differing assumptions made by the the authors.
5. Possible bias from differing intents of the papers.
6. Effects on other statistics in these studies.
7. Possible ramifications for future studies.
It is first necessary to explain the differences in the calculated occurrence rates and their depen-
dencies on certain physical parameters. These provide context for the hypothetical explanations for their
differences to come later.
2 Calculated occurrence rates and dependencies
Each author provides their calculated occurrence rate in different forms. Petigura et al. give several
occurrences rates for different domains of Rp and P, focusing primarily on Earth-size planets. Foreman-
Mackey et al. give a single value for “Earth analogs”. The data they gather comes from directly applying
their inference method to the catalog of small exoplanet candidates orbiting Sun-like stars published by
Petigura et al. They define Earth analogs as planet candidates that have very similar orbital periods and radii
to that of Earth (Foreman-Mackey et al. 2014). The value that they give is for what they term “occurrence
rate density”, not simply the occurrence rate. Dressing et al. present detailed tables of occurrence rates for
various ranges of orbital period, planet radius, and insolation, as well as values in the habitable zone (HZ).
Overall, Petigura et al. find that 26±3% of Sun-like stars are orbited by an Earth-size planet (with
a planet radius 1 − 2 R⊕) with P = 5 − 100 days (Petigura et al. 2013). Figure 1 shows the distribution
of planets in bins of log period and log radius. The respective occurrence rates for each bin are indicated
4
as well. Graphically, it appears that the binned occurrence rate depends significantly on log radius. Data
about multiple-planet systems is not represented in their study. In their TERRA pipeline, no Earth-size
planets were detected that had Earth-like periods (P = 200− 400 days). However, the radii of three planets
did exhibit 1σ confidence intervals that extend into the domain (P = 200−400 d, Rp = 1−2 R⊕). This fact
leads to the assumption that the existence of an Earth-size planet with an Earth-size radius is still plausible,
however it has yet to be detected. Petigura et al. estimate the occurrence rate of these 1 − 2 R⊕ planets
with periods of 200-400 d by extrapolating the overall planet occurrence with P. This yields a 5.7+1.7−2.2%
occurrence of Earth-size (1− 2 R⊕) planets with periods of 200− 400 d.
The term “rate density” is used by Foreman-Mackey et al. to indicate the integrand over a finite
bin in period and radius that results in a rate. A rate density attempts to correct for differences in surveys
that go to different depths. Foreman-Mackey et al. find that the occurrence rate density for Earth analogs
is 0.019+0.019−0.010 nat
−2, per natural logarithmic period per natural logarithmic radius (Foreman-Mackey et al.
2014). For reference, the authors also converted Petigura et al.’s results to these units, yielding a value of
0.119+0.046−0.035 nat
−2. They note several features of the data. The period distribution is not consistent between
large (R > 8 R⊕) planets and small planets. Foreman-Mackey et al. do not graphically show their data
in binned form, but Figure 2 does show the log radius-log period relationship in their data. As is the case
with the data presented by Petigura et al., the occurrence rate appears to depend more on planet size than
orbital period. The radius distribution, however, is qualitatively consistent between large and small planets.
Potential features near R ∼ 3 R⊕ and R ∼ 10 R⊕ are noted. Foreman-Mackey et al. claim that their results
are “completely inconsistent” with the results of Petigura et al., despite being based on the same data set.
Comparing the results calculated by Foreman-Mackey et al. and those of Petigura et al. (converted
to values in occurrence rate density), it is rather simple to notice that the value determined by Foreman-
Mackey et al. is quite lower than the value determined by Petigura et al. It is also interesting to note that
the error bars of each value do not overlap. The value published by Foreman-Mackey et al. exhibits much
larger margins of error than that by Petigura et al. (almost twice as large in a log scale). This may be
attributed to the fact that Foreman-Mackey et al. consider the observational uncertainties on the physical
parameters non-negligible, thus increasing the propagation of error for the final value.
Dressing et al. find a cumulative occurrence rate of 2.5± 0.2 planets (R = 1− 4 R⊕, P < 200 d)
per M dwarf (Dressing Charbonneau 2015). They also give occurrence rates for various ranges of planet size
and period. For planets with periods such that P < 50 days, the planet occurrence rate decreases as planet
5
radius increases from 1 R⊕ to 4 R⊕. Overall, the occurrence rate increases with orbital period between
0−200 days. Dressing et al. also perform an in-depth analysis of varying occurrence rate with planets in the
habitable zone. Under conservative assumptions for the range of the habitable zone (1.0 R⊕ < R < 1.5 R⊕),
the occurrence rate is estimated as 0.16+0.170.07 (potentially) habitable planets (1 − 1.5 R⊕) per M dwarf.
According to their estimates, it could be suggested that habitable zone planets are more common around
lower-mass stars. Figure 3 shows Dressing et al.’s distribution of planets in binned form. It is prudent to
note that the binned occurrence values given by Dressing et al. appear to be consistently greater than the
same values given by Petigura et al.
Table 1. Conservative occurrence rates from different studies
Study Occurrence rate Occurrence rate density
Petigura et al. 2013 5.7+1.7−2.2% 0.119+0.046
−0.035 nat−2
Foreman-Mackey et al. 2014 0.019+0.019−0.010 nat
−2
Dressing et al. 2015 0.16+0.170.07 %
Figure 1: Planet occurrence as a function of planet radius and orbital period. The distribution is organizedinto bins of log radius and log period. Red dots represent detected planets. Each bin displays the occurrencerate in that bin and is colored according to that occurrence rate. The bulk of the planets are distributedbetween 1− 4 R⊕ and 10− 200 days. Figure taken from (Petigura et al. 2013).
6
Figure 2: Center: The points represent detected planets. The contours represent the completeness function.The grayscale represents the occurrence rate density. The points are organized in bins of log radius and logperiod. The density of the planet distribution is high between 0.0− 1.5 lnR/R⊕ and 2.0− 4.0 lnP/day. Topand right: Histograms of the inferred rate density using the likelihood method. The points with error barsare the results of the inverse-detection-efficiency method. Figure taken from (Foreman-Mackey et al. 2014).
3 Inverse-detection efficiency vs. likelihood methods
Petigura et al. calculate the distribution of exoplanets in their survey by a process termed “inverse-
detection efficiency” (Petigura et al. 2013). In this process, they inject fake planet transit signals into the
light curves of Sun-like stars and recover them using TERRA. They divide the recovered signals into bins
by radius and period and weight the population of each bin by the inverse of their detection efficiency.
Foreman-Mackey et al. argue that the “likelihood method” is superior to the inverse-detection
efficiency method used by Petigura et al. They use the same catalog of measurements published by Petigura
et al. but treat it as a draw from an inhomogeneous Poisson process set by the observable rate density. A
Poisson process is a model for distributions of points that are randomly located in space. An inhomogeneous
7
Figure 3: Left: Planet occurrence rate binned by orbital period and planet radius. The occurrence rate islisted as a percentage at the top of each bin. The percentage of injected planets recovered by the pipelineis listed at the bottom of each bin. The occurrence rate is highest between 0.5− 2.5 R⊕ and 4− 100 days.Right: Planet occurrence rate binned by planet insolation and radius. The occurrence rate and injectionrecovery rate are listed the same as left. The occurrence rate is highest between 0.5− 2.5 R⊕ and 12− 1 F⊕.Figure taken from (Dressing Charbonneau 2015).
Poisson process has some underlying function as a parameter that is dependent on spatial location. This
yields an exponential function containing the integral of the observable rate density over the set of physical
parameters contained in the catalog data. Since the observable rate density is a product of the detection
efficiency and true occurrence rate density functions (both dependent on the same physical parameters), they
infer the true occurrence rate density. They then model the rate density as a piecewise constant step function
and derive the analytic maximum likelihood solution for the step heights. According to (Foreman-Mackey
et al. 2014), this method is “guaranteed to provide a lower variance estimate of the rate density than the
standard procedure”.
Dressing et al. use a method that is very similar to the inverse-detection method used by Petigura
et al., but involves using their own custom pipeline. They use the Kepler catalog as their source and search
the light curves of the stars in that catalog for exoplanets (Dressing Charbonneau 2015). Their pipeline
searches each light curve sequentially so that it can pick up multiple planets per star, if warranted by the
data. This is an improvement over the methods used by Petigura et al. and Foreman-Mackey et al. because
their pipelines can only detect whether or not any planet transits a given star, not if multiple do. If a star in
the catalog is detected to have a transiting planet, the signal is sent through a pipeline that vets it for being
a true exoplanet or a false positive on several criteria. These include whether or not the signal is associated
with a known spacecraft or stellar activity and if the signal is the result of harmonics from a different signal.
Dressing et al. then conduct a Bayesian Markov Chain Monte Carlo analysis to fit a curve to the exoplanet
candidates determined by their pipeline.
8
3.1 Bayesian inference
Bayesian inference is the derivation of a posterior probability resulting from a prior probability and
a likelihood function. The prior probability (or prior) is the probability distribution of an uncertain quantity
that expresses one’s beliefs about a quantity before relevant evidence is taken into account. The prior can be
created using past information, subjective assessment of data, or relevant principles. The likelihood function
represents the probability of an observed outcome given a certain set of parameter values. Thus, the posterior
probability is the conditional probability of a random event after relevant evidence is taken into account.
Bayesian inference follows Bayes’ theorem to compute the posterior probability:
P (H|E) =P (E|H) · P (H)
P (E), (1)
where H is the hypothesis, E is new data (evidence) that were not used to compute the prior probability,
P (H) is the prior probability, P (H|E) is the posterior probability, and P (E|H) is the likelihood function. In
practice, Bayes’ theorem can be applied iteratively: after applying the theorem to some observed evidence,
the resulting posterior probability can be treated as a prior used in the computation of a new posterior
probability from new evidence.
Monte Carlo methods are a computational algorithms that use repeated random sampling to obtain
numerical results. In principle, they can be used to solve any problem that has probabilistic interpretation.
The Markov Chain Monte Carlo is particularly useful when the probability distribution of a variable is
parameterized. A Monte Carlo method uses a random walk to evaluate an integrand at each step and count
that value towards an integral. A Markov Chain has this integrand as its equilibrium distribution.
4 Survey completeness
Survey completeness is, in essence, a measure of how accurately collected data represents the actual
distribution of sources. For instance, if one were to create a pipeline to detect planets transiting stars over
a given area of the sky, the completeness would be a measure of how many transiting planets the pipeline
detected versus how many transiting planets actually exist in that area. This value is significant because it
represents the reliability of a pipeline to recover signals from a catalog. It is also used in the calculation of
final statistics of the data: in this case, the occurrence rate. In the surveys covered in this paper, the survey
9
completeness is measured in small bins of orbital period P and planet radius R. Several trends were noted
by each author.
(Petigura et al. 2013) note that survey completeness is a complicated function of period and radius.
In general, as P increases and R decreases, the function of the completeness decreases. This makes logical
sense, as planets with small radii would be eliminated as noise in the pipeline (and possibly undetected by
the Kepler survey), and planets with sufficiently long periods may not have time to transit during the course
of the pipeline. Petigura et al. also state that using noise models to determine the completeness function
instead of the injection and recovery method would not be recommended because these models would only
be able to determine relative completeness, when absolute completeness is the significant value.
(Foreman-Mackey et al. 2014) use the same catalog generated by Petigura et al. to recalculate
the completeness function. They use the injected signal samples from this catalog to calculate the detection
efficiency in bins of log period and log radius. This detection efficiency is just the fraction of recovered
injection signals per bin. The authors state that, in a more certain survey, the best way to calculate the
survey completeness would be in terms of radius ratio or signal-to-noise. Since the radius uncertainties
are dominated by uncertainties in the stellar parameters, however, it is impossible to compute constraints
on radius ratios and the best method available is to determine using period and radius. The next step
is to determine the geometric transit probability in the period-radius plane. This distribution scales only
with the period. Foreman-Mackey et al. assume that all planets in the catalog are on circular orbits. For
simplicity, this assumption is necessary in the calculations of both Petigura et al. and Foreman-Mackey et
al. However, a study by (Kipping 2014) showed that, when eccentric orbits are included, Foreman-Mackey
et al. underestimate the transit probability by 10%. This effect propagates to the occurrence rate density
as well.
(Dressing Charbonneau 2015) present a much less detailed explanation of their methods for deter-
mining survey completeness. Their method requires using “the detectability of a particular transiting planet
and the likelihood that a particular planet will be observed to transit”. They calculate the geometric prob-
ability of planet transit to determine the dependence on the second factor above. They do this for planets
at particular periods or insolation levels. Examining the dependence of the geometric transit probability on
insolation is beyond the scope of this paper. It is unclear why Dressing et al. do not include planet radius in
their calculations here, but I speculate that it is for similar reasons as to why (Foreman-Mackey et al. 2014)
find that the distribution of the geometric transit probability in the period-radius plane scales only with the
10
period. They then divide the stellar radii by semimajor axes to determine the transit probability for a planet
on a circular orbit. Dressing et al. take one additional step that is not taken by Foreman-Mackey et al. or
Petigura et al. They incorporate the eccentricity correction factor presented by (Kipping 2014) to present a
more realistic distribution of planets.
5 Effects of differing assumptions
Each author selects a slightly different set of assumptions when deriving the planet occurrence
rate. The differences between these can have both drastic and insignificant effects on the results of their
calculations. I will present the assumptions used by each author, and then compare their effects.
(Petigura et al. 2013) make the following assumptions. They assume that the planet occurrence
rate is flat in log period, and that the occurrence is uniform per log period interval. The assumptions
regarding the flatness and uniformity of the occurrence rate with log period are necessary to be able to
fit a reasonable relation between the occurrence rate and bins of period and planet radius. One significant
assumption made by the authors is that the orbits of transiting planets are circular. Planets on circular orbits
have a higher probability of being observed to transit than those on elliptical orbits. Thus, the TERRA
pipeline will detect more planets than would be the case if orbital eccentricities were taken into account.
Additionally, it should be noted that Petigura et al.’s final results are based on extrapolations from the data
(in other words, they use a fitted model to derive some of the data that go into their final calculations).
This would likely be less accurate than performing a direct calculation for the result on data that is already
in the domain of concern. However, since further observation would need to be done to locate data in that
domain, their methods here are the most efficient as could be done using the available data.
Petigura et al. state that alternative definitions of the properties of Earth analogs and the domain
of the HZ may be adopted. They provide several estimates of the occurrence rate they calculated based on
different published definitions of the HZ.
(Foreman-Mackey et al. 2014) make several “strong” assumptions throughout their paper, but
they argue that they are weaker than the implicit assumptions made in previous studies. The assumptions
as stated by Foreman-Mackey are:
1. “Candidates in the catalog are independent draws from an inhomogeneous Poisson process set by the
censored occurrence rate density.”
11
2. “Every candidate is a real exoplanet (there are no false positives).”
3. “The observational uncertainties on the physical parameters are non-negligible but known (the catalog
provides probabilistic constraints on the parameters).”
4. “The detection efficiency of the pipeline is known.”
5. “The true occurrence rate density is smooth.”
Foreman-Mackey et al. give a summary of the effects of their assumptions, as well. Their first
assumption states that each planet drawn from the catalog is independent of all the other planets in the
catalog. The reasoning behind this assumption is that since the data set they consider only includes systems
with single planets, no planet in the system will affect the parameters of another. The second assumption
states that all candidates in the catalog are real. However, other studies (Morton Johnson 2011) (Fressin
et al. 2013) demonstrate that the false positive rate in the Kepler catalog is non-negligible. Running
their calculations again while taking into account possible false positives would likely decrease the planet
occurrence rate found by Foreman-Mackey et al. Similar to Petigura et al., the authors here neglect orbital
eccentricities (although they do comment on why this skews their results).
(Dressing Charbonneau 2015) make very few explicit assumptions in their calculations. Like
Petigura et al., they assume that the planet occurrence rate is flat in log period. Like Foreman-Mackey et
al., they assume that there are no false positives in the catalog in their original 2013 paper. In their 2015
paper, however, Dressing et al. apply a correction for false positives. It appears that of the studies I am
examining, this is the only one to make this correction. Unlike Petigura et al. and Foreman-Mackey et al.,
Dressing et al. factor in orbital eccentricities, as demonstrated by (Kipping 2014).
Petigura et al. and Dressing et al. assume that the distribution of planets (and thus the planet
occurrence rate) is flat in a bin of log radius. (Foreman-Mackey et al. 2014) relax this assumption and only
assume that the occurrence rate density is a smooth function of period and radius. This allows them more
freedom to extrapolate their data.
6 Bias from paper intent
Each previous study has a slightly different overall intent, which could cause some bias in their
treatment of their calculations. For instance, if the primary objective of a study is to publish a catalog of
12
data, the author(s) may make stronger assumptions or make much more general claims about the trends in
their data than a study that was focused on analyzing such a catalog.
The intent of the study by Petigura et al. is two-fold, but is primarily focused on one goal. The
bulk of the work done by the authors goes into creating their TERRA software package to identify Earth-size
planets in the Kepler data and create a catalog of these planets. Additionally, they present an analysis of
their data (they present occurrence rates for different sets of parameters). Since the primary focus of the
study is on creating the catalog of planets, the authors may make some of their more substantial assumptions
when calculating the occurrence rate.
Foreman-Mackey et al. focus on calculating the occurrence rate density using the catalog of planet
candidates published by (Petigura et al. 2013). Because their calculations only depend on the catalog data
and not the inferred statistics published by Petigura et al., it is likely that there is little bias present in
the calculations of Foreman-Mackey et al. The intent of the study by Foreman-Mackey et al. is to find the
occurrence rate density of planets in the catalog published by Petigura et al. using methods that require
fewer assumptions than previous studies.
The intent of the paper by Dressing et al. is more complicated than that of the papers by Petigura
et al. and Foreman-Mackey et al. because it is, in essence, a revisitation of their 2013 study. Their intent,
explicitly stated (Dressing Charbonneau 2015), is to “implement the following improvements to refine our
2013 estimate of the frequency of small planets around small stars.
1. We use the full Q0-Q17 Kepler data set.
2. We utilize archival spectroscopic and photometric observations to refine the stellar sample.
3. We explicitly measure the pipeline completeness.
4. We inspect follow-up observations of planet host stars to properly account for transit depth dilution
due to light from nearby stars.
5. We apply a correction for false positives in the planet candidate sample.
6. We incorporate a more sophisticated treatment of the HZ.”
Since the intent of this paper is to refine the methods utilized in their 2013 paper, it seems prudent to
examine the intent of the original paper. It appears that the goal of the 2013 study is to create a catalog of
transiting exoplanets from the Kepler data set using their own pipeline, as well as estimate the occurrence
13
rate of planets with short orbits (P < 50 days) around cool stars (Dressing Charbonneau 2013). This intent
is similar to that of Petigura et al., so it is likely that any bias present in that paper would be present in
Dressing et al.’s 2013 paper.
7 Effects on other statistics
While the most significant statistic treated in these studies is the occurrence rate of small planets
around small stars, several others are presented as products of the varying methodologies used by the authors.
In some cases, these other statistics are used in the calculation of the value of the occurrence rate. In others,
they are used as parameters to constrain the domain of the occurrence rate. Petigura et al. use their
TERRA software to determine the intensity of stellar light energy received by a planet from its parent star.
To determine this value, they use the standard formula for stellar light flux, Fp = L?/4πa2, where Fp is the
flux received by the planet, L? is the luminosity of the star, and a is the planet-star separation (Petigura
et al. 2013). The luminosity of the star is computed as L? = 4πR2?σT
4eff , where σ is the Stefan-Boltzmann
constant. In this case, Petigura et al. use the fluxes determined for each star to constrain the domain of
their planet occurrence rate. They determine the occurrence rate for planets in the domain 1 − 2 R⊕ and
1 − 4 F⊕, where F⊕ is the flux received by Earth from the Sun. The flux and luminosity data obtained
by Perigura et al. is likely not tainted by strong assumptions because it could come from the Kepler light
curves before TERRA was run to exhume present exoplanet signals.
Since the study by Foreman-Mackey et al. applies the likelihood method, the nature of all of
the statistics presented by those authors is probabilistic. In other words, statistics that would have been
considered certain in the work of Petigura et al. are treated as uncertain by Foreman-Mackey et al. For
instance, Foreman-Mackey et al. use the probability of a transiting planet from the Kepler data set and
the inferred rate density to find the number of Earth-like planets transiting Sun-like stars in the catalog
published by Petigura et al. The number that they found was 10.6+5.9−4.5 (Foreman-Mackey et al. 2014).
The uncertainties on this value are only on the expectation value and do not include the Poisson sampling
variance, and thus leave much room for improving the precision of this value. This can be accomplished by
extending the search pipelines to small planets with long periods.
The most significant statistic presented by Dressing et al. that I have not discussed yet is the
insolation of planets. The insolation is, essentially, a measurement of the flux received by a planet from its
14
host star. Dressing et al. present insolation as a constraint on the domain of the occurrence rate. In other
words, they use insolation as a definition of the habitable zone (the range around a star in which a planetary
surface can support liquid water under sufficient atmospheric pressure). According to the authors, the errors
on the insolation are large enough to produce a smooth distribution. While this is helpful in creating a
distribution which can be used to more precisely constrain the occurrence rate, large errors generally mean
that any given measurement inferred from the distribution would not be accurate.
8 Ramifications for future studies
Astronomy is a field built on collaboration. The nature of these types of studies is to build on
previous work and either develop new methods or sharpen methods already proposed. This is apparent in
the studies discussed in this paper. Petigura et al. lay much of the groundwork for the material discussed,
while Foreman-Mackey et al. propose an alternate method to discern the same statistics. Dressing et al.,
however, propose improvements on their original technique published in their original 2013 paper.
Future studies will almost definitely follow this same pattern. It is likely that after much further
study, a single superior method of determining occurrence rates will be decided upon. In the meantime,
the best course of action is for astronomers to further develop the likelihood and inverse-detection efficiency
methods until one produces statistically superior results. Ideally, observational techniques would eliminate
the need for assumptions to be made when discerning trends in data.
According to (Foreman-Mackey et al. 2014), for a full detailed analysis of planet occurrence rates,
the likelihood method should be applied instead of the inverse-detection method if uncertainties are not
significant. In a realistic case when catalog data is known to a much greater degree of precision and the
completeness function of a model varies much more smoothly, uncertainties will become less significant and
the inverse-detection-efficiency method would be prudent to utilize.
9 Conclusion
In this paper, I presented several comparisons between the statistical methodologies implemented
and assumptions made by different studies and their effects on the results of each study. The values in
question are the occurrence rate of exoplanets orbiting stars in the Kepler mission catalog. This is, essentially,
the probability of finding a planet orbiting any given star. The primary motivation for the studies was to
15
determine occurrence rates for Earth-like planets orbiting Sun-like stars. I focused on the methods of three
studies:
1. Prevalence of Earth-size planets orbiting Sun-like stars, (Petigura et al. 2013)
2. Exoplanet population inference and the abundance of Earth analogs from noisy, incomplete catalogs,
(Foreman-Mackey et al. 2014)
3. The occurrence of potentially habitable planets orbiting M dwarfs estimated from the full Kepler
dataset and an empirical measurement of the detection sensitivity, (Dressing Charbonneau 2015)
Each study uses a different method for determining the occurrence rate. (Petigura et al. 2013)
use the inverse-detection-efficiency method, wherein they inject fake planet-transit signals into Kepler light
curves, recover the signals, and sort them into bins of log planet radius and log orbital period to determine the
planet occurrence rate. (Dressing Charbonneau 2015) apply the inverse-detection-efficiency method similar
to Petigura et al. Their pipeline, however, can also detect multiple planets transiting a single star, whereas
the pipeline used by Petigura et al. can only detect if any planets transit the star. (Foreman-Mackey et al.
2014) substitute the likelihood method in place of the inverse-detection-efficiency method. In this method,
each planet candidate is treated as an independent draw from an inhomogeneous Poisson set. The pipeline
yields an integral of the detection efficiency and true occurrence rate density functions, which is used to
calculate the occurrence rate density.
Each method yields a different value for the occurrence rate or occurrence rate density. The
inverse-detection-efficiency method used by Petigura et al. yields an occurrence rate of 5.7+1.7−2.2% for Earth-
size (1 − 2 R⊕) planets with periods of 200 − 400 d. Converted to an occurrence rate density, this value
is 0.119+0.046−0.035 nat
−2. Foreman-Mackey et al.’s likelihood method yields an occurrence rate density value
of 0.019+0.019−0.010 nat
−2 for Earth-analogs, which they define along the same lines as Petigura et al. Dressing
& Charbonneau find a value for the occurrence rate at 2.5 ± 0.2 for planets where R = 1 − 4 R⊕ and
P < 200 days.
The methods of Petigura et al. and Dressing & Charbonneau seem to yield numerically similar
results. While their margins of error do not intersect, both produce relatively low occurrence rates. This
should be expected as they use similar statistical methodologies. Both authors assume that the distribution
of planets in the survey is flat in bins of log orbital period and log planet radius. This is a necessary
assumption in order to correctly apply the inverse-detection-efficiency method and group into bins of log
16
period and log radius.
Because Dressing & Charbonneau do not present their data in terms of occurrence rate density,
I can only compare the values given by Foreman-Mackey et al. to those of Petigura et al. Similar to the
results of Dressing & Charbonneau and Petigura et al., the error bars of the occurrence rate density values
of Foreman-Mackey et al. and Petigura et al. do not overlap. The occurrence rate density calculated by
Foreman-Mackey et al. is, in fact, roughly six times smaller than that of Petigura et al. This leads to the
suggestion that the likelihood method and the inverse-detection-efficiency method cannot produce similar
results if applied to the same set of data under similar conditions. This is profound because it means that
further refinement of each method should be done to determine if they are reconcilable. If not, a clear set of
domains should be determined for circumstances to use each method.
One of the most significant differences between the studies is the assumption that planet candi-
dates picked up by the pipelines have circular orbits. Petigura et al. and Foreman-Mackey et al. hold this
assumption in their calculations because their methodologies are not robust enough to handle eccentricity
variation. (Kipping 2014) published a study that determined a sizeable difference between survey complete-
ness calculations including elliptical orbits and including only circular orbits. In their study, Dressing &
Charbonneau incorporate the factor presented by Kipping to include eccentric orbits in their distribution.
Future studies should more extensively examine the effects of orbital eccentricity on the occurrence rate.
It is worth comparing the citation metrics of each of these studies. The study done by (Petigura et
al. 2013) has been cited 274 times since its publication. Those of (Foreman-Mackey et al. 2014) and (Dressing
Charbonneau 2015) have appeared as citations 78 and 107 times, respectively. It is likely that Petigura et
al.’s paper is having a greater impact because they published a reliable and extensively documented catalog.
Many papers that cite Petigura et al.’s attempt to improve on the statistical methodology of the original
study, so any chain of improvements from study to study will always lead back to Petigura et al. This trend
will almost asuredly continue in the future.
References
Petigura, E. A., Howard, A. W., Marcy, G. W. 2013, Proceedings of the National Academy of Sci-
ences, 110 (Proceedings of the National Academy of Sciences), 19273, http://dx.doi.org/10.1073/pnas.1319909110
Foreman-Mackey, D., Hogg, D. W., Morton, T. D. 2014, The Astrophysical Journal, 795 (IOP
17
Publishing), 64, http://dx.doi.org/10.1088/0004-637x/795/1/64
Dressing, C. D., Charbonneau, D. 2015, ApJ, 807 (IOP Publishing), 45, http://dx.doi.org/10.1088/0004-
637x/807/1/45
Dressing, C. D., Charbonneau, D. 2013, The Astrophysical Journal, 767 (IOP Publishing), 95,
http://dx.doi.org/10.1088/0004-637x/767/1/95
Kipping, D. M. 2014, Monthly Notices of the Royal Astronomical Society, 444 (Oxford University
Press (OUP), 2263, http://dx.doi.org/10.1093/mnras/stu1561
Morton, T. D., Johnson, J. A. 2011, The Astrophysical Journal, 738 (IOP Publishing), 170,
http://dx.doi.org/10.1088/0004-637x/738/2/170
Fressin, F., Torres, G., Charbonneau, D., et al. 2013, The Astrophysical Journal, 766 (IOP
Publishing), 81, http://dx.doi.org/10.1088/0004-637x/766/2/81
18