supplementary materials forscience.sciencemag.org/content/sci/suppl/2016/05/18/352...the command...

63
www.sciencemag.org/content/352/6288/1009/suppl/DC1 Supplementary Materials for Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals Xun Lan* and Jonathan K. Pritchard* *Corresponding author. Email: [email protected] (X.L.); [email protected] (J.K.P.) Published 20 May 2016, Science 352, 1009 (2016) DOI: 10.1126/science.aad8411 This PDF file includes: Materials and Methods Supplemental Text Figs. S1 to S27 Tables S1 to S5 References Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/352/6288/1009/DC1) Data Files S1 to S3 as Excel Files

Upload: trinhcong

Post on 16-Apr-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

www.sciencemag.org/content/352/6288/1009/suppl/DC1

Supplementary Materials for

Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals

Xun Lan* and Jonathan K. Pritchard*

*Corresponding author. Email: [email protected] (X.L.); [email protected] (J.K.P.)

Published 20 May 2016, Science 352, 1009 (2016) DOI: 10.1126/science.aad8411

This PDF file includes:

Materials and Methods Supplemental Text Figs. S1 to S27 Tables S1 to S5 References

Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/352/6288/1009/DC1)

Data Files S1 to S3 as Excel Files

Contents

1 Materials and Methods 3

1.1 Data used in this study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Identification of duplicated genes . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Estimating expression levels of duplicated genes . . . . . . . . . . . . . . . . . 6

1.4 Phylogenetically-based dating of duplicated genes . . . . . . . . . . . . . . . . 8

2 Supplemental Text 10

2.1 Basic characteristics of duplicated genes . . . . . . . . . . . . . . . . . . . . . 10

2.2 Patterns of gene expression in human and mouse tissues . . . . . . . . . . . . . 12

2.3 Expression differences between humans and other species . . . . . . . . . . . . 13

2.4 Rapid downregulation of expression in duplicates. . . . . . . . . . . . . . . . . 16

2.5 Effect of expression patterns on disease burden. . . . . . . . . . . . . . . . . . 17

2.6 Differential splicing in duplicates . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Long-term selective constraint on duplicated genes (dN/dS) . . . . . . . . . . 20

2.8 Current selective constraint on duplicated genes (SFS in humans) . . . . . . . . 21

2.9 Effect of translocation on regulatory divergence of duplicates . . . . . . . . . . 22

3 Supplemental Figures 28

4 Supplemental Tables 55

5 List of Supplementary Files 56

6 References 57

2

1 Materials and Methods

1.1 Data used in this study

The main RNA-seq data used in this study were from the GTEx Pilot Phase (11). The 46

GTEx tissues included were: adipose subcutaneous, adipose tissue, adrenal gland, artery tibial,

blood, blood vessel, brain amygdala, brain anterior cingulate cortex BA24, brain caudate basal

ganglia, brain cerebellar hemisphere, brain cerebellum, brain cortex, brain frontal cortex BA9,

brain hippocampus, brain hypothalamus, brain nucleus accumbens basal ganglia, brain putamen

basal ganglia, brain spinal cord cervical c-1, brain substantia nigra, breast, breast mammary

tissue, colon, esophagus, esophagus mucosa, esophagus muscularis, heart, heart left ventricle,

kidney, liver, lung, muscle, muscle (skeletal), nerve tibial, ovary, pancreas, pituitary, prostate,

skin, skin (sun-exposed lower leg), spleen, stomach, testis, thyroid, uterus, vagina, and whole

blood (Supplementary File 1).

We processed data from 10 different individuals for most tissues, as this allowed us to ob-

tain nearly uniform sample size across all 46 tissues, while providing good power to detect

differential expression between tissues. We used smaller or larger sample sizes for kidney (8

individuals), uterus and esophagus muscularis (11 individuals).

Additionally, for supplementary analyses, we used mouse tissue expression data produced

by Babak et al. (12). Supplementary File 2 summarizes the data we used from Babak et al. We

thank Hunter Fraser for prepublication access to these data. For comparing gene expression be-

tween species, we used RNA-seq data of six tissues, including brain, cerebellum, heart, kidney,

liver, and testis; in human, macaque and mouse (24). We also used histone ChIP-seq data from

24 tissues from a variety of human developmental and adult tissues collected by the RoadMap

Epigenomics Project (26, 27). Table S1 is a list of genome assemblies and gene annotations

used in this study.

3

1.2 Identification of duplicated genes

We first constructed a database of duplicate protein-coding gene pairs in the human genome.

In brief, we identified 1,444 duplicate pairs that are reciprocal best matches, have at least 80%

aligned coding sequence, and for which neither gene is annotated as a pseudogene. Additional

filters are described below. Our analysis uses coding sequences only, not UTRs, due to their

much higher average conservation and easier alignment.

Prescreening for candidate duplicate gene pairs. Our strategy for identifying high quality

duplicate genes started with an initial filtering step comparing all genes against all others to

identify candidate gene pairs with substantial sequence similarity at the nucleotide or amino acid

level. The human genome assembly GRCh37 (hg19) and the matching transcript annotation

from the Ensembl project (28) were used to build a matrix of coding sequence distances between

all Ensembl genes (56,486 coding sequences). The complete set of protein sequences encoded

by the human genome from the Universal Protein Resource (UniProt) database (29) was used

to build a pairwise protein sequence distance matrix (88,277 protein sequences).

Pairwise sequence distance matrices were calculated using the software Clustal Omega 1.2.0

(30). The K-tuple measure (31) is used by the Clustal Omega software for fast computation of

pairwise distances of thousands of sequences. The pairwise similarity scores generated by the

K-tuple algorithm are then converted to distance scores with an added penalty on gaps. Gene

pairs with either a nucleotide or amino acid sequence distance score lower than 0.6 were kept

for further analysis. The cutoff of 0.6 was chosen to balance sensitivity of detecting duplicate

genes against the computational burden of subsequent analyses. The command line used to

generate the distance matrices was:

$clustalo.modified -i [the input fasta file] –distmat-out=prefix.distanceMatrix –guidetree-

out=prefix.dnd –force –full -o prefix.clustaloOut

4

These filters resulted in 242,575 candidate gene pairs.

Pairwise sequence alignment. In the next filtering stage, we performed multiple sequence

alignments of each candidate pair without codon awareness across all transcripts of each gene

pair using Clustal Omega 1.2.0. We kept the two transcripts that generated the largest number

of aligned nucleotides for each gene pair. Pairs with <100 aligned nucleotides or with >50%

uncorrected sequence divergence were removed. The command line used to perform multiple

sequence alignment without codon awareness was

$clustalo-1.2.0-Ubuntu-x86 64 -i [the input fasta file] –distmat-out=prefix.mat –guidetree-

out=prefix.dnd –force –full -o prefix.clustaloOut

For the 20,191 gene pairs that remained after these filters, we performed a more computa-

tionally intensive, codon-aware alignment for each pair of transcripts using the software Mul-

tiple Alignment of Coding SEquences (MACSE) v0.9b1 (32). MACSE is a coding sequence

aligner accounting for frameshifts and stop codons. Briefly, MACSE optimizes the alignment

of two sequences by minimizing a weighted sum of costs for frameshift, deletions, stop codons,

and AA substitutions. The command line for pairwise alignment with codon awareness was

$java -Xms8m -Xmx10g -Xss4m -jar macse v0.9b1.jar -i [the input fasta file] -o outputDi-

rectory

Filtering of high-confidence reciprocal best-hit duplicate pairs. Finally, we used a series

of additional criteria to identify a set of high-confidence duplicate pairs for analysis as follows.

(A) At least 80% of the coding sequences of both genes are aligned to the other (median =

1,122bp). (B) The sister duplicates are reciprocal-best hits, in the sense that both genes have

the lowest synonymous substitution rate (dS) to each other relative to all other human genes.

(C) Neither gene is classified as a pseudogene in a comprehensive pseudogene database (33).

(D) At least 50% of aligned nucleotides are identical between the two genes. Additionally we

5

excluded very short alignments, requiring at least one continuous aligned region of >100bp,

ignoring introns, and >200bp total aligned sequence.

These criteria produced 1,444 high quality reciprocal best-hit duplicate pairs. Our expres-

sion analysis further excludes 190 pairs with no uniquely-mappable positions and 60 pairs for

which one or both genes were unexpressed in all tissues and may thus represent unannotated

pseudogenes.

1.3 Estimating expression levels of duplicated genes

Measuring the expression levels of duplicate genes using RNA-seq data can be difficult for

recently duplicated pairs. RNA-seq reads from these genes may map equally well to both copies

(and potentially to other paralogs), making it difficult to get accurate estimation of absolute

expression levels and relative expression levels of the copies.

One standard approach to this problem is via Cufflinks (34,35), which performs probabilistic

assignment of ambiguous reads (through option -u). However, in simulations, we found that the

expression ratios of duplicate pairs estimated by Cufflinks (v2.2.1) often diverged substantially

from the correct values (Figure S1).

To overcome this issue, we developed a bespoke mapping pipeline based on methods for

measuring allele-specific expression (14, 36). The essential idea is that we only consider reads

that map uniquely within the aligned region of one duplicate, and for which reads derived from

the corresponding location in the sister duplicate would also map uniquely to the sister dupli-

cate (Figure S2). We define sites for which all overlapping reads are reciprocally unambiguous

as reciprocally unambiguous sites. The normalized average coverage of all reciprocally unam-

biguous sites of a duplicate gene was then used as a measure of expression level of that gene.

This measure allows us very high confidence in the mapped reads that we use–and hence in

significant expression differences between duplicates–albeit at a cost of not attempting to make

6

use of ambiguous reads.

Performance assessment. To assess the accuracy of our estimation pipeline, we performed

simulations comparing our new method to Cufflinks. To identify unambiguous sites in duplicate

genes, we simulated 76bp (same read length as the GTEx data) single-end reads starting from

every position of the coding sequences. We then mapped these reads to the human GRCh37

transcriptome using Tophat v2.0.12 with flags “–prefilter-multihits” and “–report-secondary-

alignments”. The maximum number of mismatches was set to 4 and up to 40 alignments for

each read were reported. We defined a pair of aligned nucleotides as reciprocally unambiguous

sites if all 76bp reads covering the nucleotide in both duplicates were uniquely mappable to the

correct gene.

For read alignment of the RNA-seq data, we used Tophat2 with the same parameters as in

the simulation except that the maximum number of alignments reported for each read was set to

20. For each RNA-seq experiment, we counted the coverage of all unambiguous sites as defined

above. A gene’s expression level was estimated using the average coverage of all unambiguous

sites in that gene. The expression level was then normalized to the average read depth per

billion reads for each sample based on the effective library size estimated by edgeR (37) and

read length of that sample. Thus, average read depth per billion reads is comparable to FPKM.

In summary, we found that for all expression levels (FPKM=10, 1 and 0.1), our method

is less noisy than Cufflinks (Figure S1). We suspect that the point estimates from maximum

likelihood estimation may be unstable in this kind of setting when a large fraction of the reads

have ambiguous mappings. Thus, our more conservative approach appears to be much less

noisy.

Classification of expression patterns of sister duplicates. To classify the expression patterns

of duplicate genes we used the following criteria. Gene pairs with no reciprocally unambigu-

7

ous sites were classified as unmappable. Otherwise, we compared expression between the two

duplicates for each tissue. A gene’s expression was considered to be significantly higher than

its sister gene in the same tissue if the median expression ratio was at least two-fold and the

p-value from a paired t-test on log transformed expression across samples was less than 0.001.

We divided the duplicate gene pairs into three classes according to their expression patterns:

i) Sub-/neofunctionalized pairs, in which each of the two duplicates was significantly more

expressed in at least one tissue (Figure 2A, Figure S10A); ii) Asymmetrically Expressed Du-

plicates (AEDs), in which one duplicate was expressed higher than its sister gene in at least

one third of the tissues where one or both gene(s) were expressed and its expression was not

significantly different from its sister gene in other tested tissues (Figure 2B, Figure S10B); iii)

all other duplicates were classified as No difference pairs–i.e., one gene was expressed higher in

less than one third of tissues where one or both gene(s) were expressed and the two genes were

not expressed significantly differently in other tissues. In the main text, we also refer AEDs and

no difference pairs as not diverged pairs.

1.4 Phylogenetically-based dating of duplicated genes

Estimating precise ages of duplications is challenging. One approach is to use the sharing of

duplicates across species to provide anchor points for the ages of duplications; however this

approach may be misleading if a duplication has been lost in other lineages or if a gene has

duplicated independently more than once. A second approach is to use synonymous divergence

dS between human copies as a molecular clock; however the dS clock may be downwardly

biased due to nonallelic homologous gene conversion between paralogs (38), especially for

very young proximal duplications. In the main paper we use dS as a measure of duplication

age–we show here that, while noisy, it does provide a useful proxy for time.

To provide a measuring stick for interpreting dS values, we built a tree of synonymous

8

divergence among singleton protein coding genes (genes with no duplicates) in 9 species, in-

cluding human (Figure S3). We identified orthologs and aligned them using Clustal Omega

and estimated dS using Synonymous Non-synonymous Analysis Program (SNAP) v2.1.1 (39)

(www.hiv.lanl.gov), which applies methods from (40, 41). Duplication and loss of duplicates

after the divergence of species can result in ortholog pairs with dS significantly higher than the

average divergence between the two species. To avoid these cases, we removed genes with dis-

cordant distances. These were used to construct a species tree (Figure S3). The figure indicates

twice the human-lineage divergence from key points on the tree–these approximate expected

divergence of duplicates that arose at the corresponding times in the absence of gene conver-

sion.

In summary, as a rough rule of thumb, we can expect that duplicates with dS ∼0.4 are likely

to have occurred around the time of the human-mouse split, while duplicates with dS ∼ 1.0

likely predate the origin of placental mammals.

As a second approach to this problem, we used MrBayes (42) to build gene trees of du-

plicates and their orthologs in different species to identify cases in which both members of a

duplicate pair are shared with outgroups (Figure S4). For human duplicates that arose after the

last common ancestor of humans and species X, there should be no cases in which both dupli-

cates are shared with X (aside from occasional classification errors). For human duplicates that

arose before the split, most should be shared, aside from any that were subsequently lost on

the lineage leading to X. As may be seen in the figure, for duplicates with dS <∼0.35, only a

few are shared with mouse, while for duplicates with dS >∼ 0.45 most are shared with mouse.

This is broadly consistent with the expectations from the singleton gene tree, in which twice

the human-lineage dS since the human-mouse split is 0.45 at singleton genes (and bearing in

mind that gene conversion is likely to reduce observed dS). Sharing of duplicates with opossum

(average dS at singletons= 1.0) is also roughly consistent.

9

We next extended this phylogenetic approach to infer the most likely internal branches on

which the duplications occurred. We divided the duplicate pairs into 5 groups with 4 break

time points, i.e. the split time of chicken, opossum, mouse, and macaque from the human

lineage. For example, if we found an orthologous duplication in human and macaque but not

in the other species, then we hypothesized that this duplication likely occurred on the branch

between the human-macaque Last Common Ancestor (LCA) and human-mouse LCA. Out of

the 1194 mappable duplicate pairs, we obtained sensible inferences for 732 pairs (discrepancies

may arise due to parallel losses of duplications, or failing our quality controls for alignment

accuracy, etc.). We then filtered out pairs with dS values that were particularly high or low for

the inferred branch. Such events may indicate parallel gains or losses of duplications. This

filtering step left us with 480 high-confidence duplicate pairs. We then repeated most of the

analyses that supported our main conclusions using the new categorization instead using groups

defined using dS ranges.

Overall we found that the results from phylogeny-based analyses were consistent with dS-

based analyses (Figures S5–S7). Because our phylogenetic analysis removed a large number of

genes, we reported the results of dS-based analyses in the main text.

2 Supplemental Text

2.1 Basic characteristics of duplicated genes

Three important mechanisms of duplication are whole genome duplication, segmental dupli-

cation and retrotransposition (43–45). It is believed that two rounds of whole genome dupli-

cation occurred in the early evolution of the vertebrates (46, 47); however these events likely

preceded the origin of most duplications considered here. Retrotransposition happens when a

cellular mRNA is reverse transcribed by viral reverse transcriptase, followed by reintegration

of the cDNA into the host genome (48–50). A hallmark of retrotransposition is an intronless

10

coding region with a poly-A tail. Segmental duplication refers to duplication of large chunks

of genomic DNA. Segmental duplications may arise through unequal crossing-over between

homologous chromosomes (18, 51–53) or by replication slippage, which may occur through a

mechanism called replication Fork Stalling and Template Switching (FoSTeS) (54).

To identify duplicates that likely arose through segmental duplication or retrotransposition,

respectively, we classified duplicate genes as follows. (Our criteria were chosen to provide

confident support for either duplication of the entire genic structure or elimination of multiple

introns.) Gene pairs were inferred to be likely segmental duplications if the following three

criteria were met: 1. Both genes have at least 3 exons; 2. The two genes have less than 20% dif-

ference in the exon numbers; 3. More than 80% of the exon junctions are consistent between the

two genes (within 10bp distance). Pairs were inferred to be likely retrotransposed duplications

if one gene contains only one exon and the other gene contains at least 3 exons. We found that

segmental duplications are much more prevalent (nearly 8 fold) in the human genome compared

to retrotranspositions. Pairs meeting neither set of criteria were left as unclassified, though we

speculate that most of these likely derive from segmental duplications (for example, notice that

most young unclassified duplicates are found in tandem).

For reciprocal best-hit pairs, 963 are segmental and 86 are retrotransposed duplicates (Fig-

ure S8). We used dS between sister duplicates as a molecular clock and divided the pairs into

different age of duplication groups. As observed previously, there is a marked peak of very

young duplications (low dS), likely because most duplications are relatively short-lived over

evolutionary time (55–57).

As expected, most retrotransposed genes are found on different chromosomes. In contrast,

young segmentally duplicated pairs tend to be found close together in the genome; these seem

to be gradually translocated to different chromosomes as genome rearrangements occur (Fig-

ure S8A). The overall patterns are similar in mouse (Figure S9).

11

2.2 Patterns of gene expression in human and mouse tissues

In this section we expand on the expression analyses presented in the main text. Figure S10

presents an expanded version of Main Figure 1, showing the same genes in more tissues. Fig-

ure S11 provides an expanded version of Main Figure 2A, illustrating expression patterns across

the full set of GTEx tissues. To minimize the impact of outliers, we used the median expression

level of all individuals in a tissue for calculation of the log ratio in Figure 2, Figure S11 and

Figure S12. To avoid zero counts in the log ratios, all expression counts were increased by

addition of 0.5 pseudo counts. For plotting purposes, the log ratio was set to 0 if the Student’s

t-test showed no significant difference (p> .001) in expression level between sister duplicates.

A gene was defined to be “major” if the gene was significantly more highly expressed more of-

ten than its sister gene. If the two genes were expressed significantly higher in an equal number

of tissues, the one with higher mean expression in tissues was defined as the major gene.

A notable pattern in Figure 2A and Figure S11B is that regardless of functional category,

most duplicate pairs show asymmetric expression: for dS < 0.1, minor genes are expressed at

a median level of 40.5% of major genes across all tissues, and for 0.1 < dS < 0.8 minor genes

are expressed at just 33.5% of major genes. Here “minor” genes are defined as the genes with

lower median expression, so even with no systematic asymmetry we would expect this ratio to

be < 1. We therefore conducted permutations in which we randomly flipped the expression of

major and minor genes in each tissue. We defined major and minor genes and plotted the ratios

in the same way as for the real data. The average median expression ratio in the permuted data

was 83.3%, thus indicating that the duplicates, as a group, show significant levels of asymmetric

expression.

We performed additional analyses to explore further aspects of our results, as well as their

robustness. We found that retrotransposed genes are much more likely to be asymmetrically

expressed than segmental duplications (Figure S12). This probably reflects the fact that retro-

12

transposed genes are likely to land in genomic locations without suitable regulatory elements.

We were also curious about whether expression breadth might affect the probability of sub-

functionalization. Specifically, we hypothesized that genes with narrow expression might have

higher regulatory complexity, and thus subfunctionalize at higher rates. In fact, if anything it

seems that the reverse is true, as narrowly expressed genes have relatively lower rates of sub-

functionalization (Figure S13).

We also investigated whether sampling more developmental stages might increase the evi-

dence for subfunctionalization. We applied the same procedures to survey the divergence pat-

tern of duplicate genes in mouse using RNA-seq data from 26 tissues, including 3 fetal tissues:

embryonic brain, placenta and yolk sac (12) (Figure S14). The result in mice is consistent with

our observations in humans, showing slow rates of sub-/neofunctionalization.

It is also worth noting that our results on selective constraint and disease associations

strongly support the inference that the class of genes that we have classified here as minor

AED genes are functionally less important and less constrained. This point would remain true

even if it were shown in future that they had increased expression in some unsampled tissue.

2.3 Expression differences between humans and other species

In the main paper we argue that the data do not support a model of duplicate preservation

through expression sub- or neofunctionalization. Instead, we suggest that the data are more

consistent with a model of preservation through dosage sharing. In this model, the two dupli-

cates combine to achieve the required expression level. One simple prediction of this model is

that, in the absence of changes in optimal expression level, we might expect the summed expres-

sion level of a pair of duplicate genes to be similar to the expression of singleton orthologs in

outgroup species (and hence the expression levels of the individual duplicates should generally

be lower than the corresponding singletons). Of course this prediction should not hold precisely

13

for all genes due to changes in optimal expression levels (and indeed such changes may help to

enable duplicate fixation in the first place).

To test this, we analyzed RNA-seq data of 6 tissues in three species, namely human, macaque

and mouse (24). Specifically, we first searched for duplicates in human with only one ortholog

in macaque (or separately, in mouse, for the mouse comparison). To increase the likelihood that

the duplication event is human lineage specific, we filtered out human duplicates with dS larger

than 0.1 and 0.5 respectively before comparing to singleton orthologs of macaque and mouse.

The expression values were normalized in the same way as the GTEx RNA-seq data. Next,

to compare the expression of human genes to macaque/mouse genes, we adjusted the expres-

sion values based on a linear regression model using genes that are singletons in both species.

The adjustment results in a mean expression ratio of 1:1 between orthologous genes that are

singletons in both species (Figure 4A). Using this analysis, we find that the expression of indi-

vidual duplicates in human is significantly lower than that of their singleton orthologs in both

macaque (p=1.5 × 10−7, t-test) (Figure 4D) and mouse (p=1.2 × 10−7, t-test) (Figure S15B).

The median summed expression of duplicates is very close to the expression of the singleton

orthologs in macaque and mouse (median expression ratio is 1.11 for both macque and mouse

orthologs; these are significantly less than a 2:1 expression ratio, p=7.6×10−6 for macaque and

p=5.5×10−10 for mouse, t-test) (Figure S15).

The median dS of the duplicates that arose on the human branch since the human-macaque

split is ∼0.05. Down-regulation of these duplicates indicates that dosage-sharing evolved

quickly compared to sub-/neofunctionalization in expression. We next examined whether some

duplicates might be sub-/neofunctionalized at the protein level.

Three lines of evidence imply that sub-/neofunctionalization at the protein level also evolves

slowly compared to dosage-sharing, as follows. (All of these data below refer to a set of 27

duplicate pairs that arose on the human lineage since the human-macaque split and pass other

14

filters including for read mappability.)

(1) Figure S16: We see no evidence for adaptive protein evolution within this set of young

duplicates. The nonsynonymous divergence (dN ) between the two human copies is ≤ the syn-

onymous divergence dS for all 27 pairs. Moreover, in absolute terms, the amount of nonsyn-

onymous divergence in this set is very low: the median divergence is just ∼2%.

(2) Figure S17: Dosage sharing appears to appears to evolve very rapidly, and shows no

relationship with the amount of nonsynonymous divergence between the copies. In this analysis,

we expanded the number of genes that we could analyze. Since we are interested in the summed

expression, we relaxed our read mapping criteria to allow reads that cannot be uniquely assigned

between the two copies, but that are unique with respect to the rest of the genome.

In this analysis the average ratio of summed expression of duplicates to their ortholog is

∼1.2, far lower than the 2-fold that would be expected from doubling copy number (Figure S17).

Even pairs with identical protein sequences (dN=0) are downregulated compared to their parent

genes suggesting the sharing of expression evolves quickly after duplication and likely precedes

the divergence of protein function.

(3) Figure S18: If these genes were already significantly sub-/neofunctionalized at the pro-

tein level, then we would expect them to show significant levels of selective constraint within

humans. However the average conservation is very low. In this set, we observed more than twice

as many common missense variants in human polymorphism data as the average for singleton

genes (29.6% vs 14.4%). (Note that the gene-level estimates in this analysis are noisy due to

small numbers of segregating sites.) This indicates that these young duplicates are function-

ally redundant at the protein level, and thus within-species selective constraint against missense

mutations is much weaker than for typical genes.

15

2.4 Rapid downregulation of expression in duplicates.

If expression downregulation plays an important role in preserving duplicates, then we would

expect the expression reduction to occur relatively quickly–at least on a similar timescale to the

rate of gene loss by nonsense mutations. To explore the speed of downregulation, we took the set

of duplicates that occurred on the human lineage since the human-macaque split, and examined

their expression, relative to macaque, as a function of age (measured by dS) (Figure S19A). As

in Figure S17, we included duplicates that are not separately mappable but are distinct from

other genes.

This analysis shows downregulation among even the youngest fixed duplicates to close to the

level of macaque orthologs. This suggests that expression reduction occurs rapidly, as opposed

to a gradual decrease in expression over time (Figure S19A). Downregulation could occur

through substitutions that affect regulation (e.g., by weakening promoters), but it could also

occur through nongenetic processes such as expression buffering via feedback mechanisms on

transcription or mRNA turnover. To test whether there may be an effect of expression buffering

on new duplicates, we examined the expression of genes with copy number variation in the

human population. This analysis was restricted to polymorphic duplications with two entire

copies of the relevant genes, or to whole-gene deletions. Using genotype and gene expression

data for a subset of 1000 Genomes individuals (21, 58), we showed that gene expression in

individuals with atypical copy numbers (1, 3 or 4 copies) were closer to diploid expression than

predicted by an additive effect of copy number (Figure S19B). This suggests either widespread

partial buffering of duplicates, or that duplications with reduced expression of one or both copies

are more likely to be polymorphic. We speculate that this moderate buffering may help enable

the fixation of duplicates by alleviating dosage imbalance caused by duplication. It is also likely

that duplicates with relatively stronger buffering may be more likely to fix. Following fixation,

there may be substitution of additional expression-reducing mutations to further decrease the

16

expression of duplicates and thus enable their survival.

2.5 Effect of expression patterns on disease burden.

To assess the functional significance of the observed expression patterns, we obtained gene-

disease associations from the Disease and Gene Annotations database (DGA,

http://dga.nubic.northwestern.edu) (16). DGA provides annotations of the human genes in the

context of diseases by integrating data from diverse sources including Disease Ontology (DO),

NCBI Gene Reference Into Function (GeneRIF) and Molecular Interaction Network (MIN).

It includes a diverse set of disease information, including Mendelian disease, cancer data and

GWAS. 671 out of the 1,194 pairs of duplicates have at least one disease annotation in the

database. At the present time, disease annotations are clearly incomplete and may be inaccurate.

However, our key analyses involve comparisons of different types of genes based on expression

profiles, and we expect that classification errors should be uncorrelated with our expression-

based classifications.

We considered two generalized linear models (Poisson model with log link function). In the

first model, the response variable was the number of diseases associated with the minor gene

and the predictors and results are shown in Table S2. In the second model, the response variable

was the number of minor gene-specific diseases (Table S3).

The results show several interesting features:

• Mean expression ratio of minor/major gene. Lower relative expression of the minor

gene is associated with lower minor gene-total disease counts (p=8×10−7, Wald test) and

(to a lesser extent) lower minor-specific counts (p=2×10−3, Wald test). This is consistent

with our expectation that expression asymmetry reduces functional importance of minor

genes.

• Proportion of tissues where the minor gene is expressed high. We use this as a mea-

17

sure of the extent of sub-/neofunctionalization; it is positively correlated with both minor

gene-total and minor gene-specific disease counts (p=4× 10−13 vs p=5× 10−12, respec-

tively, Wald test). These results support the prediction that subfunctionalization of ex-

pression promotes nonredundancy. It is interesting that the minor-specific count is only

slightly more significant than the minor-total count, perhaps suggesting that this measure

of expression divergence does not reflect much true neofunctionalization.

• Synonymous divergence between duplicates. dS is positively associated with both mea-

sures: minor gene-total (p=9 × 10−4, Wald test) and minor gene-specific (p=9 × 10−3,

Wald test). This is consistent with our observations that older duplicates are under in-

creased evolutionary constraint, even if they are asymmetrically expressed.

• Nonsynonymous divergence between duplicates. dN is not associated with either mea-

sures. This may reflect that the divergence of duplicates in protein space is due to relaxed

selective constraint caused by functional redundancy.

2.6 Differential splicing in duplicates

Our main paper focuses on the possibility of sub-/neofunctionalization of whole-gene expres-

sion. Another possibility however, would be subfunctionalization by differential splicing or

isoform usage between the duplicates (59–61). In fact, differential isoform usage has been

previously reported between duplicated genes (62–65).

To examine the prevalence of differential splicing between duplicates, we compared the

expression of homologous exons using the same procedure as we have applied when comparing

gene level expression discussed in section 3. We used a p-value cutoff of 0.0001 for the t-

test instead of 0.001 because the number of tests at the exon level was about 10 times that at

the gene level. Of 1,194 mappable pairs of duplicates, 359 (30%) have at least one pair of

18

homologous exons that are differentially spliced in at least one of the 46 tissues. Of these 359

pairs, 195 (54%) were already classified as potentially subfunctionalized on the basis of whole

gene expression data. Figure S20 shows the distribution of potential subfunctionalization of

duplicates by differential splicing (orange) in addition to subfunctionalization at the gene level

(blue). Among pairs with dS < 0.7, 15.2% were classified as potentially subfunctionalized

based on whole-gene expression. An additional 10.7% show evidence for differential splicing–

suggesting at most a modest contribution from differential splicing.

We next wanted to test whether differential splicing had any effect on disease risk. It has

been shown that most alternative splicing differences are not conserved between species and

are therefore likely to be neutral (66–70), and thus we conjectured that the differential splicing

events observed here may not be functionally important.

To address this question, we revisited the regression model for disease burden developed

in the previous section. Recall that duplicates with stronger evidence for subfunctionalization

were associated with higher numbers of diseases (Table S2, Table S3). Using the same basic

model, we added an indicator variable for presence/absence of differential splicing as an addi-

tional predictor. If these pairs were truly subfunctionalized, we would expect disease burden

to be positively correlated with evidence for differential splicing. Instead however, differential

splicing is negatively correlated with disease risk

For the multiple regression on the total number of diseases associated with minor gene, the

coefficient of the new predictor is -0.35 (p=4×10−11, Wald test). For the multiple regression on

the number of diseases associated with the minor gene only, the coefficient of the new predictor

is -0.18 (p=4×10−3, Wald test). These findings suggest that differential splicing is more often an

indication of low selective constraint on the duplicates than truly divergent function. Together

these results argue that changes in isoform usage are not a primary driver of subfunctionalization

of mammalian duplicates.

19

2.7 Long-term selective constraint on duplicated genes (dN/dS)

The ratio of the nonsynonymous substitution rate (dN ) to synonymous substitution rate (dS),

dN/dS , is a classic measure of selection and constraint on protein sequences. dN/dS < 1 is

a hallmark of protein coding constraint. In the absence of advantageous mutations and if all

synonymous mutations are neutral, then 1 − dN/dS estimates the fraction of nonsynonymous

mutations that are deleterious. Meanwhile, if some substitutions are advantageous this will

increase the value of dN/dS , in rare cases pushing it above 1. Note that dN/dS estimates are

noisy for individual genes when dS is small.

To estimate dN/dS on each duplicate separately, we used MrBayes to build trees containing

the human duplicates plus orthologs from 8 other species to provide information about the

ancestral sequence at the point of duplication (see Figure S3). We then used PAML to estimate

branch lengths separately for each duplicate back to the duplication point on the inferred tree.

In more detail, we performed multiple sequence alignment among the pair and its orthologs

using MACSE. These alignments were then used as input for MrBayes v3.2.2 (42, 71, 72) to

build gene trees. For input to MrBayes, we predefined an outgroup gene for the tree, namely the

ortholog with greatest dS to the two duplicates. Duplicates with less than three orthologs across

all 8 species were excluded from the analysis due to uncertainty in the tree. We also excluded

pairs with dS > 1.57 between the two human duplicates as they are likely to have diverged

prior to the divergence of human and chicken lineages meaning that we would not have a true

outgroup within this set of species. Parameters for MrBayes were set as follows:

nst=6 Nucmodel=Codon omegavar=ny98

mcmc ngen=100000 samplefreq=1000 printfreq=1000 diagnfreq=1000

The software PAML v4.8 (73, 74) was used to estimate nonsynonymous and synonymous

divergence for each branch in the gene tree generated by MrBayes. dN and dS for each human

20

duplicate were calculated by summing branch lengths from the present day to the node ancestral

to the two duplicates. Parameters for PAML were set as follows:

seqtype=1 model=1 NSsites=0

2.8 Current selective constraint on duplicated genes (SFS in humans)

A second tool for studying selective constraint comes from examining the site frequency spec-

trum (SFS) in polymorphism data. Classes of sites that are under selective constraint tend to

be enriched for lower frequency variants (e.g., nonsynonymous variants generally have lower

mean frequency than synonymous variants). Unlike dN/dS which effectively averages selection

pressures since the duplicate genes diverged, the SFS reflects current selective pressures (within

the past 104–106 years). Compared with dN/dS , the SFS is also less confounded by advanta-

geous mutations, since it is generally assumed that these contribute little to patterns of diversity

within species, for most genes. However, since there may be modest numbers of SNVs per gene

this measure is noisy at the level of individual genes, and hence we will only report SFS results

for classes of genes.

We used data from 6,515 individuals in the Exome Sequencing Project (15) (Exome Variant

Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL: http://evs.gs.washington.edu/EVS/)

[data downloaded September 2014]). To polarize alleles into ancestral and derived, we aligned

each SNP to the corresponding positions in the Chimpanzee and Gorilla genome using the

Liftover tool (75). An allele was then defined as ancestral if it was observed in either chim-

panzee or gorilla, while the second allele matched neither chimpanzee nor gorilla. Polymorphic

sites not meeting these criteria were removed. For categorizing mutations as synonymous, non-

synonymous, nonsense (and other categories not used here), we used annotations provided by

ESP. In principle we might worry about ability to identify SNVs in the very youngest duplicates,

however any biases should be shared across major and minor genes, and thus not confound our

21

major conclusions.

Figure S22 illustrates the SFS for duplicate genes as a function of age. Notice that the

fraction of rare SNVs increases steadily with the age of the duplicates, indicating that older

duplicates tend to be under much stronger evolutionary constraint. The oldest duplicates are

in fact more constrained, on average, than singleton genes. This may reflect a stabilization of

nonredundant functions in the oldest duplicates and, in some cases, enrichment of conserved

developmental genes among the oldest duplicates (76).

Figure S23 shows a comparison for AED genes of synonymous vs nonsynonymous (dashed

vs solid) variants in minor vs major genes (red vs blue). The youngest AEDs seem to be under

relatively low selective constraint overall–i.e., little difference between synonymous/nonsynonymous

spectra–while older AEDs show strong differences between synonymous and nonsynonymous

sites. For all ages the nonsynonymous sites in major genes have more rare variants than non-

synonymous sites in minor genes, although the magnitude of the effect varies across ages. (For

some of the categories, we see the same effect at synonymous sites–selection at synonymous

sites may be due to functional elements such as splice enhancers.) We also observed higher

densities of nonsense polymorphic sites in minor genes. In summary, these spectra suggest that

most AEDs take relatively long times to become strongly constrained (i.e., dS > 0.5), and that

minor genes tend to enjoy reduced constraint at all ages.

2.9 Effect of translocation on regulatory divergence of duplicates

Figure S8A suggests that most duplicate pairs arise in tandem, and that they are gradually

separated in the genome by translocation. In the main text we reported that, controlling for dS ,

the pairs on different chromosomes are more likely to have divergent expression patterns and

to be classified as potentially sub-/neofunctionalized (Figure 3, Figure 5A, Figure S24). This

result is robust to alternative criteria for sub-/neofunctionalization, for example, requiring the p

22

value of a t-tests to be less than 0.05 and no fold change cutoff (Figure S24D).

We also observed that very similar results hold at the level of histone marks (Figure S25).

To produce this plot we used histone ChIP-seq data from 25 tissues from a variety of human de-

velopmental and adult tissues collected by the RoadMap Epigenomics Project (26,27). We clus-

tered histone modifications into two categories, Transcription Start Site (TSS) histone marks,

such as H3K4me1, H3K4me3, H3K9ac, and H3k27ac, and gene body histone marks, such as

H3K36me3, H3K27me3, and H3K9me3. TSS histone modification levels were measured by

dividing the read density (number of reads per base pair) at the promoter region (1kb up and

down stream of TSS) by the read density in the background genomic DNA sequencing data

(input). The whole gene body region was used, instead of the promoter region for calculating

gene body histone modification levels. Read densities were normalized by the total number of

reads for each sample and standardized across all samples of each mark before we calculated

the correlation between sister duplicates.

Co-expression of neighbor duplicates. It is known that nearby genes tend to have correlated

expression patterns (19), however the mechanistic causes are not well understood. These may

include sharing of regulatory enhancers and the fact that nearby genes may lie in the same

co-regulated chromatin domains.

We tested the effect of genomic separation on expression correlation using a multiple regres-

sion model which includes age (dS) as a covariate (Table S4). This model shows that genomic

separation has a significant effect on expression correlation: mean effect on correlation = -0.36;

p = 3 × 10−30 (Wald test; Table S4, Figure S26A,B). It’s also noteworthy that we see a signif-

icant (albeit weaker) effect of dS on expression correlation for tandem duplicates, but no effect

at all for separated duplicates. This is consistent with the idea that breaking synteny between

duplicates radically alters their gene regulation.

23

We also observe a concordant, though much weaker effect for physical distance between

duplicates that are on the same chromosome (p=.003, Wald test; Figure S26C,D), suggesting

that co-regulation may be weaker for tandem duplicates that are far apart, than for those that are

close together.

Figure 3C illustrates the distribution of expression correlations for both singleton neighbors

and duplicate neighbors. Both of these tend to be more correlated than unlinked singletons (data

not shown for figure simplicity), and unlinked duplicates, respectively, highlighting the role of

genomic proximity in co-regulation. Both linked and unlinked duplicates are more correlated

than linked and unlinked singletons, respectively, presumably due to similarity of regulatory

sequences of duplicates.

Effect of promoter divergence on expression correlation. As an alternative to using syn-

onymous divergence as a molecular clock, we also experimented with using promoter diver-

gence as an alternative.

To quantify promoter divergence, we first attempted to align the promoter regions of du-

plicates using a fixed region around the annotated transcript start site (TSS) of each gene

as a putative promoter. However we found that many promoter pairs from this simple ap-

proach were unalignable (by Clustal or BLASTN). We reasoned this might be because the po-

sition of a promoter can vary among duplicates and the actual boundaries of promoters are not

well defined. To better align the promoters of duplicates, we used promoter annotations from

chromHMM (77). To maximize the promoters we could find, we pooled together chromHMM

annotations of 9 cell lines available in the ENCODE repository. We then considered the closest

promoter to the TSS of each duplicate gene as its promoter, provided that it is within 5 kb of the

gene. We performed sequence alignment using Clustal and defined the divergence of promoters

as the number of mismatches divided by the total length of aligned sequences. To reduce the

24

effect of spurious alignment by Clustal, which tends to maximize the number of matches, we

only took into account regions with more than 100bp continuous aligned sequences.

From a starting set of 1194 mappable pairs, we identified promoters for both duplicates of

791 pairs. Of these, 206 were not alignable, leaving 585 duplicate pairs with aligned promoters.

As expected, there is a strong correlation between synonymous divergence in the coding region

and promoter divergence (ρ=0.83, p< 2.2×10−16, Figure S27). We then tested the effect of ge-

nomic proximity of duplicates on their expression correlation controlling promoter divergence

using a generalized linear model. The categorical variable for duplicates in tandem vs separated

in the genome again shows a highly significant effect on the expression correlation of duplicates

controlling for promoter divergence (p=4× 10−9, Wald test; Table S5).

Expression correlation of duplicates with discordant genomic proximity in human and

mouse. To better control confounding factors, we searched for duplicate pairs that are in tan-

dem in human and on different chromosomes in mouse, or vice versa (Figure 3C). We then

compared the expression correlation between the species where the pair is in tandem, vs the

species where the pair is separated using matched mouse and human expression data in 6 tis-

sues (24).

We identified 12 pairs of duplicates that are in tandem in human but separated in mouse and

15 pairs of duplicates are tandem in mouse but separated in human. Although the sample size is

small, and the data are more noisy due to the smaller number of tissues (6 vs 46), we observe a

significant signal of the expected result: i.e., that the separated duplicates are less correlated than

the tandem duplicates (p = 0.03, one sided paired t test; Figure 3C). Moreover, the magnitudes

of the correlations are similar to those seen in GTEx data for duplicates of similar age.

Shared regulation of neighbor duplicates. One intuitive explanation for the increased co-

expression of duplicate neighbors compared to singletons is that duplicates may be more likely

25

to share regulatory elements. As one test of this, we examined the rate of shared eQTLs between

singletons and duplicates in three different studies. As a proxy for mappability, we required that

both genes in a pair have at least one eQTL and that dS between the genes was larger than 0.1.

For the set of eQTLs identified using the Geuvadis data (21), 19 out of 25 pairs of duplicate

neighbors have common eQTLs, while 158 out of 394 pairs of singleton neighbors share eQTLs.

The odds ratio is 4.71. Fisher’s exact test yields a p value of 5.9 × 10−4. Similarly, for the set

of eQTLs identified by Battle et. al. (20), 14 out of 91 pairs of duplicate neighbors share their

best eQTLs, while 195 out of 3,544 pairs of singleton neighbors share their best eQTLs. The

odds ratio is 3.12, Fisher’s exact test yields p value of 5.6 × 10−4. For the eQTLs identified

by the GTEx consortium (78), the odds ratio is 2.82 and the p value is 0.02). For the analysis

here and that follows, we confirmed that the distributions of genomic distances between pairs

of duplicates and singletons are similar.

To test if the sharing of regulatory elements between neighbors is mediated through chro-

matin interactions, we examined genome-wide chromatin conformation capture data (Hi-C)

recently generated in the GM12878 cell line (22). We excluded read pairs of < 20kb as this

lies close to the resolution of the assay. On average, we found 62 Hi-C reads supporting a

promoter-promoter interaction between a pair of neighbor duplicates. No reads supported a

promoter-promoter interaction between duplicates on different chromosomes (Figure 2E). In-

terestingly, consistent with the eQTL analysis, we found promoter-promoter interaction is more

intensive in duplicate neighbors than in singleton neighbors. 66 out of 69 pairs of expressed

duplicate neighbors are linked by > 0 reads, with a total number of 4,254 reads linking the

pairs of promoters; for expressed singletons, 4,406 out of 4,959 pairs of expressed singleton

neighbors had > 0 reads, with a total number of 114,165 reads linking the promoters. To test

significance of these observations we constructed a generalized linear model with the response

variable being the number of reads linking a pair of promoters. The predictors are, 1) Sum

26

of the expression of the pair; 2) Distance between the two promoters; and 3) Categorical in-

dicator (duplicates or singletons). The GLM shows that controlling the expression level and

distance, duplicate neighbors have significantly more Hi-C linkages than singleton neighbors

(p=3.1× 10−6, Wald test).

We also examined enhancer-promoter links, using enhancer annotations from chromHMM

in the GM12878 cell line (77). To reduce noise, we went through a first step of identifying

statistically significant enhancer-promoter interactions controlling for distance between the in-

teracting loci, GC content and enzyme digestion/ligation efficiency, and requiring that the events

be within the same TAD (79). For all genes, we identified a total number of 525,342 significant

long range enhancer-promoter interactions (distance > 20kbps, FDR=0.2). Among these, we

found 8 out of 39 pairs of expressed duplicate neighbors where a single enhancer was linked to

both promoters. Similarly we found 246 out of the 2,507 pairs of expressed singleton neighbors

in the same TADs showed evidence of enhancer sharing. This is a weakly significantly enrich-

ment for duplicates vs singletons: odds ratio = 2.4, one sided Fisher’s exact test p = 0.035.

27

3 Supplemental Figures

A. FPKM = 10

●●●●

●●

●●●

●●●●●●●●●●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

−5

05

dS

Est

imat

ed e

xpre

ssio

n ra

tio (

log2

)

0−0.1 0.1−0.5 0.5−1.0 1.0−1.5 1.5−2.0

CufflinksUnambiguous Reads Only

B. FPKM = 1

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

−5

05

dS

0−0.1 0.1−0.5 0.5−1.0 1.0−1.5 1.5−2.0

C. FPKM = 0.1

●●

●●●

●●

●● ●

●●●

●●

●●

●●

●●●●●

●●

●●●●

●●

−5

05

dS

0−0.1 0.1−0.5 0.5−1.0 1.0−1.5 1.5−2.0

Figure S1: Estimation of expression ratios of duplicated genes for simulated data. For each pairof duplicates, we simulated RNA-seq reads derived from the transcripts of these two genes withan expression ratio of 1:1, and FPKM at 10 (A), 1 (B) and 0.1 (C). We mapped the reads tothe human genome hg19 using Tophat2. We then estimated the expression ratios of the genesusing Cufflinks and using our own pipeline based on reciprocally unambiguous sites only. TheY axis shows the estimated log2 expression ratio of the duplicate pairs. Note that FPKM =0.1 is presented only as a worst-case scenario, as such genes are considered unexpressed in ouranalyses.

28

Duplicate 1

Duplicate 2

A C

G T

A

C

G

T

site 1 site 2 site 3 site 4

Reciprocally

unambiguous reads

Multi-hits reads

Unambiguous sites

Ambiguous sites

Figure S2: Schematic of reciprocal reads mapping. A reciprocally unambiguous site is de-fined as a position for which all reads overlapping that site are uniquely mappable in both sisterduplicates. Gene expression of gene duplicates is calculated using only reciprocally unambigu-ous sites. To estimate average expression level, the number of mapped reads is appropriatelynormalized for the number of allowed positions.

29

Chicken

Platypus

Opossum

Mouse

Macaque

Orangutan

Gorilla

Chimpanzee

Human

0.05Synonymous divergence

dS from human

0.015

0.022

0.043

0.081

0.57

1.0

1.2

1.6

0.014

0.060

0.45

1.1

Figure S3: Synonymous divergence tree of nine species at singleton genes. The labels on theright show estimated synonymous distances, dS , between human and each of the other species,while the green labels show twice the dS along the human lineage from each branch point.Synonymous divergence between species was calculated as a weighted average synonymousdistance for all singleton (genes with no duplicates) ortholog pairs of these two species.

30

**

**

*

**

* ** * * * *

*

0.0

0.2

0.4

0.6

0.8

1.0

dS

Pro

port

ion

of p

airs

*

**

*

*

*

* ** * * * * *

*

*

*

* *

*

* * * * * * * * * *

*

* * *

**

** * * *

* * * *

* **

*

*

*

*

**

** * * * *

* **

*

*

*

*

** *

* **

*

*

* * * *

*

**

* ** *

*

**

*

0 0.5 1 1.5

*******

chimpanzeegorillaorangutanmacaquemouseopossumchicken

Figure S4: dS as a molecular clock to date duplications: Proportion of human duplicationsshared with other species as a function of dS . The X axis shows dS between human dupli-cates. The Y axis shows the estimated fraction of duplicates in a dS bin that are shared withan outgroup species. Notice for example that most duplicates with dS ≤ 0.35 are not sharedwith mouse, but most duplicates with dS ≥ 0.45 are shared with mouse indicating that theseduplicates arose prior to the human-mouse split.

31

Age

0

50

100

150

200

Num

ber

of p

airs

Human tomacaque

Macaque tomouse

Mouse toopossum

Opossum tochicken

Precedechicken

No differenceAsymmetrically expressedSub−/neofunctionalized

Figure S5: Classification of gene pairs by expression patterns. Ages estimated using phyloge-netic information: for example, “macaque to mouse” includes duplicates shared with macaquebut not with mouse.

32

Figure S6: Heat maps of expression ratios for duplicate pairs in different age groups, e.g.,“macaque to mouse” indicates human duplicates inferred to have arisen on the ancestral branchbetween the human-macaque split and the human-mouse split. As in the dS-based version ofthis figure (Figure 2B), for each duplicate pair (plotted in columns) the ratios show the tissue-specific expression level of the minor gene relative to its duplicate. Green indicates evidencefor subfunctionalization; consistently blue columns indicate AEDs. Black indicates tissue ratiosnot significantly different from 1 (p > .001). Relatively few of the duplicates that arose withinthe mammals show evidence of subfunctionalization.

33

A. Proximity of duplicates.

Age

0

50

100

150

200

Num

ber

of p

airs

RetrotranspositionDifferent chromosomesSame chr. > 1 MBSame chr. < 1 MB

Human tomacaque

Macaque tomouse

Mouse toopossum

Opossum tochicken

Precedechicken

B. Expression correlation.

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

−0.5

0.0

0.5

1.0

● ●Within 1 MB Diff chr

AgeE

xpre

ssio

n co

rrel

atio

n

Group mean

Human tomacaque

Macaque tomouse

Mouse toopossum

Opossum tochicken

*

***

** **

Figure S7: Phylogeny-based analysis: Duplicate pairs located on different chromosomes showconsistently lower expression correlations than those located on the same chromosome. A.Proximity of duplicates at different ages. B. Expression divergence of nearby pairs comparedto pairs on different chromosomes, divided up according to the phylogenetic branch on whichthe duplications occurred.

34

A. Segmental duplications

dS

Num

ber

of p

airs

0

50

100

150

200

0 0.5 1 1.5 2

Different chromosomesSame chr. > 1mbSame chr. < 1mb

B. Retrotransposed duplications

dS

Num

ber

of p

airs

0

5

10

15

20

0 0.5 1 1.5 2

Different chromosomesSame chr. > 1mbSame chr. < 1mb

Figure S8: Numbers of likely segmental (A) and retrotransposed (B) duplications in human, fordifferent values of dS . Most young segmental duplicate pairs are nearby in the genome. Notethat the number of segmental duplicate pairs is much greater than the number of retrotransposedpairs.

35

A. Segmental duplications

dS

Num

ber

of p

airs

0

50

100

150

200

250

0 0.5 1 1.5 2

Different chromosomesSame chr. > 1mbSame chr. < 1mb

B. Retrotransposed duplications

dS

Num

ber

of p

airs

0

20

40

60

80

100

0 0.5 1 1.5 2

Different chromosomesSame chr. > 1mbSame chr. < 1mb

Figure S9: Numbers of likely segmental (A) and retrotransposed (B) duplications in mouse, fordifferent values of dS . As seen in human, most young segmental duplicate pairs are nearby inthe genome and the number of segmental duplicate pairs is much greater than the number ofretrotransposed pairs.

36

Figure S10: Expression of duplicate genes in 27 tissues (expanded version of Main Figure1). A. A gene pair with an expression profile consistent with sub- or neofunctionalization:i.e., each gene is significantly more highly expressed than the other in at least one tissue. B.An asymmetrically expressed gene pair. Introns have been shortened for display purposes.The Y-axis shows read depth per billion mapped reads. Green regions in the gene models areunmappable.

37

Figure S11: Gene expression ratios for duplicate gene pairs in 46 tissues. A. Heat maps ofexpression ratios for all duplicate gene pairs, at 3 levels of synonymous divergence, dS . For eachduplicate pair (plotted as a column) the ratios show the tissue-specific expression level of thegene with higher median expression relative to its duplicate. Blue indicates significantly lowerexpression of the minor gene in a particular tissue; red indicates significantly higher expressionof the minor gene (p < 0.001 for both cases). Black indicates no significant difference. B.Distributions of expression ratios for duplicate gene pairs in 46 tissues. Labeling same as inA. Notice that for most gene pairs, the minor gene has consistently lower expression than themajor gene, with few clear cases of subfunctionalization (i.e., mix of red/blue) except for themost diverged gene pairs.

38

A. Segmental duplicates colored by major-minor classification

B. Retrotransposed pairs colored by major-minor classification

C. Retrotransposed pairs colored by parent-daughter classification

Figure S12: Gene expression ratios for segmental and retrotransposed pairs. Heat maps of ex-pression ratios for segmental duplicated (A) and retrotransposed gene pairs (B, C), at 3 levels ofsynonymous divergence, dS . Within categories, columns are sorted by the amount of blue/red.For each duplicate pair (plotted as a column) in panels A and B, the ratios show the tissue-specific expression level of the gene with higher median expression relative to its duplicate.The ratios in panel C show the tissue-specific expression level of the parent gene (gene withmultiple exons) relative to the daughter gene (gene with one exon). Blue indicates significantlylower expression of the minor or daughter gene in a particular tissue; red indicates significantlyhigher expression of the minor or daughter gene (p < .001 for both cases). Black indicates nosignificant difference. Note that in 84% of the retrotransposed gene pairs, the daughter genesare also the minor genes.

39

dS

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Pro

port

ion

of s

ub−

/neo

func

tiona

lizat

ion Expressed in

<10 tissues>=10 tissues

0 0.5 1 1.5 2

Figure S13: Rates of subfunctionalization in pairs expressed in many tissues (orange) or fewtissues (purple). The X axis indicates dS boundaries. The Y axis is the proportion of duplicatesthat are sub-/neofunctionalized. Note that the rates of sub-/neofunctionalization in broadlyexpressed duplicates are generally higher than for narrowly expressed duplicates.

40

dS

0

100

200

300

400

500

0 0.5 1 1.5 2

UnmappableNo differenceAsymmetrically expressedSub−/neofunctionalized

Figure S14: Classification of mouse duplicate gene pairs by expression patterns in 26 tissues(12). The overall patterns are qualitatively similar to the human results.

41

A. Macaque

−4

−2

0

2

4

Log2

exp

ress

ion

ratio

Ratio=2:1Ratio=1:1

Sum Major Minor Singletons

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

B. Mouse

−4

−2

0

2

4

Log2

exp

ress

ion

ratio

Sum Major Minor Singletons

●●

● ●

●●●

● ●

●●

● ●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

Figure S15: Expression levels of duplicates compared to their singleton orthologs in macaque(A) and mouse (B), for human duplicates that are single-copy genes in macaque/mouse. “Sum”shows the summed expression of both duplicates, relative to expression of the macaque/mouseorthologs in the same tissues; the “major” and “minor” data show equivalent ratios for the higherand lower expressed genes in each duplicate pair. Each tissue-gene expression ratio is plottedas a separate data point. The green data show results for a random set of singleton orthologs.

42

●●

● ●

0.00 0.02 0.04 0.06 0.08 0.10dS

0.00

0.02

0.04

0.06

0.08

0.10

dN

−1.0 0.0 1.0

Log2 ratio

Figure S16: Scatter plot of dN vs. dS for very young duplicate pairs shows dN ≤ dS for allpairs. Dots are colored by the mean log ratio of summed expression of duplicates to expressionof their single copy ortholog in macaque. Notice that this is centered close to a log ratio of 0,and that there is no obvious relationship between expression and dN .

43

−4

−2

0

2

4

Log2

exp

ress

ion

ratio Ratio

2:1

1:1

dN = 0 < 0.02 >= 0.02

Duplicates Singletons

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

Figure S17: Ratio of summed expression of duplicates to their single copy ortholog in macaque,stratified by dN between the two human copies. Each data point shows a single gene x tissuecombination. These results show that dosage reduction evolves quickly in these young dupli-cates, while they still have very low protein divergence (on average ∼2%, Figure S16).

44

0.0 0.5 1.0 1.5dS

Fra

ctio

n of

rar

e va

riant

s (<

0.1%

)

0

0.7

0.8

0.9

1

●●

● ●

Mean of young duplicates

Mean of singletons

No−difference pairs

AEDs

Sub−/Neofunctionalized

Figure S18: Fraction of rare missense variants for very young duplicates in a large data set ofhuman exomes (15), compared to duplicates in general, and singleton genes. Notice that youngduplicates have a low fraction of rare variants, indicating that they are under relatively weakselective constraint compared to much older duplicates and to non-duplicated genes.

45

A. Expression of dups. vs. single copy ortholog.

−4

−2

0

2

4

Log2

exp

ress

ion

ratio

0 0.05 0.1

Ratio

2:1

1:1

dS

Spline

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●●

●●

●●

● ●●

●●

●●

Overall median

B. Fold change in expression of CNVs

−4

−2

0

2

4

Number of copies

Log2

exp

ress

ion

ratio

1 2 3 4

Group mean

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●● ●

●●

● ●

●●

●●

● ●●

●●

●●

● ●

● ●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

Ratio

21.51

0.5

Figure S19: Expression reduction in duplicated genes. A. Ratios of summed expression ofduplicates to their single copy orthologs in macaque, as a function of duplicate age (dS). Notethat the fold-reduction of expression in the newest duplicates is similar to that in older pairs. Theaverage expression ratio of duplicates to their orthologs is ∼1.2. B. Ratios of gene expression inindividuals with atypical copy numbers to individuals with 2 copies. Each blue dot representsthe ratio of median expression of individuals with copy number indicated by the X axis to themedian expression of individuals with 2 copies, for a different CNV. Note that the effect ofcopy number on gene expression is smaller than expected from a simple additive model (dottedlines). For example, the average expression ratio of individuals with 4 copies to individualswith 2 copies is ∼1.5, compared to the 2-fold difference expected from copy number alone.However, the 1.5-fold ratio is higher than the 1.2-fold average difference seen in panel A.

46

dS

0

50

100

150

200

250

300

0 0.5 1 1.5 2

UnmappableNo differenceAsymmetrically expressedPotential subfunctionalization by differential splicingSub−/neofunctionalized

Figure S20: Distribution of potential subfunctionalization by differential splicing. Differen-tial spliced pairs are shown as a separated group in addition to the patterns identified at genelevel (Figure 1B). Note that the overall subfunctionalization rate is low even if all differentiallyspliced pairs were actually subfunctionalized.

47

A. Long-term selective constraint

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

dS

dN

dS

No difference

AEDs

Sub−/Neofunctionalized

Major genesMinor genes

B. Major vs. minor genes in AED pairs

*

**

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

* *

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

* *

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

** *

*

*

*

*

*

* *

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

* ** *

*

* *

*

*

***

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

**

*

*

**

*

*

*

**

*

**

*

*

*

*

** *

* *

*

*** **

*

* **

***

* *

** *

*

*

*

*

**

* **

*

*

*

** *

*

**

*

*

*

**

*

*

*

*

*

**

*

*

*

**

**

*

*

*

**

**

*** ***

*

**

*

*

*

*

*

**

*

*

*

* **

**

*

*

*

*

*

*

*

**

****

*

*

*

*

*

**

**

**

*

**

*

*

*

*

*

*

*

*

*

* **

*

****

**

**

*

*

**

*

*

*

*** *

*

*

*

*

** **

**

*

*

*

*

*

****

**

*

*

*

* *

*

*

**

**

*

**

*

**

*

*

*

**

*

**

*

* **

*

*

**

*** ** ** **

**

*

***

**

*

*

*

*

*

*

***

*

*

**

****

*

**

*

** ** **

* **** *

*

*

*

**

*

*

*

****

*

*

**

*

**

*

*

***

* **** *

** *

**

*

*** *

*****

*

*

* ***

**

*

**

**

*

*

*

* *

*

** *

*** **

*

**

*

***

*

*

*

* *

*

*

*****

*

**

**

*

***

**

**

*

*

*

**

*

*

*

*

*

*

***

*

**

* *

**

**

*

*

*

*

**

*

*

**

**

*

**

*

**

*

*** *

**

**

*

*

**

*

**

*

*

*

*

**

*

**

*

*

*

* *

***

*

**

*

*

***

*

**

**

***

*

*

***

*

***

*

*

*

*

*

*

*

*

*

**

*

*

*

*

** * *

**

**

**

* *0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

dN dS of major gened

Nd

S o

f min

or g

ene

p = 0.002

Figure S21: Evolutionary constraint on duplicates. A. Regression of dN/dS on dS for majorand minor genes of the three different categories. B. Scatter plot of dN/dS for major and minorgenes of AEDs. Paired t-test showed significant difference between major and minor genes ofAEDs (p = 0.002).

48

1e−04 1e−03 1e−02 1e−01 1e+00

0.0

0.2

0.4

0.6

0.8

1.0

Derived allele frequency

Cum

ulat

ive

prop

ortio

n

dS range (number of sites)

0−0.1 (9,386)0.1−0.5 (19,154)0.5−1.0 (26,746)1.0−1.5 (29,139)>1.5 (30,712)singleton (356,806)

Figure S22: Cumulative derived allele frequency spectrum for nonsynonymous sites of du-plicate genes of different age. Genes under higher selective constraint tend to have a higherfraction of rare variants, and hence the cumulative curve rises faster (i.e., appears higher on theplot). Note that the youngest duplicates are under lowest constraint, and that the oldest dupli-cates are under higher constraint than typical singleton genes. The data are for 6,515 individualsin the Exome Sequencing Project (15). The numbers in the legend show the total numbers ofnonsynonymous sites in each age group.

49

A. dS 0− 0.5

1e−04 1e−03 1e−02 1e−01 1e+00

0

0.4

0.6

0.8

1

Cum

ulat

ive

prop

ortio

n

Derived allele frequency

Non−synonymousSynonymous

Major genesMinor genes

B. dS 0.5− 1.0

1e−04 1e−03 1e−02 1e−01 1e+00

0

0.4

0.6

0.8

1

C. dS 1.0− 1.5

1e−04 1e−03 1e−02 1e−01 1e+00

0

0.4

0.6

0.8

1

D. dS 1.5− 2.0

1e−04 1e−03 1e−02 1e−01 1e+00

0

0.4

0.6

0.8

1

Figure S23: Cumulative derived allele frequency of major and minor genes for AED genes of different ages.

50

A. Pairs within 1mbp

dS

Pro

port

ion

0

0.2

0.4

0.6

0.8

1

0−0.5 0.5−1.0 1.0−1.5 1.5−2.0

B. Pairs on different chromosomes

dS

No diffAEDSub−/neo

0

0.2

0.4

0.6

0.8

1

0−0.5 0.5−1.0 1.0−1.5 1.5−2.0

C. Proportion of tissues where minor gene ex-pressed higher (p < 0.001)

0.00

0.02

0.04

Pro

p. ti

ssue

s m

inor

exp

r. hi

gh

0−0.5 0.5−1.0 1.0−1.5 1.5−2.0dS

*

*** *****

D. Proportion of tissues where minor gene ex-pressed higher (p < 0.05)

0.00

0.02

0.04

0.06

Pro

p. ti

ssue

s m

inor

exp

r. hi

gh

0−0.5 0.5−1.0 1.0−1.5 1.5−2.0dS

n.s

***

*****Distance < 1mbp

Different Chromosome

Figure S24: Sub-/neofunctionalization is more likely to occur in pairs on different chromosomecompared to pairs within 1 MB. A. Proportion of each category of expression patterns of du-plicate gene pairs within 1 MB across different values of dS . B. Proportion of each category ofexpression patterns of duplicate gene pairs on different chromosomes. C. and D. Proportions oftissues in which the gene with lower overall expression (minor gene) is more highly expressedthan the major gene according to a paired t-test. The p-value cutoff for the t-test is 0.001 (C.and 0.05 (D.. (n.s.: not significant, *: p < 0.05, **: p < 0.01, ***: p < 0.001).

51

A. TSS histone correlation

−1.0

−0.5

0.0

0.5

1.0

● ●

●●

●●

●●

●●●

●●

●●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

0−0.5 0.5−1.0 1.0−1.5 1.5−2.0

dS

His

tone

mar

k co

rrel

atio

n

****** *** *

B. Gene body histone correlation

−1.0

−0.5

0.0

0.5

1.0

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

0−0.5 0.5−1.0 1.0−1.5 1.5−2.0

dS

****** *** ***

Figure S25: Histone modification correlations are higher for pairs within 1 MB compared topairs on different chromosomes. Distributions of the TSS (A) and gene body (B) histone mod-ification correlations of duplicate pairs, across tissues for different values of dS . Pairs within1MB (orange) tend to be more correlated than pairs on different chromosomes (purple). (n.s.:not significant, *: p < 0.05, **: p < 0.01, ***: p < 0.001)

52

A. Exp. Corr. vs dS given tandem/separatedstatus.

0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

dS

Exp

ress

ion

corr

elat

ion

(par

tial r

esid

ual)

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

Within 1 MBDiff chr

B. Exp. Corr. vs tandem/separated status givendS .

−0.5

0.0

0.5

1.0

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

Within 1 MB Diff chr

Genomic proximity ( p = 3e−30 )

Exp

ress

ion

corr

elat

ion

(par

tial r

esid

ual)

−0.5

0.0

0.5

1.0

C. Tandem only: Exp. Corr. vs dS given dis-tance.

0.0 0.2 0.4 0.6 0.8 1.0

−0.5

0.0

0.5

1.0

dS ( p = 2e−09 )

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Exp

ress

ion

corr

elat

ion

(par

tial r

esid

ual)

D. Tandem only: Exp. Corr. vs distance givendS .

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

1e+04 1e+05 1e+06 1e+07 1e+08Genomic distance (bp) ( p = 0.003 )

Exp

ress

ion

corr

elat

ion

(par

tial r

esid

ual)

−0.5

0.0

0.5

1.0

Figure S26: Multiple regression of expression correlation for duplicates: top row shows tandemduplicates vs duplicates on different chromosomes; bottom row shows the effect of physicaldistance in tandem duplicates.

53

●●

●●

●● ●

● ●

● ●

●●

● ●

●●

●● ●

●●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●● ●

●●

●●●

0.0 0.2 0.4 0.6

ORF divergence

Pro

mot

er d

iver

genc

e ●

●●

●●

●●

●●● ● ●

● ● ●● ● ●●●

●●

●● ●

●●●

●●●

●●

●●●●

●●

●●●●

●●

●● ●● ●

●● ●●

● ● ●

●● ●●

●●●

●●●●

●●●● ●

●●●●● ●●

●●

● ●● ●

● ●●●

●●●

●●● ●

●●● ●●●

●●

●● ●●●●

●● ●●

●●

●●● ● ●●

●●●●

●●●

●● ●●

●●● ● ●● ●●

●●●●●

● ●

●●

● ●●

●●

●● ●●

●●●

●●

●●

●●●●

●●

●● ● ●●

●● ●

● ●● ●●

● ●●● ● ●

● ●●●

●●

●●●

0

0.2

0.4

0.6

● ●Unaligned pairs Aligned pairs

ρ = 0.83, p < 2.2e−16

Figure S27: Scatter plot of synonymous divergence (X axis) vs. promoter divergence (Y-axis)of duplicate pairs. Purple dots denote duplicates with unaligned promoters.

54

4 Supplemental Tables

Table S1: A list of genome assemblies and gene annotations used.

Species Genome assembly Gene annotationHuman Ensembl GRCh37 release 73

Chimpanzee Ensembl CHIMP2.1.4 release 70Gorilla Ensembl gorGor3 release 73

Orangutan Ensembl PPYG2 release 73Macaque Ensembl Mmul 1 release 70Mouse Ensembl GRCm38 release 70

Opossum Ensembl BROADO5 release 73Platypus Ensembl OANA5 release 73Chicken Ensembl WASHUC2 release 70

Table S2: Multiple regression of total number of diseases associated with minor gene.

Predictor Coefficient P value

Number of diseases associated with major gene 5.21 · 10−2 0Synonymous divergence between duplicates 4.81 · 10−2 9 · 10−4

Proportion of tissues where minor gene expressed high 1.99 4 · 10−13

Mean expression ratio of minor/major gene 9.78 · 10−2 8 · 10−7

Table S3: Multiple regression of number of diseases associated with the minor gene only

Predictor Coefficient P value

Number of diseases associated with major gene −7.47 · 10−3 7 · 10−2

Number of diseases shared by the duplicates 0.25 6 · 10−199

Synonymous divergence between duplicates 4.58 · 10−2 9 · 10−3

Nonsynonymous divergence between duplicates 0.45 9 · 10−2

Proportion of tissues where minor gene expressed high 2.31 5 · 10−12

Mean expression ratio of minor/major gene 7.29 · 10−2 2 · 10−3

55

Table S4: Multiple regression of expression correlation between duplicates across tissues.

Effect of whether duplicates are on same or different chromosomeson expression correlation.

Predictor Coefficient P valuea. Synonymous divergence between duplicates -0.14 2× 10−6

b. Duplicates on same or different chromosomes -0.36 3× 10−30

Interaction term (a×b) 0.14 4× 10−6

Effect of distance on expression correlation for tandem duplicates.Predictor Coefficient P value

Synonymous divergence between duplicates -0.13 2× 10−9

Log 10 genomic distance between duplicates -0.047 0.003

Table S5: Multiple regression of expression correlation between duplicates controlling promoterdivergence.

Predictor Coefficient P value

Promoter identity between duplicates 0.14 0.21Tandem vs Separated (i.e., within 1MB vs on different chromosomes) −0.25 4 · 10−9

5 List of Supplementary Files

Supplementary File 1 - A list of human tissue RNA-seq samples used from GTEx.

Supplementary File 2 - A list of mouse tissue RNA-seq data used from Babak et al..

Supplementary File 3 - A list of ChIP-seq data used from Roadmap Epigenomics Project.

56

6 References 1. G. C. Conant, K. H. Wolfe, Turning a hobby into a job: How duplicated genes find new

functions. Nat. Rev. Genet. 9, 938–950 (2008). Medline doi:10.1038/nrg2482

2. M. Lynch, J. S. Conery, The evolutionary demography of duplicate genes. J. Struct. Funct. Genomics 3, 35–44 (2003). Medline doi:10.1023/A:1022696612931

3. H. Innan, F. Kondrashov, The evolution of gene duplications: Classifying and distinguishing between models. Nat. Rev. Genet. 11, 97–108 (2010). Medline doi:10.1038/nrg2689

4. A. Stoltzfus, On the possibility of constructive neutral evolution. J. Mol. Evol. 49, 169–181 (1999). Medline doi:10.1007/PL00006540

5. W. Qian, B. Y. Liao, A. Y. Chang, J. Zhang, Maintenance of duplicate genes and their functional redundancy by reduced expression. Trends Genet. 26, 425–430 (2010). Medline doi:10.1016/j.tig.2010.07.002

6. G. C. Conant, J. A. Birchler, J. C. Pires, Dosage, duplication, and diploidization: Clarifying the interplay of multiple models for duplicate gene evolution over time. Curr. Opin. Plant Biol. 19, 91–98 (2014). Medline doi:10.1016/j.pbi.2014.05.008

7. J. F. Gout, M. Lynch, Maintenance and loss of duplicated genes by dosage subfunctionalization. Mol. Biol. Evol. 32, 2141–2148 (2015). Medline doi:10.1093/molbev/msv095

8. A. Force, M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, J. Postlethwait, Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531–1545 (1999). Medline

9. C. R. Baker, V. Hanson-Smith, A. D. Johnson, Following gene duplication, paralog interference constrains transcriptional circuit evolution. Science 342, 104–108 (2013). Medline doi:10.1126/science.1240810

10. I. Wapinski, A. Pfeffer, N. Friedman, A. Regev, Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61 (2007). Medline doi:10.1038/nature06107

11. GTEx Consortium, The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 348, 648–660 (2015). Medline doi:10.1126/science.1262110

12. T. Babak, B. DeVeale, E. K. Tsang, Y. Zhou, X. Li, K. S. Smith, K. R. Kukurba, R. Zhang, J. B. Li, D. van der Kooy, S. B. Montgomery, H. B. Fraser, Genetic conflict reflected in tissue-specific maps of genomic imprinting in human and mouse. Nat. Genet. 47, 544–549 (2015). Medline doi:10.1038/ng.3274

13. See supplementary materials and methods on Science Online.

14. B. van de Geijn, G. McVicker, Y. Gilad, J. K. Pritchard, WASP: Allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015). Medline doi:10.1038/nmeth.3582

15. W. Fu, T. D. O’Connor, G. Jun, H. M. Kang, G. Abecasis, S. M. Leal, S. Gabriel, M. J. Rieder, D. Altshuler, J. Shendure, D. A. Nickerson, M. J. Bamshad, J. M. Akey; NHLBI Exome Sequencing Project, Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013). Medline doi:10.1038/nature11690

57

16. K. Peng, W. Xu, J. Zheng, K. Huang, H. Wang, J. Tong, Z. Lin, J. Liu, W. Cheng, D. Fu, P. Du, W. A. Kibbe, S. M. Lin, T. Xia, The Disease and Gene Annotations (DGA): An annotation resource for human disease. Nucleic Acids Res. 41, D553–D560 (2013). Medline doi:10.1093/nar/gks1244

17. E. W. Ganko, B. C. Meyers, T. J. Vision, Divergence in expression between duplicated genes in Arabidopsis. Mol. Biol. Evol. 24, 2298–2309 (2007). Medline doi:10.1093/molbev/msm158

18. J. A. Bailey, Z. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, E. E. Eichler, Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002). Medline doi:10.1126/science.1072047

19. A. T. Ghanbarian, L. D. Hurst, Neighboring genes show correlated evolution in gene expression. Mol. Biol. Evol. 32, 1748–1766 (2015). Medline doi:10.1093/molbev/msv053

20. A. Battle, S. Mostafavi, X. Zhu, J. B. Potash, M. M. Weissman, C. McCormick, C. D. Haudenschild, K. B. Beckman, J. Shi, R. Mei, A. E. Urban, S. B. Montgomery, D. F. Levinson, D. Koller, Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24, 14–24 (2014). Medline doi:10.1101/gr.155192.113

21. T. Lappalainen, M. Sammeth, M. R. Friedländer, P. A. ’t Hoen, J. Monlong, M. A. Rivas, M. Gonzàlez-Porta, N. Kurbatova, T. Griebel, P. G. Ferreira, M. Barann, T. Wieland, L. Greger, M. van Iterson, J. Almlöf, P. Ribeca, I. Pulyakhina, D. Esser, T. Giger, A. Tikhonov, M. Sultan, G. Bertier, D. G. MacArthur, M. Lek, E. Lizano, H. P. Buermans, I. Padioleau, T. Schwarzmayr, O. Karlberg, H. Ongen, H. Kilpinen, S. Beltran, M. Gut, K. Kahlem, V. Amstislavskiy, O. Stegle, M. Pirinen, S. B. Montgomery, P. Donnelly, M. I. McCarthy, P. Flicek, T. M. Strom, H. Lehrach, S. Schreiber, R. Sudbrak, A. Carracedo, S. E. Antonarakis, R. Häsler, A. C. Syvänen, G. J. van Ommen, A. Brazma, T. Meitinger, P. Rosenstiel, R. Guigó, I. G. Gut, X. Estivill, E. T. Dermitzakis; Geuvadis Consortium, Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). Medline doi:10.1038/nature12531

22. S. S. Rao, M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander, E. L. Aiden, A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014). Medline doi:10.1016/j.cell.2014.11.021

23. A. Feuerborn, P. R. Cook, Why the activity of a gene depends on its neighbors. Trends Genet. 31, 483–490 (2015). Medline doi:10.1016/j.tig.2015.07.001

24. D. Brawand, M. Soumillon, A. Necsulea, P. Julien, G. Csárdi, P. Harrigan, M. Weier, A. Liechti, A. Aximu-Petri, M. Kircher, F. W. Albert, U. Zeller, P. Khaitovich, F. Grützner, S. Bergmann, R. Nielsen, S. Pääbo, H. Kaessmann, The evolution of gene expression levels in mammalian organs. Nature 478, 343–348 (2011). Medline doi:10.1038/nature10532

25. K. Y. Popadin, M. Gutierrez-Arcelus, T. Lappalainen, A. Buil, J. Steinberg, S. I. Nikolaev, S. W. Lukowski, G. A. Bazykin, V. B. Seplyarskiy, P. Ioannidis, E. M. Zdobnov, E. T. Dermitzakis, S. E. Antonarakis, Gene age predicts the strength of purifying selection acting on gene expression variation in humans. Am. J. Hum. Genet. 95, 660–674 (2014). Medline doi:10.1016/j.ajhg.2014.11.003

58

26. B. E. Bernstein, J. A. Stamatoyannopoulos, J. F. Costello, B. Ren, A. Milosavljevic, A. Meissner, M. Kellis, M. A. Marra, A. L. Beaudet, J. R. Ecker, P. J. Farnham, M. Hirst, E. S. Lander, T. S. Mikkelsen, J. A. Thomson, The NIH roadmap epigenomics mapping consortium. Nat. Biotechnol. 28, 1045–1048 (2010). Medline doi:10.1038/nbt1010-1045

27. A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky, A. Yen, A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J. Ziller, V. Amin, J. W. Whitaker, M. D. Schultz, L. D. Ward, A. Sarkar, G. Quon, R. S. Sandstrom, M. L. Eaton, Y. C. Wu, A. R. Pfenning, X. Wang, M. Claussnitzer, Y. Liu, C. Coarfa, R. A. Harris, N. Shoresh, C. B. Epstein, E. Gjoneska, D. Leung, W. Xie, R. D. Hawkins, R. Lister, C. Hong, P. Gascard, A. J. Mungall, R. Moore, E. Chuah, A. Tam, T. K. Canfield, R. S. Hansen, R. Kaul, P. J. Sabo, M. S. Bansal, A. Carles, J. R. Dixon, K. H. Farh, S. Feizi, R. Karlic, A. R. Kim, A. Kulkarni, D. Li, R. Lowdon, G. Elliott, T. R. Mercer, S. J. Neph, V. Onuchic, P. Polak, N. Rajagopal, P. Ray, R. C. Sallari, K. T. Siebenthall, N. A. Sinnott-Armstrong, M. Stevens, R. E. Thurman, J. Wu, B. Zhang, X. Zhou, A. E. Beaudet, L. A. Boyer, P. L. De Jager, P. J. Farnham, S. J. Fisher, D. Haussler, S. J. Jones, W. Li, M. A. Marra, M. T. McManus, S. Sunyaev, J. A. Thomson, T. D. Tlsty, L. H. Tsai, W. Wang, R. A. Waterland, M. Q. Zhang, L. H. Chadwick, B. E. Bernstein, J. F. Costello, J. R. Ecker, M. Hirst, A. Meissner, A. Milosavljevic, B. Ren, J. A. Stamatoyannopoulos, T. Wang, M. Kellis; Roadmap Epigenomics Consortium, Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). Medline

28. P. Flicek, M. R. Amode, D. Barrell, K. Beal, K. Billis, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fitzgerald, L. Gil, C. G. Girón, L. Gordon, T. Hourlier, S. Hunt, N. Johnson, T. Juettemann, A. K. Kähäri, S. Keenan, E. Kulesha, F. J. Martin, T. Maurel, W. M. McLaren, D. N. Murphy, R. Nag, B. Overduin, M. Pignatelli, B. Pritchard, E. Pritchard, H. S. Riat, M. Ruffier, D. Sheppard, K. Taylor, A. Thormann, S. J. Trevanion, A. Vullo, S. P. Wilder, M. Wilson, A. Zadissa, B. L. Aken, E. Birney, F. Cunningham, J. Harrow, J. Herrero, T. J. Hubbard, R. Kinsella, M. Muffato, A. Parker, G. Spudich, A. Yates, D. R. Zerbino, S. M. Searle, Ensembl 2014. Nucleic Acids Res. 42 (D1), D749–D755 (2014). Medline doi:10.1093/nar/gkt1196

29. R. Apweiler et al.; UniProt Consortium, Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 42 (D1), D191–D198 (2014). Medline doi:10.1093/nar/gkt1140

30. F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Söding, J. D. Thompson, D. G. Higgins, Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011). Medline doi:10.1038/msb.2011.75

31. W. J. Wilbur, D. J. Lipman, Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80, 726–730 (1983). Medline doi:10.1073/pnas.80.3.726

32. V. Ranwez, S. Harispe, F. Delsuc, E. J. Douzery, MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLOS ONE 6, e22594 (2011). Medline doi:10.1371/journal.pone.0022594

33. J. E. Karro, Y. Yan, D. Zheng, Z. Zhang, N. Carriero, P. Cayting, P. Harrrison, M. Gerstein, Pseudogene.org: A comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 35 (suppl 1), D55–D60 (2007). Medline doi:10.1093/nar/gkl851

59

34. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). Medline doi:10.1038/nbt.1621

35. C. Trapnell, D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn, L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013). Medline doi:10.1038/nbt.2450

36. J. F. Degner, J. C. Marioni, A. A. Pai, J. K. Pickrell, E. Nkadori, Y. Gilad, J. K. Pritchard, Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009). Medline doi:10.1093/bioinformatics/btp579

37. M. D. Robinson, D. J. McCarthy, G. K. Smyth, edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010). Medline doi:10.1093/bioinformatics/btp616

38. J. R. Lupski, Hotspots of homologous recombination in the human genome: Not all homologous sequences are equal. Genome Biol. 5, 2004–2005 (2004). doi:10.1186/gb-2004-5-10-242

39. B. Korber, Computational Analysis of HIV Molecular Sequences (Kluwer Academic Publishers, Dordrecht, 2000).

40. M. Nei, T. Gojobori, Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3, 418–426 (1986). Medline

41. S. Ganeshan, R. E. Dickover, B. T. Korber, Y. J. Bryson, S. M. Wolinsky, Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virol. 71, 663–677 (1997). Medline

42. F. Ronquist, J. P. Huelsenbeck, MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574 (2003). Medline doi:10.1093/bioinformatics/btg180

43. J. Zhang, Evolution by gene duplication: An update. Trends Ecol. Evol. 18, 292–298 (2003). doi:10.1016/S0169-5347(03)00033-8

44. M. Lynch, B. Walsh, The Origins of Genome Architecture, volume 98 (Sinauer Associates, Sunderland, MA, 2007).

45. H. Kaessmann, Origins, evolution, and phenotypic impact of new genes. Genome Res. 20, 1313–1326 (2010). Medline doi:10.1101/gr.101386.109

46. S. Ohno et al., Evolution by Gene Duplication (George Alien & Unwin Ltd., London; Springer-Verlag, Berlin, Heidelberg, New York, 1970.

47. P. W. H. Holland, J. Garcia-Fernandez, N. A. Williams, A. Sidow, Gene duplications and the origins of vertebrate development. Development 1994 (Supplement), 125–133 (1994).

48. H. H. Kazazian Jr., J. V. Moran, The impact of L1 retrotransposons on the human genome. Nat. Genet. 19, 19–24 (1998). Medline doi:10.1038/ng0598-19

49. C. Esnault, J. Maestre, T. Heidmann, Human LINE retrotransposons generate processed pseudogenes. Nat. Genet. 24, 363–367 (2000). Medline doi:10.1038/74184

50. H. Kaessmann, N. Vinckenbosch, M. Long, RNA-based gene duplication: Mechanistic and evolutionary insights. Nat. Rev. Genet. 10, 19–31 (2009). Medline doi:10.1038/nrg2487

60

51. J. Nathans, D. Thomas, D. S. Hogness, Molecular genetics of human color vision: The genes encoding blue, green, and red pigments. Science 232, 193–202 (1986). Medline doi:10.1126/science.2937147

52. P. F. Chance, N. Abbas, M. W. Lensch, L. Pentao, B. B. Roa, P. I. Patel, J. R. Lupski, Two autosomal dominant neuropathies result from reciprocal DNA duplication/deletion of a region on chromosome 17. Hum. Mol. Genet. 3, 223–228 (1994). Medline doi:10.1093/hmg/3.2.223

53. E. E. Eichler, Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661–669 (2001). Medline doi:10.1016/S0168-9525(01)02492-1

54. J. A. Lee, C. M. Carvalho, J. R. Lupski, A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell 131, 1235–1247 (2007). Medline doi:10.1016/j.cell.2007.11.037

55. M. Lynch, J. S. Conery, The origins of genome complexity. Science 302, 1401–1404 (2003). Medline doi:10.1126/science.1089370

56. A. E. Vinogradov, Large scale of human duplicate genes divergence. J. Mol. Evol. 75, 25–33 (2012). Medline doi:10.1007/s00239-012-9516-1

57. C. Roth, S. Rastogi, L. Arvestad, K. Dittmar, S. Light, D. Ekman, D. A. Liberles, Evolution after gene duplication: Models, mechanisms, sequences, systems, and organisms. J. Exp. Zool. B Mol. Dev. Evol. 308, 58–73 (2007). Medline doi:10.1002/jez.b.21124

58. P. H. Sudmant, T. Rausch, E. J. Gardner, R. E. Handsaker, A. Abyzov, J. Huddleston, Y. Zhang, K. Ye, G. Jun, M. Hsi-Yang Fritz, M. K. Konkel, A. Malhotra, A. M. Stütz, X. Shi, F. Paolo Casale, J. Chen, F. Hormozdiari, G. Dayama, K. Chen, M. Malig, M. J. Chaisson, K. Walter, S. Meiers, S. Kashin, E. Garrison, A. Auton, H. Y. Lam, X. Jasmine Mu, C. Alkan, D. Antaki, T. Bae, E. Cerveira, P. Chines, Z. Chong, L. Clarke, E. Dal, L. Ding, S. Emery, X. Fan, M. Gujral, F. Kahveci, J. M. Kidd, Y. Kong, E. W. Lameijer, S. McCarthy, P. Flicek, R. A. Gibbs, G. Marth, C. E. Mason, A. Menelaou, D. M. Muzny, B. J. Nelson, A. Noor, N. F. Parrish, M. Pendleton, A. Quitadamo, B. Raeder, E. E. Schadt, M. Romanovitch, A. Schlattl, R. Sebra, A. A. Shabalin, A. Untergasser, J. A. Walker, M. Wang, F. Yu, C. Zhang, J. Zhang, X. Zheng-Bradley, W. Zhou, T. Zichner, J. Sebat, M. A. Batzer, S. A. McCarroll, R. E. Mills, M. B. Gerstein, A. Bashir, O. Stegle, S. E. Devine, C. Lee, E. E. Eichler, J. O. Korbel; 1000 Genomes Project Consortium, An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). Medline

59. A. L. Hughes, The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. London Ser. B 256, 119–124 (1994). doi:10.1098/rspb.1994.0058

60. M. W. Hahn, Distinguishing among evolutionary models for the maintenance of gene duplicates. J. Hered. 100, 605–617 (2009). Medline doi:10.1093/jhered/esp047

61. J. M. Lambert, W. O. Cochran, B. M. Wilde, K. G. Olsen, C. D. Cooper, Evidence for widespread subfunctionalization of splice forms in vertebrate genomes. Genome Res. 2015, 184473 (2015). doi:10.1101/gr.184473.114

62. J. Altschmied, J. Delfgaauw, B. Wilde, J. Duschl, L. Bouneau, J.-N. Volff, M. Schartl, Subfunctionalization of duplicate MITF genes associated with differential degeneration of alternative exons in fish. Genetics 161, 259–267 (2002). Medline

61

63. W. P. Yu, S. Brenner, B. Venkatesh, Duplication, degeneration and subfunctionalization of the nested synapsin-Timp genes in Fugu. Trends Genet. 19, 180–183 (2003). Medline doi:10.1016/S0168-9525(03)00048-9

64. K. A. Hultman, N. Bahary, L. I. Zon, S. L. Johnson, Gene duplication of the zebrafish kit ligand and partitioning of melanocyte development functions to kit ligand a. PLOS Genet. 3, e17 (2007). Medline

65. A. N. Marshall, M. C. Montealegre, C. Jiménez-López, M. C. Lorenz, A. van Hoof, Alternative splicing and subfunctionalization generates functional diversity in fungal proteomes. PLOS Genet. 9, e1003376 (2013). Medline doi:10.1371/journal.pgen.1003376

66. R. N. Nurtdinov, I. I. Artamonova, A. A. Mironov, M. S. Gelfand, Low conservation of alternative splicing patterns in the human and mouse genomes. Hum. Mol. Genet. 12, 1313–1320 (2003). Medline doi:10.1093/hmg/ddg137

67. B. Modrek, C. J. Lee, Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat. Genet. 34, 177–180 (2003). Medline doi:10.1038/ng1159

68. R. Sorek, R. Shamir, G. Ast, How prevalent is functional alternative splicing in the human genome? Trends Genet. 20, 68–71 (2004). Medline doi:10.1016/j.tig.2003.12.004

69. N. L. Barbosa-Morais, M. Irimia, Q. Pan, H. Y. Xiong, S. Gueroussov, L. J. Lee, V. Slobodeniuc, C. Kutter, S. Watt, R. Colak, T. Kim, C. M. Misquitta-Ali, M. D. Wilson, P. M. Kim, D. T. Odom, B. J. Frey, B. J. Blencowe, The evolutionary landscape of alternative splicing in vertebrate species. Science 338, 1587–1593 (2012). Medline doi:10.1126/science.1230612

70. J. Merkin, C. Russell, P. Chen, C. B. Burge, Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science 338, 1593–1599 (2012). Medline doi:10.1126/science.1228186

71. J. P. Huelsenbeck, F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001). Medline doi:10.1093/bioinformatics/17.8.754

72. F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu, M. A. Suchard, J. P. Huelsenbeck, MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61, 539–542 (2012). Medline doi:10.1093/sysbio/sys029

73. Z. Yang, PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997). Medline

74. Z. Yang, PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007). Medline doi:10.1093/molbev/msm088

75. R. M. Kuhn, D. Karolchik, A. S. Zweig, H. Trumbower, D. J. Thomas, A. Thakkapallayil, C. W. Sugnet, M. Stanke, K. E. Smith, A. Siepel, K. R. Rosenbloom, B. Rhead, B. J. Raney, A. Pohl, J. S. Pedersen, F. Hsu, A. S. Hinrichs, R. A. Harte, M. Diekhans, H. Clawson, G. Bejerano, G. P. Barber, R. Baertsch, D. Haussler, W. J. Kent, The UCSC genome browser database: Update 2007. Nucleic Acids Res. 35 (suppl 1), D668–D673 (2007). Medline doi:10.1093/nar/gkl928

62

76. T. Makino, K. Hokamp, A. McLysaght, The complex relationship of gene duplication and essentiality. Trends Genet. 25, 152–155 (2009). Medline doi:10.1016/j.tig.2009.03.001

77. J. Ernst, M. Kellis, ChromHMM: Automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012). Medline doi:10.1038/nmeth.1906

78. J. Lonsdale, J. Thomas, M. Salvatore, R. Phillips, E. Lo, S. Shad, R. Hasz, G. Walters, F. Garcia, N. Young, B. Foster, M. Moser, E. Karasik, B. Gillard, K. Ramsey, S. Sullivan, J. Bridge, H. Magazine, J. Syron, J. Fleming, L. Siminoff, H. Traino, M. Mosavel, L. Barker, S. Jewell, D. Rohrer, D. Maxim, D. Filkins, P. Harbach, E. Cortadillo, B. Berghuis, L. Turner, E. Hudson, K. Feenstra, L. Sobin, J. Robb, P. Branton, G. Korzeniewski, C. Shive, D. Tabor, L. Qi, K. Groch, S. Nampally, S. Buia, A. Zimmerman, A. Smith, R. Burges, K. Robinson, K. Valentino, D. Bradbury, M. Cosentino, N. Diaz-Mayoral, M. Kennedy, T. Engel, P. Williams, K. Erickson, K. Ardlie, W. Winckler, G. Getz, D. DeLuca, D. MacArthur, M. Kellis, A. Thomson, T. Young, E. Gelfand, M. Donovan, Y. Meng, G. Grant, D. Mash, Y. Marcus, M. Basile, J. Liu, J. Zhu, Z. Tu, N. J. Cox, D. L. Nicolae, E. R. Gamazon, H. K. Im, A. Konkashbaev, J. Pritchard, M. Stevens, T. Flutre, X. Wen, E. T. Dermitzakis, T. Lappalainen, R. Guigo, J. Monlong, M. Sammeth, D. Koller, A. Battle, S. Mostafavi, M. McCarthy, M. Rivas, J. Maller, I. Rusyn, A. Nobel, F. Wright, A. Shabalin, M. Feolo, N. Sharopova, A. Sturcke, J. Paschal, J. M. Anderson, E. L. Wilder, L. K. Derr, E. D. Green, J. P. Struewing, G. Temple, S. Volpi, J. T. Boyer, E. J. Thomson, M. S. Guyer, C. Ng, A. Abdallah, D. Colantuoni, T. R. Insel, S. E. Koester, A. R. Little, P. K. Bender, T. Lehner, Y. Yao, C. C. Compton, J. B. Vaught, S. Sawyer, N. C. Lockhart, J. Demchok, H. F. Moore; GTEx Consortium, The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013). Medline doi:10.1038/ng.2653

79. J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, B. Ren, Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012). Medline doi:10.1038/nature11082

63