supplementary materials forscience.sciencemag.org/content/sci/suppl/2016/05/18/352...the command...
TRANSCRIPT
www.sciencemag.org/content/352/6288/1009/suppl/DC1
Supplementary Materials for
Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals
Xun Lan* and Jonathan K. Pritchard*
*Corresponding author. Email: [email protected] (X.L.); [email protected] (J.K.P.)
Published 20 May 2016, Science 352, 1009 (2016) DOI: 10.1126/science.aad8411
This PDF file includes:
Materials and Methods Supplemental Text Figs. S1 to S27 Tables S1 to S5 References
Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/352/6288/1009/DC1)
Data Files S1 to S3 as Excel Files
Contents
1 Materials and Methods 3
1.1 Data used in this study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Identification of duplicated genes . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Estimating expression levels of duplicated genes . . . . . . . . . . . . . . . . . 6
1.4 Phylogenetically-based dating of duplicated genes . . . . . . . . . . . . . . . . 8
2 Supplemental Text 10
2.1 Basic characteristics of duplicated genes . . . . . . . . . . . . . . . . . . . . . 10
2.2 Patterns of gene expression in human and mouse tissues . . . . . . . . . . . . . 12
2.3 Expression differences between humans and other species . . . . . . . . . . . . 13
2.4 Rapid downregulation of expression in duplicates. . . . . . . . . . . . . . . . . 16
2.5 Effect of expression patterns on disease burden. . . . . . . . . . . . . . . . . . 17
2.6 Differential splicing in duplicates . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Long-term selective constraint on duplicated genes (dN/dS) . . . . . . . . . . 20
2.8 Current selective constraint on duplicated genes (SFS in humans) . . . . . . . . 21
2.9 Effect of translocation on regulatory divergence of duplicates . . . . . . . . . . 22
3 Supplemental Figures 28
4 Supplemental Tables 55
5 List of Supplementary Files 56
6 References 57
2
1 Materials and Methods
1.1 Data used in this study
The main RNA-seq data used in this study were from the GTEx Pilot Phase (11). The 46
GTEx tissues included were: adipose subcutaneous, adipose tissue, adrenal gland, artery tibial,
blood, blood vessel, brain amygdala, brain anterior cingulate cortex BA24, brain caudate basal
ganglia, brain cerebellar hemisphere, brain cerebellum, brain cortex, brain frontal cortex BA9,
brain hippocampus, brain hypothalamus, brain nucleus accumbens basal ganglia, brain putamen
basal ganglia, brain spinal cord cervical c-1, brain substantia nigra, breast, breast mammary
tissue, colon, esophagus, esophagus mucosa, esophagus muscularis, heart, heart left ventricle,
kidney, liver, lung, muscle, muscle (skeletal), nerve tibial, ovary, pancreas, pituitary, prostate,
skin, skin (sun-exposed lower leg), spleen, stomach, testis, thyroid, uterus, vagina, and whole
blood (Supplementary File 1).
We processed data from 10 different individuals for most tissues, as this allowed us to ob-
tain nearly uniform sample size across all 46 tissues, while providing good power to detect
differential expression between tissues. We used smaller or larger sample sizes for kidney (8
individuals), uterus and esophagus muscularis (11 individuals).
Additionally, for supplementary analyses, we used mouse tissue expression data produced
by Babak et al. (12). Supplementary File 2 summarizes the data we used from Babak et al. We
thank Hunter Fraser for prepublication access to these data. For comparing gene expression be-
tween species, we used RNA-seq data of six tissues, including brain, cerebellum, heart, kidney,
liver, and testis; in human, macaque and mouse (24). We also used histone ChIP-seq data from
24 tissues from a variety of human developmental and adult tissues collected by the RoadMap
Epigenomics Project (26, 27). Table S1 is a list of genome assemblies and gene annotations
used in this study.
3
1.2 Identification of duplicated genes
We first constructed a database of duplicate protein-coding gene pairs in the human genome.
In brief, we identified 1,444 duplicate pairs that are reciprocal best matches, have at least 80%
aligned coding sequence, and for which neither gene is annotated as a pseudogene. Additional
filters are described below. Our analysis uses coding sequences only, not UTRs, due to their
much higher average conservation and easier alignment.
Prescreening for candidate duplicate gene pairs. Our strategy for identifying high quality
duplicate genes started with an initial filtering step comparing all genes against all others to
identify candidate gene pairs with substantial sequence similarity at the nucleotide or amino acid
level. The human genome assembly GRCh37 (hg19) and the matching transcript annotation
from the Ensembl project (28) were used to build a matrix of coding sequence distances between
all Ensembl genes (56,486 coding sequences). The complete set of protein sequences encoded
by the human genome from the Universal Protein Resource (UniProt) database (29) was used
to build a pairwise protein sequence distance matrix (88,277 protein sequences).
Pairwise sequence distance matrices were calculated using the software Clustal Omega 1.2.0
(30). The K-tuple measure (31) is used by the Clustal Omega software for fast computation of
pairwise distances of thousands of sequences. The pairwise similarity scores generated by the
K-tuple algorithm are then converted to distance scores with an added penalty on gaps. Gene
pairs with either a nucleotide or amino acid sequence distance score lower than 0.6 were kept
for further analysis. The cutoff of 0.6 was chosen to balance sensitivity of detecting duplicate
genes against the computational burden of subsequent analyses. The command line used to
generate the distance matrices was:
$clustalo.modified -i [the input fasta file] –distmat-out=prefix.distanceMatrix –guidetree-
out=prefix.dnd –force –full -o prefix.clustaloOut
4
These filters resulted in 242,575 candidate gene pairs.
Pairwise sequence alignment. In the next filtering stage, we performed multiple sequence
alignments of each candidate pair without codon awareness across all transcripts of each gene
pair using Clustal Omega 1.2.0. We kept the two transcripts that generated the largest number
of aligned nucleotides for each gene pair. Pairs with <100 aligned nucleotides or with >50%
uncorrected sequence divergence were removed. The command line used to perform multiple
sequence alignment without codon awareness was
$clustalo-1.2.0-Ubuntu-x86 64 -i [the input fasta file] –distmat-out=prefix.mat –guidetree-
out=prefix.dnd –force –full -o prefix.clustaloOut
For the 20,191 gene pairs that remained after these filters, we performed a more computa-
tionally intensive, codon-aware alignment for each pair of transcripts using the software Mul-
tiple Alignment of Coding SEquences (MACSE) v0.9b1 (32). MACSE is a coding sequence
aligner accounting for frameshifts and stop codons. Briefly, MACSE optimizes the alignment
of two sequences by minimizing a weighted sum of costs for frameshift, deletions, stop codons,
and AA substitutions. The command line for pairwise alignment with codon awareness was
$java -Xms8m -Xmx10g -Xss4m -jar macse v0.9b1.jar -i [the input fasta file] -o outputDi-
rectory
Filtering of high-confidence reciprocal best-hit duplicate pairs. Finally, we used a series
of additional criteria to identify a set of high-confidence duplicate pairs for analysis as follows.
(A) At least 80% of the coding sequences of both genes are aligned to the other (median =
1,122bp). (B) The sister duplicates are reciprocal-best hits, in the sense that both genes have
the lowest synonymous substitution rate (dS) to each other relative to all other human genes.
(C) Neither gene is classified as a pseudogene in a comprehensive pseudogene database (33).
(D) At least 50% of aligned nucleotides are identical between the two genes. Additionally we
5
excluded very short alignments, requiring at least one continuous aligned region of >100bp,
ignoring introns, and >200bp total aligned sequence.
These criteria produced 1,444 high quality reciprocal best-hit duplicate pairs. Our expres-
sion analysis further excludes 190 pairs with no uniquely-mappable positions and 60 pairs for
which one or both genes were unexpressed in all tissues and may thus represent unannotated
pseudogenes.
1.3 Estimating expression levels of duplicated genes
Measuring the expression levels of duplicate genes using RNA-seq data can be difficult for
recently duplicated pairs. RNA-seq reads from these genes may map equally well to both copies
(and potentially to other paralogs), making it difficult to get accurate estimation of absolute
expression levels and relative expression levels of the copies.
One standard approach to this problem is via Cufflinks (34,35), which performs probabilistic
assignment of ambiguous reads (through option -u). However, in simulations, we found that the
expression ratios of duplicate pairs estimated by Cufflinks (v2.2.1) often diverged substantially
from the correct values (Figure S1).
To overcome this issue, we developed a bespoke mapping pipeline based on methods for
measuring allele-specific expression (14, 36). The essential idea is that we only consider reads
that map uniquely within the aligned region of one duplicate, and for which reads derived from
the corresponding location in the sister duplicate would also map uniquely to the sister dupli-
cate (Figure S2). We define sites for which all overlapping reads are reciprocally unambiguous
as reciprocally unambiguous sites. The normalized average coverage of all reciprocally unam-
biguous sites of a duplicate gene was then used as a measure of expression level of that gene.
This measure allows us very high confidence in the mapped reads that we use–and hence in
significant expression differences between duplicates–albeit at a cost of not attempting to make
6
use of ambiguous reads.
Performance assessment. To assess the accuracy of our estimation pipeline, we performed
simulations comparing our new method to Cufflinks. To identify unambiguous sites in duplicate
genes, we simulated 76bp (same read length as the GTEx data) single-end reads starting from
every position of the coding sequences. We then mapped these reads to the human GRCh37
transcriptome using Tophat v2.0.12 with flags “–prefilter-multihits” and “–report-secondary-
alignments”. The maximum number of mismatches was set to 4 and up to 40 alignments for
each read were reported. We defined a pair of aligned nucleotides as reciprocally unambiguous
sites if all 76bp reads covering the nucleotide in both duplicates were uniquely mappable to the
correct gene.
For read alignment of the RNA-seq data, we used Tophat2 with the same parameters as in
the simulation except that the maximum number of alignments reported for each read was set to
20. For each RNA-seq experiment, we counted the coverage of all unambiguous sites as defined
above. A gene’s expression level was estimated using the average coverage of all unambiguous
sites in that gene. The expression level was then normalized to the average read depth per
billion reads for each sample based on the effective library size estimated by edgeR (37) and
read length of that sample. Thus, average read depth per billion reads is comparable to FPKM.
In summary, we found that for all expression levels (FPKM=10, 1 and 0.1), our method
is less noisy than Cufflinks (Figure S1). We suspect that the point estimates from maximum
likelihood estimation may be unstable in this kind of setting when a large fraction of the reads
have ambiguous mappings. Thus, our more conservative approach appears to be much less
noisy.
Classification of expression patterns of sister duplicates. To classify the expression patterns
of duplicate genes we used the following criteria. Gene pairs with no reciprocally unambigu-
7
ous sites were classified as unmappable. Otherwise, we compared expression between the two
duplicates for each tissue. A gene’s expression was considered to be significantly higher than
its sister gene in the same tissue if the median expression ratio was at least two-fold and the
p-value from a paired t-test on log transformed expression across samples was less than 0.001.
We divided the duplicate gene pairs into three classes according to their expression patterns:
i) Sub-/neofunctionalized pairs, in which each of the two duplicates was significantly more
expressed in at least one tissue (Figure 2A, Figure S10A); ii) Asymmetrically Expressed Du-
plicates (AEDs), in which one duplicate was expressed higher than its sister gene in at least
one third of the tissues where one or both gene(s) were expressed and its expression was not
significantly different from its sister gene in other tested tissues (Figure 2B, Figure S10B); iii)
all other duplicates were classified as No difference pairs–i.e., one gene was expressed higher in
less than one third of tissues where one or both gene(s) were expressed and the two genes were
not expressed significantly differently in other tissues. In the main text, we also refer AEDs and
no difference pairs as not diverged pairs.
1.4 Phylogenetically-based dating of duplicated genes
Estimating precise ages of duplications is challenging. One approach is to use the sharing of
duplicates across species to provide anchor points for the ages of duplications; however this
approach may be misleading if a duplication has been lost in other lineages or if a gene has
duplicated independently more than once. A second approach is to use synonymous divergence
dS between human copies as a molecular clock; however the dS clock may be downwardly
biased due to nonallelic homologous gene conversion between paralogs (38), especially for
very young proximal duplications. In the main paper we use dS as a measure of duplication
age–we show here that, while noisy, it does provide a useful proxy for time.
To provide a measuring stick for interpreting dS values, we built a tree of synonymous
8
divergence among singleton protein coding genes (genes with no duplicates) in 9 species, in-
cluding human (Figure S3). We identified orthologs and aligned them using Clustal Omega
and estimated dS using Synonymous Non-synonymous Analysis Program (SNAP) v2.1.1 (39)
(www.hiv.lanl.gov), which applies methods from (40, 41). Duplication and loss of duplicates
after the divergence of species can result in ortholog pairs with dS significantly higher than the
average divergence between the two species. To avoid these cases, we removed genes with dis-
cordant distances. These were used to construct a species tree (Figure S3). The figure indicates
twice the human-lineage divergence from key points on the tree–these approximate expected
divergence of duplicates that arose at the corresponding times in the absence of gene conver-
sion.
In summary, as a rough rule of thumb, we can expect that duplicates with dS ∼0.4 are likely
to have occurred around the time of the human-mouse split, while duplicates with dS ∼ 1.0
likely predate the origin of placental mammals.
As a second approach to this problem, we used MrBayes (42) to build gene trees of du-
plicates and their orthologs in different species to identify cases in which both members of a
duplicate pair are shared with outgroups (Figure S4). For human duplicates that arose after the
last common ancestor of humans and species X, there should be no cases in which both dupli-
cates are shared with X (aside from occasional classification errors). For human duplicates that
arose before the split, most should be shared, aside from any that were subsequently lost on
the lineage leading to X. As may be seen in the figure, for duplicates with dS <∼0.35, only a
few are shared with mouse, while for duplicates with dS >∼ 0.45 most are shared with mouse.
This is broadly consistent with the expectations from the singleton gene tree, in which twice
the human-lineage dS since the human-mouse split is 0.45 at singleton genes (and bearing in
mind that gene conversion is likely to reduce observed dS). Sharing of duplicates with opossum
(average dS at singletons= 1.0) is also roughly consistent.
9
We next extended this phylogenetic approach to infer the most likely internal branches on
which the duplications occurred. We divided the duplicate pairs into 5 groups with 4 break
time points, i.e. the split time of chicken, opossum, mouse, and macaque from the human
lineage. For example, if we found an orthologous duplication in human and macaque but not
in the other species, then we hypothesized that this duplication likely occurred on the branch
between the human-macaque Last Common Ancestor (LCA) and human-mouse LCA. Out of
the 1194 mappable duplicate pairs, we obtained sensible inferences for 732 pairs (discrepancies
may arise due to parallel losses of duplications, or failing our quality controls for alignment
accuracy, etc.). We then filtered out pairs with dS values that were particularly high or low for
the inferred branch. Such events may indicate parallel gains or losses of duplications. This
filtering step left us with 480 high-confidence duplicate pairs. We then repeated most of the
analyses that supported our main conclusions using the new categorization instead using groups
defined using dS ranges.
Overall we found that the results from phylogeny-based analyses were consistent with dS-
based analyses (Figures S5–S7). Because our phylogenetic analysis removed a large number of
genes, we reported the results of dS-based analyses in the main text.
2 Supplemental Text
2.1 Basic characteristics of duplicated genes
Three important mechanisms of duplication are whole genome duplication, segmental dupli-
cation and retrotransposition (43–45). It is believed that two rounds of whole genome dupli-
cation occurred in the early evolution of the vertebrates (46, 47); however these events likely
preceded the origin of most duplications considered here. Retrotransposition happens when a
cellular mRNA is reverse transcribed by viral reverse transcriptase, followed by reintegration
of the cDNA into the host genome (48–50). A hallmark of retrotransposition is an intronless
10
coding region with a poly-A tail. Segmental duplication refers to duplication of large chunks
of genomic DNA. Segmental duplications may arise through unequal crossing-over between
homologous chromosomes (18, 51–53) or by replication slippage, which may occur through a
mechanism called replication Fork Stalling and Template Switching (FoSTeS) (54).
To identify duplicates that likely arose through segmental duplication or retrotransposition,
respectively, we classified duplicate genes as follows. (Our criteria were chosen to provide
confident support for either duplication of the entire genic structure or elimination of multiple
introns.) Gene pairs were inferred to be likely segmental duplications if the following three
criteria were met: 1. Both genes have at least 3 exons; 2. The two genes have less than 20% dif-
ference in the exon numbers; 3. More than 80% of the exon junctions are consistent between the
two genes (within 10bp distance). Pairs were inferred to be likely retrotransposed duplications
if one gene contains only one exon and the other gene contains at least 3 exons. We found that
segmental duplications are much more prevalent (nearly 8 fold) in the human genome compared
to retrotranspositions. Pairs meeting neither set of criteria were left as unclassified, though we
speculate that most of these likely derive from segmental duplications (for example, notice that
most young unclassified duplicates are found in tandem).
For reciprocal best-hit pairs, 963 are segmental and 86 are retrotransposed duplicates (Fig-
ure S8). We used dS between sister duplicates as a molecular clock and divided the pairs into
different age of duplication groups. As observed previously, there is a marked peak of very
young duplications (low dS), likely because most duplications are relatively short-lived over
evolutionary time (55–57).
As expected, most retrotransposed genes are found on different chromosomes. In contrast,
young segmentally duplicated pairs tend to be found close together in the genome; these seem
to be gradually translocated to different chromosomes as genome rearrangements occur (Fig-
ure S8A). The overall patterns are similar in mouse (Figure S9).
11
2.2 Patterns of gene expression in human and mouse tissues
In this section we expand on the expression analyses presented in the main text. Figure S10
presents an expanded version of Main Figure 1, showing the same genes in more tissues. Fig-
ure S11 provides an expanded version of Main Figure 2A, illustrating expression patterns across
the full set of GTEx tissues. To minimize the impact of outliers, we used the median expression
level of all individuals in a tissue for calculation of the log ratio in Figure 2, Figure S11 and
Figure S12. To avoid zero counts in the log ratios, all expression counts were increased by
addition of 0.5 pseudo counts. For plotting purposes, the log ratio was set to 0 if the Student’s
t-test showed no significant difference (p> .001) in expression level between sister duplicates.
A gene was defined to be “major” if the gene was significantly more highly expressed more of-
ten than its sister gene. If the two genes were expressed significantly higher in an equal number
of tissues, the one with higher mean expression in tissues was defined as the major gene.
A notable pattern in Figure 2A and Figure S11B is that regardless of functional category,
most duplicate pairs show asymmetric expression: for dS < 0.1, minor genes are expressed at
a median level of 40.5% of major genes across all tissues, and for 0.1 < dS < 0.8 minor genes
are expressed at just 33.5% of major genes. Here “minor” genes are defined as the genes with
lower median expression, so even with no systematic asymmetry we would expect this ratio to
be < 1. We therefore conducted permutations in which we randomly flipped the expression of
major and minor genes in each tissue. We defined major and minor genes and plotted the ratios
in the same way as for the real data. The average median expression ratio in the permuted data
was 83.3%, thus indicating that the duplicates, as a group, show significant levels of asymmetric
expression.
We performed additional analyses to explore further aspects of our results, as well as their
robustness. We found that retrotransposed genes are much more likely to be asymmetrically
expressed than segmental duplications (Figure S12). This probably reflects the fact that retro-
12
transposed genes are likely to land in genomic locations without suitable regulatory elements.
We were also curious about whether expression breadth might affect the probability of sub-
functionalization. Specifically, we hypothesized that genes with narrow expression might have
higher regulatory complexity, and thus subfunctionalize at higher rates. In fact, if anything it
seems that the reverse is true, as narrowly expressed genes have relatively lower rates of sub-
functionalization (Figure S13).
We also investigated whether sampling more developmental stages might increase the evi-
dence for subfunctionalization. We applied the same procedures to survey the divergence pat-
tern of duplicate genes in mouse using RNA-seq data from 26 tissues, including 3 fetal tissues:
embryonic brain, placenta and yolk sac (12) (Figure S14). The result in mice is consistent with
our observations in humans, showing slow rates of sub-/neofunctionalization.
It is also worth noting that our results on selective constraint and disease associations
strongly support the inference that the class of genes that we have classified here as minor
AED genes are functionally less important and less constrained. This point would remain true
even if it were shown in future that they had increased expression in some unsampled tissue.
2.3 Expression differences between humans and other species
In the main paper we argue that the data do not support a model of duplicate preservation
through expression sub- or neofunctionalization. Instead, we suggest that the data are more
consistent with a model of preservation through dosage sharing. In this model, the two dupli-
cates combine to achieve the required expression level. One simple prediction of this model is
that, in the absence of changes in optimal expression level, we might expect the summed expres-
sion level of a pair of duplicate genes to be similar to the expression of singleton orthologs in
outgroup species (and hence the expression levels of the individual duplicates should generally
be lower than the corresponding singletons). Of course this prediction should not hold precisely
13
for all genes due to changes in optimal expression levels (and indeed such changes may help to
enable duplicate fixation in the first place).
To test this, we analyzed RNA-seq data of 6 tissues in three species, namely human, macaque
and mouse (24). Specifically, we first searched for duplicates in human with only one ortholog
in macaque (or separately, in mouse, for the mouse comparison). To increase the likelihood that
the duplication event is human lineage specific, we filtered out human duplicates with dS larger
than 0.1 and 0.5 respectively before comparing to singleton orthologs of macaque and mouse.
The expression values were normalized in the same way as the GTEx RNA-seq data. Next,
to compare the expression of human genes to macaque/mouse genes, we adjusted the expres-
sion values based on a linear regression model using genes that are singletons in both species.
The adjustment results in a mean expression ratio of 1:1 between orthologous genes that are
singletons in both species (Figure 4A). Using this analysis, we find that the expression of indi-
vidual duplicates in human is significantly lower than that of their singleton orthologs in both
macaque (p=1.5 × 10−7, t-test) (Figure 4D) and mouse (p=1.2 × 10−7, t-test) (Figure S15B).
The median summed expression of duplicates is very close to the expression of the singleton
orthologs in macaque and mouse (median expression ratio is 1.11 for both macque and mouse
orthologs; these are significantly less than a 2:1 expression ratio, p=7.6×10−6 for macaque and
p=5.5×10−10 for mouse, t-test) (Figure S15).
The median dS of the duplicates that arose on the human branch since the human-macaque
split is ∼0.05. Down-regulation of these duplicates indicates that dosage-sharing evolved
quickly compared to sub-/neofunctionalization in expression. We next examined whether some
duplicates might be sub-/neofunctionalized at the protein level.
Three lines of evidence imply that sub-/neofunctionalization at the protein level also evolves
slowly compared to dosage-sharing, as follows. (All of these data below refer to a set of 27
duplicate pairs that arose on the human lineage since the human-macaque split and pass other
14
filters including for read mappability.)
(1) Figure S16: We see no evidence for adaptive protein evolution within this set of young
duplicates. The nonsynonymous divergence (dN ) between the two human copies is ≤ the syn-
onymous divergence dS for all 27 pairs. Moreover, in absolute terms, the amount of nonsyn-
onymous divergence in this set is very low: the median divergence is just ∼2%.
(2) Figure S17: Dosage sharing appears to appears to evolve very rapidly, and shows no
relationship with the amount of nonsynonymous divergence between the copies. In this analysis,
we expanded the number of genes that we could analyze. Since we are interested in the summed
expression, we relaxed our read mapping criteria to allow reads that cannot be uniquely assigned
between the two copies, but that are unique with respect to the rest of the genome.
In this analysis the average ratio of summed expression of duplicates to their ortholog is
∼1.2, far lower than the 2-fold that would be expected from doubling copy number (Figure S17).
Even pairs with identical protein sequences (dN=0) are downregulated compared to their parent
genes suggesting the sharing of expression evolves quickly after duplication and likely precedes
the divergence of protein function.
(3) Figure S18: If these genes were already significantly sub-/neofunctionalized at the pro-
tein level, then we would expect them to show significant levels of selective constraint within
humans. However the average conservation is very low. In this set, we observed more than twice
as many common missense variants in human polymorphism data as the average for singleton
genes (29.6% vs 14.4%). (Note that the gene-level estimates in this analysis are noisy due to
small numbers of segregating sites.) This indicates that these young duplicates are function-
ally redundant at the protein level, and thus within-species selective constraint against missense
mutations is much weaker than for typical genes.
15
2.4 Rapid downregulation of expression in duplicates.
If expression downregulation plays an important role in preserving duplicates, then we would
expect the expression reduction to occur relatively quickly–at least on a similar timescale to the
rate of gene loss by nonsense mutations. To explore the speed of downregulation, we took the set
of duplicates that occurred on the human lineage since the human-macaque split, and examined
their expression, relative to macaque, as a function of age (measured by dS) (Figure S19A). As
in Figure S17, we included duplicates that are not separately mappable but are distinct from
other genes.
This analysis shows downregulation among even the youngest fixed duplicates to close to the
level of macaque orthologs. This suggests that expression reduction occurs rapidly, as opposed
to a gradual decrease in expression over time (Figure S19A). Downregulation could occur
through substitutions that affect regulation (e.g., by weakening promoters), but it could also
occur through nongenetic processes such as expression buffering via feedback mechanisms on
transcription or mRNA turnover. To test whether there may be an effect of expression buffering
on new duplicates, we examined the expression of genes with copy number variation in the
human population. This analysis was restricted to polymorphic duplications with two entire
copies of the relevant genes, or to whole-gene deletions. Using genotype and gene expression
data for a subset of 1000 Genomes individuals (21, 58), we showed that gene expression in
individuals with atypical copy numbers (1, 3 or 4 copies) were closer to diploid expression than
predicted by an additive effect of copy number (Figure S19B). This suggests either widespread
partial buffering of duplicates, or that duplications with reduced expression of one or both copies
are more likely to be polymorphic. We speculate that this moderate buffering may help enable
the fixation of duplicates by alleviating dosage imbalance caused by duplication. It is also likely
that duplicates with relatively stronger buffering may be more likely to fix. Following fixation,
there may be substitution of additional expression-reducing mutations to further decrease the
16
expression of duplicates and thus enable their survival.
2.5 Effect of expression patterns on disease burden.
To assess the functional significance of the observed expression patterns, we obtained gene-
disease associations from the Disease and Gene Annotations database (DGA,
http://dga.nubic.northwestern.edu) (16). DGA provides annotations of the human genes in the
context of diseases by integrating data from diverse sources including Disease Ontology (DO),
NCBI Gene Reference Into Function (GeneRIF) and Molecular Interaction Network (MIN).
It includes a diverse set of disease information, including Mendelian disease, cancer data and
GWAS. 671 out of the 1,194 pairs of duplicates have at least one disease annotation in the
database. At the present time, disease annotations are clearly incomplete and may be inaccurate.
However, our key analyses involve comparisons of different types of genes based on expression
profiles, and we expect that classification errors should be uncorrelated with our expression-
based classifications.
We considered two generalized linear models (Poisson model with log link function). In the
first model, the response variable was the number of diseases associated with the minor gene
and the predictors and results are shown in Table S2. In the second model, the response variable
was the number of minor gene-specific diseases (Table S3).
The results show several interesting features:
• Mean expression ratio of minor/major gene. Lower relative expression of the minor
gene is associated with lower minor gene-total disease counts (p=8×10−7, Wald test) and
(to a lesser extent) lower minor-specific counts (p=2×10−3, Wald test). This is consistent
with our expectation that expression asymmetry reduces functional importance of minor
genes.
• Proportion of tissues where the minor gene is expressed high. We use this as a mea-
17
sure of the extent of sub-/neofunctionalization; it is positively correlated with both minor
gene-total and minor gene-specific disease counts (p=4× 10−13 vs p=5× 10−12, respec-
tively, Wald test). These results support the prediction that subfunctionalization of ex-
pression promotes nonredundancy. It is interesting that the minor-specific count is only
slightly more significant than the minor-total count, perhaps suggesting that this measure
of expression divergence does not reflect much true neofunctionalization.
• Synonymous divergence between duplicates. dS is positively associated with both mea-
sures: minor gene-total (p=9 × 10−4, Wald test) and minor gene-specific (p=9 × 10−3,
Wald test). This is consistent with our observations that older duplicates are under in-
creased evolutionary constraint, even if they are asymmetrically expressed.
• Nonsynonymous divergence between duplicates. dN is not associated with either mea-
sures. This may reflect that the divergence of duplicates in protein space is due to relaxed
selective constraint caused by functional redundancy.
2.6 Differential splicing in duplicates
Our main paper focuses on the possibility of sub-/neofunctionalization of whole-gene expres-
sion. Another possibility however, would be subfunctionalization by differential splicing or
isoform usage between the duplicates (59–61). In fact, differential isoform usage has been
previously reported between duplicated genes (62–65).
To examine the prevalence of differential splicing between duplicates, we compared the
expression of homologous exons using the same procedure as we have applied when comparing
gene level expression discussed in section 3. We used a p-value cutoff of 0.0001 for the t-
test instead of 0.001 because the number of tests at the exon level was about 10 times that at
the gene level. Of 1,194 mappable pairs of duplicates, 359 (30%) have at least one pair of
18
homologous exons that are differentially spliced in at least one of the 46 tissues. Of these 359
pairs, 195 (54%) were already classified as potentially subfunctionalized on the basis of whole
gene expression data. Figure S20 shows the distribution of potential subfunctionalization of
duplicates by differential splicing (orange) in addition to subfunctionalization at the gene level
(blue). Among pairs with dS < 0.7, 15.2% were classified as potentially subfunctionalized
based on whole-gene expression. An additional 10.7% show evidence for differential splicing–
suggesting at most a modest contribution from differential splicing.
We next wanted to test whether differential splicing had any effect on disease risk. It has
been shown that most alternative splicing differences are not conserved between species and
are therefore likely to be neutral (66–70), and thus we conjectured that the differential splicing
events observed here may not be functionally important.
To address this question, we revisited the regression model for disease burden developed
in the previous section. Recall that duplicates with stronger evidence for subfunctionalization
were associated with higher numbers of diseases (Table S2, Table S3). Using the same basic
model, we added an indicator variable for presence/absence of differential splicing as an addi-
tional predictor. If these pairs were truly subfunctionalized, we would expect disease burden
to be positively correlated with evidence for differential splicing. Instead however, differential
splicing is negatively correlated with disease risk
For the multiple regression on the total number of diseases associated with minor gene, the
coefficient of the new predictor is -0.35 (p=4×10−11, Wald test). For the multiple regression on
the number of diseases associated with the minor gene only, the coefficient of the new predictor
is -0.18 (p=4×10−3, Wald test). These findings suggest that differential splicing is more often an
indication of low selective constraint on the duplicates than truly divergent function. Together
these results argue that changes in isoform usage are not a primary driver of subfunctionalization
of mammalian duplicates.
19
2.7 Long-term selective constraint on duplicated genes (dN/dS)
The ratio of the nonsynonymous substitution rate (dN ) to synonymous substitution rate (dS),
dN/dS , is a classic measure of selection and constraint on protein sequences. dN/dS < 1 is
a hallmark of protein coding constraint. In the absence of advantageous mutations and if all
synonymous mutations are neutral, then 1 − dN/dS estimates the fraction of nonsynonymous
mutations that are deleterious. Meanwhile, if some substitutions are advantageous this will
increase the value of dN/dS , in rare cases pushing it above 1. Note that dN/dS estimates are
noisy for individual genes when dS is small.
To estimate dN/dS on each duplicate separately, we used MrBayes to build trees containing
the human duplicates plus orthologs from 8 other species to provide information about the
ancestral sequence at the point of duplication (see Figure S3). We then used PAML to estimate
branch lengths separately for each duplicate back to the duplication point on the inferred tree.
In more detail, we performed multiple sequence alignment among the pair and its orthologs
using MACSE. These alignments were then used as input for MrBayes v3.2.2 (42, 71, 72) to
build gene trees. For input to MrBayes, we predefined an outgroup gene for the tree, namely the
ortholog with greatest dS to the two duplicates. Duplicates with less than three orthologs across
all 8 species were excluded from the analysis due to uncertainty in the tree. We also excluded
pairs with dS > 1.57 between the two human duplicates as they are likely to have diverged
prior to the divergence of human and chicken lineages meaning that we would not have a true
outgroup within this set of species. Parameters for MrBayes were set as follows:
nst=6 Nucmodel=Codon omegavar=ny98
mcmc ngen=100000 samplefreq=1000 printfreq=1000 diagnfreq=1000
The software PAML v4.8 (73, 74) was used to estimate nonsynonymous and synonymous
divergence for each branch in the gene tree generated by MrBayes. dN and dS for each human
20
duplicate were calculated by summing branch lengths from the present day to the node ancestral
to the two duplicates. Parameters for PAML were set as follows:
seqtype=1 model=1 NSsites=0
2.8 Current selective constraint on duplicated genes (SFS in humans)
A second tool for studying selective constraint comes from examining the site frequency spec-
trum (SFS) in polymorphism data. Classes of sites that are under selective constraint tend to
be enriched for lower frequency variants (e.g., nonsynonymous variants generally have lower
mean frequency than synonymous variants). Unlike dN/dS which effectively averages selection
pressures since the duplicate genes diverged, the SFS reflects current selective pressures (within
the past 104–106 years). Compared with dN/dS , the SFS is also less confounded by advanta-
geous mutations, since it is generally assumed that these contribute little to patterns of diversity
within species, for most genes. However, since there may be modest numbers of SNVs per gene
this measure is noisy at the level of individual genes, and hence we will only report SFS results
for classes of genes.
We used data from 6,515 individuals in the Exome Sequencing Project (15) (Exome Variant
Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL: http://evs.gs.washington.edu/EVS/)
[data downloaded September 2014]). To polarize alleles into ancestral and derived, we aligned
each SNP to the corresponding positions in the Chimpanzee and Gorilla genome using the
Liftover tool (75). An allele was then defined as ancestral if it was observed in either chim-
panzee or gorilla, while the second allele matched neither chimpanzee nor gorilla. Polymorphic
sites not meeting these criteria were removed. For categorizing mutations as synonymous, non-
synonymous, nonsense (and other categories not used here), we used annotations provided by
ESP. In principle we might worry about ability to identify SNVs in the very youngest duplicates,
however any biases should be shared across major and minor genes, and thus not confound our
21
major conclusions.
Figure S22 illustrates the SFS for duplicate genes as a function of age. Notice that the
fraction of rare SNVs increases steadily with the age of the duplicates, indicating that older
duplicates tend to be under much stronger evolutionary constraint. The oldest duplicates are
in fact more constrained, on average, than singleton genes. This may reflect a stabilization of
nonredundant functions in the oldest duplicates and, in some cases, enrichment of conserved
developmental genes among the oldest duplicates (76).
Figure S23 shows a comparison for AED genes of synonymous vs nonsynonymous (dashed
vs solid) variants in minor vs major genes (red vs blue). The youngest AEDs seem to be under
relatively low selective constraint overall–i.e., little difference between synonymous/nonsynonymous
spectra–while older AEDs show strong differences between synonymous and nonsynonymous
sites. For all ages the nonsynonymous sites in major genes have more rare variants than non-
synonymous sites in minor genes, although the magnitude of the effect varies across ages. (For
some of the categories, we see the same effect at synonymous sites–selection at synonymous
sites may be due to functional elements such as splice enhancers.) We also observed higher
densities of nonsense polymorphic sites in minor genes. In summary, these spectra suggest that
most AEDs take relatively long times to become strongly constrained (i.e., dS > 0.5), and that
minor genes tend to enjoy reduced constraint at all ages.
2.9 Effect of translocation on regulatory divergence of duplicates
Figure S8A suggests that most duplicate pairs arise in tandem, and that they are gradually
separated in the genome by translocation. In the main text we reported that, controlling for dS ,
the pairs on different chromosomes are more likely to have divergent expression patterns and
to be classified as potentially sub-/neofunctionalized (Figure 3, Figure 5A, Figure S24). This
result is robust to alternative criteria for sub-/neofunctionalization, for example, requiring the p
22
value of a t-tests to be less than 0.05 and no fold change cutoff (Figure S24D).
We also observed that very similar results hold at the level of histone marks (Figure S25).
To produce this plot we used histone ChIP-seq data from 25 tissues from a variety of human de-
velopmental and adult tissues collected by the RoadMap Epigenomics Project (26,27). We clus-
tered histone modifications into two categories, Transcription Start Site (TSS) histone marks,
such as H3K4me1, H3K4me3, H3K9ac, and H3k27ac, and gene body histone marks, such as
H3K36me3, H3K27me3, and H3K9me3. TSS histone modification levels were measured by
dividing the read density (number of reads per base pair) at the promoter region (1kb up and
down stream of TSS) by the read density in the background genomic DNA sequencing data
(input). The whole gene body region was used, instead of the promoter region for calculating
gene body histone modification levels. Read densities were normalized by the total number of
reads for each sample and standardized across all samples of each mark before we calculated
the correlation between sister duplicates.
Co-expression of neighbor duplicates. It is known that nearby genes tend to have correlated
expression patterns (19), however the mechanistic causes are not well understood. These may
include sharing of regulatory enhancers and the fact that nearby genes may lie in the same
co-regulated chromatin domains.
We tested the effect of genomic separation on expression correlation using a multiple regres-
sion model which includes age (dS) as a covariate (Table S4). This model shows that genomic
separation has a significant effect on expression correlation: mean effect on correlation = -0.36;
p = 3 × 10−30 (Wald test; Table S4, Figure S26A,B). It’s also noteworthy that we see a signif-
icant (albeit weaker) effect of dS on expression correlation for tandem duplicates, but no effect
at all for separated duplicates. This is consistent with the idea that breaking synteny between
duplicates radically alters their gene regulation.
23
We also observe a concordant, though much weaker effect for physical distance between
duplicates that are on the same chromosome (p=.003, Wald test; Figure S26C,D), suggesting
that co-regulation may be weaker for tandem duplicates that are far apart, than for those that are
close together.
Figure 3C illustrates the distribution of expression correlations for both singleton neighbors
and duplicate neighbors. Both of these tend to be more correlated than unlinked singletons (data
not shown for figure simplicity), and unlinked duplicates, respectively, highlighting the role of
genomic proximity in co-regulation. Both linked and unlinked duplicates are more correlated
than linked and unlinked singletons, respectively, presumably due to similarity of regulatory
sequences of duplicates.
Effect of promoter divergence on expression correlation. As an alternative to using syn-
onymous divergence as a molecular clock, we also experimented with using promoter diver-
gence as an alternative.
To quantify promoter divergence, we first attempted to align the promoter regions of du-
plicates using a fixed region around the annotated transcript start site (TSS) of each gene
as a putative promoter. However we found that many promoter pairs from this simple ap-
proach were unalignable (by Clustal or BLASTN). We reasoned this might be because the po-
sition of a promoter can vary among duplicates and the actual boundaries of promoters are not
well defined. To better align the promoters of duplicates, we used promoter annotations from
chromHMM (77). To maximize the promoters we could find, we pooled together chromHMM
annotations of 9 cell lines available in the ENCODE repository. We then considered the closest
promoter to the TSS of each duplicate gene as its promoter, provided that it is within 5 kb of the
gene. We performed sequence alignment using Clustal and defined the divergence of promoters
as the number of mismatches divided by the total length of aligned sequences. To reduce the
24
effect of spurious alignment by Clustal, which tends to maximize the number of matches, we
only took into account regions with more than 100bp continuous aligned sequences.
From a starting set of 1194 mappable pairs, we identified promoters for both duplicates of
791 pairs. Of these, 206 were not alignable, leaving 585 duplicate pairs with aligned promoters.
As expected, there is a strong correlation between synonymous divergence in the coding region
and promoter divergence (ρ=0.83, p< 2.2×10−16, Figure S27). We then tested the effect of ge-
nomic proximity of duplicates on their expression correlation controlling promoter divergence
using a generalized linear model. The categorical variable for duplicates in tandem vs separated
in the genome again shows a highly significant effect on the expression correlation of duplicates
controlling for promoter divergence (p=4× 10−9, Wald test; Table S5).
Expression correlation of duplicates with discordant genomic proximity in human and
mouse. To better control confounding factors, we searched for duplicate pairs that are in tan-
dem in human and on different chromosomes in mouse, or vice versa (Figure 3C). We then
compared the expression correlation between the species where the pair is in tandem, vs the
species where the pair is separated using matched mouse and human expression data in 6 tis-
sues (24).
We identified 12 pairs of duplicates that are in tandem in human but separated in mouse and
15 pairs of duplicates are tandem in mouse but separated in human. Although the sample size is
small, and the data are more noisy due to the smaller number of tissues (6 vs 46), we observe a
significant signal of the expected result: i.e., that the separated duplicates are less correlated than
the tandem duplicates (p = 0.03, one sided paired t test; Figure 3C). Moreover, the magnitudes
of the correlations are similar to those seen in GTEx data for duplicates of similar age.
Shared regulation of neighbor duplicates. One intuitive explanation for the increased co-
expression of duplicate neighbors compared to singletons is that duplicates may be more likely
25
to share regulatory elements. As one test of this, we examined the rate of shared eQTLs between
singletons and duplicates in three different studies. As a proxy for mappability, we required that
both genes in a pair have at least one eQTL and that dS between the genes was larger than 0.1.
For the set of eQTLs identified using the Geuvadis data (21), 19 out of 25 pairs of duplicate
neighbors have common eQTLs, while 158 out of 394 pairs of singleton neighbors share eQTLs.
The odds ratio is 4.71. Fisher’s exact test yields a p value of 5.9 × 10−4. Similarly, for the set
of eQTLs identified by Battle et. al. (20), 14 out of 91 pairs of duplicate neighbors share their
best eQTLs, while 195 out of 3,544 pairs of singleton neighbors share their best eQTLs. The
odds ratio is 3.12, Fisher’s exact test yields p value of 5.6 × 10−4. For the eQTLs identified
by the GTEx consortium (78), the odds ratio is 2.82 and the p value is 0.02). For the analysis
here and that follows, we confirmed that the distributions of genomic distances between pairs
of duplicates and singletons are similar.
To test if the sharing of regulatory elements between neighbors is mediated through chro-
matin interactions, we examined genome-wide chromatin conformation capture data (Hi-C)
recently generated in the GM12878 cell line (22). We excluded read pairs of < 20kb as this
lies close to the resolution of the assay. On average, we found 62 Hi-C reads supporting a
promoter-promoter interaction between a pair of neighbor duplicates. No reads supported a
promoter-promoter interaction between duplicates on different chromosomes (Figure 2E). In-
terestingly, consistent with the eQTL analysis, we found promoter-promoter interaction is more
intensive in duplicate neighbors than in singleton neighbors. 66 out of 69 pairs of expressed
duplicate neighbors are linked by > 0 reads, with a total number of 4,254 reads linking the
pairs of promoters; for expressed singletons, 4,406 out of 4,959 pairs of expressed singleton
neighbors had > 0 reads, with a total number of 114,165 reads linking the promoters. To test
significance of these observations we constructed a generalized linear model with the response
variable being the number of reads linking a pair of promoters. The predictors are, 1) Sum
26
of the expression of the pair; 2) Distance between the two promoters; and 3) Categorical in-
dicator (duplicates or singletons). The GLM shows that controlling the expression level and
distance, duplicate neighbors have significantly more Hi-C linkages than singleton neighbors
(p=3.1× 10−6, Wald test).
We also examined enhancer-promoter links, using enhancer annotations from chromHMM
in the GM12878 cell line (77). To reduce noise, we went through a first step of identifying
statistically significant enhancer-promoter interactions controlling for distance between the in-
teracting loci, GC content and enzyme digestion/ligation efficiency, and requiring that the events
be within the same TAD (79). For all genes, we identified a total number of 525,342 significant
long range enhancer-promoter interactions (distance > 20kbps, FDR=0.2). Among these, we
found 8 out of 39 pairs of expressed duplicate neighbors where a single enhancer was linked to
both promoters. Similarly we found 246 out of the 2,507 pairs of expressed singleton neighbors
in the same TADs showed evidence of enhancer sharing. This is a weakly significantly enrich-
ment for duplicates vs singletons: odds ratio = 2.4, one sided Fisher’s exact test p = 0.035.
27
3 Supplemental Figures
A. FPKM = 10
●●●●
●
●
●
●
●●
●
●●●
●
●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
−5
05
dS
Est
imat
ed e
xpre
ssio
n ra
tio (
log2
)
0−0.1 0.1−0.5 0.5−1.0 1.0−1.5 1.5−2.0
CufflinksUnambiguous Reads Only
B. FPKM = 1
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
−5
05
dS
0−0.1 0.1−0.5 0.5−1.0 1.0−1.5 1.5−2.0
C. FPKM = 0.1
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−5
05
dS
0−0.1 0.1−0.5 0.5−1.0 1.0−1.5 1.5−2.0
Figure S1: Estimation of expression ratios of duplicated genes for simulated data. For each pairof duplicates, we simulated RNA-seq reads derived from the transcripts of these two genes withan expression ratio of 1:1, and FPKM at 10 (A), 1 (B) and 0.1 (C). We mapped the reads tothe human genome hg19 using Tophat2. We then estimated the expression ratios of the genesusing Cufflinks and using our own pipeline based on reciprocally unambiguous sites only. TheY axis shows the estimated log2 expression ratio of the duplicate pairs. Note that FPKM =0.1 is presented only as a worst-case scenario, as such genes are considered unexpressed in ouranalyses.
28
Duplicate 1
Duplicate 2
A C
G T
A
C
G
T
site 1 site 2 site 3 site 4
Reciprocally
unambiguous reads
Multi-hits reads
Unambiguous sites
Ambiguous sites
Figure S2: Schematic of reciprocal reads mapping. A reciprocally unambiguous site is de-fined as a position for which all reads overlapping that site are uniquely mappable in both sisterduplicates. Gene expression of gene duplicates is calculated using only reciprocally unambigu-ous sites. To estimate average expression level, the number of mapped reads is appropriatelynormalized for the number of allowed positions.
29
Chicken
Platypus
Opossum
Mouse
Macaque
Orangutan
Gorilla
Chimpanzee
Human
0.05Synonymous divergence
dS from human
0.015
0.022
0.043
0.081
0.57
1.0
1.2
1.6
0.014
0.060
0.45
1.1
Figure S3: Synonymous divergence tree of nine species at singleton genes. The labels on theright show estimated synonymous distances, dS , between human and each of the other species,while the green labels show twice the dS along the human lineage from each branch point.Synonymous divergence between species was calculated as a weighted average synonymousdistance for all singleton (genes with no duplicates) ortholog pairs of these two species.
30
**
**
*
**
* ** * * * *
*
0.0
0.2
0.4
0.6
0.8
1.0
dS
Pro
port
ion
of p
airs
*
**
*
*
*
* ** * * * * *
*
*
*
* *
*
* * * * * * * * * *
*
* * *
**
** * * *
* * * *
* **
*
*
*
*
**
** * * * *
* **
*
*
*
*
** *
* **
*
*
* * * *
*
**
* ** *
*
**
*
0 0.5 1 1.5
*******
chimpanzeegorillaorangutanmacaquemouseopossumchicken
Figure S4: dS as a molecular clock to date duplications: Proportion of human duplicationsshared with other species as a function of dS . The X axis shows dS between human dupli-cates. The Y axis shows the estimated fraction of duplicates in a dS bin that are shared withan outgroup species. Notice for example that most duplicates with dS ≤ 0.35 are not sharedwith mouse, but most duplicates with dS ≥ 0.45 are shared with mouse indicating that theseduplicates arose prior to the human-mouse split.
31
Age
0
50
100
150
200
Num
ber
of p
airs
Human tomacaque
Macaque tomouse
Mouse toopossum
Opossum tochicken
Precedechicken
No differenceAsymmetrically expressedSub−/neofunctionalized
Figure S5: Classification of gene pairs by expression patterns. Ages estimated using phyloge-netic information: for example, “macaque to mouse” includes duplicates shared with macaquebut not with mouse.
32
Figure S6: Heat maps of expression ratios for duplicate pairs in different age groups, e.g.,“macaque to mouse” indicates human duplicates inferred to have arisen on the ancestral branchbetween the human-macaque split and the human-mouse split. As in the dS-based version ofthis figure (Figure 2B), for each duplicate pair (plotted in columns) the ratios show the tissue-specific expression level of the minor gene relative to its duplicate. Green indicates evidencefor subfunctionalization; consistently blue columns indicate AEDs. Black indicates tissue ratiosnot significantly different from 1 (p > .001). Relatively few of the duplicates that arose withinthe mammals show evidence of subfunctionalization.
33
A. Proximity of duplicates.
Age
0
50
100
150
200
Num
ber
of p
airs
RetrotranspositionDifferent chromosomesSame chr. > 1 MBSame chr. < 1 MB
Human tomacaque
Macaque tomouse
Mouse toopossum
Opossum tochicken
Precedechicken
B. Expression correlation.
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.5
0.0
0.5
1.0
● ●Within 1 MB Diff chr
AgeE
xpre
ssio
n co
rrel
atio
n
Group mean
Human tomacaque
Macaque tomouse
Mouse toopossum
Opossum tochicken
*
***
** **
Figure S7: Phylogeny-based analysis: Duplicate pairs located on different chromosomes showconsistently lower expression correlations than those located on the same chromosome. A.Proximity of duplicates at different ages. B. Expression divergence of nearby pairs comparedto pairs on different chromosomes, divided up according to the phylogenetic branch on whichthe duplications occurred.
34
A. Segmental duplications
dS
Num
ber
of p
airs
0
50
100
150
200
0 0.5 1 1.5 2
Different chromosomesSame chr. > 1mbSame chr. < 1mb
B. Retrotransposed duplications
dS
Num
ber
of p
airs
0
5
10
15
20
0 0.5 1 1.5 2
Different chromosomesSame chr. > 1mbSame chr. < 1mb
Figure S8: Numbers of likely segmental (A) and retrotransposed (B) duplications in human, fordifferent values of dS . Most young segmental duplicate pairs are nearby in the genome. Notethat the number of segmental duplicate pairs is much greater than the number of retrotransposedpairs.
35
A. Segmental duplications
dS
Num
ber
of p
airs
0
50
100
150
200
250
0 0.5 1 1.5 2
Different chromosomesSame chr. > 1mbSame chr. < 1mb
B. Retrotransposed duplications
dS
Num
ber
of p
airs
0
20
40
60
80
100
0 0.5 1 1.5 2
Different chromosomesSame chr. > 1mbSame chr. < 1mb
Figure S9: Numbers of likely segmental (A) and retrotransposed (B) duplications in mouse, fordifferent values of dS . As seen in human, most young segmental duplicate pairs are nearby inthe genome and the number of segmental duplicate pairs is much greater than the number ofretrotransposed pairs.
36
Figure S10: Expression of duplicate genes in 27 tissues (expanded version of Main Figure1). A. A gene pair with an expression profile consistent with sub- or neofunctionalization:i.e., each gene is significantly more highly expressed than the other in at least one tissue. B.An asymmetrically expressed gene pair. Introns have been shortened for display purposes.The Y-axis shows read depth per billion mapped reads. Green regions in the gene models areunmappable.
37
Figure S11: Gene expression ratios for duplicate gene pairs in 46 tissues. A. Heat maps ofexpression ratios for all duplicate gene pairs, at 3 levels of synonymous divergence, dS . For eachduplicate pair (plotted as a column) the ratios show the tissue-specific expression level of thegene with higher median expression relative to its duplicate. Blue indicates significantly lowerexpression of the minor gene in a particular tissue; red indicates significantly higher expressionof the minor gene (p < 0.001 for both cases). Black indicates no significant difference. B.Distributions of expression ratios for duplicate gene pairs in 46 tissues. Labeling same as inA. Notice that for most gene pairs, the minor gene has consistently lower expression than themajor gene, with few clear cases of subfunctionalization (i.e., mix of red/blue) except for themost diverged gene pairs.
38
A. Segmental duplicates colored by major-minor classification
B. Retrotransposed pairs colored by major-minor classification
C. Retrotransposed pairs colored by parent-daughter classification
Figure S12: Gene expression ratios for segmental and retrotransposed pairs. Heat maps of ex-pression ratios for segmental duplicated (A) and retrotransposed gene pairs (B, C), at 3 levels ofsynonymous divergence, dS . Within categories, columns are sorted by the amount of blue/red.For each duplicate pair (plotted as a column) in panels A and B, the ratios show the tissue-specific expression level of the gene with higher median expression relative to its duplicate.The ratios in panel C show the tissue-specific expression level of the parent gene (gene withmultiple exons) relative to the daughter gene (gene with one exon). Blue indicates significantlylower expression of the minor or daughter gene in a particular tissue; red indicates significantlyhigher expression of the minor or daughter gene (p < .001 for both cases). Black indicates nosignificant difference. Note that in 84% of the retrotransposed gene pairs, the daughter genesare also the minor genes.
39
dS
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Pro
port
ion
of s
ub−
/neo
func
tiona
lizat
ion Expressed in
<10 tissues>=10 tissues
0 0.5 1 1.5 2
Figure S13: Rates of subfunctionalization in pairs expressed in many tissues (orange) or fewtissues (purple). The X axis indicates dS boundaries. The Y axis is the proportion of duplicatesthat are sub-/neofunctionalized. Note that the rates of sub-/neofunctionalization in broadlyexpressed duplicates are generally higher than for narrowly expressed duplicates.
40
dS
0
100
200
300
400
500
0 0.5 1 1.5 2
UnmappableNo differenceAsymmetrically expressedSub−/neofunctionalized
Figure S14: Classification of mouse duplicate gene pairs by expression patterns in 26 tissues(12). The overall patterns are qualitatively similar to the human results.
41
A. Macaque
−4
−2
0
2
4
Log2
exp
ress
ion
ratio
Ratio=2:1Ratio=1:1
Sum Major Minor Singletons
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
B. Mouse
−4
−2
0
2
4
Log2
exp
ress
ion
ratio
Sum Major Minor Singletons
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Figure S15: Expression levels of duplicates compared to their singleton orthologs in macaque(A) and mouse (B), for human duplicates that are single-copy genes in macaque/mouse. “Sum”shows the summed expression of both duplicates, relative to expression of the macaque/mouseorthologs in the same tissues; the “major” and “minor” data show equivalent ratios for the higherand lower expressed genes in each duplicate pair. Each tissue-gene expression ratio is plottedas a separate data point. The green data show results for a random set of singleton orthologs.
42
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
0.00 0.02 0.04 0.06 0.08 0.10dS
0.00
0.02
0.04
0.06
0.08
0.10
dN
−1.0 0.0 1.0
Log2 ratio
Figure S16: Scatter plot of dN vs. dS for very young duplicate pairs shows dN ≤ dS for allpairs. Dots are colored by the mean log ratio of summed expression of duplicates to expressionof their single copy ortholog in macaque. Notice that this is centered close to a log ratio of 0,and that there is no obvious relationship between expression and dN .
43
−4
−2
0
2
4
Log2
exp
ress
ion
ratio Ratio
2:1
1:1
dN = 0 < 0.02 >= 0.02
Duplicates Singletons
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Figure S17: Ratio of summed expression of duplicates to their single copy ortholog in macaque,stratified by dN between the two human copies. Each data point shows a single gene x tissuecombination. These results show that dosage reduction evolves quickly in these young dupli-cates, while they still have very low protein divergence (on average ∼2%, Figure S16).
44
0.0 0.5 1.0 1.5dS
Fra
ctio
n of
rar
e va
riant
s (<
0.1%
)
0
0.7
0.8
0.9
1
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Mean of young duplicates
Mean of singletons
No−difference pairs
AEDs
Sub−/Neofunctionalized
Figure S18: Fraction of rare missense variants for very young duplicates in a large data set ofhuman exomes (15), compared to duplicates in general, and singleton genes. Notice that youngduplicates have a low fraction of rare variants, indicating that they are under relatively weakselective constraint compared to much older duplicates and to non-duplicated genes.
45
A. Expression of dups. vs. single copy ortholog.
−4
−2
0
2
4
Log2
exp
ress
ion
ratio
0 0.05 0.1
Ratio
2:1
1:1
dS
Spline
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●●
●
●
●
●
Overall median
B. Fold change in expression of CNVs
−4
−2
0
2
4
Number of copies
Log2
exp
ress
ion
ratio
1 2 3 4
Group mean
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●●● ●
●
●●
●
●
●●
●●
●●●
●
●
●●
●
●●
●
●
● ●
●
● ●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
● ●●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●
●
●
●●
Ratio
21.51
0.5
Figure S19: Expression reduction in duplicated genes. A. Ratios of summed expression ofduplicates to their single copy orthologs in macaque, as a function of duplicate age (dS). Notethat the fold-reduction of expression in the newest duplicates is similar to that in older pairs. Theaverage expression ratio of duplicates to their orthologs is ∼1.2. B. Ratios of gene expression inindividuals with atypical copy numbers to individuals with 2 copies. Each blue dot representsthe ratio of median expression of individuals with copy number indicated by the X axis to themedian expression of individuals with 2 copies, for a different CNV. Note that the effect ofcopy number on gene expression is smaller than expected from a simple additive model (dottedlines). For example, the average expression ratio of individuals with 4 copies to individualswith 2 copies is ∼1.5, compared to the 2-fold difference expected from copy number alone.However, the 1.5-fold ratio is higher than the 1.2-fold average difference seen in panel A.
46
dS
0
50
100
150
200
250
300
0 0.5 1 1.5 2
UnmappableNo differenceAsymmetrically expressedPotential subfunctionalization by differential splicingSub−/neofunctionalized
Figure S20: Distribution of potential subfunctionalization by differential splicing. Differen-tial spliced pairs are shown as a separated group in addition to the patterns identified at genelevel (Figure 1B). Note that the overall subfunctionalization rate is low even if all differentiallyspliced pairs were actually subfunctionalized.
47
A. Long-term selective constraint
0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
dS
dN
dS
No difference
AEDs
Sub−/Neofunctionalized
Major genesMinor genes
B. Major vs. minor genes in AED pairs
*
**
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
* *
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
** *
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* ** *
*
* *
*
*
***
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
**
*
*
**
*
*
*
**
*
**
*
*
*
*
** *
* *
*
*** **
*
* **
***
* *
** *
*
*
*
*
**
* **
*
*
*
** *
*
**
*
*
*
**
*
*
*
*
*
**
*
*
*
**
**
*
*
*
**
**
*** ***
*
**
*
*
*
*
*
**
*
*
*
* **
**
*
*
*
*
*
*
*
**
****
*
*
*
*
*
**
**
**
*
**
*
*
*
*
*
*
*
*
*
* **
*
****
**
**
*
*
**
*
*
*
*** *
*
*
*
*
** **
**
*
*
*
*
*
****
**
*
*
*
* *
*
*
**
**
*
**
*
**
*
*
*
**
*
**
*
* **
*
*
**
*** ** ** **
**
*
***
**
*
*
*
*
*
*
***
*
*
**
****
*
**
*
** ** **
* **** *
*
*
*
**
*
*
*
****
*
*
**
*
**
*
*
***
* **** *
** *
**
*
*** *
*****
*
*
* ***
**
*
**
**
*
*
*
* *
*
** *
*** **
*
**
*
***
*
*
*
* *
*
*
*****
*
**
**
*
***
**
**
*
*
*
**
*
*
*
*
*
*
***
*
**
* *
**
**
*
*
*
*
**
*
*
**
**
*
**
*
**
*
*** *
**
**
*
*
**
*
**
*
*
*
*
**
*
**
*
*
*
* *
***
*
**
*
*
***
*
**
**
***
*
*
***
*
***
*
*
*
*
*
*
*
*
*
**
*
*
*
*
** * *
**
**
**
* *0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
dN dS of major gened
Nd
S o
f min
or g
ene
p = 0.002
Figure S21: Evolutionary constraint on duplicates. A. Regression of dN/dS on dS for majorand minor genes of the three different categories. B. Scatter plot of dN/dS for major and minorgenes of AEDs. Paired t-test showed significant difference between major and minor genes ofAEDs (p = 0.002).
48
1e−04 1e−03 1e−02 1e−01 1e+00
0.0
0.2
0.4
0.6
0.8
1.0
Derived allele frequency
Cum
ulat
ive
prop
ortio
n
dS range (number of sites)
0−0.1 (9,386)0.1−0.5 (19,154)0.5−1.0 (26,746)1.0−1.5 (29,139)>1.5 (30,712)singleton (356,806)
Figure S22: Cumulative derived allele frequency spectrum for nonsynonymous sites of du-plicate genes of different age. Genes under higher selective constraint tend to have a higherfraction of rare variants, and hence the cumulative curve rises faster (i.e., appears higher on theplot). Note that the youngest duplicates are under lowest constraint, and that the oldest dupli-cates are under higher constraint than typical singleton genes. The data are for 6,515 individualsin the Exome Sequencing Project (15). The numbers in the legend show the total numbers ofnonsynonymous sites in each age group.
49
A. dS 0− 0.5
1e−04 1e−03 1e−02 1e−01 1e+00
0
0.4
0.6
0.8
1
Cum
ulat
ive
prop
ortio
n
Derived allele frequency
Non−synonymousSynonymous
Major genesMinor genes
B. dS 0.5− 1.0
1e−04 1e−03 1e−02 1e−01 1e+00
0
0.4
0.6
0.8
1
C. dS 1.0− 1.5
1e−04 1e−03 1e−02 1e−01 1e+00
0
0.4
0.6
0.8
1
D. dS 1.5− 2.0
1e−04 1e−03 1e−02 1e−01 1e+00
0
0.4
0.6
0.8
1
Figure S23: Cumulative derived allele frequency of major and minor genes for AED genes of different ages.
50
A. Pairs within 1mbp
dS
Pro
port
ion
0
0.2
0.4
0.6
0.8
1
0−0.5 0.5−1.0 1.0−1.5 1.5−2.0
B. Pairs on different chromosomes
dS
No diffAEDSub−/neo
0
0.2
0.4
0.6
0.8
1
0−0.5 0.5−1.0 1.0−1.5 1.5−2.0
C. Proportion of tissues where minor gene ex-pressed higher (p < 0.001)
0.00
0.02
0.04
Pro
p. ti
ssue
s m
inor
exp
r. hi
gh
●
●
●
●
●
●
●
●
0−0.5 0.5−1.0 1.0−1.5 1.5−2.0dS
*
*** *****
D. Proportion of tissues where minor gene ex-pressed higher (p < 0.05)
0.00
0.02
0.04
0.06
Pro
p. ti
ssue
s m
inor
exp
r. hi
gh
●
●
●
●
●
●
●
●
0−0.5 0.5−1.0 1.0−1.5 1.5−2.0dS
n.s
***
*****Distance < 1mbp
Different Chromosome
Figure S24: Sub-/neofunctionalization is more likely to occur in pairs on different chromosomecompared to pairs within 1 MB. A. Proportion of each category of expression patterns of du-plicate gene pairs within 1 MB across different values of dS . B. Proportion of each category ofexpression patterns of duplicate gene pairs on different chromosomes. C. and D. Proportions oftissues in which the gene with lower overall expression (minor gene) is more highly expressedthan the major gene according to a paired t-test. The p-value cutoff for the t-test is 0.001 (C.and 0.05 (D.. (n.s.: not significant, *: p < 0.05, **: p < 0.01, ***: p < 0.001).
51
A. TSS histone correlation
−1.0
−0.5
0.0
0.5
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0−0.5 0.5−1.0 1.0−1.5 1.5−2.0
dS
His
tone
mar
k co
rrel
atio
n
****** *** *
B. Gene body histone correlation
−1.0
−0.5
0.0
0.5
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
0−0.5 0.5−1.0 1.0−1.5 1.5−2.0
dS
****** *** ***
Figure S25: Histone modification correlations are higher for pairs within 1 MB compared topairs on different chromosomes. Distributions of the TSS (A) and gene body (B) histone mod-ification correlations of duplicate pairs, across tissues for different values of dS . Pairs within1MB (orange) tend to be more correlated than pairs on different chromosomes (purple). (n.s.:not significant, *: p < 0.05, **: p < 0.01, ***: p < 0.001)
52
A. Exp. Corr. vs dS given tandem/separatedstatus.
0.0 0.5 1.0 1.5 2.0
−0.5
0.0
0.5
1.0
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
dS
Exp
ress
ion
corr
elat
ion
(par
tial r
esid
ual)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Within 1 MBDiff chr
B. Exp. Corr. vs tandem/separated status givendS .
−0.5
0.0
0.5
1.0
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Within 1 MB Diff chr
Genomic proximity ( p = 3e−30 )
Exp
ress
ion
corr
elat
ion
(par
tial r
esid
ual)
−0.5
0.0
0.5
1.0
C. Tandem only: Exp. Corr. vs dS given dis-tance.
0.0 0.2 0.4 0.6 0.8 1.0
−0.5
0.0
0.5
1.0
dS ( p = 2e−09 )
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
Exp
ress
ion
corr
elat
ion
(par
tial r
esid
ual)
D. Tandem only: Exp. Corr. vs distance givendS .
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●
●
1e+04 1e+05 1e+06 1e+07 1e+08Genomic distance (bp) ( p = 0.003 )
Exp
ress
ion
corr
elat
ion
(par
tial r
esid
ual)
−0.5
0.0
0.5
1.0
Figure S26: Multiple regression of expression correlation for duplicates: top row shows tandemduplicates vs duplicates on different chromosomes; bottom row shows the effect of physicaldistance in tandem duplicates.
53
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●●●
●
●
●
●
●
●
● ●
●
●
●● ●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●●
●
●● ●
●
●●
●
●●●
0.0 0.2 0.4 0.6
ORF divergence
Pro
mot
er d
iver
genc
e ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●●● ● ●
● ● ●● ● ●●●
●●
●
●● ●
●●●
●●●
●
●
●
●●
●●●●
●
●
●
●●
●●●●
●●
●● ●● ●
●● ●●
● ● ●
●● ●●
●●●
●●●●
●●●● ●
●●●●● ●●
●
●●
● ●● ●
● ●●●
●
●●●
●●● ●
●
●●● ●●●
●●
●● ●●●●
●● ●●
●
●●
●
●●● ● ●●
●●●●
●●●
●● ●●
●●● ● ●● ●●
●●●●●
● ●
●●
● ●●
●●
●● ●●
●●●
●●
●●
●
●●●●
●
●●
●● ● ●●
●● ●
● ●● ●●
● ●●● ● ●
● ●●●
●●
●
●
●
●●●
●
0
0.2
0.4
0.6
● ●Unaligned pairs Aligned pairs
ρ = 0.83, p < 2.2e−16
Figure S27: Scatter plot of synonymous divergence (X axis) vs. promoter divergence (Y-axis)of duplicate pairs. Purple dots denote duplicates with unaligned promoters.
54
4 Supplemental Tables
Table S1: A list of genome assemblies and gene annotations used.
Species Genome assembly Gene annotationHuman Ensembl GRCh37 release 73
Chimpanzee Ensembl CHIMP2.1.4 release 70Gorilla Ensembl gorGor3 release 73
Orangutan Ensembl PPYG2 release 73Macaque Ensembl Mmul 1 release 70Mouse Ensembl GRCm38 release 70
Opossum Ensembl BROADO5 release 73Platypus Ensembl OANA5 release 73Chicken Ensembl WASHUC2 release 70
Table S2: Multiple regression of total number of diseases associated with minor gene.
Predictor Coefficient P value
Number of diseases associated with major gene 5.21 · 10−2 0Synonymous divergence between duplicates 4.81 · 10−2 9 · 10−4
Proportion of tissues where minor gene expressed high 1.99 4 · 10−13
Mean expression ratio of minor/major gene 9.78 · 10−2 8 · 10−7
Table S3: Multiple regression of number of diseases associated with the minor gene only
Predictor Coefficient P value
Number of diseases associated with major gene −7.47 · 10−3 7 · 10−2
Number of diseases shared by the duplicates 0.25 6 · 10−199
Synonymous divergence between duplicates 4.58 · 10−2 9 · 10−3
Nonsynonymous divergence between duplicates 0.45 9 · 10−2
Proportion of tissues where minor gene expressed high 2.31 5 · 10−12
Mean expression ratio of minor/major gene 7.29 · 10−2 2 · 10−3
55
Table S4: Multiple regression of expression correlation between duplicates across tissues.
Effect of whether duplicates are on same or different chromosomeson expression correlation.
Predictor Coefficient P valuea. Synonymous divergence between duplicates -0.14 2× 10−6
b. Duplicates on same or different chromosomes -0.36 3× 10−30
Interaction term (a×b) 0.14 4× 10−6
Effect of distance on expression correlation for tandem duplicates.Predictor Coefficient P value
Synonymous divergence between duplicates -0.13 2× 10−9
Log 10 genomic distance between duplicates -0.047 0.003
Table S5: Multiple regression of expression correlation between duplicates controlling promoterdivergence.
Predictor Coefficient P value
Promoter identity between duplicates 0.14 0.21Tandem vs Separated (i.e., within 1MB vs on different chromosomes) −0.25 4 · 10−9
5 List of Supplementary Files
Supplementary File 1 - A list of human tissue RNA-seq samples used from GTEx.
Supplementary File 2 - A list of mouse tissue RNA-seq data used from Babak et al..
Supplementary File 3 - A list of ChIP-seq data used from Roadmap Epigenomics Project.
56
6 References 1. G. C. Conant, K. H. Wolfe, Turning a hobby into a job: How duplicated genes find new
functions. Nat. Rev. Genet. 9, 938–950 (2008). Medline doi:10.1038/nrg2482
2. M. Lynch, J. S. Conery, The evolutionary demography of duplicate genes. J. Struct. Funct. Genomics 3, 35–44 (2003). Medline doi:10.1023/A:1022696612931
3. H. Innan, F. Kondrashov, The evolution of gene duplications: Classifying and distinguishing between models. Nat. Rev. Genet. 11, 97–108 (2010). Medline doi:10.1038/nrg2689
4. A. Stoltzfus, On the possibility of constructive neutral evolution. J. Mol. Evol. 49, 169–181 (1999). Medline doi:10.1007/PL00006540
5. W. Qian, B. Y. Liao, A. Y. Chang, J. Zhang, Maintenance of duplicate genes and their functional redundancy by reduced expression. Trends Genet. 26, 425–430 (2010). Medline doi:10.1016/j.tig.2010.07.002
6. G. C. Conant, J. A. Birchler, J. C. Pires, Dosage, duplication, and diploidization: Clarifying the interplay of multiple models for duplicate gene evolution over time. Curr. Opin. Plant Biol. 19, 91–98 (2014). Medline doi:10.1016/j.pbi.2014.05.008
7. J. F. Gout, M. Lynch, Maintenance and loss of duplicated genes by dosage subfunctionalization. Mol. Biol. Evol. 32, 2141–2148 (2015). Medline doi:10.1093/molbev/msv095
8. A. Force, M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, J. Postlethwait, Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531–1545 (1999). Medline
9. C. R. Baker, V. Hanson-Smith, A. D. Johnson, Following gene duplication, paralog interference constrains transcriptional circuit evolution. Science 342, 104–108 (2013). Medline doi:10.1126/science.1240810
10. I. Wapinski, A. Pfeffer, N. Friedman, A. Regev, Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61 (2007). Medline doi:10.1038/nature06107
11. GTEx Consortium, The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 348, 648–660 (2015). Medline doi:10.1126/science.1262110
12. T. Babak, B. DeVeale, E. K. Tsang, Y. Zhou, X. Li, K. S. Smith, K. R. Kukurba, R. Zhang, J. B. Li, D. van der Kooy, S. B. Montgomery, H. B. Fraser, Genetic conflict reflected in tissue-specific maps of genomic imprinting in human and mouse. Nat. Genet. 47, 544–549 (2015). Medline doi:10.1038/ng.3274
13. See supplementary materials and methods on Science Online.
14. B. van de Geijn, G. McVicker, Y. Gilad, J. K. Pritchard, WASP: Allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015). Medline doi:10.1038/nmeth.3582
15. W. Fu, T. D. O’Connor, G. Jun, H. M. Kang, G. Abecasis, S. M. Leal, S. Gabriel, M. J. Rieder, D. Altshuler, J. Shendure, D. A. Nickerson, M. J. Bamshad, J. M. Akey; NHLBI Exome Sequencing Project, Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013). Medline doi:10.1038/nature11690
57
16. K. Peng, W. Xu, J. Zheng, K. Huang, H. Wang, J. Tong, Z. Lin, J. Liu, W. Cheng, D. Fu, P. Du, W. A. Kibbe, S. M. Lin, T. Xia, The Disease and Gene Annotations (DGA): An annotation resource for human disease. Nucleic Acids Res. 41, D553–D560 (2013). Medline doi:10.1093/nar/gks1244
17. E. W. Ganko, B. C. Meyers, T. J. Vision, Divergence in expression between duplicated genes in Arabidopsis. Mol. Biol. Evol. 24, 2298–2309 (2007). Medline doi:10.1093/molbev/msm158
18. J. A. Bailey, Z. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, E. E. Eichler, Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002). Medline doi:10.1126/science.1072047
19. A. T. Ghanbarian, L. D. Hurst, Neighboring genes show correlated evolution in gene expression. Mol. Biol. Evol. 32, 1748–1766 (2015). Medline doi:10.1093/molbev/msv053
20. A. Battle, S. Mostafavi, X. Zhu, J. B. Potash, M. M. Weissman, C. McCormick, C. D. Haudenschild, K. B. Beckman, J. Shi, R. Mei, A. E. Urban, S. B. Montgomery, D. F. Levinson, D. Koller, Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24, 14–24 (2014). Medline doi:10.1101/gr.155192.113
21. T. Lappalainen, M. Sammeth, M. R. Friedländer, P. A. ’t Hoen, J. Monlong, M. A. Rivas, M. Gonzàlez-Porta, N. Kurbatova, T. Griebel, P. G. Ferreira, M. Barann, T. Wieland, L. Greger, M. van Iterson, J. Almlöf, P. Ribeca, I. Pulyakhina, D. Esser, T. Giger, A. Tikhonov, M. Sultan, G. Bertier, D. G. MacArthur, M. Lek, E. Lizano, H. P. Buermans, I. Padioleau, T. Schwarzmayr, O. Karlberg, H. Ongen, H. Kilpinen, S. Beltran, M. Gut, K. Kahlem, V. Amstislavskiy, O. Stegle, M. Pirinen, S. B. Montgomery, P. Donnelly, M. I. McCarthy, P. Flicek, T. M. Strom, H. Lehrach, S. Schreiber, R. Sudbrak, A. Carracedo, S. E. Antonarakis, R. Häsler, A. C. Syvänen, G. J. van Ommen, A. Brazma, T. Meitinger, P. Rosenstiel, R. Guigó, I. G. Gut, X. Estivill, E. T. Dermitzakis; Geuvadis Consortium, Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). Medline doi:10.1038/nature12531
22. S. S. Rao, M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander, E. L. Aiden, A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014). Medline doi:10.1016/j.cell.2014.11.021
23. A. Feuerborn, P. R. Cook, Why the activity of a gene depends on its neighbors. Trends Genet. 31, 483–490 (2015). Medline doi:10.1016/j.tig.2015.07.001
24. D. Brawand, M. Soumillon, A. Necsulea, P. Julien, G. Csárdi, P. Harrigan, M. Weier, A. Liechti, A. Aximu-Petri, M. Kircher, F. W. Albert, U. Zeller, P. Khaitovich, F. Grützner, S. Bergmann, R. Nielsen, S. Pääbo, H. Kaessmann, The evolution of gene expression levels in mammalian organs. Nature 478, 343–348 (2011). Medline doi:10.1038/nature10532
25. K. Y. Popadin, M. Gutierrez-Arcelus, T. Lappalainen, A. Buil, J. Steinberg, S. I. Nikolaev, S. W. Lukowski, G. A. Bazykin, V. B. Seplyarskiy, P. Ioannidis, E. M. Zdobnov, E. T. Dermitzakis, S. E. Antonarakis, Gene age predicts the strength of purifying selection acting on gene expression variation in humans. Am. J. Hum. Genet. 95, 660–674 (2014). Medline doi:10.1016/j.ajhg.2014.11.003
58
26. B. E. Bernstein, J. A. Stamatoyannopoulos, J. F. Costello, B. Ren, A. Milosavljevic, A. Meissner, M. Kellis, M. A. Marra, A. L. Beaudet, J. R. Ecker, P. J. Farnham, M. Hirst, E. S. Lander, T. S. Mikkelsen, J. A. Thomson, The NIH roadmap epigenomics mapping consortium. Nat. Biotechnol. 28, 1045–1048 (2010). Medline doi:10.1038/nbt1010-1045
27. A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky, A. Yen, A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J. Ziller, V. Amin, J. W. Whitaker, M. D. Schultz, L. D. Ward, A. Sarkar, G. Quon, R. S. Sandstrom, M. L. Eaton, Y. C. Wu, A. R. Pfenning, X. Wang, M. Claussnitzer, Y. Liu, C. Coarfa, R. A. Harris, N. Shoresh, C. B. Epstein, E. Gjoneska, D. Leung, W. Xie, R. D. Hawkins, R. Lister, C. Hong, P. Gascard, A. J. Mungall, R. Moore, E. Chuah, A. Tam, T. K. Canfield, R. S. Hansen, R. Kaul, P. J. Sabo, M. S. Bansal, A. Carles, J. R. Dixon, K. H. Farh, S. Feizi, R. Karlic, A. R. Kim, A. Kulkarni, D. Li, R. Lowdon, G. Elliott, T. R. Mercer, S. J. Neph, V. Onuchic, P. Polak, N. Rajagopal, P. Ray, R. C. Sallari, K. T. Siebenthall, N. A. Sinnott-Armstrong, M. Stevens, R. E. Thurman, J. Wu, B. Zhang, X. Zhou, A. E. Beaudet, L. A. Boyer, P. L. De Jager, P. J. Farnham, S. J. Fisher, D. Haussler, S. J. Jones, W. Li, M. A. Marra, M. T. McManus, S. Sunyaev, J. A. Thomson, T. D. Tlsty, L. H. Tsai, W. Wang, R. A. Waterland, M. Q. Zhang, L. H. Chadwick, B. E. Bernstein, J. F. Costello, J. R. Ecker, M. Hirst, A. Meissner, A. Milosavljevic, B. Ren, J. A. Stamatoyannopoulos, T. Wang, M. Kellis; Roadmap Epigenomics Consortium, Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). Medline
28. P. Flicek, M. R. Amode, D. Barrell, K. Beal, K. Billis, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fitzgerald, L. Gil, C. G. Girón, L. Gordon, T. Hourlier, S. Hunt, N. Johnson, T. Juettemann, A. K. Kähäri, S. Keenan, E. Kulesha, F. J. Martin, T. Maurel, W. M. McLaren, D. N. Murphy, R. Nag, B. Overduin, M. Pignatelli, B. Pritchard, E. Pritchard, H. S. Riat, M. Ruffier, D. Sheppard, K. Taylor, A. Thormann, S. J. Trevanion, A. Vullo, S. P. Wilder, M. Wilson, A. Zadissa, B. L. Aken, E. Birney, F. Cunningham, J. Harrow, J. Herrero, T. J. Hubbard, R. Kinsella, M. Muffato, A. Parker, G. Spudich, A. Yates, D. R. Zerbino, S. M. Searle, Ensembl 2014. Nucleic Acids Res. 42 (D1), D749–D755 (2014). Medline doi:10.1093/nar/gkt1196
29. R. Apweiler et al.; UniProt Consortium, Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 42 (D1), D191–D198 (2014). Medline doi:10.1093/nar/gkt1140
30. F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Söding, J. D. Thompson, D. G. Higgins, Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011). Medline doi:10.1038/msb.2011.75
31. W. J. Wilbur, D. J. Lipman, Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80, 726–730 (1983). Medline doi:10.1073/pnas.80.3.726
32. V. Ranwez, S. Harispe, F. Delsuc, E. J. Douzery, MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLOS ONE 6, e22594 (2011). Medline doi:10.1371/journal.pone.0022594
33. J. E. Karro, Y. Yan, D. Zheng, Z. Zhang, N. Carriero, P. Cayting, P. Harrrison, M. Gerstein, Pseudogene.org: A comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 35 (suppl 1), D55–D60 (2007). Medline doi:10.1093/nar/gkl851
59
34. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). Medline doi:10.1038/nbt.1621
35. C. Trapnell, D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn, L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013). Medline doi:10.1038/nbt.2450
36. J. F. Degner, J. C. Marioni, A. A. Pai, J. K. Pickrell, E. Nkadori, Y. Gilad, J. K. Pritchard, Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009). Medline doi:10.1093/bioinformatics/btp579
37. M. D. Robinson, D. J. McCarthy, G. K. Smyth, edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010). Medline doi:10.1093/bioinformatics/btp616
38. J. R. Lupski, Hotspots of homologous recombination in the human genome: Not all homologous sequences are equal. Genome Biol. 5, 2004–2005 (2004). doi:10.1186/gb-2004-5-10-242
39. B. Korber, Computational Analysis of HIV Molecular Sequences (Kluwer Academic Publishers, Dordrecht, 2000).
40. M. Nei, T. Gojobori, Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3, 418–426 (1986). Medline
41. S. Ganeshan, R. E. Dickover, B. T. Korber, Y. J. Bryson, S. M. Wolinsky, Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virol. 71, 663–677 (1997). Medline
42. F. Ronquist, J. P. Huelsenbeck, MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574 (2003). Medline doi:10.1093/bioinformatics/btg180
43. J. Zhang, Evolution by gene duplication: An update. Trends Ecol. Evol. 18, 292–298 (2003). doi:10.1016/S0169-5347(03)00033-8
44. M. Lynch, B. Walsh, The Origins of Genome Architecture, volume 98 (Sinauer Associates, Sunderland, MA, 2007).
45. H. Kaessmann, Origins, evolution, and phenotypic impact of new genes. Genome Res. 20, 1313–1326 (2010). Medline doi:10.1101/gr.101386.109
46. S. Ohno et al., Evolution by Gene Duplication (George Alien & Unwin Ltd., London; Springer-Verlag, Berlin, Heidelberg, New York, 1970.
47. P. W. H. Holland, J. Garcia-Fernandez, N. A. Williams, A. Sidow, Gene duplications and the origins of vertebrate development. Development 1994 (Supplement), 125–133 (1994).
48. H. H. Kazazian Jr., J. V. Moran, The impact of L1 retrotransposons on the human genome. Nat. Genet. 19, 19–24 (1998). Medline doi:10.1038/ng0598-19
49. C. Esnault, J. Maestre, T. Heidmann, Human LINE retrotransposons generate processed pseudogenes. Nat. Genet. 24, 363–367 (2000). Medline doi:10.1038/74184
50. H. Kaessmann, N. Vinckenbosch, M. Long, RNA-based gene duplication: Mechanistic and evolutionary insights. Nat. Rev. Genet. 10, 19–31 (2009). Medline doi:10.1038/nrg2487
60
51. J. Nathans, D. Thomas, D. S. Hogness, Molecular genetics of human color vision: The genes encoding blue, green, and red pigments. Science 232, 193–202 (1986). Medline doi:10.1126/science.2937147
52. P. F. Chance, N. Abbas, M. W. Lensch, L. Pentao, B. B. Roa, P. I. Patel, J. R. Lupski, Two autosomal dominant neuropathies result from reciprocal DNA duplication/deletion of a region on chromosome 17. Hum. Mol. Genet. 3, 223–228 (1994). Medline doi:10.1093/hmg/3.2.223
53. E. E. Eichler, Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661–669 (2001). Medline doi:10.1016/S0168-9525(01)02492-1
54. J. A. Lee, C. M. Carvalho, J. R. Lupski, A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell 131, 1235–1247 (2007). Medline doi:10.1016/j.cell.2007.11.037
55. M. Lynch, J. S. Conery, The origins of genome complexity. Science 302, 1401–1404 (2003). Medline doi:10.1126/science.1089370
56. A. E. Vinogradov, Large scale of human duplicate genes divergence. J. Mol. Evol. 75, 25–33 (2012). Medline doi:10.1007/s00239-012-9516-1
57. C. Roth, S. Rastogi, L. Arvestad, K. Dittmar, S. Light, D. Ekman, D. A. Liberles, Evolution after gene duplication: Models, mechanisms, sequences, systems, and organisms. J. Exp. Zool. B Mol. Dev. Evol. 308, 58–73 (2007). Medline doi:10.1002/jez.b.21124
58. P. H. Sudmant, T. Rausch, E. J. Gardner, R. E. Handsaker, A. Abyzov, J. Huddleston, Y. Zhang, K. Ye, G. Jun, M. Hsi-Yang Fritz, M. K. Konkel, A. Malhotra, A. M. Stütz, X. Shi, F. Paolo Casale, J. Chen, F. Hormozdiari, G. Dayama, K. Chen, M. Malig, M. J. Chaisson, K. Walter, S. Meiers, S. Kashin, E. Garrison, A. Auton, H. Y. Lam, X. Jasmine Mu, C. Alkan, D. Antaki, T. Bae, E. Cerveira, P. Chines, Z. Chong, L. Clarke, E. Dal, L. Ding, S. Emery, X. Fan, M. Gujral, F. Kahveci, J. M. Kidd, Y. Kong, E. W. Lameijer, S. McCarthy, P. Flicek, R. A. Gibbs, G. Marth, C. E. Mason, A. Menelaou, D. M. Muzny, B. J. Nelson, A. Noor, N. F. Parrish, M. Pendleton, A. Quitadamo, B. Raeder, E. E. Schadt, M. Romanovitch, A. Schlattl, R. Sebra, A. A. Shabalin, A. Untergasser, J. A. Walker, M. Wang, F. Yu, C. Zhang, J. Zhang, X. Zheng-Bradley, W. Zhou, T. Zichner, J. Sebat, M. A. Batzer, S. A. McCarroll, R. E. Mills, M. B. Gerstein, A. Bashir, O. Stegle, S. E. Devine, C. Lee, E. E. Eichler, J. O. Korbel; 1000 Genomes Project Consortium, An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). Medline
59. A. L. Hughes, The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. London Ser. B 256, 119–124 (1994). doi:10.1098/rspb.1994.0058
60. M. W. Hahn, Distinguishing among evolutionary models for the maintenance of gene duplicates. J. Hered. 100, 605–617 (2009). Medline doi:10.1093/jhered/esp047
61. J. M. Lambert, W. O. Cochran, B. M. Wilde, K. G. Olsen, C. D. Cooper, Evidence for widespread subfunctionalization of splice forms in vertebrate genomes. Genome Res. 2015, 184473 (2015). doi:10.1101/gr.184473.114
62. J. Altschmied, J. Delfgaauw, B. Wilde, J. Duschl, L. Bouneau, J.-N. Volff, M. Schartl, Subfunctionalization of duplicate MITF genes associated with differential degeneration of alternative exons in fish. Genetics 161, 259–267 (2002). Medline
61
63. W. P. Yu, S. Brenner, B. Venkatesh, Duplication, degeneration and subfunctionalization of the nested synapsin-Timp genes in Fugu. Trends Genet. 19, 180–183 (2003). Medline doi:10.1016/S0168-9525(03)00048-9
64. K. A. Hultman, N. Bahary, L. I. Zon, S. L. Johnson, Gene duplication of the zebrafish kit ligand and partitioning of melanocyte development functions to kit ligand a. PLOS Genet. 3, e17 (2007). Medline
65. A. N. Marshall, M. C. Montealegre, C. Jiménez-López, M. C. Lorenz, A. van Hoof, Alternative splicing and subfunctionalization generates functional diversity in fungal proteomes. PLOS Genet. 9, e1003376 (2013). Medline doi:10.1371/journal.pgen.1003376
66. R. N. Nurtdinov, I. I. Artamonova, A. A. Mironov, M. S. Gelfand, Low conservation of alternative splicing patterns in the human and mouse genomes. Hum. Mol. Genet. 12, 1313–1320 (2003). Medline doi:10.1093/hmg/ddg137
67. B. Modrek, C. J. Lee, Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat. Genet. 34, 177–180 (2003). Medline doi:10.1038/ng1159
68. R. Sorek, R. Shamir, G. Ast, How prevalent is functional alternative splicing in the human genome? Trends Genet. 20, 68–71 (2004). Medline doi:10.1016/j.tig.2003.12.004
69. N. L. Barbosa-Morais, M. Irimia, Q. Pan, H. Y. Xiong, S. Gueroussov, L. J. Lee, V. Slobodeniuc, C. Kutter, S. Watt, R. Colak, T. Kim, C. M. Misquitta-Ali, M. D. Wilson, P. M. Kim, D. T. Odom, B. J. Frey, B. J. Blencowe, The evolutionary landscape of alternative splicing in vertebrate species. Science 338, 1587–1593 (2012). Medline doi:10.1126/science.1230612
70. J. Merkin, C. Russell, P. Chen, C. B. Burge, Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science 338, 1593–1599 (2012). Medline doi:10.1126/science.1228186
71. J. P. Huelsenbeck, F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001). Medline doi:10.1093/bioinformatics/17.8.754
72. F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu, M. A. Suchard, J. P. Huelsenbeck, MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61, 539–542 (2012). Medline doi:10.1093/sysbio/sys029
73. Z. Yang, PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997). Medline
74. Z. Yang, PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007). Medline doi:10.1093/molbev/msm088
75. R. M. Kuhn, D. Karolchik, A. S. Zweig, H. Trumbower, D. J. Thomas, A. Thakkapallayil, C. W. Sugnet, M. Stanke, K. E. Smith, A. Siepel, K. R. Rosenbloom, B. Rhead, B. J. Raney, A. Pohl, J. S. Pedersen, F. Hsu, A. S. Hinrichs, R. A. Harte, M. Diekhans, H. Clawson, G. Bejerano, G. P. Barber, R. Baertsch, D. Haussler, W. J. Kent, The UCSC genome browser database: Update 2007. Nucleic Acids Res. 35 (suppl 1), D668–D673 (2007). Medline doi:10.1093/nar/gkl928
62
76. T. Makino, K. Hokamp, A. McLysaght, The complex relationship of gene duplication and essentiality. Trends Genet. 25, 152–155 (2009). Medline doi:10.1016/j.tig.2009.03.001
77. J. Ernst, M. Kellis, ChromHMM: Automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012). Medline doi:10.1038/nmeth.1906
78. J. Lonsdale, J. Thomas, M. Salvatore, R. Phillips, E. Lo, S. Shad, R. Hasz, G. Walters, F. Garcia, N. Young, B. Foster, M. Moser, E. Karasik, B. Gillard, K. Ramsey, S. Sullivan, J. Bridge, H. Magazine, J. Syron, J. Fleming, L. Siminoff, H. Traino, M. Mosavel, L. Barker, S. Jewell, D. Rohrer, D. Maxim, D. Filkins, P. Harbach, E. Cortadillo, B. Berghuis, L. Turner, E. Hudson, K. Feenstra, L. Sobin, J. Robb, P. Branton, G. Korzeniewski, C. Shive, D. Tabor, L. Qi, K. Groch, S. Nampally, S. Buia, A. Zimmerman, A. Smith, R. Burges, K. Robinson, K. Valentino, D. Bradbury, M. Cosentino, N. Diaz-Mayoral, M. Kennedy, T. Engel, P. Williams, K. Erickson, K. Ardlie, W. Winckler, G. Getz, D. DeLuca, D. MacArthur, M. Kellis, A. Thomson, T. Young, E. Gelfand, M. Donovan, Y. Meng, G. Grant, D. Mash, Y. Marcus, M. Basile, J. Liu, J. Zhu, Z. Tu, N. J. Cox, D. L. Nicolae, E. R. Gamazon, H. K. Im, A. Konkashbaev, J. Pritchard, M. Stevens, T. Flutre, X. Wen, E. T. Dermitzakis, T. Lappalainen, R. Guigo, J. Monlong, M. Sammeth, D. Koller, A. Battle, S. Mostafavi, M. McCarthy, M. Rivas, J. Maller, I. Rusyn, A. Nobel, F. Wright, A. Shabalin, M. Feolo, N. Sharopova, A. Sturcke, J. Paschal, J. M. Anderson, E. L. Wilder, L. K. Derr, E. D. Green, J. P. Struewing, G. Temple, S. Volpi, J. T. Boyer, E. J. Thomson, M. S. Guyer, C. Ng, A. Abdallah, D. Colantuoni, T. R. Insel, S. E. Koester, A. R. Little, P. K. Bender, T. Lehner, Y. Yao, C. C. Compton, J. B. Vaught, S. Sawyer, N. C. Lockhart, J. Demchok, H. F. Moore; GTEx Consortium, The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013). Medline doi:10.1038/ng.2653
79. J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, B. Ren, Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012). Medline doi:10.1038/nature11082
63