phd thesis - ku yin.pdfphd thesis ye yin evolution and adaptation of baboon and mandrill revealed by...

105
UNIVERSITY OF COPENHAGEN FACULTY OR DEPARTMENT PhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen, Denmark This thesis has been submitted to the PhD School of The Faculty of Science, University of Copenhagen Submitted: March 2018

Upload: others

Post on 07-Jun-2020

33 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

U N I V E R S I T Y O F C O P E N H A G E N

F A C U L T Y O R D E P A R T M E N T

PhD Thesis

Ye Yin

Evolution and Adaptation of Baboon and Mandrill Revealed by

Genome Sequencing

Academic advisor: Karsten Kristiansen, University of Copenhagen,

Denmark

This thesis has been submitted to the PhD School of The Faculty of Science, University of

Copenhagen

Submitted: March 2018

Page 2: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

2

Dissertation for the degree of philosophiae doctor (PhD)

Department of Biology, University of Copenhagen

Copenhagen, Denmark

and

BGI-Research, BGI-Shenzhen

Shenzhen, China

March 2018

Author: Ye Yin

Title: Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing

Academic advisors: Karsten Kristiansen, Department of Biology, University of Copenhagen,

Denmark

Submitted March 2018

Page 3: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

3

Preface

This PhD project started in 2015 as collaboration between the Department of Biology, University of

Copenhagen and BGI-Shenzhen. The work presented here has been performed at both institutions

by supervision of Professor Karsten Kristiansen.

Page 4: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

4

Acknowledgements

I would like to thank my supervisor Professor Karsten Kristiansen for introducing me in the most

cutting-age research area of genomics and bioinformatics, giving me kind guidance on conducting

academic researches.

I would also like to thank Chenglin Zhang, Deputy Director of Beijing Zoo, and Professor Rasmus

from University of Copenhagen for kindly providing the baboon and mandrill sample used in this

study.

Additionally, I would like to thank those who participated in the crowdfunding for the baboon and

mandrill sequencing projects.

Page 5: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

5

Abstract

Baboon (genus Papio) and mandrill (Mandrillus sphinx) are closely related with human beings in

phylogenetic relationships, which can serve as unique models for primate evolutionary studies as

well as human diseases researches. However, genetic researches and genomic resources of baboon

and mandrill are limited, especially comparing to chimpanzee and gorilla. Thus genome sequencing

of baboon and mandrill was carried out here for constructing reference genomes of these

remarkable Old World monkeys.

With the process of sampling, DNA extraction and sequencing, 414 Gb and 426 Gb raw sequencing

data of different libraries were generated for baboon and mandrill respectively using the second

generation sequencing platform. Then, genome assembly was carried out based on the sequencing

data of both species. The genome assembly of baboon was 3.11 Gb with contig N50 to be 21,659,

and scaffold N50 to be 1,070,645, and the genome assembly of mandrill was 2.88 Gb with contig

N50 to be 20,483, and scaffold N50 to be 3,564,730. With the assembled genomes, repeat contents

were first annotated to be 42.3% and 40.4% respectively for baboon and mandrill. After masking

the repeat content in the genomes, evidence based and ab-initial gene annotation were combined

together to predict 23,867 genes in baboon and 21,906 genes in mandrill. Searching 3,023 BUSCO

(Benchmarking Universal Single-Copy Orthologs) genes against the predicted genes, the

completeness of the genes were estimated to be 97% (baboon) and 98% (mandrill). This

comprehensive assembly and complete gene sets provides new biological insight into genetic

diversity, structural variation, behavioral characteristic. Comparative genomic analysis among

primates was conducted to reveal the synteny between primates and also the gene family evolution

of contraction and expansion especially in baboon and mandrill. There were 9,930, 11,418,14,318

gene pairs between baboon and mandrill, macaque and baboon, human and macaque. In baboon and

mandrill lineage, there were 545 expanded and 618 contracted gene families, expanded genes were

Page 6: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

6

significantly enriched in biosynthetic process, structural constituent of ribosome, nucleosomal DNA

binding, G-protein coupled receptor activity, olfactory receptor activity, glucose catabolic process,

peptidyl-prolyl isomerization, as well as carbon fixation in photosynthetic organisms and electron

transport chain pathway. Molecular mechanisms of adaptation for baboon and mandrill including

immune, language competence and olfactory character were also investigated through comparative

genomics. Through this study, genomic resources were provided for primate species, and

comprehensive insights of adaptation and evolution were also provided for better understanding of

baboon and mandrill.

Page 7: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

7

Table of Contents

Preface ...................................................................................................................................... 3

Acknowledgements .................................................................................................................... 4

Abstract ..................................................................................................................................... 5

Table of Contents ....................................................................................................................... 7

Abbreviations .......................................................................................................................... 10

List of Tables ............................................................................................................................ 12

List of Figures ........................................................................................................................... 14

1. Introduction ..................................................................................................................... 16

1.1 Baboon and its biology ............................................................................................................... 16

1.2 Mandrill and its biology .............................................................................................................. 20

1.3 Genomic studies on baboon and mandrill ................................................................................... 22

1.3.1 Genomics of primates .................................................................................................................... 22

1.3.2 Genomics of baboon ...................................................................................................................... 26

1.3.3 Genomics of mandrill ..................................................................................................................... 28

1.3.4 Comparative genomics in primates ............................................................................................... 29

1.4 Objectives .................................................................................................................................. 31

2. Materials and Methods ..................................................................................................... 32

2.1 Sampling and sample preparation ........................................................................................ 32

2.2 Genome sequencing ............................................................................................................. 33

2.2.1 Library construction and sequencing ............................................................................................. 33

2.2.2 Data filtering .................................................................................................................................. 33

2.2.3 Overlapping library data merging .................................................................................................. 34

2.2.4 K-mer analysis ................................................................................................................................ 34

Page 8: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

8

2.3 Genome assembly and annotation ....................................................................................... 35

2.3.1 Genome assembly .......................................................................................................................... 35

2.3.2 Genome annotation ....................................................................................................................... 36

2.4 Evolutionary analysis ........................................................................................................... 40

2.4.1 Gene family cluster ........................................................................................................................ 40

2.4.2 Phylogenetic analysis ..................................................................................................................... 41

2.4.3 Positively gene selection analysis .................................................................................................. 41

2.5 Comparative genomics ......................................................................................................... 42

2.5.1 Synteny analysis of human, macaque, baboon and mandrill ........................................................ 42

2.5.2 Gene family contraction and expansion ........................................................................................ 42

2.5.3 Segmental duplications .................................................................................................................. 43

2.6 Investigating molecular mechanisms of adaptation/phenotype ............................................. 43

2.6.1 Immune character.......................................................................................................................... 43

2.6.2 Language competence ................................................................................................................... 44

2.6.3 Olfactory character ........................................................................................................................ 44

2.6.4 Predicting binding sites of transcription factors ............................................................................ 45

3. Results .............................................................................................................................. 46

3.1 Landscapes of baboon and mandrill genomes ............................................................................. 46

3.1.1 Sequencing data............................................................................................................................. 46

3.1.2 K-mer analysis ................................................................................................................................ 47

3.1.3 Genome assembly .......................................................................................................................... 48

3.1.4 Annotation results ......................................................................................................................... 49

3.2 Evolution of baboon and mandrill ............................................................................................... 53

3.2.1 Gene families ................................................................................................................................. 53

3.2.2 Phylogenetic analysis ..................................................................................................................... 55

3.3 Synteny among primates ............................................................................................................ 57

3.3.1 Synteny analysis of human, macaque, baboon and mandrill ........................................................ 57

3.3.2 Gene family contraction and expansion ........................................................................................ 58

Page 9: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

9

3.3.2 Segmental duplications .................................................................................................................. 60

3.4 MHC comparison between human and baboon/mandrill ............................................................. 61

3.5 Language related genomic features ............................................................................................. 66

3.6 Olfactory receptor genes analysis ............................................................................................... 67

3.7 Positively selected genes ............................................................................................................ 69

4. Discussion......................................................................................................................... 73

5. Conclusions ...................................................................................................................... 77

6. Future perspectives .......................................................................................................... 79

7. References ........................................................................................................................ 80

8. Appendix .......................................................................................................................... 88

Page 10: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

10

Abbreviations

4D Fourfold Degenerate

AIDS Acquired Immune Deficiency Syndrome

BP Biological Process

BUSCO Benchmarking Universal Single-Copy Orthologs

CC Cellular Component

CEGMA Core Eukaryotic Genes Mapping Approach

ChIP-seq Chromatin Immunoprecipitation Sequencing

CMV Cytomegalovirus

DBG De Bruijn Graph

EBV Epstein-Barr Virus

EC Enzyme Commission

EST Expressed Sequence Tag

GO Gene Ontology

HAV Hepatitis A Virus

HGNC Hugo Gene Nomenclature Committee

HIV Human Immunodeficiency Virus

HLA Human Leukocyte Antigen

IPEX Immunodysregulation Polyendocrinopathy Enteropathy X-Linked

KEGG Kyoto Encyclopedia Of Genes And Genomes

LINE Long Interspersed Nuclear Element

LTR Long Terminal Repeat

MF Molecular Function

MHC Major Histocompatibility Complex

MYA Million Years Ago

NGS Next Generation Sequencing

OR Olfactory Receptor

PCR Polymerase Chain Reaction

PGC Primordial Germ Cells

PPIA Peptidylprolyl Isomerase A

PSMC Pairwise Sequentially Markovian Coalescent

QTL Quantitative Trait Loci

ROS Reactive Oxygen Species

SD Segmental Duplication

SINE Short Interspersed Nuclear Element

SIV Simian Immunodeficiency Virus

Page 11: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

11

SMRT Single-Molecule Realtime Sequencing

SNPRC Southwest National Primate Research Center

TE Transposable Element

WSSD Whole-Genome Sequence Detection

Page 12: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

12

List of Tables

Table 1.1 Baboons as animal models for studies of human diseases and vaccines. .................... 19

Table 1.2 Summary of mandrill as models in human diseases and vaccines studies/tests. ......... 22

Table 1.3 Published primate genome sequences. ...................................................................... 25

Table 3.1 Statistics of baboon and mandrill raw sequencing data. ............................................. 46

Table 3.2 The information of 17-mer statistics. ......................................................................... 47

Table 3.3 Statistics of the genome assemblies........................................................................... 48

Table 3.4 Repeat contents of baboon, mandrill, human and mouse. ......................................... 50

Table 3.5 Summary of gene annotation in baboon genome. ...................................................... 51

Table 3.6 Summary of gene annotation in mandrill genome...................................................... 51

Table 3.7 Assessment of gene sets using BUSCO. ...................................................................... 52

Table 3.8 Function annotation of the final gene sets. ................................................................ 52

Table 3.9 Gene family clustering in the seven species. .............................................................. 53

Table 3.10 Olfactory receptor gene copy number in five species. .............................................. 68

Table 8.1 Statistics of baboon and mandrill clean/filtered sequencing data. .............................. 88

Table 8.2 Prediction of the repeats in baboon genome. ............................................................ 88

Table 8.3 General statistics of repeats in mandrill genome........................................................ 88

Table 8.4 Categories of TEs in baboon genome. ........................................................................ 89

Table 8.5 Categories of TEs in mandrill genome. ....................................................................... 89

Page 13: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

13

Table 8.6 Non-coding RNA genes in baboon genome. ............................................................... 89

Table 8.7 Non-coding RNA genes in mandrill genome. .............................................................. 90

Table 8.8 Go enrichment of unique gene families in baboon. .................................................... 90

Table 8.9 Go enrichment of unique gene families in mandrill. ................................................... 91

Table 8.10 GO enrichment result of unique gene families for mandrill. ..................................... 92

Table 8.11 Repeat content of MHC class I region for mandrill and human. ................................ 92

Table 8.12 GO and KEGG enrichment of the positively selected genes (PSGs). ........................... 93

Page 14: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

14

List of Figures

Figure....................................................................................................................................... 17

Figure 2.1 Photos of the samples selected for sequencing. ........................................................ 32

Figure 2.2. Overall process of genome annotation. ................................................................... 37

Figure 3.1 Orthologous gene clusters in the five related species. ............................................... 54

Figure 3.2 Comparison of orthologous genes among 12 primates and mouse. ........................... 55

Figure 3.3 Phylogenetic tree based on single copy gene families in the 13 species. .................... 56

Figure 3.5 Synteny relationship of human, macaque, baboon and mandrill. .............................. 58

Figure 3.6 Gene family contraction and expansion for 12 primates and mouse. ......................... 60

Figure 3.7 Segmental duplications in seven primate species. ..................................................... 61

Figure 3.8 Synteny between human and mandrill MHC regions. ................................................ 63

Figure 3.9 Alignment of HLA genes with amino acid sequence for human, baboon and mandrill.

................................................................................................................................................ 65

Figure 3.10 Structure of MICA and MICB gene for human and mandrill. .................................... 65

Figure 3.11 Amino acid sequence aligment of FOXP2 gene from human, chimpanzees, mouse,

baboon and mandrill. ............................................................................................................... 67

Figure 3.12 Expansion of the olfactory receptor gene family in baboon and mandrill. ............... 69

Figure 3.13 Interaction between innate immunity for positively selected genes in mandrill. ...... 70

Figure 8.1 The distribution of 17-mer frequency of baboon and mandrill. ............................... 102

Page 15: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

15

Figure 8.2 Colinearity analysis of chr 3 for mandrill. ................................................................ 103

Figure 8.3 Sequencing depth and the location relationships of pair-end reads on MHC class I

region for mandrill. ................................................................................................................ 103

Figure 8.4 OR7E24 genes on chromosome 19 in mandrill. ....................................................... 105

Page 16: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

16

1. Introduction

There are two suborders of primates, the Strepsirrhini and Haplorhini. Haplorhines are further split

into tarsiers and simians. Simians comprise two groups, one of them is the catarrhines, Catarrhines

are further split into the Old World monkeys (Cercopithecoidea) and the apes (Hominoidea).

Baboon (genus Papio) and mandrill (Mandrillus sphinx) belong to genus Papionini, and they are

primates in the Old World monkey family which are widely distributed in Africa. Comparing to

chimpanzees and gorillas which belong to Hominidae, baboon and mandrill are also closely related

to human beings [1]. Papio, Mandrillus, and Macaca were used to be clustered in a tribe

Cercocebini of the subfamily Cercopithecinae [2], but currently Papio was assigned to Lophocebus

while Mandrillus was assigned to Cercocebus according to postcranial skeleton and the dentition [3].

The Papionini tribe was diverged from Cercopithecini around 11.5 million years ago (Mya) and

comprises the subtribe Papionina, with the genera Papio, Mandrillus and the subtribe Macacina,

with the genus Macaca [4]. Complete mtDNA genome sequences also provided similar

phylogenetic relationships among of Macaca and the Mandrillus [5].

1.1 Baboon and its biology

Baboons are primates of the Old World monkeys belonging to Papio. They have close-set eyes,

powerful jaws, short tails, long muzzles, thick fur, and rough spots on their protruding buttocks.

Baboon species also show sexual dimorphism, usually in size, but sometimes also in color or canine

development [6]. Baboons can live up to more than 40 years, with the baboons in captivity were

known to 45 years while in the wild is about 30 years. They live in open savannah, woodland and

hills across Africa and they eat insects or fish occasionally. There are five species in Papio, P.

ursinus (chacma baboon), P. papio (Guinea baboon), P. hamadryas (hamadryas baboon), P. anubis

(olive baboon) and P. cynocephalus (yellow baboon), which are predominantly found in Southern

Page 17: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

17

Africa, Western Africa, Southwestern Arabia, North-central Africa and eastern Africa, respectively

[7] (Figure 1.1).

Figure 1.1 Geographical distribution of baboons. Distribution based on the map in Kingdom [8]

(Modified from a figure in previous study [9]).

Baboons live in hierarchical troops with number of individuals ranging from 5 to more than 200,

considerably larger than most of chimpanzee groups. The size of the troops largely varies for

different baboon species and different time periods during a year. The structure of hamadryas

baboons is remarkably different from that of the other baboon species, which are collectively

termed as savanna baboons. For example, hamadryas baboons always have very large troops

composed of many small harems while other baboons often have a structure more promiscuous and

the hierarchy is determined by the matriline. In the hamadryas harems, the males jealously guard

their females and some of them also raid harems for females, which will cause fights by the males.

Page 18: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

18

Visual threats such as quick flashing of eyelids and show off the teeth are usually used during the

fights, and in some species, infants are taken as hostages during fights.

Baboons can determine the dominance relations between individuals from vocal exchanges. In

savanna baboons, each male individual mate with any female and the order among the males

depends partially on their rankings in the structure. Individuals with higher rank have benefits in

health and reproductive. High-ranking males have higher level of testosterone and lower level of

glucocorticoid than other males, and the top-ranking males have higher levels of both testosterone

and glucocorticoid than the second-ranking males [10]. Females also prefer friendly males as mates.

Therefore, there is also possibilities that a female baboon cam mate with a female by exhibiting

friend behaviors such as groom the female or supply with food. The time for gestation of baboons is

six months, and usually a single infant. The mother will be the primary caretaker but other females

also share the duties of taking care of all the offspring. The young baboons will be weaned about

one year later and the male baboons have to leave their group before they reach sexual maturity,

about five or six years old. On the other hand, females stay in the same group.

Studies of human complex diseases are difficult because it is very challenge to control human

pedigree structure and environmental conditions. The limited access of tissues also greatly

hampered the related studies. To overcome these limitations, nonhuman primates are often used as

valuable sources. Sharing many genetic, biochemical, physiologic, and anatomic characteristics

with human beings [11], baboon are naturally infected with numerous human pathogens and

therefore have the potential to be used as animal models for physiology and pathophysiology

researches [12], including cardiovascular disease, obesity, hypertension, age-related skeletal disease,

epilepsy, infectious disease and intrauterine researches [13, 14] (Table 1.1). Transplantation and

Page 19: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

19

drug therapy have also been conducted in baboon [15-17].

Table 1.1 Baboons as animal models for studies of human diseases and vaccines.

Experimental objective Reference

Viral diseases

Ebola Studies in pathogenesis [18]

Encephalo-myocarditis virus Vaccine [19]

HIV-1 HIV-1 vaccine candidates [20]

Hepatitis A virus (HAV), cytomegalovirus

(CMV), Epstein-Barr virus(EBV)

Infections transmissible between

baboons and human beings

[21]

Bacterial infections

Bacillus anthracis Infections in nonhuman primate

model

[22]

Francisella tularensis Outer membrane live, attenuated

LPS-17

[23]

Angioinvasive aspergillus Baboon-to-human liver

transplantation infection

[15]

Parasite infections

Leishmania major Infection model [24]

Schistosoma mansoni Live, irradiated cecariae vaccine [25, 26]

Zoonotic gastrointestinal Baboon as zoonotic reservoirs [27]

In the meantime, some characteristics of baboon are obviously different to human, including

language ability and sensory capabilities [28]. As comparative biology according to articulatory

anatomy, many researches [29] claim nonhuman primates are incapable of producing systems of

vowel-like sounds due to their high larynx position, but recent discoveries have begun to challenge

this view with three reasons. First, some animal species with no documented ability to produce

systems of vowel-like sounds [30]. Second, human infants, with their larynx still high, produce the

same range of vowel qualities as adults [31]. Third, modeling suggests that the production of

vocalic sounds depend on the position of the larynx, but rather on the control of tongue muscles and

lips [32].

Page 20: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

20

1.2 Mandrill and its biology

Mandrill (here, specifically referred to Mandrillus sphinx) is a primate of the Old World monkey

(Cercopithecidae) family. Along with the drill, mandrill was once classified as baboons (Papio)

because they are superficially similar [33]. Mandrills live in tropical rainforests, rocky, riparian,

flooded or gallery forests, as well as cultivated areas and stream beds across Africa, and are usually

found in Southern Cameroon, Gabon, Equatorial Guinea and Congo [34]. The distribution is

generally separated by the Sanaga River and the Ogooué and White Rivers. There are remarkable

genetic differences between these two populations. As a result, these two populations have been

classified into different subspecies.

The diet is omnivorous ranging from fruits to insects. Its diet is generally composed of fruits

(50.7%), seeds (26.0%), leaves (8.2%), pith (6.8%), flowers (2.7%), and animal foods (4.1%), with

other foods making up the remaining (1.4%). Usually, they consume plants, as diverse as more than

a hundred species and fruits are preferred. Furthermore, they also eat mushrooms and soil. Besides

plants, they also eat animals, mostly invertebrates, such as insects like ants, beetles, termites,

crickets, and snails or scorpions. Its diet also contains eggs and small vertebrates like birds, frogs,

rats, and shrews or juvenile of larger vertebrates such bay duikers and antelope [35]. The life

expectancy of mandrill in captivity can be up to 31 years, shorter than that of baboons.

Mandrills also exhibit strong sexual dimorphism. It has experienced very long and strong sexual

selection, as a result, male mandrills have larger size and coloration. Generally, mandrill’s face is

hairless with an elongated muzzle. They also have distinct characters, such as protruding blue ridges

on the sides, red nostrils and lips. The areas around the genitals are multi-colored. Particularly,

dominant male mandrills have more pronounced coloration. Mandrill is the largest and heaviest

Page 21: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

21

monkey in the world. Typically, male mandrills weigh 19–37 kg, with an average of 32.3 kg, while

the females weigh roughly half as much as the males, at 10–15 kg and an average of 12.4 kg. The

male mandrills are 75–95 cm long in average and the females are 55–66 cm. The shoulder height

ranges from 45–50 cm in females to 55–65 cm in males. These sizes and weights even surpass that

of the largest baboons. Furthermore, the mandrill is more ape-like compared to the baboons

regarding the body structure, with a muscular and compact build, shorter, thicker limbs that are

longer in the front and almost no tail.

Mandrills are mostly terrestrial but they are more arboreal compared to baboons [36]. Mandrills live

in large, stable groups with the size as big as hundreds of individuals [34]. The largest horde that

have been verifiably observed contains more than 1,300 mandrills, which is the largest nonhuman

aggregation ever documented. Mandrills are diurnal. They sleep on trees at night. They use tools

and have been observed using sticks in captivity.

Mandrills breed every two years and the mating season extends from June to October. Sometimes,

male mandrills fight for mating rights. The testicular volume increases along with the gaining of

dominance (alpha male) and decreases if the dominance is lost. Similar changes are also observed in

the color of sexual skin on the face and genitalia, which becomes red in alpha male mandrills.

Physiologically, the secretion of the sternal cutaneous gland also increases accordingly [37, 38]. A

dominance hierarchy among females also exists [38].

The way monkeys select their mates can be attributed to smell, rather than color which mainly

genes called the major histocompatibility complex (MHC) [39]. MHC is a cluster of genes which

determine mandrill’s individual scent and help build proteins involved in the body's immune system

Page 22: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

22

and affects body odour by interacting with bacteria on the skin. By a series of experiments found

that particular odour-types were consistent with particular MHC gene patterns, suggesting that

mandrills use odour as an indication of genetic compatibility [40]. Mandrill is one of two species in

Papionini possess a sternal gland, gland is a triangular area in the middle of the chest and structure

basis for scent [41]. Mandrill is also widely used in immune systems research and is nature host for

SIV strains (Table 1.2).

Table 1.2 Summary of mandrill as models in human diseases and vaccines studies/tests.

Objectives Reference

Viral diseases

Simian Immunodeficiency virus (SIV) Studies in pathogenesis [42]

Simian T-lymphotropic Virus Type 1 (STLV-1) Transmission modes [43, 44]

Bacterial infections

Paratuberculosis Infections in mandrill [45]

Helicobacter heilmannii Bacterial pathogen model [46]

Parasite infections

Amebic, ciliate, nematodes Parasite prevalence in mandrill [47]

Loa loa Irradiated vaccine [48]

1.3 Genomic studies on baboon and mandrill

1.3.1 Genomics of primates

Since the complete of the human genome assembly [49], along with the reduced cost of genome

sequencing and greatly increased throughput of new sequencers, more and more genomic data of

primates are becoming available, including both Old world monkeys, such as chimpanzee [50], and

New world monkeys, such as Marmoset (Callithrix jaccbus). The genomics of non-human primates

received wide interests for two motivations: the application as models for analysis of human disease,

and genetic conservation and divergence on evolutionary history through comparative genomics

[51]. Generally, the species selected for genome sequencing meet the criteria of: 1) important

Page 23: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

23

evolutionary position within the phylogeny (i.e. chimpanzee, gibbon and orangutan etc.); 2)

biomedical relevance to human. For example, macaque and baboon, although the genome

sequencing of the latter has not been completed yet, were selected because they are often used to

study the genetic basis of numerous human diseases [13], and squirrel monkey is used for studies of

neurobiology and infectious disease. The size of primate genomes varies little, ranging from 2.7 Gb

of Bonobo (Pan paniscus) [52] to 3.4 Gb of Tarsier (Tarsius syrichta) [53]. Repetitive regions

occupy about 50% of human, ape and monkey genomes but the amount of species-specific

insertions varies substantially, ranging from ~5,000 in human to ~2,300 in chimpanzee. Orang-utan

has only 250 [54]. Genomic studies had also been conducted to extinct hominis, the Neanderthals

(Green et al., 2010) and the Denisovans [55]. These genomic data resources together enabled people

to perform comparisons between human and other primates, or between primates and other

mammals.

The sequencing and assembly of non-human primate genomes went through different stages in pace

with the development of sequencing technologies. Sequencing of the genomes of chimpanzee (Pan

troglodytes), and the rhesus macaque (Macaca mulatta) were performed through the application of

shotgun sequencing used exclusively Sanger sequencing methods with considerable cost and efforts

[56, 57]. Then next generation sequencing (NGS) was widely used and gave more rapid progress on

genomics, while plenty of primates were sequenced and assembled (Table 1.3), which supplied us

with more understanding on genome content, evolution and diversity [51]. Since 2013, the

development and application of single-molecule, realtime (SMAT) sequencing technology has

shown considerable improvement on human or other genomes assemblies. Compared with the NGS

assembly version gorGor3, the results in Gorilla (Gorilla gorilla) with SMAT sequences show

significant decrease in assembly fragmentation, while the contig N50 increased >819 folds (from

11.8 kb to 9.6 Mb, Table 1), and 94% of gorGor3 gaps were closed [58, 59].

Page 24: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

24

To understand the origin of the human genome is one of the most important purposes to sequence

primates closely related to human. Inter- and intra-species comparisons had provided insights of

gene exchange among the early human and chimpanzee ancestors, and allowed the identification of

positively selected genes or regions during the evolution of human or other primates. These genes

always indicate genetic or phenotypic changes that are critical for adaptation of human or non-

human primates, as well as hominis. It has been clearly shown that genes involved in the immune

system and resistance to pathogens, as well as those involved in reproductive biology were

commonly positively selected in many non-human primates [58, 60, 61]. This might be a result of

the long-term exposure to various pathogens in the wild, and genes related to gametogenesis are

beneficial to the competition within species. On the other hand, within species, the signals of

positive selection have been found in genes related to a wide range of phenotypes. Positively

selected genes shared among human, chimpanzee and gorilla are related to neuro and brain

development; genes related to glycolipid metabolism and hearing are positively selected in

orangutans [54]; In marmosets and other callitrichine primates, genes involved in phyletic reduction

of body size were positively selected [62]. The common ancestor of dated back to 12-5 million

years ago [63]. The reciprocal gene flow has lasted for ~3 million years between those lineages [63],

suggesting the divergence process is a long period with extensive gene flow, instead of a short event.

Similar evidence of gene exchanges was also detected in Bornean and Sumatran orang-utans

genomes [54, 63].

The long-read sequence indeed improved the completeness and accuracy of assembly, while the

scaffolds were still in the Mb level. Recently, a new technology named Hi-C help assemble contigs

into chromsome-scale scaffolds [64-66]. Hi-C and related technologies were developed to detect the

three-dimensional folding of chromosomes within the nucleus [67], then the information were used

Page 25: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

25

to assist assembly. The results indicated that combination of shotgun fragments and mate-pair

sequences with Hi-C date could generate chromsome-scale assemblies with 98% accuracy in

assigning scaffolds to chromosome groups for human [64]. As far as we know, there is no Hi-C

assistant assebly result for non-human primates.

Table 1.3 Published primate genome sequences. (modified based on [51]).

Common

name

Species name Bases in

contigs

Contig

N50

Scaffold

N50

Reference

Chimpanzee Pan troglodytes 2.7 Gb 15.7 kb 8.6 Mb [56]

Chimpanzee

(updated)

P. troglodytes 2.9 Gb 50.7 kb 8.9 Mb [68]

Bonobo Pan paniscus 2.7 Gb 67 kb 9.6 Mb [69]

Gorilla Gorilla gorilla 2.7 Gb 11.8 kb 914 kb [58]

Gorilla

(updated)

Gorilla gorilla 2.8 Gb 9.6 Mb 23.1 Mb [59]

Orang-utan Pongo abelii 3.1 Gb 15.5 kb 739 kb [54]

Indian rhesus

macaque

Macaca mulatta 2.9 Gb 25.7 kb 24.3 Mb [57]

Indian rhesus

macaque

(updated)

M. mulatta 3.1 Gb 107.2 kb 4.2 Mb [70]

https://www.ncbi.

nlm.nih.gov/asse

mbly/GCA_0007

72875.3

Chinese

rhesus

macaque

M. mulatta 2.8 Gb 11.9 kb 891 kb [71]

Vietnamese

cynomolgus

macaque

M. fascicularis 2.9 Gb 12.5 kb 652 kb [71]

Aye-aye D.

madagascarensi

s

3.0 Gb NA 13.6 kb [72]

Vervet C. aethiops 2.8 Gb 90.4 kb 81.8 Mb [73]

Olive baboon P. anubis 2.9 Gb 149.8 kb 585.7 kb https://www.ncbi.

nlm.nih.gov/asse

Page 26: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

26

mbly/GCF_00026

4685.3/

Gibbon Nomascus

leucogenys

2.8 Gb 35.1 kb 22.7 Mb [74]

Marmoset Callithrix

jacchus

2.3 Gb 29 kb 6.7 Mb [75]

Mouse lemur Microcebus

murinus

2.4 Gb 210.7 kb 108.2 Mb https://www.ncbi.

nlm.nih.gov/asse

mbly/GCF_00016

5445.2

Pig-tailed

macaque

Macaca

nemestrina

2.8 Gb 106.9 kb 15.2 Mb https://www.ncbi.

nlm.nih.gov/asse

mbly/GCF_00095

6065.1/#/st

Sifaka Propithecus

coquereli

2.1 Gb 28.1 kb 5.6 Mb https://www.ncbi.

nlm.nih.gov/asse

mbly/GCF_00095

6105.1/#/st

Sooty

mangabey

Cercocebus atys 2.8 Gb 112.9 kb 12.8 Mb https://www.ncbi.

nlm.nih.gov/asse

mbly/GCF_00095

5945.1/

Squirrel

monkey

Saimiri

boliviensis

2.5 Gb 38.8 kb 18.7 Mb https://www.ncbi.

nlm.nih.gov/asse

mbly/GCF_00023

5385.1/#/def

Bushbaby Otolemur

garnettii

2.4 Gb 27.1 kb 13.9 Mb https://www.ncbi.

nlm.nih.gov/asse

mbly/GCF_00018

1295.1/

Mouse lemur Microcebus

murinus

2.4 Gb 182.9 kb 3.7 Mb https://www.ncbi.

nlm.nih.gov/asse

mbly/GCF_00016

5445.1/

Tarsier Tarsius syrichta 3.4 Gb 38.2 kb 401 Mb [76]

1.3.2 Genomics of baboon

Baboons (Papio) shares a common ancestor with humans ~30 million years ago and are genetically

closer to human comparing to New World monkeys but are less closely related than the African

apes. Although the genome assembly of baboon is lacking before our study, a comparison between

Page 27: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

27

a short region (~1.5 Mb) of baboon and human genome showed very limited substitutions, most of

which are relatively enriched in exons [77].

Baboon has been commonly used as an ideal primate model for genetic studies of complex traits

and human diseases because of high similarities between human and baboon in transcriptome,

physiology and genetics [78]. Southwest National Primate Research Center (SNPRC) at the Texas

Biomedical Research Institute maintains ~2,000 baboons for biomedical researches. The pedigree

contains over 16,000 individuals across seven generations, with 384 founds of P. h. Anubis, P. h.

cynocephalus and their hybrid progenies. Tissues and blood clots from over 8,000 individuals have

been well stored, and DNA, serum and buffy coats from ~ 4,000 members have been banked [13].

Among these pedigree baboons across seven generations, more than 2,000 individuals have been

genotyped using microsatellite markers, followed by the construction of a whole-genome linkage

map, which contains 294 ordered loci, with an average interval between markers of 7.2 cM [79, 80].

Together with the genotypes of these baboons, several hundred quantitative traits have also been

phenotyped accordingly, which were further used to localize genomic regions of genes controlling

these traits. These data have been implemented in the studies of atherosclerosis, hypertension,

obesity, craniofacial complex etc. Taking the advantage of this genetic map, scientists have scanned

the genome searching for regions (QTL) associated with over 200 traits related to cardiovascular

diseases. Several important QTL were found using this approach. For example, Kammerer et al.

(2002) have found a QTL influencing low-density lipoprotein cholesterol dietary cholesterol

response on chromosome 6; next year, Rainwater et al (2003) had found several lipid-/lipoprotein-

related QTL including three for low-density lipoprotein cholesterol size fractions located on

chromosomes 5, 10q and 17, respectively. A region on chromosome 17 was also found to be

associated with cholecystokinin. This region is known to harbor genes of glucose transporter,

glucagon-like peptide 2 receptor, and sterol regulatory element binding transcription factor 1, which

Page 28: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

28

are related to adiposity [81].

Transcriptomic study is also an efficient approach to advance our understanding to many human

diseases and traits. Northcott et al. (2012) developed a cross-species array (rat and baboon)

targeting 328 genes possibly related to blood pressure. Among these genes, they found 74 were

commonly expressed in both rat and baboon kidney, while 41 were specifically expressed in rat and

34 were specific to baboon. This study displayed evidence of similarities and differences of gene

expression profile between primate and rodent and therefore highlighted the importance of an

appropriate primate model in studies of human complex diseases as well as other traits, such as

neurology and social behaviors.

Several investigators have also combined the transcriptome and linkage map in their studies and

found that, for example, the mRNA level of adiponectin, which is correlated to body weight, serum

triglycerides, adipocyte volume and glucose levels, is significantly heritable and the heritability is

associated to a region on chromosome 4p [82]. Similarly, the abundance of resistin mRNA is also

heritable and the QTL is located on chromosome 19p [83].

1.3.3 Genomics of mandrill

As a nature host of HIV and SIV, mandrills (M. sphinx) is on a list of species whose genomes to be

sequenced. In some cases, mandrills are able to tolerate SIV infection for long periods of time, and

their responses to viral infections are sometimes quite different from the other hosts, such as

mangabey and African green monkey. Additionally, mandrills are adapted to two different SIV

strains, SIVmnd1 possibly originated from a virus in Cercopitehcus lhoesti, and SIVmnd2 from a

virus in M. leucophaeus. Furthermore, a correlation between the low rates of vertical transmission

Page 29: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

29

and the expression of CCR5 have been found in mandrill [84]. Heterozygous individuals have

greater reproductive success regardless of the sex. They always have more offspring. However, this

advantage has only been observed in alpha males but not in the beta males. This correlation

between heterozygosity and reproductive success and tenure has been fairly explained by multi-

locus effects [85].

1.3.4 Comparative genomics in primates

With the genomic data across different primate species, genome regions or elements common or

specific to several species (i.e. human) can be identified and analyzed in details by systematic

comparisons between these genomes. Sequence alignment and comparison shows strong correlation

between pairwise differences and time of divergence which is inferred from other information, such

as fossils. The divergence between human and chimpanzee sequence is 1.1-1.4% [58]. The

difference between human and rhesus macaque is relatively larger, ~6.5% [61], which is consistent

with the longer time of divergence between these two species (28-35 million years ago). The

alignments also show indels among species. Indels are favorably located at intronic and intergenic

regions, which are more tolerant to small indels compared to protein-coding regions.

Besides nucleotide substitutions and small indels, insertion of fragments and larger segmental

rearrangements were also detected in primate genomes. The most extensively investigated process

is the insertion of retrotransposons, such as Alu, which is ongoing in primate genomes and have

played very important roles in shaping the genome structure. For example, Alu insertions is a major

driver of genome change [86]. Retroposition has broader effects on genome evolution because of its

potential of inducing segmental duplication or deletion [87].

Segmental duplication is vital for genome evolution. Segmental duplications collectively make up

Page 30: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

30

~5% of human and chimpanzee genomes, and ~3.8% of orang-utan genome [54, 88]. It is apparent

that the segmental duplications are not randomly located on the chromosomes. Segmental

duplications are preferentially distributed on human chromosome 22 (11.9%) and non-recombining

chromosome Y (50.4%) but against chromosome 3 (1.7%) [89]. Segmental duplications in primate

genomes are categorized into three groups according to their locations. They are pericentromeric,

subtelomeric and interstitial duplications. Duplications in these three classes differ in the types and

frequencies [90]. Pericentromeric duplicates make up about 47.6 Mb, occupying a third of all the

duplicates in human genome [89]. The ratio of inter- to intra-chromosomal duplication in this class

is about 6:1 [89]. Furthermore, more than 30% of the pericentromeric sequences are occupied by

duplicons from other chromosomes. A two-step model has been proposed to explain the process of

segmental duplications in pericentromeric regions [91]. It is similar in subtelomeric regions that

they also have many duplicates from other chromosomes although the total amount of duplicates is

much fewer (2.6 Mb). The inter-chromosomal segmental duplicates are present in 30 of 42

subtelomeric regions [92]. Subtelomeric segmental duplicates are typically 50 to 100 kb long and,

on the contrary to the origination of pericentromeric duplicates, their births involve exchanges

between subtelomeric regions and a larger part of the relative orientation between non-homologous

chromosomes has been retained [89, 93]. Interstitial segmental duplicates locate on euchromatins.

In contrast to the predominance of tandem duplicate clusters found in most genomes, primate

genomes contain a large number of interstitial duplicates. Although interspersed duplicates are

located along the euchromatin, the locations are not randomly distributed either [89]. Comparative

analysis of genomes across primates show that many interspersed intra-chromosomal duplicates can

be dated to the evolution of the great ape. Their births are always associated with chromosome

rearrangements [94].

Page 31: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

31

1.4 Objectives

Providing the background mentioned above, whole genome sequences will facilitate current

biological researches of baboon and mandrill. Thus here I proposed to use second generation

sequencing to construct reference genomes for both baboon and mandrill, conduct repeat/gene

annotation, conduct gene family clustering, conduct evolutionary analyses and comparative

genomic analyses for baboon and mandrill. Study objectives include:

i) Genomic resources for future studies on baboon and mandrill;

ii) Detailed genomic features of baboon and mandrill in repeat content, protein coding genes,

gene families, etc.;

iii) Comparing the genome/genetic features of baboon and mandrill to human and other

primates to provide further insights for primate integration;

iv) Identify possible genetic mechanisms for Old World monkey adaptation;

v) Provide additional insights for human diseases/health.

Page 32: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

32

2. Materials and Methods

2.1 Sampling and sample preparation

One baboon (olive baboon, Papio anubis), and one mandrill (Mandrillus sphinx) were selected for

sampling (Figure 2.1). With the assistance of zoologist, veterinarians withdrew 5 mL blood from

the twenty-year-old male baboon and the eighteen-year-old male mandrill, and the 5 ml whole

blood was from the left jugular vein of animal, and the blood was collected to a plastic collection

tube with 4% (w/v) sodium citrate. The blood samples were then snap frozen in liquid nitrogen and

stored at -80˚C until further processing. Genomic DNA was extracted from the whole blood

samples with the AXYGEN Blood and Tissue Extraction Kit (Corning, USA) according to the

manufacturer’s instructions. The extracted DNA was subjected to electrophoresis in 2% agarose gel

and stained with ethidium bromide to assess the overall quality. DNA concentration was determined

by Quant-iT™ PicoGreen ® dsDNA Reagent and Kits (Thermo Fisher Scientific, USA) according

to the manufacturer’s instructions.

Figure 2.1 Photos of the samples selected for sequencing. The baboon (a) and the mandrill (b)

are both from Beijing Zoo.

Page 33: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

33

2.2 Genome sequencing

2.2.1 Library construction and sequencing

DNA of Baboon and mandrill was used for library construction, according to protocols following

descriptions in previous publications [95]. A total of 12 libraries were constructed for each of the

two species. Then sequencing was carried out on Illumina sequencer HiSeq2000. For each species,

6 libraries were designed in paired-end configuration, comprising 2 libraries with reads of 100 bp in

length and a mean target insert size of 250 bp and 2 libraries of 100 bp reads with insert sizes of 500

bp and 800 bp, respectively. 6 libraries were designed and processed in mate-pair configuration,

with all libraries having 100 bp reads and 1 library with insert size 2 kbp, 4kbp and 5 kbp and 1

library each of 10 kbp and 20 kbp insert sizes.

2.2.2 Data filtering

The quality requirement for de novo sequencing is high thus data filtering was carried out to obtain

high quality reads for assembly. During sample preparation, adapters were ligated and amplification

was conducted. Thus adapter contaminated reads, duplicated reads introduced during amplification

and reads with high sequencing errors (low sequencing quality) need to be filtered according to

previous study [96]. Here, raw reads from the sequencer were filtered using SOAPnuck (v.

1.5.6; https://github.com/BGI-flexlab/SOAPnuke). The filtering criteria were as below: i) reads

with >10 percent base of Ns (uncertain/ambiguous bases) were filtered; ii) reads with >40 percent

of low quality bases (quality score <=10) were filtered; iii) reads contaminated by adaptor (adaptor

matched 50%, allowed one base mismatch) and produced by PCR duplication (identical reads in

both ends) were filtered.

parameter

Page 34: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

34

SOAPnuke filter –f adapter1.list -r adapter2.list -1 reads1.fq.gz -2 reads2.fq.gz -l 10 -q 0.4 –n 0.1 –

M 1 –o ./

2.2.3 Overlapping library data merging

Overlapping libraries are designed in a way that the ends of the paired reads overlapped with each

other, thus the fragments were sequenced through. For the overlapping libraries, the insert size (Si)

of the library should be shorter than the total read length (length of the two read ends, Lr), and the

expected overlap length can be calculated as (2 Lr – Si). The overlap information can be used to

merge the paired reads into one longer sequence. Merging reads will benefit downstream analysis

by providing longer sequence and lower sequencing error. Here, merging of the overlapped reads

was performed using FLASh [97] v1.2.10 and default parameters.

2.2.4 K-mer analysis

In order to estimate the genome features including genome size, repeat content and heterozygosity,

K-mer analysis was first performed. K-mer is sub-sequence of the reads with the length of k. The

Formula 2.1 was used for estimating the genome size. In this formula, knum is the total number of K-

mer, kdepth is the expected depth of K-mer, bnum is the total number of bases, bdepth is the expected

depth of bases. According to Formula 2.2, the distribution of kdepth follows a Poission distribution.

Thus, the peak depth of the K-mer depth was used for expected K-mer depth, while λ was used as

the expect K-mer depth.

In this analysis, the k was 17 with command: kmerfreq -k 17 -m 1 -o 1 -l fq.list [98].

num num

depth depth

k bG

k b

(2.1)

Page 35: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

35

(2.2)

2.3 Genome assembly and annotation

2.3.1 Genome assembly

The baboon and mandrill genome were assembled by short-reads assembly software SOAPdenovo2

[98] using the filtered data. SOAPdenovo was developed for the short read assemblies based on a de

Bruijn graph algorithm, which has been widely applied in genome assembly.

Four major steps were conducted to complete the preliminary assembly:

i) Building the de Bruijn graph

To build the de Bruijn graph, all reads from the small insert size (<1000 bp) libraries were

used to build the de Bruijn graph (DBG). The initial DBG was composed of 57-mers as

nodes and the edge connection among the nodes was made up of read paths. In order to

simplify the DBG, erroneous connections were removed to resolve the repeats, including the

following four aspects.

a) Clipping the short tips

The short tips that were shorter than 114 bp (the length of 2-fold 57mer) in the DBG were

clipped.

b) Removing low-coverage links

c) Solving tiny repeats by read path

d) Merging the bubbles. The bubbles were generally caused by repeats or heterozygosity.

ii) Contig construction

On the simplified DBG, the broken connections at repeat boundaries were extracted and

Page 36: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

36

output the unambiguous sequence fragments of them as contigs.

iii) Scaffold construction

Realigned the reads onto the contigs and used the paired-end information to join the unique

contigs into scaffolds.

iv) Gap closure

Filled the intra scaffold gaps using the mapped reads. Most of the remaining gaps probably

occur in repetitive regions. Paired-end reads with one end mapped on the unique contig and

the other end located in the gap region were extracted for the local assembly, thus the

unmapped ends were used to fill in the gaps within the scaffolds. The gap filling was

performed by GAPcloser [98].

Genome assembly with command:

SOAPdenovo all -s config -K 49

GapCloser_v1.10_gz –a scaff.fa -b lib.cfg -o baboon_gapClosed.fill -t 16

2.3.2 Genome annotation

Repeat sequences can be classified into tandem repeat including microsatellite sequences, small

satellite sequences, and the interspersed repeats including DNA transposons and retrotransposons

(LTRs, LINEs and SINEs). Repeat elements were first annotated using both homolog searching and

de novo prediction, and similarly, genes were annotated by combining homolog searching and

prediction based on gene structure (Figure 2.2).

Page 37: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

37

Genome

sequence

Gene annotationRepeat

annotation

ncRNA

annotation

Gene set

Function

annotation

homologDe novocDNA/

ESTDe novo homolog

Statistics resultsStatistics results

UniProtKEGGInterPro

Statistics results

miRNA/

snRNArRNAtRNA

GLEAN setRNA-

seq data

Figure 2.2. Overall process of genome annotation. The genome annotation including three major

parts: repeat annotation, gene annotation and ncRNA annotation.

2.3.2.1 Repeat annotation

To predict transposable elements (TEs) in the genome, RepeatMasker [99] (version 4.0.5) and

Repeat-ProteinMask were used to scan the whole genome against the RepBase library [100]

(Version 20.04) for known repeats. RepeatMasker was then used again to identify de novo repeats

Page 38: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

38

based on the custom TE library constructed by combining results of RepeatModeler [101] (Version

1.0.8) and LTR_FINDER [102] (Version 1.0.6). Tandem repeats was also predicted using Tandem

Repeat Finder [103] (Version 4.0.7). Finally, all the repeat prediction results were combined

together to the final repeat annotation result.

LTR parameter

LTR_FINDER.x86_64-1.0.5/ltr_finder -w 2 -s tRNAdb/dm3-tRNAs.fa

RepeatMasker parameter

RepeatMasker -nolow -no_is -norna -parallel 1 -lib RepBase16.10/RepeatMaskerLib.embl.lib

ProteinMask parameter

RepeatProteinMask -noLowSimple -pvalue 0.0001

2.3.2.2 RNA annotation

To identify transfer ribonucleic acids (tRNAs), tRNAscan [104] was used. While for ribosomal

ribonucleic acids (rRNAs) identification, 757,441 rRNAs from public domain were used to search

against the genome with command -p blastn -e 1e-5. To identify RNA genes and other non-coding

RNA (ncRNA), Rfam database [105] was used to search against the genome with the Rfam

program, rfam_scan.pl, (ftp://ftp.hgc.jp/pub/mirror/sanger/Rfam/tools/rfam_scan.pl).

rRNA parameter

blastall -p blastn -e 1e-5 –i Human_rRNA.fa

ncRAN parameter

rfam_scan.pl -d Rfam.fasta.

2.3.2.3 Gene annotation

Page 39: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

39

Genes were predicted using three categories of methods, including homolog based, evidence based

and ab initio prediction. For homolog based annotation, protein sequences of Macaca mulatta, Pan

troglodytes, Nomascus leucogenys, Pongo abelii, Gorilla gorilla and Homo sapiens were

downloaded from Ensembl database (Release 73) and were aligned to the genome using BLAT

[106]. Then GeneWise [107] (Version 2.2.0) was used for further precise alignment and gene

structure prediction. For evidence based prediction, EST sequences were downloaded from NCBI

and were aligned to the genome using PASA [108] for spliced alignments and assembly to detected

gene structure. For ab initio prediction, we employed AUGUSTUS [109] (Version 3.1) to process

ab initio gene model prediction in the repeat masked genome. Finally, these gene prediction results

were combined using GLEAN [110] to obtain the final non-redundant gene set.

Homology parameter

blat -q=prot -t=dnax

genewise -sum -genesf

ab initio prediction parameter

denovo-predict.pl --augustus human

GLEAN parameter

run.Glean.pl --YAML parameter.yaml --genome **.fa --maxintron 100000 --cds 150 --homolog

**.gff --EST **.pasa.gff

2.3.2.4 Gene function annotation

In order to provide possible gene function information, predicted genes were compared against

protein databases with protein function information. Blast2GO program [111] was used to assign

gene ontology (GO) terms and enzyme commission (EC) numbers. InterProScan [112], which

searches Pfam domains [112] and several other protein signature databases, was used to predict

Page 40: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

40

protein domains. InterProScan results were finally subjected to searching against the genome by

Blast2GO for further GO terms assignments.

Function parameter

run_iprscan51-55.pl --cpu 100 --cuts 100 --appl ProDom --appl ProSiteProfiles --appl SMART --

appl PANTHER --appl PRINTS --appl Pfam --appl PIRSF --appl ProSitePatterns **.pep

blast -b 100 -v 100 -p blastp -e 1e-5 -F F -d database(database including kegg, swissprot, tremble )

2.3.2.5 Completeness of gene content with and BUSCO

CEGMA [113] and BUSCO [114] were used to assess the completeness of the genome and quality

of gene predictions. Both software used universal/conserved single-copy genes which should be

present in the genome to search against the genome, thus to estimate the completeness of the

genome and gene annotation. Completeness of the gene sets were assessed with default settings for

both software and with plant specific reference profiles in the case of BUSCO.

BUSCO parameter

BUSCO_v1.2.py -o run_glean -m OGS -l vertebrata database -in **.pep -c 16

2.4 Evolutionary analysis

2.4.1 Gene family cluster

Protein sequences of 11 species including Callithrix jacchus, Gorilla gorilla, Homo sapiens,

Macaca mulatta, Microcebus murinus, Nomascus leucogenys, Otolemur garnettii, Pan troglodytes,

Pongo abelii, Tarsius syricht and Mus musculus were used together with the predicted genes of the

two species to do the gene family clustering. Proteins were further filtered if, i) the coding sequence

Page 41: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

41

was shorter than 90bp, ii) the sequences with first or last amino acid marked as “X”, which

indicated ambiguous amino acid because of “N” in the gene sequence. iii) to remain just one of the

transcript if multiple transcripts existed. Then TreeFam (http://www.treefam.org/) was used to

defined gene families in Mandrillus sphinx and Papio anubis. Firstly, all-vs.-all blastp with the e-

value cut-off of 1e-7 for 13 species’ protein sequences were conducted and secondly the possible

blast matches were joined together by an in-house program. Thirdly, we removed genes with

aligned proportion less than 0.33 and converted bit score to percent score. Finally, hcluster_sg

(Version0.5.0, https://pypi.python.org/pypi/hcluster) was used to cluster genes into gene families.

2.4.2 Phylogenetic analysis

With gene families clusters defined, the fourfold degenerate (4D) sites of 5,133 single-copy

orthologous among 13 species were extracted for the phylogenetic tree construction. PhyML

package [115] was used to build the phylogenetic tree with maximum-likelihood methods and

GTR+gamma as amino acid model (1,000 rapid bootstrap replicates conducted). Based on the

phylogenetic tree, divergence times of these species were estimated by using MCMCTree

(http://abacus.gene.ucl.ac.uk/software/paml.html) With default parameters. To further calibrate the

evolution time in the tree, six fossil dates collected from the TimeTree database

(http://www.timetree.org/) were used, including the divergence time between Mus musculus and

human to be 85-93 million years ago (MYA) [116], divergent time between human and chimpanzee,

gorilla, to be 6 MYA (with a range of 5–7) [117] and 9 MYA (range, 8-10) [118].

2.4.3 Positively gene selection analysis

The selection pressure on protein-encoding genes in mandrill and baboon were measured by

comparing nonsynonymous (dN) and synonymous (dS) substitution rates. This ratio would be equal

Page 42: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

42

to 1 if the whole coding sequence evolves neutrally. When dN/dS < 1, it's under constraint, and

when dN/dS > 1 it should be under positive selection. I calculated the dN/dS ratio using models in

the program package PAML version 3.14. From gene family cluster, I obtained single-cope gene in

every species. Subsequently, I used neutral (M1 and M7) and selection (M2 and M8) models to

identify the codons that are under positive selection. Models M1 and M7 supposed a different

distribution of ω values smaller than 1, otherwise models M2 and M8 constrained ω to be larger

than 1 (ω2), thereby distinguishing positive selection from purifying evolution (ω < 1), neutral

evolution (ω = 1), and positive selection. The fitness of the model M1-M2 and M7-M8 can be

compared using a χ2 distribution with 2 degrees of freedom.

2.5 Comparative genomics

2.5.1 Synteny analysis of human, macaque, baboon and mandrill

For the comparative genomic analysis, syntenic blocks among primate species were first identified,

for further identification of genomic rearrangement events such as inversions, insertions and

deletions among these species. Proteins of human, macaque, baboon and mandrill were aligned

between each other using blastp (Version 2.2.26), and then the blast results were filtered using

criteria of coverage greater than 85% and identity greater than 85%. Finally, the best match of every

gene was obtained as the gene pair in synteny.

2.5.2 Gene family contraction and expansion

With the gene family clustering result, gene family contraction and expansion can be detected to

figure out the dynamic evolutionary changes of gene families along the phylogenetic tree.

According to the phylogenetic tree and divergence time, CAFÉ [119] was used for gene family

Page 43: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

43

contraction and expansion analysis. Firstly, a global parameter λ by using maximum likelihood

based on random birth and death model was estimated. Then a conditional p-value was calculated

and families with p-value less than 0.05 were marked as significantly changed families, which

means these families underwent contraction or expansion in the process of evolution.

2.5.3 Segmental duplications

Segmental duplications are duplicated blocks of genomic DNA typically ranging in size from 1–200

kb. They often contain high-copy repeats or intron-exon structure. Whole-genome sequence

detection (WSSD) method was used for segmental duplications identification [120]. Whether a

sequence is duplicated or not were determined according to its overrepresentation and average

sequence identity. After excluding TE element in genome, clean reads were then mapped to genome

using BWA with parameters “-m 200000 -l 20 -k 2 -t 30”, then samtools was used to get coverage

and depth.

2.6 Investigating molecular mechanisms of adaptation/phenotype

2.6.1 Immune character

Major histocompatibility complex (MHC) is a series of genes coding surface proteins assisting cells

to recognize foreign substances, which is related with immune system and it has been demonstrated

to be in association with many diseases. The main function of MHC molecules is to bind the

peptide chain derived from pathogens thus present pathogens on the cell surface to facilitate T-cell

recognition and perform a series of immune functions. MHC has been proved to be highly

polymorphic in most primate species, incuding macaque. So MHC class I region was identified in

the mandrill genome by searching the human sequence against it with RepeatMasker (Version 4.0.5,

Page 44: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

44

with parameter “-nolow -no_is -norna -engine ncbi”).

2.6.2 Language competence

Language is a special ability for communication within species, particularly in human. Previously,

some genes have been found to be involved in language, and exploring the status of these genes in

animals can further help to understand the original of language formation. FOXP2 was the first

gene found to be related to the human language development and a heterozygous missense mutation

were thought to cause inherited language disorder based on a case study of a family known as KE

family. FOXP2 is expressed in many tissues including the basal ganglia and inferior frontal cortex

[121] where it is essential for brain maturation and speech and language development. Here, protein

sequences of FOXP2 genes of human, chimp and mouse have been download from NCBI. These

FOXP2 protein sequences were mapped to baboon and mandrill genome using blat with the

parameters of “-q=prot -t=dnax”. Blat results were filter using the following criteria, i) hits other

than the best five hits were filtered, ii) query protein covered less than 30%, iii) difference greater

than 20%. After the blat alignment, GeneWise was used to do fine mapping with default parameters.

2.6.3 Olfactory character

Olfaction or sense of smell is one of the important feelings for animals. Chemical communication is

least well understood in Old World species and the olfactory sense is underappreciated [122].

Human olfactory receptor (OR) genes protein from HGNC

(http://www.genenames.org/genefamilies/OR) were used to search against the genomes of mandrill

and baboon to identify OR genes.

Page 45: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

45

2.6.4 Predicting binding sites of transcription factors

Transcription factors (TFs) are key regulators which bind to specific DNA sequence to activate or

repress gene expression. Each TF has at least one DNA-binding domain (DBD) which is always

conserved. Based on their DBDs, TFs could be classified into 70 families in AnimalTFDB 2.0

database [123]. In order to identify and explore functions of TFs, a BLAST tool was used to search

against TFs in the database with the protein sequences. The 1,691 human protein sequences in

AnimalTFDB 2.0 database were selected as the BLAST database with the conditions setting as e-

value<=1e-5, coverage>=30%, identity>=20%. In the prediction result, 68 TF families in total,

3,438, 3,714 and 4,272 genes in Has, Msp, Oba, respectively.

Transcription factor binding sites (TFBS), a motif may correspond to the active site of an enzyme or

a structural unit necessary for proper expression of genes. Thus, sequence motifs are one of the

basic functional units of molecular evolution. Consequently, identifying and understanding these

motifs is fundamental to building models of cellular processes at the molecular scale and to

understanding the mechanisms of human disease. In this study, we used the MEME Suite to

perform motif-based sequence analysis, which comprises an integrated set of tools and databases.

We used build-in motifs to identify human genomic sequences with e-value<=1e-10 in DREME

algorithm that may contain the discovered motifs, or to determine if the motifs are similar to

previously studied motifs. In the prediction result, 19,024 TFBS were found in Has, and determined

whether there were some variations near the binding sites with 50 bp extending size.

Page 46: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

46

3. Results

3.1 Landscapes of baboon and mandrill genomes

3.1.1 Sequencing data

For de novo genome assembly, 12 libraries were constructed and sequenced for each of the two

species, and the sequencing data was summarized in Table 3.1. In total, 512 Gb (~170× considering

the genome of 3 Gb) of raw paired-end and 328 Gb (109× considering the genome size of 3 Gb) of

raw sequencing data were obtained.

Table 3.1 Statistics of baboon and mandrill raw sequencing data.

Species Pair-end

Libraries

Insert

Size (bp)

Average Reads

Length (bp)

Raw Data

(Gb)

Sequence

Depth (×)

Baboon 250 150 109.48 36.49

500 100 80.91 26.97

800 100 60.11 20.37

4,000 90 68.24 22.75

10,000 90 95.72 31.91

Total - - 414.46 138.15

Mandrill 250 150 113,296 37.77

500 100 83,054 27.68

800 100 65,328 21.78

2,000 90 34,561 11.52

5,000 90 32,967 10.99

10,000 90 65,377 21.79

20,000 90 32,141 10.71

Total - - 426,724 142.2

After data filtering, 284 Gb and 289 Gb clean data were obtained (Table 8.1).

Page 47: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

47

3.1.2 K-mer analysis

In order to assess the genome features, 17-mers (17 bp sub-sequences) were extracted and subjected

to the K-mer analysis. The reads from the short insert libraries (baboon, libraries with insert sizes of

250bp, 500bp and 800bp and ~202 Gb data amount in total; mandrill, libraries with insert sizes of

250bp, 500bp and 800bp and 212 Gb data amount in total) were used for this analysis. From the

distribution of depth-frequency (Figure 8.1), the peak of distribution was at ~28× and ~31×

respectively. Thus the genome sizes of olive baboons and mandrill were estimated to be 2.93 Gb

and 2.90 Gb respectively (Table 3.2).

Table 3.2 The information of 17-mer statistics.

Species K Number of K-

mers

Depth peak Genome Size Sequencing depth

Baboon 17 82,117,298,803 28 2,932,760,671 33

Mandrill 17 89,967,169,490 31 2,902,166,757 37

The distribution of K-mer frequencies of reads from second generation sequencing dataset can also

reflect the heterozygosity of the genome [124]. Considering a genome without heterozygosity,

repeat and no errors during sequencing, the K-mer frequency distribution should be a Poisson

distribution. For real dataset, due to the sequence errors, there were excessive K-mer with low

frequency. In the meantime, heterozygote regions would result in two sets of K-mers with half of

the major sequencing depth/K-mer frequency, thus for higher heterozygosity, there would be more

obvious secondary peak at half the frequency. Also for repeat sequences, since they are multiple

copies of K-mers resulted from identical repeat sequences, secondary peaks can be found at twice or

even more times of the major K-mer frequency. As indicated in Figure 8.1, for baboon, there was

no obvious secondary peak at half of the major K-mer frequency which was ~28, indicating low

Page 48: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

48

heterozygosity for the sequenced baboon individual. However, for mandrill, obvious secondary

peak can be found at the K-mer frequency of ~16 which was half of the major K-mer frequency

(~31), thus the mandrill individual should have relatively high heterozygosity. For both genomes,

there were also noticeable peaks at twice of the major K-mer frequency, indicating high repeat

content for both genomes. Thus, the two baboon genomes sequenced are both high repetitive and

obviously heterozygous.

3.1.3 Genome assembly

With estimated genome features, genome assembly was conducted for both species to obtain the

genome assemblies. The final baboon genome assembly was 3.12 Gb with ~80 Mb gaps, similar to

the overall genome length estimated in the K-mer analysis (Table 3.3). The contig N50 was 21.7 kb

with longest contig to be 238.9 kb, indicating continuity of the genome and good quality for gene

annotation. For scaffolds, the N50 was 1.1 Mb with longest scaffold to be 8.8 Mb. And 2,308

longest scaffolds consisted more than 80% of the whole genome. Similarly, for mandrill, the total

length assembled was 2.88 Gb, with ~80 Mb gaps. The contig N50 was 20.5 kb with longest contig

to be 211 kb. The scaffold N50 was 3.6 Mb with the longest scaffold to be 19.1 Mb. And 634

longest scaffolds consisted more than 80% of the whole genome. The genome assemblies are of

good quality for downstream analysis, with good coverage and continuity.

Table 3.3 Statistics of the genome assemblies.

Contig*1 Scaffold

Size (bp) Number Size (bp) Number

Baboon N90*2 2,315 171,662 52,973 4,209

Page 49: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

49

N80 7,938 108,096 332,809 2,308

N70 12,413 77,767 559,903 1,593

N60 16,868 56,789 798,728 1,128

N50 21,659 40,868 1,070,645 792

Longest 238,945 ---- 8,793,459 ----

Total size 3,044,016,568 ---- 3,116,777,842 ----

Total number (>=100

bp)

---- 1,831,592 ---- 1,610,583

Total number (>=2 kb) ---- 177,569 ---- 11,097

Mandrill N90 5,266 141,475 638,217 936

N80 9,025 101,618 1,303,160 634

N70 12,638 75,505 1,962,294 457

N60 16,336 56,061 2,730,696 332

N50 20,483 40,751 3,564,730 241

Longest 211,017 ---- 19,105,867 ----

Total size 2,798,997,503 ---- 2,882,689,325 ----

Total number (>=100

bp)

---- 455,069 ---- 215,140

Total number (>=2 kb) ---- 194,923 ---- 4,742

*1. Contigs are the first assembled sequences without gaps, while scaffolds are the sequences generated by linking

contigs with gaps filled in.

*2. N90 means the length of the contig/scaffold for which all the contigs/scaffolds longer than it accumulate to 90%

of the total length. Similarly, N(P) in which P ranged from 50 to 90 in this table, indicates the length of

contig/scaffold for which all the contigs/scaffolds longer than it accumulated to P% of the total length.

3.1.4 Annotation results

3.1.4.1 Repeat annotation

Repeats are widely existed in the genome with possible important functions. Repeats were

annotated and categorized in both genomes (Table 8.2 -Table 8.5). For baboon, the repeat content

took up ~50% of the whole genome, with 47% to be transposable elements (TEs). Comparing to the

repeat contents in human (Table 3.4), Long Interspersed Nuclear Elements (LINEs) were less in

baboon and mandrill genome (~17%) comparing to human genome (~21%), while Short

Interspersed Nuclear Elements (SINEs) were similar in these genomes (~12%), especially with Alu

elements to have quite similar proportion (10%~11%), reflecting that the Alu elements were the

Page 50: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

50

conserved within primate genomes as previously described [125].

Table 3.4 Repeat contents of baboon, mandrill, human and mouse.

Group Percentage coverage of genome

Baboon Mandrill Human Mouse

LINE 16.76 16.61 20.99 19.2

L1 15.6 15.05 17.37 18.78

L2 1.05 1.39 3.3 0.38

LINE/other 0.11 0.17 0.32 0.04

SINE 11.26 12.10 13.64 8.22

Alu 10.14 10.47 10.74 2.66

MIR 0.92 1.37 2.9 0.57

B4 0.14 0.20 -- 2.36

SINE/other 0.07 0.06 -- 2.64

LTR 7.88 8.36 8.55 9.87

MaLRs 3.12 3.40 3.78 4.82

Other ERVs 4.68 4.85 4.77 4.4

LTR/other 0.08 0.11 -- 0.65

DNA transposons 2.7 3.27 3.03 0.88

Other 3.70 0.06 0.53 0.74

Total 42.3 40.40 46.74 38.91

3.1.4.2 RNA annotation

The non-coding RNAs (ncRNAs) are RNA molecules that are not translated into a protein. Four

types of ncRNAs were annotated in baboon and mandrill genomes, including transfer RNAs

(tRNAs), ribosomal RNAs (rRNAs), and small nuclear RNAs (snRNAs) (Table 8.6 and Table8.7).

3.1.4.3 Gene annotation

After masking repeats, protein coding genes were predicted in the genome using ESTs, homolog

Page 51: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

51

proteins and ab initio prediction, generating 23,867 (baboon) and 21,906 (mandrill) protein-coding

genes finally (Table 3.5 and 3.6). In mandrill genome, the average number of exon per gene is

slightly lower than that in baboon genome while the average exon length is longer than baboon. In

addition, the average intron length is 700bp longer than that in baboon.

Table 3.5 Summary of gene annotation in baboon genome.

Gene set Number Average

transcript

length

(bp)

Average

CDS

length

(bp)

Average

exon per

gene

Average

exon

length

(bp)

Average

intron

length

(bp)

De novo AUGUSTUS 22,528 45,907 1,371 8.10 169 6,272

Homolog Nomascus

leucogenys

21,278 36,106 1,467 8.31 176 4,741

Pongo abelii 23,806 32,996 1,341 7.59 176 4,801

Pan

troglodytes

21,245 35,899 1,468 8.22 178 4,771

Macaca

mulatta

25,930 32,844 1,267 7.14 177 5,145

Gorilla gorilla 24,402 30,755 1,377 7.65 179 4,415

Homo sapiens 25,350 35,308 1,481 8.18 181 4,710

EST 39,294 6,538 775 2.21 350 5,763

Final set 23,867 37,246 1,459 8.20 178 4,972

Table 3.6 Summary of gene annotation in mandrill genome

Gene set Number Average

transcript

length

(bp)

Average

CDS

length

(bp)

Average

exon

per

gene

Average

exon

length

(bp)

Average

intron

length

(bp)

De novo AUGUSTUS 18,460 54,148 1,429 8.68 164.65 6,863

Homolog Nomascus

leucogenys

20,874 39,863 1,499 8.56 175.07 5,072

Pongo abelii 23,330 37,371 1,373 7.82 175.53 52,757

Pan troglodytes 20,866 40,317 1,502 8.46 177.62 5,204

Macaca

mulatta

25,460 38,089 1,294 7.36 175.96 5,787

Page 52: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

52

Gorilla gorilla 23,791 34,748 1,413 7.92 178.47 4,816

Homo sapiens 25,161 39,338 1,513 8.42 179.77 5,098

EST 38,021 7,365 781 2.33 335.00 4,935

Final set 21,906 39,087 1,390 7.52 184.95 5,785

3.1.4.4 Gene evaluation and function annotation

To evaluate the quality of the annotated protein coding genes, 3,023 BUSCO (Benchmarking

Universal Single-Copy Orthologs) groups were searched against the predicted gene set to find that

97% (baboon) and 98% (mandrill) (Table 3.7) of complete groups can be found in the final gene

sets. Besides, 99.24% (baboon) and 98.70% (mandrill) of the predicted genes were with

corresponding biological function supported by at least one of the functional databases (Table 3.8).

Table 3.7 Assessment of gene sets using BUSCO.

Baboon Mandrill

Total BUSCO groups 3,023 3,023

Complete BUSCOs 2,936 2,981

Complete and single-copy BUSCOs 2,772 2,811

Complete and duplicated BUSCOs 164 170

Fragmented BUSCOs 63 28

Missing BUSCOs 24 14

Table 3.8 Function annotation of the final gene sets.

Baboon Mandrill

Gene number % Gene number %

Total 23,867 100.00 21,906 100.00

Annotated InterPro 20,310 85.10 18,139 82.80

GO 15,818 66.27 14,160 64.64

KEGG 19,733 82.68 18,022 82.27

Swissprot 22,547 94.47 20,547 93.80

TrEMBL 23,661 99.14 21,529 98.28

All 23,686 99.24 21,622 98.70

Page 53: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

53

database

Unannotated 181 0.74 284 1.30

3.2 Evolution of baboon and mandrill

3.2.1 Gene families

In order to analyze gene family evolution of baboon and mandrill, gene family clustering was

conducted to identify 17,947 and 15,368 gene families respectively, with 668 and 1,387 genes not

clustered (Table 3.9). Comparing to human (Homo sapiens), macaque (Macaca mulatta) and

chimpanzee (Pan troglodytes), 489 and 342 gene families, with 598 and 515 genes, were found to

be unique in the two species (Figure 3.1). These unique gene families were significantly enriched

in function annotation with gene ontology (GO) terms 0042773 of ATP synthesis coupled electron

transport (GO level, biological process, BP, P=1.28e-12), GO:0016651 of oxidoreductase activity,

acting on NADH or NADPH (GO level, molecular function, MF, P=1.68e-09) for baboon (Table

8.8) and GO:0006412 of translation (GO level: BP, P=6.29e-33), GO:0003735 of structural

constituent of ribosome (GO level, BP, P=6.29e-33) for mandrill (Table 8.9). On the other hand,

5,133 single-copy orthologous genes were found to be shared among all the 13 species (Figure 3.2).

Table 3.9 Gene family clustering in the seven species.

Species Genes

number

Genes in

families

Un-clustered

genes

Family

number

Unique

families

Callithrix jacchus 20,585 445 16,858 12 1.19

Gorilla gorilla 20,478 313 17,495 8 1.15

Homo sapiens 19,513 105 17,367 2 1.12

Macaca mulatta 20,627 912 16,391 38 1.2

Mandrillus sphinx 21,906 1,387 15,368 87 1.34

Microcebus

murinus

17,853 310 15,414 9 1.14

Mus musculus 22,190 864 17,778 209 1.2

Page 54: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

54

Note: Un-clustered genes refer to unique genes in the species; Unique families refer to unique gene families of the species.

Figure 3.1 Orthologous gene clusters in the five related species. The Venn diagram of unique

and shared gene families in the human, mandrill, gorilla, macaque and baboon genomes.

Page 55: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

55

Figure 3.2 Comparison of orthologous genes among 13 primates and mouse.

3.2.2 Phylogenetic analysis

In order to analyze the species evolution, phylogenetic tree of the baboon, mandrill and the other

sequenced animal genomes were constructed based on single-copy orthologous genes. Molecular

clock of 4-fold degenerate sites (neutral substitution rate per year) in species was estimated with

single copy orthologous genes thus the divergence time was estimated. The maximum-likelihood

phylogenetic tree (Figure 3.3) indicates that baboon and mandrill are located in the same clade with

macaque and they diverged from human clade about 28.5 (27.5–30.4) Million years ago (MYA)

Page 56: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

56

while the divergence time between Cercopithecoidea and Hominoidea was estimated to be 26.66

(24.29–28.95) MYA using mitochondrial genome sequences method [126]. Baboon and mandrill

were estimated to split from macaque about 7.9 (6.9–9.2) MYA which was different from the

previous estimation which was 6.6 (6.0–8.0) MYA [127]. Baboon and mandrill split from each

other at ~5.8 (5.0–6.8) MYA, reflecting the close evolutionary relationship between baboon and

mandrill.

Figure 3.3 Phylogenetic tree based on single copy gene families in the 13 species. The

calibration time marked as red dot is derived from previous publications [89-91].

The demographic history of a species reflects historical population changes thus would be important

to understand from the genome. We inferred a noticeable population bottleneck in the demographic

history of the baboon and mandrill using the pairwise sequentially Markovian coalescent (PSMC)

model (Figure 3.4). The two species went through similar population size changes between 100 and

10,000 thousand years (kyr) ago. Around 28 kyr ago, a sharp increase, followed by a noticeable

Page 57: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

57

bottleneck from a peak of 61,000 and 47,000 to ~6,500 around 17 kys ago in both the baboon and

mandrill populations. The increase of population size was coincident with the increase of human

population, probably indicating climate change suitable for increase of mammals, while the recent

bottleneck of the baboon and mandrill populations are different from the recent increase of the

human population.

Figure 3.4 The demographic change of baboon and mandrill. The population size change over

time was estimated by PSMC model. The x-axis indicates the time, from left to right to be from

recent to ancient, while the y-axis indicates the effective population size.

3.3 Synteny among primates

3.3.1 Synteny analysis of human, macaque, baboon and mandrill

Comparing the genomes of human, macaque, baboon and mandrill, synteny can be identified thus

the historical genome rearrangement events such as inversions, insertions and deletions can be also

identified. These events may result in loss, duplication or change of genes functions. In total, 9,930,

11,418, 14,318 gene pairs between baboon and mandrill, macaque and baboon, human and macaque

Page 58: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

58

were identified respectively. Human retained 24 chromosomes (22+X+Y) while the baboon clade

had only 22 chromosomes (20+X+Y) (Figure 3.5) after ~27.5–30.4 millon years of evolution. I

found that chromosome 13 and 14 of baboon branch went through chromosome fusion and

chromosome 7 and 10 experienced chromosome breaks after forming a new clade. Moreover,

several inversion events including paracentric such as chromosome 1, 6, 9 and so on and pericentric

(chromosome 2) occurred in comparison with human. For detail, I analyzed the genes located in

five inversions with length more than 37 Mb on chromosome 1, 2, 3, 4 and 9 and enriched them

significantly with terms GO:0006412: translation (GO level: BP, P=4.60e-03), GO:0004950:

chemokine receptor activity (GO level: MF, P=8.96E-06) and so on (Table 8.10). And we found

FOXP2 gene, which was vital for the formation of voice and language, was located at chromosome

3: 135,344,867–135,605,165 (mandrill) (Figure 8.3).

Figure 3.5 Synteny relationship of human, macaque, baboon and mandrill.

3.3.2 Gene family contraction and expansion

In baboon and mandrill lineage, there were 545 expanded and 618 contracted gene families (Figure

3.6). Expanded gene families were found to be significantly enriched in the functions of

Page 59: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

59

biosynthetic process, structural constituent of ribosome, nucleosomal DNA binding, G-protein

coupled receptor activity, olfactory receptor activity, glucose catabolic process, peptidyl-prolyl

isomerization, as well as carbon fixation in photosynthetic organisms and electron transport chain

pathway. In baboon and mandrill, peptidylprolyl isomerase A (PPIA) was significantly expanded

(GO:0003755, P= 3.60E-89, Fisher’s exact test, 40 baboon genes and 53 mandrill genes). The PPIA

belongs to the peptidyl-prolyl cis-trans isomerase (PPIase) family which catalyze the cis-trans

isomerization, folding of newly synthesized protein, combination of several transcription factors

and regulating many biological processes including inflammation and apoptosis, even acting in

cerebral hypoxia-ischemia. In stress environment when presence of reactive oxygen species (ROS),

cell will secrete PPIA to induce an inflammatory response and mitigate tissue injury. Baboons have

been used in embryo infections and disparate bacterial infections and were found to have rapid

infections during the early innate immune responses, which may be related to PPIA functions.

The peroxiredoxin-6 (PRDX6) family, which can reduce peroxides and protection against oxidative

injury during metabolism, was also significantly expanded (GO:0051920, P = 0.000641, Fisher’s

exact test, 4 baboon genes, 5 mandrill genes).

Page 60: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

60

Figure 3.6 Gene family contraction and expansion for 12 primates and mouse.

3.3.2 Segmental duplications

Segmental duplications (SDs) were widely existed in mammal genes and might be functionally

important, thus SDs were identified in seven species including baboon and mandrill (Figure 3.7).

Long segment duplications were found to be similar in baboon and mandrill and less than human.

Page 61: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

61

Figure 3.7 Segmental duplications in seven primate species.

3.4 MHC comparison between human and baboon/mandrill

Major histocompatibility complex (MHC) contains a series of genes which code surface proteins to

assist cells recognizing foreign substances, thus MHC is important for immune system and it has

been found to be associated with many diseases. The proteins coded by genes from MHC region are

majorly to bind the peptide chain from pathogens and present pathogens on the cell surface to

facilitate T-cell recognition and then a series of immune functions. MHC region is highly

polymorphic in most primate species that have been studied. Previous study has been conducted on

MHC region of macaque to reveal the diversity of this region. While for other Old World monkeys

other than macaque, the MHC regions remain largely unknown. Checking the assembled genomes

of baboon and mandrill, relative complete assembly of MHC region has only been found in mandrill

other than baboon, because of the complexity of high repeat content. MHC region of mandrill was

found on Chromosome 4. In order to make sure the assembled MHC region of mandrill was of high

Page 62: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

62

quality, reads were mapped back to the assembled genome to show good coverage and pair-

end/mate-pair relationship (Figure 8.3), supporting assembly of MHC region in mandrill. Since the

MHC region is highly repetitive, a detailed repeat annotation was carried out for both mandrill and

human MHC class I regions (from gene GABBR1 to gene MICB in the direction from the telomere

side to the centromere side) with the same parameter to find similar repeat content for the two

species in this region (48.27% in mandrill comparing to 51.03% in human) (Table 8.11). In

addition to the similar repeat content, the genes of the two species in this region were in good

synteny (Figure 3.8). Only 54 insertion and deletion (indels) with length >100bp were found

between the MHC region I of the two species, which were mostly found to be overlapped with

repetitive elements, such as SINE, LINE and LTR, indicating the influence of repeat content in the

MHC diversity.

HLA genes are important for immune recognition thus HLA genes were further checked and

compared to human. In MHC class I region of human, there were 50 genes in total including 6 HLA

genes, while in mandrill MHC class I region, only 4 HLA genes were identified. Searching the

whole genome other than the MHC region, another 4 HLA genes were identified, making the total

number of HLA genes to be 8 in mandrill. However, further inspection of the 8 HLA genes in

mandrill resulted in finding 5 of them harbored start or stop codon changes, prematurely terminated

changes or frameshift mutations (Figure 3.9), reflecting the genetic mechanisms of differences in

immune response between mandrill and human.

Considering that only one MIC gene was found in chimpanzee [128] comparing to two genes of

MICA and MICB in human which resulted from genomic duplication occurred ~33-44 million years

ago [129, 130], MIC genes in mandrill were further identified (Figure 3.10). Both MICA and MICB

Page 63: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

63

gene or gene fragments were found to be existed in mandrill. But the gene structure of MICA in

mandrill was found to be incomplete because of loss of the first exon. Again, this reflected genetic

mechanisms of differences between human and mandrill immune responses.

Figure 3.8 Synteny between human and mandrill MHC regions.

Page 64: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

64

Page 65: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

65

Figure 3.9 Alignment of HLA genes with amino acid sequence for human, baboon and

mandrill.

Figure 3.10 Structure of MICA and MICB gene for human and mandrill. The red, orange,

Page 66: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

66

yellow and purple box represent exons, LINEs, SINEs and LTR, respectively.

3.5 Language related genomic features

Language is a special ability for communication, particularly used by human. Genes were identified

to be involved in language formation in human being. FOXP2 was the first gene identified to be

relevant to the human language development and a heterozygous missense mutation in this gene

was prove to cause inherited language disorder based on a case study of a family known as KE

family. FOXP2 expressed in many tissues including the basal ganglia and inferior frontal cortex

[121] where are essential for brain maturation and speech/language development. Two amino acid

substitutions affected the neural functions of FOXP2 and differential transcriptional regulation in

vivo resulted in two human-specific amino acids comparing to chimpanzee. And 111 genes were

found to be significantly expression changed [131] by these two substitutions. Similarly, using

ChIP-seq, researchers used FOXP2 peptide to design antibody and found 175 target genes [132].

With the baboon and mandrill genomes available, FOXP2 gene evolution was further investigated

here in primates, to shed light on language related genomic features.

FOXP2 genes in baboon and mandrill were identified and compared to those in human, chimpanzee,

and mouse (Figure 3.11). In baboon, the FOXP2 gene (which can be well aligned to human

ENSP00000386200) were found on scaffold1015 from 492,895 to 753,085 bp, and in mandrill, the

FOXP2 gene was found on scaffold103 from 4,983,301 to 5,243,599 bp. They both had 18 exons.

In human, the FOXP2 gene has 22 different transcripts with many motifs including FOXP coiled-

coil domain and Fork head domain. FOXP coiled-coil domain modulated the dimeric associations

of FOXP transcription factors when mutations in this domain might cause disease like IPEX

(immunodysregulation polyendocrinopathy enteropathy X-linked) syndrome. Fork head domain

Page 67: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

67

was found in several different transcription factors and to be involved in a variety of biological

processes including early embryogenesis, organogenesis, tumorigenesis and signal transduction.

Comparing the human and baboon FOXP2 genes (with the entire length to be 715 amino acids),

only two amino acid differences were found, and they were both on the seventh exon. What is more,

no mutation in FOXP was found in the coiled-coil and Fork head domain which was concordant

with previous studies. The two amino acid substitutions may affect functions of FOXP2.

Figure 3.11 Amino acid sequence aligment of FOXP2 gene from human, chimpanzees, mouse,

baboon and mandrill. Dots represent identical residues to the human sequence.

3.6 Olfactory receptor genes analysis

Olfaction or sense of smell is one of the important feelings of animals. However, communications

Page 68: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

68

through chemicals like olfactory were not understood in Old World species [122]. Most mammals

possess two distinct sets of chemosensory neurons located in the main olfactory epithelium (MOE)

and in the vomeronasal organ (VNO), while Old World primates were generally considered to lack

a functional of VNO [133]. Previous studies indicated that olfactory communication played a vital

role in information acquisition during social foraging for both mandrill and baboon [134].

Comparing to chimpanzee and macaque, almost all the OR gene families substantially expanded

(Table 3.10) and several families including Family 52, were expanded comparing to human

(Figure 3.12). In detail, the number of Family 7, subfamily E member 24 ORs (OR7E24), is

notably overrepresented in mandrill and baboon genomes (6 copies in mandrill distributed on

chromosome 19 (Figure 8.4), 5 in baboon distributed on chromosome 14 and 19, 1 in human, 1 in

macaque and 0 in chimpanzee). Intriguingly, OR7E24 was confirmed to preferentially and

specifically expressed in human testis cells and was supposed to play an important role in migratory

phase of germ cells life cycle [135]. These ORs may be functionally important during the life cycle

of mandrill and baboon and further researches can be conducted to explore the mechanisms.

Table 3.10 Olfactory receptor gene copy number in five species.

Families human macaque mandrill baboon chimpanzee

Family 1 26 12 23 25 18

Family 2 67 47 94 95 49

Family 3 3 1 3 3 3

Family 4 56 35 68 62 33

Family 5 56 23 56 60 39

Family 6 31 24 38 36 23

Family 7 11 2 14 13 9

Family 8 23 11 22 23 15

Family 9 8 4 12 11 7

Family 10 37 23 46 42 23

Page 69: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

69

Family 11 9 4 11 12 4

Family 12 3 1 3 3 0

Family 13 12 5 13 13 10

Family 14 1 0 1 1 1

Family 51 24 18 35 27 21

Family 52 26 16 48 39 22

Family 56 6 3 6 5 5

Figure 3.12 Expansion of the olfactory receptor gene family in baboon and mandrill. The red,

blue, green, yellow are olfactory receptor genes in the baboon, mandrill, human and chimpanzee.

3.7 Positively selected genes

In addition to gene family expansion and contractions, genes under selection during evolution are

also functionally important, thus I identified positively selected genes in order to reveal evolution of

baboon and mandrill as well as depict possible functional changes of these species. 5,133 single-

copy orthologous genes shared among 13 species obtained in the gene family clustering were used

Page 70: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

70

for detecting positively selected genes (PSGs). In total, 657 PSGs were identified with significant

enrichment in the molecular functions of kinase activity, transferase activity, phosphotransferase

activity and etc. (Table 8.12). Further investigating functions of these PSGs, 34 genes were found

to be innate immunity response genes by searching InnateDB. Interactions of these genes were

predicted by STRING: functional protein association networks (http://string-db.org/cgi). As shown

in Figure 3.13, STAT1, IL5, IL1R1, ATG5, CREB1, DICER1, PIK3R1 genes may have important

roles in immune system, which are strongly associated with stress resistance and wound healing.

Finally, by GO and KEGG pathway enrichment analysis, PSGs were found to be enriched in terms

of GO:0080134: regulation of response to stress (GO level: BP, P=1.59e-05), GO:0006955:

immune response (GO level: BP, P value=4.11e-05) and KEGG:4640: Hematopoietic cell

lineage(P=6.83e-05).

Figure 3.13 Interaction between innate immunity for positively selected genes in mandrill.

3.8 Disease related genomic features

Page 71: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

71

In order to insight related disease mutation on baboon and mandrill. We collected the mutations

in the HGMD database, and check these gene’s mutation on baboon and mandrill. Based on this

method, we found 17 genes has the disease mutation of amino acid change in the two species (Table

3.11). Moreover, we found that some of the mutations are in the function domains which would

heavily affect the function of these genes (Table 3.12). These mutations could cause disease

phenotype in human, such as Lung cancer, cranial volume and Asthma atopic.

Above all, we tried to find some genes which are disease related genes and has unique

mutations on baboon and mandrill. To us supervise that we only find one gene (LRRK2) has a

unique amino acid changes in baboon and mandrill in position 1210 (Figure 3.14). For this site, all

other species is a tyrosine, but that for baboon and mandrill is cysteine and this change has reduced

the Hydrophobicity, which could affect the gene’s function.

Table 3.11 Genes with disease and its’ mutation on baboon and mandrill.

Gene name Position Wild type AA Mutation AA Disease Description

ALAD 59 K N Amyotrophic lateral sclerosis

CIITA 500 G A Multiple sclerosis

CRB1 959 G S Retinitis pigmentosa

IL4R 75 I V Asthma, atopic

MCPH1 761 A V Cranial volume

NPHS2 192 I V Nephrotic syndrome

TP53BP1 353 D E Lung cancer

Table 3.12 Mutation effect of some disease related genes.

Page 72: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

72

PROTEIN UNIPROT_ID REF ALT POS VAR SIFT Domain

ALAD P13716 K N 59 K59N 1 ALAD

CIITA P33076 G A 500 G500A 1 NACHT domain

CRB1 P82279 G S 959 G959S 0.66 PFAM

NO/PROSITE(EGF-

like 14)

IL4R J9JII2 I V 75 I75V 0.82 Interleukin-4

receptor

MCPH1 Q8NEM0 A V 761 A761V 0.52 BRCT domain

NPHS2 Q9NP85 I V 192 I192V 1 SPFH domain /

Band 7 family

TP53BP1 Q12888 D E 353 D353E 1 not included

Figure 3.14 The unique mutation on baboon and mandrill with Y1210C.

Page 73: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

73

4. Discussion

Primates are well studied mammals because of their evolutionarily importance as well as their close

relationship to human. As for genomic researches, there were many primate genomes available and

genomic features of primates have already been comprehensively studied. Despite current

progresses in primate genomic studies, more genomic data for primate species are necessary for

further studies to improve our understanding of primates in evolutionary studies and applications.

Here, applying second generation sequencing technologies, I established two draft genomes for

baboon and mandrill respectively, which are valuable resources for primate and diseases studies.

The contig N50 of the two genomes were longer than 20 kb while the scaffold N50 reached more

than 1 Mb (3.56 Mb for mandrill), indicating good quality of the assembled genomes. In order to

further improve the genome assemblies, long reads may be applied to fill in the gaps of the

assembly and improve the contig continuity, while genetic maps were necessary for anchoring the

scaffolds onto chromosomes. However, lacking of genetic maps usually impeded construction of

chromosome-level genome assemblies of primates. With further development of technologies like

Hi-C sequencing (formaldehyde cross-linking and sequencing), the assembled scaffolds may be

further anchored to chromosomes, even without the genetic maps.

Secondly, genomic features of baboon and mandrill were comprehensively explored with the draft

genomes. The repeat content and gene content were similar to other primate species. According to

the phylogenetic tree constructed based on single copy gene families, baboon and mandrill were

found to be located in the same clade and the divergence time from the human clade was about 28.5

million years ago (MYA), and the two species of baboon and mandrill were split about 5.8 MYA.

Evolutionary changes including chromosome-level changes, gene families changes (expanded,

contracted and specific gene families) and positively selected genes were identified here to reflect

Page 74: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

74

genetic differences of the two primate species comparing to others. For example, chromosome

fusion events (fusion of human chromosome 13 and 14) have been identified even with the scaffold

level genome assembly here. Thus, with further improvement of the genome assembly, especially

the chromosome-level genome assembly, further investigation of the genomic changes can be

conducted to comprehensively reveal evolutionary changes.

Thirdly, since baboon is usually used as model for human diseases researches and both species have

some specific features, genetic mechanisms underpin immune, language ability as well as olfactory

have been investigated. For immune, MHC regions were specifically analyzed in mandrill genome,

because only mandrill genome assembly here was relatively complete. A very good synteny has

been found between mandrill and human MHC region with only 54 insertions and deletions (longer

than 100 bp) were found. And for homologs of human leukocyte antigen (HLA), I found fewer

HLA genes in both baboon and mandrill comparing to human (8 genes in total with five of them

harboring deleterious mutations). And different from chimpanzee, two MIC genes can be found in

baboon and mandrill although one of them has probably become pseudogenes. The similarity in

MHC region, and lacking of HLA gene families are probably related with the success of cross-

species plant cases. For baboon, improvement of the assembly in the MHC region should be

valuable for future studies. For language ability, I explored the FOXP2 genes in the two species to

find two amino acid changes comparing to the human FOXP2 gene, thus further validations should

be carried out to further illustrate the influences of these mutations. Substantially expanded

olfactory receptor (OR) genes were found in baboon and mandrill comparing to other species,

indicating specific olfactory systems for these two species, which also wait for further studies. With

the found of some mutation in genes that would cause disease in human but that not show a

diseased phenotype. For example, MCPH1 a gene which is identified as being responsible for the

neurodevelopmental disorder primary microcephaly type 1, that is characterized by a smaller-than-

Page 75: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

75

normal brain size and mental retardation{Liu, 2016 #299}. We found the consensus mutation on all

the primates but except Homo sapiens (Figure 4.1). Compared with all the primates, it’s easy to find

that the human sapiens has the largest brain volume. We infer that this mutation is positive selection

site in human been and it may have accelerated the intelligence in the evolution of Human. What’s

more, some disease mutation on baboon and mandrill also made them a better medical model. We

can use CRISPR technology to edit the genome of baboon and then see its phenotype, then we can

use some newest medicine on them to select the best cure solutions which may facilitate the develop

of medicine.

Figure 4.1 The volume of cranial capacity in primates.

Finally, assembly and analysis of the two draft genomes of baboon and mandrill also reflected the

possibility of establishing more genomes for primate species. Primates are an order of mammal

species with ~16 families and ~500 species, which are all highly evolved animal species with

Page 76: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

76

special physiological and behavioral characteristics. Despite the evolutionary importance and

relative simple genome content, there were only ~20 species already have been established

reference genomes. Also, the genome assemblies were quite different in quality and continuity,

making it more difficult for further analysis and applications. Thus, establishing draft genomes

using second generation sequencing for all primate species can be invaluable for evolutionary

researches, conservation/preservation, as well as human genetic/diseases researches and

applications. The plan to sequence all primate species in near future, using either second generation

sequencing technologies combined with 10X or Hi-C library construction methods, or the third

generation long reads sequencing, should be feasible. With genome sequence available, repeat and

gene annotation, as well as comparative genomics among the primate species can also be conducted.

Page 77: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

77

5. Conclusions

Firstly, draft genomes of baboon and mandrill have been established in this study, which can serve

as reference dataset for future genome sequencing and comparative genomic studies. With more

than 100× second generation sequencing data from different sequencing libraries, whole genome

shotgun (WGS) assemblies of both species were finished, with the genome size of 3.12 Gb and 2.88

Gb respectively. Then genome assemblies reached to high continuity reflected by long contig N50

of more than 20 kb and scaffold N50 longer than 1 Mb. The longest scaffold was longer than 8.8

Mb in baboon and 19.1 Mb in mandrill. ~40% of the genome were annotated to be repeat sequenced

and 23,867 and 21,906 protein coding genes were annotated respectively. BUSCO assessment

indicated high quality of both the genome assembly and gene annotation with high coverages (98%

and 99%) of the conserved genes.

Secondly, with the draft genome sequences available, basic genomic features were investigated and

compared to related species to find similar repeat content, protein coding gene numbers and gene

families in baboon and mandrill comparing to other primates. Only 489/342 gene families with

598/515 genes were found to be specific in baboon/mandrill. And fewer segmental duplications

(SDs) were found in baboon and mandrill comparing to human.

Thirdly, evolution of the two species was comprehensively analyzed to find the demographic

changes, chromosome-level changes, gene family expansion and contraction, as well as positively

selected genes. Baboon and mandrill were found to be located in the same clade with macaque and

they were diverged from human clade about 28.5 (27.5–30.4) million years ago (MYA) while the

divergence time between Cercopithecoidea and Hominoidea was estimated to be 26.66 (24.29–

Page 78: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

78

28.95) MYA. Baboon and mandrill were found to be split from each other ~5.8 (5.0–6.8) MYA.

Demographic changes along evolution with a sharp increase followed by a noticeable bottleneck

happened ~28 thousand years ago were observed for both the baboon and mandrill. Synteny

between baboon, mandrill and human were established to find chromosomal rearrangements (fusion

of chromosome 13 and 14 and chromosome breaks of chromosome 7 and 10). For gene family

evolution, the lineage of baboon and mandrill had 545 expanded and 618 contracted gene families,

with gene families of important functions to be expanded including PPIA which can induce an

inflammatory response and mitigate tissue injury, and PRDX6 family, which can reduce peroxides

and protection against oxidative injury during metabolism. 657 positively selected genes were

identified for the lineage of baboon and mandrill and some of them were also related with immune

responses.

Finally, underlying genetic mechanisms for immune system, language and olfactory were

investigated to find highly consistent MHC regions with fewer HLA genes, two amino acid

mutations in FOXP2 genes, and notably expanded olfactory gene families in baboon and mandrill.

Good synteny was found between mandrill and human in MHC region with only 54 insertion and

deletion (indels) longer than 100 bp in MHC region I. And fewer HLA genes in baboon and

mandrill were found comparing to human (8 in total with 5 to become pseudogenes).

Page 79: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

79

6. Future perspectives

i) Further improving the genome assemblies. Especially by applying third generation

sequencing and Hi-C sequencing, chromosome-level genome assembly with fewer gaps can

be achieved. And for the highly repetitive regions including MHC regions, better assembly

would benefit future functional and comparative genomic studies.

ii) Constructing genome database for these species. In order to effectively share the genome

data, database can be established.

iii) Further functional/molecular studies of some genomic features. Genomic features

including specific gene families, mutations in functionally important genes (for example,

FOXP2 gene) as well as expanded gene families (for example, olfactory receptor genes)

were identified in this study but further validations through functional studies should be

required for illustration of the mechanisms related with these functions.

iv) Large scale genome sequencing of primates. With experiences obtained in this study,

large scale genome sequencing aiming at establishing draft genomes for all primate species

can be further considered.

Page 80: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

80

7. References

1. Wilson DE and Reeder DM. Mammal species of the world: a taxonomic and geographic

reference. JHU Press; 2005.

2. Jolly C. Introduction to the Cercopithecoidea, with notes on their use as laboratory animals.

In: Symp Zool Soc Lond 1966, pp.427-57.

3. Fleagle JG and McGraw WS. Skeletal and dental morphology supports diphyletic origin of

baboons and mandrills. Proceedings of the National Academy of Sciences. 1999;96 3:1157-

61.

4. Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MA, et al. A

molecular phylogeny of living primates. PLoS genetics. 2011;7 3:e1001342.

5. Liedigk R, Roos C, Brameier M and Zinner D. Mitogenomics of the Old World monkey

tribe Papionini. BMC evolutionary biology. 2014;14 1:176.

6. Sigg H, Stolba A, Abegglen J-J and Dasser V. Life history of hamadryas baboons: physical

development, infant mortality, reproductive parameters and family relationships. Primates.

1982;23 4:473-87.

7. Groves CP. Primate taxonomy. 2001.

8. Kingdon J. The Kingdon field guide to African mammals. Bloomsbury Publishing; 2015.

9. Zinner D, Groeneveld LF, Keller C and Roos C. Mitochondrial phylogeography of baboons

(Papio spp.)–Indication for introgressive hybridization? BMC evolutionary biology. 2009;9

1:83.

10. Gesquiere LR, Learn NH, Simao MCM, Onyango PO, Alberts SC and Altmann J. Life at

the top: rank and stress in wild male baboons. Science. 2011;333 6040:357-60.

11. Rogers J and Hixson JE. Baboons as an animal model for genetic studies of common human

disease. The American Journal of Human Genetics. 1997;61 3:489-93.

12. Chai D, Cuneo S, Falconer H, Mwenda J and D'Hooghe T. Olive baboon (Papio anubis

anubis) as a model for intrauterine research. Journal of medical primatology. 2007;36 6:365-

9.

13. Cox LA, Comuzzie AG, Havill LM, Karere GM, Spradling KD, Mahaney MC, et al.

Baboons as a model to study genetics and epigenetics of human disease. ILAR journal.

2013;54 2:106-21.

14. Locher CP, Witt SA, Herndier BG, Tenner‐Racz K, Racz P and Levy JA. Baboons as an

animal model for human immunodeficiency virus pathogenesis and vaccine development.

Immunological reviews. 2001;183 1:127-40.

15. Starzl TE, Fung J, Tzakis A, Todo S, Demetris A, Marino I, et al. Baboon-to-human liver

transplantation. The lancet. 1993;341 8837:65-71.

16. Taylor Jr F, Chang A, Esmon C, D'angelo A, Vigano-D'Angelo S and Blick K. Protein C

prevents the coagulopathic and lethal effects of Escherichia coli infusion in the baboon.

Journal of Clinical Investigation. 1987;79 3:918.

17. Hanson SR, Powell JS, Dodson T, Lumsden A, Kelly AB, Anderson JS, et al. Effects of

angiotensin converting enzyme inhibition with cilazapril on intimal hyperplasia in injured

arteries and vascular grafts in the baboon. Hypertension. 1991;18 4 Suppl:II70.

18. Ryabchikova EI, Kolesnikova LV and Luchko SV. An analysis of features of pathogenesis

in two animal models of Ebola virus infection. The Journal of infectious diseases. 1999;179

Supplement_1:S199-S202.

19. Huneke RB, Michaels MG, Kaufman CL and Ildstad ST. Antibody response in baboons

(Papio cynocephalus anubis) to a commercially available encephalomyocarditis virus

Page 81: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

81

vaccine. Comparative Medicine. 1998;48 5:526-8.

20. VanCott TC, Mascola JR, Loomis-Price LD, Sinangil F, Zitomersky N, McNeil J, et al.

Cross-subtype neutralizing antibodies induced in baboons by a subtype E gp120 immunogen

based on an R5 primary human immunodeficiency virus type 1 envelope. Journal of

virology. 1999;73 6:4640-50.

21. Drewe JA, O’Riain MJ, Beamish E, Currie H and Parsons S. Survey of infections

transmissible between baboons and humans, Cape Town, South Africa. Emerging infectious

diseases. 2012;18 2:298.

22. Stearns-Kurosawa DJ, Lupu F, Taylor FB, Kinasewitz G and Kurosawa S. Sepsis and

pathophysiology of anthrax in a nonhuman primate model. The American journal of

pathology. 2006;169 2:433-44.

23. Khlebnikov V, Golovlev I, Zhemchugov V, Chugunov A, Averin S, Afanas' ev S, et al. The

immunological efficacy of Francisella tularensis outer membranes for hamadryas baboons.

Zhurnal mikrobiologii, epidemiologii, i immunobiologii. 1993; 3:61-4.

24. Githure JI, Reid GD, Binhazim AA, Anjili CO, Shatry AM and Hendricks LD. Leishmania

major: the suitability of East African nonhuman primates as animal models for cutaneous

leishmaniasis. Experimental parasitology. 1987;64 3:438-47.

25. Yole D, Pemberton R, Reid G and Wilson R. Protective immunity to Schistosoma mansoni

induced in the olive baboon Papio anubis by the irradiated cercaria vaccine. Parasitology.

1996;112 1:37-46.

26. Nyindo M and Farah I. The baboon as a non-human primate model of human schistosome

infection. Parasitology Today. 1999;15 12:478-82.

27. Mafuyai H, Barshep Y, Audu B, Kumbak D and Ojobe T. Baboons as potential reservoirs of

zoonotic gastrointestinal parasite infections at Yankari National Park, Nigeria. African

health sciences. 2013;13 2:252-4.

28. Prescott M. Primate sensory capabilities and communication signals: implications for care

and use in the laboratory. National Centre for the Replacement, Refinement and Reduction

of Animals in Research; 2006.

29. Boë L-J, Berthommier F, Legou T, Captier G, Kemp C, Sawallis TR, et al. Evidence of a

Vocalic Proto-System in the Baboon (Papio papio) Suggests Pre-Hominin Speech

Precursors. PloS one. 2017;12 1:e0169321.

30. Nishimura T, Mikami A, Suzuki J and Matsuzawa T. Descent of the hyoid in chimpanzees:

evolution of face flattening and speech. Journal of Human Evolution. 2006;51 3:244-54.

31. Kuhl PK and Meltzoff AN. Infant vocalizations in response to speech: Vocal imitation and

developmental change. The journal of the Acoustical Society of America. 1996;100 4:2425-

38.

32. Boë L-J, Badin P, Ménard L, Captier G, Davis B, MacNeilage P, et al. Anatomy and control

of the developing human vocal tract: A response to Lieberman. Journal of Phonetics.

2013;41 5:379-92.

33. Nowak RM. Walker's mammals of the world. JHU Press; 1999.

34. Harrison MJ. The mandrill in Gabon's rain forest—ecology, distribution and status. Oryx.

1988;22 4:218-28.

35. Hoshino J. Feeding ecology of mandrills (Mandrillus sphinx) in Campo animal reserve,

Cameroon. Primates. 1985;26 3:248-73.

36. Leigh SR, Setchell JM, Charpentier M, Knapp LA and Wickings EJ. Canine tooth size and

fitness in male mandrills (Mandrillus sphinx). Journal of Human Evolution. 2008;55 1:75-85.

37. Setchell JM and Dixson AF. Changes in the secondary sexual adornments of male mandrills

(Mandrillus sphinx) are associated with gain and loss of alpha status. Hormones and

Behavior. 2001;39 3:177-84.

Page 82: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

82

38. Setchell JM and Dixson AF. Developmental variables and dominance rank in adolescent

male mandrills (Mandrillus sphinx). American journal of primatology. 2002;56 1:9-25.

39. Setchell JM, Vaglio S, Abbott KM, Moggi-Cecchi J, Boscaro F, Pieraccini G, et al. Odour

signals major histocompatibility complex genotype in an Old World monkey. Proceedings

of the Royal Society of London B: Biological Sciences. 2010:rspb20100571.

40. Setchell JM, Richards SA, Abbott KM and Knapp LA. Mate-guarding by male mandrills

(Mandrillus sphinx) is associated with female MHC genotype. Behavioral Ecology.

2016:arw106.

41. Feistner AT. Scent marking in mandrills, Mandrillus sphinx. Folia Primatologica. 1991;57

1:42-7.

42. Pandrea I, Apetrei C, Dufour J, Dillon N, Barbercheck J, Metzger M, et al. Simian

immunodeficiency virus SIVagm. sab infection of Caribbean African green monkeys: a new

model for the study of SIV pathogenesis in natural hosts. Journal of virology. 2006;80

10:4858-67.

43. Roussel M, Pontier D, Ngoubangoye B, Kazanji M, Verrier D and Fouchet D. Modes of

transmission of Simian T-lymphotropic Virus Type 1 in semi-captive mandrills (Mandrillus

sphinx). Veterinary microbiology. 2015;179 3:155-61.

44. Nerrienet E, Amouretti X, Müller-Trutwin M, Poaty-Mavoungou V, Bedjebaga I, Nguyen

HT, et al. Phylogenetic analysis of SIV and STLV type I in mandrills (Mandrillus sphinx):

indications that intracolony transmissions are predominantly the result of male-to-male

aggressive contacts. AIDS research and human retroviruses. 1998;14 9:785-96.

45. Zwick LS, Walsh TF, Barbiers R, Collins MT, Kinsel MJ and Murnane RD.

Paratuberculosis in a mandrill (Papio sphinx). Journal of Veterinary Diagnostic

Investigation. 2002;14 4:326-8.

46. O'Rourke J, Dixon M, Jack A, Enno A and Lee A. Gastric B‐cell mucosa‐associated

lymphoid tissue (MALT) lymphoma in an animal model of ‘Helicobacter

heilmannii’infection. The Journal of pathology. 2004;203 4:896-903.

47. Setchell JM, Bedjabaga I-B, Goossens B, Reed P, Wickings EJ and Knapp LA. Parasite

prevalence, abundance, and diversity in a semi-free-ranging colony of Mandrillus sphinx.

International Journal of Primatology. 2007;28 6:1345-62.

48. Ungeheuer M, Elissa N, Morelli A, Georges A, Deloron P, Debre P, et al. Cellular responses

to Loa loa experimental infection in mandrills (Mandrillus sphinx) vaccinated with

irradiated infective larvae. Parasite immunology. 2000;22 4:173-84.

49. International Human Genome Sequencing C. Initial sequencing and analysis of the human

genome. Nature. 2001;409:860. doi:10.1038/35057062

https://www.nature.com/articles/35057062#supplementary-information.

50. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature.

2005;437 7055:69-87. doi:10.1038/nature04072.

51. Rogers J and Gibbs RA. Comparative primate genomics: emerging patterns of genome

content and dynamics. Nat Rev Genet. 2014;15 5:347-59. doi:10.1038/nrg3707

http://www.nature.com/nrg/journal/v15/n5/abs/nrg3707.html#supplementary-information.

52. Prufer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, et al. The bonobo genome

compared with the chimpanzee and human genomes. Nature. 2012;486 7404:527-31.

doi:10.1038/nature11128.

53. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution

map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476.

doi:10.1038/nature10530

https://www.nature.com/articles/nature10530#supplementary-information.

Page 83: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

83

54. Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, et al.

Comparative and demographic analysis of orang-utan genomes. Nature. 2011;469 7331:529-

33. doi:http://www.nature.com/nature/journal/v469/n7331/abs/10.1038-nature09687-

unlocked.html#supplementary-information.

55. Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, et al. A high-coverage

genome sequence from an archaic Denisovan individual. Science. 2012;338 6104:222-6.

doi:10.1126/science.1224344.

56. and Analysis ConsortiumThe Chimpanzee S. Initial sequence of the chimpanzee genome

and comparison with the human genome. Nature. 2005;437 7055:69-87.

doi:http://www.nature.com/nature/journal/v437/n7055/suppinfo/nature04072_S1.html.

57. Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, et al.

Evolutionary and Biomedical Insights from the Rhesus Macaque Genome. Science.

2007;316 5822:222-34. doi:10.1126/science.1139247.

58. Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, et al. Insights into

hominid evolution from the gorilla genome sequence. Nature. 2012;483 7388:169-75.

doi:http://www.nature.com/nature/journal/v483/n7388/abs/nature10842.html#supplementary

-information.

59. Gordon D, Huddleston J, Chaisson MJP, Hill CM, Kronenberg ZN, Munson KM, et al.

Long-read sequence assembly of the gorilla genome. Science. 2016;352 6281

doi:10.1126/science.aae0344.

60. Johnson ME, Viggiano L, Bailey JA, Abdul-Rauf M, Goodwin G, Rocchi M, et al. Positive

selection of a gene family during the emergence of humans and African apes. Nature.

2001;413 6855:514-9. doi:10.1038/35097067.

61. Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, et al.

Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316

5822:222-34.

62. Harris RA, Tardif SD, Vinar T, Wildman DE, Rutherford JN, Rogers J, et al. Evolutionary

genetics and implications of small size and twinning in callitrichine primates. Proceedings

of the National Academy of Sciences of the United States of America. 2014;111 4:1467-72.

doi:10.1073/pnas.1316037111.

63. Mailund T, Halager AE, Westergaard M, Dutheil JY, Munch K, Andersen LN, et al. A New

Isolation with Migration Model along Complete Genomes Infers Very Different Divergence

Processes among Closely Related Great Ape Species. PLoS Genetics. 2012;8 12:e1003125.

doi:10.1371/journal.pgen.1003125.

64. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO and Shendure J. Chromosome-

scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat

Biotechnol. 2013;31 12:1119-25. doi:10.1038/nbt.2727.

65. Kaplan N and Dekker J. High-throughput genome scaffolding from in vivo DNA interaction

frequency. Nat Biotech. 2013;31 12:1143-7. doi:10.1038/nbt.2768

http://www.nature.com/nbt/journal/v31/n12/abs/nbt.2768.html#supplementary-information.

66. Chaisson MJP, Wilson RK and Eichler EE. Genetic variation and the de novo assembly of

human genomes. Nat Rev Genet. 2015;16 11:627-40. doi:10.1038/nrg3933.

67. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al.

Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the

Human Genome. Science. 2009;326 5950:289-93. doi:10.1126/science.1181369.

68. Pan_troglodytes-2.1.4 assembly. National Center for Biotechnology Information [online],

2011.

69. Prufer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, et al. The bonobo genome

compared with the chimpanzee and human genomes. Nature. 2012;486 7404:527-31.

Page 84: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

84

doi:http://www.nature.com/nature/journal/v486/n7404/abs/nature11128.html#supplementary

-information.

70. Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, et al. A new rhesus

macaque assembly and annotation for next-generation sequencing analyses. Biol Direct.

2014;9 1:20. doi:10.1186/1745-6150-9-20.

71. Yan G, Zhang G, Fang X, Zhang Y, Li C, Ling F, et al. Genome sequencing and comparison

of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques.

Nat Biotech. 2011;29 11:1019-23.

doi:http://www.nature.com/nbt/journal/v29/n11/abs/nbt.1992.html#supplementary-

information.

72. Perry GH, Reeves D, Melsted P, Ratan A, Miller W, Michelini K, et al. A Genome

Sequence Resource for the Aye-Aye (Daubentonia madagascariensis), a Nocturnal Lemur

from Madagascar. Genome Biol Evol. 2012;4 2:126-35. doi:10.1093/gbe/evr132.

73. Warren WC, Jasinska AJ, Garcia-perez R, Svardal H, Tomlinson C, Rocchi M, et al. The

genome of the vervet (Chlorocebus aethiops sabaeus). Genome Res. 2015;

doi:10.1101/gr.192922.115.

74. Carbone L, Alan Harris R, Gnerre S, Veeramah KR, Lorente-Galdos B, Huddleston J, et al.

Gibbon genome and the fast karyotype evolution of small apes. Nature. 2014;513 7517:195-

201. doi:10.1038/nature13679

http://www.nature.com/nature/journal/v513/n7517/abs/nature13679.html#supplementary-

information.

75. The Marmoset Genome S and Analysis C. The common marmoset genome provides insight

into primate biology and evolution. Nat Genet. 2014;46 8:850-7. doi:10.1038/ng.3042

http://www.nature.com/ng/journal/v46/n8/abs/ng.3042.html#supplementary-information.

76. Schmitz J, Noll A, Raabe CA, Churakov G, Voss R, Kiefmann M, et al. Genome sequence

of the basal haplorrhine primate Tarsius syrichta reveals unusual insertions. Nature

Communications. 2016;7:12997. doi:10.1038/ncomms12997.

77. Silva JC and Kondrashov AS. Patterns in spontaneous mutation revealed by human–baboon

sequence comparison. TRENDS in Genetics. 2002;18 11:544-7.

78. VandeBerg JL, Williams-Blangero S and Tardif SD. The baboon in biomedical research.

New York: Springer; 2009.

79. Cox LA, Mahaney MC, VandeBerg JL and Rogers J. A second-generation genetic linkage

map of the baboon (Papio hamadryas) genome. Genomics. 2006;88 3:274-81.

doi:https://doi.org/10.1016/j.ygeno.2006.03.020.

80. Rogers J, Mahaney MC, Witte SM, Nair S, Newman D, Wedel S, et al. A genetic linkage

map of the baboon (Papio hamadryas) genome based on human microsatellite

polymorphisms. Genomics. 2000;67 3:237-47.

81. Voruganti VS, Tejero ME, Proffitt JM, Cole SA, Freeland-Graves JH and Comuzzie AG.

Genome-wide Scan of Plasma Cholecystokinin in Baboons Shows Linkage to Human

Chromosome 17. Obesity. 2007;15 8:2043-50. doi:10.1038/oby.2007.243.

82. Tejero ME, Voruganti VS, Proffitt JM, Curran JE, Goring HH, Johnson MP, et al. Cross-

species replication of a resistin mRNA QTL, but not QTLs for circulating levels of resistin,

in human and baboon. Heredity. 2008;101 1:60-6. doi:10.1038/hdy.2008.28.

83. Tejero ME, Cole SA, Cai G, Peebles KW, Freeland-Graves JH, Cox LA, et al. Genome-

wide scan of resistin mRNA expression in omental adipose tissue of baboons. International

Journal Of Obesity. 2004;29:406. doi:10.1038/sj.ijo.0802699.

84. Pandrea I, Onanga R, Souquiere S, Mouinga-Ondéme A, Bourry O, Makuwa M, et al.

Paucity of CD4(+) CCR5(+) T Cells May Prevent Transmission of Simian

Immunodeficiency Virus in Natural Nonhuman Primate Hosts by Breast-Feeding. Journal of

Page 85: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

85

Virology. 2008;82 11:5501-9. doi:10.1128/JVI.02555-07.

85. Charpentier M, Setchell J, Prugnolle F, Knapp L, Wickings E, Peignot P, et al. Genetic

diversity and reproductive success in mandrills (Mandrillus sphinx). Proceedings of the

National Academy of Sciences of the United States of America. 2005;102 46:16723-8.

86. Gokcumen O, Tischler V, Tica J, Zhu Q, Iskow RC, Lee E, et al. Primate genome

architecture influences structural variation mechanisms and functional consequences.

Proceedings of the National Academy of Sciences. 2013;110 39:15764.

87. Cordaux R and Batzer MA. The impact of retrotransposons on human genome evolution.

Nat Rev Genet. 2009;10 10:691-703. doi:10.1038/nrg2640.

88. Marques-Bonet T, Ryder OA and Eichler EE. Sequencing primate genomes: what have we

learned? Annual review of genomics and human genetics. 2009;10:355-86.

doi:10.1146/annurev.genom.9.081307.164420.

89. She X, Horvath JE, Jiang Z, Liu G, Furey TS, Christ L, et al. The structure and evolution of

centromeric transition regions within the human genome. Nature. 2004;430:857.

doi:10.1038/nature02806

https://www.nature.com/articles/nature02806#supplementary-information.

90. Bailey JA and Eichler EE. Primate segmental duplications: crucibles of evolution, diversity

and disease. Nat Rev Genet. 2006;7 7:552-64. doi:10.1038/nrg1895.

91. Eichler EE, Budarf ML, Rocchi M, Deaven LL, Doggett NA, Baldini A, et al.

Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of

pericentromeric plasticity. Human molecular genetics. 1997;6 7:991-1002.

92. Riethman HC, Xiang Z, Paul S, Morse E, Hu XL, Flint J, et al. Integration of telomere

sequences with the draft human genome sequence. Nature. 2001;409 6822:948-51.

93. Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM and Trask BJ. Human

subtelomeres are hot spots of interchromosomal recombination and segmental duplication.

Nature. 2005;437 7055:94-100.

94. Antonell A, De LORX and Perez Jurado LA. Evolutionary mechanisms shaping the

genomic structure of the Williams-Beuren syndrome chromosomal region at human 7q11.23.

Genome Research. 2005;15 9:1179.

95. Li R, Fan W, Tian G, Zhu H, He L, Cai J, et al. The sequence and de novo assembly of the

giant panda genome. Nature. 2010;463 7279:311.

96. Minoche AE, Dohm JC and Himmelbauer H. Evaluation of genomic high-throughput

sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome

biology. 2011;12 11:R112.

97. Magoč T and Salzberg SL. FLASH: fast length adjustment of short reads to improve

genome assemblies. Bioinformatics. 2011;27 21:2957-63.

98. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an empirically

improved memory-efficient short-read de novo assembler. Gigascience. 2012;1 1:18.

99. Tarailo‐Graovac M and Chen N. Using RepeatMasker to identify repetitive elements in

genomic sequences. Current protocols in bioinformatics. 2009:4.10. 1-4.. 4.

100. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O and Walichiewicz J. Repbase

Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005;110 1-

4:462-7.

101. Smit A and Hubley R. RepeatModeler Open-1.0. Repeat Masker Website. 2010.

102. Xu Z and Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR

retrotransposons. Nucleic Acids Res. 2007;35 suppl 2:W265-W8.

103. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids

research. 1999;27 2:573.

104. Lowe TM and Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA

Page 86: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

86

genes in genomic sequence. Nucleic acids research. 1997;25 5:955-64.

105. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, et al. Rfam: updates to

the RNA families database. Nucleic acids research. 2008;37 suppl_1:D136-D40.

106. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12 4:656-64.

107. Birney E, Clamp M and Durbin R. GeneWise and genomewise. Genome Res. 2004;14

5:988-95.

108. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith Jr RK, Hannick LI, et al. Improving

the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic

Acids Res. 2003;31 19:5654-66.

109. Stanke M, Keller O, Gunduz I, Hayes A, Waack S and Morgenstern B. AUGUSTUS: ab

initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34 suppl 2:W435-W9.

110. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS and Weinstock GM. Creating a

honey bee consensus gene set. Genome biology. 2007;8 1:R13.

111. Götz S, García-Gómez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, et al. High-

throughput functional annotation and data mining with the Blast2GO suite. Nucleic acids

research. 2008;36 10:3420-35.

112. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths‐Jones S, et al. The Pfam

protein families database. Nucleic acids research. 2004;32 suppl_1:D138-D41.

113. Parra G, Bradnam K and Korf I. CEGMA: a pipeline to accurately annotate core genes in

eukaryotic genomes. Bioinformatics. 2007;23 9:1061-7.

114. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV and Zdobnov EM. BUSCO:

assessing genome assembly and annotation completeness with single-copy orthologs.

Bioinformatics. 2015;31 19:3210-2.

115. Guindon S, Delsuc F, Dufayard J-F and Gascuel O. Estimating maximum likelihood

phylogenies with PhyML. Bioinformatics for DNA sequence analysis. 2009:113-37.

116. Huchon D, Chevret P, Jordan U, Kilpatrick CW, Ranwez V, Jenkins PD, et al. Multiple

molecular evidences for a living mammalian fossil. Proceedings of the National Academy of

Sciences. 2007;104 18:7495-9.

117. Glazko GV and Nei M. Estimation of divergence times for major lineages of primate species.

Molecular biology and evolution. 2003;20 3:424-34.

118. Schrago C and Voloch C. The precision of the hominid timescale estimated by relaxed clock

methods. Journal of evolutionary biology. 2013;26 4:746-55.

119. De Bie T, Cristianini N, Demuth JP and Hahn MW. CAFE: a computational tool for the

study of gene family evolution. Bioinformatics. 2006;22 10:1269-71.

120. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, et al. Recent segmental

duplications in the human genome. Science. 2002;297 5583:1003-7.

121. Takahashi H, Takahashi K and Liu F-C. FOXP genes, neural development, speech and

language disorders. Forkhead Transcription Factors. Springer; 2009. p. 117-29.

122. Heymann EW. The neglected sense–olfaction in primate behavior, ecology, and evolution.

American journal of primatology. 2006;68 6:519-24.

123. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME SUITE: tools

for motif discovery and searching. Nucleic Acids Res. 2009;37 Web Server issue:20.

124. Pettersson E, Lundeberg J and Ahmadian A. Generations of sequencing technologies.

Genomics. 2009;93 2:105-11.

125. Kriegs JO, Churakov G, Jurka J, Brosius J and Schmitz J. Evolutionary history of 7SL

RNA-derived SINEs in Supraprimates. Trends Genet. 2007;23 4:158-61.

126. Raaum RL, Sterner KN, Noviello CM, Stewart C-B and Disotell TR. Catarrhine primate

divergence dates estimated from complete mitochondrial genomes: concordance with fossil

and nuclear DNA evidence. Journal of Human Evolution. 2005;48 3:237-57.

Page 87: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

87

127. Steiper ME and Young NM. Primate molecular divergence dates. Molecular phylogenetics

and evolution. 2006;41 2:384-94.

128. Anzai T, Shiina T, Kimura N, Yanagiya K, Kohara S, Shigenari A, et al. Comparative

sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as

the major path to genomic divergence. Proceedings of the National Academy of Sciences.

2003;100 13:7708-13.

129. Gaudieri S, Giles KM, Kulski JK and Dawkins RL. Duplication and polymorphism in the

MHC: Alu generated diversity and polymorphism within the PERB11 gene family.

Hereditas. 1997;127 1‐2:37-46.

130. Yamazaki M, Tateno Y and Inoko H. Genomic organization around the centromeric end of

the HLA class I region: Large-scale sequence analysis. Journal of molecular evolution.

1999;48 3:317-27.

131. Konopka G, Bomar JM, Winden K, Coppola G, Jonsson ZO, Gao F, et al. Human-specific

transcriptional regulation of CNS development genes by FOXP2. Nature. 2009;462

7270:213-7.

132. Spiteri E, Konopka G, Coppola G, Bomar J, Oldham M, Ou J, et al. Identification of the

transcriptional targets of FOXP2, a gene linked to speech and language, in developing

human brain. The American Journal of Human Genetics. 2007;81 6:1144-57.

133. Burrows AM. Primate Anatomy: An Introduction. JSTOR, 2001.

134. Laidre ME. Informative breath: olfactory cues sought during social foraging among Old

World monkeys (Mandrillus sphinx, M. Leucophaeus, and Papio anubis). Journal of

Comparative Psychology. 2009;123 1:34.

135. Goto T, Salpekar A and Monk M. Expression of a testis-specific member of the olfactory

receptor gene family in human primordial germ cells. Molecular human reproduction.

2001;7 6:553-8.

136. Price AL, Jones NC and Pevzner PA. De novo identification of repeat families in large

genomes. Bioinformatics. 2005;21 suppl_1:i351-i8.

Page 88: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

88

8. Appendix

Table 8.1 Statistics of baboon and mandrill clean/filtered sequencing data.

Species Pair-end

Libraries

Insert size

(bp)

Average

reads length

(bp)

Clean data

(Gb)

Sequencing

depth

Baboon 250 150 88.37 29.46

500 100 62.69 20.9

800 100 52.74 17.58

4000 90 36.79 12.26

10000 90 43.82 14.61

Total - - 284.41 94.8

Mandrill 250 150 91.18 30.39

500 100 67.38 22.46

800 100 54.28 18.09

2000 90 18.71 6.24

5000 90 16.3 5.43

10000 90 31.35 10.45

20000 90 10.34 3.45

Total - - 289.55 96.52

Table 8.2 Prediction of the repeats in baboon genome.

Prediction method Repeat size (bp) Percentage in the genome

TRF [103] 88,638,882 2.84

RepeatMasker [99] 1,317,851,115 42.28

RepeatProteinMask [72] 325,375,043 10.43

De novo [136] 1,353,584,808 43.42

Total 1,558,442,757 50.00

Table 8.3 General statistics of repeats in mandrill genome.

Type Repeat Size(bp) Percentage in the genome

TRF 87,221,621 3.03

RepeatMasker 936,130,281 32.47

RepeatProteinMask 281,888,845 9.77

De novo 1,139,310,255 39.52

Total 1,263,424,029 43.83

Page 89: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

89

Table 8.4 Categories of TEs in baboon genome.

RepBase TEs TE Proteins De novo Combined TEs

Length (bp) % Length

(bp)

% Length (bp) % Length (bp) %

DNA 85,821,216 2.75 13,073,15

4

0.42 23,596,137 0.76 102,653,65

5

3.29

LINE 524,428,336 16.8

3

267,916,8

87

8.60 728,970,010 23.39 907,729,49

6

29.12

SINE 365,541,149 11.7

3

-- -- 488,745,536 15.68 629,613,40

2

20.20

LTR 246,841,510 7.92 44,428,21

6

1.42 358,098,539 11.49 522,321,82

6

16.76

Other 979 -- -- -- -- 0 979 0

Unknow

n

1,296,802 0.04 -- -- 495,265 0.02 1,791,826 0.06

Total 131,785,111

5

42.2

8

325,375,0

43

10.4

4

1,281,034,87

5

41.10 1,465,054,7

16

47.01

Note: Repbase TEs, the result of RepeatMasker based on Repbase; TE proteins, the result of RepeatProteinMask

based on Repbase; De novo, Result of RepeatMasker by using library predicted through De novo prediction;

Combined: combined results of Repbase TEs, TE proteins and de novo.

Table 8.5 Categories of TEs in mandrill genome.

RepBase TEs TE Proteins De novo Combined TEs

Length (bp) % Length (bp) % Length(bp) % Length (bp) %

DNA 47,923,460 1.66 13,264,158 0.46 27,821,997 0.96 68,516,869 2.37

LINE 401,922,498 13.9 229,014,482 7.94 725,287,701 25.1 815,296,990 28.28

SINE 319,811,862 11.0 -- -- 481,314,186 16.6 576,217,301 19.99

LTR 169,184,719 5.87 39,705,383 1.38 80,223,826 2.78 200,629,837 6.96

Other 81 -- -- -- 3,210 0 3,291 0

Unkno

wn -- -- -- -- 2,897,396 0.1 2,897,396 0.1

Total 936,130,281 32.4 281,888,845 9.78 1,117,858,1

41 38.78 121,695,029 42.

22 Note: Repbase TEs, the result of RepeatMasker based on Repbase; TE proteins, the result of RepeatProteinMask

based on Repbase; De novo, Result of RepeatMasker by using library predicted through De novo prediction;

Combined: combined results of Repbase TEs, TE proteins and de novo.

Table 8.6 Non-coding RNA genes in baboon genome.

Type Copy Average length

(bp)

Total length

(bp)

% of

genome

tRNA 510 75.26 38,384 0.12

rRNA 1,200 101.38 121,666 0.39

rRNA 18S 136 136.05 18,503 0.06

28S 288 155.67 44,833 0.14

5.8S 17 89.94 1,529 0.005

5S 759 74.84 56,801 0. 18

Page 90: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

90

snRNA 2,812 110.58 310,963 0.99

snRNA CD-box 900 102.03 91,824 0. 29

HACA-

box

324 135.44 43,881 0. 14

splicing 1,322 118.04 156,045 0. 50

Table 8.7 Non-coding RNA genes in mandrill genome.

Type Copy Average length

(bp)

Total length

(bp)

% genome

tRNA 466 75.36 35,118 0.12

rRNA 982 97.05 95,301 0.33

rRNA 18S 20 252.6 5,052 0.02

28S 205 160.49 32,902 0.11

5.8S 8 103.87 831 0.00

5S 749 75.45 56,516 0.19

snRNA 2716 110.76 300,830 1.04

snRNA CD-box 880 101.76 89,547 0.31

HACA-

box

314 136.82 42,963 0.15

splicing 1261 118.27 149,146 0.52

Table 8.8 Go enrichment of unique gene families in baboon.

GO ID GO term GO class P value

GO:0090266 regulation of mitotic cell cycle spindle

assembly checkpoint

BP 1.60E-03

GO:0048478 replication fork protection BP 7.78E-03

GO:0007416 synapse assembly BP 1.38E-02

GO:0006749 glutathione metabolic process BP 2.34E-02

GO:0007018 microtubule-based movement BP 2.37E-02

GO:0006694 steroid biosynthetic process BP 3.92E-02

GO:0042773 ATP synthesis coupled electron transport BP 1.28E-12

GO:0055114 oxidation-reduction process BP 7.42E-06

GO:0005680 anaphase-promoting complex CC 1.05E-03

GO:0004957 prostaglandin E receptor activity MF 1.15E-02

GO:0003840 gamma-glutamyltransferase activity MF 1.55E-02

GO:0003854 3-beta-hydroxy-delta5-steroid

dehydrogenase activity

MF 2.06E-02

GO:0003777 microtubule motor activity MF 2.09E-02

GO:0016491 oxidoreductase activity MF 1.57E-06

Page 91: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

91

GO:0016651 oxidoreductase activity, acting on NADH or

NADPH

MF 1.68E-09

Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.

Table 8.9 Go enrichment of unique gene families in mandrill.

GO ID GO term GO class P value

GO:0044260 cellular macromolecule metabolic

process

BP 2.02E-04

GO:0043170 macromolecule metabolic process BP 6.18E-04

GO:0009987 cellular process BP 1.32E-02

GO:0008152 metabolic process BP 1.88E-02

GO:0044238 primary metabolic process BP 2.69E-02

GO:0044237 cellular metabolic process BP 3.97E-02

GO:0034645 cellular macromolecule biosynthetic

process

BP 1.02E-10

GO:0019538 protein metabolic process BP 1.86E-08

GO:0010467 gene expression BP 3.62E-11

GO:0044267 cellular protein metabolic process BP 4.79E-10

GO:0006412 translation BP 6.29E-33

GO:0007186 G-protein coupled receptor signaling

pathway

BP 9.02E-06

GO:0043229 intracellular organelle CC 1.49E-04

GO:0005622 intracellular CC 3.44E-04

GO:0044391 ribosomal subunit CC 4.00E-04

GO:0005912 adherens junction CC 2.86E-03

GO:0044464 cell part CC 2.92E-03

GO:0044424 intracellular part CC 5.80E-03

GO:0015934 large ribosomal subunit CC 1.70E-02

GO:0015935 small ribosomal subunit CC 4.19E-02

GO:0005840 ribosome CC 1.57E-35

GO:0005737 cytoplasm CC 1.73E-11

GO:0044444 cytoplasmic part CC 2.34E-15

GO:0032991 macromolecular complex CC 3.29E-10

GO:0043232 intracellular non-membrane-bounded

organelle

CC 6.11E-19

GO:0004888 transmembrane signaling receptor

activity

MF 1.11E-04

GO:0004871 signal transducer activity MF 1.49E-04

GO:0004930 G-protein coupled receptor activity MF 1.49E-04

Page 92: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

92

GO:0045296 cadherin binding MF 9.17E-04

GO:0004807 triose-phosphate isomerase activity MF 1.55E-02

GO:0003735 structural constituent of ribosome MF 1.57E-35

GO:0005198 structural molecule activity MF 4.23E-29

GO:0004984 olfactory receptor activity MF 9.21E-08

Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.

Table 8.10 GO enrichment result of unique gene families for mandrill.

GO ID GO term GO class P value

GO:0006412 translation BP 4.60E-03

GO:0006935 chemotaxis BP 1.84E-02

GO:0040011 locomotion BP 1.84E-02

GO:0009605 response to external stimulus BP 4.71E-02

GO:0005840 ribosome CC 4.60E-03

GO:0005737 cytoplasm CC 1.12E-02

GO:0044444 cytoplasmic part CC 1.84E-02

GO:0030529 ribonucleoprotein complex CC 2.36E-02

GO:0001594 trace-amine receptor activity MF 2.75E-04

GO:0016493 C-C chemokine receptor activity MF 7.93E-04

GO:0003735 structural constituent of ribosome MF 4.60E-03

GO:0004896 cytokine receptor activity MF 4.60E-03

GO:0008528 G-protein coupled peptide receptor

activity

MF 4.60E-03

GO:0005198 structural molecule activity MF 1.84E-02

GO:0004950 chemokine receptor activity MF 8.96E-06

Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.

Table 8.11 Repeat content of MHC class I region for mandrill and human.

Type mandrill human

Copy

Number

Length (bp) Percent

(%)

Copy

Number

Length (bp) Percent (%)

DNA/Crypton-

V

1 65 0.00 0 0 0.00

DNA/DNA 3 179 0.01 2 126 0.01

DNA/Helitron 1 363 0.02 1 322 0.02

DNA/Maveric

k

0 0 0.00 1 44 0.00

DNA/Sola 0 0 0.00 1 69 0.00

DNA/MULE- 2 141 0.01 0 0 0.00

Page 93: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

93

MuDR

DNA/TcMar-

Tc1

1 187 0.01 1 183 0.01

DNA/TcMar-

Tigge

12 3,673 0.20 0 0 0.00

DNA/TcMar-

Tigger

26 10,177 0.56 27 11,980 0.63

DNA/hAT 2 184 0.01 1 174 0.01

DNA/hAT-

Charlie

38 9,586 0.52 46 9,941 0.52

DNA/hAT-

Tip100

9 2,016 0.11 6 839 0.04

LINE/CR1 4 772 0.04 4 771 0.04

LINE/Jockey 0 0 0.00 1 57 0.00

LINE/L1 754 340,504 18.62 759 406,157 21.26

LINE/L1-Tx1 1 142 0.01 0 0 0.00

LINE/L2 38 10,351 0.57 32 8,906 0.47

LINE/RTE-X 2 290 0.02 2 302 0.02

LTR/Copia 1 92 0.01 0 0 0.00

LTR/ERV1 146 80,428 4.40 126 77,703 4.07

LTR/ERVK 22 10,368 0.57 34 29,679 1.55

LTR/ERVL 171 91,775 5.02 207 123,209 6.45

LTR/ERVL-

MaLR

109 35,671 1.95 81 27,654 1.45

LTR/Gypsy 2 170 0.01 1 67 0.00

LTR/LTR 3 728 0.04 1 170 0.01

SINE/7SL 5 338 0.02 9 399 0.02

SINE/Alu 947 276,532 15.12 806 267,594 14.01

SINE/B4 27 1,505 0.08 24 895 0.05

SINE/MIR 43 5,880 0.32 44 6,508 0.34

SINE/tRNA-

7SL

10 637 0.03 8 836 0.04

SINE/tRNA-

RTE

1 121 0.01 1 121 0.01

All 2,381 882,875 48.27 2,226 974,706 51.03

Table 8.12 GO and KEGG enrichment of the positively selected genes (PSGs).

GO ID GO Term GO Class Adjusted

P-value

Page 94: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

94

GO:0016301 kinase activity MF 6.62E-10

GO:0016772 transferase activity, transferring phosphorus-

containing groups

MF 1.18E-09

GO:0016773 phosphotransferase activity, alcohol group as acceptor MF 1.18E-09

GO:0003824 catalytic activity MF 2.03E-09

GO:0005524 ATP binding MF 3.42E-09

GO:0004672 protein kinase activity MF 3.42E-09

GO:0032559 adenyl ribonucleotide binding MF 3.80E-09

GO:0030554 adenyl nucleotide binding MF 4.73E-09

GO:0016740 transferase activity MF 1.76E-08

GO:0005515 protein binding MF 3.48E-08

GO:0004713 protein tyrosine kinase activity MF 3.64E-08

GO:0016310 phosphorylation BP 3.85E-08

GO:0006468 protein phosphorylation BP 7.96E-08

GO:0035639 purine ribonucleoside triphosphate binding MF 8.16E-07

GO:0036094 small molecule binding MF 8.22E-07

GO:0032553 ribonucleotide binding MF 9.04E-07

GO:0032555 purine ribonucleotide binding MF 9.04E-07

GO:0017076 purine nucleotide binding MF 1.25E-06

GO:0000166 nucleotide binding MF 1.40E-06

GO:0006793 phosphorus metabolic process BP 1.24E-05

GO:0006796 phosphate-containing compound metabolic process BP 1.24E-05

GO:0009452 RNA capping BP 2.66E-05

GO:0007626 locomotory behavior BP 4.05E-05

GO:0005488 binding MF 5.19E-05

GO:0007155 cell adhesion BP 5.68E-05

GO:0022610 biological adhesion BP 5.68E-05

GO:0008374 O-acyltransferase activity MF 8.68E-05

GO:0043412 macromolecule modification BP 8.73E-05

GO:0006464 protein modification process BP 0.000113

GO:0004525 ribonuclease III activity MF 0.00015

GO:0000123 histone acetyltransferase complex CC 0.000275

GO:0004252 serine-type endopeptidase activity MF 0.000396

GO:0030507 spectrin binding MF 0.000399

GO:0006508 proteolysis BP 0.000485

GO:0070011 peptidase activity, acting on L-amino acid peptides MF 0.000639

GO:0004177 aminopeptidase activity MF 0.000665

Page 95: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

95

GO:0008233 peptidase activity MF 0.000665

GO:0046777 protein autophosphorylation BP 0.000767

GO:0005802 trans-Golgi network CC 0.000848

GO:0005768 endosome CC 0.000848

GO:0005516 calmodulin binding MF 0.001068

GO:0004842 ubiquitin-protein ligase activity MF 0.001317

GO:0017016 Ras GTPase binding MF 0.001317

GO:0031267 small GTPase binding MF 0.001433

GO:0051020 GTPase binding MF 0.001433

GO:0016881 acid-amino acid ligase activity MF 0.002103

GO:0016747 transferase activity, transferring acyl groups other

than amino-acyl groups

MF 0.003245

GO:0004175 endopeptidase activity MF 0.003245

GO:0008236 serine-type peptidase activity MF 0.003313

GO:0017171 serine hydrolase activity MF 0.003313

GO:0019787 small conjugating protein ligase activity MF 0.003335

GO:0016787 hydrolase activity MF 0.003374

GO:0008238 exopeptidase activity MF 0.003374

GO:0070461 SAGA-type complex CC 0.003374

GO:0070566 adenylyltransferase activity MF 0.003374

GO:0042558 pteridine-containing compound metabolic process BP 0.004351

GO:0050660 flavin adenine dinucleotide binding MF 0.004532

GO:0007610 behavior BP 0.004535

GO:0004402 histone acetyltransferase activity MF 0.00476

GO:0006370 mRNA capping BP 0.005009

GO:0008174 mRNA methyltransferase activity MF 0.005009

GO:0009057 macromolecule catabolic process BP 0.005217

GO:0019199 transmembrane receptor protein kinase activity MF 0.005929

GO:0015291 secondary active transmembrane transporter activity MF 0.006141

GO:0008217 regulation of blood pressure BP 0.006384

GO:0014706 striated muscle tissue development BP 0.006384

GO:0060537 muscle tissue development BP 0.006384

GO:0005887 integral to plasma membrane CC 0.006872

GO:0031226 intrinsic to plasma membrane CC 0.006872

GO:0051345 positive regulation of hydrolase activity BP 0.00701

GO:0000910 cytokinesis BP 0.008254

GO:0004568 chitinase activity MF 0.009118

Page 96: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

96

GO:0006032 chitin catabolic process BP 0.009118

GO:0045335 phagocytic vesicle CC 0.009118

GO:0055037 recycling endosome CC 0.009118

GO:0030318 melanocyte differentiation BP 0.009118

GO:0017049 GTP-Rho binding MF 0.009118

GO:2000114 regulation of establishment of cell polarity BP 0.009118

GO:0008344 adult locomotory behavior BP 0.009118

GO:0043966 histone H3 acetylation BP 0.009118

GO:0017034 Rap guanyl-nucleotide exchange factor activity MF 0.009118

GO:0004534 5'-3' exoribonuclease activity MF 0.009118

GO:0030914 STAGA complex CC 0.009118

GO:0008460 dTDP-glucose 4,6-dehydratase activity MF 0.009118

GO:0004909 interleukin-1, Type I, activating receptor activity MF 0.009118

GO:0004334 fumarylacetoacetase activity MF 0.009118

GO:0004349 glutamate 5-kinase activity MF 0.009118

GO:0004350 glutamate-5-semialdehyde dehydrogenase activity MF 0.009118

GO:0043550 regulation of lipid kinase activity BP 0.009118

GO:0070772 PAS complex CC 0.009118

GO:0003919 FMN adenylyltransferase activity MF 0.009118

GO:0006747 FAD biosynthetic process BP 0.009118

GO:0008609 alkylglycerone-phosphate synthase activity MF 0.009118

GO:0004336 galactosylceramidase activity MF 0.009118

GO:0006683 galactosylceramide catabolic process BP 0.009118

GO:0008611 ether lipid biosynthetic process BP 0.009118

GO:0016287 glycerone-phosphate O-acyltransferase activity MF 0.009118

GO:0006516 glycoprotein catabolic process BP 0.009118

GO:0008705 methionine synthase activity MF 0.009118

GO:0008898 homocysteine S-methyltransferase activity MF 0.009118

GO:0010739 positive regulation of protein kinase A signaling

cascade

BP 0.009118

GO:0090036 regulation of protein kinase C signaling cascade BP 0.009118

GO:0005137 interleukin-5 receptor binding MF 0.009118

GO:0048280 vesicle fusion with Golgi apparatus BP 0.009118

GO:0008488 gamma-glutamyl carboxylase activity MF 0.009118

GO:0017187 peptidyl-glutamic acid carboxylation BP 0.009118

GO:0006348 chromatin silencing at telomere BP 0.009118

GO:0004375 glycine dehydrogenase (decarboxylating) activity MF 0.009118

Page 97: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

97

GO:0006546 glycine catabolic process BP 0.009118

GO:0004483 mRNA (nucleoside-2'-O-)-methyltransferase activity MF 0.009118

GO:0080009 mRNA methylation BP 0.009118

GO:0050902 leukocyte adhesive activation BP 0.009118

GO:0048066 developmental pigmentation BP 0.009118

GO:0050931 pigment cell differentiation BP 0.009118

GO:0032878 regulation of establishment or maintenance of cell

polarity

BP 0.009118

GO:0019202 amino acid kinase activity MF 0.009118

GO:0046443 FAD metabolic process BP 0.009118

GO:0072387 flavin adenine dinucleotide metabolic process BP 0.009118

GO:0072388 flavin adenine dinucleotide biosynthetic process BP 0.009118

GO:0006681 galactosylceramide metabolic process BP 0.009118

GO:0019374 galactolipid metabolic process BP 0.009118

GO:0019376 galactolipid catabolic process BP 0.009118

GO:0046485 ether lipid metabolic process BP 0.009118

GO:0016413 O-acetyltransferase activity MF 0.009118

GO:0042084 5-methyltetrahydrofolate-dependent methyltransferase

activity

MF 0.009118

GO:0070528 protein kinase C signaling cascade BP 0.009118

GO:0018214 protein carboxylation BP 0.009118

GO:0016642 oxidoreductase activity, acting on the CH-NH2 group

of donors, disulfide as acceptor

MF 0.009118

GO:0009071 serine family amino acid catabolic process BP 0.009118

GO:0016556 mRNA modification BP 0.009118

GO:0045123 cellular extravasation BP 0.009118

GO:0017137 Rab GTPase binding MF 0.009356

GO:0006030 chitin metabolic process BP 0.009521

GO:0016891 endoribonuclease activity, producing 5'-

phosphomonoesters

MF 0.009521

GO:0015103 inorganic anion transmembrane transporter activity MF 0.010851

GO:0007605 sensory perception of sound BP 0.012081

GO:0003714 transcription corepressor activity MF 0.012081

GO:0050954 sensory perception of mechanical stimulus BP 0.012081

GO:0007067 mitosis BP 0.016642

GO:0000280 nuclear division BP 0.016642

GO:0044431 Golgi apparatus part CC 0.017318

GO:0006725 cellular aromatic compound metabolic process BP 0.017468

Page 98: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

98

GO:0005452 inorganic anion exchanger activity MF 0.018637

GO:0016055 Wnt receptor signaling pathway BP 0.020776

GO:0070588 calcium ion transmembrane transport BP 0.020776

GO:0004540 ribonuclease activity MF 0.020776

GO:0000226 microtubule cytoskeleton organization BP 0.020776

GO:0008271 secondary active sulfate transmembrane transporter

activity

MF 0.020776

GO:0008272 sulfate transport BP 0.020776

GO:0015116 sulfate transmembrane transporter activity MF 0.020776

GO:0042813 Wnt-activated receptor activity MF 0.020776

GO:0016573 histone acetylation BP 0.020776

GO:0048193 Golgi vesicle transport BP 0.020776

GO:0030574 collagen catabolic process BP 0.020776

GO:0090382 phagosome maturation BP 0.020776

GO:0045670 regulation of osteoclast differentiation BP 0.020776

GO:0046920 alpha-(1->3)-fucosyltransferase activity MF 0.020776

GO:0034450 ubiquitin-ubiquitin ligase activity MF 0.020776

GO:0008124 4-alpha-hydroxytetrahydrobiopterin dehydratase

activity

MF 0.020776

GO:0034435 cholesterol esterification BP 0.020776

GO:0034736 cholesterol O-acyltransferase activity MF 0.020776

GO:0006919 activation of cysteine-type endopeptidase activity

involved in apoptotic process

BP 0.020776

GO:0032963 collagen metabolic process BP 0.020776

GO:0044236 multicellular organismal metabolic process BP 0.020776

GO:0044243 multicellular organismal catabolic process BP 0.020776

GO:0044259 multicellular organismal macromolecule metabolic

process

BP 0.020776

GO:0002761 regulation of myeloid leukocyte differentiation BP 0.020776

GO:0030316 osteoclast differentiation BP 0.020776

GO:0045637 regulation of myeloid cell differentiation BP 0.020776

GO:0034433 steroid esterification BP 0.020776

GO:0034434 sterol esterification BP 0.020776

GO:0004772 sterol O-acyltransferase activity MF 0.020776

GO:0010950 positive regulation of endopeptidase activity BP 0.020776

GO:0010952 positive regulation of peptidase activity BP 0.020776

GO:0043280 positive regulation of cysteine-type endopeptidase

activity involved in apoptotic process

BP 0.020776

Page 99: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

99

GO:0097202 activation of cysteine-type endopeptidase activity BP 0.020776

GO:2001056 positive regulation of cysteine-type endopeptidase

activity

BP 0.020776

GO:0008305 integrin complex CC 0.020935

GO:0007167 enzyme linked receptor protein signaling pathway BP 0.021323

GO:0004675 transmembrane receptor protein serine/threonine

kinase activity

MF 0.022843

GO:0016050 vesicle organization BP 0.022843

GO:0016337 cell-cell adhesion BP 0.023783

GO:0000087 M phase of mitotic cell cycle BP 0.023909

GO:0051301 cell division BP 0.025184

GO:0004553 hydrolase activity, hydrolyzing O-glycosyl

compounds

MF 0.025409

GO:0048037 cofactor binding MF 0.026148

GO:0048856 anatomical structure development BP 0.026148

GO:0030097 hemopoiesis BP 0.026288

GO:0006475 internal protein amino acid acetylation BP 0.026288

GO:0018393 internal peptidyl-lysine acetylation BP 0.026288

GO:0018394 peptidyl-lysine acetylation BP 0.026288

GO:0008237 metallopeptidase activity MF 0.028813

GO:0048285 organelle fission BP 0.028832

GO:0015301 anion:anion antiporter activity MF 0.030496

GO:0043085 positive regulation of catalytic activity BP 0.030496

GO:0001510 RNA methylation BP 0.030496

GO:0048534 hemopoietic or lymphoid organ development BP 0.030496

GO:0006473 protein acetylation BP 0.030496

GO:0004521 endoribonuclease activity MF 0.030496

GO:0004712 protein serine/threonine/tyrosine kinase activity MF 0.030496

GO:0043473 pigmentation BP 0.030496

GO:0017080 sodium channel regulator activity MF 0.030496

GO:0004948 calcitonin receptor activity MF 0.030496

GO:0046373 L-arabinose metabolic process BP 0.030496

GO:0046556 alpha-N-arabinofuranosidase activity MF 0.030496

GO:0004962 endothelin receptor activity MF 0.030496

GO:0048484 enteric nervous system development BP 0.030496

GO:0070776 MOZ/MORF histone acetyltransferase complex CC 0.030496

GO:0042577 lipid phosphatase activity MF 0.030496

GO:0004822 isoleucine-tRNA ligase activity MF 0.030496

Page 100: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

100

GO:0006428 isoleucyl-tRNA aminoacylation BP 0.030496

GO:0019236 response to pheromone BP 0.030496

GO:0080025 phosphatidylinositol-3,5-bisphosphate binding MF 0.030496

GO:0032777 Piccolo NuA4 histone acetyltransferase complex CC 0.030496

GO:0000103 sulfate assimilation BP 0.030496

GO:0004020 adenylylsulfate kinase activity MF 0.030496

GO:0004781 sulfate adenylyltransferase (ATP) activity MF 0.030496

GO:0051018 protein kinase A binding MF 0.030496

GO:0017025 TBP-class protein binding MF 0.030496

GO:0034454 microtubule anchoring at centrosome BP 0.030496

GO:0008250 oligosaccharyltransferase complex CC 0.030496

GO:0005315 inorganic phosphate transmembrane transporter

activity

MF 0.030496

GO:0034599 cellular response to oxidative stress BP 0.030496

GO:0090307 spindle assembly involved in mitosis BP 0.030496

GO:0004666 prostaglandin-endoperoxide synthase activity MF 0.030496

GO:0019371 cyclooxygenase pathway BP 0.030496

GO:0043141 ATP-dependent 5'-3' DNA helicase activity MF 0.030496

GO:0030139 endocytic vesicle CC 0.030496

GO:0002573 myeloid leukocyte differentiation BP 0.030496

GO:0030010 establishment of cell polarity BP 0.030496

GO:0030534 adult behavior BP 0.030496

GO:0019566 arabinose metabolic process BP 0.030496

GO:0048483 autonomic nervous system development BP 0.030496

GO:0070775 H3 histone acetyltransferase complex CC 0.030496

GO:0004779 sulfate adenylyltransferase activity MF 0.030496

GO:0006677 glycosylceramide metabolic process BP 0.030496

GO:0046477 glycosylceramide catabolic process BP 0.030496

GO:0046514 ceramide catabolic process BP 0.030496

GO:0046521 sphingoid catabolic process BP 0.030496

GO:0010737 protein kinase A signaling cascade BP 0.030496

GO:0010738 regulation of protein kinase A signaling cascade BP 0.030496

GO:0072393 microtubule anchoring at microtubule organizing

center

BP 0.030496

GO:0019369 arachidonic acid metabolic process BP 0.030496

GO:0006342 chromatin silencing BP 0.030496

GO:0045814 negative regulation of gene expression, epigenetic BP 0.030496

GO:0003684 damaged DNA binding MF 0.032579

Page 101: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

101

GO:0030163 protein catabolic process BP 0.0328

GO:0007596 blood coagulation BP 0.03529

GO:0007599 hemostasis BP 0.03529

GO:0006629 lipid metabolic process BP 0.036449

GO:0016569 covalent chromatin modification BP 0.037639

GO:0016570 histone modification BP 0.037639

GO:0043547 positive regulation of GTPase activity BP 0.039103

GO:0050817 coagulation BP 0.041062

GO:0002520 immune system development BP 0.043864

GO:0006026 aminoglycan catabolic process BP 0.043864

GO:0004702 receptor signaling protein serine/threonine kinase

activity

MF 0.043864

GO:0008235 metalloexopeptidase activity MF 0.043864

GO:0043235 receptor complex CC 0.043864

GO:0016407 acetyltransferase activity MF 0.043864

GO:0043414 macromolecule methylation BP 0.043937

GO:0048731 system development BP 0.044356

GO:0006895 Golgi to endosome transport BP 0.044356

GO:0000186 activation of MAPKK activity BP 0.044356

GO:0006729 tetrahydrobiopterin biosynthetic process BP 0.044356

GO:0030099 myeloid cell differentiation BP 0.044356

GO:0016822 hydrolase activity, acting on acid carbon-carbon

bonds

MF 0.044356

GO:0016823 hydrolase activity, acting on acid carbon-carbon

bonds, in ketonic substances

MF 0.044356

GO:0046146 tetrahydrobiopterin metabolic process BP 0.044356

GO:0006687 glycosphingolipid metabolic process BP 0.044356

GO:0019377 glycolipid catabolic process BP 0.044356

GO:0046479 glycosphingolipid catabolic process BP 0.044356

GO:0046504 glycerol ether biosynthetic process BP 0.044356

GO:0006906 vesicle fusion BP 0.044356

GO:0043281 regulation of cysteine-type endopeptidase activity

involved in apoptotic process

BP 0.044356

GO:2000116 regulation of cysteine-type endopeptidase activity BP 0.044356

GO:0005975 carbohydrate metabolic process BP 0.046346

Map ID Map Title Adjusted P-

value

Gene IDs

Page 102: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

102

map00630 Glyoxylate and

dicarboxylate

metabolism

0.025008 Masph06057 Paanu10824 Masph03104

Paanu04268 Paanu04371 Masph01783

Paanu09927 Masph17981 Paanu10721

Masph08262

map00525 NA 0.025008 Masph19208 Paanu12892

map01055 Biosynthesis of

vancomycin group

antibiotics

0.025008 Masph19208 Paanu12892

map00523 Polyketide sugar unit

biosynthesis

0.025008 Masph19208 Paanu12892

map04113 Meiosis - yeast 0.025008 Masph15120 Paanu15224

map04141 NA 0.025008 Paanu10424 Masph02116 Paanu18656

Masph17762 Masph17390 Paanu15832

Figure 8.1 The distribution of 17-mer frequency of baboon and mandrill. Major and secondary

peaks of the frequency distribution were indicated by arrows for baboon (a) and mandrill (b).

a.

b.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80 90 100

Per

centa

ge

(%)

K-mer frequency

Page 103: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

103

Figure 8.2 Colinearity analysis of chr 3 for mandrill. The orange lines represent gene pairs.

Figure 8.3 Sequencing depth and the location relationships of pair-end reads on MHC class I

region for mandrill.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70 80 90 100

Per

centa

ge

(%)

K-mer frequency

Page 104: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

104

Page 105: PhD Thesis - ku Yin.pdfPhD Thesis Ye Yin Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing Academic advisor: Karsten Kristiansen, University of Copenhagen,

105

Figure 8.4 OR7E24 genes on chromosome 19 in mandrill.