algorithms for decoding cancer genomes: …yn394gh2333/dissertation-augmented.pdffor a supportive...
TRANSCRIPT
ALGORITHMS FOR DECODING CANCER GENOMES:
PHYLOGENETIC INFERENCE AND HAPLOTYPE
ASSEMBLY
a dissertation
submitted to the department of computer science
and the committee on graduate studies
of stanford university
in partial fulfillment of the requirements
for the degree of
doctor of philosophy
By
Dorna Kashef-Haghighi
June 2015
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/yn394gh2333
© 2015 by Dorna KashefHaghighi. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Serafim Batzoglou, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
David Dill
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Arend Sidow
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
The field of cancer genomics is expanding rapidly due to major advancements in
the sequencing technologies. Only a decade ago, the cost and limited throughput of
DNA sequencing made the study of cancer genome alterations at base-pair resolution
infeasible. Today, whole-genome sequencing of tumor populations is commonplace.
Cancer evolves through cycles of cell damage and a series of clonal expansions, mark-
ing the genome with new alterations along the way. To unravel the life history of
cancer genomes, new computational methods are needed to take advantage of the
wealth of whole genome sequence data now available.
In this dissertation, I first describe my work developing computational methods
for studying the role of early neoplasias in breast cancer evolution and show how
these methods can reveal robust clonal lineages and identify cancer progenitor mu-
tations. Next, I describe a probabilistic approach for haplotype reconstruction of an
invasive breast carcinoma genome using long DNA fragments from Moleculo sequenc-
ing technology. I show how cancer-specific aneuploidies can be leveraged to achieve
megabase-length haplotypes with high accuracy. Finally, I demonstrate applications
of phase information for detecting false somatic variant calls, and for identifying and
phasing segmental duplications.
iv
Acknowledgement
The work presented in this doctoral dissertation is the result of the support and help
of many amazing mentors, collaborators, and friends throughout my graduate career
at Stanford University. I would like to take this opportunity to extend my sincere
gratitude and appreciation to the following people.
I am deeply grateful to my advisor, Serafim Batzoglou, for his mentorship, super-
vision, and all the novel technical ideas he brought to my thesis work. I am indebted
to him for o↵ering me the freedom to explore, and the opportunity to learn. His end-
less encouragements always inspired me to conquer the most challenging obstacles in
my research, and his wealth of knowledge guided me to the correct path. Thank you
Serafim for making my journey at Stanford a fascinating and stimulating experience.
Your guidance and support were critical to my growth as a researcher, and I feel
privileged for being one of your students.
I am indebted to Arend Sidow for his mentorship and his integral role in the
inception and development of this work. Arend, I can never thank you enough for
developing my appreciation of this field and for sharing your honest opinion and
generous feedback at all times. Thank you for always finding the time in your busy
schedule to meet with me and for patiently teaching me various subjects and skills.
Having the opportunity to work with you has been a tremendous honor for me.
I would like to express my gratitude to David Dill for serving on my qualification,
oral, and reading committees. His insightful questions and generous suggestions were
critical to the improvement of the work presented here. I thank Anshul Kundaje for
serving on my oral committee and o↵ering invaluable suggestions. I am indebted to
Jonathon Pritchard for chairing my oral session. I would also like to acknowledge my
v
funding source, the Stanford Graduate Fellowship, for making this work possible. My
undergraduate advisors, Doina Precup and Prakash Panangaden, were instrumental
in my decision to pursue a PhD.
During my time at Stanford I was very fortunate to have the chance to collaborate
closely with many outstanding researchers. In particular, I would like to thank Robert
West, Daniel Newburger, Sivan Bercovici, and Ziming Weng. The cancer study pre-
sented in this dissertation was shaped by remarkable contributions of Robert West,
who not only supervised the research e↵ort but also enriched it by providing the nec-
essary clinical perspective. The high qualities of sequence data, and the validation
analysis described in Chapter 3, were made possible only by laborious e↵orts of Zim-
ing, who managed all the wet lab work. I am indebted to Sivan for contributing many
brilliant ideas to this work, and for inspiring me and helping me through numerous
technical conversations. Dan, you are an amazing collaborator and a wonderful friend.
Your diligence and unique skill set were the driving force behind the completion of
our joint project. Your humor and vast knowledge made our co-teaching CS374 the
most fun teaching experience I have ever had. Your friendship, positive spirit, and
generous advice helped me through many stressful moments of graduate school. For
all these, I am truly thankful. I would also like to thank my other collaborators
Rahaleh Salari, Noah Spies, Alayne Brunner, and Robert Sweeney for their valuable
contributions and input.
I am thankful to all current, former, and a�liated members of the Batzoglou lab
for a supportive lab culture. I would like to specially thank Alex, Daniel, Iman, Irene,
Jesse, Lin, Marc, Marina, Raheleh, Sarah, Sivan, Sofia, Ti↵any, Victoria, Volodymyr,
and Yuling for many fruitful discussions and fun times. I am grateful to Sarah, Marc,
and Ti↵any for welcoming me to the lab and o↵ering me generous advice as I was
starting this journey. Victoria, thank you for all the scientific and non-scientific chats
we had together, and for always surprising me with your kindness.
My life at Stanford would not have been as enjoyable and fulfilling without the help
and support of many amazing friends. I would like to particularly thank Maryam,
Rasoul, Parisa, Ali, Farzaneh, Alireza Marandi, Marjan, Milad, Arezou, Pedram,
Leila, Bernd, Parnian, Nastaran, Hooman, Shirin, Reza, Ehsan, Solene, and Alireza
vi
Sharafat for creating many memorable moments in my life. I would also like to
thank Faezeh, Morteza, Mojdeh, Vahid, Arefeh, Maryam khezr, Golnoosh, Negin,
and Bahareh for being my true friends even though we live many miles away.
I could not have gotten this far if it was not for the selfless support and help of
my family: my parents, and my sisters Semira and Sormeh. I am forever grateful to
my parents for having raised me to appreciate the value of learning science, and for
constantly reminding me to believe in myself. Thank you, mom, for always listening
to me and giving me your sincere advice. Thank you for teaching me how to love
wholeheartedly and how to cherish little things in life. And thank you for all the time
you spent with me during school years to help me excel. Thank you, dad, for instilling
in me the value of work ethic, for your unconditional love, and for the example that
you have set for me. I feel incredibly fortunate to have you all as my family.
Lastly, and on a more personal note, I would like to sincerely thank my dearest
Fardad. For it was he who exquisitely transformed my fears and despairs into strength
and persistence, and artistically turned my uncertainties into perspective. He stood
by me during the most intense periods of my studies and brilliantly guided me to
overcome each and everyone of the hurdles I encountered along the way. Completing
this dissertation without having him by my side is simply unimaginable.
Joint Work
The work in Chapter 3 was published in Genome Research [66]. I would like to thank
my co-first authors Daniel Newburger and Ziming Weng for their unique contributions
to the work. I am grateful to Arend Sidow, Serafim Batzoglou, and Robert West for
their continuous guidance and supervision of the project. I would also like to thank
my other co-authors Raheleh Salari, Robert Sweeney, Alayne Brunner, Shirley Zhu,
Xiangqian Guo, Sushama Varma, and Megan Troxell.
Chapter 4 would not be possible without the supervision and technical lead of
Serafim Batzoglou, Arend Sidow, and Rober West. I sincerely thank Sivan Bercovici
for contributing many indispensable ideas. I acknowledge Ziming Weng for generating
the sequencing data and also her contributions to our discussion. I also thank Alex
Bishara, Noah Spies, and Daniel Newburger for discussions and their contributions
vii
to this work.
viii
Contents
Abstract iv
Acknowledgement v
1 Introduction 1
2 Background 3
2.1 Genome and terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Types of genomic variations . . . . . . . . . . . . . . . . . . . 4
2.1.2 Cancer Genomics . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Variant calling in cancer samples . . . . . . . . . . . . . . . . . . . . 6
2.3 Haplotype phasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Genome evolution of Breast Cancer 11
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Whole-genome sequencing of early neoplasias and related car-
cinomas from archival material . . . . . . . . . . . . . . . . . 13
3.3.2 Somatic SNVs fall into a limited and highly structured set of
classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.3 Allele frequencies of somatic SNVs support common ancestral
relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.4 Mutated neoplasias are evolutionarily related to carcinomas . 20
ix
3.3.5 Point-mutational mechanisms are evolutionarily stable and re-
producible among cases . . . . . . . . . . . . . . . . . . . . . . 21
3.3.6 Aneuploidies are the dominant evolutionary feature of progression 24
3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Identification and processing of neoplasias . . . . . . . . . . . 30
3.4.2 Library construction and sequencing . . . . . . . . . . . . . . 31
3.4.3 Read mapping and BAM file processing . . . . . . . . . . . . . 32
3.4.4 Multisample SNV calling . . . . . . . . . . . . . . . . . . . . . 32
3.4.5 Determination of somatic SNV class patterns and of robust
sharing classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.6 PCR-based validation of SNVs and accuracy assessment of whole-
genome calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.7 Aneuploidy and tumor purity . . . . . . . . . . . . . . . . . . 40
3.4.8 SNV mutation spectra . . . . . . . . . . . . . . . . . . . . . . 41
3.4.9 Tree inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.10 Ordering SNVs vs. chromosome 1q ploidy gain in ancestral
branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Haplotype reconstruction of somatic genomes 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Dataset and SNV detection pipeline . . . . . . . . . . . . . . . 47
4.2.2 Overview of the framework . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Local phasing . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.4 LD-based validation of local phasing . . . . . . . . . . . . . . 54
4.2.5 Statistical phasing . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.6 Leveraging aneuploidy information in phasing . . . . . . . . . 56
4.2.7 Final validation test . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Processing of samples and sequencing . . . . . . . . . . . . . . 59
x
4.3.2 Genotype calling . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.3 Constructing read clouds from sequence reads . . . . . . . . . 60
4.3.4 Building variant blocks . . . . . . . . . . . . . . . . . . . . . . 61
4.3.5 Local phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.6 Constructing somatic haplotypes . . . . . . . . . . . . . . . . 65
4.3.7 LD-based validation . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.8 Statistical phase . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.9 Detecting somatic CNV regions . . . . . . . . . . . . . . . . . 66
4.3.10 Leveraging somatic CNVs for detecting switch errors and con-
necting haplotypes . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Applications of haplotype phasing 71
5.1 Enhancing the accuracy of variant calls . . . . . . . . . . . . . . . . . 71
5.2 Identifying and phasing cryptic segmental duplications . . . . . . . . 74
5.3 Increasing the resolution of phylogenetic inference methods . . . . . . 75
6 Conclusion 79
Bibliography 81
xi
List of Tables
3.1 Variant call statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Distances between variants in complete LD pairs . . . . . . . . . . . . 55
4.2 Estimated number of switch errors in CNV regions . . . . . . . . . . 57
xii
List of Figures
3.1 Overall workflow of the project . . . . . . . . . . . . . . . . . . . . . 15
3.2 Lineage tree and alternate allele frequencies . . . . . . . . . . . . . . 17
3.3 Mutation spectra and rates of somatic SNVs . . . . . . . . . . . . . . 22
3.4 Dinucleotide mutation rates for each patient . . . . . . . . . . . . . . 24
3.5 Lesser allele fraction plot of Patient 6 . . . . . . . . . . . . . . . . . . 26
3.6 Aneuploidy summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Genome evolutions of all patients . . . . . . . . . . . . . . . . . . . . 29
3.8 Alternate allele frequencies in each tested private or phylogenetically
informative classes of somatic SNVs of Patient 6 . . . . . . . . . . . . 37
4.1 Moleculo read clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Reconstructing parental and somatic haplotypes in the local phase step. 49
4.3 Probabilistic inference model for phasing germline variants . . . . . . 51
4.4 A haplotype block from real read cloud data . . . . . . . . . . . . . . 54
4.5 Haplotype allelic fractions . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Two examples of somatic variants called by GATK in the IDC sample 73
5.2 Identifying a segmental duplication event at KCNJ12. . . . . . . . . . 77
5.3 Inferring the haplotype sequence of KCNJ12 paralogs. . . . . . . . . . 78
xiii
Chapter 1
Introduction
The term ‘cancer’ describes a wide spectrum of diseases that share one common
attribute: unregulated proliferation of cells. Cancer is recognized worldwide as a
disease of high prevalence and mortality rate. According to American Cancer Society
statistics [80], about 589,430 Americans are expected to die of cancer in 2015. Breast
cancer is one of the top three most frequent types of cancer among women in the US,
and is considered as the dominant cause of cancer death in American women aged 20
to 59 years [77].
Cancer results from an accumulation of genetic and epigenetic changes inherited
at birth and also acquired over the course of an individual’s lifetime; hence, genomics
is an essential component to cancer research. The release of the human reference
genome in 2001 [49, 88], and also rapid development of next generation sequencing
technologies (NGS) in the ensuing years, have revolutionized the field of cancer ge-
nomics by o↵ering a detailed and multidimensional view of cancer genomes. These
molecular techniques are refining our understanding of cancer biology and prompting
new diagnostic and therapeutic approaches to the treatment of cancer patients.
The past few years have witnessed increasingly more systematic and organized
e↵orts, led by researchers across the world, to investigate the underlying somatic
alterations of human cancer [53, 11, 68]. However, a comprehensive understanding
of the mechanisms involved in the formation and progression of cancers still remains
elusive. As part of this dissertation, I contribute to the combined e↵ort of researchers
1
CHAPTER 1. INTRODUCTION 2
to characterize somatic mutations involved in the early stages of cancer formation,
by studying the role of early neoplasias in the breast cancer evolution.
Although the availability of high-throughput and low-cost sequencing platforms,
o↵ered by NGS technologies, has expedited the study of many di↵erent cancer types
at an unparalleled scale, NGS data su↵er from short read lengths. Obtaining a global
view of genomic contributions to tumor development is impeded by the resulting
fragmentation of a genome into a few hundred base-pair segments. To ameliorate the
challenges faced by short read sequencing, third generation sequencing technologies
are now emerging. Single-molecule sequencing (e.g. [87, 29]) and synthetic long read
sequencing (e.g. [48, 70, 3]) platforms have developed in recent years that produce
fragments with lengths ranging from tens to thousands of kilobase pairs. The applica-
tion of these technologies to cancer genomes can provide insight into the underlying
molecular mechanisms at an unprecedented resolution. As part of my dissertation
work, I conducted the first application of synthetic long read sequencing technologies
to do haplotype analysis of somatic alterations in an invasive breast cancer sample.
This dissertation manuscript is organized as follows.
• Chapter 2 introduces some basic biology concepts and key terms that are used
in the subsequent chapters.
• Chapter 3 presents a study of genome evolution during the progression from
premalignant cell populations to invasive breast cancer. This chapter also de-
scribes the mutation discovery and phylogeny inference pipelines developed as
part of this study.
• Chapter 4 presents a novel toolset that can leverage information from long read
sequencing technologies to do haplotype assembly of a somatic genome.
• Chapter 5 showcases some promising applications of read-based haplotype in-
ference.
• Chapter 6 concludes the dissertation with the contributions of the work.
Chapter 2
Background
2.1 Genome and terminology
The human genome is the complete set of genetic information stored in the cells,
and is encoded in 23 pairs of chromosomes. Humans are diploid organisms; meaning
that they carry two homologous copies of each chromosome, one contributed by each
parent. Each chromosome is a long chain of DNA molecules, and is represented as a
string over an alphabet of four letters A, C, G, and T known as nucleotides or bases.
Although humans are 99.9% identical in their genetic makeup, they still di↵er from
each other at millions of nucleotide sites among the 3.2 billion sites of the genome.
These di↵erences contribute to heritable variations between individuals, including but
not limited to their susceptibility to disease.
The release of the first draft of the human genome in 2001 [49, 88], and a more
complete draft in 2003 made remarkable advances in our understanding of the genetic
variation in the human genome and its impact on complex traits and disease. During
the last decade we have also witnessed an extensive progress in the field of genomics
fueled by the rapid advances in sequencing technologies.
3
CHAPTER 2. BACKGROUND 4
2.1.1 Types of genomic variations
Genomic variations are di↵erences in the DNA sequence of individuals in a population.
These di↵erences can be classified into two major categories according to their size.
Single Nucleotide Variants (SNVs)
The simplest and most abundant form of genomic variation among individuals is
a single nucleotide variation (SNV), which is a single base change in the DNA se-
quence. The term “single nucleotide polymorphism” (SNP) refers to an SNV with a
population frequency of at least 1%. SNPs occur throughout a genome at a rate of
approximately one in one thousand base pairs. To this date, over 53 million SNPs are
already identified and reported in public SNP databases such as NCBI dbSNP and
the international HapMap project.
Structural and Copy-Number Variants
Human genetic variations are not limited to single nucleotide changes. Other dif-
ferences include insertions or deletions of short stretches of DNA. These are called
indels for short. Moreover, large segments of DNA ranging in size from kilobases to
megabases can be inserted into, deleted from, or rearranged in the genome of di↵erent
individuals. These alterations change the structure of chromosomes and are called
structural variants (SVs). A copy number variation (CNV), which is one form of
structural variation, indicates that a particular stretch of DNA has di↵erent number
of copies among individuals. CNVs are caused by genomic rearrangements that lead
to the loss or duplication of DNA fragments.
Since a diploid genome has two copies of every chromosome, an individual has
two copies of the same locus. As a result, a genetic variant might be present in one
or both copies. If the same genetic change a↵ects both copies, the variant is called
homozygous. If it only occurs in one copy, it is called heterozygous. The two versions
of a gene at a given location are called alleles. Typically one allele of a heterozygous
CHAPTER 2. BACKGROUND 5
mutation is the same as in the reference genome. This allele is referred to as the
reference allele. The second version is referred to as an alternate or variant allele.
2.1.2 Cancer Genomics
In recent years, a remarkable advance in our knowledge of the mutational profile of
cancer and its application to the clinical setting has taken place. A growing body
of research is forming to study and characterize the heterogeneity of cancer cells.
These studies allow for a better understanding of the disease progression and hope-
fully bring us closer to the development of personalized medicine. The field of cancer
genomics has benefited substantially from the application of next generation sequenc-
ing technologies. These massively parallel sequencing platforms have increased the
throughput of genome sequencing while reducing the cost of data production. As
a result, sequencing many patients of the same cancer type, and analyzing multiple
samples from the same patient are now possible at an a↵ordable cost.
Somatic Variants
The genome of a cancer cell possesses two types of genomic variants: germline and
somatic. Germline variants are mutations inherited from a parent. These variants
are present in all body cells including tumor cells. Most of these variants are not
disease causing and are prevalent in the general population. During the lifetime of
an individual, his or her DNA is continuously mutated as a result of intrinsic DNA
defects or environmental mutagens. While most of this damage is repaired, a small
fraction survives and accumulates in the DNA. These genetic alterations, which are
present in only a subset of body cells and are not found in germline cells, are called
somatic variants. Most cancer genomics studies focus on identification and analysis
of somatic mutations of a genome as these alterations provide an insight into the
underlying genetics of cancer.
Although normal body cells also carry somatic variants, in the context of cancer
genomics, we are mainly interested in somatic alterations not found in normal cells.
Therefore, throughout this dissertation, the term somatic mutation strictly refers to
CHAPTER 2. BACKGROUND 6
a somatic change that is harbored only by cancer cells.
To distinguish between germline and somatic variants, the matched tumor and
normal samples of a patient should be sequenced and analyzed. Genetic alterations
that are shared between two samples are marked as germline variants.
Aneuploidy
An aberrant chromosomal copy number, also referred to as aneuploidy, has been
recognized as a common characteristic of cancer genomes for over a century. The high
prevalence of chromosome-arm level somatic copy number alterations is reported in
several studies [37, 10, 9, 63], however understanding their role in tumorigenesis and
the progression of disease has remained an active filed of research [37].
Cancer heterogeneity
Although all cancer cells originate from a common progenitor, they evolve through
di↵erent clonal expansions and accumulate di↵erent somatic mutations along the way
[15]. These clones dynamically compete for resources within the ever-changing cellular
environment of the tumor, and are subject to selection mechanisms. The term tumor
heterogeneity refers to the existence of co-existing subpopulation of cells in a tumor
with diverse genotypes and methylation patterns.
Understanding cancer heterogeneity leads to more accurate diagnosis and prog-
nosis of the disease, and is crucial for the development of e↵ective and personalized
therapies [33].
2.2 Variant calling in cancer samples
The process of variant discovery is a crucial first step in most cancer genomics stud-
ies. In next generation sequencing (NGS) methods, the DNA from a cancer sample is
amplified, sheared into small fragments (several hundred base pairs), and sequenced
producing millions of short sequence reads. These reads are then aligned to a ref-
erence genome, and sequence alterations from the reference are marked as potential
CHAPTER 2. BACKGROUND 7
mutations. To use this set of genetic alterations in the downstream analysis, it is vital
to distinguish true variants from noise. However, this procedure is made challenging
by multiple factors, some of which are discussed in this section.
NGS data can su↵er from errors introduced in di↵erent stages of sequencing such
as early amplification cycles or base calling. Moreover, read-alignment tools are
not error-free. Mapping errors are particularly enriched in repetitive regions of the
genome. If enough reads are misaligned to a region, their variation from the reference
genome can resemble the signature of a true variant.
Di↵erentiating true genetic variants from errors is especially hindered by the fre-
quent high level of normal-contamination in tumor samples and the heterogeneity of
cancer. The low proportion of cells in a sample containing a somatic variant results
in a low percentage of sequence reads harboring the variant allele, which obstructs
the distinction between true somatic SNVs and sequencing errors.
Understanding the complex nature of cancer genomics necessitates a comprehen-
sive analysis of the complete mutational spectrum of the tumor sample including
germline and somatic variants. Classifying genetic alterations as somatic or germline
requires the joint analysis of matched normal and cancer samples from the same pa-
tient. The presence of variant allele in sequence reads from both samples suggests
that the sequence variant is a germline mutation. However, incorrect classifications
of germline mutations as somatic can stem from sequence sampling bias, where only
one copy of the diploid genome is sampled at a specific site.
To address these challenges e�ciently, sophisticated algorithms should be devel-
oped. In recent years, several general and cancer-specific SNV callers have been
published (e.g. [26, 20, 50, 38, 46]); however, this subject is still an active area of
research.
2.3 Haplotype phasing
The term haplotype refers to a set of alleles at adjacent loci that are carried together
on the same copy of a chromosome. At any segment of a diploid genome, there are two
haplotypes, one inherited from each parent. At a heterozygous site, each allele belongs
CHAPTER 2. BACKGROUND 8
to one haplotype. More than two haplotypes can be present in a heterogeneous tumor
sample, or in the genome of aneuploid cells.
Variant calling methods report the alleles present at a variant site. These alleles
are referred to as the variant’s genotype; however, the order of these alleles on each
chromosome is not directly observed. For example, if at three adjacent variant sites
{x1, x2, x3}, an individual has two haplotypes (ACG, CTA), a variant caller produces
A/C, C/T, and A/G as the genotypes for x1, x2, and x3 respectively. However, it is
not immediately known which of the four possible pairs of haplotypes (ACA, CTG),
(ATA, CCG), (ACG, CTA), or (ATG, CCA) is the correct configuration of these
alleles. The process of inferring haplotypes from genotype information is referred to
as phasing or haplotype inference.
The importance of phase information is continuously increasing, culminating in
a broad range of applications. Haplotype data is crucial in many disciplines includ-
ing but not limited to population genetics, functional genomics, pharmacogenomics,
and personalized medicine. Genotype imputation, local-ancestry inference, and de-
termining human migration patterns are only a few of the aforementioned applica-
tions. Haplotype information can also facilitate the identification of candidate genes
associated with complex traits. Moreover, several studies have discovered strong as-
sociations between specific haplotypes and drug resistance or disease susceptibility.
The growing recognition for the importance of haplotype information has resulted
in a collective e↵ort of researchers to develop computational methods for haplotype
inference suitable for large-scale or genome-wide sequencing projects.
The simplest approach to phasing is the use of relatedness information in indi-
viduals of a single family. In the simple case of a trio, in which a child and both
parents are either sequenced or genotyped, basic principles of Mendelian inheritence
dictate which alleles were inherited from each parent. The only variants that remain
unphased in trio studies are those at which both parents and the progeny are het-
erozygous, and the ones that were not genotyped in at least one individual. Thus,
these studies result in very long and accurate haplotype blocks. Since it is not always
feasible, or even possible, to sequence all members of a family, this approach has very
CHAPTER 2. BACKGROUND 9
limited applicability. At di↵erent levels of genealogical relatedness, haplotype phas-
ing of individuals is performed by identifying segments of the genome that they share
identical by descent (IBD). These are segments of the genome that individuals have
inherited from the same ancestor. As the genealogical relationship between pairs of
individuals becomes more distant, the length of such shared IBD regions decreases
exponentially resulting in smaller haplotype blocks.
Unrelated individuals can be phased by a di↵erent set of methods which are com-
monly referred to as statistical phasing. These methods are based on modeling the
haplotype frequencies of individuals in a population, and often leverage the linkage
disequilibrium (LD) patterns between genetic markers. LD refers to an allelic corre-
lation between markers in a population. Several EM-based, or HMM-based methods
have been developed for identifying, in the sequenced (or genotyped) individuals, a
set of possible haplotypes that can explain the observed genotypes. This type of
phasing is commonly used in population-scale studies such as in the International
HapMap Consortium and the 1000 Genomes project to impute genotypes at untyped
markers. A review paper by Browning et al. provides a comprehensive overview of
computational methods developed for statistical phasing [14].
Statistical phasing methods are error-prone and o↵er haplotype blocks with lengths
limited by the extent of the linkage disequilibrium across the genome. These methods
can only infer the phase between variants that are frequent in a population or in a
given sample of individuals, and are not applicable to rare and de novo mutations,
which are most clinically significant. Neither genetic phasing of human families, nor
statistical phasing of unrelated individuals can phase the somatic alterations in can-
cer cells. Therefore, direct haplotyping methods, through experimental analysis of a
single individual sample, are desired to phase de novo and somatic mutations.
Recent technological advances have enabled molecular-based haplotyping of per-
sonal genomes, referred to as Single Individual Haplotyping (SIH). These methods ex-
ploit the single-molecule nature of sequenced fragments. If a fragment encompasses
more than one heterozygous variant, it determines the phase of covered variants.
Therefore, partial haplotypes can be obtained by combining phase information across
overlapping fragments. Several factors such as sequencing errors, alignment errors,
CHAPTER 2. BACKGROUND 10
and potential gaps in the sequenced fragments contribute to making this problem
computationally challenging [21]. Various haplotype assembly algorithms have been
developed recently using di↵erent approaches including greedy algorithms (e.g. [52]),
stochastic approaches (e.g. [7, 8]), and dynamic programming algorithms (e.g. [41]).
Next generation sequencing technologies are the platforms of choice in most cur-
rent genomic studies. However, these technologies produce reads that are relatively
short (a few hundred base pairs) compared to the average distance between heterozy-
gous variants (one thousand base pairs). As a result, sequence reads cover at most
one heterozygous variant at a time. Recently, however, new technologies have been
developed that can produce longer sequences. These technologies which employ dif-
ferent compartmentalization approaches and amplification techniques can produce
sequenced fragments ranging in size from tens to hundreds of kilobases. Snyder et
al. provides a comprehensive overview of these technologies and their application to
single individual haplotyping [79].
Chapter 3
Genome evolution during
progression to breast cancer
3.1 Abstract
Cancer evolution involves cycles of genomic damage, epigenetic deregulation, and
increased cellular proliferation that eventually culminate in the carcinoma pheno-
type. Early neoplasias, which are often found concurrently with carcinomas and are
histologically distinguishable from normal breast tissue, are less advanced in pheno-
type and are thought to represent precursor stages. To elucidate their role in cancer
evolution we performed comparative whole genome sequencing of early neoplasias,
matched normal tissue, and carcinomas from six patients. By using somatic mu-
tations as lineage markers we built trees that relate the tissue samples within each
patient. On the basis of these lineage trees we inferred the order, timing, and rates
of genomic events. In four out of six cases, an early neoplasia and the carcinoma
share a hypermutated common ancestor with recurring aneuploidies, and in all six
cases evolution accelerated in the carcinoma lineage. Point mutational mechanisms
are stable and consistent across cases, suggesting that hypermutation is a result of
increased cell division. In contrast to highly advanced tumors that are the focus
of much of current cancer genome sequencing, neither the early neoplasia genomes
nor the carcinomas are enriched with potentially functional somatic point mutations.
11
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 12
Aneuploidies that occur in common ancestors of neoplastic and tumor cells are the
earliest events that a↵ect a large number of genes, and may predispose breast tissue
to eventual development of invasive carcinoma.
3.2 Introduction
The cells of a multicellular organism are related to one another by a bifurcating
lineage tree whose root is the zygote. DNA replication, chromosome segregation,
and cell division during development from the zygote to the adult introduces point
mutations and other DNA changes into the genome, which persist in the descen-
dants of the cells in which they occurred. Germ-line point mutations occur at a
rate of approximately one per diploid genome per cell division [47], but the rate of
somatic changes is less well-understood, and is likely to vary by tissue type. Large-
scale genomic changes such as aneuploidies are generally thought to be extremely
rare in normal tissue. Cancers, in contrast to normal tissue, accumulate much larger
numbers of genomic changes, as illustrated by genome sequencing of late-stage tu-
mors [53, 83, 11, 71, 18, 82, 6, 68, 67]. Solid tumors are highly mutated by several
mechanisms, such as point mutations, copy-number variations, and chromothripsis
[40, 51, 10, 55, 61, 81, 23, 58]; relapses or metastases exhibit further mutational
evolution [27, 28, 92, 64, 59, 86, 90, 91]. The state of an individual advanced cancer
genome sheds little light on the order of genomic changes, however, except in analyses
of subclone evolution [67, 75]. In an advanced tumor, the earliest driver changes that
had predisposed ancestral cells to eventual carcinoma development are confounded
with later changes. As a consequence, our understanding of early tumor evolution
is still in its infancy. The historically proven approach to understanding evolution is
comparative analysis of extant species, whose power was greatly increased by whole-
genome sequencing in recent years. Analogous to species comparisons, which are
based on evolutionary (bifurcating) lineage trees, comparisons of somatic genomes
from a single individual could, in principle, shed light on somatic evolution, but in
normal tissue the number of mutations is low. However, given the large number of
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 13
genomic changes during tumor evolution, it may be possible to dissect the evolution-
ary history of a cancer by comparing its genome to clinically recognized precursor
lesions. In this context, breast cancers provide a proof-of-principle opportunity, due
to their frequent association with early neoplastic lesions that are readily identified
by morphology [78, 1, 56, 13], and whose genomes may provide windows into the
earliest stages of tumor evolution. Using whole-genome sequencing of histologically
characterized archival (formalin-fixed, para�n-embedded) samples, we determine lin-
eage relationships of early neoplasias with carcinomas, quantify mutational load and
mutation spectra during progression from normal tissue to neoplasia to carcinoma,and
find the earliest detectable mutations and aneuploidies in cell lineages ancestral to the
lesions. A subset of these early events may have provided the initial oncogenic poten-
tialand helped trigger the first clonal expansion. Our analyses reveal variation among
the six cases in the specific evolution of neoplasia and tumor, as would be expected for
an evolutionary process dominated by stochasticity. The mechanistic commonalities
among the cases, however, bear significant implications for our conceptualization of
tumor origins and progression.
3.3 Results
3.3.1 Whole-genome sequencing of early neoplasias and re-
lated carcinomas from archival material
Our workflow (Figure 3.1) began with the screening of histopathological sections of
archival estrogen receptor positive invasive ductal carcinoma (IDC) resection speci-
mens for presence of concurrent early neoplasias, which are microscopic in size (typ-
ically 1-3 mm). We selected cases in which early neoplasia with or without atypia
(EN or ENA; a spectrum of usual ductal hyperplasia, columnar cell lesions, and flat
epithelial atypia), and in some cases ductal carcinoma in situ (DCIS) were present
in addition to the IDC. Areas of high neoplasia or carcinoma content were cored,
and histologically re-evaluated for lesion purity. Six cases were chosen in which each
sample met criteria for purity and had enough DNA for whole genome sequencing.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 14
Each case had at least one early neoplasia sample from the same side in which the
carcinoma was found, and five also had an early neoplasia sample from the contralat-
eral mastectomy/lumpectomy. Each had at least one control sample (lymph, normal
breast tissue or both), and three cases also had a DCIS in addition to the IDC,
yielding a total of 31 samples (Figure 3.1A).
We optimized DNA isolation from archival samples to obtain su�cient quantities
of preparative material, and honed the generation of robust libraries. For each sample,
a single library was built and sequenced with paired-end reads (2 ⇥ 101 bp) on the
Illumina HiSeq platform. Library complexity was su�cient to support deep whole
genome sequencing, with the vast majority of sequence data coming from independent
DNA fragments as opposed to PCR duplicates. The samples from the first patient
were sequenced to higher coverage (average of 84.6x) to calibrate the tradeo↵ between
cost and sensitivity in variation calling. Coverage of each sample by confidently
mapped reads ranged from 46.7x to 105.7x, with a median of 53.4x.
3.3.2 Somatic SNVs fall into a limited and highly structured
set of classes
Detection of somatic single nucleotide variants (SNVs), such as those occurring dur-
ing cancer evolution, requires a methodology with high specificity because inherited
(germline) variants are orders of magnitude more numerous and even a small false
positive rate of calling inherited variants somatic results in low positive predictive
value. Our high sequence coverage and purity of samples allowed us to pursue highly
sensitive and specific somatic SNV identification. Because we sequenced several sam-
ples from each patient, we identified the total set of SNVs in each patient with a
multisample strategy using GATK. For each patient, we called variants using reads
from all samples simultaneously, and then assigned genotypes to each sample. The
vast majority of SNVs were present in all samples, as expected from germline variants.
Standard quality control metrics confirmed the high quality of our variant calls. The
total number of high-confidence germline variants ranged from 2,650,714 (Patient 5)
to 2,973,005 (Patient 1). Between 97.91% and 98.06% of these were present in dbSNP.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 15
C
D
p 16qEN
ADC
IS IDC ENA
DCIS IDC ENA
DCIS IDC
1 1 1 1 1 1 10 0
Tissue block prepara-tion, sectioning, histological stain
whole sample
Pathological evaluation and diagnosis of carcinoma
Histological character-ization of tissue cores
neoplasias and DCIS associated with IDC
GATK multisample variant calling on realigned,
Alternate allele frequency determination
heterozygous germline variants for aneuploidy
germline and candidate somatic variants
somatic variant calls based on read counts, presence-absence patterns in the samples, and alternate allele frequencies
Pathological evaluation as part of clinical care
Transfer of core blocks from histologyto molecular biology lab
Tran
sfer
of d
ata
to h
igh
perf
orm
ance
com
putin
g en
viro
nmen
t
Clinically informed evaluationof research specimens
Molecular biology and sequencing
Computational sequence analysis
Somatic SNVs Germline SNVs
Determination of each case’s evolutionary history
IDCER+
Preparation of suitable samples for maximum tumor/neoplasia content
Library construction
Full-scale sequencing of suitable libraries
Test sequencing and read mapping to assess library quality and complexity
Library size distribution
Read mappings
Base quality
Ductal
hyperplasia
lengths in terms of muta-tions that occurred during that evolutionary time
Mapping of aneuploidies onto the trees
Inference of timing of genomic changes, and of genomic state of the last common ancestor well before the carcinoma
+ 1q- 16q
4*10-6
10-6
Lineage markersfor tree building
Mutationspectrum
for inferenceof mechanism
Aneuploidydetection using heterozygous SNPs
Figure 3.1: Overall workflow of the project from clinical sample to genome evolutioninference.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 16
Figure 3.2: (Legend on next page.)
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 17
Figure 3.2: Lineage tree and alternate allele frequencies. (A) The samples in thisstudy by type (rows) and patient (columns). (B) Model of neoplastic progressionon the basis of organismal tissue and cell lineage. For simplicity, only one possiblescenario of the progression from normal to neoplasia to carcinoma is shown. Mutationsthat arise in ancestors are propagated through subsequent divisions to all descendants.Depending on the ancestors in which they arise, they will be found in one or moresamples of the patient, with varying prevalence. For example, mutations that arisein the B branches will be found in all cells of the neoplasia and of the carcinoma;in contrast, mutations that arise on the C branch will be present only in a subset ofthe neoplasia cells and mark the neoplastic subpopulation from which the carcinomaarose. Mutations that arise on the F branch mark a clonal expansion within theneoplasia, after the last common ancestor with the carcinoma. Note that if thereare no mutations found that define branches B and C, it is not possible to infer aspecific relationship of the carcinoma with the neoplasia. (NS) Not sampled. Inthe expanded box are alternate allele frequency comparisons relevant to neoplasiasand carcinomas. The two starred comparisons require independent estimates of theproportion of normal cells in each sample, as they compare AAFs across di↵erentsamples. All other comparisons are either within samples, or the AAF is zero, thusrequiring no independent estimate of the proportion of normal cells in the sample. (C-F ) Alternate allele frequencies as a function of the class and sample for each patientwith phylogenetically informative SNV-sharing classes. The number of SNVs in eachclass and the branch in the lineage tree of A are listed below each plot. For Patient1, the only phylogenetically informative class was where the IDC shared SNVs withENA. For the other patients, the AAFs of informative classes are grouped togetherand the mutation pattern for each class is represented by a series of zeros and onesdirectly above the sample labels (a “1” indicates that the SNVs were present in thecorresponding sample and a “0” indicates that they were not). (EN) Early neoplasia;(EN cl) early neoplasia contralateral; (ENA) early neoplasia with atypia. Subscriptin lineage-tree branch of patient 6 denotes whether the neoplasia in the lineage treeis this patient’s EN or ENA, and whether the carcinoma is DCIS or IDC.
On average, 59,697 SNVs per patient were present in all samples but not in dbSNP,
and therefore represent novel SNPs of low population-allele frequency (Table 3.1).
Between 1465 (Patient 1) and 3416 (Patient 6) SNVs were candidate somatic
variants, as they were not detected in at least one sample of that patient (Table
3.1). If the samples are related by a tree, then only some sharing classes are possible
and the total number of observed classes is much lower than the number of possible
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 18
P1 P2 P3 P4 P5 P6Total 2,973,005 2,771,413 2,912,758 2,915,727 2,650,714 2,937,816Homozygous 1,168,671 1,078,021 1,149,006 1,160,421 1,017,760 1,146,679Ts/Tv ratio 2.13 2.09 2.09 2.09 2.15 2.10
In dbSNP 2,910,863 2,717,531 2,856,582 2,857,498 2,596,421 2,864,359Percent 97.91 98.06 98.07 98.00 97.95 97.50
Novel 62,142 53,882 56,176 58,229 54,293 73,457Homozyous 2,514 1,734 1,715 1,681 1,295 2,372
Candidate somatic 1,465 1,546 2,567 2,775 1,924 3,416After filtering 1,279 1,479 2,104 2,582 1,728 3,211
Table 3.1: Variant call statistics
classes. For example, in Patient 1, from whom we sequenced six samples, there are
26 � 1 = 63 possible classes to which an SNV can belong. In this patient, 1766 SNVs
were absent from at least one sample, and excluding those present in lymph we retain
1465 candidate somatic SNVs. Only six of the classes, containing 1279 out of the
initial 1465 candidate SNVs (87%), survived filtering. Those SNVs removed during
filtering were either germline SNVs where one allele was poorly covered, or somatic
SNVs whose class membership we could not confidently establish.
Across the six cases, we retained 82%�96% (median = 91%) of SNVs and 19%�43% (median = 27%) of classes, revealing substantial structure in the data. The final
number of confident somatic SNVs ranges from 1279 in Patient 1 to 3211 in Patient
6, for a total of 12,392 in all six patients. 8950 (72%) of these are private to only
one sample in only one patient, and the number of such private SNVs increases as a
function of the severity of the cancer phenotype: the IDCs harbor the most private
mutations (average of 601 per sample, N = 7, range 46 � 1809), the DCISs have an
average of 470 SNVs per sample (N = 3 range 70�978), early lesions 229 per sample
(N = 14, range 123 � 387), and normal have the fewest (N = 2, range 39 � 89).
On average, the IDCs accumulated 2.6-fold more private mutations than the early
neoplasias, and almost 10-fold more than normal breast tissue. This may be due to
a larger number of cell divisions or an increased mutation rate in the ancestral cell
lineage of the IDC.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 19
3.3.3 Allele frequencies of somatic SNVs support common
ancestral relationships
Somatic SNVs that are not private to individual samples define phylogenetically in-
formative classes. A total of 3442 SNVs define such classes, ranging from 0 SNVs
in Patient 4 to 1054 SNVs in Patient 3, with a per-case average of 574 and a per-
class (N = 7) average of 492. To illustrate the logic of phylogenetic inference using
informative classes, we consider a hypothetical lineage tree that relates non-breast
somatic, normal breast, neoplastic, and carcinoma cell lineages (Figure 3.2B). Muta-
tions that occurred in ancestral cells are present in specific subsets of samples, with
the lineage tree constraining the set of possible classes.
As demonstrated in recent studies of subclone evolution in IDC [68, 67, 75], alter-
nate allele frequency (AAF) is a powerful metric for understanding tumor evolution.
The “alternate allele” is the allele that does not match the reference base, and which
in the vast majority of cases is the somatic mutation. Its frequency is estimated from
its sequence coverage divided by the coverage of the alternate base plus that of the
reference base. Depending on the ancestral lineage in which a collection of mutations
arose, their AAF distributions in each sample vary. For example, if a variant arose in
a common ancestor of a subset of lesional cells in the sample, its AAF is lower than
that of an earlier mutation that is present in all lesional cells of the sample (Figure
3.2B).
For each SNV class of each patient, we obtained estimates of AAF distributions
with highly consistent class patterns (Figure 3.2 C-F). For example, in Patient 1 the
AAFs of the SNVs that are present in ENA and IDC and absent everywhere else are
higher than the AAFs of the ENA-only or the IDC-only classes. The same patterns
hold for Patients 2 and 6. The patterns in Patient 5 are complicated by the presence
of two IDCs and by low numbers of SNVs in relevant classes. Note that the mean
AAFs are always < 50% due to unavoidable contamination of the lesional tissue with
normal cells that derive from lineages that branched o↵ before the lesional ancestors
accumulated their somatic mutations.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 20
3.3.4 Mutated neoplasias are evolutionarily related to carci-
nomas
Each case represents an independent evolution; therefore, common patterns across the
cases may be of general significance. We first asked to what extent the early neoplasias
and the carcinomas share mutations that are not present in other samples, pointing
to shared ancestral cell lineages. In four cases (Patients 1, 2, 5, and 6) (Figure 3.2C-
F), the phylogenetically informative SNV classes indicate that a neoplasia shares a
common ancestor with the carcinoma. In each of these cases, a neoplasia and the
carcinoma share a significant number of SNVs. For example, in Patient 1, 775 SNVs
are shared between ENA and IDC, and in Patient 2, 681 SNVs are shared among the
EN, DCIS, and IDC, with additional SNVs shared between the EN and IDC. There
are no well-supported classes (in terms of number of SNVs and their AAFs) that
are in conflict with each other, and none in which normal tissue or contralateral EN
share SNVs with the carcinomas. The aforementioned PCR-based targeted validation
showed 94% and 98% accuracy in assigning SNVs to the correct phylogenetic class.
In three of these four cases (Patients 1, 2, and 6) the number of SNVs in common
between a neoplasia and carcinoma suggests the existence of a common ancestor that
had already accumulated many somatic SNVs. Strikingly, in two cases (Patients 1 and
2) the number of mutations in the ancestor is greater than the number of mutations
that subsequently occurred in the ancestral lineage private to the carcinoma.
In three cases (Patients 2, 3, and 6) DCIS was concurrent with IDC, and in one
case (Patient 5) two independent IDC lesions were present. These four cases provided
us the opportunity to ask whether the carcinoma phenotype arose once or multiple
times independently. In Patient 3, the DCIS and IDC share a mutated common
ancestor, suggesting that the carcinoma phenotype arose in the ancestral lineage,
and that the IDC subsequently acquired the invasive phenotype. In Patients 2 and
6, there is no well-supported class of SNVs that unites the two carcinomas to the
exclusion of a neoplasia. Instead, in both patients, the DCIS and the IDC each
share separate classes of SNVs with a neoplasia, suggesting independent origins of
the carcinoma phenotype from neoplastic ancestors.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 21
These results suggest that some early neoplasias harbor a predisposition to spawn-
ing a carcinoma that later acquires an invasive phenotype (Patients 1, 2, 6). The
chance of acquiring a carcinoma phenotype, given the predisposition provided by the
neoplasia, is su�ciently high to allow for concurrent and independent development of
carcinomas (DCIS and IDC in Patients 2 and 6).
3.3.5 Point-mutational mechanisms are evolutionarily stable
and reproducible among cases
SNVs result from mutations that occurred in ancestral cells, and if a specific molec-
ular mechanism were primarily responsible for the mutations, the distribution of the
SNVs among the various types of change (the “mutation spectrum”) would carry
that mechanism’s signature [72]. To investigate the cause of the ancestral accumula-
tion of mutations, we analyzed the mutational spectrum as a function of the samples
in which SNVs were found. The mutational spectrum in our cases is remarkably
consistent from patient to patient (Figure 3.3A) and is also stable across SNVs in
di↵erent types of samples and in di↵erent patterns (Figure 3.3B). Transitions out-
number transversions about 1.5-fold in a pattern that is typical for replication errors
and not indicative of any specific type of DNA damage or failed repair mechanism.
C-to-T changes (or G-to-A, which are the same due to base pairing) are most nu-
merous. Converted to substitution rates, this bias is even more pronounced because
there are only roughly two C’s for every three T’s in the human genome. The consis-
tency across patients implies a common mechanism, and the consistency among the
three SNV groups (SNVs in early lesions only, in carcinoma only, and shared between
early lesions and carcinoma) implies that the common mechanism acts throughout
neoplastic and tumor evolution.
To further shed light on the mutational mechanism we turned to analysis of din-
ucleotide substitution patterns. Because dinucleotide frequencies vary by an order of
magnitude in the human genome, with AA/TT being most common and CG least
common, we converted mutation counts to rates. Truly random substitutions would
have the same rates for each of the 60 possible mutations (10 dinucleotides with six
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 22
Figure 3.3: Mutation spectra and rates of somatic SNVs. (A) Mononucleotide substi-tution frequencies by patient. (B) Mononucleotide substitution frequencies by SNVclass. (C) Dinucleotide substitution rates of SNVs private to early neoplasias. (D)Dinucleotide substitution rates of SNVs private to carcinomas. (E) Dinucleotide sub-stitution rates of SNVs shared among neoplasias and carcinomas. For C-E, SNVs arepooled across patients. The mutated dinucleotide is indicated in the inner circle, andthe substitution occurring within it is color coded. Rate is defined as mutations perdinucleotide of that class.
possible changes each, not counting changes in both bases because they are exceed-
ingly rare). A dinucleotide-unaware process would recapitulate the mononucleotide
rates, with the average transition having an about fourfold higher rate than the av-
erage transversion. In contrast, we detect an approximately eightfold higher rate of
C-to-T transitions in the CpG context. This higher mutation rate is due to methyla-
tion of the C in a CpG dinucleotide, which upon deamination becomes a TpG. If the
repair machinery catches this event it is reversed, but if the replication fork passes
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 23
first it leads to a C-to-T transition in one of the daughter strands. The relative rate
of C-to-T transitions in CpGs versus C-to-T transitions in the other dinucleotide con-
texts and versus all other changes provides an internal calibration as to whether DNA
damage processes or defective repair mechanisms have disproportionally a↵ected the
genome.
In our patients, the rate increase of C-to-T transitions in the CpG context and in
the dinucleotide mutation spectrum in general is similar to germline evolution [84, 44],
and is consistent across patients (Figure 3.4) as well as among classes of SNVs (private
to neoplasias, private to IDCs, and shared among neoplasias and carcinomas) (Figure
3.3 C-E). This implies that the sources of the somatic SNVs are mutations that
accumulated during many rounds of DNA replication (many ancestral cell divisions),
and that cancer- or neoplasia-specific point mutational mechanisms, if present at all,
did not substantially a↵ect the mutation spectrum. Taken together, these lines of
evidence support a model of mutation accumulation that is gradual and largely a
function of the number of cell divisions, as opposed to recurring DNA damage events
or mutational storms.
The somatic SNVs are randomly distributed in each patient with no enrichment of
exonic or nonsynonymous changes, regardless of the phylogenetic class to which they
belong. We also detect very little clustering of mutations that might be indicative of
localized mutagenic events [68]. Across all cases, 159 out of the 12,392 high-confidence
somatic SNVs fall into coding regions, with 2/3 (106) being nonsynonymous, which is
what is expected by chance. This holds true for any biological subdivision of the data
(e.g., neoplasias vs. IDC). The a↵ected genes exhibit no enrichment for pathways
by GO analysis [4, 42]. One point mutation, H1047R in PIK3CA, which has been
previously implicated in cancer [73, 30] and early neoplasias [85], was recurrent in
our cases (Patients 1, 3, 4, and 5, in various samples) at varying allele frequencies.
Common cancer loci such as TP53 and BRCA1 were not mutated.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 24
AAAC
AGTA
CA
CCCG GA GC
TA
2x10 -6
4x10 -6
6x10 -6
8x10 -6
Patient 6
Patient 3Patient 2Patient 1
Patient 5Patient 4
AAAC
AGTA
CA
CCCG GA
GCTA
2x10 -6
4x10 -6
6x10 -6
8x10 -6
AAAC
AGTA
CA
CCCG GA GC
TA
2x10 -6
4x10 -6
6x10 -6
8x10 -6
AAAC
AGTA
CA
CCCG GA GC
TA
2x10 -6
4x10 -6
6x10 -6
8x10 -6
AAAC
AGTA
CA
CCCG GA GC
TA
2x10 -6
4x10 -6
6x10 -6
8x10 -6
AAAC
AGTA
CA
CCCG GA GC
TA
2x10 -6
4x10 -6
6x10 -6
8x10 -6
C to TG to AA to GT to CA to TT to AC to AG to TC to GG to CA to CT to G
Figure 3.4: Dinucleotide mutation rates for each patient. Plots are the same as Figure3.3 C-E, except that here SNV classes are pooled for each patient. Rates are in unitsof “substitution per dinucleotide type”, and vary overall between patients becausethe number of mutations varies from case to case. In all cases, transitions in CpGdincleotides have a much higher rate than all other mutations.
3.3.6 Aneuploidies are the dominant evolutionary feature of
progression
The paucity of candidate driver mutations and overall random distribution of point
mutations in our cases suggest that other genomic events may be contributing to the
initial neoplastic phenotype and its progression to carcinoma. We therefore devised
a multistep strategy to identify chromosome arm-scale losses and gains in each pa-
tient, utilizing those germline variants for which the patients were heterozygous. Each
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 25
patient was heterozygous for between 1.56 and 1.74 million SNPs, ensuring substan-
tial statistical power to detect subchromosomal-sized aneuploidies and copy-number
variations.
We quantified, in each somatic sample separately, the fraction of reads that sup-
port the allele with the fewer number of reads (the lesser allele fraction, or LAF).
We then ordered the SNVs according to their position in the genome and identified
transition points where the LAF abruptly changes. In one case (Patient 5), the 20
large-scale copy-number variations which are confined to this patient’s two IDC sam-
ples are suggestive of chromothripsis [55, 61, 81, 23, 58]. In the other five patients, we
identified a total of 46 large-scale copy-number variations, 43 of which involve whole
chromosomes or whole chromosome arms.
None of the normal breast and contralateral neoplastic samples, some of the ip-
silateral neoplasias, and all of the carcinomas exhibit aneuploidy. Four of the seven
IDCs exhibit evidence for the presence of a subclone population in which additional
chromosomes have undergone aneuploidy events.
In Patients 1, 2, and 6, aneuploidy events are shared among early neoplasias and
carcinomas. All aneuploidies that are present in the neoplasias are also present in the
carcinomas. Plotting the LAFs of all samples from a patient powerfully illustrates
both the chromosome scale of these events as well as the sharing of the same aneu-
ploidies among certain samples. In Patient 6, for example, the aneuploidies involving
chromosomes 1q, 6q, 8p, 17 and 22 are shared among both carcinomas and the EN
(Figure 3.5). The plot also reveals the aneuploidies of many other chromosomes that
are present in a subclone population that makes up about 30% of the IDC sam-
ple. Examination of the corresponding plots of all patients reveals the extraordinary
prevalence of aneuploidies in these cases.
Graphing the distribution of LAFs for each LAF-derived section of the genome
separately (usually a whole chromosome or arm) further supports the robustness of
LAF as a metric to identify aneuploidies (Figure 3.6A). However, a reduction of LAF
can be a result of ploidy gains as well as losses. Therefore, we calculated the actual
ploidy changes in a two-step process: first, we estimated the contribution of normal
cells to the sample using chromosome losses, and then we calculated the additional
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 26
Figure 3.5: Lesser allele fraction plot of Patient 6. SNVs are arranged by theirorder in the genome, and LAF is plotted for each sample in windows of 1000 SNVswith 500 SNV overlap. Aneuploidies are visible as precipitous drops in the LAF,which are often shared between samples. Chromosome boundaries are indicated byshort vertical lines. All samples are plotted and give highly consistent LAFs forchromosomes that are euploid.
number of chromosome copies for those chromosomes that exhibited increased ploidy.
We validated a subset of these calls using FISH (Figure 3.6B) and found all LAF-based
calls that we tested to be correct.
The distribution of aneuploidies across chromosomes among the six patients is
highly nonrandom (Figure 3.6C). Gain of chromosome 1q is by far the most common
event, with a total of 13 extra copies accumulated in these patients, not considering
the IDC subclones. All cases exhibit 1q gain, and it is the only event that is shared by
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 27
Figure 3.6: (Legend on next page.)
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 28
Figure 3.6: Aneuploidy summary. (A) LAF distributions for each chromosome acrossall patients and samples. In each sample-by-patient panel, the LAF distributionsof all chromosomes are superimposed. In the absence of aneuploidy, the plot linesof all chromosomes are well-aligned, as is evident in the control plots and some ENplots. Control panels often contain plots from two samples (indicated) and so thereare sometimes 46 lines superimposed, revealing the robustness of the LAF metricacross samples and chromosomes. A chromosome’s plot line is gray when it does notdeviate from the typical distribution. The line is colored when the chromosome’sLAF is skewed. Distinct colors are assigned to represent aneuploid regions that recurin di↵erent samples and patients. Colors are labeled in the panel in which they firstappear. For Patient 6 please see Figure 3. (B) FISH of chromosome 1 in ENAof Patient 6. (C) Distribution of aneuploidies by patient, excluding those in IDCsubclones. Each square denotes a unit gain (orange) or loss (blue). In Patients2, 3, and 6, two phases of aneuploidies occurred, with those of the second phasenot surrounded by a border. (Total) The total number of chromosomes lost (�) orgained (+) across all patients; (1st) the number during the first detected phase. Onlyrecurrent events are listed. In Patient 5 (which exhibits hallmarks of chromothripsis),di↵erent pieces of chromosomes 1p and 19 underwent simultaneous losses or gains.
all three early neoplasias in which we could detect aneuploidy. In three cases (Patients
2, 3, and 6), the IDC underwent gains of 1q in addition to previous ones, increasing
1q ploidy to 6, 4, and 4, respectively. This suggests that the selective advantage
conferred by 1q gain increases with further gains of 1q during tumor evolution.
Like the shared SNVs, the shared aneuploidies support specific lineage relation-
ships among the samples of each patient. We therefore built lineage trees using the
somatic SNVs as phylogenetic markers, and then asked whether the shared aneuploi-
dies are consistent with these trees (Figure 3.7). All aneuploidies are unambiguously
and parsimoniously assigned to specific branches in the SNV-based lineage trees.
The order of aneuploidies during the evolution of each case is also unambiguous
and highly suggestive of a small number of aneuploidies being first drivers of the
neoplastic phenotype. In all cases, gain of 1q was among the events that occurred
first, including in the three cases in which genomic crises occurred in a common
ancestor of neoplasias and carcinomas (Patients 1, 2, and 6). Loss of 16q occurred
four times, and loss of 17 three times, as part of the first set of aneuploidies. Gain of
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 29
Figure 3.7: Genome evolutions of all patients (P1-P6 ). Vertical black lines are ances-tral lineages whose lengths are proportional to the number of SNVs that occurred ineach (except Patient 4, which is 50% shorter for fit). Cones represent tissue samples;cone width represents approximate amount of tissue; cone height is constrained atthe top by the position of the last common ancestral cell of the sample, which is de-termined by the ancestral branch lengths, and on the bottom by the time of surgery,which is the same for all samples. The ratio of cone width to height is an approxima-tion of the rate of cell division in each sample since the last common ancestral cell.Chromosome ploidy changes are indicated with the chromosome number; stand-alonenumbers in italics indicate the number of chromosomes a↵ected by subclone evolution(or putative chromothripsis in Patient 5). Thick branches are the earliest branchesfor which we are able to infer genomic events. Circles at the end of thick branches areancestors with the colors denoting their inferred neoplasialike, DCIS-like, or IDC-likephenotypes.
16p occurred three times. The remaining aneuploidies occurred once or twice in all
trees, and none were recurrent in the earliest stages of evolution.
In order to time the occurrence of aneuploidies relative to SNVs, we identified the
branch in the lineage tree of each patient where the first ploidy gains of chromosome
1q occurred and considered SNVs that occurred on this branch. AAF spectra of
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 30
SNVs that occurred before the ploidy gains and located on the chromatid that was
duplicated should be enriched for higher AAF in the progeny samples. In each of
the six patients, statistical tests rejected the null hypothesis that there are no such
SNVs (Fisher’s exact test, P-values ranging from 0.5 ⇥ 10�2 to 0.8 ⇥ 10�36). This
pattern is reproducible between di↵erent samples of the same case, and the SNVs
that exhibit high AAF largely overlap. The same pattern holds for the ploidy gain in
chromosome 16p, but due to fewer SNVs the statistical signal is less strong. Overall,
the AAF distributions of 1q SNVs are consistent, with some mutations occurring
before the ploidy gain, and some mutations occurring after the ploidy gain. This
suggests gradual accumulation of point mutations as a function of the number of cell
divisions, as opposed to mutational bursts.
Because the aneuploidies and SNVs independently support the lineage tree topolo-
gies, the genotypes and phenotypes of the common ancestors can be confidently in-
ferred in each case. The aforementioned mutated common ancestors of neoplasias and
carcinomas in Patients 1, 2, and 6 bore extensive aneuploidy, as did the mutated com-
mon ancestor of the DCIS and IDC in Patient 3. In all four cases, therefore, genomic
crises occurred in an ancestral cell or in consecutive daughter cells of the ancestral
cell lineage. The phenotypes of these ancestors likely included nuclear atypia and
increased rate of cell division, but no invasive capabilities. Their genomes were pre-
disposed to further genomic change, and as a result the subsequent lineages leading
to IDC accumulated numerous additional SNVs and aneuploidies.
3.4 Methods
3.4.1 Identification and processing of neoplasias
All patients except one had opted for mastectomies, and all of the available breast
tissue had been formalin-fixed, which allowed for the discovery of multiple sites of
neoplastic lesions in each case by examination of large sets of tissue sections. Neo-
plastic lesions were classified according to a standard set of criteria that included
nuclear morphology, cell shape, and tissue organization. Once a lesion was identified
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 31
and characterized, we estimated the extent of the neoplastic tissue by taking cores
and performing further sectioning and histology. We then dissected the material to
minimize the proportion of normal breast tissue in the final sample. Our goal was
to achieve 50% or more neoplastic or tumor content, but we could not rigorously
quantify this number until after sequencing had been performed.
3.4.2 Library construction and sequencing
DNA extraction from each dissected sample was performed using procedures opti-
mized for archival material. FFPE cores were cut into 20-µm slices. Para�n was
dissolved in Xylene and removed (four repeats of 5 min incubation with rotation in
1 mL of Xylene and microcentrifugation for 3 min) and followed by washing with
ethanol (four repeats of 5 min incubation with rotation in 1 mL of ethanol and mi-
crocentrifugation for 3 min). Tissue was then lysed with Proteinase K and crosslinks
reversed by overnight incubation at 56�C. After brief digestion with RNase A (Qia-
gen), DNA was purified with a column-based method (Qiagen QIAamp DNA Mini
Kit). For each sample, one Illumina library was built with an average insert size of
between 300 and 400 bases, depending on the quality of the DNA. Half to 1 µg of
genomic DNA (depending on the availability of the material) was sheared to 400 bp
with Covaris S2, end-repaired, ligated to Illumina adapter, size selected, and ampli-
fied with eight cycles of PCR to generate the final library. Standard Illumina 2⇥ 101
paired-end sequencing on the HiSeq2000 platform was performed such that the fi-
nal sequence coverage of confidently aligned reads was nearly 100 for each sample
in the first patient, and 50 for the samples of Patients 2-6. Analysis of the mapped
reads confirmed high library quality (very low duplicate read-pair fraction, almost
normally distributed fragment size, and highly uniform genome coverage) that was
indistinguishable from that of comparable libraries constructed from fresh DNA.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 32
3.4.3 Read mapping and BAM file processing
Raw Illumina reads were uploaded to DNAnexus (https://dnanexus.com/) and aligned
to the human reference genome (UCSC build hg19) using the DNAnexus read map-
per, a hash-based probabilistic aligner that incorporates paired read information. We
used standard quality-control metrics, such as percent confidently mapped reads and
insert size distribution, to discard problematic Illumina lanes prior to subsequent
analysis. Successfully aligned reads from high-quality lanes were labeled using read
group tags and then merged into sample-level BAM files. Lane-level read group tags
improve the performance of downstream BAM processing and variant calling with
the Genome Analysis Toolkit (GATK) [60, 26].
We followed GATK’s best practices guidelines (v3) to perform sample-level BAM
processing using the Picard java utilities (http://picard.sourceforge.net/) and GATK
tools [60]. This protocol has three steps that are executed in the following order:
duplicate read marking, local realignment, and base quality score recalibration. We
used the Picard MarkDuplicates utility to mark duplicate reads based upon the read
position and orientation of read pairs. Marked duplicates were ignored in subsequent
processing and variant calling steps. GATK local realignment was performed with
standard parameters and the recommended known indel sets (Mills et al. and 1000
Genomes indels from the GATK v1.2 bundle) [62]. GATK base quality score recali-
bration was performed with the standard set of covariates. The realigned, recalibrated
BAM files produced by these processing steps were used for multisample SNV calling
and for all alignment-related statistics such as allele counts.
3.4.4 Multisample SNV calling
Multisample SNV calling was performed on processed, sample-level BAM files with
the GATK Unified Genotyper (DePristo et al. 2011). Multisample runs were grouped
by patient such that BAM files from di↵erent patients were run separately. Notable
parameters for the Unified Genotyper include standard call confidence of 50.0 (-
stand call conf 50.0) and minimum base quality score of 20 (-mbq 20). To reduce
SNV false discovery rate, raw variant calls were filtered using GATK variant quality
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 33
score recalibration tools (VQSR) with the recommended training sets. The following
annotations were used for training: FS (strand bias), MQ (mapping quality), DP
(depth), HaplotypeScore, MQRankSum, and ReadPosRankSum. Replacing the rec-
ommended QD annotation (call quality divided by depth) with DP greatly improves
sensitivity for low-frequency somatic variants.
We used pass-filter SNVs to create a set of high-confidence germline calls and a
set of high-confidence somatic calls for each patient. For a given patient, we defined
germline SNVs as calls meeting the following multisample criteria: (1) depth 20 or
greater in every sample, where depth is defined as the sum of alternate and reference
base counts, and (2) non-reference GATK genotype (GT) in every sample. These
high-confidence germline calls were used for aneuploidy analyses (below). Somatic
SNVs were defined using a similar set of criteria: (1) depth 20 or greater in every
sample, (2) fewer than two reads supporting the alternate allele in at least one sample,
and (3) absence in dbSNP 132. We excluded SNVs in dbSNP 132 in order to reduce
the number of false-negative germline calls in our somatic SNV call set.
Three out of four Patient 2 genomic libraries were contaminated with mouse DNA,
with 15% of DCIS reads aligning to the mouse genome. Approximately 1% of reads
from Normal and 0.65% of reads from EN aligned to mouse; these fractions were sig-
nificantly above background levels for una↵ected libraries. To remove contamination-
related mapping artifacts from our SNV data, we added additional filtering steps
to the SNV calling protocol for Patient 2. Prior to variant calling with the Unified
Genotyper, we eliminated all reads lacking confidently mapped mates. After variant
calling and VQSR, we removed all novel pass-filter SNVs positioned in areas of the
genome with significant homology with the mouse genome. Homology was assessed
by mapping tiled 75-mer reference sequences, surrounding each position of interest,
to the mouse genome (mm9). This second step dramatically reduced spurious calls
in DCIS while eliminating only 1% germline dbSNP positions used as controls.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 34
3.4.5 Determination of somatic SNV class patterns and of
robust sharing classes
Multisample somatic SNV calls were further analyzed to determine patterns of SNV-
sharing across samples within the same patient. Although GATK provides sample
genotype calls based on genotype likelihood calculations, these calls lack sensitivity
when applied to cancer samples with substantial normal contamination or subclonal
tumor populations. To further enhance sensitivity of SNV detection beyond GATK
multisample calls, we applied a simple but sensitive metric to determine each sample’s
mutation status. At each somatic SNV position predicted by GATK in at least one
sample, we considered any sample with two or more reads supporting the alternate
allele to harbor the mutation (i.e., mutation present). Samples with fewer than two
reads supporting the alternate allele were labeled as reference (i.e., mutation absent).
Our rationale was that given that a specific SNV is detected in some samples, reads
supporting this SNV in other samples have a significant prior to be true rather than
sequencing errors. We call this criterion ”evidence of presence” of an SNV in a given
sample. These patterns of mutation presence and absence define mutation classes
for lineage construction and other somatic SNV analyses. We note that a small but
important number of SNVs were reallocated by this method from candidate somatic
SNVs with inconsistent patterns of sharing among samples to germline events, and
that very few single-sample (“private”) SNVs were reallocated to sharing classes,
underscoring the high-sequence and alignment quality of our datasets.
A case with n samples has 2n possible class patterns. For example, for a case with
five samples, the patterns are 00000 to 11111. No case has the 00000 class, because an
SNV has to be present in at least one sample, and the 11111 class is that of germline
variants. Classes that are private to one sample are 10000, 01000, 00100, 00010, and
00001. Candidate classes that are possibly phylogenetically informative are defined
by SNVs that are present in two or more, but not all, samples. To identify the subset
of robust phylogenetically informative classes, we applied the following steps:
Eliminate classes with the SNV present in the lymph sample (applicable to Pa-
tients 1, 4, 5, and 6). These classes consisted of lymph-only SNVs (presumably
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 35
somatic mutations in the lymph sample) and germline SNVs, where one or very few
samples were missing the alternate allele presumably due to sampling variance.
Retain the classes that, when ranked in decreasing order of the number of SNVs
present within them, together contain 95% of all candidate somatic SNVs. This
eliminated all spurious classes that were not supported by an overall substantial
number of SNVs, most of which were missing from just one sample, presumably due
to sampling variance.
Eliminate classes with a large fraction of SNVs whose mutation-absent samples
exhibit one alternate-allele supporting read, suggestive of systematic false-negative
calls. This also constituted a small number of classes with SNVs whose alternate
alleles were missing from just one sample presumably due to sampling variance.
3.4.6 PCR-based validation of SNVs and accuracy assess-
ment of whole-genome calls
Validation Design
We designed primers to target a random subset of SNVs within each sample-specific
and phylogenetic class for validation, using target-specific PCR amplification followed
by sequencing. We focused on Patients 2 and 6 because their lesions have the great-
est phylogenetic complexity (Figure 3.7) and therefore constitute the most stringent
test of the main results of our study. 192 and 196 primer sets were designed for
Patients 2 and 6, respectively, such that each SNV to be validated was within ap-
proximately 40 bases of the sequence start site. Primer design was optimized for
multiplexing. Primers contained Illumina linker sequences to facilitate sequencing.
The initial target-specific multiplex PCR was performed with slow-annealing. A sec-
ond PCR using Illumina-compatible primers added barcodes and yielded preparative
amounts of material. All barcoded samples from a single patient were combined into
a single lane of HiSeq2000 for sequencing. For Patient 2, 192 of 192 targets success-
fully generated enough reads to support validation, with a mean coverage (number of
reads per target per sample) of almost 190,000. For Patient 6, 195 of 196 targets were
successful, with a mean coverage of just over 43,000. Amplification and sequencing
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 36
were performed with all targets (each pool containing all PCR 196 patient-specific
primer pairs) on all samples (which were amplified separately) of each patient. This
design supported two levels of validation for both patients, which we denote A and
B. Two more types of validation, C and D, were possible in Patient 6. For a visual
representation of the results from Patient 6 please see Figure 3.8.
Validation A
Validation A is the simplest of the four approaches. It asks whether the validation
PCR/sequencing supports the initial SNV call at all, i.e., whether the alternate allele
is detectable well above background in at least one sample.
A for Patient 2 is 192/192 = 100%.
A for Patient 6 is 180/195 = 92%.
12 of the false positives of Patient 6 are SNVs that had initially been called as private
to the ENA sample. Excluding the ENA-only calls, the validation rate improves to
172/175 = 98%. In this context, we note that SNVs that are present in the ENA and
also in another sample have a much better validation rate than those present in the
ENA alone, due to the additional signal provided by the other samples. We conclude
that our initial SNV calls had a high degree of specificity. SNVs present in more than
one sample, which comprise the classes that are most important for our study, have
an almost perfect validation rate.
Validation B
Validation B addresses sample-specificity and whether the assignment of an SNV to
a specific class, especially to a phylogenetically informative class, was correct. The
most stringent metric is to ask what fraction of SNVs are validated to be present in
precisely the same set of samples as the initial assignment based on the whole genome
sequence, and to count each with a misassignment as incorrect. It uses those SNVs
that were validated to be present (validation A).
B for Patient 2 is 180/192 = 94%.
11 of the 12 miscalls involve an SNV that was initially called as IDC-only, but is
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 37
ENA
ENA
0.5
0.4
0.3
0.2
0.1
0.5
0.4
0.3
0.2
0.1
0.5
0.4
0.3
0.2
0.1
0.5
0.4
0.3
0.2
0.1
Sample
Lym
ph
Alte
rnat
e al
lele
freq
uenc
y
EN
DCI
SEN
_cl
IDC
IDC
(new
1)ID
C (n
ew2)
nor
mal
(new
)
Lym
ph EN
DCI
SEN
_cl
IDC
IDC
(new
1)ID
C (n
ew2)
nor
mal
(new
)
0 1 0 0 0 0 * * *
0 0 1 0 0 0 * * *
0 0 0 1 0 0 * * *
0 0 0 0 1 0 * * *
0 0 0 0 0 1 * * *
0 0 1 0 0 1 * * *
0 0 1 1 0 0 * * *
0 0 1 1 0 1 * * *
N=20
N=20
N=25
N=21
N=19
N=30
N=34
N=20
Figure 3.8: Alternate allele frequencies in each tested private class (green binarycode) or phylogenetically informative class (magenta binary code) of somatic SNVs ofPatient 6. Code denotes the class of SNV as determined by the whole genome sequenceanalysis. Starred samples were not present in the whole genome analysis. EachSNV corresponds to a bar whose position is repeated stereotypically for each sample.SNVs are sorted by position in the genome. N denotes the number of SNVs (andtherefore the number of bars per sample) in each class. Presence/absence patternsfrom the validation experiment are visible as clustered bars, denoting consistentlyhigh alternate allele frequencies of SNVs in the sample. High validation rate andtherefore concordance in class assignments is visible as correspondence between thewhole-genome-derived presence/absence code and the blocks of above-backgroundalternate allele frequencies.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 38
in fact an SNV shared between the EN and the IDC. Recall that this class had low
alternate allele frequencies in the EN, so these were simply missed in the genome-wide
data due to their very low frequency. B for Patient 6 is 176/180 = 98%. In summary,
our class assignments are highly accurate and the small amount of error does not
a↵ect the study’s results or conclusions in any way.
Validation C
In Patient 6, we were able to go back to the archival tissue and recover additional
(separate) IDC material as well as a sample of normal tissue. In what might be called
a “biological” validation, we can therefore ask what fraction of SNVs present in the
original IDC are also present in the two new IDC samples. Class IDC-only: 18 of the
19 IDC-only SNVs tested also appear in both new IDC samples. The one SNV that
is not present in the new IDC samples has the lowest alternate allele frequency in the
original IDC sample, indicating that it marks a subclone not present outside of this
sample. Phylogenetically informative classes that include IDC: 50 out of 50 SNVs
were present in the new IDC samples. Thus, this validation shows that mutations we
find in a single IDC isolate are fully supported by their presence in independent IDC
isolates, and that our false-positive rate for this class is e↵ectively zero.
Validation D
The addition of a sample of normal tissue from the ipsilateral breast in close proximity
to the other lesions allowed us to ask whether any SNVs we targeted would give a
false-positive signal in the validation. The seven SNVs we tested that were shared
among all ipsilateral samples were also positive in this normal sample, as expected
for SNVs that arose early in breast development; none of the remaining SNVs that
were private to one sample or comprised the phylogenetically informative classes
(N=188) had signal above background in the normal sample, again underscoring
superior specificity of our somatic SNV calls.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 39
Validation of the “Evidence-of-Presence” criterion
The validation data also allowed us to examine whether the reassignment of SNVs
according to our evidence-of-presence criterion improved accuracy over GATK multi-
sample calling. As we describe in the manuscript, we first perform the standard
GATK multi-sample SNV calling to identify the set of somatic SNVs in a patient.
GATK results include class membership, i.e., in which sample the alternate allele
of the SNV is present. But we adjust this class membership using our evidence-of-
presence criterion, which asks whether there is evidence for the alternate allele of an
SNV in a sample where GATK did not call it. The logic is as follows: Assume that an
SNV is called by GATK in sample A of a given patient, but not in sample B. Assume
that in sample B, there are two (or more) reads supporting that SNV. (This situa-
tion is common with GATK.) Due to its presence in sample A, the SNV has a high
prior probability of being a true somatic SNV in sample B, rather than resulting from
coincidence of sequencing errors in the two or more reads supporting it. Recall that
typically fewer than 1000 somatic SNVs are called per sample; this is several orders of
magnitude fewer positions than the entire genome, and therefore it is possible to use a
more sensitive criterion for detection of SNVs in these positions than for de novo dis-
covery in the entire genome, without increasing the false positive rate substantially.
The validation data show that application of the evidence-of-presence criterion in-
deed improves call accuracy over the GATK class assignments: In Patient 2, 17 SNVs
within our validation set had been reassigned according to evidence-of-presence. In
14 out of these 17 cases, the reassignment detected the mutations in samples that
were validated, thus improving over GATK calls; in 3 cases the reassignment cre-
ated a false positive, i.e., detecting an SNV in a sample which was not supported
by our validation. Similarly, in Patient 6, of the 14 SNVs within our validation set
that were reassigned according to evidence-of-presence, 11 were correctly reassigned,
in 3 cases evidence-of-presence called an SNV in one additional false positive sam-
ple. In summary, we concluded that evidence-of-presence significantly improved class
assignments over GATK.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 40
Conclusion
The results from the four approaches to validation reveal the robustness of the
genome-wide data, particularly of the phylogenetically informative classes, which
form a cornerstone of our study. The results from the assessment of the evidence-of-
presence criterion versus original GATK calls underscore the power of multisample
calling and the technical robustness of our analytic approaches.
3.4.7 Aneuploidy and tumor purity
To identify aneuploidies we selected a subset of the germline SNVs identified by
GATK. These “sgSNVs” were defined, separately for each patient, as a patient’s
multisample germline SNVs that had dbSNP132 entries, were heterozygous, and had
minor allele frequencies in the control sample of at least 0.25. We define the “lesser
allele” as the one supported by fewer reads than the other allele (which is the “preva-
lent allele”). Three metrics were calculated for each SNV: the lesser allele coverage,
the prevalent allele coverage, and the lesser allele fraction (LAF). The LAF was used
to identify aneuploidies, whose “sign” (loss or gain) was then set by the two coverage
metrics.
In all patients except 5, the vast majority of chromosomal copy-number transitions
coincided with the centromere, or the whole chromosome was involved. Fine mapping
of the transition points was therefore not usually necessary. In the handful of cases
where a transition point did not coincide with a centromere, we found the window of
the plot at which the event either started or ended (window i). As discussed in Figure
3.5, each window spans 1000 SNVs, with an overlap of 500 SNVs between adjacent
windows. We then plotted the frequency of the heterozygous variants in the three
relevant windows (i� 1, i, i+ 1, totaling 2000 variants) in that sample. The variant
at which the frequency shifted was easily detected by eye, and it was not necessary
to deploy segmentation methods. The resolution of this analysis is low (determined
by what can be seen by eye on the plots) and we did not attempt to identify events
that involved regions smaller than about a third of a chromosome arm. We also note
that we did not attempt to identify structural rearrangements that do not result in
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 41
copy-number changes, such as inversions.
The identified loss of heterozygosity (LOH) chromosomes were then used to esti-
mate the fraction of the sample that is due to normal cells (lymphocytes, myocytes,
etc.), as follows: All cancer cells contribute zero copies of an allele that was lost due
to LOH, and the normal cells contribute one copy of the LOH allele times the contam-
ination fraction n. Note that in all of our patients, the control samples were free of
LOH chromosomes (Figure 3.6A). The LOH allele is almost always the one with fewer
reads. Therefore, the LAF l should, on average, be equal to the lost-chromosome frac-
tion that is contributed by the normal contamination. Some arithmetic shows that
n = l1�l . Once n was estimated from l, the exact ploidy p for those chromosomes that
had gains was calculated according to the formula P = 1�2nll(1�n) .
Sequence-based n’s roughly matched estimates of n by histology. The histology-
based estimates are necessarily an approximation because they are based on limited
sampling, by sectioning of the tissue core mass from which DNA is obtained.
3.4.8 SNV mutation spectra
Mutation spectra for patient samples were aggregated in two ways: (1) combined
across patients to form three “superclasses” of SNVs based on lesion class (private
in early neoplasias, private in carcinomas, and shared between neoplasias and carci-
nomas); (2) combined within each patient, ignoring lesion class, to form six groups.
Complementary mutations were pooled, reducing the number of possible mononu-
cleotide mutations from 12 to 6, and the number of single-base substitution classes
in dinucleotides from 16⇥ 6 = 96 to 10⇥ 6 = 60.
Mononucleotide mutation spectra were simply estimated from the frequency of
the mutation type (Figure 3.3, cf. A and B, where the bars of each color add up to
1). For dinucleotides, we calculated rates by dividing the number of events of each of
the 60 changes by the genome-wide count of the dinucleotide that was mutated.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 42
3.4.9 Tree inference
Tree topology was defined by the phylogenetically informative SNV classes. The data
are unambiguous and we therefore used parsimony to establish which samples shared
common ancestors in which configuration. Once the SNV-based trees were built,
aneuploidy events could be mapped onto them, and again the data were unambiguous.
Even successive gains of ploidy of the same chromosome, most prominently among
them 1q (e.g., Fig. 3.7F), could be ordered without conflicts.
3.4.10 Ordering SNVs vs. chromosome 1q ploidy gain in
ancestral branches
We devised a statistical test to ask whether some SNVs occurred before copy gain
in aneuploidy regions. For each patient, we identified the branch in the lineage tree
responsible for the first copy-number changes in chromosome 1q, which consistently
represents the earliest aneuploidy event in our patients. We then analyzed the AAF
spectra of SNVs occurring in that branch. The test below is based on the idea that
SNVs that occur on a 1q chromatid prior to gain of a copy of that chromatid should
have higher AAF than SNVs occurring on a 1q chromosome after copy gain.
We used SNVs on all diploid chromosomes on the same branch as our control
set. Sequence coverage is scaled with respect to the aneuploidy and controls for
contamination of the sample by normal cells (lymphocytes, etc.):
scaled coverage = coverage⇥ (p⇥ (1� n)
2+ n)
where p is the estimated ploidy and n is the estimated normal contamination. In
order to find outliers indicative of events prior to copy gain, we calculated a Z-score.
SNVs with AAFs with Z-score > 3 were labeled as “high” and SNVs falling below
threshold were labeled as “low”. For each patient, we used Fisher’s exact test to
compare the distribution of SNV labels in the control chromosomes vs. 1q. In each
of the patients, we reject the null hypothesis that the 1q distribution is equal to or
less extreme than the control distribution.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 43
3.5 Discussion
Evolutionary studies of cancer have so far focused on the inference of clonal evolution
within the cancer (e.g., [67]) or analyses of the relationship of metastases with the
primary tumor (e.g., [64]). Here we addressed a di↵erent perspective, namely that
of the early origins of the cancer phenotype. These three approaches can be thought
of as mimicking progression, at least as far as solid tumors are concerned: Studies of
metastatic evolution are about the terminal stages of the cancer; studies of within-
cancer subclone diversity are about the Darwinian process of faster versus slower
growing cell populations and the evolution of the primary tumor mass; and studies
of early neoplasias and their relationships to the diagnostic tumors are about early
origins of cancer.
Our understanding of these early origins will be greatly enhanced by molecular
evolutionary analyses similar to those that have advanced our understanding of organ-
ismal evolution. Cells within concurrent lesions are analogous to extant organisms:
they are related to one another by bifurcating lineage trees and have accumulated
genomic changes over the course of evolution. In our study of multiple lesions in six
cases of ductal breast carcinoma, we found that the genomes of ancestors of some early
neoplasias and carcinomas were already aneuploid and harbored a modest number of
point mutations. By comparing mutational spectra of somatic SNVs across patients
and samples we inferred that somatic SNVs accumulated gradually as a result of a
large number of ancestral cell divisions and not during saltatory mutational crises.
In two cases, the carcinoma phenotype originated twice independently from an an-
cestral neoplastic phenotype, suggesting a substantial predisposition of the ancestor
to generate cancerous progeny.
All of the neoplasias with aneuploidies shared common cellular ancestors with the
carcinomas; in all of these cases the neoplasia and carcinoma shared these aneuploidies
as well as somatic SNVs. In contrast, none of the neoplasias that were devoid of
aneuploidies (all contralateral ENs and five ipsilateral ENs) were closely related to a
carcinoma. Among the aneuploidies, gain of chromosome 1q was most dramatically
recurrent, which is consistent with its prevalence among late-stage breast cancers (cf.
CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 44
Fig. 4 in [24]). 1q harbors more than a thousand genes, and while the increased dosage
alone is not su�cient for a carcinoma phenotype (some of our neoplastic samples carry
the increased 1q ploidy), it is likely to be predisposing to further genomic change.
Initially, such change may be catalyzed primarily by an increased rate of cell division,
as the mutation spectrum of the early neoplasias is indistinguishable from that of the
IDCs in every patient examined. Additional aneuploidies accumulate, however, and
at some point a combination of dosage imbalances and mutational load, and perhaps
epigenetic or stromal changes as well, results in an invasive carcinoma phenotype.
We anticipate that the evolution of a diverse set of breast and other cancers
will soon be studied similarly and with complementary approaches [74, 64, 36, 67,
75]. Current practice in clinical diagnosis of cancer facilitates studies on archival
material because of the low cost and superior quality of histopathological examination
of formalin-fixed, para�n-embedded samples. We show that high-quality, large-scale
genome sequence can be obtained from archival material, and show by validation that
the data from such material can be highly robust. Evolutionary inference based on
many samples of such material opens a new dimension for analysis of cancer origins
and progression. In the future, phylogenetic analysis of carcinomas and concurrent
lesions will suggest drugs that attack both carcinoma and early lesions by targeting
genomic changes common to all lesions, removing not only the carcinoma, but also
the reservoir of related cells from which a carcinoma might recur.
Chapter 4
Haplotype reconstruction of
somatic genomes using long reads
4.1 Introduction
As discussed in section 2.3, haplotype inference has a broad spectrum of applications.
Despite the recognition of the central role of haplotype information in diagnostic and
prognostic studies, haplotype assembly of cancer genomes was not feasible until very
recently. This impediment was mainly due to the limitations of genetic and statistical
phasing methods in phasing de novo and somatic variants. Currently, single-molecule
sequencing techniques do not provide a viable solution either, due to their high costs
and error rates. Fortunately, recent advances in experimental technologies are open-
ing a new avenue for haplotype reconstruction of somatic genomes. In 2013, phased
genome and epigenome of the HeLa cancer cell line was published [2], only made
possible by sequencing pools of fosmid clones, in addition to shotgun and mate-pair
sequencing in an exhaustive analysis. The findings highlighted the importance of
haplotype information in providing a complete profile for cancer genomes. How-
ever, the laborious and time-consuming protocols for preparation of fosmid libraries
restricted the wide use of this technology in cancer genomics studies. Recent tech-
nologies such as CPT-seq [3], and Moleculo [48] reconstruct long DNA fragments
by utilizing well-established and highly accurate short read sequencing. To achieve
45
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 46
more accurate reconstructed long fragments, they use specialized DNA partitioning
techniques, along with PCR amplification protocols. The Moleculo method conducts
sequencing of sub-haploid (⇠10Kbp) DNA fragments by performing in vitro dilu-
tion of fragments into several hundred wells. These molecules are then sheared into
smaller fragments, and assigned barcodes that are unique per well. Small fragments
from all wells are then pooled and sequenced with Illumina sequencing technology.
The number of fragments within a well is set such that each well covers only a small
fraction (1-2%) of the haploid genome. As a result, when demultiplexed reads from
individual wells are mapped to the reference genome, shotgun reads originated from
single long DNA fragments form islands of reads such that each island represents one
long molecule. We denote these islands by read clouds. In the Moleculo protocol,
the sequencing coverage of a genome is determined by two parameters: the coverage
of DNA fragments with short reads, which we denote by CR, and the coverage of a
genome with long fragments (or read clouds), which we denote by CF (Figure 4.1).
AACAGTAACCTTGATTACGTAACTGACCCTTGACTAAAACTCCAAGGTACTGGATACCTGTAAACCRTCGAACTGAAACTAAAGTAACTAAACTAAACTAAGTAAACTGACTAACTGTAAACTGAAATGC
CRCF
Figure 4.1: Moleculo read clouds. Sequence reads from each well are mapped to thereference genome separately. Clusters of sequence reads originated from single DNAmolecules are identified and labeled as read clouds. Read clouds are circled in thisfigure. CR represents the coverage of DNA fragments with short reads. CF representsthe coverage of genome with read clouds.
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 47
In this chapter, I describe a multifaceted approach for haplotype reconstruction
of a solid tumor that is sequenced using the Moleculo protocol. Our approach ex-
ploits long-range information from resulting read clouds, linkage information across
variants, and cancer-specific aneuploidies, all in one package, to produce long and
highly accurate haplotype blocks. We successfully applied our methodology to phase
the genome of an invasive ductal carcinoma for the first time, and achieved haplotype
blocks with N50 sca↵old size of 27.7 Mbp. We confirmed the accuracy of inferred
haplotypes to be over 99.9%, by computationally validating the inferred haplotypes
using existing linkage disequilibrium patterns between variants and also aneuploidy
information.
4.2 Results
4.2.1 Dataset and SNV detection pipeline
In this study, we selected an invasive ductal carcinoma (IDC) lesion from a grade 3,
ERBB2 amplified, estrogen receptor (ER) negative breast cancer patient. A sample
from normal breast tissue of the same patient was also obtained and sequenced to
serve as a control. We sequenced eight Moleculo libraries of the IDC sample, with an
average per base CR = 1.8x, CF = 44x, and an average fragment length of 9Kbp. To
obtain a more robust set of variants, we also performed four lanes of whole-genome
shotgun sequencing on both normal and IDC samples.
Single nucleotide variant calls (SNVs) were made by performing GATK multi-
sample genotyper on shotgun reads from normal and IDC samples as well as on
pooled sequence reads from Moleculo data. SNVs that were called in both sets of
IDC libraries (shotgun and Moleculo) and the control sample were considered as
germline variants. Those that were absent in the control sample were considered
as somatic variants. We identified 3,107,350 germline SNVs, of which 1,869,281 were
heterozygous and therefore informative for phasing. About 91% of candidate germline
SNVs were reported in the dbSNP database (dbSNP 138). On average, heterozygous
SNVs were spanned by 41 read clouds, 30 of which covering the variant by short reads.
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 48
In total, 3530 somatic variants were called in both sets of IDC libraries (shotgun and
Moleculo) and were absent in the normal sample.
4.2.2 Overview of the framework
In this work, we present a comprehensive framework for haplotype reconstruction of
a somatic genome. Our framework consists of three major steps each leveraging a
unique aspect of the data. It exploits the haploid nature of Moleculo read clouds,
linkage disequilibrium (LD) information between heterozygous variants, and cancer-
specific aneuploidies to infer an accurate and long-range phase between germline and
somatic SNVs in a cancer sample. At the end of each step, we computationally
validate the inferred phase by utilizing features of the data that are yet not employed
in the inference procedure up to that step.
1. Local phase
First, we determine the phase between germline and somatic SNVs by opti-
mizing the assignment of overlapping read clouds to two parental haplotypes
and up to two somatic haplotypes. Parental haplotypes correspond to the two
copies of the genome inherited from parents and provide the phase informa-
tion between germline SNVs. Each somatic haplotype is linked to one of the
parental haplotypes, and indicates which somatic mutations occurred on that
copy of the chromosome in the cancer cells. In our approach, we first infer the
phase between germline SNVs by finding the best assignment of all read clouds
to the two parental haplotypes (Figure 4.2B). We then proceed and reassign
read clouds to somatic haplotypes at the sites of somatic alterations to infer
which parental chromosome was altered by a somatic variant (Figure 4.2C).
To best model the specific properties of Moleculo data, we developed a cus-
tomized probabilistic inference model to phase the variants. The model uses a
Monte Carlo simulated annealing procedure to find the optimal assignment of
read clouds to parental haplotypes. The states of this model are all possible
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 49
h1
h2
mp
B.
A.
h1
h2
Germline SNV Somatic SNV Reference allele Variant allele
C.
mp
Mixed alleles
Figure 4.2: Reconstructing parental and somatic haplotypes in the local phase step.A. Read clouds are depicted as linear segments with reference and alternate allelesshown whenever they are covered by at least one shotgun read within the cloud.Black circles represent positions where a cloud has reads covering both reference andalternate alleles, most likely due to sequencing errors. We use individual reads in ouranalysis, this simplification is only for visualization purposes. B. Overlapped readclouds at heterozygous variants are assigned to two parental haplotypes (h1, h2). C.Two somatic haplotypes are shown at the top and the bottom. Read clouds coveringthe alternate allele at a somatic variant are moved to somatic haplotypes.
haplotype assignments of read clouds. The method starts with a random ini-
tialization of all read clouds to two parental haplotypes, and transitions within
its state space by choosing a move from the following three moves at random:
cloud reassignment, cloud split, and switch unwinding (Figure 4.3). Each move
is introduced for a specific purpose detailed in the methods section. The cloud
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 50
split move, in particular, is introduced to identify and address chimeric read
clouds, which result from fragment collisions within a Moleculo well. To reduce
the e↵ect of sequencing errors in resulting haplotypes, we incorporated mapping
and base-call qualities of individual sequence read members of a read cloud into
the calculation of the likelihood score at each state of the model. The proposed
state after each move is immediately accepted if it achieves a higher likelihood
score. However, if the new likelihood score is lower than the previous score, it
is accepted probabilistically.
Since the density of heterozygous SNPs varies considerably along the genome,
the genomic distance between consecutive variants can be larger than the length
of input DNA fragments. As a result, the local phasing step produces thou-
sands haplotype blocks of various lengths with their relative phase remaining
unknown. To consolidate these local haplotype blocks into longer chromosome-
spanning haplotypes, we utilized two sources of information: linkage disequilib-
rium between SNPs, and unbalanced allelic ratios in large somatic CNV regions.
2. Statistical phase
Linkage disequilibrium (LD) information between genomic variants can provide
valuable information about their phase. In the extreme case of complete LD
between two markers, where two alleles are always found together in a pop-
ulation, the phase of the two variants can be immediately inferred. Linkage
disequilibrium information between markers is widely used in statistical phas-
ing methods. However, statistical haplotype phasing approaches are limited to
phasing variants that are prevalent in a population. These methods cannot be
applied to de novo (or rare) germline variants and somatic mutations. In this
step, we overcome this limitation by applying statistical phasing on already
phased local blocks. In this approach, the phase between somatic and de novo
variants and their neighboring heterozygous SNPs is first inferred in the local
phasing step. LD information between heterozygous SNPs is only then lever-
aged to join haplotype blocks wherever possible to produce larger haplotype
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 51
h1
h2
A. Diploid Genome
B.
C.
h1
h2
h1
h2
Figure 4.3: Probabilistic inference model for phasing germline variants. Read cloudsare assigned randomly to two haplotypes. The inference model learns the true hap-lotypes by reassigning read clouds to the two haplotypes using three di↵erent moves:A.Cloud reassignment. A read cloud is randomly selected and reassigned to the otherhaplotype. B. A genomic position is randomly selected and read clouds on one sideare reassigned to opposite haplotypes. C. Cloud split. A read cloud is broken intotwo parts, and the two parts are assigned to opposite haplotypes.
blocks.
3. Leveraging cancer aneuploidy
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 52
Unlike diploid genomes, genomes of aneuploid cancer cells have additional prop-
erties that we can exploit to further connect haplotype blocks. As shown in
previous chapter, invasive carcinomas often exhibit large copy number variants
(CNVs) including chromosome-wide aneuploidies. If we limit our attention to
CNV regions with an uneven haplotype ratio, haplotype blocks that are spanned
by a CNV can then be connected if we exploit the resulting imbalance of al-
lelic ratios at heterozygous variants. More specifically, we can use allelic ratios
at heterozygous variants to infer which haplotype in a haplotype block has a
higher copy number. If a single CNV event spans more than one haplotype
block, more prevalent and less prevalent haplotypes of those blocks can be con-
nected respectively.
In our approach, we first calculate the lesser allele fraction (LAF) value, which
is the fraction of reads supporting the less prevalent allele, at each heterozygous
variant in both IDC and normal samples. To have more accurate LAF values,
which are also comparable between normal and IDC data, we use sequence
reads from our shotgun libraries. We then employ an HMM model to break the
sequence of IDC LAF values into segments that exhibit the same allele fraction,
suggesting that variants in each segment are spanned by the same CNV event.
Next, we label parental haplotypes at each haplotype block in these segments
as “less frequent” and “more frequent” by developing an HMM method that
employs haplotype allelic ratios at each germline SNV. We also use this HMM
method to identify switch errors in the haplotype blocks that were produced
by either local or statistical phasing. Finally, we utilize these labels to connect
haplotypes of consecutive haplotype blocks within each segment.
4.2.3 Local phasing
In this patient, germline heterozygous SNVs are on average 1.6Kbp apart (97.5%
of neighboring heterozygous SNVs are less than 9Kbp apart). Therefore, most read
clouds span multiple heterozygous SNVs. Since each read cloud originates from a
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 53
single molecule, the phase of variants it covers can be determined. Long-range phase
information between SNVs can then be obtained by leveraging information from over-
lapping read clouds.
Haplotype phasing in the presence of sequencing and alignment errors is compu-
tationally expensive. Although exact algorithms for phasing has been developed (e.g.
[41, 19]), their running times scale exponentially in either the average size of reads
or the sequencing coverage of the genome. Cancer samples are usually sequenced
at a high coverage for improving the sensitivity of mutation-detection in heteroge-
neous cancer specimens with high levels of normal-contamination. Therefore, a more
scalable approach is needed for analyzing tumor samples. As part of this work, we
developed a probabilistic inference model that can be applied to cancer data to phase
not only inherited but also de novo and somatic SNVs. Existing phasing methods for
long reads model fragment as a long string of alleles, therefore synthetic long reads
should be created from short sequence reads. In contrast, our method works by the
direct analysis of shotgun reads in each read cloud. As a result, we can incorporate
mapping and base-call qualities of individual reads into our model, which in turn
results in highly accurate haplotype blocks.
In this step, we phased 99% of the germline and 77% of the somatic SNVs called
by GATK, producing 43,719 phased contigs with N50 of 81.02 Kbp. Half of germline
variants were in haplotype blocks with 115 or more SNVs (V50=115). The remaining
somatic variants stayed unphased due to two main reasons. About 13% of them were
more than 10Kbp distant from any germline heterozygous variants, and thus were not
connected to neighboring heterozygous sites by read clouds. For the remaining 10%,
no read cloud connected the variant allele to another heterozygous SNV, therefore the
phase could not be inferred. This was mainly caused due to sampling bias and also
the high normal contamination of this studied IDC lesion (computationally estimated
to be about 70%). Figure 4.4 shows a phased region in chromosome 7.
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 54
Germline SNV Somatic SNV Reference allele Alternate allele Mixed alleles
Figure 4.4: A haplotype block from real read cloud data. A 100Kbp phased region inchr7 with over 170 variants is shown. The two parental haplotypes are shown in themiddle, as well as the two somatic haplotypes on top and bottom. The plot at thetop represents the genomic distance in basepairs between neighboring SNVs. Of the4 somatic SNVs that are phased in this region and are highlighted by dashed lines,two belongs to one somatic haplotype and two to the other one.
4.2.4 LD-based validation of local phasing
The standard approach for measuring the accuracy of inferred phase is by counting
the number of switch errors. The number of switch errors indicates the minimum
number of switches in the inferred phase necessary to make it compatible with the
true phase. To correctly calculate this number, the underlying haplotypes must be
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 55
Distance(Kbp) Pairs in complete LD Inconsistent pairs0 - 10 1080500 1710 - 20 169311 2020 - 30 65060 330 - 40 31159 240 - 50 18466 050 - 60 11836 060 - 70 7496 070 - 80 5065 180 - 90 3544 0
90+ 8660 0
Table 4.1: Distances between variants in complete LD pairs
known. However, in practice, this information is not available. In this study, we use
linkage disequilibrium (LD) information, which is obtained from population data, to
computationally validate our inferred haplotypes at common SNPs. For this purpose,
we identified pairs of heterozygous SNPs that are in complete linkage disequilibrium
(D’=1) in 1000 Genomes project and appear in the same haplotype block in our sam-
ple. We hypothesized that at each pair, a disagreement between population phase and
inferred phase is an indication of a potential switch error. Out of the total 1,401,097
analyzed SNPs pairs, we identified 43 pairs whose inferred phases were in conflict
with their respective population phase. Table 4.1 shows the binned distribution of
pairwise distances between variants that are in complete LD.
A single switch error can disrupt the inferred phase between several pairs of SNVs,
so we examined the haplotype blocks and calculated how many switches in each
haplotype block are required to correct the identified discrepancies between inferred
and population phases. In total, we identified 11 switch errors, suggesting a phase
accuracy of over 99.9% in the validated regions. We acknowledge the limited scope
of this test due to an inverse correlation between pairwise LD levels and genomic
distance between variants; therefore, we additionally validated the inferred haplotype
blocks using other features of the data as explained in the following sections.
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 56
4.2.5 Statistical phasing
Linkage disequilibrium across informative sites can also be utilized to connect haplo-
type blocks. LD information about pairs of SNPs that span across haplotype blocks
can help us connect these blocks such that, in the resulting haplotypes, the order of
alleles at these pairs agrees with the population phase. Therefore, our goal in this
step is to statistically join haplotype blocks by exploiting patterns of LD. There have
been several tools developed for statistical phasing, a few recently published ones
(such as SHAPETIT2 [25], and Prism [48]) can take partial phase information from
local haplotypes as their input. We performed Prism’s global phasing stage, which
is an HMM-based model, on the locally phased blocks using the prephased reference
panel of the 1000 Genomes project. Prism suggests a phase between adjacent local
blocks along with a confidence score, which is a measure of the likelihood of a switch
error associated with the suggested phase. We assembled local blocks into longer
haplotypes if the confidence score was above 0.98, which is empirically estimated to
produce about 0.3 to 0.6 long switches per mega-base [48]. With this approach, we
achieved N50 size of 533 Kbp, and V50 of 452 SNPs.
4.2.6 Leveraging aneuploidy information in phasing
In total we phased 2.76 Gbp regions of the IDC genome. Within these regions, we
identified 1.94 Gbp segments in which cancer cells (or a subset of them) display an
uneven number of copies between haplotypes. We exploited the resulting unbalanced
allelic fractions of heterozygous variants in these segments to first detect switch errors
in our inferred phase and then further connect haplotype blocks.
If a haplotype block is within a region with an uneven haplotype copy ratio, the
fraction of shotgun reads supporting the alleles of the less prevalent haplotype should
be equal to the copy number ratio of that haplotype. As a consequence of a phase
switch error, while on one side of the switch point, these fractions of shotgun reads
agree with the copy number ratio of the less prevalent haplotype, on the other side,
they agree with the copy number ratio of the more prevalent haplotype. We therefore
developed an HMM model to detect such precipitous shifts in haplotype allelic ratios
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 57
Category Total length of Total number of phased Number of potentialregions (Gbp) germline variants switch errors
R1 1.0915 723082 703 (0.10%)R2 0.0431 24818 8 (0.03%)R3 0.5250 329800 104 (0.03%)R4 0.0941 49428 9 (0.02%)R5 0.0931 59339 16 (0.03%)R6 0.0033 656 0 (0%)R7 0.0057 2441 2 (0.08%)R8 0.0089 2741 2 (0.07%)
Table 4.2: Estimated number of switch errors in CNV regions
to detect switch errors.
We limited our analysis to CNV regions that were larger than 1Mbp in size, and
also excluded two regions in chr19 and chr3, which showed signs of chromothripsis,
to ensure that identified CNV regions cover only one single event and thus allelic
ratios could be used reliably to join haplotype blocks. We used information from the
remaining 1.86 Gbp regions of the genome to detect switch errors and extend blocks.
(Figure 4.5).
Within each highlighted region in Figure 4.5, we identified switch errors in haplo-
type blocks. Table 4.2 reports the number of potential switch errors that we identified
in di↵erent regions of the genome. A large fraction of the genome is present at the
average lesser haplotype allelic fraction of 0.42. In these regions, we estimated 0.1%
switch errors (0.64 errors per 1Mbp). In other regions, the rate of errors ranged from
0% to 0.08% (0 to 0.35 errors per 1Mbp). Our estimate of switch errors is very close
to Prism’s estimation of long switch errors for the chosen confidence score (0.98) (Fig-
ure 2 in [48]), which suggests most switch errors were introduced in the statistical
phasing step.
After correcting switch errors, we connected haplotype blocks within each CNV
region by joining haplotypes of the same prevalence. With this approach, we achieved
haplotype blocks with an N50 size of 27.67 Mbp. Half of germline variants were in
haplotype blocks with 17,787 or more variants.
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 58
1q 2p 2q 3p
0.2
0.4
0.6
3q 4p 4q 5p 5q 6p
0.2
0.4
0.6
6q 7p 7q 8p 8q 9p 9q
0.2
0.4
0.6
10p 10q 11p 11q 12p 12q 13q
0.2
0.4
0.6
14q 15q 16p 16q 17p 17q 18p 18q 19p 19q 20p 20q
0.2
0.4
0.6
22q Xp Xq
0.2
0.4
0.6
1p
21q
NormalIDC
R1 R2 R3 R4 R5 R6 R7 R8
Figure 4.5: Haplotype allelic fractions. Germline SNVs are arranged by their genomicorder, and the fraction of shotgun reads displaying IDC’s less prevalent haplotypeallele is plotted for both normal and IDC samples, averaged in windows of 20 SNVs.As expected, allele ratios in the normal sample are close to 0.5. Large CNVs in IDCsample are visible as sharp drops in the plot. Black lines display CNV borders detectedby the HMM. Highlighted regions are CNV regions in which we joined haplotypeblocks. CNV regions are colored based on their average allelic ratios. Allelic fractionsdisplayed in this plot were calculated after correcting the identified switch errors.
4.2.7 Final validation test
As a final validation test of the inferred phase between variants, we considered somatic
SNVs at the regions of LOH. These are regions in which cancer cells have lost one
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 59
of the two parental chromosomes; the other paternal chromosome, however, can be
present in di↵erent copy numbers. We limited our analysis to LOH regions in which
all cancer cells share the chromosome loss event. We used haplotype-specific read
counts to identify such regions. In these regions, which constituted about 20% of
the genome, all somatic SNVs should belong to the retained chromosome and not
the deleted copy. Since we did not use this information in previous phasing steps, it
serves as an independent validation test whereby we can estimate the phasing error
of somatic SNVs. We observed that out of the 308 somatic SNVs that were phased
in these regions, 304 of them were assigned to the correct haplotype.
4.3 Methods
4.3.1 Processing of samples and sequencing
Our workflow began with selecting a core of grade 3, ERBB2 amplified, estrogen
receptor (ER) negative IDC isolated from fresh frozen tissue of a patient and a spec-
imen of the patient’s normal breast tissue as a control sample. We built one Illumina
library for each sample with an average insert size ranging between 300 and 400
bases. We also prepared eight Moleculo long fragment libraries for the IDC sample.
Standard paired-end sequencing (2 x 101) of all libraries was then performed on the
Illumina platform. Sequenced paired-end reads were subsequently mapped to UCSC
hg19 reference genome using BWA-mem [54] with default parameters, and the ‘-M’
option. Only primary alignments were considered in the downstream analysis.
4.3.2 Genotype calling
We used GATK’s Unified Genotyper toolkit (GenomeAnalysisTK-2.8.1) to call single
nucleotide variants in IDC and control samples simultaneously. To minimize the
impact of PCR amplification errors, and other technology-specific errors on genotype
calls, we processed sequence reads from both sets of shotgun and Moleculo libraries
and performed multisample SNV calling on the processed BAM files. We followed
the best practice guideline provided by GATK.
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 60
Raw variant calls were then filtered by GATK’s Variant Quality Score Recalibra-
tion (VQSR) tool, and those that passed the filters were used to create high-confidence
sets of germline and somatic calls. We only considered variants that are covered with
at least 10 sequence reads in every sample. We defined germline variants as ones that
were called in the two sets of IDC libraries and the control sample. Somatic SNVs
were defined as ones that were not called in normal sample, and no sequence reads
in normal sample harbored the alternate allele. We only considered somatic variants
that were called by GATK in at least one set of IDC libraries (shotgun or Moleculo),
and their variant alleles were supported by at least one sequence read in the other
set.
We selected an SNV as an informative heterozygous marker for the downstream
phasing steps when it passed three conditions: 1. It had a lesser allele fraction of at
least 0.25 in the control sample, 2. At least one read cloud supported each allele of
the variant, 3. Not more than 25% of overlapping read clouds had a mixture of reads
showing evidence for both reference and alternate alleles.
4.3.3 Constructing read clouds from sequence reads
Within each Moleculo well, long fragments are sequenced with short read sequenc-
ing technology. Sequence reads from a long fragment form an island of reads when
mapped to the reference genome. We reconstructed long fragments by identifying
such islands of shotgun reads in each well and grouping reads in each island into a
read cloud. Since the fraction of a haploid genome covered in a well is very low, the
chance for two fragments of opposite haplotypes to collide in the same well is very
low. Therefore, short reads in nearly all read clouds have originated from only one
fragment.
We excluded PCR duplicate reads, non-primary alignments and reads with map-
ping quality less than 10 from read clouds. Read clouds were then passed through a
quality control process in which the internal coverage of read clouds and their total
lengths were assessed. Read clouds with evidence of both reference and alternate al-
leles in at least two consecutive heterozygous variants were also excluded from further
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 61
analysis.
4.3.4 Building variant blocks
Since heterozygous variants are not uniformly distributed throughout the genome,
occasionally the distance between two neighboring variants is larger than the size
of Moleculo fragments. As a result, phasing the genome by assembling read clouds
produces a set of disjoint haplotype blocks. The variants in these blocks can be
phased separately and in parallel because their phase is independent of each other.
We denote these groups of variants that can be phased together as variant blocks.
To obtain the set of variant blocks, we created a weighted undirected graph. In
this graph, nodes represented SNVs and edges captured their connectivity in the read
clouds. More specifically, two nodes were connected with an edge if they met two
conditions: 1) at least one read cloud covered both variants with sequence reads, 2)
the two variants were adjacent in that read cloud. The weight of an edge was set to
be the number of read clouds that met these two conditions. In such a graph model,
there will be no path between two nodes if no read cloud covers both corresponding
variants. Therefore, each connected component of the graph defines one variant block.
We also utilized the edge weights in this graph to assess the connectivity level of
each connected component. The value of a minimum cut in a component indicates
the minimum number of read clouds that can be removed to break the connectivity of
the component. This number indicates a confidence level for the inferred haplotype
because a higher connectivity level provides more information for phasing. If two
variants are connected by very few read clouds, sequencing or alignment errors can
cause an error in their inferred phase. A connected component can be broken into
smaller subparts by choice if its minimum cut value is less than a specified threshold.
4.3.5 Local phase
We represent input shotgun sequence reads {r1, ..., rn} from Moleculo data by two
n x m matrices R, and Q, whose columns correspond to positions in the genome,
and whose rows correspond to shotgun sequence reads. The base-call of a read ri at
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 62
the genomic position j is stored in Rji which can take one of four values {0, 1, 2,�},
representing the reference allele, the alternate allele, a di↵erent base-call, or a gap
respectively. At the positions of the genome that GATK did not call an SNV, any
base-call other than the reference allele is denoted by 2. Positions that lie outside
the boundaries of the aligned reads, as well as deletions, are marked as ‘-’. Insertions
are not captured in the matrix. Qji is defined to be the maximum of ri’s mapping
error probability, and its probability of an incorrect base-call at j. We let L to be the
genomic positions of heterozygous variants in a variant block defined in the previous
section. We also let C be the set of K read clouds {c1, ..., ck} covering at least two
heterozygous variants in L. We define ci = {j|rj is in the i’th read cloud} to be a
collection of shotgun reads. We first infer the phase between heterozygous variants
in L, by optimizing the assignment of read clouds in C to two parental haplotypes
H = {h1, h2}. The state of h1 and h2 is assumed to be hidden. We use the variable
↵(ci) to denote the assignment of ci to either h1 or h2. We model the probability
P (hlq = alt) of haplotype q, namely hq, to present the alternate allele at location
l using the binomial distribution. The distribution is parameterized by ✓lq, which
corresponds to haplotype q at position l.
Given the above assumptions, the likelihood of our reads R conditioned on ✓ is
given by:
P (R|✓) =Y
l2L
Y
ck2C
Y
i2ck
P (Rli|✓l↵(ck)) (4.1)
, where the probability at a single polymorphic location P (Rli|✓l↵(ck)) is given by:
P (Rli|✓l↵(ck)) =
8>>><
>>>:
✓l↵(ck)Qli/3 + (1� ✓l↵(ck))(1�Ql
i) Rli = 0
(1� ✓l↵(ck))Qli/3 + ✓l↵(ck)(1�Ql
i) Rli = 1
2Qli/3 Rl
i = 2
(4.2)
We set P (�|✓) to 1.
We wish to optimize ✓ such that it maximizes P (R|✓), which we assume to be pro-
portional to P (✓|R) under the assumption of a uniform prior over P (✓). Our inference
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 63
procedure begins with a random initialization of all read clouds to two haplotypes.
The model transitions within its state space, which is the set of all possible haplotype
assignments of read clouds, in a Monte Carlo simulated annealing approach. Namely,
for a given assignment of ↵, once reads are assigned to a haplotype, the parameter ✓
for each haplotype is calculated based on the fraction of read clouds assigned to the
haplotype exhibiting the alternate allele. A proposal for a new assignment ↵0, and
corresponding ✓0 is derived from a category of three moves: (a) cloud reassignment,
(b) cloud split, and (c) phase switching. The corresponding moves are outlined in the
sections below.
After each move, the acceptance probability of the visited state is calculated using
P (s, s0, T ), that depends on s = P (R|✓), s0 = P (R|✓0), and a time-varying tempera-
ture parameter T .
P (s, s0, T ) =
8<
:1 s0 > s
exp((s0 � s)/T ) s0 s(4.3)
1. Cloud reassignment.
The underlying idea behind this move is that read clouds from each haplotype
should agree with each other on their overlapping germline variants and that
over several iterations similar read clouds cluster in the correct haplotype and
attract other matching clouds. When this move is selected, we select a read
cloud ci randomly and swap its haplotype assignment ↵(ci). We accept the move
probabilistically according to P (s, s0, T ). If the state transition is accepted, we
update ↵ and ✓ accordingly.
2. Cloud split.
The second move is introduced to address chimeric read clouds, which result
from fragment collisions in the same well. Chimeric read clouds can disturb
the phase between variants, and lower the likelihood score. Regardless of the
haplotype assignment, such read clouds cannot match the haplotype alleles at
a subset of covered variants. If two DNA fragments from opposite haplotypes
overlapped at multiple heterozygous SNVs within a well, the resulting read cloud
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 64
would have evidence for both reference and alternate alleles at those variants.
These read clouds are not informative for phasing because the true order of
alleles at the overlapped sites cannot be inferred, and thus these read clouds
are filtered at the read cloud construction step. However, if the two fragments
overlapped at only a few variants, or if they overlapped at no variant but were
mistakenly combined into one read cloud due to their close proximity, the order
of alleles can be recovered from the chimeric read cloud. The order of alleles on
the underlying true fragments can be recovered by breaking the read cloud at
the fusion point.
In our model, each read cloud has a binary flag which indicates whether it is
chimeric or not. This flag is updated dynamically during the execution of the
inference procedure. In this move, we randomly select a read cloud. If the read
cloud is already flagged as chimeric, we reunite its components and reset its flag.
If it is not flagged as such, we break it into two parts, assign the two resulting
read clouds to opposite haplotypes, and flag them as chimeric. We calculate
the likelihood score under the new ↵0 assignments, and accept the proposed
configuration according to P (s, s0, T ).
Breaking a read cloud ci at break point � means that we assign all sequence
reads in ci covering germline variants up to � to a read cloud c�i (1), and all
sequence reads starting after � to a read cloud c�i (2). We choose the break
point by finding a genomic position that maximizes the following score:
� = argmax�
(max (P (c�i (1)|✓1)P (c�i (2)|✓2), P (c�i (1)|✓2)P (c�i (2)|✓1))) (4.4)
After the breakpoint � is selected ↵(c�i (1)), and ↵(c�i (2)) are determined based
on which assignment maximizes the inner max term in the equation above.
3. Switch unwinding.
As the inference method proceeds, clusters of matching read clouds in each
haplotype form and grow. However, it is possible for a cluster to start forming in
the wrong haplotype and introduce a switch error in the resulting phase. If there
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 65
is a switch error between two variants, read clouds spanning the switch point
cannot be assigned to either haplotype confidently. This is because on one side
of the switch point, these read clouds match the alleles of one haplotype, while
on the other side they match the alleles of the opposite haplotype. Therefore,
read clouds originated from opposite haplotypes do not segregate correctly in
that region. We rely on two signatures to identify the switch points. The first
signature is that ✓lk is less than 1 and close to 0.5 in both haplotypes at the
variants around the switch point. The second signature is that a large fraction
of read clouds are flagged as chimeric in that region. In this move, we choose
a heterozygous variant l, in a weighted random manner such that the weight is
inversely correlated with max(✓l1, ✓l2), and is also correlated with the fraction of
read clouds spanning l that are flagged as chimeric. Second, we reassign all read
clouds starting after xl to the opposite haplotype. Third, we reassign all read
clouds spanning xl to the most likely haplotype, and reunite the components of
read clouds flagged as chimeric. Finally, we calculate the likelihood score and
accept the move according to the update rule.
4.3.6 Constructing somatic haplotypes
After phasing germline heterozygous variants, we proceeded by inferring the phase of
somatic SNVs. The goal here was to identify which parental haplotype was mutated
at each site in cancer cells. At each somatic variant, covering read clouds can harbor
either the reference allele or the variant allele depending on which copy of the chro-
mosome they originated from, and whether they were from cancer or normal cells.
Only read clouds emanated from cancer cells that were from the mutated chromosome
should harbor the variant allele. Since read clouds are already assigned to parental
chromosomes in the previous step, we can infer which chromosome copy was mutated
at each somatic variant.
In this step, we created two somatic haplotypes, each linked to a parental hap-
lotype. At each somatic SNV, we assigned read clouds that harbored the alternate
allele to a somatic haplotype. The somatic haplotype that a read cloud was assigned
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 66
to was determined based on which parental haplotype it was in previously.
4.3.7 LD-based validation
Since we did not leverage population information at the local phase step, we can use
linkage disequilibrium information as an independent test for the accuracy of inferred
germline haplotypes. For this purpose, we used haplotype information from 1000
Genomes project (Phase 1 integrated data set, version 3). In each phase block, we
examined pairwise LD between all SNPs by calculating the Pearson correlation. We
counted the total number of SNP pairs in all blocks that are in complete LD. We
then counted the number of SNP pairs at which the inferred phase disagrees with the
population phase. If more than one SNP pair with discordant phases were present in
a variant block, we calculated the minimum number of switches required to make the
inferred phase compatible with the LD phase.
4.3.8 Statistical phase
To join haplotype blocks using LD information, we first identified haplotype blocks
covering at least one germline SNP present in 1000 Genomes project database. We
then ran the global stage of Prism [48] with its default parameters (-K 75) on these
blocks by using population data from 1000 Genomes project (Phase 1 integrated
data set, version 3). Prism outputs the relative phase between consecutive blocks
and assigns a confidence score to the suggested phase. We only joined blocks if the
reported confidence score was above 0.98.
4.3.9 Detecting somatic CNV regions
To detect large CNV regions, we utilized lesser allele fraction (LAF) values at germline
heterozygous SNPs, segmented by an HMM method. More specifically, we first cal-
culated LAF values at germline heterozygous SNPs in both normal and IDC samples,
by using sequence reads from shotgun libraries. We then averaged the LAF values
in non-overlapping windows of 20 SNPs and plotted the averaged values. We then
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 67
identified the steep drops of LAF values by developing an HMM model that is em-
bedded into an EM. These drops correspond to CNV events with uneven haplotype
copy numbers. Di↵erent copy number variants may produce di↵erent LAF values
depending on how many copies are gained or lost by each parental chromosome, and
what portion of cancer cells share this event. We expect to observe the same LAF
values at regions of a genome that are present in equal copy numbers if, and only if,
the same fraction of cancer cells share the CNV at these regions.
We denote the averaged LAF values of windows by L = {l1, ..., ln}. We also denote
the states of our HMM model by S = {s1, ..., sk}. Our HMM model had 11 states,
and each state corresponded to a distinct expected LAF value. The number of states
was determined by plotting the LAF values. Transition probabilities P (si+1|si), areset to 0.5 for staying in the same state and to 0.5/10=0.05 for transitioning to a
di↵erent state. The first window in each chromosome arm was equally likely to be
in any of the states. Emission probabilities P (li|sj) were modeled using a Gaussian
distribution with the mean equal to µj, and the standard deviation of 0.1.
We used an EM approach to learn µ values. We started with an initial set of values
µ1�11 = {0.45, 0.42, 0.39, 0.37, 0.34, 0.31, 0.27, 0.24, 0.21, 0.18, 0.15}. At the E step, we
performed the HMM method using current µ values to find the optimal assignment of
LAF values to the states. At the M step, we set µi to the average LAF of all windows
that were assigned to si. We repeated this procedure until convergence.
After the convergence of the EM algorithm, we grouped consecutive windows in
each chromosome arm that were in the same state. Groups of windows that were
assigned to s1 were considered as diploid regions of the genome, or as regions with
even haplotype-specific copy numbers. This choice was made because LAF windows
of the normal sample were also assigned to this state. Other groups were considered
as CNVs. Since we were mainly interested in large CNV regions, we only considered
regions with lengths more than 1Mbp for the downstream analysis. None of the CNV
regions assigned to s11 (with µ11 = 0.13) were larger than 1 Mbp. Therefore, this
state was not included in the downstream analysis. We also did not include chr19, and
a region in chr3q in the downstream analysis as they show signs of chromothripsis.
The only windows assigned to state s7 were in chr19; therefore, this state was also
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 68
excluded from the downstream analysis.
4.3.10 Leveraging somatic CNVs for detecting switch errors
and connecting haplotypes
We exploited LAF values at large (greater than or equal to 1Mbp) CNV regions to
detect and estimate the number of switch errors in the haplotype blocks, which were
introduced by either local or statistical phasing. For this purpose, we used groups
of windows, C, that were produced in the previous section. We denote the HMM
state that windows in ci are assigned to by �(ci). We assume that if a haplotype
block is fully contained within a genomic region covered by ci, the expected fraction
of reads supporting the allele in the less prevalent haplotype is µ�(ci). In the presence
of a switch error in the haplotype block at the genomic position li, the less prevalent
haplotype on one side of li is h1, while h2 is the less prevalent haplotype on the other
side. Thus, the expected fraction of reads supporting h1 allele at SNPs on one side of
li is µ�(ci), while on the other side, the expected fraction of reads supporting h2 allele
is µ�(ci).
To detect potential switch errors in each haplotype block and to further connect
blocks, we devised a hidden Markov model with two states S = {S1, S2}. In this
model, observations are the sequence of haplotype read counts at heterozygous posi-
tions covered by a haplotype block. Haplotypes h1 and h2 were assigned as the less
prevalent haplotype in states S1 and S2 respectively. In the case of no switch error,
all heterozygous variants in a haplotype block should be assigned to the same state.
In this model, emission probabilities specified the likelihood of observing the fraction
of reads supporting h1 (or h2) allele in state S1 (or S2) at each heterozygous variant
lj. Emission probabilities were modeled with a binomial distribution with parameters
p = µ�(ci), and n the total shotgun read coverage at lj. Transition probabilities were
set to 0.999 for staying in the same state and 0.001 for transitioning to another state.
A state transition within a haplotype block was considered as a switch error.
We first corrected the switch errors identified by this model, and then labeled
the haplotypes within a haplotype block as “less prevalent” and “more prevalent”
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 69
depending on which HMM state the variants of the haplotype block were assigned
to. We then proceeded to connect haplotype blocks in the genomic region covered
by each ci (with �(ci) > 1), such that haplotypes of the same frequency were joined
together.
4.4 Discussion
The study of cancer genomes at the haplotype resolution is still an uncharted area.
In this work we demonstrated that recent synthetic long sequencing technologies can
be e↵ectively utilized to produce highly accurate and long-range phase information
between germline and somatic SNVs. As a proof of concept, we applied our frame-
work to reconstruct the haplotypes of an invasive breast cancer sample, which was
sequenced through the Moleculo protocol.
The study of inherited susceptibility to cancer has been an active area of research
in the past few decades. As part of these e↵orts, a growing list of germline mutations
associated with developing cancer has been revealed (e.g. [17, 89, 32, 57]). On the
basis of Knudson’s two-hit model of cancer, for a tumor to develop, an individual who
has inherited a mutated copy of a tumor suppressor gene should also develop a disrup-
tive somatic mutation on the other copy of the gene. By producing megabase-length
haplotypes, we provide the opportunity to study such interplays between somatic and
germline mutations, in coding regions of a cancer genome, .
Although we were able to correctly infer the parental chromosome-of-origin of so-
matic SNVs, our capability to reconstruct somatic haplotypes in heterogeneous cancer
samples is inherently limited by the length of input DNA fragments. In a heteroge-
neous tumor, depending on which mutations are harbored by each subpopulation of
tumor cells, somatic SNVs that mutated on the same parental chromosome can in
fact be distributed among di↵erent haplotypes. The correct combination of somatic
SNVs on these haplotypes can only be inferred if DNA fragments are long enough to
encompass at least two somatic SNVs. However, the infrequent occurrence of somatic
mutations, combined with the relative short length of synthetic long reads compared
CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 70
to the distance between somatic mutations, makes this task infeasible. Future ad-
vances in third generation sequencing technologies will potentially help overcome this
limitation by producing longer fragments. Upon the availability of such data, our
local phase method can be extended to build more than two somatic haplotypes by
leveraging connected somatic SNVs.
Chapter 5
Applications of haplotype phasing
In this chapter, I discuss some promising applications of haplotype inference. First,
I demonstrate how phase information, coupled with Moleculo read cloud data, can
be utilized to improve the specificity of variant calls. Next, I describe how the local
phase method described in the previous chapter can be employed to identify cryptic
segmental duplications of a genome, and how it can be extended to reconstruct the
underlying haplotypes. Finally, I discuss the key role of haplotype information in
reconstructing phylogenetic trees of tumor samples.
5.1 Enhancing the accuracy of variant calls
As discussed in 2.2, detection of germline and somatic alterations of cancer genomes
has a central role in characterizing cancer samples. However, the accuracy of current
variant calling procedures varies considerably along the genome depending on several
factors such as sequencing errors, alignment errors, and the allele fraction of somatic
variants. The depth of sequencing coverage in a tumor sample and its matched
normal sample also further confound the identification of somatic mutations. Having
an accurate set of variants is a crucial first step in many cancer studies including
the work described in Chapter 3. Therefore, variant calling is typically ensued by
rounds of filtering steps and experimental validations. As shown in previous studies,
haplotype information can provide an additional and valuable layer of information to
71
CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 72
enhance the accuracy of variant calls [69, 70] .
In most cases, sequencing and mapping errors have equal chance of occurrence on
maternal or paternal chromosomes. The occurrence of homozygous somatic muta-
tions, on the other hand, is a particularly rare event. Therefore, one can hypothesize
that the presence of read clouds harboring the variant allele in both haplotypes is an
evidence of a false positive somatic call. Another common signature of a false variant
call is a strong presence of mixed read clouds at the variant site. Since a read cloud
originates from a single molecule, it should only support one allele at each genomic
position. We refer to read clouds in which some shotgun reads support the reference
allele at a variant site, while the rest support the variant allele, as mixed read clouds.
The presence of several mixed read clouds at a variant site is a suggestion of some
systematic sequencing or alignment errors in that region. Figure 5.1 displays read
clouds covering two variants that were called as somatic by GATK in the study de-
scribed in the previous chapter. The first example exhibits the signature of a true
somatic variant: all read clouds in the somatic haplotype have matching alleles with
only one parental haplotype at their encompassing germline variants. The second
example is most likely a false call. There is a high prevalence of mixed read clouds
at this position, and read clouds harboring the variant allele are distributed between
both parental haplotypes. About 5% of somatic variants called in that study had
such signatures of a false call.
The di↵erentiation of somatic from germline alterations can also benefit from
haplotype information in samples with high levels of normal contamination. To dis-
tinguish a somatic variant from a germline substitution, studies leverage sequence
reads from a matched normal sample. A variant is designated as a somatic candidate
if it is not supported by sequence reads from the normal sample. However, minor
sampling biases in regions of low sequencing coverage in the normal sample can cause
misclassification of germline variations as somatic candidates. In samples with high
levels of normal contamination, such misclassifications can be identified based on the
distribution of read clouds harboring the reference allele. Since all read clouds origi-
nating from normal cells carry the reference allele at a true somatic variant site, these
read clouds should be distributed evenly across both parental haplotypes. The dearth
CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 73
A. B.
h1
h2
h1
h2Germline SNVSomatic SNVReference allele
Variant alleleMixed alleles
Figure 5.1: Two examples of somatic variants called by GATK in the IDC sample.A. A true somatic variant. The somatic haplotype is shown on top. Read clouds inthe somatic haplotype agree with h1 on germline alleles. There are no mixed readclouds at this position. B. A likely false variant call. Read clouds with the variantallele are distributed within both copies of the chromosome. A high fraction of readclouds show evidence for both alleles at this position.
of read clouds carrying the reference allele in at least one parental haplotype suggests
that the mutation is in reality a germline alteration.
Haplotype information can also be exploited to correct genotype errors at ho-
mozygous germline point mutations. A homozygous genotype can be miscalled as a
heterozygous genotype depending on the fraction of reads supporting the true allele.
At a true heterozygous variant site, read clouds with reference and alternate alleles
CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 74
should be segregated into the two parental haplotypes. Therefore, the support of
the same allele in both haplotypes by the majority of read clouds indicates that the
genotype is in fact homozygous for that allele.
5.2 Identifying and phasing cryptic segmental du-
plications
Segmental duplications (SDs) are large duplicated DNA segments within a haploid
genome with near-identical sequences. The role of segmental duplications in evolution
and their associations with genetic diseases have been well documented in many
studies [22, 5, 45, 31, 76]. In this section, I show how Moleculo read clouds can be
utilized to infer the haplotype of SD copies that are present in a target sample but
are absent in the human reference genome.
If a segmental duplication has more than two copies but only one copy is present in
the reference genome, most sequence reads arising from these copies will be mapped
to the one copy present in the reference. Consequently, single nucleotides that are
unique to each copy can easily be mistaken for heterozygous variants. These incorrect
variant calls, in turn, result in phasing errors because two haplotypes may not be
su�cient to explain the di↵erences between the associated copies. Figure 5.2 depicts
an example of a segmental duplication in chr17 (KCNJ12) that we had identified in
the sequenced IDC sample, as part of the study described in the previous chapter.
As the figure shows, read clouds in h2 have di↵erent signatures, which suggests that
they originated from di↵erent copies. Read clouds assigned to h1, however, exhibit
matching alleles.
We assigned the read clouds in h2 to two haplotypes using the method described
in 4.3.5. The result, which is depicted in Figure 5.3, suggests that three haplotypes
were su�cient to capture unique signatures of read clouds in this region. These
three inferred haplotypes were also reported in [34]. In this study, Genovese et al.
identified three paralogs in this region (KCNJ12, KCNJ17 and KCNJ18), and listed
the paralogous variants on each copy. Our inferred haplotypes were in complete
CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 75
agreement with the haplotypes reported in that study on the reported variants.
This example shows the capability of our phasing procedure to phase segmental
duplication regions by e↵ectively utilizing Moleculo read clouds, without need for
any additional information. In regions where read clouds cannot be explained with
two haplotypes, we can dynamically increase the number of haplotypes and reassign
read clouds to the new set of haplotypes using the same set of moves as described
in 4.3.5. This approach proposes a cost-e↵ective method for the identification and
haplotype inference of segmental duplications. In recent years, multiple methods have
been developed for leveraging long-range sequence information from long-read single
molecule or synthetic long read sequencing technologies to uniquely map sequence
reads to repeated regions of the genome and enable variant calling in these regions [12,
43]. Our approach along with these methods can resolve complex repeated regions of
the genome by phasing the newly identified variants and reconstructing the underlying
copies. Together, these e↵orts provide a new platform to study the association of
specific haplotypes in these regions with human disease.
5.3 Increasing the resolution of phylogenetic infer-
ence methods
As demonstrated in chapter 3, phylogenetic trees provide an invaluable framework for
studying the evolution of a genome and for characterizing the heterogeneity within
a tumor. In recent years, several studies have been conducted to infer phylogenetic
relationships between multiple samples extracted from various regions of a single
tumor [36, 65, 93], or from di↵erent metastasis or tumor regions within a patient
[16, 92, 35, 39]. These studies utilize patterns of variant sharing among samples,
variant allele fractions, and also cellular prevalence of somatic point mutations to
reconstruct tumor lineage trees and to infer the chronological order of somatic events
during cancer progression.
Aneuploidies and copy number variants present a significant challenge in the cor-
rect estimation of cellular prevalences. These types of variants alter prevalent allele
CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 76
fractions, which are utilized for estimating what portion of cancer cells carry a somatic
mutation. There are existing tools for identifying CNV regions within a genome and
for estimating total copy numbers within a region. However, the cellular prevalence of
a somatic mutation can only be correctly inferred if the copy number of its harboring
parental chromosome is known, which requires haplotype information.
Furthermore, a chromosome loss that occurs later in a cancer’s lifetime buries
important information about somatic mutations that the deleted chromosome once
harbored. As a result of this information loss, the position of some mutation groups
on the lineage tree cannot be confidently inferred. As an example, consider a case in
which two samples share a group of somatic mutations within a genomic region. If a
third sample from the same patient does not share this mutation group but has lost
one copy of the chromosome in that region, then there are two distinct possibilities.
This mutation group may have been shared among all three samples and was lost in
the third sample due to the chromosome deletion. Alternatively, the third sample may
have never carried this mutation group and these mutations occurred after the lineage
path of the third sample diverged from the other two. We can, however, deduce that
the latter hypothesis is correct, if we infer that these somatic point mutations belong
to the retained chromosome in the third sample. Therefore, haplotype assignment of
somatic mutations can resolve some phylogenetic ambiguities.
CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 77
25002000150010005000
5 10 15 20 25 30 35 40
h1
h2
Figure 5.2: Identifying a segmental duplication event at KCNJ12. Read clouds over-lapping KCNJ12 gene are assigned to two parental haplotypes using the local phasemethod described in 4.3.5. It can be seen from the figure that read clouds in h2
are arising from more than one genomic segment, suggesting that more than twohaplotypes are necessary to separate the read clouds.
CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 78
h3
h1
h2
Figure 5.3: Inferring the haplotype sequence of KCNJ12 paralogs. Read clouds in h2
in Figure 5.2 are assigned to h2, and h3. h1, h2, and h3 provide haplotype sequencesof KCNJ18, KCNJ17, and KCNJ12 respectively. KCNJ17 is the copy present in thereference genome assembly.
Chapter 6
Conclusion
The contribution of cancer genomics to almost all aspects of cancer research is indis-
putable. Ongoing technological advances are not only increasing the rate at which
cancer sequencing studies can be completed, but are also enabling a continually
higher-resolution view of the somatic changes to cancer genomes. The study of het-
erogeneity in cancer genome evolution is now within reach in light of next generation
sequencing technologies. New advances in long read sequencing technologies are pro-
viding the opportunity to characterize the elaborate interplay between somatically
acquired mutations and inherited alterations. These e↵orts are transforming our un-
derstanding of tumor biology and are establishing a more detailed insight into the
dynamic evolution of cancer cells.
Chapter 3 of this dissertation provided a new view of the breast cancer evolution
by studying the role of early neoplasias in the progression of cancer. This study
demonstrated that phylogenetic trees could be successfully inferred from multiple
lesions within a patient. It also illustrated how somatic point mutations in conjunction
with chromosomal abnormalities could serve as lineage markers to obtain such trees.
The established phylogenetic trees proved that in some patients, early neoplasias cells
have a common ancestor with ductal carcinoma cells and, as such, they can reveal
early events in a cancer’s lifetime. The perspective achieved in this study was shown
to be more comprehensive than the current histological model of cancer progression.
Perhaps the most noteworthy finding in this study was the discovery of elevated
79
CHAPTER 6. CONCLUSION 80
mutation load as well as widespread aneuploidies in all early neoplasia samples that
were genetically related to a ductal carcinoma lesion. This was a consistent pattern
across all early neoplasias and as such may have future diagnostic applications.
Chapter 4 exhibited the use of long-range sequence information from emerging
synthetic long read sequencing technologies to phase de novo and somatic mutations
in cancer genomes. A new toolset was proposed for phasing somatic and germline
mutations by leveraging sequence data from the Moleculo platform. This work demon-
strated how the unique characteristics of Moleculo data, coupled with somatic aneu-
ploidies, could be e↵ectively exploited to produce highly accurate and long haplotype
blocks. The accuracy of the obtained haplotype blocks was also measured through
step-by-step independent validations. The resulting long haplotype blocks provide a
unique opportunity to study the interactions between somatic and germline variants
and their combined e↵ects on cancer initiation and progression. Moreover, and as dis-
cussed in chapter 5, this information empowers a wide range of applications in cancer
research such as enabling more accurate and comprehensive SNV calling, detecting
complex variant types such as structural variations, identifying somatic variants in
repetitive regions of the genome, and obtaining more robust phylogenetic trees.
Taken together, these studies allow for a better understanding of the mutational
landscape of cancer genomes. Haplotype information can result in more robust phy-
logenetic trees by providing better-refined sets of variant calls and copy-numbers. At
the same time, sequencing multiple lesions from the same patient can also produce
even longer haplotype blocks by leveraging the union of aneuploidies across samples.
Longer haplotype blocks will fully encompass a higher number of genes, which in turn
can give further insights into the biological function of variants and the mechanism
of the disease.
Bibliography
[1] Tarek MA Abdel-Fatah, Desmond G Powe, Zsolt Hodi, Andrew HS Lee, Jorge S
Reis-Filho, and Ian O Ellis. High frequency of coexistence of columnar cell
lesions, lobular neoplasia, and low grade ductal carcinoma in situ with invasive
tubular carcinoma and invasive lobular carcinoma. The American journal of
surgical pathology, 31(3):417–426, 2007.
[2] Andrew Adey, Joshua N Burton, Jacob O Kitzman, Joseph B Hiatt, Alexandra P
Lewis, Beth K Martin, Ruolan Qiu, Choli Lee, and Jay Shendure. The haplotype-
resolved genome and epigenome of the aneuploid hela cancer cell line. Nature,
500(7461):207–211, 2013.
[3] Sasan Amini, Dmitry Pushkarev, Lena Christiansen, Emrah Kostem, Tom
Royce, Casey Turk, Natasha Pignatelli, Andrew Adey, Jacob O Kitzman,
Kandaswamy Vijayan, et al. Haplotype-resolved whole-genome sequencing by
contiguity-preserving transposition and combinatorial indexing. Nature genet-
ics, 2014.
[4] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather
Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight,
Janan T Eppig, et al. Gene ontology: tool for the unification of biology. Nature
genetics, 25(1):25–29, 2000.
[5] Jorune Balciuniene, Ningping Feng, Kelly Iyadurai, Betsy Hirsch, Lawrence
Charnas, Brent R Bill, Mathew C Easterday, Johan Staaf, LeAnn Oseth, Desiree
Czapansky-Beilman, et al. Recurrent 10q22-q23 deletions: a genomic disorder
81
BIBLIOGRAPHY 82
on 10q associated with cognitive and behavioral abnormalities. The American
Journal of Human Genetics, 80(5):938–947, 2007.
[6] Shantanu Banerji, Kristian Cibulskis, Claudia Rangel-Escareno, Kristin K
Brown, Scott L Carter, Abbie M Frederick, Michael S Lawrence, Andrey Y
Sivachenko, Carrie Sougnez, Lihua Zou, et al. Sequence analysis of mutations
and translocations across breast cancer subtypes. Nature, 486(7403):405–409,
2012.
[7] Vikas Bansal and Vineet Bafna. Hapcut: an e�cient and accurate algorithm for
the haplotype assembly problem. Bioinformatics, 24(16):i153–i159, 2008.
[8] Vikas Bansal, Aaron L Halpern, Nelson Axelrod, and Vineet Bafna. An mcmc
algorithm for haplotype assembly from whole-genome sequence data. Genome
research, 18(8):1336–1346, 2008.
[9] Michael Baudis. Genomic imbalances in 5918 malignant epithelial tumors: an
explorative meta-analysis of chromosomal cgh data. BMC cancer, 7(1):226, 2007.
[10] Rameen Beroukhim, Craig H Mermel, Dale Porter, Guo Wei, Soumya Ray-
chaudhuri, Jerry Donovan, Jordi Barretina, Jesse S Boehm, Jennifer Dobson,
Mitsuyoshi Urashima, et al. The landscape of somatic copy-number alteration
across human cancers. Nature, 463(7283):899–905, 2010.
[11] Graham R Bignell, Chris D Greenman, Helen Davies, Adam P Butler, Sarah
Edkins, Jenny M Andrews, Gemma Buck, Lina Chen, David Beare, Calli La-
timer, et al. Signatures of mutation and selection in the cancer genome. Nature,
463(7283):893–898, 2010.
[12] Alex Bishara, Yuling Liu, Dorna Kashef-Haghighi, Ziming Weng, Daniel E New-
burger, Robert West, Arend Sidow, and Serafim Batzoglou. Read clouds uncover
variation in complex regions of the human genome. In Research in Computational
Molecular Biology, pages 30–31. Springer, 2015.
BIBLIOGRAPHY 83
[13] Alessandro Bombonati and Dennis C Sgroi. The molecular pathology of breast
cancer progression. The Journal of pathology, 223(2):308–318, 2011.
[14] Sharon R Browning and Brian L Browning. Haplotype phasing: existing methods
and new developments. Nature Reviews Genetics, 12(10):703–714, 2011.
[15] Rebecca A Burrell, Nicholas McGranahan, Jiri Bartek, and Charles Swanton.
The causes and consequences of genetic heterogeneity in cancer evolution. Nature,
501(7467):338–345, 2013.
[16] Peter J Campbell, Shinichi Yachida, Laura J Mudie, Philip J Stephens, Erin D
Pleasance, Lucy A Stebbings, Laura A Morsberger, Calli Latimer, Stuart
McLaren, Meng-Lay Lin, et al. The patterns and dynamics of genomic instability
in metastatic pancreatic cancer. Nature, 467(7319):1109–1113, 2010.
[17] Silvia Casadei, Barbara M Norquist, Tom Walsh, Sunday Stray, Jessica B Man-
dell, Ming K Lee, John A Stamatoyannopoulos, and Mary-Claire King. Contri-
bution of inherited mutations in the brca2-interacting protein palb2 to familial
breast cancer. Cancer research, 71(6):2222–2229, 2011.
[18] Michael A Chapman, Michael S Lawrence, Jonathan J Keats, Kristian Cibulskis,
Carrie Sougnez, Anna C Schinzel, Christina L Harview, Jean-Philippe Brunet,
Gregory J Ahmann, Mazhar Adli, et al. Initial genome sequencing and analysis
of multiple myeloma. Nature, 471(7339):467–472, 2011.
[19] Zhi-Zhong Chen, Fei Deng, and Lusheng Wang. Exact algorithms for haplotype
assembly from whole-genome sequence data. Bioinformatics, page btt349, 2013.
[20] Kristian Cibulskis, Michael S Lawrence, Scott L Carter, Andrey Sivachenko,
David Ja↵e, Carrie Sougnez, Stacey Gabriel, Matthew Meyerson, Eric S Lander,
and Gad Getz. Sensitive detection of somatic point mutations in impure and
heterogeneous cancer samples. Nature biotechnology, 31(3):213–219, 2013.
[21] Rudi Cilibrasi, Leo Van Iersel, Steven Kelk, and John Tromp. The complexity of
the single individual snp haplotyping problem. Algorithmica, 49(1):13–36, 2007.
BIBLIOGRAPHY 84
[22] Gregory M Cooper, Bradley P Coe, Santhosh Girirajan, Jill A Rosenfeld,
Ti↵any H Vu, Carl Baker, Charles Williams, Heather Stalker, Rizwan Hamid,
Vickie Hannig, et al. A copy number variation morbidity map of developmental
delay. Nature genetics, 43(9):838–846, 2011.
[23] Karen Crasta, Neil J Ganem, Regina Dagher, Alexandra B Lantermann, Elena V
Ivanova, Yunfeng Pan, Luigi Nezi, Alexei Protopopov, Dipanjan Chowdhury, and
David Pellman. Dna breaks and chromosome pulverization from errors in mitosis.
Nature, 482(7383):53–58, 2012.
[24] Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M
Rueda, Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa,
Yinyin Yuan, et al. The genomic and transcriptomic architecture of 2,000 breast
tumours reveals novel subgroups. Nature, 486(7403):346–352, 2012.
[25] Olivier Delaneau, Bryan Howie, Anthony J Cox, Jean-Francois Zagury, and
Jonathan Marchini. Haplotype estimation using sequencing reads. The American
Journal of Human Genetics, 93(4):687–696, 2013.
[26] Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R
Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel,
Manuel A Rivas, Matt Hanna, et al. A framework for variation discovery
and genotyping using next-generation dna sequencing data. Nature genetics,
43(5):491–498, 2011.
[27] Li Ding, Matthew J Ellis, Shunqiang Li, David E Larson, Ken Chen, John W
Wallis, Christopher C Harris, Michael D McLellan, Robert S Fulton, Lucinda L
Fulton, et al. Genome remodelling in a basal-like breast cancer metastasis and
xenograft. Nature, 464(7291):999–1005, 2010.
[28] Li Ding, Timothy J Ley, David E Larson, Christopher A Miller, Daniel C
Koboldt, John S Welch, Julie K Ritchey, Margaret A Young, Tamara Lamprecht,
Michael D McLellan, et al. Clonal evolution in relapsed acute myeloid leukaemia
revealed by whole-genome sequencing. Nature, 481(7382):506–510, 2012.
BIBLIOGRAPHY 85
[29] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geo↵ Otto, Paul
Peluso, David Rank, Primo Baybayan, Brad Bettman, et al. Real-time dna
sequencing from single polymerase molecules. Science, 323(5910):133–138, 2009.
[30] Matthew J Ellis, Li Ding, Dong Shen, Jingqin Luo, Vera J Suman, John W
Wallis, Brian A Van Tine, Jeremy Hoog, Reece J Goi↵on, Theodore C Gold-
stein, et al. Whole-genome analysis informs breast cancer response to aromatase
inhibition. Nature, 486(7403):353–360, 2012.
[31] Beverly S Emanuel and Tamim H Shaikh. Segmental duplications:
an’expanding’role in genomic instability and disease. Nature Reviews Genetics,
2(10):791–800, 2001.
[32] Megan N Farley, Laura S Schmidt, Jessica L Mester, Samuel Pena-Llopis, Andrea
Pavia-Jimenez, Alana Christie, Cathy D Vocke, Christopher J Ricketts, James
Peterson, Lindsay Middelton, et al. A novel germline mutation in bap1 pre-
disposes to familial clear-cell renal cell carcinoma. Molecular Cancer Research,
11(9):1061–1071, 2013.
[33] R Fisher, L Pusztai, and C Swanton. Cancer heterogeneity: implications for
targeted therapeutics. British journal of cancer, 108(3):479–485, 2013.
[34] Giulio Genovese, Robert E Handsaker, Heng Li, Nicolas Altemose, Amelia M
Lindgren, Kimberly Chambert, Bogdan Pasaniuc, Alkes L Price, David Reich,
Cynthia C Morton, et al. Using population admixture to help complete maps of
the human genome. Nature genetics, 45(4):406–414, 2013.
[35] Marco Gerlinger, Stuart Horswell, James Larkin, Andrew J Rowan, Max P Salm,
Ignacio Varela, Rosalie Fisher, Nicholas McGranahan, Nicholas Matthews, Clau-
dio R Santos, et al. Genomic architecture and evolution of clear cell renal cell
carcinomas defined by multiregion sequencing. Nature genetics, 46(3):225–233,
2014.
[36] Marco Gerlinger, Andrew J Rowan, Stuart Horswell, James Larkin, David En-
desfelder, Eva Gronroos, Pierre Martinez, Nicholas Matthews, Aengus Stewart,
BIBLIOGRAPHY 86
Patrick Tarpey, et al. Intratumor heterogeneity and branched evolution revealed
by multiregion sequencing. New England Journal of Medicine, 366(10):883–892,
2012.
[37] David J Gordon, Benjamin Resio, and David Pellman. Causes and consequences
of aneuploidy in cancer. Nature Reviews Genetics, 13(3):189–203, 2012.
[38] Rodrigo Goya, Mark GF Sun, Ryan D Morin, Gillian Leung, Gavin Ha, Kimber-
ley C Wiegand, Janine Senz, Anamaria Crisan, Marco A Marra, Martin Hirst,
et al. Snvmix: predicting single nucleotide variants from next-generation se-
quencing of tumors. Bioinformatics, 26(6):730–736, 2010.
[39] Michael R Green, Andrew J Gentles, Ramesh V Nair, Jonathan M Irish, Shingo
Kihira, Chih Long Liu, Itai Kela, Erik S Hopmans, June H Myklebust, Hanlee
Ji, et al. Hierarchy in somatic mutations arising during genomic evolution and
progression of follicular lymphoma. Blood, 121(9):1604–1611, 2013.
[40] Christopher Greenman, Philip Stephens, Ra↵aella Smith, Gillian L Dalgliesh,
Christopher Hunter, Graham Bignell, Helen Davies, Jon Teague, Adam Butler,
Claire Stevens, et al. Patterns of somatic mutation in human cancer genomes.
Nature, 446(7132):153–158, 2007.
[41] Dan He, Arthur Choi, Knot Pipatsrisawat, Adnan Darwiche, and Eleazar Eskin.
Optimal algorithms for haplotype assembly from whole-genome sequence data.
Bioinformatics, 26(12):i183–i190, 2010.
[42] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. Bioinformatics
enrichment tools: paths toward the comprehensive functional analysis of large
gene lists. Nucleic acids research, 37(1):1–13, 2009.
[43] John Huddleston, Swati Ranade, Maika Malig, Francesca Antonacci, Mark
Chaisson, Lawrence Hon, Peter H Sudmant, Tina A Graves, Can Alkan, Megan Y
Dennis, et al. Reconstructing complex regions of genomes using long-read se-
quencing technology. Genome research, 24(4):688–696, 2014.
BIBLIOGRAPHY 87
[44] Dick G Hwang and Phil Green. Bayesian markov chain monte carlo sequence
analysis reveals varying neutral substitution patterns in mammalian evolution.
Proceedings of the National Academy of Sciences of the United States of America,
101(39):13994–14001, 2004.
[45] Yonggang Ji, Evan E Eichler, Stuart Schwartz, and Robert D Nicholls. Structure
of chromosomal duplicons and their role in mediating human genomic disorders.
Genome research, 10(5):597–610, 2000.
[46] Daniel C Koboldt, Qunyuan Zhang, David E Larson, Dong Shen, Michael D
McLellan, Ling Lin, Christopher A Miller, Elaine R Mardis, Li Ding, and
Richard K Wilson. Varscan 2: somatic mutation and copy number alteration
discovery in cancer by exome sequencing. Genome research, 22(3):568–576, 2012.
[47] Augustine Kong, Michael L Frigge, Gisli Masson, Soren Besenbacher, Patrick
Sulem, Gisli Magnusson, Sigurjon A Gudjonsson, Asgeir Sigurdsson, Aslaug
Jonasdottir, Adalbjorg Jonasdottir, et al. Rate of de novo mutations and the
importance of father/’s age to disease risk. Nature, 488(7412):471–475, 2012.
[48] Volodymyr Kuleshov, Dan Xie, Rui Chen, Dmitry Pushkarev, Zhihai Ma, Tim
Blauwkamp, Michael Kertesz, and Michael Snyder. Whole-genome haplotyping
using long reads and statistical methods. Nature biotechnology, 32(3):261–266,
2014.
[49] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C
Zody, Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William
FitzHugh, et al. Initial sequencing and analysis of the human genome. Nature,
409(6822):860–921, 2001.
[50] David E Larson, Christopher C Harris, Ken Chen, Daniel C Koboldt, Travis E
Abbott, David J Dooling, Timothy J Ley, Elaine R Mardis, Richard K Wilson,
and Li Ding. Somaticsniper: identification of somatic point mutations in whole
genome sequencing data. Bioinformatics, 28(3):311–317, 2012.
BIBLIOGRAPHY 88
[51] Rebecca J Leary, Jimmy C Lin, Jordan Cummins, Simina Boca, Laura D Wood,
D Williams Parsons, Sian Jones, Tobias Sjoblom, Ben-Ho Park, Ramon Parsons,
et al. Integrated analysis of homozygous deletions, focal amplifications, and
sequence alterations in breast and colorectal cancers. Proceedings of the National
Academy of Sciences, 105(42):16224–16229, 2008.
[52] Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern,
Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady
Denisov, et al. The diploid genome sequence of an individual human. PLoS
biology, 5(10):e254, 2007.
[53] Timothy J Ley, Elaine R Mardis, Li Ding, Bob Fulton, Michael D McLellan,
Ken Chen, David Dooling, Brian H Dunford-Shore, Sean McGrath, Matthew
Hickenbotham, et al. Dna sequencing of a cytogenetically normal acute myeloid
leukaemia genome. Nature, 456(7218):66–72, 2008.
[54] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with
bwa-mem. arXiv preprint arXiv:1303.3997, 2013.
[55] Pengfei Liu, Ayelet Erez, Sandesh C Sreenath Nagamani, Shweta U Dhar,
Katarzyna E Ko lodziejska, Avinash V Dharmadhikari, M Lance Cooper, Joanna
Wiszniewska, Feng Zhang, Marjorie A Withers, et al. Chromosome catastrophes
involve replication mechanisms generating complex genomic rearrangements.
Cell, 146(6):889–903, 2011.
[56] Maria A Lopez-Garcia, Felipe C Geyer, Magali Lacroix-Triki, Caterina Marchio,
and Jorge S Reis-Filho. Breast cancer precursors revisited: molecular features
and progression pathways. Histopathology, 57(2):171–192, 2010.
[57] Chey Loveday, Clare Turnbull, Elise Ruark, Rosa Maria Munoz Xicola, Emma
Ramsay, Deborah Hughes, Margaret Warren-Perry, Katie Snape, Diana Eccles,
D Gareth Evans, et al. Germline rad51c mutations confer susceptibility to ovarian
cancer. Nature genetics, 44(5):475–476, 2012.
BIBLIOGRAPHY 89
[58] Christopher A Maher and Richard KWilson. Chromothripsis and human disease:
piecing together the shattering process. Cell, 148(1):29–32, 2012.
[59] Elaine R Mardis. Genome sequencing and cancer. Current opinion in genetics
& development, 22(3):245–250, 2012.
[60] Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian
Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel,
Mark Daly, et al. The genome analysis toolkit: a mapreduce framework for
analyzing next-generation dna sequencing data. Genome research, 20(9):1297–
1303, 2010.
[61] Matthew Meyerson and David Pellman. Cancer genomes evolve by pulverizing
single chromosomes. Cell, 144(1):9–10, 2011.
[62] Ryan E Mills, Christopher T Luttig, Christine E Larkins, Adam Beauchamp,
Circe Tsui, W Stephen Pittard, and Scott E Devine. An initial map of inser-
tion and deletion (indel) variation in the human genome. Genome research,
16(9):1182–1190, 2006.
[63] Felix Mitelman, Bertil Johansson, and Fredrik Mertens. Mitelman database of
chromosome aberrations in cancer. Cancer Genome Anatomy Project., 2007.
[64] Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers,
Jeanne McIndoo, Kerry Cook, Asya Stepansky, Dan Levy, Diane Esposito, et al.
Tumour evolution inferred by single-cell sequencing. Nature, 472(7341):90–94,
2011.
[65] Nicholas Navin, Alexander Krasnitz, Linda Rodgers, Kerry Cook, Jennifer Meth,
Jude Kendall, Michael Riggs, Yvonne Eberling, Jennifer Troge, Vladimir Grubor,
et al. Inferring tumor progression from genomic heterogeneity. Genome research,
20(1):68–80, 2010.
[66] Daniel E Newburger, Dorna Kashef-Haghighi, Ziming Weng, Raheleh Salari,
Robert T Sweeney, Alayne L Brunner, Shirley X Zhu, Xiangqian Guo, Sushama
BIBLIOGRAPHY 90
Varma, Megan L Troxell, et al. Genome evolution during progression to breast
cancer. Genome research, 23(7):1097–1108, 2013.
[67] Serena Nik-Zainal, Ludmil B Alexandrov, David C Wedge, Peter Van Loo,
Christopher D Greenman, Keiran Raine, David Jones, Jonathan Hinton, John
Marshall, Lucy A Stebbings, et al. Mutational processes molding the genomes
of 21 breast cancers. Cell, 149(5):979–993, 2012.
[68] Serena Nik-Zainal, Peter Van Loo, David C Wedge, Ludmil B Alexandrov,
Christopher D Greenman, King Wai Lau, Keiran Raine, David Jones, John Mar-
shall, Manasa Ramakrishna, et al. The life history of 21 breast cancers. Cell,
149(5):994–1007, 2012.
[69] Brock A Peters, Bahram G Kermani, Oleg Alferov, Misha R Agarwal, Mark A
McElwain, Natali Gulbahce, Daniel M Hayden, Y Tom Tang, Rebecca Yu Zhang,
Rick Tearle, et al. Detection and phasing of single base de novo mutations
in biopsies from human in vitro fertilized embryos by advanced whole-genome
sequencing. Genome research, 25(3):426–434, 2015.
[70] Brock A Peters, Bahram G Kermani, Andrew B Sparks, Oleg Alferov, Peter
Hong, Andrei Alexeev, Yuan Jiang, Fredrik Dahl, Y Tom Tang, Juergen Haas,
et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human
cells. Nature, 487(7406):190–195, 2012.
[71] Erin D Pleasance, R Keira Cheetham, Philip J Stephens, David J McBride,
Sean J Humphray, Chris D Greenman, Ignacio Varela, Meng-Lay Lin, Gon-
zalo R Ordonez, Graham R Bignell, et al. A comprehensive catalogue of somatic
mutations from a human cancer genome. Nature, 463(7278):191–196, 2010.
[72] Erin D Pleasance, Philip J Stephens, Sarah O’Meara, David J McBride, Alison
Meynert, David Jones, Meng-Lay Lin, David Beare, King Wai Lau, Chris Green-
man, et al. A small-cell lung cancer genome with complex signatures of tobacco
exposure. Nature, 463(7278):184–190, 2010.
BIBLIOGRAPHY 91
[73] Yardena Samuels, Zhenghe Wang, Alberto Bardelli, Natalie Silliman, Janine
Ptak, Steve Szabo, Hai Yan, Adi Gazdar, Steven M Powell, Gregory J Riggins,
et al. High frequency of mutations of the pik3ca gene in human cancers. Science,
304(5670):554–554, 2004.
[74] Sohrab P Shah, Ryan D Morin, Jaswinder Khattra, Leah Prentice, Trevor Pugh,
Angela Burleigh, Allen Delaney, Karen Gelmon, Ryan Guliany, Janine Senz,
et al. Mutational evolution in a lobular breast tumour profiled at single nucleotide
resolution. Nature, 461(7265):809–813, 2009.
[75] Sohrab P Shah, Andrew Roth, Rodrigo Goya, Arusha Oloumi, Gavin Ha,
Yongjun Zhao, Gulisa Turashvili, Jiarui Ding, Kane Tse, Gholamreza Ha↵ari,
et al. The clonal and mutational evolution spectrum of primary triple-negative
breast cancers. Nature, 486(7403):395–399, 2012.
[76] Christine J Shaw and James R Lupski. Implications of human genome archi-
tecture for rearrangement-based disorders: the genomic basis of disease. Human
molecular genetics, 13(suppl 1):R57–R64, 2004.
[77] Rebecca Siegel, Jiemin Ma, Zhaohui Zou, and Ahmedin Jemal. Cancer statistics,
2014. CA: a cancer journal for clinicians, 64(1):9–29, 2014.
[78] Peter T Simpson, Theo Gale, Jorge S Reis-Filho, Chris Jones, Suzanne Parry,
John P Sloane, Andrew Hanby, Sarah E Pinder, Andrew HS Lee, Steve
Humphreys, et al. Columnar cell lesions of the breast: the missing link in breast
cancer progression?: a morphological and molecular analysis. The American
journal of surgical pathology, 29(6):734–746, 2005.
[79] Matthew W Snyder, Andrew Adey, Jacob O Kitzman, and Jay Shendure.
Haplotype-resolved genome sequencing: experimental methods and applications.
Nature Reviews Genetics, 2015.
[80] American Cancer Society. Cancer facts and figures 2015. Atlanta: American
Cancer Society, 2015.
BIBLIOGRAPHY 92
[81] Philip J Stephens, Chris D Greenman, Beiyuan Fu, Fengtang Yang, Graham R
Bignell, Laura J Mudie, Erin D Pleasance, King Wai Lau, David Beare, Lucy A
Stebbings, et al. Massive genomic rearrangement acquired in a single catastrophic
event during cancer development. cell, 144(1):27–40, 2011.
[82] Michael R Stratton. Exploring the genomes of cancer cells: progress and promise.
science, 331(6024):1553–1558, 2011.
[83] Michael R Stratton, Peter J Campbell, and P Andrew Futreal. The cancer
genome. Nature, 458(7239):719–724, 2009.
[84] John Sved and Adrian Bird. The expected equilibrium of the cpg dinucleotide
in vertebrate genomes under a mutation model. Proceedings of the National
Academy of Sciences, 87(12):4692–4696, 1990.
[85] Megan L Troxell, Alayne L Brunner, Tanaya Ne↵, Andrea Warrick, Carol Bead-
ling, Kelli Montgomery, Shirley Zhu, Christopher L Corless, and Robert B West.
Phosphatidylinositol-3-kinase pathway mutations are common in breast colum-
nar cell lesions. Modern Pathology, 25(7):930–937, 2012.
[86] Samra Turajlic, Simon J Furney, Maryou B Lambros, Costas Mitsopoulos,
Iwanka Kozarewa, Felipe C Geyer, Alan MacKay, Jarle Hakas, Marketa Zvelebil,
Christopher J Lord, et al. Whole genome sequencing of matched primary and
metastatic acral melanomas. Genome research, 22(2):196–207, 2012.
[87] Bala Murali Venkatesan and Rashid Bashir. Nanopore sensors for nucleic acid
analysis. Nature nanotechnology, 6(10):615–624, 2011.
[88] J Craig Venter, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural,
Granger G Sutton, Hamilton O Smith, Mark Yandell, Cheryl A Evans, Robert A
Holt, et al. The sequence of the human genome. science, 291(5507):1304–1351,
2001.
[89] Tom Walsh, Silvia Casadei, Ming K Lee, Christopher C Pennil, Alex S Nord,
Anne M Thornton, Wendy Roeb, Kathy J Agnew, Sunday M Stray, Anneka
BIBLIOGRAPHY 93
Wickramanayake, et al. Mutations in 12 genes for inherited ovarian, fallopian
tube, and peritoneal carcinoma identified by massively parallel sequencing. Pro-
ceedings of the National Academy of Sciences, 108(44):18032–18037, 2011.
[90] Matthew J Walter, Dong Shen, Li Ding, Jin Shao, Daniel C Koboldt, Ken Chen,
David E Larson, Michael D McLellan, David Dooling, Rachel Abbott, et al.
Clonal architecture of secondary acute myeloid leukemia. New England Journal
of Medicine, 366(12):1090–1098, 2012.
[91] Xiaochong Wu, Paul A Northcott, Adrian Dubuc, Adam J Dupuy, David JH
Shih, Hendrik Witt, Sidney Croul, Eric Bou↵et, Daniel W Fults, Charles G
Eberhart, et al. Clonal selection drives genetic divergence of metastatic medul-
loblastoma. Nature, 482(7386):529–533, 2012.
[92] Shinichi Yachida, Sian Jones, Ivana Bozic, Tibor Antal, Rebecca Leary, Baojin
Fu, Mihoko Kamiyama, Ralph H Hruban, James R Eshleman, Martin A Nowak,
et al. Distant metastasis occurs late during the genetic evolution of pancreatic
cancer. Nature, 467(7319):1114–1117, 2010.
[93] Jianjun Zhang, Junya Fujimoto, Jianhua Zhang, David C Wedge, Xingzhi Song,
Jiexin Zhang, Sahil Seth, Chi-Wan Chow, Yu Cao, Curtis Gumbs, et al. Intratu-
mor heterogeneity in localized lung adenocarcinomas delineated by multiregion
sequencing. Science, 346(6206):256–259, 2014.