algorithms for decoding cancer genomes: …yn394gh2333/dissertation-augmented.pdffor a supportive...

ALGORITHMS FOR DECODING CANCER GENOMES:

PHYLOGENETIC INFERENCE AND HAPLOTYPE

ASSEMBLY

a dissertation

submitted to the department of computer science

and the committee on graduate studies

of stanford university

in partial fulfillment of the requirements

for the degree of

doctor of philosophy

By

Dorna Kashef-Haghighi

June 2015

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/yn394gh2333

© 2015 by Dorna KashefHaghighi. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/yn394gh2333

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou, Primary Adviser


David Dill


Arend Sidow

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

The field of cancer genomics is expanding rapidly due to major advancements in

the sequencing technologies. Only a decade ago, the cost and limited throughput of

DNA sequencing made the study of cancer genome alterations at base-pair resolution

infeasible. Today, whole-genome sequencing of tumor populations is commonplace.

Cancer evolves through cycles of cell damage and a series of clonal expansions, mark-

ing the genome with new alterations along the way. To unravel the life history of

cancer genomes, new computational methods are needed to take advantage of the

wealth of whole genome sequence data now available.

In this dissertation, I first describe my work developing computational methods

for studying the role of early neoplasias in breast cancer evolution and show how

these methods can reveal robust clonal lineages and identify cancer progenitor mu-

tations. Next, I describe a probabilistic approach for haplotype reconstruction of an

invasive breast carcinoma genome using long DNA fragments from Moleculo sequenc-

ing technology. I show how cancer-specific aneuploidies can be leveraged to achieve

megabase-length haplotypes with high accuracy. Finally, I demonstrate applications

of phase information for detecting false somatic variant calls, and for identifying and

phasing segmental duplications.

iv

Acknowledgement

The work presented in this doctoral dissertation is the result of the support and help

of many amazing mentors, collaborators, and friends throughout my graduate career

at Stanford University. I would like to take this opportunity to extend my sincere

gratitude and appreciation to the following people.

I am deeply grateful to my advisor, Serafim Batzoglou, for his mentorship, super-

vision, and all the novel technical ideas he brought to my thesis work. I am indebted

to him for o↵ering me the freedom to explore, and the opportunity to learn. His end-

less encouragements always inspired me to conquer the most challenging obstacles in

my research, and his wealth of knowledge guided me to the correct path. Thank you

Serafim for making my journey at Stanford a fascinating and stimulating experience.

Your guidance and support were critical to my growth as a researcher, and I feel

privileged for being one of your students.

I am indebted to Arend Sidow for his mentorship and his integral role in the

inception and development of this work. Arend, I can never thank you enough for

developing my appreciation of this field and for sharing your honest opinion and

generous feedback at all times. Thank you for always finding the time in your busy

schedule to meet with me and for patiently teaching me various subjects and skills.

Having the opportunity to work with you has been a tremendous honor for me.

I would like to express my gratitude to David Dill for serving on my qualification,

oral, and reading committees. His insightful questions and generous suggestions were

critical to the improvement of the work presented here. I thank Anshul Kundaje for

serving on my oral committee and o↵ering invaluable suggestions. I am indebted to

Jonathon Pritchard for chairing my oral session. I would also like to acknowledge my

v

funding source, the Stanford Graduate Fellowship, for making this work possible. My

undergraduate advisors, Doina Precup and Prakash Panangaden, were instrumental

in my decision to pursue a PhD.

During my time at Stanford I was very fortunate to have the chance to collaborate

closely with many outstanding researchers. In particular, I would like to thank Robert

West, Daniel Newburger, Sivan Bercovici, and Ziming Weng. The cancer study pre-

sented in this dissertation was shaped by remarkable contributions of Robert West,

who not only supervised the research e↵ort but also enriched it by providing the nec-

essary clinical perspective. The high qualities of sequence data, and the validation

analysis described in Chapter 3, were made possible only by laborious e↵orts of Zim-

ing, who managed all the wet lab work. I am indebted to Sivan for contributing many

brilliant ideas to this work, and for inspiring me and helping me through numerous

technical conversations. Dan, you are an amazing collaborator and a wonderful friend.

Your diligence and unique skill set were the driving force behind the completion of

our joint project. Your humor and vast knowledge made our co-teaching CS374 the

most fun teaching experience I have ever had. Your friendship, positive spirit, and

generous advice helped me through many stressful moments of graduate school. For

all these, I am truly thankful. I would also like to thank my other collaborators

Rahaleh Salari, Noah Spies, Alayne Brunner, and Robert Sweeney for their valuable

contributions and input.

I am thankful to all current, former, and a�liated members of the Batzoglou lab

for a supportive lab culture. I would like to specially thank Alex, Daniel, Iman, Irene,

Jesse, Lin, Marc, Marina, Raheleh, Sarah, Sivan, Sofia, Ti↵any, Victoria, Volodymyr,

and Yuling for many fruitful discussions and fun times. I am grateful to Sarah, Marc,

and Ti↵any for welcoming me to the lab and o↵ering me generous advice as I was

starting this journey. Victoria, thank you for all the scientific and non-scientific chats

we had together, and for always surprising me with your kindness.

My life at Stanford would not have been as enjoyable and fulfilling without the help

and support of many amazing friends. I would like to particularly thank Maryam,

Rasoul, Parisa, Ali, Farzaneh, Alireza Marandi, Marjan, Milad, Arezou, Pedram,

Leila, Bernd, Parnian, Nastaran, Hooman, Shirin, Reza, Ehsan, Solene, and Alireza

vi

Sharafat for creating many memorable moments in my life. I would also like to

thank Faezeh, Morteza, Mojdeh, Vahid, Arefeh, Maryam khezr, Golnoosh, Negin,

and Bahareh for being my true friends even though we live many miles away.

I could not have gotten this far if it was not for the selfless support and help of

my family: my parents, and my sisters Semira and Sormeh. I am forever grateful to

my parents for having raised me to appreciate the value of learning science, and for

constantly reminding me to believe in myself. Thank you, mom, for always listening

to me and giving me your sincere advice. Thank you for teaching me how to love

wholeheartedly and how to cherish little things in life. And thank you for all the time

you spent with me during school years to help me excel. Thank you, dad, for instilling

in me the value of work ethic, for your unconditional love, and for the example that

you have set for me. I feel incredibly fortunate to have you all as my family.

Lastly, and on a more personal note, I would like to sincerely thank my dearest

Fardad. For it was he who exquisitely transformed my fears and despairs into strength

and persistence, and artistically turned my uncertainties into perspective. He stood

by me during the most intense periods of my studies and brilliantly guided me to

overcome each and everyone of the hurdles I encountered along the way. Completing

this dissertation without having him by my side is simply unimaginable.

Joint Work

The work in Chapter 3 was published in Genome Research [66]. I would like to thank

my co-first authors Daniel Newburger and Ziming Weng for their unique contributions

to the work. I am grateful to Arend Sidow, Serafim Batzoglou, and Robert West for

their continuous guidance and supervision of the project. I would also like to thank

my other co-authors Raheleh Salari, Robert Sweeney, Alayne Brunner, Shirley Zhu,

Xiangqian Guo, Sushama Varma, and Megan Troxell.

Chapter 4 would not be possible without the supervision and technical lead of

Serafim Batzoglou, Arend Sidow, and Rober West. I sincerely thank Sivan Bercovici

for contributing many indispensable ideas. I acknowledge Ziming Weng for generating

the sequencing data and also her contributions to our discussion. I also thank Alex

Bishara, Noah Spies, and Daniel Newburger for discussions and their contributions

vii

to this work.

viii

Contents

Abstract iv

Acknowledgement v

1 Introduction 1

2 Background 3

2.1 Genome and terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Types of genomic variations . . . . . . . . . . . . . . . . . . . 4

2.1.2 Cancer Genomics . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Variant calling in cancer samples . . . . . . . . . . . . . . . . . . . . 6

2.3 Haplotype phasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Genome evolution of Breast Cancer 11

3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Whole-genome sequencing of early neoplasias and related car-

cinomas from archival material . . . . . . . . . . . . . . . . . 13

3.3.2 Somatic SNVs fall into a limited and highly structured set of

classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.3 Allele frequencies of somatic SNVs support common ancestral

relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.4 Mutated neoplasias are evolutionarily related to carcinomas . 20

ix

3.3.5 Point-mutational mechanisms are evolutionarily stable and re-

producible among cases . . . . . . . . . . . . . . . . . . . . . . 21

3.3.6 Aneuploidies are the dominant evolutionary feature of progression 24

3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.1 Identification and processing of neoplasias . . . . . . . . . . . 30

3.4.2 Library construction and sequencing . . . . . . . . . . . . . . 31

3.4.3 Read mapping and BAM file processing . . . . . . . . . . . . . 32

3.4.4 Multisample SNV calling . . . . . . . . . . . . . . . . . . . . . 32

3.4.5 Determination of somatic SNV class patterns and of robust

sharing classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.6 PCR-based validation of SNVs and accuracy assessment of whole-

genome calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.7 Aneuploidy and tumor purity . . . . . . . . . . . . . . . . . . 40

3.4.8 SNV mutation spectra . . . . . . . . . . . . . . . . . . . . . . 41

3.4.9 Tree inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.10 Ordering SNVs vs. chromosome 1q ploidy gain in ancestral

branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Haplotype reconstruction of somatic genomes 45

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Dataset and SNV detection pipeline . . . . . . . . . . . . . . . 47

4.2.2 Overview of the framework . . . . . . . . . . . . . . . . . . . . 48

4.2.3 Local phasing . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.4 LD-based validation of local phasing . . . . . . . . . . . . . . 54

4.2.5 Statistical phasing . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.6 Leveraging aneuploidy information in phasing . . . . . . . . . 56

4.2.7 Final validation test . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Processing of samples and sequencing . . . . . . . . . . . . . . 59

x

4.3.2 Genotype calling . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.3 Constructing read clouds from sequence reads . . . . . . . . . 60

4.3.4 Building variant blocks . . . . . . . . . . . . . . . . . . . . . . 61

4.3.5 Local phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.6 Constructing somatic haplotypes . . . . . . . . . . . . . . . . 65

4.3.7 LD-based validation . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.8 Statistical phase . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.9 Detecting somatic CNV regions . . . . . . . . . . . . . . . . . 66

4.3.10 Leveraging somatic CNVs for detecting switch errors and con-

necting haplotypes . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Applications of haplotype phasing 71

5.1 Enhancing the accuracy of variant calls . . . . . . . . . . . . . . . . . 71

5.2 Identifying and phasing cryptic segmental duplications . . . . . . . . 74

5.3 Increasing the resolution of phylogenetic inference methods . . . . . . 75

6 Conclusion 79

Bibliography 81

xi

List of Tables

3.1 Variant call statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Distances between variants in complete LD pairs . . . . . . . . . . . . 55

4.2 Estimated number of switch errors in CNV regions . . . . . . . . . . 57

xii

List of Figures

3.1 Overall workflow of the project . . . . . . . . . . . . . . . . . . . . . 15

3.2 Lineage tree and alternate allele frequencies . . . . . . . . . . . . . . 17

3.3 Mutation spectra and rates of somatic SNVs . . . . . . . . . . . . . . 22

3.4 Dinucleotide mutation rates for each patient . . . . . . . . . . . . . . 24

3.5 Lesser allele fraction plot of Patient 6 . . . . . . . . . . . . . . . . . . 26

3.6 Aneuploidy summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Genome evolutions of all patients . . . . . . . . . . . . . . . . . . . . 29

3.8 Alternate allele frequencies in each tested private or phylogenetically

informative classes of somatic SNVs of Patient 6 . . . . . . . . . . . . 37

4.1 Moleculo read clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Reconstructing parental and somatic haplotypes in the local phase step. 49

4.3 Probabilistic inference model for phasing germline variants . . . . . . 51

4.4 A haplotype block from real read cloud data . . . . . . . . . . . . . . 54

4.5 Haplotype allelic fractions . . . . . . . . . . . . . . . . . . . . . . . . 58

5.1 Two examples of somatic variants called by GATK in the IDC sample 73

5.2 Identifying a segmental duplication event at KCNJ12. . . . . . . . . . 77

5.3 Inferring the haplotype sequence of KCNJ12 paralogs. . . . . . . . . . 78

xiii

Chapter 1

Introduction

The term ‘cancer’ describes a wide spectrum of diseases that share one common

attribute: unregulated proliferation of cells. Cancer is recognized worldwide as a

disease of high prevalence and mortality rate. According to American Cancer Society

statistics [80], about 589,430 Americans are expected to die of cancer in 2015. Breast

cancer is one of the top three most frequent types of cancer among women in the US,

and is considered as the dominant cause of cancer death in American women aged 20

to 59 years [77].

Cancer results from an accumulation of genetic and epigenetic changes inherited

at birth and also acquired over the course of an individual’s lifetime; hence, genomics

is an essential component to cancer research. The release of the human reference

genome in 2001 [49, 88], and also rapid development of next generation sequencing

technologies (NGS) in the ensuing years, have revolutionized the field of cancer ge-

nomics by o↵ering a detailed and multidimensional view of cancer genomes. These

molecular techniques are refining our understanding of cancer biology and prompting

new diagnostic and therapeutic approaches to the treatment of cancer patients.

The past few years have witnessed increasingly more systematic and organized

e↵orts, led by researchers across the world, to investigate the underlying somatic

alterations of human cancer [53, 11, 68]. However, a comprehensive understanding

of the mechanisms involved in the formation and progression of cancers still remains

elusive. As part of this dissertation, I contribute to the combined e↵ort of researchers

1

CHAPTER 1. INTRODUCTION 2

to characterize somatic mutations involved in the early stages of cancer formation,

by studying the role of early neoplasias in the breast cancer evolution.

Although the availability of high-throughput and low-cost sequencing platforms,

o↵ered by NGS technologies, has expedited the study of many di↵erent cancer types

at an unparalleled scale, NGS data su↵er from short read lengths. Obtaining a global

view of genomic contributions to tumor development is impeded by the resulting

fragmentation of a genome into a few hundred base-pair segments. To ameliorate the

challenges faced by short read sequencing, third generation sequencing technologies

are now emerging. Single-molecule sequencing (e.g. [87, 29]) and synthetic long read

sequencing (e.g. [48, 70, 3]) platforms have developed in recent years that produce

fragments with lengths ranging from tens to thousands of kilobase pairs. The applica-

tion of these technologies to cancer genomes can provide insight into the underlying

molecular mechanisms at an unprecedented resolution. As part of my dissertation

work, I conducted the first application of synthetic long read sequencing technologies

to do haplotype analysis of somatic alterations in an invasive breast cancer sample.

This dissertation manuscript is organized as follows.

• Chapter 2 introduces some basic biology concepts and key terms that are used

in the subsequent chapters.

• Chapter 3 presents a study of genome evolution during the progression from

premalignant cell populations to invasive breast cancer. This chapter also de-

scribes the mutation discovery and phylogeny inference pipelines developed as

part of this study.

• Chapter 4 presents a novel toolset that can leverage information from long read

sequencing technologies to do haplotype assembly of a somatic genome.

• Chapter 5 showcases some promising applications of read-based haplotype in-

ference.

• Chapter 6 concludes the dissertation with the contributions of the work.

Chapter 2

Background

2.1 Genome and terminology

The human genome is the complete set of genetic information stored in the cells,

and is encoded in 23 pairs of chromosomes. Humans are diploid organisms; meaning

that they carry two homologous copies of each chromosome, one contributed by each

parent. Each chromosome is a long chain of DNA molecules, and is represented as a

string over an alphabet of four letters A, C, G, and T known as nucleotides or bases.

Although humans are 99.9% identical in their genetic makeup, they still di↵er from

each other at millions of nucleotide sites among the 3.2 billion sites of the genome.

These di↵erences contribute to heritable variations between individuals, including but

not limited to their susceptibility to disease.

The release of the first draft of the human genome in 2001 [49, 88], and a more

complete draft in 2003 made remarkable advances in our understanding of the genetic

variation in the human genome and its impact on complex traits and disease. During

the last decade we have also witnessed an extensive progress in the field of genomics

fueled by the rapid advances in sequencing technologies.

3

CHAPTER 2. BACKGROUND 4

2.1.1 Types of genomic variations

Genomic variations are di↵erences in the DNA sequence of individuals in a population.

These di↵erences can be classified into two major categories according to their size.

Single Nucleotide Variants (SNVs)

The simplest and most abundant form of genomic variation among individuals is

a single nucleotide variation (SNV), which is a single base change in the DNA se-

quence. The term “single nucleotide polymorphism” (SNP) refers to an SNV with a

population frequency of at least 1%. SNPs occur throughout a genome at a rate of

approximately one in one thousand base pairs. To this date, over 53 million SNPs are

already identified and reported in public SNP databases such as NCBI dbSNP and

the international HapMap project.

Structural and Copy-Number Variants

Human genetic variations are not limited to single nucleotide changes. Other dif-

ferences include insertions or deletions of short stretches of DNA. These are called

indels for short. Moreover, large segments of DNA ranging in size from kilobases to

megabases can be inserted into, deleted from, or rearranged in the genome of di↵erent

individuals. These alterations change the structure of chromosomes and are called

structural variants (SVs). A copy number variation (CNV), which is one form of

structural variation, indicates that a particular stretch of DNA has di↵erent number

of copies among individuals. CNVs are caused by genomic rearrangements that lead

to the loss or duplication of DNA fragments.

Since a diploid genome has two copies of every chromosome, an individual has

two copies of the same locus. As a result, a genetic variant might be present in one

or both copies. If the same genetic change a↵ects both copies, the variant is called

homozygous. If it only occurs in one copy, it is called heterozygous. The two versions

of a gene at a given location are called alleles. Typically one allele of a heterozygous


mutation is the same as in the reference genome. This allele is referred to as the

reference allele. The second version is referred to as an alternate or variant allele.

2.1.2 Cancer Genomics

In recent years, a remarkable advance in our knowledge of the mutational profile of

cancer and its application to the clinical setting has taken place. A growing body

of research is forming to study and characterize the heterogeneity of cancer cells.

These studies allow for a better understanding of the disease progression and hope-

fully bring us closer to the development of personalized medicine. The field of cancer

genomics has benefited substantially from the application of next generation sequenc-

ing technologies. These massively parallel sequencing platforms have increased the

throughput of genome sequencing while reducing the cost of data production. As

a result, sequencing many patients of the same cancer type, and analyzing multiple

samples from the same patient are now possible at an a↵ordable cost.

Somatic Variants

The genome of a cancer cell possesses two types of genomic variants: germline and

somatic. Germline variants are mutations inherited from a parent. These variants

are present in all body cells including tumor cells. Most of these variants are not

disease causing and are prevalent in the general population. During the lifetime of

an individual, his or her DNA is continuously mutated as a result of intrinsic DNA

defects or environmental mutagens. While most of this damage is repaired, a small

fraction survives and accumulates in the DNA. These genetic alterations, which are

present in only a subset of body cells and are not found in germline cells, are called

somatic variants. Most cancer genomics studies focus on identification and analysis

of somatic mutations of a genome as these alterations provide an insight into the

underlying genetics of cancer.

Although normal body cells also carry somatic variants, in the context of cancer

genomics, we are mainly interested in somatic alterations not found in normal cells.

Therefore, throughout this dissertation, the term somatic mutation strictly refers to


a somatic change that is harbored only by cancer cells.

To distinguish between germline and somatic variants, the matched tumor and

normal samples of a patient should be sequenced and analyzed. Genetic alterations

that are shared between two samples are marked as germline variants.

Aneuploidy

An aberrant chromosomal copy number, also referred to as aneuploidy, has been

recognized as a common characteristic of cancer genomes for over a century. The high

prevalence of chromosome-arm level somatic copy number alterations is reported in

several studies [37, 10, 9, 63], however understanding their role in tumorigenesis and

the progression of disease has remained an active filed of research [37].

Cancer heterogeneity

Although all cancer cells originate from a common progenitor, they evolve through

di↵erent clonal expansions and accumulate di↵erent somatic mutations along the way

[15]. These clones dynamically compete for resources within the ever-changing cellular

environment of the tumor, and are subject to selection mechanisms. The term tumor

heterogeneity refers to the existence of co-existing subpopulation of cells in a tumor

with diverse genotypes and methylation patterns.

Understanding cancer heterogeneity leads to more accurate diagnosis and prog-

nosis of the disease, and is crucial for the development of e↵ective and personalized

therapies [33].

2.2 Variant calling in cancer samples

The process of variant discovery is a crucial first step in most cancer genomics stud-

ies. In next generation sequencing (NGS) methods, the DNA from a cancer sample is

amplified, sheared into small fragments (several hundred base pairs), and sequenced

producing millions of short sequence reads. These reads are then aligned to a ref-

erence genome, and sequence alterations from the reference are marked as potential


mutations. To use this set of genetic alterations in the downstream analysis, it is vital

to distinguish true variants from noise. However, this procedure is made challenging

by multiple factors, some of which are discussed in this section.

NGS data can su↵er from errors introduced in di↵erent stages of sequencing such

as early amplification cycles or base calling. Moreover, read-alignment tools are

not error-free. Mapping errors are particularly enriched in repetitive regions of the

genome. If enough reads are misaligned to a region, their variation from the reference

genome can resemble the signature of a true variant.

Di↵erentiating true genetic variants from errors is especially hindered by the fre-

quent high level of normal-contamination in tumor samples and the heterogeneity of

cancer. The low proportion of cells in a sample containing a somatic variant results

in a low percentage of sequence reads harboring the variant allele, which obstructs

the distinction between true somatic SNVs and sequencing errors.

Understanding the complex nature of cancer genomics necessitates a comprehen-

sive analysis of the complete mutational spectrum of the tumor sample including

germline and somatic variants. Classifying genetic alterations as somatic or germline

requires the joint analysis of matched normal and cancer samples from the same pa-

tient. The presence of variant allele in sequence reads from both samples suggests

that the sequence variant is a germline mutation. However, incorrect classifications

of germline mutations as somatic can stem from sequence sampling bias, where only

one copy of the diploid genome is sampled at a specific site.

To address these challenges e�ciently, sophisticated algorithms should be devel-

oped. In recent years, several general and cancer-specific SNV callers have been

published (e.g. [26, 20, 50, 38, 46]); however, this subject is still an active area of

research.

2.3 Haplotype phasing

The term haplotype refers to a set of alleles at adjacent loci that are carried together

on the same copy of a chromosome. At any segment of a diploid genome, there are two

haplotypes, one inherited from each parent. At a heterozygous site, each allele belongs


to one haplotype. More than two haplotypes can be present in a heterogeneous tumor

sample, or in the genome of aneuploid cells.

Variant calling methods report the alleles present at a variant site. These alleles

are referred to as the variant’s genotype; however, the order of these alleles on each

chromosome is not directly observed. For example, if at three adjacent variant sites

{x1, x2, x3}, an individual has two haplotypes (ACG, CTA), a variant caller produces

A/C, C/T, and A/G as the genotypes for x1, x2, and x3 respectively. However, it is

not immediately known which of the four possible pairs of haplotypes (ACA, CTG),

(ATA, CCG), (ACG, CTA), or (ATG, CCA) is the correct configuration of these

alleles. The process of inferring haplotypes from genotype information is referred to

as phasing or haplotype inference.

The importance of phase information is continuously increasing, culminating in

a broad range of applications. Haplotype data is crucial in many disciplines includ-

ing but not limited to population genetics, functional genomics, pharmacogenomics,

and personalized medicine. Genotype imputation, local-ancestry inference, and de-

termining human migration patterns are only a few of the aforementioned applica-

tions. Haplotype information can also facilitate the identification of candidate genes

associated with complex traits. Moreover, several studies have discovered strong as-

sociations between specific haplotypes and drug resistance or disease susceptibility.

The growing recognition for the importance of haplotype information has resulted

in a collective e↵ort of researchers to develop computational methods for haplotype

inference suitable for large-scale or genome-wide sequencing projects.

The simplest approach to phasing is the use of relatedness information in indi-

viduals of a single family. In the simple case of a trio, in which a child and both

parents are either sequenced or genotyped, basic principles of Mendelian inheritence

dictate which alleles were inherited from each parent. The only variants that remain

unphased in trio studies are those at which both parents and the progeny are het-

erozygous, and the ones that were not genotyped in at least one individual. Thus,

these studies result in very long and accurate haplotype blocks. Since it is not always

feasible, or even possible, to sequence all members of a family, this approach has very


limited applicability. At di↵erent levels of genealogical relatedness, haplotype phas-

ing of individuals is performed by identifying segments of the genome that they share

identical by descent (IBD). These are segments of the genome that individuals have

inherited from the same ancestor. As the genealogical relationship between pairs of

individuals becomes more distant, the length of such shared IBD regions decreases

exponentially resulting in smaller haplotype blocks.

Unrelated individuals can be phased by a di↵erent set of methods which are com-

monly referred to as statistical phasing. These methods are based on modeling the

haplotype frequencies of individuals in a population, and often leverage the linkage

disequilibrium (LD) patterns between genetic markers. LD refers to an allelic corre-

lation between markers in a population. Several EM-based, or HMM-based methods

have been developed for identifying, in the sequenced (or genotyped) individuals, a

set of possible haplotypes that can explain the observed genotypes. This type of

phasing is commonly used in population-scale studies such as in the International

HapMap Consortium and the 1000 Genomes project to impute genotypes at untyped

markers. A review paper by Browning et al. provides a comprehensive overview of

computational methods developed for statistical phasing [14].

Statistical phasing methods are error-prone and o↵er haplotype blocks with lengths

limited by the extent of the linkage disequilibrium across the genome. These methods

can only infer the phase between variants that are frequent in a population or in a

given sample of individuals, and are not applicable to rare and de novo mutations,

which are most clinically significant. Neither genetic phasing of human families, nor

statistical phasing of unrelated individuals can phase the somatic alterations in can-

cer cells. Therefore, direct haplotyping methods, through experimental analysis of a

single individual sample, are desired to phase de novo and somatic mutations.

Recent technological advances have enabled molecular-based haplotyping of per-

sonal genomes, referred to as Single Individual Haplotyping (SIH). These methods ex-

ploit the single-molecule nature of sequenced fragments. If a fragment encompasses

more than one heterozygous variant, it determines the phase of covered variants.

Therefore, partial haplotypes can be obtained by combining phase information across

overlapping fragments. Several factors such as sequencing errors, alignment errors,


and potential gaps in the sequenced fragments contribute to making this problem

computationally challenging [21]. Various haplotype assembly algorithms have been

developed recently using di↵erent approaches including greedy algorithms (e.g. [52]),

stochastic approaches (e.g. [7, 8]), and dynamic programming algorithms (e.g. [41]).

Next generation sequencing technologies are the platforms of choice in most cur-

rent genomic studies. However, these technologies produce reads that are relatively

short (a few hundred base pairs) compared to the average distance between heterozy-

gous variants (one thousand base pairs). As a result, sequence reads cover at most

one heterozygous variant at a time. Recently, however, new technologies have been

developed that can produce longer sequences. These technologies which employ dif-

ferent compartmentalization approaches and amplification techniques can produce

sequenced fragments ranging in size from tens to hundreds of kilobases. Snyder et

al. provides a comprehensive overview of these technologies and their application to

single individual haplotyping [79].

Chapter 3

Genome evolution during

progression to breast cancer

3.1 Abstract

Cancer evolution involves cycles of genomic damage, epigenetic deregulation, and

increased cellular proliferation that eventually culminate in the carcinoma pheno-

type. Early neoplasias, which are often found concurrently with carcinomas and are

histologically distinguishable from normal breast tissue, are less advanced in pheno-

type and are thought to represent precursor stages. To elucidate their role in cancer

evolution we performed comparative whole genome sequencing of early neoplasias,

matched normal tissue, and carcinomas from six patients. By using somatic mu-

tations as lineage markers we built trees that relate the tissue samples within each

patient. On the basis of these lineage trees we inferred the order, timing, and rates

of genomic events. In four out of six cases, an early neoplasia and the carcinoma

share a hypermutated common ancestor with recurring aneuploidies, and in all six

cases evolution accelerated in the carcinoma lineage. Point mutational mechanisms

are stable and consistent across cases, suggesting that hypermutation is a result of

increased cell division. In contrast to highly advanced tumors that are the focus

of much of current cancer genome sequencing, neither the early neoplasia genomes

nor the carcinomas are enriched with potentially functional somatic point mutations.

11

CHAPTER 3. GENOME EVOLUTION OF BREAST CANCER 12

Aneuploidies that occur in common ancestors of neoplastic and tumor cells are the

earliest events that a↵ect a large number of genes, and may predispose breast tissue

to eventual development of invasive carcinoma.

3.2 Introduction

The cells of a multicellular organism are related to one another by a bifurcating

lineage tree whose root is the zygote. DNA replication, chromosome segregation,

and cell division during development from the zygote to the adult introduces point

mutations and other DNA changes into the genome, which persist in the descen-

dants of the cells in which they occurred. Germ-line point mutations occur at a

rate of approximately one per diploid genome per cell division [47], but the rate of

somatic changes is less well-understood, and is likely to vary by tissue type. Large-

scale genomic changes such as aneuploidies are generally thought to be extremely

rare in normal tissue. Cancers, in contrast to normal tissue, accumulate much larger

numbers of genomic changes, as illustrated by genome sequencing of late-stage tu-

mors [53, 83, 11, 71, 18, 82, 6, 68, 67]. Solid tumors are highly mutated by several

mechanisms, such as point mutations, copy-number variations, and chromothripsis

[40, 51, 10, 55, 61, 81, 23, 58]; relapses or metastases exhibit further mutational

evolution [27, 28, 92, 64, 59, 86, 90, 91]. The state of an individual advanced cancer

genome sheds little light on the order of genomic changes, however, except in analyses

of subclone evolution [67, 75]. In an advanced tumor, the earliest driver changes that

had predisposed ancestral cells to eventual carcinoma development are confounded

with later changes. As a consequence, our understanding of early tumor evolution

is still in its infancy. The historically proven approach to understanding evolution is

comparative analysis of extant species, whose power was greatly increased by whole-

genome sequencing in recent years. Analogous to species comparisons, which are

based on evolutionary (bifurcating) lineage trees, comparisons of somatic genomes

from a single individual could, in principle, shed light on somatic evolution, but in

normal tissue the number of mutations is low. However, given the large number of


genomic changes during tumor evolution, it may be possible to dissect the evolution-

ary history of a cancer by comparing its genome to clinically recognized precursor

lesions. In this context, breast cancers provide a proof-of-principle opportunity, due

to their frequent association with early neoplastic lesions that are readily identified

by morphology [78, 1, 56, 13], and whose genomes may provide windows into the

earliest stages of tumor evolution. Using whole-genome sequencing of histologically

characterized archival (formalin-fixed, para�n-embedded) samples, we determine lin-

eage relationships of early neoplasias with carcinomas, quantify mutational load and

mutation spectra during progression from normal tissue to neoplasia to carcinoma,and

find the earliest detectable mutations and aneuploidies in cell lineages ancestral to the

lesions. A subset of these early events may have provided the initial oncogenic poten-

tialand helped trigger the first clonal expansion. Our analyses reveal variation among

the six cases in the specific evolution of neoplasia and tumor, as would be expected for

an evolutionary process dominated by stochasticity. The mechanistic commonalities

among the cases, however, bear significant implications for our conceptualization of

tumor origins and progression.

3.3 Results

3.3.1 Whole-genome sequencing of early neoplasias and re-

lated carcinomas from archival material

Our workflow (Figure 3.1) began with the screening of histopathological sections of

archival estrogen receptor positive invasive ductal carcinoma (IDC) resection speci-

mens for presence of concurrent early neoplasias, which are microscopic in size (typ-

ically 1-3 mm). We selected cases in which early neoplasia with or without atypia

(EN or ENA; a spectrum of usual ductal hyperplasia, columnar cell lesions, and flat

epithelial atypia), and in some cases ductal carcinoma in situ (DCIS) were present

in addition to the IDC. Areas of high neoplasia or carcinoma content were cored,

and histologically re-evaluated for lesion purity. Six cases were chosen in which each

sample met criteria for purity and had enough DNA for whole genome sequencing.


Each case had at least one early neoplasia sample from the same side in which the

carcinoma was found, and five also had an early neoplasia sample from the contralat-

eral mastectomy/lumpectomy. Each had at least one control sample (lymph, normal

breast tissue or both), and three cases also had a DCIS in addition to the IDC,

yielding a total of 31 samples (Figure 3.1A).

We optimized DNA isolation from archival samples to obtain su�cient quantities

of preparative material, and honed the generation of robust libraries. For each sample,

a single library was built and sequenced with paired-end reads (2 ⇥ 101 bp) on the

Illumina HiSeq platform. Library complexity was su�cient to support deep whole

genome sequencing, with the vast majority of sequence data coming from independent

DNA fragments as opposed to PCR duplicates. The samples from the first patient

were sequenced to higher coverage (average of 84.6x) to calibrate the tradeo↵ between

cost and sensitivity in variation calling. Coverage of each sample by confidently

mapped reads ranged from 46.7x to 105.7x, with a median of 53.4x.

3.3.2 Somatic SNVs fall into a limited and highly structured

set of classes

Detection of somatic single nucleotide variants (SNVs), such as those occurring dur-

ing cancer evolution, requires a methodology with high specificity because inherited

(germline) variants are orders of magnitude more numerous and even a small false

positive rate of calling inherited variants somatic results in low positive predictive

value. Our high sequence coverage and purity of samples allowed us to pursue highly

sensitive and specific somatic SNV identification. Because we sequenced several sam-

ples from each patient, we identified the total set of SNVs in each patient with a

multisample strategy using GATK. For each patient, we called variants using reads

from all samples simultaneously, and then assigned genotypes to each sample. The

vast majority of SNVs were present in all samples, as expected from germline variants.

Standard quality control metrics confirmed the high quality of our variant calls. The

total number of high-confidence germline variants ranged from 2,650,714 (Patient 5)

to 2,973,005 (Patient 1). Between 97.91% and 98.06% of these were present in dbSNP.


C

D

p 16qEN

ADC

IS IDC ENA

DCIS IDC ENA

DCIS IDC

1 1 1 1 1 1 10 0

Tissue block prepara-tion, sectioning, histological stain

whole sample

Pathological evaluation and diagnosis of carcinoma

Histological character-ization of tissue cores

neoplasias and DCIS associated with IDC

GATK multisample variant calling on realigned,

Alternate allele frequency determination

heterozygous germline variants for aneuploidy

germline and candidate somatic variants

somatic variant calls based on read counts, presence-absence patterns in the samples, and alternate allele frequencies

Pathological evaluation as part of clinical care

Transfer of core blocks from histologyto molecular biology lab

Tran

sfer

of d

ata

to h

igh

perf

orm

ance

com

putin

g en

viro

nmen

t

Clinically informed evaluationof research specimens

Molecular biology and sequencing

Computational sequence analysis

Somatic SNVs Germline SNVs

Determination of each case’s evolutionary history

IDCER+

Preparation of suitable samples for maximum tumor/neoplasia content

Library construction

Full-scale sequencing of suitable libraries

Test sequencing and read mapping to assess library quality and complexity

Library size distribution

Read mappings

Base quality

Ductal

hyperplasia

lengths in terms of muta-tions that occurred during that evolutionary time

Mapping of aneuploidies onto the trees

Inference of timing of genomic changes, and of genomic state of the last common ancestor well before the carcinoma

+ 1q- 16q

4*10-6

10-6

Lineage markersfor tree building

Mutationspectrum

for inferenceof mechanism

Aneuploidydetection using heterozygous SNPs

Figure 3.1: Overall workflow of the project from clinical sample to genome evolutioninference.


Figure 3.2: (Legend on next page.)


Figure 3.2: Lineage tree and alternate allele frequencies. (A) The samples in thisstudy by type (rows) and patient (columns). (B) Model of neoplastic progressionon the basis of organismal tissue and cell lineage. For simplicity, only one possiblescenario of the progression from normal to neoplasia to carcinoma is shown. Mutationsthat arise in ancestors are propagated through subsequent divisions to all descendants.Depending on the ancestors in which they arise, they will be found in one or moresamples of the patient, with varying prevalence. For example, mutations that arisein the B branches will be found in all cells of the neoplasia and of the carcinoma;in contrast, mutations that arise on the C branch will be present only in a subset ofthe neoplasia cells and mark the neoplastic subpopulation from which the carcinomaarose. Mutations that arise on the F branch mark a clonal expansion within theneoplasia, after the last common ancestor with the carcinoma. Note that if thereare no mutations found that define branches B and C, it is not possible to infer aspecific relationship of the carcinoma with the neoplasia. (NS) Not sampled. Inthe expanded box are alternate allele frequency comparisons relevant to neoplasiasand carcinomas. The two starred comparisons require independent estimates of theproportion of normal cells in each sample, as they compare AAFs across di↵erentsamples. All other comparisons are either within samples, or the AAF is zero, thusrequiring no independent estimate of the proportion of normal cells in the sample. (C-F ) Alternate allele frequencies as a function of the class and sample for each patientwith phylogenetically informative SNV-sharing classes. The number of SNVs in eachclass and the branch in the lineage tree of A are listed below each plot. For Patient1, the only phylogenetically informative class was where the IDC shared SNVs withENA. For the other patients, the AAFs of informative classes are grouped togetherand the mutation pattern for each class is represented by a series of zeros and onesdirectly above the sample labels (a “1” indicates that the SNVs were present in thecorresponding sample and a “0” indicates that they were not). (EN) Early neoplasia;(EN cl) early neoplasia contralateral; (ENA) early neoplasia with atypia. Subscriptin lineage-tree branch of patient 6 denotes whether the neoplasia in the lineage treeis this patient’s EN or ENA, and whether the carcinoma is DCIS or IDC.

On average, 59,697 SNVs per patient were present in all samples but not in dbSNP,

and therefore represent novel SNPs of low population-allele frequency (Table 3.1).

Between 1465 (Patient 1) and 3416 (Patient 6) SNVs were candidate somatic

variants, as they were not detected in at least one sample of that patient (Table

3.1). If the samples are related by a tree, then only some sharing classes are possible

and the total number of observed classes is much lower than the number of possible


P1 P2 P3 P4 P5 P6Total 2,973,005 2,771,413 2,912,758 2,915,727 2,650,714 2,937,816Homozygous 1,168,671 1,078,021 1,149,006 1,160,421 1,017,760 1,146,679Ts/Tv ratio 2.13 2.09 2.09 2.09 2.15 2.10

In dbSNP 2,910,863 2,717,531 2,856,582 2,857,498 2,596,421 2,864,359Percent 97.91 98.06 98.07 98.00 97.95 97.50

Novel 62,142 53,882 56,176 58,229 54,293 73,457Homozyous 2,514 1,734 1,715 1,681 1,295 2,372

Candidate somatic 1,465 1,546 2,567 2,775 1,924 3,416After filtering 1,279 1,479 2,104 2,582 1,728 3,211

Table 3.1: Variant call statistics

classes. For example, in Patient 1, from whom we sequenced six samples, there are

26 � 1 = 63 possible classes to which an SNV can belong. In this patient, 1766 SNVs

were absent from at least one sample, and excluding those present in lymph we retain

1465 candidate somatic SNVs. Only six of the classes, containing 1279 out of the

initial 1465 candidate SNVs (87%), survived filtering. Those SNVs removed during

filtering were either germline SNVs where one allele was poorly covered, or somatic

SNVs whose class membership we could not confidently establish.

Across the six cases, we retained 82%�96% (median = 91%) of SNVs and 19%�43% (median = 27%) of classes, revealing substantial structure in the data. The final

number of confident somatic SNVs ranges from 1279 in Patient 1 to 3211 in Patient

6, for a total of 12,392 in all six patients. 8950 (72%) of these are private to only

one sample in only one patient, and the number of such private SNVs increases as a

function of the severity of the cancer phenotype: the IDCs harbor the most private

mutations (average of 601 per sample, N = 7, range 46 � 1809), the DCISs have an

average of 470 SNVs per sample (N = 3 range 70�978), early lesions 229 per sample

(N = 14, range 123 � 387), and normal have the fewest (N = 2, range 39 � 89).

On average, the IDCs accumulated 2.6-fold more private mutations than the early

neoplasias, and almost 10-fold more than normal breast tissue. This may be due to

a larger number of cell divisions or an increased mutation rate in the ancestral cell

lineage of the IDC.


3.3.3 Allele frequencies of somatic SNVs support common

ancestral relationships

Somatic SNVs that are not private to individual samples define phylogenetically in-

formative classes. A total of 3442 SNVs define such classes, ranging from 0 SNVs

in Patient 4 to 1054 SNVs in Patient 3, with a per-case average of 574 and a per-

class (N = 7) average of 492. To illustrate the logic of phylogenetic inference using

informative classes, we consider a hypothetical lineage tree that relates non-breast

somatic, normal breast, neoplastic, and carcinoma cell lineages (Figure 3.2B). Muta-

tions that occurred in ancestral cells are present in specific subsets of samples, with

the lineage tree constraining the set of possible classes.

As demonstrated in recent studies of subclone evolution in IDC [68, 67, 75], alter-

nate allele frequency (AAF) is a powerful metric for understanding tumor evolution.

The “alternate allele” is the allele that does not match the reference base, and which

in the vast majority of cases is the somatic mutation. Its frequency is estimated from

its sequence coverage divided by the coverage of the alternate base plus that of the

reference base. Depending on the ancestral lineage in which a collection of mutations

arose, their AAF distributions in each sample vary. For example, if a variant arose in

a common ancestor of a subset of lesional cells in the sample, its AAF is lower than

that of an earlier mutation that is present in all lesional cells of the sample (Figure

3.2B).

For each SNV class of each patient, we obtained estimates of AAF distributions

with highly consistent class patterns (Figure 3.2 C-F). For example, in Patient 1 the

AAFs of the SNVs that are present in ENA and IDC and absent everywhere else are

higher than the AAFs of the ENA-only or the IDC-only classes. The same patterns

hold for Patients 2 and 6. The patterns in Patient 5 are complicated by the presence

of two IDCs and by low numbers of SNVs in relevant classes. Note that the mean

AAFs are always < 50% due to unavoidable contamination of the lesional tissue with

normal cells that derive from lineages that branched o↵ before the lesional ancestors

accumulated their somatic mutations.


3.3.4 Mutated neoplasias are evolutionarily related to carci-

nomas

Each case represents an independent evolution; therefore, common patterns across the

cases may be of general significance. We first asked to what extent the early neoplasias

and the carcinomas share mutations that are not present in other samples, pointing

to shared ancestral cell lineages. In four cases (Patients 1, 2, 5, and 6) (Figure 3.2C-

F), the phylogenetically informative SNV classes indicate that a neoplasia shares a

common ancestor with the carcinoma. In each of these cases, a neoplasia and the

carcinoma share a significant number of SNVs. For example, in Patient 1, 775 SNVs

are shared between ENA and IDC, and in Patient 2, 681 SNVs are shared among the

EN, DCIS, and IDC, with additional SNVs shared between the EN and IDC. There

are no well-supported classes (in terms of number of SNVs and their AAFs) that

are in conflict with each other, and none in which normal tissue or contralateral EN

share SNVs with the carcinomas. The aforementioned PCR-based targeted validation

showed 94% and 98% accuracy in assigning SNVs to the correct phylogenetic class.

In three of these four cases (Patients 1, 2, and 6) the number of SNVs in common

between a neoplasia and carcinoma suggests the existence of a common ancestor that

had already accumulated many somatic SNVs. Strikingly, in two cases (Patients 1 and

2) the number of mutations in the ancestor is greater than the number of mutations

that subsequently occurred in the ancestral lineage private to the carcinoma.

In three cases (Patients 2, 3, and 6) DCIS was concurrent with IDC, and in one

case (Patient 5) two independent IDC lesions were present. These four cases provided

us the opportunity to ask whether the carcinoma phenotype arose once or multiple

times independently. In Patient 3, the DCIS and IDC share a mutated common

ancestor, suggesting that the carcinoma phenotype arose in the ancestral lineage,

and that the IDC subsequently acquired the invasive phenotype. In Patients 2 and

6, there is no well-supported class of SNVs that unites the two carcinomas to the

exclusion of a neoplasia. Instead, in both patients, the DCIS and the IDC each

share separate classes of SNVs with a neoplasia, suggesting independent origins of

the carcinoma phenotype from neoplastic ancestors.


These results suggest that some early neoplasias harbor a predisposition to spawn-

ing a carcinoma that later acquires an invasive phenotype (Patients 1, 2, 6). The

chance of acquiring a carcinoma phenotype, given the predisposition provided by the

neoplasia, is su�ciently high to allow for concurrent and independent development of

carcinomas (DCIS and IDC in Patients 2 and 6).

3.3.5 Point-mutational mechanisms are evolutionarily stable

and reproducible among cases

SNVs result from mutations that occurred in ancestral cells, and if a specific molec-

ular mechanism were primarily responsible for the mutations, the distribution of the

SNVs among the various types of change (the “mutation spectrum”) would carry

that mechanism’s signature [72]. To investigate the cause of the ancestral accumula-

tion of mutations, we analyzed the mutational spectrum as a function of the samples

in which SNVs were found. The mutational spectrum in our cases is remarkably

consistent from patient to patient (Figure 3.3A) and is also stable across SNVs in

di↵erent types of samples and in di↵erent patterns (Figure 3.3B). Transitions out-

number transversions about 1.5-fold in a pattern that is typical for replication errors

and not indicative of any specific type of DNA damage or failed repair mechanism.

C-to-T changes (or G-to-A, which are the same due to base pairing) are most nu-

merous. Converted to substitution rates, this bias is even more pronounced because

there are only roughly two C’s for every three T’s in the human genome. The consis-

tency across patients implies a common mechanism, and the consistency among the

three SNV groups (SNVs in early lesions only, in carcinoma only, and shared between

early lesions and carcinoma) implies that the common mechanism acts throughout

neoplastic and tumor evolution.

To further shed light on the mutational mechanism we turned to analysis of din-

ucleotide substitution patterns. Because dinucleotide frequencies vary by an order of

magnitude in the human genome, with AA/TT being most common and CG least

common, we converted mutation counts to rates. Truly random substitutions would

have the same rates for each of the 60 possible mutations (10 dinucleotides with six


Figure 3.3: Mutation spectra and rates of somatic SNVs. (A) Mononucleotide substi-tution frequencies by patient. (B) Mononucleotide substitution frequencies by SNVclass. (C) Dinucleotide substitution rates of SNVs private to early neoplasias. (D)Dinucleotide substitution rates of SNVs private to carcinomas. (E) Dinucleotide sub-stitution rates of SNVs shared among neoplasias and carcinomas. For C-E, SNVs arepooled across patients. The mutated dinucleotide is indicated in the inner circle, andthe substitution occurring within it is color coded. Rate is defined as mutations perdinucleotide of that class.

possible changes each, not counting changes in both bases because they are exceed-

ingly rare). A dinucleotide-unaware process would recapitulate the mononucleotide

rates, with the average transition having an about fourfold higher rate than the av-

erage transversion. In contrast, we detect an approximately eightfold higher rate of

C-to-T transitions in the CpG context. This higher mutation rate is due to methyla-

tion of the C in a CpG dinucleotide, which upon deamination becomes a TpG. If the

repair machinery catches this event it is reversed, but if the replication fork passes


first it leads to a C-to-T transition in one of the daughter strands. The relative rate

of C-to-T transitions in CpGs versus C-to-T transitions in the other dinucleotide con-

texts and versus all other changes provides an internal calibration as to whether DNA

damage processes or defective repair mechanisms have disproportionally a↵ected the

genome.

In our patients, the rate increase of C-to-T transitions in the CpG context and in

the dinucleotide mutation spectrum in general is similar to germline evolution [84, 44],

and is consistent across patients (Figure 3.4) as well as among classes of SNVs (private

to neoplasias, private to IDCs, and shared among neoplasias and carcinomas) (Figure

3.3 C-E). This implies that the sources of the somatic SNVs are mutations that

accumulated during many rounds of DNA replication (many ancestral cell divisions),

and that cancer- or neoplasia-specific point mutational mechanisms, if present at all,

did not substantially a↵ect the mutation spectrum. Taken together, these lines of

evidence support a model of mutation accumulation that is gradual and largely a

function of the number of cell divisions, as opposed to recurring DNA damage events

or mutational storms.

The somatic SNVs are randomly distributed in each patient with no enrichment of

exonic or nonsynonymous changes, regardless of the phylogenetic class to which they

belong. We also detect very little clustering of mutations that might be indicative of

localized mutagenic events [68]. Across all cases, 159 out of the 12,392 high-confidence

somatic SNVs fall into coding regions, with 2/3 (106) being nonsynonymous, which is

what is expected by chance. This holds true for any biological subdivision of the data

(e.g., neoplasias vs. IDC). The a↵ected genes exhibit no enrichment for pathways

by GO analysis [4, 42]. One point mutation, H1047R in PIK3CA, which has been

previously implicated in cancer [73, 30] and early neoplasias [85], was recurrent in

our cases (Patients 1, 3, 4, and 5, in various samples) at varying allele frequencies.

Common cancer loci such as TP53 and BRCA1 were not mutated.


AAAC

AGTA

CA

CCCG GA GC

TA

2x10 -6

4x10 -6

6x10 -6

8x10 -6

Patient 6

Patient 3Patient 2Patient 1

Patient 5Patient 4

AAAC

AGTA

CA

CCCG GA

GCTA

2x10 -6

4x10 -6

6x10 -6

8x10 -6

AAAC

AGTA

CA

CCCG GA GC

TA

2x10 -6

4x10 -6

6x10 -6

8x10 -6

AAAC

AGTA

CA

CCCG GA GC

TA

2x10 -6

4x10 -6

6x10 -6

8x10 -6

AAAC

AGTA

CA

CCCG GA GC

TA

2x10 -6

4x10 -6

6x10 -6

8x10 -6

AAAC

AGTA

CA

CCCG GA GC

TA

2x10 -6

4x10 -6

6x10 -6

8x10 -6

C to TG to AA to GT to CA to TT to AC to AG to TC to GG to CA to CT to G

Figure 3.4: Dinucleotide mutation rates for each patient. Plots are the same as Figure3.3 C-E, except that here SNV classes are pooled for each patient. Rates are in unitsof “substitution per dinucleotide type”, and vary overall between patients becausethe number of mutations varies from case to case. In all cases, transitions in CpGdincleotides have a much higher rate than all other mutations.

3.3.6 Aneuploidies are the dominant evolutionary feature of

progression

The paucity of candidate driver mutations and overall random distribution of point

mutations in our cases suggest that other genomic events may be contributing to the

initial neoplastic phenotype and its progression to carcinoma. We therefore devised

a multistep strategy to identify chromosome arm-scale losses and gains in each pa-

tient, utilizing those germline variants for which the patients were heterozygous. Each


patient was heterozygous for between 1.56 and 1.74 million SNPs, ensuring substan-

tial statistical power to detect subchromosomal-sized aneuploidies and copy-number

variations.

We quantified, in each somatic sample separately, the fraction of reads that sup-

port the allele with the fewer number of reads (the lesser allele fraction, or LAF).

We then ordered the SNVs according to their position in the genome and identified

transition points where the LAF abruptly changes. In one case (Patient 5), the 20

large-scale copy-number variations which are confined to this patient’s two IDC sam-

ples are suggestive of chromothripsis [55, 61, 81, 23, 58]. In the other five patients, we

identified a total of 46 large-scale copy-number variations, 43 of which involve whole

chromosomes or whole chromosome arms.

None of the normal breast and contralateral neoplastic samples, some of the ip-

silateral neoplasias, and all of the carcinomas exhibit aneuploidy. Four of the seven

IDCs exhibit evidence for the presence of a subclone population in which additional

chromosomes have undergone aneuploidy events.

In Patients 1, 2, and 6, aneuploidy events are shared among early neoplasias and

carcinomas. All aneuploidies that are present in the neoplasias are also present in the

carcinomas. Plotting the LAFs of all samples from a patient powerfully illustrates

both the chromosome scale of these events as well as the sharing of the same aneu-

ploidies among certain samples. In Patient 6, for example, the aneuploidies involving

chromosomes 1q, 6q, 8p, 17 and 22 are shared among both carcinomas and the EN

(Figure 3.5). The plot also reveals the aneuploidies of many other chromosomes that

are present in a subclone population that makes up about 30% of the IDC sam-

ple. Examination of the corresponding plots of all patients reveals the extraordinary

prevalence of aneuploidies in these cases.

Graphing the distribution of LAFs for each LAF-derived section of the genome

separately (usually a whole chromosome or arm) further supports the robustness of

LAF as a metric to identify aneuploidies (Figure 3.6A). However, a reduction of LAF

can be a result of ploidy gains as well as losses. Therefore, we calculated the actual

ploidy changes in a two-step process: first, we estimated the contribution of normal

cells to the sample using chromosome losses, and then we calculated the additional


Figure 3.5: Lesser allele fraction plot of Patient 6. SNVs are arranged by theirorder in the genome, and LAF is plotted for each sample in windows of 1000 SNVswith 500 SNV overlap. Aneuploidies are visible as precipitous drops in the LAF,which are often shared between samples. Chromosome boundaries are indicated byshort vertical lines. All samples are plotted and give highly consistent LAFs forchromosomes that are euploid.

number of chromosome copies for those chromosomes that exhibited increased ploidy.

We validated a subset of these calls using FISH (Figure 3.6B) and found all LAF-based

calls that we tested to be correct.

The distribution of aneuploidies across chromosomes among the six patients is

highly nonrandom (Figure 3.6C). Gain of chromosome 1q is by far the most common

event, with a total of 13 extra copies accumulated in these patients, not considering

the IDC subclones. All cases exhibit 1q gain, and it is the only event that is shared by


Figure 3.6: (Legend on next page.)


Figure 3.6: Aneuploidy summary. (A) LAF distributions for each chromosome acrossall patients and samples. In each sample-by-patient panel, the LAF distributionsof all chromosomes are superimposed. In the absence of aneuploidy, the plot linesof all chromosomes are well-aligned, as is evident in the control plots and some ENplots. Control panels often contain plots from two samples (indicated) and so thereare sometimes 46 lines superimposed, revealing the robustness of the LAF metricacross samples and chromosomes. A chromosome’s plot line is gray when it does notdeviate from the typical distribution. The line is colored when the chromosome’sLAF is skewed. Distinct colors are assigned to represent aneuploid regions that recurin di↵erent samples and patients. Colors are labeled in the panel in which they firstappear. For Patient 6 please see Figure 3. (B) FISH of chromosome 1 in ENAof Patient 6. (C) Distribution of aneuploidies by patient, excluding those in IDCsubclones. Each square denotes a unit gain (orange) or loss (blue). In Patients2, 3, and 6, two phases of aneuploidies occurred, with those of the second phasenot surrounded by a border. (Total) The total number of chromosomes lost (�) orgained (+) across all patients; (1st) the number during the first detected phase. Onlyrecurrent events are listed. In Patient 5 (which exhibits hallmarks of chromothripsis),di↵erent pieces of chromosomes 1p and 19 underwent simultaneous losses or gains.

all three early neoplasias in which we could detect aneuploidy. In three cases (Patients

2, 3, and 6), the IDC underwent gains of 1q in addition to previous ones, increasing

1q ploidy to 6, 4, and 4, respectively. This suggests that the selective advantage

conferred by 1q gain increases with further gains of 1q during tumor evolution.

Like the shared SNVs, the shared aneuploidies support specific lineage relation-

ships among the samples of each patient. We therefore built lineage trees using the

somatic SNVs as phylogenetic markers, and then asked whether the shared aneuploi-

dies are consistent with these trees (Figure 3.7). All aneuploidies are unambiguously

and parsimoniously assigned to specific branches in the SNV-based lineage trees.

The order of aneuploidies during the evolution of each case is also unambiguous

and highly suggestive of a small number of aneuploidies being first drivers of the

neoplastic phenotype. In all cases, gain of 1q was among the events that occurred

first, including in the three cases in which genomic crises occurred in a common

ancestor of neoplasias and carcinomas (Patients 1, 2, and 6). Loss of 16q occurred

four times, and loss of 17 three times, as part of the first set of aneuploidies. Gain of


Figure 3.7: Genome evolutions of all patients (P1-P6 ). Vertical black lines are ances-tral lineages whose lengths are proportional to the number of SNVs that occurred ineach (except Patient 4, which is 50% shorter for fit). Cones represent tissue samples;cone width represents approximate amount of tissue; cone height is constrained atthe top by the position of the last common ancestral cell of the sample, which is de-termined by the ancestral branch lengths, and on the bottom by the time of surgery,which is the same for all samples. The ratio of cone width to height is an approxima-tion of the rate of cell division in each sample since the last common ancestral cell.Chromosome ploidy changes are indicated with the chromosome number; stand-alonenumbers in italics indicate the number of chromosomes a↵ected by subclone evolution(or putative chromothripsis in Patient 5). Thick branches are the earliest branchesfor which we are able to infer genomic events. Circles at the end of thick branches areancestors with the colors denoting their inferred neoplasialike, DCIS-like, or IDC-likephenotypes.

16p occurred three times. The remaining aneuploidies occurred once or twice in all

trees, and none were recurrent in the earliest stages of evolution.

In order to time the occurrence of aneuploidies relative to SNVs, we identified the

branch in the lineage tree of each patient where the first ploidy gains of chromosome

1q occurred and considered SNVs that occurred on this branch. AAF spectra of


SNVs that occurred before the ploidy gains and located on the chromatid that was

duplicated should be enriched for higher AAF in the progeny samples. In each of

the six patients, statistical tests rejected the null hypothesis that there are no such

SNVs (Fisher’s exact test, P-values ranging from 0.5 ⇥ 10�2 to 0.8 ⇥ 10�36). This

pattern is reproducible between di↵erent samples of the same case, and the SNVs

that exhibit high AAF largely overlap. The same pattern holds for the ploidy gain in

chromosome 16p, but due to fewer SNVs the statistical signal is less strong. Overall,

the AAF distributions of 1q SNVs are consistent, with some mutations occurring

before the ploidy gain, and some mutations occurring after the ploidy gain. This

suggests gradual accumulation of point mutations as a function of the number of cell

divisions, as opposed to mutational bursts.

Because the aneuploidies and SNVs independently support the lineage tree topolo-

gies, the genotypes and phenotypes of the common ancestors can be confidently in-

ferred in each case. The aforementioned mutated common ancestors of neoplasias and

carcinomas in Patients 1, 2, and 6 bore extensive aneuploidy, as did the mutated com-

mon ancestor of the DCIS and IDC in Patient 3. In all four cases, therefore, genomic

crises occurred in an ancestral cell or in consecutive daughter cells of the ancestral

cell lineage. The phenotypes of these ancestors likely included nuclear atypia and

increased rate of cell division, but no invasive capabilities. Their genomes were pre-

disposed to further genomic change, and as a result the subsequent lineages leading

to IDC accumulated numerous additional SNVs and aneuploidies.

3.4 Methods

3.4.1 Identification and processing of neoplasias

All patients except one had opted for mastectomies, and all of the available breast

tissue had been formalin-fixed, which allowed for the discovery of multiple sites of

neoplastic lesions in each case by examination of large sets of tissue sections. Neo-

plastic lesions were classified according to a standard set of criteria that included

nuclear morphology, cell shape, and tissue organization. Once a lesion was identified


and characterized, we estimated the extent of the neoplastic tissue by taking cores

and performing further sectioning and histology. We then dissected the material to

minimize the proportion of normal breast tissue in the final sample. Our goal was

to achieve 50% or more neoplastic or tumor content, but we could not rigorously

quantify this number until after sequencing had been performed.

3.4.2 Library construction and sequencing

DNA extraction from each dissected sample was performed using procedures opti-

mized for archival material. FFPE cores were cut into 20-µm slices. Para�n was

dissolved in Xylene and removed (four repeats of 5 min incubation with rotation in

1 mL of Xylene and microcentrifugation for 3 min) and followed by washing with

ethanol (four repeats of 5 min incubation with rotation in 1 mL of ethanol and mi-

crocentrifugation for 3 min). Tissue was then lysed with Proteinase K and crosslinks

reversed by overnight incubation at 56�C. After brief digestion with RNase A (Qia-

gen), DNA was purified with a column-based method (Qiagen QIAamp DNA Mini

Kit). For each sample, one Illumina library was built with an average insert size of

between 300 and 400 bases, depending on the quality of the DNA. Half to 1 µg of

genomic DNA (depending on the availability of the material) was sheared to 400 bp

with Covaris S2, end-repaired, ligated to Illumina adapter, size selected, and ampli-

fied with eight cycles of PCR to generate the final library. Standard Illumina 2⇥ 101

paired-end sequencing on the HiSeq2000 platform was performed such that the fi-

nal sequence coverage of confidently aligned reads was nearly 100 for each sample

in the first patient, and 50 for the samples of Patients 2-6. Analysis of the mapped

reads confirmed high library quality (very low duplicate read-pair fraction, almost

normally distributed fragment size, and highly uniform genome coverage) that was

indistinguishable from that of comparable libraries constructed from fresh DNA.


3.4.3 Read mapping and BAM file processing

Raw Illumina reads were uploaded to DNAnexus (https://dnanexus.com/) and aligned

to the human reference genome (UCSC build hg19) using the DNAnexus read map-

per, a hash-based probabilistic aligner that incorporates paired read information. We

used standard quality-control metrics, such as percent confidently mapped reads and

insert size distribution, to discard problematic Illumina lanes prior to subsequent

analysis. Successfully aligned reads from high-quality lanes were labeled using read

group tags and then merged into sample-level BAM files. Lane-level read group tags

improve the performance of downstream BAM processing and variant calling with

the Genome Analysis Toolkit (GATK) [60, 26].

We followed GATK’s best practices guidelines (v3) to perform sample-level BAM

processing using the Picard java utilities (http://picard.sourceforge.net/) and GATK

tools [60]. This protocol has three steps that are executed in the following order:

duplicate read marking, local realignment, and base quality score recalibration. We

used the Picard MarkDuplicates utility to mark duplicate reads based upon the read

position and orientation of read pairs. Marked duplicates were ignored in subsequent

processing and variant calling steps. GATK local realignment was performed with

standard parameters and the recommended known indel sets (Mills et al. and 1000

Genomes indels from the GATK v1.2 bundle) [62]. GATK base quality score recali-

bration was performed with the standard set of covariates. The realigned, recalibrated

BAM files produced by these processing steps were used for multisample SNV calling

and for all alignment-related statistics such as allele counts.

3.4.4 Multisample SNV calling

Multisample SNV calling was performed on processed, sample-level BAM files with

the GATK Unified Genotyper (DePristo et al. 2011). Multisample runs were grouped

by patient such that BAM files from di↵erent patients were run separately. Notable

parameters for the Unified Genotyper include standard call confidence of 50.0 (-

stand call conf 50.0) and minimum base quality score of 20 (-mbq 20). To reduce

SNV false discovery rate, raw variant calls were filtered using GATK variant quality


score recalibration tools (VQSR) with the recommended training sets. The following

annotations were used for training: FS (strand bias), MQ (mapping quality), DP

(depth), HaplotypeScore, MQRankSum, and ReadPosRankSum. Replacing the rec-

ommended QD annotation (call quality divided by depth) with DP greatly improves

sensitivity for low-frequency somatic variants.

We used pass-filter SNVs to create a set of high-confidence germline calls and a

set of high-confidence somatic calls for each patient. For a given patient, we defined

germline SNVs as calls meeting the following multisample criteria: (1) depth 20 or

greater in every sample, where depth is defined as the sum of alternate and reference

base counts, and (2) non-reference GATK genotype (GT) in every sample. These

high-confidence germline calls were used for aneuploidy analyses (below). Somatic

SNVs were defined using a similar set of criteria: (1) depth 20 or greater in every

sample, (2) fewer than two reads supporting the alternate allele in at least one sample,

and (3) absence in dbSNP 132. We excluded SNVs in dbSNP 132 in order to reduce

the number of false-negative germline calls in our somatic SNV call set.

Three out of four Patient 2 genomic libraries were contaminated with mouse DNA,

with 15% of DCIS reads aligning to the mouse genome. Approximately 1% of reads

from Normal and 0.65% of reads from EN aligned to mouse; these fractions were sig-

nificantly above background levels for una↵ected libraries. To remove contamination-

related mapping artifacts from our SNV data, we added additional filtering steps

to the SNV calling protocol for Patient 2. Prior to variant calling with the Unified

Genotyper, we eliminated all reads lacking confidently mapped mates. After variant

calling and VQSR, we removed all novel pass-filter SNVs positioned in areas of the

genome with significant homology with the mouse genome. Homology was assessed

by mapping tiled 75-mer reference sequences, surrounding each position of interest,

to the mouse genome (mm9). This second step dramatically reduced spurious calls

in DCIS while eliminating only 1% germline dbSNP positions used as controls.


3.4.5 Determination of somatic SNV class patterns and of

robust sharing classes

Multisample somatic SNV calls were further analyzed to determine patterns of SNV-

sharing across samples within the same patient. Although GATK provides sample

genotype calls based on genotype likelihood calculations, these calls lack sensitivity

when applied to cancer samples with substantial normal contamination or subclonal

tumor populations. To further enhance sensitivity of SNV detection beyond GATK

multisample calls, we applied a simple but sensitive metric to determine each sample’s

mutation status. At each somatic SNV position predicted by GATK in at least one

sample, we considered any sample with two or more reads supporting the alternate

allele to harbor the mutation (i.e., mutation present). Samples with fewer than two

reads supporting the alternate allele were labeled as reference (i.e., mutation absent).

Our rationale was that given that a specific SNV is detected in some samples, reads

supporting this SNV in other samples have a significant prior to be true rather than

sequencing errors. We call this criterion ”evidence of presence” of an SNV in a given

sample. These patterns of mutation presence and absence define mutation classes

for lineage construction and other somatic SNV analyses. We note that a small but

important number of SNVs were reallocated by this method from candidate somatic

SNVs with inconsistent patterns of sharing among samples to germline events, and

that very few single-sample (“private”) SNVs were reallocated to sharing classes,

underscoring the high-sequence and alignment quality of our datasets.

A case with n samples has 2n possible class patterns. For example, for a case with

five samples, the patterns are 00000 to 11111. No case has the 00000 class, because an

SNV has to be present in at least one sample, and the 11111 class is that of germline

variants. Classes that are private to one sample are 10000, 01000, 00100, 00010, and

00001. Candidate classes that are possibly phylogenetically informative are defined

by SNVs that are present in two or more, but not all, samples. To identify the subset

of robust phylogenetically informative classes, we applied the following steps:

Eliminate classes with the SNV present in the lymph sample (applicable to Pa-

tients 1, 4, 5, and 6). These classes consisted of lymph-only SNVs (presumably


somatic mutations in the lymph sample) and germline SNVs, where one or very few

samples were missing the alternate allele presumably due to sampling variance.

Retain the classes that, when ranked in decreasing order of the number of SNVs

present within them, together contain 95% of all candidate somatic SNVs. This

eliminated all spurious classes that were not supported by an overall substantial

number of SNVs, most of which were missing from just one sample, presumably due

to sampling variance.

Eliminate classes with a large fraction of SNVs whose mutation-absent samples

exhibit one alternate-allele supporting read, suggestive of systematic false-negative

calls. This also constituted a small number of classes with SNVs whose alternate

alleles were missing from just one sample presumably due to sampling variance.

3.4.6 PCR-based validation of SNVs and accuracy assess-

ment of whole-genome calls

Validation Design

We designed primers to target a random subset of SNVs within each sample-specific

and phylogenetic class for validation, using target-specific PCR amplification followed

by sequencing. We focused on Patients 2 and 6 because their lesions have the great-

est phylogenetic complexity (Figure 3.7) and therefore constitute the most stringent

test of the main results of our study. 192 and 196 primer sets were designed for

Patients 2 and 6, respectively, such that each SNV to be validated was within ap-

proximately 40 bases of the sequence start site. Primer design was optimized for

multiplexing. Primers contained Illumina linker sequences to facilitate sequencing.

The initial target-specific multiplex PCR was performed with slow-annealing. A sec-

ond PCR using Illumina-compatible primers added barcodes and yielded preparative

amounts of material. All barcoded samples from a single patient were combined into

a single lane of HiSeq2000 for sequencing. For Patient 2, 192 of 192 targets success-

fully generated enough reads to support validation, with a mean coverage (number of

reads per target per sample) of almost 190,000. For Patient 6, 195 of 196 targets were

successful, with a mean coverage of just over 43,000. Amplification and sequencing


were performed with all targets (each pool containing all PCR 196 patient-specific

primer pairs) on all samples (which were amplified separately) of each patient. This

design supported two levels of validation for both patients, which we denote A and

B. Two more types of validation, C and D, were possible in Patient 6. For a visual

representation of the results from Patient 6 please see Figure 3.8.

Validation A

Validation A is the simplest of the four approaches. It asks whether the validation

PCR/sequencing supports the initial SNV call at all, i.e., whether the alternate allele

is detectable well above background in at least one sample.

A for Patient 2 is 192/192 = 100%.

A for Patient 6 is 180/195 = 92%.

12 of the false positives of Patient 6 are SNVs that had initially been called as private

to the ENA sample. Excluding the ENA-only calls, the validation rate improves to

172/175 = 98%. In this context, we note that SNVs that are present in the ENA and

also in another sample have a much better validation rate than those present in the

ENA alone, due to the additional signal provided by the other samples. We conclude

that our initial SNV calls had a high degree of specificity. SNVs present in more than

one sample, which comprise the classes that are most important for our study, have

an almost perfect validation rate.

Validation B

Validation B addresses sample-specificity and whether the assignment of an SNV to

a specific class, especially to a phylogenetically informative class, was correct. The

most stringent metric is to ask what fraction of SNVs are validated to be present in

precisely the same set of samples as the initial assignment based on the whole genome

sequence, and to count each with a misassignment as incorrect. It uses those SNVs

that were validated to be present (validation A).

B for Patient 2 is 180/192 = 94%.

11 of the 12 miscalls involve an SNV that was initially called as IDC-only, but is


ENA

ENA

0.5

0.4

0.3

0.2

0.1

0.5

0.4

0.3

0.2

0.1

0.5

0.4

0.3

0.2

0.1

0.5

0.4

0.3

0.2

0.1

Sample

Lym

ph

Alte

rnat

e al

lele

freq

uenc

y

EN

DCI

SEN

_cl

IDC

IDC

(new

1)ID

C (n

ew2)

nor

mal

(new

)

Lym

ph EN

DCI

SEN

_cl

IDC

IDC

(new

1)ID

C (n

ew2)

nor

mal

(new

)

0 1 0 0 0 0 * * *

0 0 1 0 0 0 * * *

0 0 0 1 0 0 * * *

0 0 0 0 1 0 * * *

0 0 0 0 0 1 * * *

0 0 1 0 0 1 * * *

0 0 1 1 0 0 * * *

0 0 1 1 0 1 * * *

N=20

N=20

N=25

N=21

N=19

N=30

N=34

N=20

Figure 3.8: Alternate allele frequencies in each tested private class (green binarycode) or phylogenetically informative class (magenta binary code) of somatic SNVs ofPatient 6. Code denotes the class of SNV as determined by the whole genome sequenceanalysis. Starred samples were not present in the whole genome analysis. EachSNV corresponds to a bar whose position is repeated stereotypically for each sample.SNVs are sorted by position in the genome. N denotes the number of SNVs (andtherefore the number of bars per sample) in each class. Presence/absence patternsfrom the validation experiment are visible as clustered bars, denoting consistentlyhigh alternate allele frequencies of SNVs in the sample. High validation rate andtherefore concordance in class assignments is visible as correspondence between thewhole-genome-derived presence/absence code and the blocks of above-backgroundalternate allele frequencies.


in fact an SNV shared between the EN and the IDC. Recall that this class had low

alternate allele frequencies in the EN, so these were simply missed in the genome-wide

data due to their very low frequency. B for Patient 6 is 176/180 = 98%. In summary,

our class assignments are highly accurate and the small amount of error does not

a↵ect the study’s results or conclusions in any way.

Validation C

In Patient 6, we were able to go back to the archival tissue and recover additional

(separate) IDC material as well as a sample of normal tissue. In what might be called

a “biological” validation, we can therefore ask what fraction of SNVs present in the

original IDC are also present in the two new IDC samples. Class IDC-only: 18 of the

19 IDC-only SNVs tested also appear in both new IDC samples. The one SNV that

is not present in the new IDC samples has the lowest alternate allele frequency in the

original IDC sample, indicating that it marks a subclone not present outside of this

sample. Phylogenetically informative classes that include IDC: 50 out of 50 SNVs

were present in the new IDC samples. Thus, this validation shows that mutations we

find in a single IDC isolate are fully supported by their presence in independent IDC

isolates, and that our false-positive rate for this class is e↵ectively zero.

Validation D

The addition of a sample of normal tissue from the ipsilateral breast in close proximity

to the other lesions allowed us to ask whether any SNVs we targeted would give a

false-positive signal in the validation. The seven SNVs we tested that were shared

among all ipsilateral samples were also positive in this normal sample, as expected

for SNVs that arose early in breast development; none of the remaining SNVs that

were private to one sample or comprised the phylogenetically informative classes

(N=188) had signal above background in the normal sample, again underscoring

superior specificity of our somatic SNV calls.


Validation of the “Evidence-of-Presence” criterion

The validation data also allowed us to examine whether the reassignment of SNVs

according to our evidence-of-presence criterion improved accuracy over GATK multi-

sample calling. As we describe in the manuscript, we first perform the standard

GATK multi-sample SNV calling to identify the set of somatic SNVs in a patient.

GATK results include class membership, i.e., in which sample the alternate allele

of the SNV is present. But we adjust this class membership using our evidence-of-

presence criterion, which asks whether there is evidence for the alternate allele of an

SNV in a sample where GATK did not call it. The logic is as follows: Assume that an

SNV is called by GATK in sample A of a given patient, but not in sample B. Assume

that in sample B, there are two (or more) reads supporting that SNV. (This situa-

tion is common with GATK.) Due to its presence in sample A, the SNV has a high

prior probability of being a true somatic SNV in sample B, rather than resulting from

coincidence of sequencing errors in the two or more reads supporting it. Recall that

typically fewer than 1000 somatic SNVs are called per sample; this is several orders of

magnitude fewer positions than the entire genome, and therefore it is possible to use a

more sensitive criterion for detection of SNVs in these positions than for de novo dis-

covery in the entire genome, without increasing the false positive rate substantially.

The validation data show that application of the evidence-of-presence criterion in-

deed improves call accuracy over the GATK class assignments: In Patient 2, 17 SNVs

within our validation set had been reassigned according to evidence-of-presence. In

14 out of these 17 cases, the reassignment detected the mutations in samples that

were validated, thus improving over GATK calls; in 3 cases the reassignment cre-

ated a false positive, i.e., detecting an SNV in a sample which was not supported

by our validation. Similarly, in Patient 6, of the 14 SNVs within our validation set

that were reassigned according to evidence-of-presence, 11 were correctly reassigned,

in 3 cases evidence-of-presence called an SNV in one additional false positive sam-

ple. In summary, we concluded that evidence-of-presence significantly improved class

assignments over GATK.


Conclusion

The results from the four approaches to validation reveal the robustness of the

genome-wide data, particularly of the phylogenetically informative classes, which

form a cornerstone of our study. The results from the assessment of the evidence-of-

presence criterion versus original GATK calls underscore the power of multisample

calling and the technical robustness of our analytic approaches.

3.4.7 Aneuploidy and tumor purity

To identify aneuploidies we selected a subset of the germline SNVs identified by

GATK. These “sgSNVs” were defined, separately for each patient, as a patient’s

multisample germline SNVs that had dbSNP132 entries, were heterozygous, and had

minor allele frequencies in the control sample of at least 0.25. We define the “lesser

allele” as the one supported by fewer reads than the other allele (which is the “preva-

lent allele”). Three metrics were calculated for each SNV: the lesser allele coverage,

the prevalent allele coverage, and the lesser allele fraction (LAF). The LAF was used

to identify aneuploidies, whose “sign” (loss or gain) was then set by the two coverage

metrics.

In all patients except 5, the vast majority of chromosomal copy-number transitions

coincided with the centromere, or the whole chromosome was involved. Fine mapping

of the transition points was therefore not usually necessary. In the handful of cases

where a transition point did not coincide with a centromere, we found the window of

the plot at which the event either started or ended (window i). As discussed in Figure

3.5, each window spans 1000 SNVs, with an overlap of 500 SNVs between adjacent

windows. We then plotted the frequency of the heterozygous variants in the three

relevant windows (i� 1, i, i+ 1, totaling 2000 variants) in that sample. The variant

at which the frequency shifted was easily detected by eye, and it was not necessary

to deploy segmentation methods. The resolution of this analysis is low (determined

by what can be seen by eye on the plots) and we did not attempt to identify events

that involved regions smaller than about a third of a chromosome arm. We also note

that we did not attempt to identify structural rearrangements that do not result in


copy-number changes, such as inversions.

The identified loss of heterozygosity (LOH) chromosomes were then used to esti-

mate the fraction of the sample that is due to normal cells (lymphocytes, myocytes,

etc.), as follows: All cancer cells contribute zero copies of an allele that was lost due

to LOH, and the normal cells contribute one copy of the LOH allele times the contam-

ination fraction n. Note that in all of our patients, the control samples were free of

LOH chromosomes (Figure 3.6A). The LOH allele is almost always the one with fewer

reads. Therefore, the LAF l should, on average, be equal to the lost-chromosome frac-

tion that is contributed by the normal contamination. Some arithmetic shows that

n = l1�l . Once n was estimated from l, the exact ploidy p for those chromosomes that

had gains was calculated according to the formula P = 1�2nll(1�n) .

Sequence-based n’s roughly matched estimates of n by histology. The histology-

based estimates are necessarily an approximation because they are based on limited

sampling, by sectioning of the tissue core mass from which DNA is obtained.

3.4.8 SNV mutation spectra

Mutation spectra for patient samples were aggregated in two ways: (1) combined

across patients to form three “superclasses” of SNVs based on lesion class (private

in early neoplasias, private in carcinomas, and shared between neoplasias and carci-

nomas); (2) combined within each patient, ignoring lesion class, to form six groups.

Complementary mutations were pooled, reducing the number of possible mononu-

cleotide mutations from 12 to 6, and the number of single-base substitution classes

in dinucleotides from 16⇥ 6 = 96 to 10⇥ 6 = 60.

Mononucleotide mutation spectra were simply estimated from the frequency of

the mutation type (Figure 3.3, cf. A and B, where the bars of each color add up to

1). For dinucleotides, we calculated rates by dividing the number of events of each of

the 60 changes by the genome-wide count of the dinucleotide that was mutated.


3.4.9 Tree inference

Tree topology was defined by the phylogenetically informative SNV classes. The data

are unambiguous and we therefore used parsimony to establish which samples shared

common ancestors in which configuration. Once the SNV-based trees were built,

aneuploidy events could be mapped onto them, and again the data were unambiguous.

Even successive gains of ploidy of the same chromosome, most prominently among

them 1q (e.g., Fig. 3.7F), could be ordered without conflicts.

3.4.10 Ordering SNVs vs. chromosome 1q ploidy gain in

ancestral branches

We devised a statistical test to ask whether some SNVs occurred before copy gain

in aneuploidy regions. For each patient, we identified the branch in the lineage tree

responsible for the first copy-number changes in chromosome 1q, which consistently

represents the earliest aneuploidy event in our patients. We then analyzed the AAF

spectra of SNVs occurring in that branch. The test below is based on the idea that

SNVs that occur on a 1q chromatid prior to gain of a copy of that chromatid should

have higher AAF than SNVs occurring on a 1q chromosome after copy gain.

We used SNVs on all diploid chromosomes on the same branch as our control

set. Sequence coverage is scaled with respect to the aneuploidy and controls for

contamination of the sample by normal cells (lymphocytes, etc.):

scaled coverage = coverage⇥ (p⇥ (1� n)

2+ n)

where p is the estimated ploidy and n is the estimated normal contamination. In

order to find outliers indicative of events prior to copy gain, we calculated a Z-score.

SNVs with AAFs with Z-score > 3 were labeled as “high” and SNVs falling below

threshold were labeled as “low”. For each patient, we used Fisher’s exact test to

compare the distribution of SNV labels in the control chromosomes vs. 1q. In each

of the patients, we reject the null hypothesis that the 1q distribution is equal to or

less extreme than the control distribution.


3.5 Discussion

Evolutionary studies of cancer have so far focused on the inference of clonal evolution

within the cancer (e.g., [67]) or analyses of the relationship of metastases with the

primary tumor (e.g., [64]). Here we addressed a di↵erent perspective, namely that

of the early origins of the cancer phenotype. These three approaches can be thought

of as mimicking progression, at least as far as solid tumors are concerned: Studies of

metastatic evolution are about the terminal stages of the cancer; studies of within-

cancer subclone diversity are about the Darwinian process of faster versus slower

growing cell populations and the evolution of the primary tumor mass; and studies

of early neoplasias and their relationships to the diagnostic tumors are about early

origins of cancer.

Our understanding of these early origins will be greatly enhanced by molecular

evolutionary analyses similar to those that have advanced our understanding of organ-

ismal evolution. Cells within concurrent lesions are analogous to extant organisms:

they are related to one another by bifurcating lineage trees and have accumulated

genomic changes over the course of evolution. In our study of multiple lesions in six

cases of ductal breast carcinoma, we found that the genomes of ancestors of some early

neoplasias and carcinomas were already aneuploid and harbored a modest number of

point mutations. By comparing mutational spectra of somatic SNVs across patients

and samples we inferred that somatic SNVs accumulated gradually as a result of a

large number of ancestral cell divisions and not during saltatory mutational crises.

In two cases, the carcinoma phenotype originated twice independently from an an-

cestral neoplastic phenotype, suggesting a substantial predisposition of the ancestor

to generate cancerous progeny.

All of the neoplasias with aneuploidies shared common cellular ancestors with the

carcinomas; in all of these cases the neoplasia and carcinoma shared these aneuploidies

as well as somatic SNVs. In contrast, none of the neoplasias that were devoid of

aneuploidies (all contralateral ENs and five ipsilateral ENs) were closely related to a

carcinoma. Among the aneuploidies, gain of chromosome 1q was most dramatically

recurrent, which is consistent with its prevalence among late-stage breast cancers (cf.


Fig. 4 in [24]). 1q harbors more than a thousand genes, and while the increased dosage

alone is not su�cient for a carcinoma phenotype (some of our neoplastic samples carry

the increased 1q ploidy), it is likely to be predisposing to further genomic change.

Initially, such change may be catalyzed primarily by an increased rate of cell division,

as the mutation spectrum of the early neoplasias is indistinguishable from that of the

IDCs in every patient examined. Additional aneuploidies accumulate, however, and

at some point a combination of dosage imbalances and mutational load, and perhaps

epigenetic or stromal changes as well, results in an invasive carcinoma phenotype.

We anticipate that the evolution of a diverse set of breast and other cancers

will soon be studied similarly and with complementary approaches [74, 64, 36, 67,

75]. Current practice in clinical diagnosis of cancer facilitates studies on archival

material because of the low cost and superior quality of histopathological examination

of formalin-fixed, para�n-embedded samples. We show that high-quality, large-scale

genome sequence can be obtained from archival material, and show by validation that

the data from such material can be highly robust. Evolutionary inference based on

many samples of such material opens a new dimension for analysis of cancer origins

and progression. In the future, phylogenetic analysis of carcinomas and concurrent

lesions will suggest drugs that attack both carcinoma and early lesions by targeting

genomic changes common to all lesions, removing not only the carcinoma, but also

the reservoir of related cells from which a carcinoma might recur.

Chapter 4

Haplotype reconstruction of

somatic genomes using long reads

4.1 Introduction

As discussed in section 2.3, haplotype inference has a broad spectrum of applications.

Despite the recognition of the central role of haplotype information in diagnostic and

prognostic studies, haplotype assembly of cancer genomes was not feasible until very

recently. This impediment was mainly due to the limitations of genetic and statistical

phasing methods in phasing de novo and somatic variants. Currently, single-molecule

sequencing techniques do not provide a viable solution either, due to their high costs

and error rates. Fortunately, recent advances in experimental technologies are open-

ing a new avenue for haplotype reconstruction of somatic genomes. In 2013, phased

genome and epigenome of the HeLa cancer cell line was published [2], only made

possible by sequencing pools of fosmid clones, in addition to shotgun and mate-pair

sequencing in an exhaustive analysis. The findings highlighted the importance of

haplotype information in providing a complete profile for cancer genomes. How-

ever, the laborious and time-consuming protocols for preparation of fosmid libraries

restricted the wide use of this technology in cancer genomics studies. Recent tech-

nologies such as CPT-seq [3], and Moleculo [48] reconstruct long DNA fragments

by utilizing well-established and highly accurate short read sequencing. To achieve

45

CHAPTER 4. HAPLOTYPE RECONSTRUCTION OF SOMATIC GENOMES 46

more accurate reconstructed long fragments, they use specialized DNA partitioning

techniques, along with PCR amplification protocols. The Moleculo method conducts

sequencing of sub-haploid (⇠10Kbp) DNA fragments by performing in vitro dilu-

tion of fragments into several hundred wells. These molecules are then sheared into

smaller fragments, and assigned barcodes that are unique per well. Small fragments

from all wells are then pooled and sequenced with Illumina sequencing technology.

The number of fragments within a well is set such that each well covers only a small

fraction (1-2%) of the haploid genome. As a result, when demultiplexed reads from

individual wells are mapped to the reference genome, shotgun reads originated from

single long DNA fragments form islands of reads such that each island represents one

long molecule. We denote these islands by read clouds. In the Moleculo protocol,

the sequencing coverage of a genome is determined by two parameters: the coverage

of DNA fragments with short reads, which we denote by CR, and the coverage of a

genome with long fragments (or read clouds), which we denote by CF (Figure 4.1).

AACAGTAACCTTGATTACGTAACTGACCCTTGACTAAAACTCCAAGGTACTGGATACCTGTAAACCRTCGAACTGAAACTAAAGTAACTAAACTAAACTAAGTAAACTGACTAACTGTAAACTGAAATGC

CRCF

Figure 4.1: Moleculo read clouds. Sequence reads from each well are mapped to thereference genome separately. Clusters of sequence reads originated from single DNAmolecules are identified and labeled as read clouds. Read clouds are circled in thisfigure. CR represents the coverage of DNA fragments with short reads. CF representsthe coverage of genome with read clouds.


In this chapter, I describe a multifaceted approach for haplotype reconstruction

of a solid tumor that is sequenced using the Moleculo protocol. Our approach ex-

ploits long-range information from resulting read clouds, linkage information across

variants, and cancer-specific aneuploidies, all in one package, to produce long and

highly accurate haplotype blocks. We successfully applied our methodology to phase

the genome of an invasive ductal carcinoma for the first time, and achieved haplotype

blocks with N50 sca↵old size of 27.7 Mbp. We confirmed the accuracy of inferred

haplotypes to be over 99.9%, by computationally validating the inferred haplotypes

using existing linkage disequilibrium patterns between variants and also aneuploidy

information.

4.2 Results

4.2.1 Dataset and SNV detection pipeline

In this study, we selected an invasive ductal carcinoma (IDC) lesion from a grade 3,

ERBB2 amplified, estrogen receptor (ER) negative breast cancer patient. A sample

from normal breast tissue of the same patient was also obtained and sequenced to

serve as a control. We sequenced eight Moleculo libraries of the IDC sample, with an

average per base CR = 1.8x, CF = 44x, and an average fragment length of 9Kbp. To

obtain a more robust set of variants, we also performed four lanes of whole-genome

shotgun sequencing on both normal and IDC samples.

Single nucleotide variant calls (SNVs) were made by performing GATK multi-

sample genotyper on shotgun reads from normal and IDC samples as well as on

pooled sequence reads from Moleculo data. SNVs that were called in both sets of

IDC libraries (shotgun and Moleculo) and the control sample were considered as

germline variants. Those that were absent in the control sample were considered

as somatic variants. We identified 3,107,350 germline SNVs, of which 1,869,281 were

heterozygous and therefore informative for phasing. About 91% of candidate germline

SNVs were reported in the dbSNP database (dbSNP 138). On average, heterozygous

SNVs were spanned by 41 read clouds, 30 of which covering the variant by short reads.


In total, 3530 somatic variants were called in both sets of IDC libraries (shotgun and

Moleculo) and were absent in the normal sample.

4.2.2 Overview of the framework

In this work, we present a comprehensive framework for haplotype reconstruction of

a somatic genome. Our framework consists of three major steps each leveraging a

unique aspect of the data. It exploits the haploid nature of Moleculo read clouds,

linkage disequilibrium (LD) information between heterozygous variants, and cancer-

specific aneuploidies to infer an accurate and long-range phase between germline and

somatic SNVs in a cancer sample. At the end of each step, we computationally

validate the inferred phase by utilizing features of the data that are yet not employed

in the inference procedure up to that step.

1. Local phase

First, we determine the phase between germline and somatic SNVs by opti-

mizing the assignment of overlapping read clouds to two parental haplotypes

and up to two somatic haplotypes. Parental haplotypes correspond to the two

copies of the genome inherited from parents and provide the phase informa-

tion between germline SNVs. Each somatic haplotype is linked to one of the

parental haplotypes, and indicates which somatic mutations occurred on that

copy of the chromosome in the cancer cells. In our approach, we first infer the

phase between germline SNVs by finding the best assignment of all read clouds

to the two parental haplotypes (Figure 4.2B). We then proceed and reassign

read clouds to somatic haplotypes at the sites of somatic alterations to infer

which parental chromosome was altered by a somatic variant (Figure 4.2C).

To best model the specific properties of Moleculo data, we developed a cus-

tomized probabilistic inference model to phase the variants. The model uses a

Monte Carlo simulated annealing procedure to find the optimal assignment of

read clouds to parental haplotypes. The states of this model are all possible


h1

h2

mp

B.

A.

h1

h2

Germline SNV Somatic SNV Reference allele Variant allele

C.

mp

Mixed alleles

Figure 4.2: Reconstructing parental and somatic haplotypes in the local phase step.A. Read clouds are depicted as linear segments with reference and alternate allelesshown whenever they are covered by at least one shotgun read within the cloud.Black circles represent positions where a cloud has reads covering both reference andalternate alleles, most likely due to sequencing errors. We use individual reads in ouranalysis, this simplification is only for visualization purposes. B. Overlapped readclouds at heterozygous variants are assigned to two parental haplotypes (h1, h2). C.Two somatic haplotypes are shown at the top and the bottom. Read clouds coveringthe alternate allele at a somatic variant are moved to somatic haplotypes.

haplotype assignments of read clouds. The method starts with a random ini-

tialization of all read clouds to two parental haplotypes, and transitions within

its state space by choosing a move from the following three moves at random:

cloud reassignment, cloud split, and switch unwinding (Figure 4.3). Each move

is introduced for a specific purpose detailed in the methods section. The cloud


split move, in particular, is introduced to identify and address chimeric read

clouds, which result from fragment collisions within a Moleculo well. To reduce

the e↵ect of sequencing errors in resulting haplotypes, we incorporated mapping

and base-call qualities of individual sequence read members of a read cloud into

the calculation of the likelihood score at each state of the model. The proposed

state after each move is immediately accepted if it achieves a higher likelihood

score. However, if the new likelihood score is lower than the previous score, it

is accepted probabilistically.

Since the density of heterozygous SNPs varies considerably along the genome,

the genomic distance between consecutive variants can be larger than the length

of input DNA fragments. As a result, the local phasing step produces thou-

sands haplotype blocks of various lengths with their relative phase remaining

unknown. To consolidate these local haplotype blocks into longer chromosome-

spanning haplotypes, we utilized two sources of information: linkage disequilib-

rium between SNPs, and unbalanced allelic ratios in large somatic CNV regions.

2. Statistical phase

Linkage disequilibrium (LD) information between genomic variants can provide

valuable information about their phase. In the extreme case of complete LD

between two markers, where two alleles are always found together in a pop-

ulation, the phase of the two variants can be immediately inferred. Linkage

disequilibrium information between markers is widely used in statistical phas-

ing methods. However, statistical haplotype phasing approaches are limited to

phasing variants that are prevalent in a population. These methods cannot be

applied to de novo (or rare) germline variants and somatic mutations. In this

step, we overcome this limitation by applying statistical phasing on already

phased local blocks. In this approach, the phase between somatic and de novo

variants and their neighboring heterozygous SNPs is first inferred in the local

phasing step. LD information between heterozygous SNPs is only then lever-

aged to join haplotype blocks wherever possible to produce larger haplotype


h1

h2

A. Diploid Genome

B.

C.

h1

h2

h1

h2

Figure 4.3: Probabilistic inference model for phasing germline variants. Read cloudsare assigned randomly to two haplotypes. The inference model learns the true hap-lotypes by reassigning read clouds to the two haplotypes using three di↵erent moves:A.Cloud reassignment. A read cloud is randomly selected and reassigned to the otherhaplotype. B. A genomic position is randomly selected and read clouds on one sideare reassigned to opposite haplotypes. C. Cloud split. A read cloud is broken intotwo parts, and the two parts are assigned to opposite haplotypes.

blocks.

3. Leveraging cancer aneuploidy


Unlike diploid genomes, genomes of aneuploid cancer cells have additional prop-

erties that we can exploit to further connect haplotype blocks. As shown in

previous chapter, invasive carcinomas often exhibit large copy number variants

(CNVs) including chromosome-wide aneuploidies. If we limit our attention to

CNV regions with an uneven haplotype ratio, haplotype blocks that are spanned

by a CNV can then be connected if we exploit the resulting imbalance of al-

lelic ratios at heterozygous variants. More specifically, we can use allelic ratios

at heterozygous variants to infer which haplotype in a haplotype block has a

higher copy number. If a single CNV event spans more than one haplotype

block, more prevalent and less prevalent haplotypes of those blocks can be con-

nected respectively.

In our approach, we first calculate the lesser allele fraction (LAF) value, which

is the fraction of reads supporting the less prevalent allele, at each heterozygous

variant in both IDC and normal samples. To have more accurate LAF values,

which are also comparable between normal and IDC data, we use sequence

reads from our shotgun libraries. We then employ an HMM model to break the

sequence of IDC LAF values into segments that exhibit the same allele fraction,

suggesting that variants in each segment are spanned by the same CNV event.

Next, we label parental haplotypes at each haplotype block in these segments

as “less frequent” and “more frequent” by developing an HMM method that

employs haplotype allelic ratios at each germline SNV. We also use this HMM

method to identify switch errors in the haplotype blocks that were produced

by either local or statistical phasing. Finally, we utilize these labels to connect

haplotypes of consecutive haplotype blocks within each segment.

4.2.3 Local phasing

In this patient, germline heterozygous SNVs are on average 1.6Kbp apart (97.5%

of neighboring heterozygous SNVs are less than 9Kbp apart). Therefore, most read

clouds span multiple heterozygous SNVs. Since each read cloud originates from a


single molecule, the phase of variants it covers can be determined. Long-range phase

information between SNVs can then be obtained by leveraging information from over-

lapping read clouds.

Haplotype phasing in the presence of sequencing and alignment errors is compu-

tationally expensive. Although exact algorithms for phasing has been developed (e.g.

[41, 19]), their running times scale exponentially in either the average size of reads

or the sequencing coverage of the genome. Cancer samples are usually sequenced

at a high coverage for improving the sensitivity of mutation-detection in heteroge-

neous cancer specimens with high levels of normal-contamination. Therefore, a more

scalable approach is needed for analyzing tumor samples. As part of this work, we

developed a probabilistic inference model that can be applied to cancer data to phase

not only inherited but also de novo and somatic SNVs. Existing phasing methods for

long reads model fragment as a long string of alleles, therefore synthetic long reads

should be created from short sequence reads. In contrast, our method works by the

direct analysis of shotgun reads in each read cloud. As a result, we can incorporate

mapping and base-call qualities of individual reads into our model, which in turn

results in highly accurate haplotype blocks.

In this step, we phased 99% of the germline and 77% of the somatic SNVs called

by GATK, producing 43,719 phased contigs with N50 of 81.02 Kbp. Half of germline

variants were in haplotype blocks with 115 or more SNVs (V50=115). The remaining

somatic variants stayed unphased due to two main reasons. About 13% of them were

more than 10Kbp distant from any germline heterozygous variants, and thus were not

connected to neighboring heterozygous sites by read clouds. For the remaining 10%,

no read cloud connected the variant allele to another heterozygous SNV, therefore the

phase could not be inferred. This was mainly caused due to sampling bias and also

the high normal contamination of this studied IDC lesion (computationally estimated

to be about 70%). Figure 4.4 shows a phased region in chromosome 7.


Germline SNV Somatic SNV Reference allele Alternate allele Mixed alleles

Figure 4.4: A haplotype block from real read cloud data. A 100Kbp phased region inchr7 with over 170 variants is shown. The two parental haplotypes are shown in themiddle, as well as the two somatic haplotypes on top and bottom. The plot at thetop represents the genomic distance in basepairs between neighboring SNVs. Of the4 somatic SNVs that are phased in this region and are highlighted by dashed lines,two belongs to one somatic haplotype and two to the other one.

4.2.4 LD-based validation of local phasing

The standard approach for measuring the accuracy of inferred phase is by counting

the number of switch errors. The number of switch errors indicates the minimum

number of switches in the inferred phase necessary to make it compatible with the

true phase. To correctly calculate this number, the underlying haplotypes must be


Distance(Kbp) Pairs in complete LD Inconsistent pairs0 - 10 1080500 1710 - 20 169311 2020 - 30 65060 330 - 40 31159 240 - 50 18466 050 - 60 11836 060 - 70 7496 070 - 80 5065 180 - 90 3544 0

90+ 8660 0

Table 4.1: Distances between variants in complete LD pairs

known. However, in practice, this information is not available. In this study, we use

linkage disequilibrium (LD) information, which is obtained from population data, to

computationally validate our inferred haplotypes at common SNPs. For this purpose,

we identified pairs of heterozygous SNPs that are in complete linkage disequilibrium

(D’=1) in 1000 Genomes project and appear in the same haplotype block in our sam-

ple. We hypothesized that at each pair, a disagreement between population phase and

inferred phase is an indication of a potential switch error. Out of the total 1,401,097

analyzed SNPs pairs, we identified 43 pairs whose inferred phases were in conflict

with their respective population phase. Table 4.1 shows the binned distribution of

pairwise distances between variants that are in complete LD.

A single switch error can disrupt the inferred phase between several pairs of SNVs,

so we examined the haplotype blocks and calculated how many switches in each

haplotype block are required to correct the identified discrepancies between inferred

and population phases. In total, we identified 11 switch errors, suggesting a phase

accuracy of over 99.9% in the validated regions. We acknowledge the limited scope

of this test due to an inverse correlation between pairwise LD levels and genomic

distance between variants; therefore, we additionally validated the inferred haplotype

blocks using other features of the data as explained in the following sections.


4.2.5 Statistical phasing

Linkage disequilibrium across informative sites can also be utilized to connect haplo-

type blocks. LD information about pairs of SNPs that span across haplotype blocks

can help us connect these blocks such that, in the resulting haplotypes, the order of

alleles at these pairs agrees with the population phase. Therefore, our goal in this

step is to statistically join haplotype blocks by exploiting patterns of LD. There have

been several tools developed for statistical phasing, a few recently published ones

(such as SHAPETIT2 [25], and Prism [48]) can take partial phase information from

local haplotypes as their input. We performed Prism’s global phasing stage, which

is an HMM-based model, on the locally phased blocks using the prephased reference

panel of the 1000 Genomes project. Prism suggests a phase between adjacent local

blocks along with a confidence score, which is a measure of the likelihood of a switch

error associated with the suggested phase. We assembled local blocks into longer

haplotypes if the confidence score was above 0.98, which is empirically estimated to

produce about 0.3 to 0.6 long switches per mega-base [48]. With this approach, we

achieved N50 size of 533 Kbp, and V50 of 452 SNPs.

4.2.6 Leveraging aneuploidy information in phasing

In total we phased 2.76 Gbp regions of the IDC genome. Within these regions, we

identified 1.94 Gbp segments in which cancer cells (or a subset of them) display an

uneven number of copies between haplotypes. We exploited the resulting unbalanced

allelic fractions of heterozygous variants in these segments to first detect switch errors

in our inferred phase and then further connect haplotype blocks.

If a haplotype block is within a region with an uneven haplotype copy ratio, the

fraction of shotgun reads supporting the alleles of the less prevalent haplotype should

be equal to the copy number ratio of that haplotype. As a consequence of a phase

switch error, while on one side of the switch point, these fractions of shotgun reads

agree with the copy number ratio of the less prevalent haplotype, on the other side,

they agree with the copy number ratio of the more prevalent haplotype. We therefore

developed an HMM model to detect such precipitous shifts in haplotype allelic ratios


Category Total length of Total number of phased Number of potentialregions (Gbp) germline variants switch errors

R1 1.0915 723082 703 (0.10%)R2 0.0431 24818 8 (0.03%)R3 0.5250 329800 104 (0.03%)R4 0.0941 49428 9 (0.02%)R5 0.0931 59339 16 (0.03%)R6 0.0033 656 0 (0%)R7 0.0057 2441 2 (0.08%)R8 0.0089 2741 2 (0.07%)

Table 4.2: Estimated number of switch errors in CNV regions

to detect switch errors.

We limited our analysis to CNV regions that were larger than 1Mbp in size, and

also excluded two regions in chr19 and chr3, which showed signs of chromothripsis,

to ensure that identified CNV regions cover only one single event and thus allelic

ratios could be used reliably to join haplotype blocks. We used information from the

remaining 1.86 Gbp regions of the genome to detect switch errors and extend blocks.

(Figure 4.5).

Within each highlighted region in Figure 4.5, we identified switch errors in haplo-

type blocks. Table 4.2 reports the number of potential switch errors that we identified

in di↵erent regions of the genome. A large fraction of the genome is present at the

average lesser haplotype allelic fraction of 0.42. In these regions, we estimated 0.1%

switch errors (0.64 errors per 1Mbp). In other regions, the rate of errors ranged from

0% to 0.08% (0 to 0.35 errors per 1Mbp). Our estimate of switch errors is very close

to Prism’s estimation of long switch errors for the chosen confidence score (0.98) (Fig-

ure 2 in [48]), which suggests most switch errors were introduced in the statistical

phasing step.

After correcting switch errors, we connected haplotype blocks within each CNV

region by joining haplotypes of the same prevalence. With this approach, we achieved

haplotype blocks with an N50 size of 27.67 Mbp. Half of germline variants were in

haplotype blocks with 17,787 or more variants.


1q 2p 2q 3p

0.2

0.4

0.6

3q 4p 4q 5p 5q 6p

0.2

0.4

0.6

6q 7p 7q 8p 8q 9p 9q

0.2

0.4

0.6

10p 10q 11p 11q 12p 12q 13q

0.2

0.4

0.6

14q 15q 16p 16q 17p 17q 18p 18q 19p 19q 20p 20q

0.2

0.4

0.6

22q Xp Xq

0.2

0.4

0.6

1p

21q

NormalIDC

R1 R2 R3 R4 R5 R6 R7 R8

Figure 4.5: Haplotype allelic fractions. Germline SNVs are arranged by their genomicorder, and the fraction of shotgun reads displaying IDC’s less prevalent haplotypeallele is plotted for both normal and IDC samples, averaged in windows of 20 SNVs.As expected, allele ratios in the normal sample are close to 0.5. Large CNVs in IDCsample are visible as sharp drops in the plot. Black lines display CNV borders detectedby the HMM. Highlighted regions are CNV regions in which we joined haplotypeblocks. CNV regions are colored based on their average allelic ratios. Allelic fractionsdisplayed in this plot were calculated after correcting the identified switch errors.

4.2.7 Final validation test

As a final validation test of the inferred phase between variants, we considered somatic

SNVs at the regions of LOH. These are regions in which cancer cells have lost one


of the two parental chromosomes; the other paternal chromosome, however, can be

present in di↵erent copy numbers. We limited our analysis to LOH regions in which

all cancer cells share the chromosome loss event. We used haplotype-specific read

counts to identify such regions. In these regions, which constituted about 20% of

the genome, all somatic SNVs should belong to the retained chromosome and not

the deleted copy. Since we did not use this information in previous phasing steps, it

serves as an independent validation test whereby we can estimate the phasing error

of somatic SNVs. We observed that out of the 308 somatic SNVs that were phased

in these regions, 304 of them were assigned to the correct haplotype.

4.3 Methods

4.3.1 Processing of samples and sequencing

Our workflow began with selecting a core of grade 3, ERBB2 amplified, estrogen

receptor (ER) negative IDC isolated from fresh frozen tissue of a patient and a spec-

imen of the patient’s normal breast tissue as a control sample. We built one Illumina

library for each sample with an average insert size ranging between 300 and 400

bases. We also prepared eight Moleculo long fragment libraries for the IDC sample.

Standard paired-end sequencing (2 x 101) of all libraries was then performed on the

Illumina platform. Sequenced paired-end reads were subsequently mapped to UCSC

hg19 reference genome using BWA-mem [54] with default parameters, and the ‘-M’

option. Only primary alignments were considered in the downstream analysis.

4.3.2 Genotype calling

We used GATK’s Unified Genotyper toolkit (GenomeAnalysisTK-2.8.1) to call single

nucleotide variants in IDC and control samples simultaneously. To minimize the

impact of PCR amplification errors, and other technology-specific errors on genotype

calls, we processed sequence reads from both sets of shotgun and Moleculo libraries

and performed multisample SNV calling on the processed BAM files. We followed

the best practice guideline provided by GATK.


Raw variant calls were then filtered by GATK’s Variant Quality Score Recalibra-

tion (VQSR) tool, and those that passed the filters were used to create high-confidence

sets of germline and somatic calls. We only considered variants that are covered with

at least 10 sequence reads in every sample. We defined germline variants as ones that

were called in the two sets of IDC libraries and the control sample. Somatic SNVs

were defined as ones that were not called in normal sample, and no sequence reads

in normal sample harbored the alternate allele. We only considered somatic variants

that were called by GATK in at least one set of IDC libraries (shotgun or Moleculo),

and their variant alleles were supported by at least one sequence read in the other

set.

We selected an SNV as an informative heterozygous marker for the downstream

phasing steps when it passed three conditions: 1. It had a lesser allele fraction of at

least 0.25 in the control sample, 2. At least one read cloud supported each allele of

the variant, 3. Not more than 25% of overlapping read clouds had a mixture of reads

showing evidence for both reference and alternate alleles.

4.3.3 Constructing read clouds from sequence reads

Within each Moleculo well, long fragments are sequenced with short read sequenc-

ing technology. Sequence reads from a long fragment form an island of reads when

mapped to the reference genome. We reconstructed long fragments by identifying

such islands of shotgun reads in each well and grouping reads in each island into a

read cloud. Since the fraction of a haploid genome covered in a well is very low, the

chance for two fragments of opposite haplotypes to collide in the same well is very

low. Therefore, short reads in nearly all read clouds have originated from only one

fragment.

We excluded PCR duplicate reads, non-primary alignments and reads with map-

ping quality less than 10 from read clouds. Read clouds were then passed through a

quality control process in which the internal coverage of read clouds and their total

lengths were assessed. Read clouds with evidence of both reference and alternate al-

leles in at least two consecutive heterozygous variants were also excluded from further


analysis.

4.3.4 Building variant blocks

Since heterozygous variants are not uniformly distributed throughout the genome,

occasionally the distance between two neighboring variants is larger than the size

of Moleculo fragments. As a result, phasing the genome by assembling read clouds

produces a set of disjoint haplotype blocks. The variants in these blocks can be

phased separately and in parallel because their phase is independent of each other.

We denote these groups of variants that can be phased together as variant blocks.

To obtain the set of variant blocks, we created a weighted undirected graph. In

this graph, nodes represented SNVs and edges captured their connectivity in the read

clouds. More specifically, two nodes were connected with an edge if they met two

conditions: 1) at least one read cloud covered both variants with sequence reads, 2)

the two variants were adjacent in that read cloud. The weight of an edge was set to

be the number of read clouds that met these two conditions. In such a graph model,

there will be no path between two nodes if no read cloud covers both corresponding

variants. Therefore, each connected component of the graph defines one variant block.

We also utilized the edge weights in this graph to assess the connectivity level of

each connected component. The value of a minimum cut in a component indicates

the minimum number of read clouds that can be removed to break the connectivity of

the component. This number indicates a confidence level for the inferred haplotype

because a higher connectivity level provides more information for phasing. If two

variants are connected by very few read clouds, sequencing or alignment errors can

cause an error in their inferred phase. A connected component can be broken into

smaller subparts by choice if its minimum cut value is less than a specified threshold.

4.3.5 Local phase

We represent input shotgun sequence reads {r1, ..., rn} from Moleculo data by two

n x m matrices R, and Q, whose columns correspond to positions in the genome,

and whose rows correspond to shotgun sequence reads. The base-call of a read ri at


the genomic position j is stored in Rji which can take one of four values {0, 1, 2,�},

representing the reference allele, the alternate allele, a di↵erent base-call, or a gap

respectively. At the positions of the genome that GATK did not call an SNV, any

base-call other than the reference allele is denoted by 2. Positions that lie outside

the boundaries of the aligned reads, as well as deletions, are marked as ‘-’. Insertions

are not captured in the matrix. Qji is defined to be the maximum of ri’s mapping

error probability, and its probability of an incorrect base-call at j. We let L to be the

genomic positions of heterozygous variants in a variant block defined in the previous

section. We also let C be the set of K read clouds {c1, ..., ck} covering at least two

heterozygous variants in L. We define ci = {j|rj is in the i’th read cloud} to be a

collection of shotgun reads. We first infer the phase between heterozygous variants

in L, by optimizing the assignment of read clouds in C to two parental haplotypes

H = {h1, h2}. The state of h1 and h2 is assumed to be hidden. We use the variable

↵(ci) to denote the assignment of ci to either h1 or h2. We model the probability

P (hlq = alt) of haplotype q, namely hq, to present the alternate allele at location

l using the binomial distribution. The distribution is parameterized by ✓lq, which

corresponds to haplotype q at position l.

Given the above assumptions, the likelihood of our reads R conditioned on ✓ is

given by:

P (R|✓) =Y

l2L

Y

ck2C

Y

i2ck

P (Rli|✓l↵(ck)) (4.1)

, where the probability at a single polymorphic location P (Rli|✓l↵(ck)) is given by:

P (Rli|✓l↵(ck)) =

8>>><

>>>:

✓l↵(ck)Qli/3 + (1� ✓l↵(ck))(1�Ql

i) Rli = 0

(1� ✓l↵(ck))Qli/3 + ✓l↵(ck)(1�Ql

i) Rli = 1

2Qli/3 Rl

i = 2

(4.2)

We set P (�|✓) to 1.

We wish to optimize ✓ such that it maximizes P (R|✓), which we assume to be pro-

portional to P (✓|R) under the assumption of a uniform prior over P (✓). Our inference


procedure begins with a random initialization of all read clouds to two haplotypes.

The model transitions within its state space, which is the set of all possible haplotype

assignments of read clouds, in a Monte Carlo simulated annealing approach. Namely,

for a given assignment of ↵, once reads are assigned to a haplotype, the parameter ✓

for each haplotype is calculated based on the fraction of read clouds assigned to the

haplotype exhibiting the alternate allele. A proposal for a new assignment ↵0, and

corresponding ✓0 is derived from a category of three moves: (a) cloud reassignment,

(b) cloud split, and (c) phase switching. The corresponding moves are outlined in the

sections below.

After each move, the acceptance probability of the visited state is calculated using

P (s, s0, T ), that depends on s = P (R|✓), s0 = P (R|✓0), and a time-varying tempera-

ture parameter T .

P (s, s0, T ) =

8<

:1 s0 > s

exp((s0 � s)/T ) s0 s(4.3)

1. Cloud reassignment.

The underlying idea behind this move is that read clouds from each haplotype

should agree with each other on their overlapping germline variants and that

over several iterations similar read clouds cluster in the correct haplotype and

attract other matching clouds. When this move is selected, we select a read

cloud ci randomly and swap its haplotype assignment ↵(ci). We accept the move

probabilistically according to P (s, s0, T ). If the state transition is accepted, we

update ↵ and ✓ accordingly.

2. Cloud split.

The second move is introduced to address chimeric read clouds, which result

from fragment collisions in the same well. Chimeric read clouds can disturb

the phase between variants, and lower the likelihood score. Regardless of the

haplotype assignment, such read clouds cannot match the haplotype alleles at

a subset of covered variants. If two DNA fragments from opposite haplotypes

overlapped at multiple heterozygous SNVs within a well, the resulting read cloud


would have evidence for both reference and alternate alleles at those variants.

These read clouds are not informative for phasing because the true order of

alleles at the overlapped sites cannot be inferred, and thus these read clouds

are filtered at the read cloud construction step. However, if the two fragments

overlapped at only a few variants, or if they overlapped at no variant but were

mistakenly combined into one read cloud due to their close proximity, the order

of alleles can be recovered from the chimeric read cloud. The order of alleles on

the underlying true fragments can be recovered by breaking the read cloud at

the fusion point.

In our model, each read cloud has a binary flag which indicates whether it is

chimeric or not. This flag is updated dynamically during the execution of the

inference procedure. In this move, we randomly select a read cloud. If the read

cloud is already flagged as chimeric, we reunite its components and reset its flag.

If it is not flagged as such, we break it into two parts, assign the two resulting

read clouds to opposite haplotypes, and flag them as chimeric. We calculate

the likelihood score under the new ↵0 assignments, and accept the proposed

configuration according to P (s, s0, T ).

Breaking a read cloud ci at break point � means that we assign all sequence

reads in ci covering germline variants up to � to a read cloud c�i (1), and all

sequence reads starting after � to a read cloud c�i (2). We choose the break

point by finding a genomic position that maximizes the following score:

� = argmax�

(max (P (c�i (1)|✓1)P (c�i (2)|✓2), P (c�i (1)|✓2)P (c�i (2)|✓1))) (4.4)

After the breakpoint � is selected ↵(c�i (1)), and ↵(c�i (2)) are determined based

on which assignment maximizes the inner max term in the equation above.

3. Switch unwinding.

As the inference method proceeds, clusters of matching read clouds in each

haplotype form and grow. However, it is possible for a cluster to start forming in

the wrong haplotype and introduce a switch error in the resulting phase. If there


is a switch error between two variants, read clouds spanning the switch point

cannot be assigned to either haplotype confidently. This is because on one side

of the switch point, these read clouds match the alleles of one haplotype, while

on the other side they match the alleles of the opposite haplotype. Therefore,

read clouds originated from opposite haplotypes do not segregate correctly in

that region. We rely on two signatures to identify the switch points. The first

signature is that ✓lk is less than 1 and close to 0.5 in both haplotypes at the

variants around the switch point. The second signature is that a large fraction

of read clouds are flagged as chimeric in that region. In this move, we choose

a heterozygous variant l, in a weighted random manner such that the weight is

inversely correlated with max(✓l1, ✓l2), and is also correlated with the fraction of

read clouds spanning l that are flagged as chimeric. Second, we reassign all read

clouds starting after xl to the opposite haplotype. Third, we reassign all read

clouds spanning xl to the most likely haplotype, and reunite the components of

read clouds flagged as chimeric. Finally, we calculate the likelihood score and

accept the move according to the update rule.

4.3.6 Constructing somatic haplotypes

After phasing germline heterozygous variants, we proceeded by inferring the phase of

somatic SNVs. The goal here was to identify which parental haplotype was mutated

at each site in cancer cells. At each somatic variant, covering read clouds can harbor

either the reference allele or the variant allele depending on which copy of the chro-

mosome they originated from, and whether they were from cancer or normal cells.

Only read clouds emanated from cancer cells that were from the mutated chromosome

should harbor the variant allele. Since read clouds are already assigned to parental

chromosomes in the previous step, we can infer which chromosome copy was mutated

at each somatic variant.

In this step, we created two somatic haplotypes, each linked to a parental hap-

lotype. At each somatic SNV, we assigned read clouds that harbored the alternate

allele to a somatic haplotype. The somatic haplotype that a read cloud was assigned


to was determined based on which parental haplotype it was in previously.

4.3.7 LD-based validation

Since we did not leverage population information at the local phase step, we can use

linkage disequilibrium information as an independent test for the accuracy of inferred

germline haplotypes. For this purpose, we used haplotype information from 1000

Genomes project (Phase 1 integrated data set, version 3). In each phase block, we

examined pairwise LD between all SNPs by calculating the Pearson correlation. We

counted the total number of SNP pairs in all blocks that are in complete LD. We

then counted the number of SNP pairs at which the inferred phase disagrees with the

population phase. If more than one SNP pair with discordant phases were present in

a variant block, we calculated the minimum number of switches required to make the

inferred phase compatible with the LD phase.

4.3.8 Statistical phase

To join haplotype blocks using LD information, we first identified haplotype blocks

covering at least one germline SNP present in 1000 Genomes project database. We

then ran the global stage of Prism [48] with its default parameters (-K 75) on these

blocks by using population data from 1000 Genomes project (Phase 1 integrated

data set, version 3). Prism outputs the relative phase between consecutive blocks

and assigns a confidence score to the suggested phase. We only joined blocks if the

reported confidence score was above 0.98.

4.3.9 Detecting somatic CNV regions

To detect large CNV regions, we utilized lesser allele fraction (LAF) values at germline

heterozygous SNPs, segmented by an HMM method. More specifically, we first cal-

culated LAF values at germline heterozygous SNPs in both normal and IDC samples,

by using sequence reads from shotgun libraries. We then averaged the LAF values

in non-overlapping windows of 20 SNPs and plotted the averaged values. We then


identified the steep drops of LAF values by developing an HMM model that is em-

bedded into an EM. These drops correspond to CNV events with uneven haplotype

copy numbers. Di↵erent copy number variants may produce di↵erent LAF values

depending on how many copies are gained or lost by each parental chromosome, and

what portion of cancer cells share this event. We expect to observe the same LAF

values at regions of a genome that are present in equal copy numbers if, and only if,

the same fraction of cancer cells share the CNV at these regions.

We denote the averaged LAF values of windows by L = {l1, ..., ln}. We also denote

the states of our HMM model by S = {s1, ..., sk}. Our HMM model had 11 states,

and each state corresponded to a distinct expected LAF value. The number of states

was determined by plotting the LAF values. Transition probabilities P (si+1|si), areset to 0.5 for staying in the same state and to 0.5/10=0.05 for transitioning to a

di↵erent state. The first window in each chromosome arm was equally likely to be

in any of the states. Emission probabilities P (li|sj) were modeled using a Gaussian

distribution with the mean equal to µj, and the standard deviation of 0.1.

We used an EM approach to learn µ values. We started with an initial set of values

µ1�11 = {0.45, 0.42, 0.39, 0.37, 0.34, 0.31, 0.27, 0.24, 0.21, 0.18, 0.15}. At the E step, we

performed the HMM method using current µ values to find the optimal assignment of

LAF values to the states. At the M step, we set µi to the average LAF of all windows

that were assigned to si. We repeated this procedure until convergence.

After the convergence of the EM algorithm, we grouped consecutive windows in

each chromosome arm that were in the same state. Groups of windows that were

assigned to s1 were considered as diploid regions of the genome, or as regions with

even haplotype-specific copy numbers. This choice was made because LAF windows

of the normal sample were also assigned to this state. Other groups were considered

as CNVs. Since we were mainly interested in large CNV regions, we only considered

regions with lengths more than 1Mbp for the downstream analysis. None of the CNV

regions assigned to s11 (with µ11 = 0.13) were larger than 1 Mbp. Therefore, this

state was not included in the downstream analysis. We also did not include chr19, and

a region in chr3q in the downstream analysis as they show signs of chromothripsis.

The only windows assigned to state s7 were in chr19; therefore, this state was also


excluded from the downstream analysis.

4.3.10 Leveraging somatic CNVs for detecting switch errors

and connecting haplotypes

We exploited LAF values at large (greater than or equal to 1Mbp) CNV regions to

detect and estimate the number of switch errors in the haplotype blocks, which were

introduced by either local or statistical phasing. For this purpose, we used groups

of windows, C, that were produced in the previous section. We denote the HMM

state that windows in ci are assigned to by �(ci). We assume that if a haplotype

block is fully contained within a genomic region covered by ci, the expected fraction

of reads supporting the allele in the less prevalent haplotype is µ�(ci). In the presence

of a switch error in the haplotype block at the genomic position li, the less prevalent

haplotype on one side of li is h1, while h2 is the less prevalent haplotype on the other

side. Thus, the expected fraction of reads supporting h1 allele at SNPs on one side of

li is µ�(ci), while on the other side, the expected fraction of reads supporting h2 allele

is µ�(ci).

To detect potential switch errors in each haplotype block and to further connect

blocks, we devised a hidden Markov model with two states S = {S1, S2}. In this

model, observations are the sequence of haplotype read counts at heterozygous posi-

tions covered by a haplotype block. Haplotypes h1 and h2 were assigned as the less

prevalent haplotype in states S1 and S2 respectively. In the case of no switch error,

all heterozygous variants in a haplotype block should be assigned to the same state.

In this model, emission probabilities specified the likelihood of observing the fraction

of reads supporting h1 (or h2) allele in state S1 (or S2) at each heterozygous variant

lj. Emission probabilities were modeled with a binomial distribution with parameters

p = µ�(ci), and n the total shotgun read coverage at lj. Transition probabilities were

set to 0.999 for staying in the same state and 0.001 for transitioning to another state.

A state transition within a haplotype block was considered as a switch error.

We first corrected the switch errors identified by this model, and then labeled

the haplotypes within a haplotype block as “less prevalent” and “more prevalent”


depending on which HMM state the variants of the haplotype block were assigned

to. We then proceeded to connect haplotype blocks in the genomic region covered

by each ci (with �(ci) > 1), such that haplotypes of the same frequency were joined

together.

4.4 Discussion

The study of cancer genomes at the haplotype resolution is still an uncharted area.

In this work we demonstrated that recent synthetic long sequencing technologies can

be e↵ectively utilized to produce highly accurate and long-range phase information

between germline and somatic SNVs. As a proof of concept, we applied our frame-

work to reconstruct the haplotypes of an invasive breast cancer sample, which was

sequenced through the Moleculo protocol.

The study of inherited susceptibility to cancer has been an active area of research

in the past few decades. As part of these e↵orts, a growing list of germline mutations

associated with developing cancer has been revealed (e.g. [17, 89, 32, 57]). On the

basis of Knudson’s two-hit model of cancer, for a tumor to develop, an individual who

has inherited a mutated copy of a tumor suppressor gene should also develop a disrup-

tive somatic mutation on the other copy of the gene. By producing megabase-length

haplotypes, we provide the opportunity to study such interplays between somatic and

germline mutations, in coding regions of a cancer genome, .

Although we were able to correctly infer the parental chromosome-of-origin of so-

matic SNVs, our capability to reconstruct somatic haplotypes in heterogeneous cancer

samples is inherently limited by the length of input DNA fragments. In a heteroge-

neous tumor, depending on which mutations are harbored by each subpopulation of

tumor cells, somatic SNVs that mutated on the same parental chromosome can in

fact be distributed among di↵erent haplotypes. The correct combination of somatic

SNVs on these haplotypes can only be inferred if DNA fragments are long enough to

encompass at least two somatic SNVs. However, the infrequent occurrence of somatic

mutations, combined with the relative short length of synthetic long reads compared


to the distance between somatic mutations, makes this task infeasible. Future ad-

vances in third generation sequencing technologies will potentially help overcome this

limitation by producing longer fragments. Upon the availability of such data, our

local phase method can be extended to build more than two somatic haplotypes by

leveraging connected somatic SNVs.

Chapter 5

Applications of haplotype phasing

In this chapter, I discuss some promising applications of haplotype inference. First,

I demonstrate how phase information, coupled with Moleculo read cloud data, can

be utilized to improve the specificity of variant calls. Next, I describe how the local

phase method described in the previous chapter can be employed to identify cryptic

segmental duplications of a genome, and how it can be extended to reconstruct the

underlying haplotypes. Finally, I discuss the key role of haplotype information in

reconstructing phylogenetic trees of tumor samples.

5.1 Enhancing the accuracy of variant calls

As discussed in 2.2, detection of germline and somatic alterations of cancer genomes

has a central role in characterizing cancer samples. However, the accuracy of current

variant calling procedures varies considerably along the genome depending on several

factors such as sequencing errors, alignment errors, and the allele fraction of somatic

variants. The depth of sequencing coverage in a tumor sample and its matched

normal sample also further confound the identification of somatic mutations. Having

an accurate set of variants is a crucial first step in many cancer studies including

the work described in Chapter 3. Therefore, variant calling is typically ensued by

rounds of filtering steps and experimental validations. As shown in previous studies,

haplotype information can provide an additional and valuable layer of information to

71

CHAPTER 5. APPLICATIONS OF HAPLOTYPE PHASING 72

enhance the accuracy of variant calls [69, 70] .

In most cases, sequencing and mapping errors have equal chance of occurrence on

maternal or paternal chromosomes. The occurrence of homozygous somatic muta-

tions, on the other hand, is a particularly rare event. Therefore, one can hypothesize

that the presence of read clouds harboring the variant allele in both haplotypes is an

evidence of a false positive somatic call. Another common signature of a false variant

call is a strong presence of mixed read clouds at the variant site. Since a read cloud

originates from a single molecule, it should only support one allele at each genomic

position. We refer to read clouds in which some shotgun reads support the reference

allele at a variant site, while the rest support the variant allele, as mixed read clouds.

The presence of several mixed read clouds at a variant site is a suggestion of some

systematic sequencing or alignment errors in that region. Figure 5.1 displays read

clouds covering two variants that were called as somatic by GATK in the study de-

scribed in the previous chapter. The first example exhibits the signature of a true

somatic variant: all read clouds in the somatic haplotype have matching alleles with

only one parental haplotype at their encompassing germline variants. The second

example is most likely a false call. There is a high prevalence of mixed read clouds

at this position, and read clouds harboring the variant allele are distributed between

both parental haplotypes. About 5% of somatic variants called in that study had

such signatures of a false call.

The di↵erentiation of somatic from germline alterations can also benefit from

haplotype information in samples with high levels of normal contamination. To dis-

tinguish a somatic variant from a germline substitution, studies leverage sequence

reads from a matched normal sample. A variant is designated as a somatic candidate

if it is not supported by sequence reads from the normal sample. However, minor

sampling biases in regions of low sequencing coverage in the normal sample can cause

misclassification of germline variations as somatic candidates. In samples with high

levels of normal contamination, such misclassifications can be identified based on the

distribution of read clouds harboring the reference allele. Since all read clouds origi-

nating from normal cells carry the reference allele at a true somatic variant site, these

read clouds should be distributed evenly across both parental haplotypes. The dearth


A. B.

h1

h2

h1

h2Germline SNVSomatic SNVReference allele

Variant alleleMixed alleles

Figure 5.1: Two examples of somatic variants called by GATK in the IDC sample.A. A true somatic variant. The somatic haplotype is shown on top. Read clouds inthe somatic haplotype agree with h1 on germline alleles. There are no mixed readclouds at this position. B. A likely false variant call. Read clouds with the variantallele are distributed within both copies of the chromosome. A high fraction of readclouds show evidence for both alleles at this position.

of read clouds carrying the reference allele in at least one parental haplotype suggests

that the mutation is in reality a germline alteration.

Haplotype information can also be exploited to correct genotype errors at ho-

mozygous germline point mutations. A homozygous genotype can be miscalled as a

heterozygous genotype depending on the fraction of reads supporting the true allele.

At a true heterozygous variant site, read clouds with reference and alternate alleles


should be segregated into the two parental haplotypes. Therefore, the support of

the same allele in both haplotypes by the majority of read clouds indicates that the

genotype is in fact homozygous for that allele.

5.2 Identifying and phasing cryptic segmental du-

plications

Segmental duplications (SDs) are large duplicated DNA segments within a haploid

genome with near-identical sequences. The role of segmental duplications in evolution

and their associations with genetic diseases have been well documented in many

studies [22, 5, 45, 31, 76]. In this section, I show how Moleculo read clouds can be

utilized to infer the haplotype of SD copies that are present in a target sample but

are absent in the human reference genome.

If a segmental duplication has more than two copies but only one copy is present in

the reference genome, most sequence reads arising from these copies will be mapped

to the one copy present in the reference. Consequently, single nucleotides that are

unique to each copy can easily be mistaken for heterozygous variants. These incorrect

variant calls, in turn, result in phasing errors because two haplotypes may not be

su�cient to explain the di↵erences between the associated copies. Figure 5.2 depicts

an example of a segmental duplication in chr17 (KCNJ12) that we had identified in

the sequenced IDC sample, as part of the study described in the previous chapter.

As the figure shows, read clouds in h2 have di↵erent signatures, which suggests that

they originated from di↵erent copies. Read clouds assigned to h1, however, exhibit

matching alleles.

We assigned the read clouds in h2 to two haplotypes using the method described

in 4.3.5. The result, which is depicted in Figure 5.3, suggests that three haplotypes

were su�cient to capture unique signatures of read clouds in this region. These

three inferred haplotypes were also reported in [34]. In this study, Genovese et al.

identified three paralogs in this region (KCNJ12, KCNJ17 and KCNJ18), and listed

the paralogous variants on each copy. Our inferred haplotypes were in complete


agreement with the haplotypes reported in that study on the reported variants.

This example shows the capability of our phasing procedure to phase segmental

duplication regions by e↵ectively utilizing Moleculo read clouds, without need for

any additional information. In regions where read clouds cannot be explained with

two haplotypes, we can dynamically increase the number of haplotypes and reassign

read clouds to the new set of haplotypes using the same set of moves as described

in 4.3.5. This approach proposes a cost-e↵ective method for the identification and

haplotype inference of segmental duplications. In recent years, multiple methods have

been developed for leveraging long-range sequence information from long-read single

molecule or synthetic long read sequencing technologies to uniquely map sequence

reads to repeated regions of the genome and enable variant calling in these regions [12,

43]. Our approach along with these methods can resolve complex repeated regions of

the genome by phasing the newly identified variants and reconstructing the underlying

copies. Together, these e↵orts provide a new platform to study the association of

specific haplotypes in these regions with human disease.

5.3 Increasing the resolution of phylogenetic infer-

ence methods

As demonstrated in chapter 3, phylogenetic trees provide an invaluable framework for

studying the evolution of a genome and for characterizing the heterogeneity within

a tumor. In recent years, several studies have been conducted to infer phylogenetic

relationships between multiple samples extracted from various regions of a single

tumor [36, 65, 93], or from di↵erent metastasis or tumor regions within a patient

[16, 92, 35, 39]. These studies utilize patterns of variant sharing among samples,

variant allele fractions, and also cellular prevalence of somatic point mutations to

reconstruct tumor lineage trees and to infer the chronological order of somatic events

during cancer progression.

Aneuploidies and copy number variants present a significant challenge in the cor-

rect estimation of cellular prevalences. These types of variants alter prevalent allele


fractions, which are utilized for estimating what portion of cancer cells carry a somatic

mutation. There are existing tools for identifying CNV regions within a genome and

for estimating total copy numbers within a region. However, the cellular prevalence of

a somatic mutation can only be correctly inferred if the copy number of its harboring

parental chromosome is known, which requires haplotype information.

Furthermore, a chromosome loss that occurs later in a cancer’s lifetime buries

important information about somatic mutations that the deleted chromosome once

harbored. As a result of this information loss, the position of some mutation groups

on the lineage tree cannot be confidently inferred. As an example, consider a case in

which two samples share a group of somatic mutations within a genomic region. If a

third sample from the same patient does not share this mutation group but has lost

one copy of the chromosome in that region, then there are two distinct possibilities.

This mutation group may have been shared among all three samples and was lost in

the third sample due to the chromosome deletion. Alternatively, the third sample may

have never carried this mutation group and these mutations occurred after the lineage

path of the third sample diverged from the other two. We can, however, deduce that

the latter hypothesis is correct, if we infer that these somatic point mutations belong

to the retained chromosome in the third sample. Therefore, haplotype assignment of

somatic mutations can resolve some phylogenetic ambiguities.


25002000150010005000

5 10 15 20 25 30 35 40

h1

h2

Figure 5.2: Identifying a segmental duplication event at KCNJ12. Read clouds over-lapping KCNJ12 gene are assigned to two parental haplotypes using the local phasemethod described in 4.3.5. It can be seen from the figure that read clouds in h2

are arising from more than one genomic segment, suggesting that more than twohaplotypes are necessary to separate the read clouds.


h3

h1

h2

Figure 5.3: Inferring the haplotype sequence of KCNJ12 paralogs. Read clouds in h2

in Figure 5.2 are assigned to h2, and h3. h1, h2, and h3 provide haplotype sequencesof KCNJ18, KCNJ17, and KCNJ12 respectively. KCNJ17 is the copy present in thereference genome assembly.

Chapter 6

Conclusion

The contribution of cancer genomics to almost all aspects of cancer research is indis-

putable. Ongoing technological advances are not only increasing the rate at which

cancer sequencing studies can be completed, but are also enabling a continually

higher-resolution view of the somatic changes to cancer genomes. The study of het-

erogeneity in cancer genome evolution is now within reach in light of next generation

sequencing technologies. New advances in long read sequencing technologies are pro-

viding the opportunity to characterize the elaborate interplay between somatically

acquired mutations and inherited alterations. These e↵orts are transforming our un-

derstanding of tumor biology and are establishing a more detailed insight into the

dynamic evolution of cancer cells.

Chapter 3 of this dissertation provided a new view of the breast cancer evolution

by studying the role of early neoplasias in the progression of cancer. This study

demonstrated that phylogenetic trees could be successfully inferred from multiple

lesions within a patient. It also illustrated how somatic point mutations in conjunction

with chromosomal abnormalities could serve as lineage markers to obtain such trees.

The established phylogenetic trees proved that in some patients, early neoplasias cells

have a common ancestor with ductal carcinoma cells and, as such, they can reveal

early events in a cancer’s lifetime. The perspective achieved in this study was shown

to be more comprehensive than the current histological model of cancer progression.

Perhaps the most noteworthy finding in this study was the discovery of elevated

79

CHAPTER 6. CONCLUSION 80

mutation load as well as widespread aneuploidies in all early neoplasia samples that

were genetically related to a ductal carcinoma lesion. This was a consistent pattern

across all early neoplasias and as such may have future diagnostic applications.

Chapter 4 exhibited the use of long-range sequence information from emerging

synthetic long read sequencing technologies to phase de novo and somatic mutations

in cancer genomes. A new toolset was proposed for phasing somatic and germline

mutations by leveraging sequence data from the Moleculo platform. This work demon-

strated how the unique characteristics of Moleculo data, coupled with somatic aneu-

ploidies, could be e↵ectively exploited to produce highly accurate and long haplotype

blocks. The accuracy of the obtained haplotype blocks was also measured through

step-by-step independent validations. The resulting long haplotype blocks provide a

unique opportunity to study the interactions between somatic and germline variants

and their combined e↵ects on cancer initiation and progression. Moreover, and as dis-

cussed in chapter 5, this information empowers a wide range of applications in cancer

research such as enabling more accurate and comprehensive SNV calling, detecting

complex variant types such as structural variations, identifying somatic variants in

repetitive regions of the genome, and obtaining more robust phylogenetic trees.

Taken together, these studies allow for a better understanding of the mutational

landscape of cancer genomes. Haplotype information can result in more robust phy-

logenetic trees by providing better-refined sets of variant calls and copy-numbers. At

the same time, sequencing multiple lesions from the same patient can also produce

even longer haplotype blocks by leveraging the union of aneuploidies across samples.

Longer haplotype blocks will fully encompass a higher number of genes, which in turn

can give further insights into the biological function of variants and the mechanism

of the disease.

Bibliography

[1] Tarek MA Abdel-Fatah, Desmond G Powe, Zsolt Hodi, Andrew HS Lee, Jorge S

Reis-Filho, and Ian O Ellis. High frequency of coexistence of columnar cell

lesions, lobular neoplasia, and low grade ductal carcinoma in situ with invasive

tubular carcinoma and invasive lobular carcinoma. The American journal of

surgical pathology, 31(3):417–426, 2007.

[2] Andrew Adey, Joshua N Burton, Jacob O Kitzman, Joseph B Hiatt, Alexandra P

Lewis, Beth K Martin, Ruolan Qiu, Choli Lee, and Jay Shendure. The haplotype-

resolved genome and epigenome of the aneuploid hela cancer cell line. Nature,

500(7461):207–211, 2013.

[3] Sasan Amini, Dmitry Pushkarev, Lena Christiansen, Emrah Kostem, Tom

Royce, Casey Turk, Natasha Pignatelli, Andrew Adey, Jacob O Kitzman,

Kandaswamy Vijayan, et al. Haplotype-resolved whole-genome sequencing by

contiguity-preserving transposition and combinatorial indexing. Nature genet-

ics, 2014.

[4] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather

Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight,

Janan T Eppig, et al. Gene ontology: tool for the unification of biology. Nature

genetics, 25(1):25–29, 2000.

[5] Jorune Balciuniene, Ningping Feng, Kelly Iyadurai, Betsy Hirsch, Lawrence

Charnas, Brent R Bill, Mathew C Easterday, Johan Staaf, LeAnn Oseth, Desiree

Czapansky-Beilman, et al. Recurrent 10q22-q23 deletions: a genomic disorder

81

BIBLIOGRAPHY 82

on 10q associated with cognitive and behavioral abnormalities. The American

Journal of Human Genetics, 80(5):938–947, 2007.

[6] Shantanu Banerji, Kristian Cibulskis, Claudia Rangel-Escareno, Kristin K

Brown, Scott L Carter, Abbie M Frederick, Michael S Lawrence, Andrey Y

Sivachenko, Carrie Sougnez, Lihua Zou, et al. Sequence analysis of mutations

and translocations across breast cancer subtypes. Nature, 486(7403):405–409,

2012.

[7] Vikas Bansal and Vineet Bafna. Hapcut: an e�cient and accurate algorithm for

the haplotype assembly problem. Bioinformatics, 24(16):i153–i159, 2008.

[8] Vikas Bansal, Aaron L Halpern, Nelson Axelrod, and Vineet Bafna. An mcmc

algorithm for haplotype assembly from whole-genome sequence data. Genome

research, 18(8):1336–1346, 2008.

[9] Michael Baudis. Genomic imbalances in 5918 malignant epithelial tumors: an

explorative meta-analysis of chromosomal cgh data. BMC cancer, 7(1):226, 2007.

[10] Rameen Beroukhim, Craig H Mermel, Dale Porter, Guo Wei, Soumya Ray-

chaudhuri, Jerry Donovan, Jordi Barretina, Jesse S Boehm, Jennifer Dobson,

Mitsuyoshi Urashima, et al. The landscape of somatic copy-number alteration

across human cancers. Nature, 463(7283):899–905, 2010.

[11] Graham R Bignell, Chris D Greenman, Helen Davies, Adam P Butler, Sarah

Edkins, Jenny M Andrews, Gemma Buck, Lina Chen, David Beare, Calli La-

timer, et al. Signatures of mutation and selection in the cancer genome. Nature,

463(7283):893–898, 2010.

[12] Alex Bishara, Yuling Liu, Dorna Kashef-Haghighi, Ziming Weng, Daniel E New-

burger, Robert West, Arend Sidow, and Serafim Batzoglou. Read clouds uncover

variation in complex regions of the human genome. In Research in Computational

Molecular Biology, pages 30–31. Springer, 2015.

BIBLIOGRAPHY 83

[13] Alessandro Bombonati and Dennis C Sgroi. The molecular pathology of breast

cancer progression. The Journal of pathology, 223(2):308–318, 2011.

[14] Sharon R Browning and Brian L Browning. Haplotype phasing: existing methods

and new developments. Nature Reviews Genetics, 12(10):703–714, 2011.

[15] Rebecca A Burrell, Nicholas McGranahan, Jiri Bartek, and Charles Swanton.

The causes and consequences of genetic heterogeneity in cancer evolution. Nature,

501(7467):338–345, 2013.

[16] Peter J Campbell, Shinichi Yachida, Laura J Mudie, Philip J Stephens, Erin D

Pleasance, Lucy A Stebbings, Laura A Morsberger, Calli Latimer, Stuart

McLaren, Meng-Lay Lin, et al. The patterns and dynamics of genomic instability

in metastatic pancreatic cancer. Nature, 467(7319):1109–1113, 2010.

[17] Silvia Casadei, Barbara M Norquist, Tom Walsh, Sunday Stray, Jessica B Man-

dell, Ming K Lee, John A Stamatoyannopoulos, and Mary-Claire King. Contri-

bution of inherited mutations in the brca2-interacting protein palb2 to familial

breast cancer. Cancer research, 71(6):2222–2229, 2011.

[18] Michael A Chapman, Michael S Lawrence, Jonathan J Keats, Kristian Cibulskis,

Carrie Sougnez, Anna C Schinzel, Christina L Harview, Jean-Philippe Brunet,

Gregory J Ahmann, Mazhar Adli, et al. Initial genome sequencing and analysis

of multiple myeloma. Nature, 471(7339):467–472, 2011.

[19] Zhi-Zhong Chen, Fei Deng, and Lusheng Wang. Exact algorithms for haplotype

assembly from whole-genome sequence data. Bioinformatics, page btt349, 2013.

[20] Kristian Cibulskis, Michael S Lawrence, Scott L Carter, Andrey Sivachenko,

David Ja↵e, Carrie Sougnez, Stacey Gabriel, Matthew Meyerson, Eric S Lander,

and Gad Getz. Sensitive detection of somatic point mutations in impure and

heterogeneous cancer samples. Nature biotechnology, 31(3):213–219, 2013.

[21] Rudi Cilibrasi, Leo Van Iersel, Steven Kelk, and John Tromp. The complexity of

the single individual snp haplotyping problem. Algorithmica, 49(1):13–36, 2007.

BIBLIOGRAPHY 84

[22] Gregory M Cooper, Bradley P Coe, Santhosh Girirajan, Jill A Rosenfeld,

Ti↵any H Vu, Carl Baker, Charles Williams, Heather Stalker, Rizwan Hamid,

Vickie Hannig, et al. A copy number variation morbidity map of developmental

delay. Nature genetics, 43(9):838–846, 2011.

[23] Karen Crasta, Neil J Ganem, Regina Dagher, Alexandra B Lantermann, Elena V

Ivanova, Yunfeng Pan, Luigi Nezi, Alexei Protopopov, Dipanjan Chowdhury, and

David Pellman. Dna breaks and chromosome pulverization from errors in mitosis.

Nature, 482(7383):53–58, 2012.

[24] Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M

Rueda, Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa,

Yinyin Yuan, et al. The genomic and transcriptomic architecture of 2,000 breast

tumours reveals novel subgroups. Nature, 486(7403):346–352, 2012.

[25] Olivier Delaneau, Bryan Howie, Anthony J Cox, Jean-Francois Zagury, and

Jonathan Marchini. Haplotype estimation using sequencing reads. The American

Journal of Human Genetics, 93(4):687–696, 2013.

[26] Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R

Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel,

Manuel A Rivas, Matt Hanna, et al. A framework for variation discovery

and genotyping using next-generation dna sequencing data. Nature genetics,

43(5):491–498, 2011.

[27] Li Ding, Matthew J Ellis, Shunqiang Li, David E Larson, Ken Chen, John W

Wallis, Christopher C Harris, Michael D McLellan, Robert S Fulton, Lucinda L

Fulton, et al. Genome remodelling in a basal-like breast cancer metastasis and

xenograft. Nature, 464(7291):999–1005, 2010.

[28] Li Ding, Timothy J Ley, David E Larson, Christopher A Miller, Daniel C

Koboldt, John S Welch, Julie K Ritchey, Margaret A Young, Tamara Lamprecht,

Michael D McLellan, et al. Clonal evolution in relapsed acute myeloid leukaemia

revealed by whole-genome sequencing. Nature, 481(7382):506–510, 2012.

BIBLIOGRAPHY 85

[29] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geo↵ Otto, Paul

Peluso, David Rank, Primo Baybayan, Brad Bettman, et al. Real-time dna

sequencing from single polymerase molecules. Science, 323(5910):133–138, 2009.

[30] Matthew J Ellis, Li Ding, Dong Shen, Jingqin Luo, Vera J Suman, John W

Wallis, Brian A Van Tine, Jeremy Hoog, Reece J Goi↵on, Theodore C Gold-

stein, et al. Whole-genome analysis informs breast cancer response to aromatase

inhibition. Nature, 486(7403):353–360, 2012.

[31] Beverly S Emanuel and Tamim H Shaikh. Segmental duplications:

an’expanding’role in genomic instability and disease. Nature Reviews Genetics,

2(10):791–800, 2001.

[32] Megan N Farley, Laura S Schmidt, Jessica L Mester, Samuel Pena-Llopis, Andrea

Pavia-Jimenez, Alana Christie, Cathy D Vocke, Christopher J Ricketts, James

Peterson, Lindsay Middelton, et al. A novel germline mutation in bap1 pre-

disposes to familial clear-cell renal cell carcinoma. Molecular Cancer Research,

11(9):1061–1071, 2013.

[33] R Fisher, L Pusztai, and C Swanton. Cancer heterogeneity: implications for

targeted therapeutics. British journal of cancer, 108(3):479–485, 2013.

[34] Giulio Genovese, Robert E Handsaker, Heng Li, Nicolas Altemose, Amelia M

Lindgren, Kimberly Chambert, Bogdan Pasaniuc, Alkes L Price, David Reich,

Cynthia C Morton, et al. Using population admixture to help complete maps of

the human genome. Nature genetics, 45(4):406–414, 2013.

[35] Marco Gerlinger, Stuart Horswell, James Larkin, Andrew J Rowan, Max P Salm,

Ignacio Varela, Rosalie Fisher, Nicholas McGranahan, Nicholas Matthews, Clau-

dio R Santos, et al. Genomic architecture and evolution of clear cell renal cell

carcinomas defined by multiregion sequencing. Nature genetics, 46(3):225–233,

2014.

[36] Marco Gerlinger, Andrew J Rowan, Stuart Horswell, James Larkin, David En-

desfelder, Eva Gronroos, Pierre Martinez, Nicholas Matthews, Aengus Stewart,

BIBLIOGRAPHY 86

Patrick Tarpey, et al. Intratumor heterogeneity and branched evolution revealed

by multiregion sequencing. New England Journal of Medicine, 366(10):883–892,

2012.

[37] David J Gordon, Benjamin Resio, and David Pellman. Causes and consequences

of aneuploidy in cancer. Nature Reviews Genetics, 13(3):189–203, 2012.

[38] Rodrigo Goya, Mark GF Sun, Ryan D Morin, Gillian Leung, Gavin Ha, Kimber-

ley C Wiegand, Janine Senz, Anamaria Crisan, Marco A Marra, Martin Hirst,

et al. Snvmix: predicting single nucleotide variants from next-generation se-

quencing of tumors. Bioinformatics, 26(6):730–736, 2010.

[39] Michael R Green, Andrew J Gentles, Ramesh V Nair, Jonathan M Irish, Shingo

Kihira, Chih Long Liu, Itai Kela, Erik S Hopmans, June H Myklebust, Hanlee

Ji, et al. Hierarchy in somatic mutations arising during genomic evolution and

progression of follicular lymphoma. Blood, 121(9):1604–1611, 2013.

[40] Christopher Greenman, Philip Stephens, Ra↵aella Smith, Gillian L Dalgliesh,

Christopher Hunter, Graham Bignell, Helen Davies, Jon Teague, Adam Butler,

Claire Stevens, et al. Patterns of somatic mutation in human cancer genomes.

Nature, 446(7132):153–158, 2007.

[41] Dan He, Arthur Choi, Knot Pipatsrisawat, Adnan Darwiche, and Eleazar Eskin.

Optimal algorithms for haplotype assembly from whole-genome sequence data.

Bioinformatics, 26(12):i183–i190, 2010.

[42] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. Bioinformatics

enrichment tools: paths toward the comprehensive functional analysis of large

gene lists. Nucleic acids research, 37(1):1–13, 2009.

[43] John Huddleston, Swati Ranade, Maika Malig, Francesca Antonacci, Mark

Chaisson, Lawrence Hon, Peter H Sudmant, Tina A Graves, Can Alkan, Megan Y

Dennis, et al. Reconstructing complex regions of genomes using long-read se-

quencing technology. Genome research, 24(4):688–696, 2014.

BIBLIOGRAPHY 87

[44] Dick G Hwang and Phil Green. Bayesian markov chain monte carlo sequence

analysis reveals varying neutral substitution patterns in mammalian evolution.

Proceedings of the National Academy of Sciences of the United States of America,

101(39):13994–14001, 2004.

[45] Yonggang Ji, Evan E Eichler, Stuart Schwartz, and Robert D Nicholls. Structure

of chromosomal duplicons and their role in mediating human genomic disorders.

Genome research, 10(5):597–610, 2000.

[46] Daniel C Koboldt, Qunyuan Zhang, David E Larson, Dong Shen, Michael D

McLellan, Ling Lin, Christopher A Miller, Elaine R Mardis, Li Ding, and

Richard K Wilson. Varscan 2: somatic mutation and copy number alteration

discovery in cancer by exome sequencing. Genome research, 22(3):568–576, 2012.

[47] Augustine Kong, Michael L Frigge, Gisli Masson, Soren Besenbacher, Patrick

Sulem, Gisli Magnusson, Sigurjon A Gudjonsson, Asgeir Sigurdsson, Aslaug

Jonasdottir, Adalbjorg Jonasdottir, et al. Rate of de novo mutations and the

importance of father/’s age to disease risk. Nature, 488(7412):471–475, 2012.

[48] Volodymyr Kuleshov, Dan Xie, Rui Chen, Dmitry Pushkarev, Zhihai Ma, Tim

Blauwkamp, Michael Kertesz, and Michael Snyder. Whole-genome haplotyping

using long reads and statistical methods. Nature biotechnology, 32(3):261–266,

2014.

[49] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C

Zody, Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William

FitzHugh, et al. Initial sequencing and analysis of the human genome. Nature,

409(6822):860–921, 2001.

[50] David E Larson, Christopher C Harris, Ken Chen, Daniel C Koboldt, Travis E

Abbott, David J Dooling, Timothy J Ley, Elaine R Mardis, Richard K Wilson,

and Li Ding. Somaticsniper: identification of somatic point mutations in whole

genome sequencing data. Bioinformatics, 28(3):311–317, 2012.

BIBLIOGRAPHY 88

[51] Rebecca J Leary, Jimmy C Lin, Jordan Cummins, Simina Boca, Laura D Wood,

D Williams Parsons, Sian Jones, Tobias Sjoblom, Ben-Ho Park, Ramon Parsons,

et al. Integrated analysis of homozygous deletions, focal amplifications, and

sequence alterations in breast and colorectal cancers. Proceedings of the National

Academy of Sciences, 105(42):16224–16229, 2008.

[52] Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern,

Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady

Denisov, et al. The diploid genome sequence of an individual human. PLoS

biology, 5(10):e254, 2007.

[53] Timothy J Ley, Elaine R Mardis, Li Ding, Bob Fulton, Michael D McLellan,

Ken Chen, David Dooling, Brian H Dunford-Shore, Sean McGrath, Matthew

Hickenbotham, et al. Dna sequencing of a cytogenetically normal acute myeloid

leukaemia genome. Nature, 456(7218):66–72, 2008.

[54] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with

bwa-mem. arXiv preprint arXiv:1303.3997, 2013.

[55] Pengfei Liu, Ayelet Erez, Sandesh C Sreenath Nagamani, Shweta U Dhar,

Katarzyna E Ko lodziejska, Avinash V Dharmadhikari, M Lance Cooper, Joanna

Wiszniewska, Feng Zhang, Marjorie A Withers, et al. Chromosome catastrophes

involve replication mechanisms generating complex genomic rearrangements.

Cell, 146(6):889–903, 2011.

[56] Maria A Lopez-Garcia, Felipe C Geyer, Magali Lacroix-Triki, Caterina Marchio,

and Jorge S Reis-Filho. Breast cancer precursors revisited: molecular features

and progression pathways. Histopathology, 57(2):171–192, 2010.

[57] Chey Loveday, Clare Turnbull, Elise Ruark, Rosa Maria Munoz Xicola, Emma

Ramsay, Deborah Hughes, Margaret Warren-Perry, Katie Snape, Diana Eccles,

D Gareth Evans, et al. Germline rad51c mutations confer susceptibility to ovarian

cancer. Nature genetics, 44(5):475–476, 2012.

BIBLIOGRAPHY 89

[58] Christopher A Maher and Richard KWilson. Chromothripsis and human disease:

piecing together the shattering process. Cell, 148(1):29–32, 2012.

[59] Elaine R Mardis. Genome sequencing and cancer. Current opinion in genetics

& development, 22(3):245–250, 2012.

[60] Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian

Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel,

Mark Daly, et al. The genome analysis toolkit: a mapreduce framework for

analyzing next-generation dna sequencing data. Genome research, 20(9):1297–

1303, 2010.

[61] Matthew Meyerson and David Pellman. Cancer genomes evolve by pulverizing

single chromosomes. Cell, 144(1):9–10, 2011.

[62] Ryan E Mills, Christopher T Luttig, Christine E Larkins, Adam Beauchamp,

Circe Tsui, W Stephen Pittard, and Scott E Devine. An initial map of inser-

tion and deletion (indel) variation in the human genome. Genome research,

16(9):1182–1190, 2006.

[63] Felix Mitelman, Bertil Johansson, and Fredrik Mertens. Mitelman database of

chromosome aberrations in cancer. Cancer Genome Anatomy Project., 2007.

[64] Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers,

Jeanne McIndoo, Kerry Cook, Asya Stepansky, Dan Levy, Diane Esposito, et al.

Tumour evolution inferred by single-cell sequencing. Nature, 472(7341):90–94,

2011.

[65] Nicholas Navin, Alexander Krasnitz, Linda Rodgers, Kerry Cook, Jennifer Meth,

Jude Kendall, Michael Riggs, Yvonne Eberling, Jennifer Troge, Vladimir Grubor,

et al. Inferring tumor progression from genomic heterogeneity. Genome research,

20(1):68–80, 2010.

[66] Daniel E Newburger, Dorna Kashef-Haghighi, Ziming Weng, Raheleh Salari,

Robert T Sweeney, Alayne L Brunner, Shirley X Zhu, Xiangqian Guo, Sushama

BIBLIOGRAPHY 90

Varma, Megan L Troxell, et al. Genome evolution during progression to breast

cancer. Genome research, 23(7):1097–1108, 2013.

[67] Serena Nik-Zainal, Ludmil B Alexandrov, David C Wedge, Peter Van Loo,

Christopher D Greenman, Keiran Raine, David Jones, Jonathan Hinton, John

Marshall, Lucy A Stebbings, et al. Mutational processes molding the genomes

of 21 breast cancers. Cell, 149(5):979–993, 2012.

[68] Serena Nik-Zainal, Peter Van Loo, David C Wedge, Ludmil B Alexandrov,

Christopher D Greenman, King Wai Lau, Keiran Raine, David Jones, John Mar-

shall, Manasa Ramakrishna, et al. The life history of 21 breast cancers. Cell,

149(5):994–1007, 2012.

[69] Brock A Peters, Bahram G Kermani, Oleg Alferov, Misha R Agarwal, Mark A

McElwain, Natali Gulbahce, Daniel M Hayden, Y Tom Tang, Rebecca Yu Zhang,

Rick Tearle, et al. Detection and phasing of single base de novo mutations

in biopsies from human in vitro fertilized embryos by advanced whole-genome

sequencing. Genome research, 25(3):426–434, 2015.

[70] Brock A Peters, Bahram G Kermani, Andrew B Sparks, Oleg Alferov, Peter

Hong, Andrei Alexeev, Yuan Jiang, Fredrik Dahl, Y Tom Tang, Juergen Haas,

et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human

cells. Nature, 487(7406):190–195, 2012.

[71] Erin D Pleasance, R Keira Cheetham, Philip J Stephens, David J McBride,

Sean J Humphray, Chris D Greenman, Ignacio Varela, Meng-Lay Lin, Gon-

zalo R Ordonez, Graham R Bignell, et al. A comprehensive catalogue of somatic

mutations from a human cancer genome. Nature, 463(7278):191–196, 2010.

[72] Erin D Pleasance, Philip J Stephens, Sarah O’Meara, David J McBride, Alison

Meynert, David Jones, Meng-Lay Lin, David Beare, King Wai Lau, Chris Green-

man, et al. A small-cell lung cancer genome with complex signatures of tobacco

exposure. Nature, 463(7278):184–190, 2010.

BIBLIOGRAPHY 91

[73] Yardena Samuels, Zhenghe Wang, Alberto Bardelli, Natalie Silliman, Janine

Ptak, Steve Szabo, Hai Yan, Adi Gazdar, Steven M Powell, Gregory J Riggins,

et al. High frequency of mutations of the pik3ca gene in human cancers. Science,

304(5670):554–554, 2004.

[74] Sohrab P Shah, Ryan D Morin, Jaswinder Khattra, Leah Prentice, Trevor Pugh,

Angela Burleigh, Allen Delaney, Karen Gelmon, Ryan Guliany, Janine Senz,

et al. Mutational evolution in a lobular breast tumour profiled at single nucleotide

resolution. Nature, 461(7265):809–813, 2009.

[75] Sohrab P Shah, Andrew Roth, Rodrigo Goya, Arusha Oloumi, Gavin Ha,

Yongjun Zhao, Gulisa Turashvili, Jiarui Ding, Kane Tse, Gholamreza Ha↵ari,

et al. The clonal and mutational evolution spectrum of primary triple-negative

breast cancers. Nature, 486(7403):395–399, 2012.

[76] Christine J Shaw and James R Lupski. Implications of human genome archi-

tecture for rearrangement-based disorders: the genomic basis of disease. Human

molecular genetics, 13(suppl 1):R57–R64, 2004.

[77] Rebecca Siegel, Jiemin Ma, Zhaohui Zou, and Ahmedin Jemal. Cancer statistics,

2014. CA: a cancer journal for clinicians, 64(1):9–29, 2014.

[78] Peter T Simpson, Theo Gale, Jorge S Reis-Filho, Chris Jones, Suzanne Parry,

John P Sloane, Andrew Hanby, Sarah E Pinder, Andrew HS Lee, Steve

Humphreys, et al. Columnar cell lesions of the breast: the missing link in breast

cancer progression?: a morphological and molecular analysis. The American

journal of surgical pathology, 29(6):734–746, 2005.

[79] Matthew W Snyder, Andrew Adey, Jacob O Kitzman, and Jay Shendure.

Haplotype-resolved genome sequencing: experimental methods and applications.

Nature Reviews Genetics, 2015.

[80] American Cancer Society. Cancer facts and figures 2015. Atlanta: American

Cancer Society, 2015.

BIBLIOGRAPHY 92

[81] Philip J Stephens, Chris D Greenman, Beiyuan Fu, Fengtang Yang, Graham R

Bignell, Laura J Mudie, Erin D Pleasance, King Wai Lau, David Beare, Lucy A

Stebbings, et al. Massive genomic rearrangement acquired in a single catastrophic

event during cancer development. cell, 144(1):27–40, 2011.

[82] Michael R Stratton. Exploring the genomes of cancer cells: progress and promise.

science, 331(6024):1553–1558, 2011.

[83] Michael R Stratton, Peter J Campbell, and P Andrew Futreal. The cancer

genome. Nature, 458(7239):719–724, 2009.

[84] John Sved and Adrian Bird. The expected equilibrium of the cpg dinucleotide

in vertebrate genomes under a mutation model. Proceedings of the National

Academy of Sciences, 87(12):4692–4696, 1990.

[85] Megan L Troxell, Alayne L Brunner, Tanaya Ne↵, Andrea Warrick, Carol Bead-

ling, Kelli Montgomery, Shirley Zhu, Christopher L Corless, and Robert B West.

Phosphatidylinositol-3-kinase pathway mutations are common in breast colum-

nar cell lesions. Modern Pathology, 25(7):930–937, 2012.

[86] Samra Turajlic, Simon J Furney, Maryou B Lambros, Costas Mitsopoulos,

Iwanka Kozarewa, Felipe C Geyer, Alan MacKay, Jarle Hakas, Marketa Zvelebil,

Christopher J Lord, et al. Whole genome sequencing of matched primary and

metastatic acral melanomas. Genome research, 22(2):196–207, 2012.

[87] Bala Murali Venkatesan and Rashid Bashir. Nanopore sensors for nucleic acid

analysis. Nature nanotechnology, 6(10):615–624, 2011.

[88] J Craig Venter, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural,

Granger G Sutton, Hamilton O Smith, Mark Yandell, Cheryl A Evans, Robert A

Holt, et al. The sequence of the human genome. science, 291(5507):1304–1351,

2001.

[89] Tom Walsh, Silvia Casadei, Ming K Lee, Christopher C Pennil, Alex S Nord,

Anne M Thornton, Wendy Roeb, Kathy J Agnew, Sunday M Stray, Anneka

BIBLIOGRAPHY 93

Wickramanayake, et al. Mutations in 12 genes for inherited ovarian, fallopian

tube, and peritoneal carcinoma identified by massively parallel sequencing. Pro-

ceedings of the National Academy of Sciences, 108(44):18032–18037, 2011.

[90] Matthew J Walter, Dong Shen, Li Ding, Jin Shao, Daniel C Koboldt, Ken Chen,

David E Larson, Michael D McLellan, David Dooling, Rachel Abbott, et al.

Clonal architecture of secondary acute myeloid leukemia. New England Journal

of Medicine, 366(12):1090–1098, 2012.

[91] Xiaochong Wu, Paul A Northcott, Adrian Dubuc, Adam J Dupuy, David JH

Shih, Hendrik Witt, Sidney Croul, Eric Bou↵et, Daniel W Fults, Charles G

Eberhart, et al. Clonal selection drives genetic divergence of metastatic medul-

loblastoma. Nature, 482(7386):529–533, 2012.

[92] Shinichi Yachida, Sian Jones, Ivana Bozic, Tibor Antal, Rebecca Leary, Baojin

Fu, Mihoko Kamiyama, Ralph H Hruban, James R Eshleman, Martin A Nowak,

et al. Distant metastasis occurs late during the genetic evolution of pancreatic

cancer. Nature, 467(7319):1114–1117, 2010.

[93] Jianjun Zhang, Junya Fujimoto, Jianhua Zhang, David C Wedge, Xingzhi Song,

Jiexin Zhang, Sahil Seth, Chi-Wan Chow, Yu Cao, Curtis Gumbs, et al. Intratu-

mor heterogeneity in localized lung adenocarcinomas delineated by multiregion

sequencing. Science, 346(6206):256–259, 2014.

algorithms for decoding cancer genomes: …yn394gh2333/dissertation-augmented.pdffor a supportive...

Documents