Download - Genome Sequence Analysis and Methods - Amborella · 2010. 12. 11. · BioAcknowledgement Scientists who shappyged the world as we know by being factual, rational, honest, and free-thinking

Genome Sequence Analysis and Methods

IT Revolution &its role in personalized medicineits role in personalized medicine

Jong Bhakg

Genome Research Foundation &Theragen BiO Institute

TheragenEtex

BioAcknowledgementgScientists who shaped the world as we know by being factual, p y g ,rational, honest, and free-thinking.Tax payers who support researchersTax payers who support researchersMy former and present colleagues in MRC, Harvard Med., EBI, KAIST, KOBIC, and, TBI from whom I learned so much.TheragenEtex for generous financial supportTheragenEtex for generous financial supportDr. Kim SangTae and Dr. Park, Jongsun.

BioDisclaimerThe content of these slides are produced by Jong after The content of these slides are produced by Jong after shamelessly stealing other people’s copyrighted ideas and knowledgeknowledge.Everyone is welcome to take the slides in part or in whole without any permission whatsoever.

Everything here is under (♡) BioLicenseIt means it is free for all− It means it is free for all.

Brief history of Computing

• Charles Babbage: difference engine 1822• Alan Turing: formalization of the concept of the algorithm and

i i h h T i hi dcomputation with the Turing machine modern computer.• Claude Shannon Information Theory: An Algebra for

Th ti l G ti 1940Theoretical Genetics. 1940• IBM 1980s. Personal ComputerIBM 1980s. Personal Computer• Tim Berners-Lee 1990. WWW. Personal

Net Net • Google Interaction Network• FaceBook Interaction Network• GRID computing Networking computing.p g g p g• Cloud computing Personalization of Super computer

Brief History of BioinformaticsDarwin: A theoretical biologistDarwin: A theoretical biologistMendel: A theoretical prediction and validationPerutz and Kendrew: structural biologistsCrick and Watson: DNA modellersCrick, Brenner, : CodonSanger: DNA sequencing: GenomicsSanger: DNA sequencing: GenomicsSanger: Protein sequencingSanger: ProteomicsSanger: ProteomicsLesk: Visualization of proteinsDayhoff: Atlas of proteins (well known DB)Needleman & Wunch: Computer algorithmsRoger Staden: DNA analysis toolsSouthern: Hybridization Functional GenomicsSouthern: Hybridization Functional GenomicsH. influenzae, Cyanobacterial genomes 1995, 1996Human Genome Project 2000Human Genome Project 2000Personal Genomics 2006

IT and Bio Revolution (혁명)?Revolution is (혁명이란)?

What is IT in Bio?

Briefly it is called BIT or Bioinformaticsy

Who is bioinformatist?

BioinformaticsIs aboutIs aboutMapping

X is IEPY is Size

Old mapIs not

Accurate.

HoweverIt helps

P l tPeople toExplore.

DataDataInformationKnowledge

Gangnido 1402

What is Omics?생물학이 공정화, 산업화 하는 과정에서 생긴 말.생명 ‘공학’과 생명 ‘산업’ 이 2008년 부터 생김생명 ‘공학’과 생명 ‘산업’ 이 ~2008년 부터 생김

http://omics org− http://omics.org

Genome Transcript Interactome FunctomeProteome Textome OmeOme

자원

BBiiOOMatrixMatrix

Sequence

file 자원

ResourceResourceStructure

Expression 분석Pipeline분석

ResourceResource

Pathway

Regulation

Pipeline

Bi i

분석

MaterialsMaterialsNetwork

Info TypeInfo Type

BioEngine DBBioDiversity

소재은행

MaterialsMaterials

Info TypeInfo Type

2ndary Info2ndary Info2ndary Info2ndary Info 2ndary DB2ndary DB2ndary DB2ndary DB

(Personal Genomics)개인 유전체학 ?

Core Tech: 대중화된 개인유전체 (personal genomics)(p g )

What is it?What is it?각 개개인의 유전자 타입에 맞춰서 의사가 진단, 처방, 조언일반 대중이 유전체 해독기와 분석기를 어떤 식으로든 이용할 수 있음.일반 대중이 유전체 해독기와 분석기를 어떤 식으로든 이용할 수 있음.

http://PersonalGenomics orghttp://PersonalGenomics.org

What is Genomics?

http://genomics.org

Genomic T : 유전체 “티”자6 billion persons

esio

n B

ase

6bi

ll

Jong Bhak, under BioLicense

유전체학의 양대축: 인족의 다양성 과 개개인의 게놈

6 billion people6 billion people

50 00050,000 Bases for PASNP

Dr. Kim Seong-jin


Dr. Kim Seong jin

DNA 서열 해독과 DNA (유전자) 타이핑

서열 해독은 세포의 모든

DNA/RNA를 해독하는 것서열 해독기 HellogenomeTMg

유전자 타이핑은 세포의 일유전자 타이핑은 세포의 일

부 DNA/RNA 를 타입을 결정해주는 것 바이오 칩

HellogeneTM

개인 게놈 해독 역사개인 게놈 해독 역사NCBI Reference genome, Pool DNAs CaucasianCaucasianCraig Venter, Caucasian (publically available) James Watson Caucasian (publically James Watson, Caucasian (publically available)Ni i ( ) Af i (H M )Nigerian (anonymous), African (HapMap)YH, Han Chinese, publically available (BGI)Seong Jin Kim, Publically available (테라젠 팀)AK1, Korean , Publically availableRosalynn Gill, Caucasian, (publically available)

최초 공개된 여성 PGP9 (테라젠)( )

게놈 해독의 과거, 현재, 미래

HGP: 13 years, $2.7 billion (3.5조원 14X) 2004

Craig Venter: 4 years, $100 million (1300억 원) 2007년Craig Venter: 4 years, $100 million (1300억 원) 2007년

James Watson: 2 years, $2 million (26억원 7.5X) 2008

김성진박사: 6 months, $0.17 million (2억 2천 만원 29X) 2009

2010: 1 month, $20,000 (2천 6백만원 30X)

2015(?):

Omics revolution 의 핵심?

대용량의 싸고, 정확한 데이터:−서열 해독: sequencing tech/cost

초고속의 자동화된 분석

−서열 분석: computing tech/cost

Ome and Omics graph (옴과 오믹스의 관계)

$3,000,000,000$50,000 per person

Cost

$ 0

Year20162003 Ome and OmicsBalance pointBalance point

~ 2010


실제 예: 유전체학의 Y 축

최초의 한국인 게놈 해독 논문: 김성진박사 게놈

데이터 공개 시점 2008년 12월: ftp://bioftp org데이터 공개 시점 2008년 12월: ftp://bioftp.org

두번째 한국인 게놈: AK1 (서울대 의대: 무명)서울대 의대와 마크로젠이 2009년 해독데이터 공개 시점 2009년 12월데이터 공개 시점 2009년 12월

The first Korean Genome (SJK)First analyzed by Gacheon medical school LCDI and KOBIC, KRIBB in 2008 (Joint effort among LCDI, KOBIC, and 국가참조표준센터)First annotated and made public on 4th Dec. 2008 (through web and ftp)SNP, CNV, indels were analysedA t t d h t i i ti t d dAutomated phenotypic association study was doneNon-syn. AnalysisPhylogenetic study of mtDNA Y Chr And autosomes showed Korean Phylogenetic study of mtDNA, Y Chr And autosomes showed Korean relationship to Chinese and Japanese.First intra-Asian genome comparison (Chinese and Korean)First intra Asian genome comparison (Chinese and Korean)Analyzed at: 7.8, 17.3, 23.5 and 28 x foldsBy Jan. 23.5 fold sequenced and analyzedy q yOpenfreely Available from: http://koreagenome.org

The Karyogram of the donor DNANo obvious chromosomal abnormalities!

Classification and number of intra-genic SNPs

Not represent in dbSNP

Comparison of individual SNPs

SJK shared 56% with Yoruba SJK shared 60% with ChineseKorean vs African : 56% Korean vs Chinese : 60%

SJK shared 50% with Venter SJK shared 53% with WatsonKorean vs Caucasians : 52%

Korean Genome Variation Browser

SJK’s SNPs

“NOC2L” gene

Watson’s SNPs

Hapmap

YH’s SNPs

http://koreagenome.org/cgi-bin/gbrowse/kgenome/

Venter’s SNPs

SJK’s genetic lineage

Autosomal phylogenic tree

Chromosome Y haplogroup lineage mtDNA ethno-geographic lineage

SJK

Size distribution and classification of short indels found in SJKindels found in SJK

Using MAQ we identified 342 965 short indels Using MAQ, we identified 342,965 short indels We found that only 247 (0.1%) were validated,113,287 (33.0%) non-validated, and 229,431 (66.9%) indels were not found in dbSNP

Indels in SJK genic regions

I d

Indel

GIndexIndel number Homozygous Heterozygous

Gene

number

5'UTR 27 9 18 26

CDS 49 16 33 40CDS 49 16 33 40

3’UTR 319 114 205 247

Intron 127,516 45,430 82,086 12,421

Total 127,911 45,569 82,342 12,734

Comparison of individual Indels

Comparison of the SJK indels (< 4bp) overlapped with those of YH, HuRef (Venter), Watson, and NA18507 (Yoruba) genomes

Source SJK genome Indel loci a indel size b indel type c indel type/allindel type/indel loci

YH genome (135,199) All (289,257) 22,605 22,522 22,495 7.8% 99.5%

Homozygous (112,843) 12,940 12,915 12,902 11.4% 99.7%

Heterozygous (176,414) 9,665 9,607 9,593 5.5% 99.3%

H R f (577 661) All 34 142 33 254 29 422 10 2% 86 2%HuRef genome (577,661) All 34,142 33,254 29,422 10.2% 86.2%

Homozygous 17,325 16,956 15,656 13.9% 90.4%

Heterozygous 16,817 16,298 13,766 7.8% 81.9%

Watson genome (118 887) All 6 533 5 749 5 738 2 0% 87 8%Watson genome (118,887) All 6,533 5,749 5,738 2.0% 87.8%

Homozygous 3,363 3,090 3,090 2.7% 91.9%

Heterozygous 3,170 2,659 2,648 1.5% 83.5%

NA18507 genome (438 566) All 152 847 146 266 143 023 49 4% 93 6%NA18507 genome (438,566) All 152,847 146,266 143,023 49.4% 93.6%

Homozygous 76,314 73,231 72,287 64.1% 94.7%

Heterozygous 76,533 73,035 70,736 40.1% 92.4%

This discrepancy seems to result from the method used rather than from the ethnic similarities between SJK and NA18507(i.e., because, paired-end sequencing was used for SJK and NA18507).

This may partially explain why HuRef and Watson which are Caucasian as the NCBI reference, have lower levels (86.2% and 87.8%) of common indels against SJK.

Homo- and heterozygous deletions in SJK genomein SJK genome

(A) Homozygous 2 3 kb genomic deletion and (B) Heterozygous 5 kb genomic deletion(A) Homozygous 2.3 kb genomic deletion and (B) Heterozygous 5 kb genomic deletion.

Detection and identification of structural variants

• We found structural variants by using paired-end reads.1. 2920 deletions (100bp ~ 100kb)2. 415 inversions (100bp ~ 100kb)3 963 insertions (175bp 250bp)3. 963 insertions (175bp ~ 250bp)

• We found deletion SVs in 21 coding genes.We found deletion SVs in 21 coding genes. All heterozygous deletions

Repeat composition in SJK deletion variants

Long Interspersed Nuclear Elements (LINE)

Short Interspersed Nuclear Elements (SINE)

실제 예: 유전체학의 X 축

6 billion people

Th l ti di it iThe population diversity in Pan AsiaPan Asia

PASNP consortium

http://pasnpi orghttp://pasnpi.org

PASNP projectis an international consortium for Pan Asian’s SNP study.

• Researches with Asian populations :1 P l ti di it i g (Ph l g i t d )1. Population diversity mapping (Phylogenic study)

2. Functional annotation study3. CNV study3. CNV study

• Basic services :1. SNP data repository and basic information services via web for Pan Asians

Sampling from Pan Asia 11 countries

1. Sample number: 1,833p ,2. Ethnic group: 763 Country: 113. Country: 114. SNP marker number: 58,960 (Affymetrix 56K Xba SNP genotyping chip)chip)

Genotyped 76 ethnic groups over 11 countries

How many recognizable human groups in the world?J t i th i ht fi thJust in the right fig., there

are simply six recognizable population groupspopulation groups.

When we consider human i ti i l timigration, isolation,

admixture, and more ethnic groups this is not a simple

M i lik lih d

groups, this is not a simple question.

Maximum likelihood tree of 29 populations. The tree is based on data

from 19 934 SNPs from 19,934 SNPs. Bootstrap values based on

100 replicates

Admixture, migration, and isolation e, g ,vents make population grouping more com

plicatedplicated

In this study we foundIn this study, we found

1 G ti t i t l 1. Genetic ancestry is strongly correlated with linguistic affiliations as

well as geography.2. Most populations show relatedness 2. Most populations show relatedness within ethnic/linguistic groups despite

prevalent gene flow amongst prevalent gene flow amongst populations.

PCA resultsFinding 1: GeneticGenetic

ancestry is strongly

correlatedcorrelated with linguistic affiliations as

llwell as geography

Finding 2: Most

populations p pshow

relatedness withinwithin

ethnic/linguistic groups despitedespite

prevalent gene flow amongst

l ti

Phylogentic and population

Finding 1: Genetic

p pstructure analysis

resultsFinding 1: Genetic

ancestry is strongly

l t d ithcorrelated with linguistic

affiliations as well as geography

Finding 2: Most gpopulations show relatedness within ethnic/linguistic

Phylogenic tree: Da ethnic/linguistic

groups despite prevalent gene flow amongst

distance based NJ tree

Population flow amongst populations stratification : STRUCTURE

ASD based NJ individual treeFinding 1: Genetic

ancestry is strongly correlated with linguisticcorrelated with linguistic

affiliations as well as geography

Finding 2: Most populations show

l t d ithirelatedness within ethnic/linguistic groups despite prevalent gene

flow amongst populations

Eight population outliers whose linguistic and genetic affinities are inconsistentwhose linguistic and genetic affinities are inconsistent

• AX-ME/Melanesian, MY-JH/Jehai (Negrito), MY-KS/Kensiu (Negrito),TH-MO/Mon, TH-KA/Karen, CN-JN/Jinuo, IN-TB/Spiti, and CN-UG/Uyghur, , , p , yg• These linguistic outliers tend to cluster with their geographic neighbors

or to occupy an intermediate position between their geographic neighbors and the more distant members of their linguistic group

• These patterns are consistent either with substantial recent admixture among

the populations, a history of language replacement , or uncertainties in the linguistic classifications themselves

Considerable gene flow among Asian pop lations as obser edpopulations was observed

• Considerable gene flow was observed amongst sub-populations in clusters including those groups believed to practice endogamy basclusters, including those groups believed to practice endogamy bas

ed on linguistic, cultural and ethnic information• STRUCTURE reveals that the six Han Chinese population samples show varying degrees of admixture between a northern ‘Altaic’ clus show varying degrees of admixture between a northern Altaic cluster and a ‘Sino-Tibetan/Tai-Kadai’ cluster which is most frequent in the ethnic groups sampled from southern China and northern Thaila

dnd

Peopling of Asia: one-wave VS. two-wave hypothesisTwo wave hypothesisTwo-wave hypothesis

Cavalli-Sforza et al (Nat Genet, 2003) suggested a hypothesis of peopling of Asia that anatomically

modern humans (also called Homo sapiens sapiens) ( p p )spread into Asia through two routes.

The first was a southern route, perhaps along the coast to south and Southeast Asia, from where it

bifurcated north and south. In the south, these modern bifurcated north and south. In the south, these modern humans reached Oceania between 60 and 40 kya,

whereas the northern expansion later reached China, Japan and eventually America.The second was a central route through the Middle East, Arabia or Persia to central Asia, from

where migration occurred in all directions reaching Europe, east and northeast Asia about 40 kya, g g p , y ,after which the first and principal migration to America suggested by Greenberg occurred not

later than 15 kya.One-wave hypothesis

All modern East Asian and Southeast Asian populations was derived from a single

initial entry of modern humans into the subcontinent.

Which model can be better explanation for our observation?

O t b ti i Our current observation is expressed in right figures.

Forward time simulationPeopling of Asia: one-wave versus two-wave hypothesisPeopling of Asia: one wave versus two wave hypothesis

Based on hypothesis of Cavalli-Sforza et al. and our observations on Asian Negritos, we constructed three models which are testable.

Current our observation is like the one, Model 3. Current our observation is like the one, Model 3. If Model 1 or 2 can change into our current observation under some known

conditions, “two-wave models” will be acceptable. *Conditions

- Allele frequency spectrum of MRCA was from YRI. 10,000 SNPs were simulated. One generation was calculated as 20 years. Gene flow proportion

was set to different levels (M=0.005~0.95)

H h i l d l f h li f A i M d l 1 d M d l 2 h “ ” Hypothetical models of the peopling of Asia. Model 1 and Model 2 represent the “two waves” hypothesis (Cavalli’s hypothesis), and Model 3 represents the “one wave” hypothesis.

- AF: African; NG: Negrito; AS: Asian; EU: European.

Results and Conclusion: Forward time simulationsimulation

Peopling of Asia: one-wave versus two-wave hypothesis

Our simulation results indicate that model 1 is not compatible with the empirical data,

No extreme gene flow!

Model 2 is only compatible if gene flow from other Asian populations to the Negritos has been fairly extreme, with more than 50% of Negrit

h i f th A io chromosomes coming from other Asian populations, without dramatically affecting the Negri

to phenotype. Th s Model 1 and 2 are impertinent to theThus Model 1 and 2 are impertinent to the

explanation of current observation.Negrito: The Semang people of the Malay

PeninsulaThailand people

Finding 3: Haplotype diversity was strongly correlated with latitude with

diversity decreasing from South to Northdiversity decreasing from South to NorthHaplotype diversity versus

latitudes• Haplotype diversity was strongly correlated with

latitude (R2 = 0.91, P < 0 0001) with diversity < 0.0001), with diversity decreasing from South to

North.North.

• This is consistent with a loss ① Indonesian; ② Malay; ③ Philippine; ④ Thai; ⑤ South Chinese minorities; ⑥ Southern Han Chinese; ⑦ Japanese

⑧ h h

of diversity as populations moved to higher latitudes.

& Korean; ⑧ Northern Han Chinese; ⑨ Northern Chinese Minorities; ⑩

Yakut.

Finding 4: Southeast Asian has the most of the Asian gene poolsgene pools

Group private haplotype sharing analysis (Frequency was not considered (type

only))

• 90% of haplotypes in East Asian populations was found in

SEA and Central South Asian SEA and Central-South Asian (CSA) populations .

• Of which about 50% were f d i SEA d EA l d found in SEA and EA only and

5% found in CSA only.• These observations suggest

that the geographic source(s) contributing to EA populations

were mainly from SEA

* Proportion of

ypopulations.

* YKT: Yakut; N-CM: Northern Chinese minorities; N-HAN: Northern Han Chinese;

JP-KR: Japanese and Korean; S-HAN: Southern Han Chinese; S CM: Southern

Proportion ofhaplotypes in population A that can be also found

in population B (HSa)

•HSa :CSA private•HSb : EA private

•HSc : sharing by all groups Southern Han Chinese; S-CM: Southern

Chinese minorities; EA: East Asiangroups

•HSd: SE private

Koreans and his neighborsNortheastern Asians including Japanese Northern Chinese

A i I di

- Northeastern Asians including Japanese, Northern Chinese, and Korean have high autosomal genetic similarity compared to

others.American Indians

N thNorthern Asians

NortheasternNortheastern Asians

Chinese and

Figure S28 Maximum likelihood tree of 126

Chinese and SEAs

Figure S28 Maximum likelihood tree of 126 population samples (PASNP + HGDP).

* Bootstrap values based on 100 replicates are shown. Language families are indicated with

colors as shown in the legend. AX: Affymetrix; g yCN: China; ID: Indonesia; IN: India; JP:Japan; KR: Korea; MY: Malaysia; PI: the Philippines;

SG:Singapore; TH: Thailand; TW: Taiwan.

CPU used for PASNP:

600 CPUs for about one year running “St t ”“Structure” program.

O l 56K SNP hi d t (2000 l )Only 56K SNP chip data (2000 samples)

Lesson and Challenge

55/12

1. 실험디자인: 더욱더 중요해짐오믹스 시대의 성공적인 생명과학분야의 핵심오믹스 시대의 성공적인 생명과학분야의 핵심

제대로된 데이터를 분석할 수 있는 실험디자인

− 실험디자인 (게놈, 생정보학): 8개월실험 : 2개월− 실험 : 2개월

− 분석 : 2개월대부분이 실험디자인이 안된/틀린 경우 좋은 논문 불가 , 돈낭비,

실험디자인은 생정보학자와 처음부터 같이해야함 (아니면, 연구자가 그많큼의 노력과 시간 계획을 투자)

56/12

2. Bioinformatics Challenge :

still massively mapping

How to map/compute 6 billion X 6 billionHow to map/compute 6 billion X 6 billion matrix?

6 billion persons

FullGenome Sequening Depth AxisGenoTyping Width Axis Bas

es

yp g

6bi

llion

B

3. 주문연구시대 : 혼자 다 하려고 하지 않기애플 아이폰

58/12

오믹스의 문제들

Adding one more dimension?

How to map/compute RNA expressionsp p pIn relation with bio-function?

6 billion persons

ses

billi

on B

as6

Transcriptome sequencing (전사체 해독)

Adding even more dimension?

How to map/compute Phenome?p p

6 billion persons

ses

billi

on B

as6

Cohort studies (대규모 집단 연구)

How to map/compute epigenome?

6 billion persons

ses

billi

on B

as6

How to map/compute Microbiome?

6 billion persons

ses

billi

on B

as6

DIY BiologygyDIY genomicsDIY bi i f iDIY bioinformaticsConsumer genomics gPersonal genomicsP l bi i f tiPersonal bioinformatics

결론결론연구자는 실제 중요한 문제에 집중:연구자는 실제 중요한 문제에 집중:

유전체학과 생정보학은유전체학과, 생정보학은이제는 전문가만 하는 시대를 지나서

일반인도 사서 하는 시대입니다일반인도 사서 하는 시대입니다.

Download - Genome Sequence Analysis and Methods - Amborella · 2010. 12. 11. · BioAcknowledgement Scientists who shappyged the world as we know by being factual, rational, honest, and free-thinking

Top Related