genome sequence analysis and methods - amborella · 2010. 12. 11. · bioacknowledgement scientists...

67
Genome Sequence Analysis and Methods IT Revolution & its role in personalized medicine its role in personalized medicine Jong Bhak Genome Research Foundation & Theragen BiO Institute TheragenEtex

Upload: others

Post on 25-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Genome Sequence Analysis and Methods

    IT Revolution &its role in personalized medicineits role in personalized medicine

    Jong Bhakg

    Genome Research Foundation &Theragen BiO Institute

    TheragenEtex

  • BioAcknowledgementgScientists who shaped the world as we know by being factual, p y g ,rational, honest, and free-thinking.Tax payers who support researchersTax payers who support researchersMy former and present colleagues in MRC, Harvard Med., EBI, KAIST, KOBIC, and, TBI from whom I learned so much.TheragenEtex for generous financial supportTheragenEtex for generous financial supportDr. Kim SangTae and Dr. Park, Jongsun.

  • BioDisclaimerThe content of these slides are produced by Jong after The content of these slides are produced by Jong after shamelessly stealing other people’s copyrighted ideas and knowledgeknowledge.Everyone is welcome to take the slides in part or in whole without any permission whatsoever.

    Everything here is under (♡) BioLicenseIt means it is free for all− It means it is free for all.

  • Brief history of Computing

    • Charles Babbage: difference engine 1822• Alan Turing: formalization of the concept of the algorithm and

    i i h h T i hi dcomputation with the Turing machine modern computer.• Claude Shannon Information Theory: An Algebra for

    Th ti l G ti 1940Theoretical Genetics. 1940• IBM 1980s. Personal ComputerIBM 1980s. Personal Computer• Tim Berners-Lee 1990. WWW. Personal

    Net Net • Google Interaction Network• FaceBook Interaction Network• GRID computing Networking computing.p g g p g• Cloud computing Personalization of Super computer

  • Brief History of BioinformaticsDarwin: A theoretical biologistDarwin: A theoretical biologistMendel: A theoretical prediction and validationPerutz and Kendrew: structural biologistsCrick and Watson: DNA modellersCrick, Brenner, : CodonSanger: DNA sequencing: GenomicsSanger: DNA sequencing: GenomicsSanger: Protein sequencingSanger: ProteomicsSanger: ProteomicsLesk: Visualization of proteinsDayhoff: Atlas of proteins (well known DB)Needleman & Wunch: Computer algorithmsRoger Staden: DNA analysis toolsSouthern: Hybridization Functional GenomicsSouthern: Hybridization Functional GenomicsH. influenzae, Cyanobacterial genomes 1995, 1996Human Genome Project 2000Human Genome Project 2000Personal Genomics 2006

  • IT and Bio Revolution (혁명)?Revolution is (혁명이란)?

  • What is IT in Bio?

    Briefly it is called BIT or Bioinformaticsy

    Who is bioinformatist?

  • BioinformaticsIs aboutIs aboutMapping

    X is IEPY is Size

    Old mapIs not

    Accurate.

    HoweverIt helps

    P l tPeople toExplore.

    DataDataInformationKnowledge

    Gangnido 1402

  • What is Omics?생물학이 공정화, 산업화 하는 과정에서 생긴 말.생명 ‘공학’과 생명 ‘산업’ 이 2008년 부터 생김생명 ‘공학’과 생명 ‘산업’ 이 ~2008년 부터 생김

    http://omics org− http://omics.org

  • Genome Transcript Interactome FunctomeProteome Textome OmeOme

    자원

    BBiiOOMatrixMatrix

    Sequence

    file 자원

    ResourceResourceStructure

    Expression 분석Pipeline분석

    ResourceResource

    Pathway

    Regulation

    Pipeline

    Bi i

    분석

    MaterialsMaterialsNetwork

    Info TypeInfo Type

    BioEngine DBBioDiversity

    소재은행

    MaterialsMaterials

    Info TypeInfo Type

    2ndary Info2ndary Info2ndary Info2ndary Info 2ndary DB2ndary DB2ndary DB2ndary DB

  • (Personal Genomics)개인 유전체학 ?

    Core Tech: 대중화된 개인유전체 (personal genomics)(p g )

    What is it?What is it?각 개개인의 유전자 타입에 맞춰서 의사가 진단, 처방, 조언일반 대중이 유전체 해독기와 분석기를 어떤 식으로든 이용할 수 있음.일반 대중이 유전체 해독기와 분석기를 어떤 식으로든 이용할 수 있음.

    http://PersonalGenomics orghttp://PersonalGenomics.org

  • What is Genomics?

    http://genomics.org

  • Genomic T : 유전체 “티”자6 billion persons

    esio

    n B

    ase

    6bi

    ll

    Jong Bhak, under BioLicense

  • 유전체학의 양대축: 인족의 다양성 과 개개인의 게놈

    6 billion people6 billion people

    50 00050,000 Bases for PASNP

    Dr. Kim Seong-jin

    Jong Bhak, under BioLicense

    Dr. Kim Seong jin

  • DNA 서열 해독과 DNA (유전자) 타이핑

    서열 해독은 세포의 모든

    DNA/RNA를 해독하는 것서열 해독기 HellogenomeTMg

    유전자 타이핑은 세포의 일유전자 타이핑은 세포의 일

    부 DNA/RNA 를 타입을 결정해주는 것 바이오 칩

    HellogeneTM

  • 개인 게놈 해독 역사개인 게놈 해독 역사NCBI Reference genome, Pool DNAs CaucasianCaucasianCraig Venter, Caucasian (publically available) James Watson Caucasian (publically James Watson, Caucasian (publically available)Ni i ( ) Af i (H M )Nigerian (anonymous), African (HapMap)YH, Han Chinese, publically available (BGI)Seong Jin Kim, Publically available (테라젠 팀)AK1, Korean , Publically availableRosalynn Gill, Caucasian, (publically available)

    최초 공개된 여성 PGP9 (테라젠)( )

  • 게놈 해독의 과거, 현재, 미래

    HGP: 13 years, $2.7 billion (3.5조원 14X) 2004

    Craig Venter: 4 years, $100 million (1300억 원) 2007년Craig Venter: 4 years, $100 million (1300억 원) 2007년

    James Watson: 2 years, $2 million (26억원 7.5X) 2008

    김성진박사: 6 months, $0.17 million (2억 2천 만원 29X) 2009

    2010: 1 month, $20,000 (2천 6백만원 30X)

    2015(?):

  • Omics revolution 의 핵심?

    대용량의 싸고, 정확한 데이터:−서열 해독: sequencing tech/cost

    초고속의 자동화된 분석

    −서열 분석: computing tech/cost

  • Ome and Omics graph (옴과 오믹스의 관계)

    $3,000,000,000$50,000 per person

    Cost

    $ 0

    Year20162003 Ome and OmicsBalance pointBalance point

    ~ 2010

    Jong Bhak, under BioLicense

  • 실제 예: 유전체학의 Y 축

  • 최초의 한국인 게놈 해독 논문: 김성진박사 게놈

    데이터 공개 시점 2008년 12월: ftp://bioftp org데이터 공개 시점 2008년 12월: ftp://bioftp.org

  • 두번째 한국인 게놈: AK1 (서울대 의대: 무명)서울대 의대와 마크로젠이 2009년 해독데이터 공개 시점 2009년 12월데이터 공개 시점 2009년 12월

  • The first Korean Genome (SJK)First analyzed by Gacheon medical school LCDI and KOBIC, KRIBB in 2008 (Joint effort among LCDI, KOBIC, and 국가참조표준센터)First annotated and made public on 4th Dec. 2008 (through web and ftp)SNP, CNV, indels were analysedA t t d h t i i ti t d dAutomated phenotypic association study was doneNon-syn. AnalysisPhylogenetic study of mtDNA Y Chr And autosomes showed Korean Phylogenetic study of mtDNA, Y Chr And autosomes showed Korean relationship to Chinese and Japanese.First intra-Asian genome comparison (Chinese and Korean)First intra Asian genome comparison (Chinese and Korean)Analyzed at: 7.8, 17.3, 23.5 and 28 x foldsBy Jan. 23.5 fold sequenced and analyzedy q yOpenfreely Available from: http://koreagenome.org

  • The Karyogram of the donor DNANo obvious chromosomal abnormalities!

  • Classification and number of intra-genic SNPs

    Not represent in dbSNP

  • Comparison of individual SNPs

    SJK shared 56% with Yoruba SJK shared 60% with ChineseKorean vs African : 56% Korean vs Chinese : 60%

    SJK shared 50% with Venter SJK shared 53% with WatsonKorean vs Caucasians : 52%

  • Korean Genome Variation Browser

    SJK’s SNPs

    “NOC2L” gene

    Watson’s SNPs

    Hapmap

    YH’s SNPs

    http://koreagenome.org/cgi-bin/gbrowse/kgenome/

    Venter’s SNPs

  • SJK’s genetic lineage

    Autosomal phylogenic tree

    Chromosome Y haplogroup lineage mtDNA ethno-geographic lineage

    SJK

  • Size distribution and classification of short indels found in SJKindels found in SJK

    Using MAQ we identified 342 965 short indels Using MAQ, we identified 342,965 short indels We found that only 247 (0.1%) were validated,113,287 (33.0%) non-validated, and 229,431 (66.9%) indels were not found in dbSNP

  • Indels in SJK genic regions

    I d

    Indel

    GIndexIndel number Homozygous Heterozygous

    Gene

    number

    5'UTR 27 9 18 26

    CDS 49 16 33 40CDS 49 16 33 40

    3’UTR 319 114 205 247

    Intron 127,516 45,430 82,086 12,421

    Total 127,911 45,569 82,342 12,734

  • Comparison of individual Indels

    Comparison of the SJK indels (< 4bp) overlapped with those of YH, HuRef (Venter), Watson, and NA18507 (Yoruba) genomes

    Source SJK genome Indel loci a indel size b indel type c indel type/allindel type/indel loci

    YH genome (135,199) All (289,257) 22,605 22,522 22,495 7.8% 99.5%

    Homozygous (112,843) 12,940 12,915 12,902 11.4% 99.7%

    Heterozygous (176,414) 9,665 9,607 9,593 5.5% 99.3%

    H R f (577 661) All 34 142 33 254 29 422 10 2% 86 2%HuRef genome (577,661) All 34,142 33,254 29,422 10.2% 86.2%

    Homozygous 17,325 16,956 15,656 13.9% 90.4%

    Heterozygous 16,817 16,298 13,766 7.8% 81.9%

    Watson genome (118 887) All 6 533 5 749 5 738 2 0% 87 8%Watson genome (118,887) All 6,533 5,749 5,738 2.0% 87.8%

    Homozygous 3,363 3,090 3,090 2.7% 91.9%

    Heterozygous 3,170 2,659 2,648 1.5% 83.5%

    NA18507 genome (438 566) All 152 847 146 266 143 023 49 4% 93 6%NA18507 genome (438,566) All 152,847 146,266 143,023 49.4% 93.6%

    Homozygous 76,314 73,231 72,287 64.1% 94.7%

    Heterozygous 76,533 73,035 70,736 40.1% 92.4%

    This discrepancy seems to result from the method used rather than from the ethnic similarities between SJK and NA18507(i.e., because, paired-end sequencing was used for SJK and NA18507).

    This may partially explain why HuRef and Watson which are Caucasian as the NCBI reference, have lower levels (86.2% and 87.8%) of common indels against SJK.

  • Homo- and heterozygous deletions in SJK genomein SJK genome

    (A) Homozygous 2 3 kb genomic deletion and (B) Heterozygous 5 kb genomic deletion(A) Homozygous 2.3 kb genomic deletion and (B) Heterozygous 5 kb genomic deletion.

  • Detection and identification of structural variants

    • We found structural variants by using paired-end reads.1. 2920 deletions (100bp ~ 100kb)2. 415 inversions (100bp ~ 100kb)3 963 insertions (175bp 250bp)3. 963 insertions (175bp ~ 250bp)

    • We found deletion SVs in 21 coding genes.We found deletion SVs in 21 coding genes. All heterozygous deletions

  • Repeat composition in SJK deletion variants

    Long Interspersed Nuclear Elements (LINE)

    Short Interspersed Nuclear Elements (SINE)

  • 실제 예: 유전체학의 X 축

  • 6 billion people

    Th l ti di it iThe population diversity in Pan AsiaPan Asia

    PASNP consortium

    http://pasnpi orghttp://pasnpi.org

  • PASNP projectis an international consortium for Pan Asian’s SNP study.

    • Researches with Asian populations :1 P l ti di it i g (Ph l g i t d )1. Population diversity mapping (Phylogenic study)

    2. Functional annotation study3. CNV study3. CNV study

    • Basic services :1. SNP data repository and basic information services via web for Pan Asians

  • Sampling from Pan Asia 11 countries

    1. Sample number: 1,833p ,2. Ethnic group: 763 Country: 113. Country: 114. SNP marker number: 58,960 (Affymetrix 56K Xba SNP genotyping chip)chip)

  • Genotyped 76 ethnic groups over 11 countries

  • How many recognizable human groups in the world?J t i th i ht fi thJust in the right fig., there

    are simply six recognizable population groupspopulation groups.

    When we consider human i ti i l timigration, isolation,

    admixture, and more ethnic groups this is not a simple

    M i lik lih d

    groups, this is not a simple question.

    Maximum likelihood tree of 29 populations. The tree is based on data

    from 19 934 SNPs from 19,934 SNPs. Bootstrap values based on

    100 replicates

  • Admixture, migration, and isolation e, g ,vents make population grouping more com

    plicatedplicated

    In this study we foundIn this study, we found

    1 G ti t i t l 1. Genetic ancestry is strongly correlated with linguistic affiliations as

    well as geography.2. Most populations show relatedness 2. Most populations show relatedness within ethnic/linguistic groups despite

    prevalent gene flow amongst prevalent gene flow amongst populations.

  • PCA resultsFinding 1: GeneticGenetic

    ancestry is strongly

    correlatedcorrelated with linguistic affiliations as

    llwell as geography

    Finding 2: Most

    populations p pshow

    relatedness withinwithin

    ethnic/linguistic groups despitedespite

    prevalent gene flow amongst

    l ti

  • Phylogentic and population

    Finding 1: Genetic

    p pstructure analysis

    resultsFinding 1: Genetic

    ancestry is strongly

    l t d ithcorrelated with linguistic

    affiliations as well as geography

    Finding 2: Most gpopulations show relatedness within ethnic/linguistic

    Phylogenic tree: Da ethnic/linguistic

    groups despite prevalent gene flow amongst

    distance based NJ tree

    Population flow amongst populations stratification : STRUCTURE

  • ASD based NJ individual treeFinding 1: Genetic

    ancestry is strongly correlated with linguisticcorrelated with linguistic

    affiliations as well as geography

    Finding 2: Most populations show

    l t d ithirelatedness within ethnic/linguistic groups despite prevalent gene

    flow amongst populations

  • Eight population outliers whose linguistic and genetic affinities are inconsistentwhose linguistic and genetic affinities are inconsistent

    • AX-ME/Melanesian, MY-JH/Jehai (Negrito), MY-KS/Kensiu (Negrito),TH-MO/Mon, TH-KA/Karen, CN-JN/Jinuo, IN-TB/Spiti, and CN-UG/Uyghur, , , p , yg• These linguistic outliers tend to cluster with their geographic neighbors

    or to occupy an intermediate position between their geographic neighbors and the more distant members of their linguistic group

    • These patterns are consistent either with substantial recent admixture among

    the populations, a history of language replacement , or uncertainties in the linguistic classifications themselves

  • Considerable gene flow among Asian pop lations as obser edpopulations was observed

    • Considerable gene flow was observed amongst sub-populations in clusters including those groups believed to practice endogamy basclusters, including those groups believed to practice endogamy bas

    ed on linguistic, cultural and ethnic information• STRUCTURE reveals that the six Han Chinese population samples show varying degrees of admixture between a northern ‘Altaic’ clus show varying degrees of admixture between a northern Altaic cluster and a ‘Sino-Tibetan/Tai-Kadai’ cluster which is most frequent in the ethnic groups sampled from southern China and northern Thaila

    dnd

  • Peopling of Asia: one-wave VS. two-wave hypothesisTwo wave hypothesisTwo-wave hypothesis

    Cavalli-Sforza et al (Nat Genet, 2003) suggested a hypothesis of peopling of Asia that anatomically

    modern humans (also called Homo sapiens sapiens) ( p p )spread into Asia through two routes.

    The first was a southern route, perhaps along the coast to south and Southeast Asia, from where it

    bifurcated north and south. In the south, these modern bifurcated north and south. In the south, these modern humans reached Oceania between 60 and 40 kya,

    whereas the northern expansion later reached China, Japan and eventually America.The second was a central route through the Middle East, Arabia or Persia to central Asia, from

    where migration occurred in all directions reaching Europe, east and northeast Asia about 40 kya, g g p , y ,after which the first and principal migration to America suggested by Greenberg occurred not

    later than 15 kya.One-wave hypothesis

    All modern East Asian and Southeast Asian populations was derived from a single

    initial entry of modern humans into the subcontinent.

    Which model can be better explanation for our observation?

    O t b ti i Our current observation is expressed in right figures.

  • Forward time simulationPeopling of Asia: one-wave versus two-wave hypothesisPeopling of Asia: one wave versus two wave hypothesis

    Based on hypothesis of Cavalli-Sforza et al. and our observations on Asian Negritos, we constructed three models which are testable.

    Current our observation is like the one, Model 3. Current our observation is like the one, Model 3. If Model 1 or 2 can change into our current observation under some known

    conditions, “two-wave models” will be acceptable. *Conditions

    - Allele frequency spectrum of MRCA was from YRI. 10,000 SNPs were simulated. One generation was calculated as 20 years. Gene flow proportion

    was set to different levels (M=0.005~0.95)

    H h i l d l f h li f A i M d l 1 d M d l 2 h “ ” Hypothetical models of the peopling of Asia. Model 1 and Model 2 represent the “two waves” hypothesis (Cavalli’s hypothesis), and Model 3 represents the “one wave” hypothesis.

    - AF: African; NG: Negrito; AS: Asian; EU: European.

  • Results and Conclusion: Forward time simulationsimulation

    Peopling of Asia: one-wave versus two-wave hypothesis

    Our simulation results indicate that model 1 is not compatible with the empirical data,

    No extreme gene flow!

    Model 2 is only compatible if gene flow from other Asian populations to the Negritos has been fairly extreme, with more than 50% of Negrit

    h i f th A io chromosomes coming from other Asian populations, without dramatically affecting the Negri

    to phenotype. Th s Model 1 and 2 are impertinent to theThus Model 1 and 2 are impertinent to the

    explanation of current observation.Negrito: The Semang people of the Malay

    PeninsulaThailand people

  • Finding 3: Haplotype diversity was strongly correlated with latitude with

    diversity decreasing from South to Northdiversity decreasing from South to NorthHaplotype diversity versus

    latitudes• Haplotype diversity was strongly correlated with

    latitude (R2 = 0.91, P < 0 0001) with diversity < 0.0001), with diversity decreasing from South to

    North.North.

    • This is consistent with a loss ① Indonesian; ② Malay; ③ Philippine; ④ Thai; ⑤ South Chinese minorities; ⑥ Southern Han Chinese; ⑦ Japanese

    ⑧ h h

    of diversity as populations moved to higher latitudes.

    & Korean; ⑧ Northern Han Chinese; ⑨ Northern Chinese Minorities; ⑩

    Yakut.

  • Finding 4: Southeast Asian has the most of the Asian gene poolsgene pools

    Group private haplotype sharing analysis (Frequency was not considered (type

    only))

    • 90% of haplotypes in East Asian populations was found in

    SEA and Central South Asian SEA and Central-South Asian (CSA) populations .

    • Of which about 50% were f d i SEA d EA l d found in SEA and EA only and

    5% found in CSA only.• These observations suggest

    that the geographic source(s) contributing to EA populations

    were mainly from SEA

    * Proportion of

    ypopulations.

    * YKT: Yakut; N-CM: Northern Chinese minorities; N-HAN: Northern Han Chinese;

    JP-KR: Japanese and Korean; S-HAN: Southern Han Chinese; S CM: Southern

    Proportion ofhaplotypes in population A that can be also found

    in population B (HSa)

    •HSa :CSA private•HSb : EA private

    •HSc : sharing by all groups Southern Han Chinese; S-CM: Southern

    Chinese minorities; EA: East Asiangroups

    •HSd: SE private

  • Koreans and his neighborsNortheastern Asians including Japanese Northern Chinese

    A i I di

    - Northeastern Asians including Japanese, Northern Chinese, and Korean have high autosomal genetic similarity compared to

    others.American Indians

    N thNorthern Asians

    NortheasternNortheastern Asians

    Chinese and

    Figure S28 Maximum likelihood tree of 126

    Chinese and SEAs

    Figure S28 Maximum likelihood tree of 126 population samples (PASNP + HGDP).

    * Bootstrap values based on 100 replicates are shown. Language families are indicated with

    colors as shown in the legend. AX: Affymetrix; g yCN: China; ID: Indonesia; IN: India; JP:Japan; KR: Korea; MY: Malaysia; PI: the Philippines;

    SG:Singapore; TH: Thailand; TW: Taiwan.

  • CPU used for PASNP:

    600 CPUs for about one year running “St t ”“Structure” program.

    O l 56K SNP hi d t (2000 l )Only 56K SNP chip data (2000 samples)

  • Lesson and Challenge

    55/12

  • 1. 실험디자인: 더욱더 중요해짐오믹스 시대의 성공적인 생명과학분야의 핵심오믹스 시대의 성공적인 생명과학분야의 핵심

    제대로된 데이터를 분석할 수 있는 실험디자인

    − 실험디자인 (게놈, 생정보학): 8개월실험 : 2개월− 실험 : 2개월

    − 분석 : 2개월대부분이 실험디자인이 안된/틀린 경우 좋은 논문 불가 , 돈낭비,

    실험디자인은 생정보학자와 처음부터 같이해야함 (아니면, 연구자가 그많큼의 노력과 시간 계획을 투자)

    56/12

  • 2. Bioinformatics Challenge :

    still massively mapping

    How to map/compute 6 billion X 6 billionHow to map/compute 6 billion X 6 billion matrix?

    6 billion persons

    FullGenome Sequening Depth AxisGenoTyping Width Axis Bas

    es

    yp g

    6bi

    llion

    B

  • 3. 주문연구시대 : 혼자 다 하려고 하지 않기애플 아이폰

    58/12

  • 오믹스의 문제들

  • Adding one more dimension?

    How to map/compute RNA expressionsp p pIn relation with bio-function?

    6 billion persons

    ses

    billi

    on B

    as6

  • Transcriptome sequencing (전사체 해독)

  • Adding even more dimension?

    How to map/compute Phenome?p p

    6 billion persons

    ses

    billi

    on B

    as6

  • Cohort studies (대규모 집단 연구)

  • How to map/compute epigenome?

    6 billion persons

    ses

    billi

    on B

    as6

  • How to map/compute Microbiome?

    6 billion persons

    ses

    billi

    on B

    as6

  • DIY BiologygyDIY genomicsDIY bi i f iDIY bioinformaticsConsumer genomics gPersonal genomicsP l bi i f tiPersonal bioinformatics

  • 결론결론연구자는 실제 중요한 문제에 집중:연구자는 실제 중요한 문제에 집중:

    유전체학과 생정보학은유전체학과, 생정보학은이제는 전문가만 하는 시대를 지나서

    일반인도 사서 하는 시대입니다일반인도 사서 하는 시대입니다.