-
Genome Sequence Analysis and Methods
IT Revolution &its role in personalized medicineits role in personalized medicine
Jong Bhakg
Genome Research Foundation &Theragen BiO Institute
TheragenEtex
-
BioAcknowledgementgScientists who shaped the world as we know by being factual, p y g ,rational, honest, and free-thinking.Tax payers who support researchersTax payers who support researchersMy former and present colleagues in MRC, Harvard Med., EBI, KAIST, KOBIC, and, TBI from whom I learned so much.TheragenEtex for generous financial supportTheragenEtex for generous financial supportDr. Kim SangTae and Dr. Park, Jongsun.
-
BioDisclaimerThe content of these slides are produced by Jong after The content of these slides are produced by Jong after shamelessly stealing other people’s copyrighted ideas and knowledgeknowledge.Everyone is welcome to take the slides in part or in whole without any permission whatsoever.
Everything here is under (♡) BioLicenseIt means it is free for all− It means it is free for all.
-
Brief history of Computing
• Charles Babbage: difference engine 1822• Alan Turing: formalization of the concept of the algorithm and
i i h h T i hi dcomputation with the Turing machine modern computer.• Claude Shannon Information Theory: An Algebra for
Th ti l G ti 1940Theoretical Genetics. 1940• IBM 1980s. Personal ComputerIBM 1980s. Personal Computer• Tim Berners-Lee 1990. WWW. Personal
Net Net • Google Interaction Network• FaceBook Interaction Network• GRID computing Networking computing.p g g p g• Cloud computing Personalization of Super computer
-
Brief History of BioinformaticsDarwin: A theoretical biologistDarwin: A theoretical biologistMendel: A theoretical prediction and validationPerutz and Kendrew: structural biologistsCrick and Watson: DNA modellersCrick, Brenner, : CodonSanger: DNA sequencing: GenomicsSanger: DNA sequencing: GenomicsSanger: Protein sequencingSanger: ProteomicsSanger: ProteomicsLesk: Visualization of proteinsDayhoff: Atlas of proteins (well known DB)Needleman & Wunch: Computer algorithmsRoger Staden: DNA analysis toolsSouthern: Hybridization Functional GenomicsSouthern: Hybridization Functional GenomicsH. influenzae, Cyanobacterial genomes 1995, 1996Human Genome Project 2000Human Genome Project 2000Personal Genomics 2006
-
IT and Bio Revolution (혁명)?Revolution is (혁명이란)?
-
What is IT in Bio?
Briefly it is called BIT or Bioinformaticsy
Who is bioinformatist?
-
BioinformaticsIs aboutIs aboutMapping
X is IEPY is Size
Old mapIs not
Accurate.
HoweverIt helps
P l tPeople toExplore.
DataDataInformationKnowledge
Gangnido 1402
-
What is Omics?생물학이 공정화, 산업화 하는 과정에서 생긴 말.생명 ‘공학’과 생명 ‘산업’ 이 2008년 부터 생김생명 ‘공학’과 생명 ‘산업’ 이 ~2008년 부터 생김
http://omics org− http://omics.org
-
Genome Transcript Interactome FunctomeProteome Textome OmeOme
자원
BBiiOOMatrixMatrix
Sequence
file 자원
ResourceResourceStructure
Expression 분석Pipeline분석
ResourceResource
Pathway
Regulation
Pipeline
Bi i
분석
MaterialsMaterialsNetwork
Info TypeInfo Type
BioEngine DBBioDiversity
소재은행
MaterialsMaterials
Info TypeInfo Type
2ndary Info2ndary Info2ndary Info2ndary Info 2ndary DB2ndary DB2ndary DB2ndary DB
-
(Personal Genomics)개인 유전체학 ?
Core Tech: 대중화된 개인유전체 (personal genomics)(p g )
What is it?What is it?각 개개인의 유전자 타입에 맞춰서 의사가 진단, 처방, 조언일반 대중이 유전체 해독기와 분석기를 어떤 식으로든 이용할 수 있음.일반 대중이 유전체 해독기와 분석기를 어떤 식으로든 이용할 수 있음.
http://PersonalGenomics orghttp://PersonalGenomics.org
-
What is Genomics?
http://genomics.org
-
Genomic T : 유전체 “티”자6 billion persons
esio
n B
ase
6bi
ll
Jong Bhak, under BioLicense
-
유전체학의 양대축: 인족의 다양성 과 개개인의 게놈
6 billion people6 billion people
50 00050,000 Bases for PASNP
Dr. Kim Seong-jin
Jong Bhak, under BioLicense
Dr. Kim Seong jin
-
DNA 서열 해독과 DNA (유전자) 타이핑
서열 해독은 세포의 모든
DNA/RNA를 해독하는 것서열 해독기 HellogenomeTMg
유전자 타이핑은 세포의 일유전자 타이핑은 세포의 일
부 DNA/RNA 를 타입을 결정해주는 것 바이오 칩
HellogeneTM
-
개인 게놈 해독 역사개인 게놈 해독 역사NCBI Reference genome, Pool DNAs CaucasianCaucasianCraig Venter, Caucasian (publically available) James Watson Caucasian (publically James Watson, Caucasian (publically available)Ni i ( ) Af i (H M )Nigerian (anonymous), African (HapMap)YH, Han Chinese, publically available (BGI)Seong Jin Kim, Publically available (테라젠 팀)AK1, Korean , Publically availableRosalynn Gill, Caucasian, (publically available)
최초 공개된 여성 PGP9 (테라젠)( )
-
게놈 해독의 과거, 현재, 미래
HGP: 13 years, $2.7 billion (3.5조원 14X) 2004
Craig Venter: 4 years, $100 million (1300억 원) 2007년Craig Venter: 4 years, $100 million (1300억 원) 2007년
James Watson: 2 years, $2 million (26억원 7.5X) 2008
김성진박사: 6 months, $0.17 million (2억 2천 만원 29X) 2009
2010: 1 month, $20,000 (2천 6백만원 30X)
2015(?):
-
Omics revolution 의 핵심?
대용량의 싸고, 정확한 데이터:−서열 해독: sequencing tech/cost
초고속의 자동화된 분석
−서열 분석: computing tech/cost
-
Ome and Omics graph (옴과 오믹스의 관계)
$3,000,000,000$50,000 per person
Cost
$ 0
Year20162003 Ome and OmicsBalance pointBalance point
~ 2010
Jong Bhak, under BioLicense
-
실제 예: 유전체학의 Y 축
-
최초의 한국인 게놈 해독 논문: 김성진박사 게놈
데이터 공개 시점 2008년 12월: ftp://bioftp org데이터 공개 시점 2008년 12월: ftp://bioftp.org
-
두번째 한국인 게놈: AK1 (서울대 의대: 무명)서울대 의대와 마크로젠이 2009년 해독데이터 공개 시점 2009년 12월데이터 공개 시점 2009년 12월
-
The first Korean Genome (SJK)First analyzed by Gacheon medical school LCDI and KOBIC, KRIBB in 2008 (Joint effort among LCDI, KOBIC, and 국가참조표준센터)First annotated and made public on 4th Dec. 2008 (through web and ftp)SNP, CNV, indels were analysedA t t d h t i i ti t d dAutomated phenotypic association study was doneNon-syn. AnalysisPhylogenetic study of mtDNA Y Chr And autosomes showed Korean Phylogenetic study of mtDNA, Y Chr And autosomes showed Korean relationship to Chinese and Japanese.First intra-Asian genome comparison (Chinese and Korean)First intra Asian genome comparison (Chinese and Korean)Analyzed at: 7.8, 17.3, 23.5 and 28 x foldsBy Jan. 23.5 fold sequenced and analyzedy q yOpenfreely Available from: http://koreagenome.org
-
The Karyogram of the donor DNANo obvious chromosomal abnormalities!
-
Classification and number of intra-genic SNPs
Not represent in dbSNP
-
Comparison of individual SNPs
SJK shared 56% with Yoruba SJK shared 60% with ChineseKorean vs African : 56% Korean vs Chinese : 60%
SJK shared 50% with Venter SJK shared 53% with WatsonKorean vs Caucasians : 52%
-
Korean Genome Variation Browser
SJK’s SNPs
“NOC2L” gene
Watson’s SNPs
Hapmap
YH’s SNPs
http://koreagenome.org/cgi-bin/gbrowse/kgenome/
Venter’s SNPs
-
SJK’s genetic lineage
Autosomal phylogenic tree
Chromosome Y haplogroup lineage mtDNA ethno-geographic lineage
SJK
-
Size distribution and classification of short indels found in SJKindels found in SJK
Using MAQ we identified 342 965 short indels Using MAQ, we identified 342,965 short indels We found that only 247 (0.1%) were validated,113,287 (33.0%) non-validated, and 229,431 (66.9%) indels were not found in dbSNP
-
Indels in SJK genic regions
I d
Indel
GIndexIndel number Homozygous Heterozygous
Gene
number
5'UTR 27 9 18 26
CDS 49 16 33 40CDS 49 16 33 40
3’UTR 319 114 205 247
Intron 127,516 45,430 82,086 12,421
Total 127,911 45,569 82,342 12,734
-
Comparison of individual Indels
Comparison of the SJK indels (< 4bp) overlapped with those of YH, HuRef (Venter), Watson, and NA18507 (Yoruba) genomes
Source SJK genome Indel loci a indel size b indel type c indel type/allindel type/indel loci
YH genome (135,199) All (289,257) 22,605 22,522 22,495 7.8% 99.5%
Homozygous (112,843) 12,940 12,915 12,902 11.4% 99.7%
Heterozygous (176,414) 9,665 9,607 9,593 5.5% 99.3%
H R f (577 661) All 34 142 33 254 29 422 10 2% 86 2%HuRef genome (577,661) All 34,142 33,254 29,422 10.2% 86.2%
Homozygous 17,325 16,956 15,656 13.9% 90.4%
Heterozygous 16,817 16,298 13,766 7.8% 81.9%
Watson genome (118 887) All 6 533 5 749 5 738 2 0% 87 8%Watson genome (118,887) All 6,533 5,749 5,738 2.0% 87.8%
Homozygous 3,363 3,090 3,090 2.7% 91.9%
Heterozygous 3,170 2,659 2,648 1.5% 83.5%
NA18507 genome (438 566) All 152 847 146 266 143 023 49 4% 93 6%NA18507 genome (438,566) All 152,847 146,266 143,023 49.4% 93.6%
Homozygous 76,314 73,231 72,287 64.1% 94.7%
Heterozygous 76,533 73,035 70,736 40.1% 92.4%
This discrepancy seems to result from the method used rather than from the ethnic similarities between SJK and NA18507(i.e., because, paired-end sequencing was used for SJK and NA18507).
This may partially explain why HuRef and Watson which are Caucasian as the NCBI reference, have lower levels (86.2% and 87.8%) of common indels against SJK.
-
Homo- and heterozygous deletions in SJK genomein SJK genome
(A) Homozygous 2 3 kb genomic deletion and (B) Heterozygous 5 kb genomic deletion(A) Homozygous 2.3 kb genomic deletion and (B) Heterozygous 5 kb genomic deletion.
-
Detection and identification of structural variants
• We found structural variants by using paired-end reads.1. 2920 deletions (100bp ~ 100kb)2. 415 inversions (100bp ~ 100kb)3 963 insertions (175bp 250bp)3. 963 insertions (175bp ~ 250bp)
• We found deletion SVs in 21 coding genes.We found deletion SVs in 21 coding genes. All heterozygous deletions
-
Repeat composition in SJK deletion variants
Long Interspersed Nuclear Elements (LINE)
Short Interspersed Nuclear Elements (SINE)
-
실제 예: 유전체학의 X 축
-
6 billion people
Th l ti di it iThe population diversity in Pan AsiaPan Asia
PASNP consortium
http://pasnpi orghttp://pasnpi.org
-
PASNP projectis an international consortium for Pan Asian’s SNP study.
• Researches with Asian populations :1 P l ti di it i g (Ph l g i t d )1. Population diversity mapping (Phylogenic study)
2. Functional annotation study3. CNV study3. CNV study
• Basic services :1. SNP data repository and basic information services via web for Pan Asians
-
Sampling from Pan Asia 11 countries
1. Sample number: 1,833p ,2. Ethnic group: 763 Country: 113. Country: 114. SNP marker number: 58,960 (Affymetrix 56K Xba SNP genotyping chip)chip)
-
Genotyped 76 ethnic groups over 11 countries
-
How many recognizable human groups in the world?J t i th i ht fi thJust in the right fig., there
are simply six recognizable population groupspopulation groups.
When we consider human i ti i l timigration, isolation,
admixture, and more ethnic groups this is not a simple
M i lik lih d
groups, this is not a simple question.
Maximum likelihood tree of 29 populations. The tree is based on data
from 19 934 SNPs from 19,934 SNPs. Bootstrap values based on
100 replicates
-
Admixture, migration, and isolation e, g ,vents make population grouping more com
plicatedplicated
In this study we foundIn this study, we found
1 G ti t i t l 1. Genetic ancestry is strongly correlated with linguistic affiliations as
well as geography.2. Most populations show relatedness 2. Most populations show relatedness within ethnic/linguistic groups despite
prevalent gene flow amongst prevalent gene flow amongst populations.
-
PCA resultsFinding 1: GeneticGenetic
ancestry is strongly
correlatedcorrelated with linguistic affiliations as
llwell as geography
Finding 2: Most
populations p pshow
relatedness withinwithin
ethnic/linguistic groups despitedespite
prevalent gene flow amongst
l ti
-
Phylogentic and population
Finding 1: Genetic
p pstructure analysis
resultsFinding 1: Genetic
ancestry is strongly
l t d ithcorrelated with linguistic
affiliations as well as geography
Finding 2: Most gpopulations show relatedness within ethnic/linguistic
Phylogenic tree: Da ethnic/linguistic
groups despite prevalent gene flow amongst
distance based NJ tree
Population flow amongst populations stratification : STRUCTURE
-
ASD based NJ individual treeFinding 1: Genetic
ancestry is strongly correlated with linguisticcorrelated with linguistic
affiliations as well as geography
Finding 2: Most populations show
l t d ithirelatedness within ethnic/linguistic groups despite prevalent gene
flow amongst populations
-
Eight population outliers whose linguistic and genetic affinities are inconsistentwhose linguistic and genetic affinities are inconsistent
• AX-ME/Melanesian, MY-JH/Jehai (Negrito), MY-KS/Kensiu (Negrito),TH-MO/Mon, TH-KA/Karen, CN-JN/Jinuo, IN-TB/Spiti, and CN-UG/Uyghur, , , p , yg• These linguistic outliers tend to cluster with their geographic neighbors
or to occupy an intermediate position between their geographic neighbors and the more distant members of their linguistic group
• These patterns are consistent either with substantial recent admixture among
the populations, a history of language replacement , or uncertainties in the linguistic classifications themselves
-
Considerable gene flow among Asian pop lations as obser edpopulations was observed
• Considerable gene flow was observed amongst sub-populations in clusters including those groups believed to practice endogamy basclusters, including those groups believed to practice endogamy bas
ed on linguistic, cultural and ethnic information• STRUCTURE reveals that the six Han Chinese population samples show varying degrees of admixture between a northern ‘Altaic’ clus show varying degrees of admixture between a northern Altaic cluster and a ‘Sino-Tibetan/Tai-Kadai’ cluster which is most frequent in the ethnic groups sampled from southern China and northern Thaila
dnd
-
Peopling of Asia: one-wave VS. two-wave hypothesisTwo wave hypothesisTwo-wave hypothesis
Cavalli-Sforza et al (Nat Genet, 2003) suggested a hypothesis of peopling of Asia that anatomically
modern humans (also called Homo sapiens sapiens) ( p p )spread into Asia through two routes.
The first was a southern route, perhaps along the coast to south and Southeast Asia, from where it
bifurcated north and south. In the south, these modern bifurcated north and south. In the south, these modern humans reached Oceania between 60 and 40 kya,
whereas the northern expansion later reached China, Japan and eventually America.The second was a central route through the Middle East, Arabia or Persia to central Asia, from
where migration occurred in all directions reaching Europe, east and northeast Asia about 40 kya, g g p , y ,after which the first and principal migration to America suggested by Greenberg occurred not
later than 15 kya.One-wave hypothesis
All modern East Asian and Southeast Asian populations was derived from a single
initial entry of modern humans into the subcontinent.
Which model can be better explanation for our observation?
O t b ti i Our current observation is expressed in right figures.
-
Forward time simulationPeopling of Asia: one-wave versus two-wave hypothesisPeopling of Asia: one wave versus two wave hypothesis
Based on hypothesis of Cavalli-Sforza et al. and our observations on Asian Negritos, we constructed three models which are testable.
Current our observation is like the one, Model 3. Current our observation is like the one, Model 3. If Model 1 or 2 can change into our current observation under some known
conditions, “two-wave models” will be acceptable. *Conditions
- Allele frequency spectrum of MRCA was from YRI. 10,000 SNPs were simulated. One generation was calculated as 20 years. Gene flow proportion
was set to different levels (M=0.005~0.95)
H h i l d l f h li f A i M d l 1 d M d l 2 h “ ” Hypothetical models of the peopling of Asia. Model 1 and Model 2 represent the “two waves” hypothesis (Cavalli’s hypothesis), and Model 3 represents the “one wave” hypothesis.
- AF: African; NG: Negrito; AS: Asian; EU: European.
-
Results and Conclusion: Forward time simulationsimulation
Peopling of Asia: one-wave versus two-wave hypothesis
Our simulation results indicate that model 1 is not compatible with the empirical data,
No extreme gene flow!
Model 2 is only compatible if gene flow from other Asian populations to the Negritos has been fairly extreme, with more than 50% of Negrit
h i f th A io chromosomes coming from other Asian populations, without dramatically affecting the Negri
to phenotype. Th s Model 1 and 2 are impertinent to theThus Model 1 and 2 are impertinent to the
explanation of current observation.Negrito: The Semang people of the Malay
PeninsulaThailand people
-
Finding 3: Haplotype diversity was strongly correlated with latitude with
diversity decreasing from South to Northdiversity decreasing from South to NorthHaplotype diversity versus
latitudes• Haplotype diversity was strongly correlated with
latitude (R2 = 0.91, P < 0 0001) with diversity < 0.0001), with diversity decreasing from South to
North.North.
• This is consistent with a loss ① Indonesian; ② Malay; ③ Philippine; ④ Thai; ⑤ South Chinese minorities; ⑥ Southern Han Chinese; ⑦ Japanese
⑧ h h
of diversity as populations moved to higher latitudes.
& Korean; ⑧ Northern Han Chinese; ⑨ Northern Chinese Minorities; ⑩
Yakut.
-
Finding 4: Southeast Asian has the most of the Asian gene poolsgene pools
Group private haplotype sharing analysis (Frequency was not considered (type
only))
• 90% of haplotypes in East Asian populations was found in
SEA and Central South Asian SEA and Central-South Asian (CSA) populations .
• Of which about 50% were f d i SEA d EA l d found in SEA and EA only and
5% found in CSA only.• These observations suggest
that the geographic source(s) contributing to EA populations
were mainly from SEA
* Proportion of
ypopulations.
* YKT: Yakut; N-CM: Northern Chinese minorities; N-HAN: Northern Han Chinese;
JP-KR: Japanese and Korean; S-HAN: Southern Han Chinese; S CM: Southern
Proportion ofhaplotypes in population A that can be also found
in population B (HSa)
•HSa :CSA private•HSb : EA private
•HSc : sharing by all groups Southern Han Chinese; S-CM: Southern
Chinese minorities; EA: East Asiangroups
•HSd: SE private
-
Koreans and his neighborsNortheastern Asians including Japanese Northern Chinese
A i I di
- Northeastern Asians including Japanese, Northern Chinese, and Korean have high autosomal genetic similarity compared to
others.American Indians
N thNorthern Asians
NortheasternNortheastern Asians
Chinese and
Figure S28 Maximum likelihood tree of 126
Chinese and SEAs
Figure S28 Maximum likelihood tree of 126 population samples (PASNP + HGDP).
* Bootstrap values based on 100 replicates are shown. Language families are indicated with
colors as shown in the legend. AX: Affymetrix; g yCN: China; ID: Indonesia; IN: India; JP:Japan; KR: Korea; MY: Malaysia; PI: the Philippines;
SG:Singapore; TH: Thailand; TW: Taiwan.
-
CPU used for PASNP:
600 CPUs for about one year running “St t ”“Structure” program.
O l 56K SNP hi d t (2000 l )Only 56K SNP chip data (2000 samples)
-
Lesson and Challenge
55/12
-
1. 실험디자인: 더욱더 중요해짐오믹스 시대의 성공적인 생명과학분야의 핵심오믹스 시대의 성공적인 생명과학분야의 핵심
제대로된 데이터를 분석할 수 있는 실험디자인
− 실험디자인 (게놈, 생정보학): 8개월실험 : 2개월− 실험 : 2개월
− 분석 : 2개월대부분이 실험디자인이 안된/틀린 경우 좋은 논문 불가 , 돈낭비,
실험디자인은 생정보학자와 처음부터 같이해야함 (아니면, 연구자가 그많큼의 노력과 시간 계획을 투자)
56/12
-
2. Bioinformatics Challenge :
still massively mapping
How to map/compute 6 billion X 6 billionHow to map/compute 6 billion X 6 billion matrix?
6 billion persons
FullGenome Sequening Depth AxisGenoTyping Width Axis Bas
es
yp g
6bi
llion
B
-
3. 주문연구시대 : 혼자 다 하려고 하지 않기애플 아이폰
58/12
-
오믹스의 문제들
-
Adding one more dimension?
How to map/compute RNA expressionsp p pIn relation with bio-function?
6 billion persons
ses
billi
on B
as6
-
Transcriptome sequencing (전사체 해독)
-
Adding even more dimension?
How to map/compute Phenome?p p
6 billion persons
ses
billi
on B
as6
-
Cohort studies (대규모 집단 연구)
-
How to map/compute epigenome?
6 billion persons
ses
billi
on B
as6
-
How to map/compute Microbiome?
6 billion persons
ses
billi
on B
as6
-
DIY BiologygyDIY genomicsDIY bi i f iDIY bioinformaticsConsumer genomics gPersonal genomicsP l bi i f tiPersonal bioinformatics
-
결론결론연구자는 실제 중요한 문제에 집중:연구자는 실제 중요한 문제에 집중:
유전체학과 생정보학은유전체학과, 생정보학은이제는 전문가만 하는 시대를 지나서
일반인도 사서 하는 시대입니다일반인도 사서 하는 시대입니다.