downstream analysis with snp markers - pbgworkspbgworks.org/sites/pbgworks.org/files/06 sim snp...
TRANSCRIPT
![Page 1: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/1.jpg)
Downstream analysis with SNP markersPart I: Introduction to computer software for dataPart I: Introduction to computer software for data
analysis
Sung-Chur SimTh Ohi St t U i it OARDCThe Ohio State University, OARDC
SolCAP workshop
![Page 2: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/2.jpg)
OutlineOutline
Part I: Introduction to computer softwarePart I: Introduction to computer software• MicroSatellite Analyzer (MSA)• Graphical GenoType (GTT)p yp ( )• STRUCTURE
What can you do using the software?Wh d l d th ft ?Where can you download the software?How can you format input data ?
Part II: The use of STRUCTURE for associationPart II: The use of STRUCTURE for association mapping• Detail steps to generate a Q-matrix using STRUCTURE
![Page 3: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/3.jpg)
MicroSatellite Analyzer: MSAMicroSatellite Analyzer: MSA
An independent analysis tool for large data sets (Dieringer p y g ( gand Schlӧtterer 2003)
• Descriptive statistics per population and locus (e.g. allelic richness heterozygosity and Shannon index of diversity)richness, heterozygosity, and Shannon index of diversity)
• FST, FIS, and FIT based on the Weir and Cockerham method• FST per locus and population pair ; P-value for FST determined ST ST
by permuting genotypes among groups• Genetic distance including Nei’s standard genetic distance • Converts your data into the formats of GENEPOP• Converts your data into the formats of GENEPOP,
STRUCTURE, ARLEQUIN, etc. Version 4.05 available for Windows, Linux, and Mac: (http://i122server.vu-wien.ac.at/MSA/MSA_download.html)
![Page 4: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/4.jpg)
MSA: MicroSatellite AnalyzerMSA: MicroSatellite Analyzer
![Page 5: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/5.jpg)
Input formatInput format
http://i122server.vu-wien.ac.at/MSA/MSA_download.htmlp _
![Page 6: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/6.jpg)
Input formatInput formatOne or two column format• Specify one (1) or two (2) column p y ( ) ( )
format in the cell A1• Enter name of population in the first
column (no empty cell)( p y )• Specify inbred (h) or outbred (d)
for your species in the second column (no empty cell)( p y )
• Enter group number of population (no empty cell)
• SNP data converted from letter codesSNP data converted from letter codes to numerical coding
• Missing data cam be indicated by -1, nd dot( ) or empty cellnd, dot(.), or empty cell
• Save your data in the format “TAB DELIMITED”
![Page 7: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/7.jpg)
Identify loci that distinguish populations
60
Processing vs. Freshmarket60
Processing vs. Heirloom
20
30
40
50
umbe
r of
loci
20
30
40
50
umbe
r of
lociovate: 0
fw2.2: 0sp6: 0.14
ovate: 0.26fw2.2: 0sp6: 0.73
0
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Nu
FST
0
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Nu
FSTFST ST
50
60
i
Freshmarket vs. Heirloom
ovate: 0.31
FST = 0: an allele of a gene is fixed or the gene is under balancing selection
10
20
30
40
50
Num
ber
of lo
ci fw2.2: 0sp6: 0.47
g g
FST = 1: a gene under diversifying selection
0
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FST
![Page 8: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/8.jpg)
Graphical GenoType: GGTGraphical GenoType: GGT
A tool for representing molecular marker data by graphical p g y g prepresentation and color coding of chromosomes• Useful for evaluation of plant material and selection of a desired
genotypegenotype
Advanced genetic analyses • Marker-trait association• Genetic distance
Li k di ilib i• Linkage disequilibrium
Version 2.0 available for Windows (http://www.plantbreeding.wur.nl/UK/software_ggt.html)
![Page 9: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/9.jpg)
![Page 10: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/10.jpg)
http://www.plantbreeding.wur.nl/UK/software_ggt.html
![Page 11: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/11.jpg)
Input formatInput format
Two data files derived from locus and map data pLocus file• Contains data on marker alleles using the MapMaker or
JoinMap type of coding• A plain text file
![Page 12: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/12.jpg)
Locus fileLocus file
![Page 13: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/13.jpg)
Input formatInput format
Two data files derived from locus and map data pLocus file• Contains data on marker alleles using the MapMaker or
JoinMap type of coding• A plain text file
M filMap file• Specifies marker positions on a linkage map• A plain text fileA plain text file
![Page 14: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/14.jpg)
Map fileMap file
![Page 15: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/15.jpg)
Input formatInput format
Two data files derived from locus and map data pLocus file• Contains data on marker alleles using the MapMaker or
JoinMap type of coding• A plain text file
M filMap file• Specifies marker positions on a linkage map• A plain text fileA plain text file
Build a GGT file by merging the locus and map files using the ‘Build GGT-file’ optionThe GGT file can also be prepared from an Excel spreadsheet
![Page 16: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/16.jpg)
![Page 17: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/17.jpg)
Fresh Market Heirloom Processing Wild
Identify Recombinants
Chromosome 5
Regions of low variation
Chromosome 6
![Page 18: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/18.jpg)
STRUCTURESTRUCTURE
A model-based clustering method (Pritchard et al. 2000)g ( )• Inferring population structure using multi-locus genotype data • Generating a Q-matrix to correct for population subdivision
d i k t it i ti l i i l l tiduring marker-trait association analysis in complex populations (e.g. breeding populations)
• Identifying migrants and admixed individuals
Version 2.3.3 available for Windows, Linux, and Mac: (http://pritch bsd uchicago edu/structure html)(http://pritch.bsd.uchicago.edu/structure.html)
![Page 19: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/19.jpg)
![Page 20: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/20.jpg)
h // i h b d hi d / h lhttp://pritch.bsd.uchicago.edu/structure.html
![Page 21: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/21.jpg)
Input formatInput format
A matrix where the data for individuals are in rows, the loci are in column• n consecutive rows have the data• n consecutive rows have the data
for each individual of n-ploidspecies
• Integer should be used for coding• Integer should be used for coding genotype
• Missing data should be indicated by a number which doesn’t occura number which doesn t occur elsewhere in the data (e.g. -1)
• The data file should be a text file ( txt) not an excel file ( xls) for(.txt) not an excel file (.xls) for running STRUCTURE
![Page 22: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/22.jpg)
SummarySummary
Three computer programs, MSA, GGT, and STRUCTURE were introduced for SNP data analysis by providing the following information:
• What can the programs do?• Where can you download them?• Where can you download them?• How can you format input data for each program?
![Page 23: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/23.jpg)
Downstream analysis with SNP markersPart II: The use of STUCTURE software for associationPart II: The use of STUCTURE software for association
mapping of bacterial spot resistance in tomato
Sung-Chur SimTh Ohi St t U i it OARDCThe Ohio State University, OARDC
SolCAP workshop
![Page 24: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/24.jpg)
Bacterial spot in tomatoBacterial spot in tomato
A disease complex caused by speciesA disease complex caused by species of Xanthomonas bacteria.
Five physiological races: T1-T5
Sources of resistance from close relatives of cultivated tomato (Solanum lycopersicum L ) or S(Solanum lycopersicum L.) or S. pimpinellifolium• Hawaii 7998 (T1)
• Hawaii 7981 (T3)
• PI128216 (T3)
• PI114490 (T1 T2 T3 and T4)• PI114490 (T1, T2, T3, and T4)
![Page 25: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/25.jpg)
Association analysis models incorporate a correction for population structure
Unified mixed model (Yu et al. 2006)
Y = μ REPy + Qw + Markerα + Zv + Error
Addi t i Q f l ti t t t f d li k dAdding a matrix, Qw, of population structure can correct for pseudo-linkage and can add insight to which crosses, pedigrees, subpopulations have the highest breeding value
![Page 26: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/26.jpg)
STRUCTURE analysis
Format marker data
y
The marker data file used in this example is available on the workshop URL: http://pbgworks.org/tomato-workshop (file name: STRUCTURE_InputData.txt)
![Page 27: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/27.jpg)
STRUCTURE analysis
Format marker data
y
Decide how long to run STRUCTURE (burnin and MCIC)(burnin and MCIC)
Burnin length: how long to run the simulation before collecting data to minimize the effect of the starting configuration (Recommendation: 10,000 ~100,000)
MCIC length: how long to run the simulation after the burnin to get accurate parameter estimates (R d ti 500 000 1 000 000)(Recommendation: 500,000~1,000,000)
![Page 28: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/28.jpg)
STRUCTURE analysis
Format marker data
y
Decide how long to run STRUCTURE (burnin and MCIC) ( )
Run simulations 20 times for each of several different Ks
Identify the best K based on the log likelihood values from the 20 simulations for each K using the non parametric and/or ∆K20 simulations for each K using the non-parametric and/or ∆K
methods
![Page 29: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/29.jpg)
Inference of best K (number of populations)Inference of best K (number of populations)
The log likelihood for each K Ln P(D) = L(K)The log likelihood for each K, Ln P(D) = L(K)
Two approaches to determine the best K
1. Use of L(K): When K is approaching a true value, L(K) plateaus (or continues increasing slightly) and has high variance between runs (Rosenberg et al. 2001, E t l 2005)Evanno et al. 2005).
nonparametric test (Wilcoxin test)
2. Use of an ad hoc quantity (∆K): Calculated based on q y ( )the second order rate of change of the likelihood (∆K) (Evanno et al. 2005). The ∆K shows a clear peak at the true value of K.
∆K = m([L’’K])/s[L(K)]∆K = m([L K])/s[L(K)]
Evanno et al. 2005. Molecular Ecology 14: 2611-2620
![Page 30: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/30.jpg)
Log likelihood values
![Page 31: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/31.jpg)
Inference of best K using the delta K methodg
The best K = 8
L(K) = an average of 20 values of Ln P(D) L’(K) = L(K)n – L(K)n-1 The Excel file used in this example isL’’(K) = L’(K)n – L’(K)n-1Delta K = [L’’(K)]/Stdev
The Excel file used in this example is available on the workshop URL:http://pbgworks.org/tomato-workshop(file name: The best K analysis.xls)
![Page 32: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/32.jpg)
STRUCTURE analysis
Format marker data
y
Decide how long to run STRUCTURE (burnin and MCIC) ( )
Run simulations 20 times for each of several different Ks
Identify the best K based on the log likelihood values from the 20 simulations for each K using the non parametric and/or ∆K20 simulations for each K using the non-parametric and/or ∆K
methods
Retrieve a Q-matrix of the best K
![Page 33: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/33.jpg)
Bar plot showing clustering of individual genotype
![Page 34: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/34.jpg)
Q-matrix
![Page 35: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/35.jpg)
SAS codesSAS codes
%macro Mol(mark);%macro Mol(mark);proc mixed data = three;class &mark gen rep;
d l T1 1 2 3 4 5 6 7 8 & k /model T1 = pop1 pop2 pop3 pop4 pop5 pop6 pop7 pop8 &mark / solution;random gen rep; Qwg p%mend;
%Mol(M1);%Mol(M2);
Markerα
%Mol(M2);%Mol(M3);%Mol(M4);
run; The SAS code used in this example is available on the workshop URL: http://pbgworks.org/tomato-workshop(file name: SAScode.txt)
![Page 36: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE](https://reader031.vdocument.in/reader031/viewer/2022030410/5a9693307f8b9ab6188cb403/html5/thumbnails/36.jpg)
SummarySummarySTRUCTURE is a useful tool to detect population subdivisionsubdivision
Th f th Q t i t f b l tiThe use of the Q-matrix can correct for subpopulations during association analysis in breeding populations; avoids detection of false-positivesavoids detection of false-positives
The SNP resources from SolCAP are a powerfulThe SNP resources from SolCAP are a powerful survey tool; we should be thinking beyond bi-parental populations toward analysis of complex breeding p p y p gpopulations