ISMU pipeline for NGS data analysis and facilitating
molecular breeding
http://hpc.icrisat.cgiar.org/NGS/
• Short read length of sequences• Availability of many tools• Platform dependency and command line driven• No direct ways for prediction of SNPs between
genotypes• Quality scores vary depending on version and
technology
Challenges
ISMU version 1
• SNP discovery from NGS data
– Pipeline for mapping / assembling
– Calling SNPs between genotypes
– Visualisation
ISMU version 2
• Application of identified SNPs to breeding
• Benchmark available open source short reads assembly and downstream analysis programs/software.
• Assembly and polymorphism detection between genotypes and visualization
• Assay design (Illumina GoldenGate Assay), genotype calling and visualization and analysis of SNP genotyping and haplotype data
• Identify and use parental lines for using in MABC or MARS
• Discovery of SNP markers for use in foreground and background selection of MABC or MARS.
• Documentation of the pipeline and the integrated software.
Objectives of NGS Pipeline
Control Flowchart
ICRISATCROPS
YesNo
Input Data & validation
Upload Reference& data
Mapping (Maq,Novo)
Mapped reads
Assembly Visualization
Consensus calling
Report SNPs
• Extract sequences with SNPs• Design primers
• In silico validation by SNP2CAPS
DatabaseADT Score
G.G Assay
Bead Studio
Flapjack
Genotype 1 Genotype 2
Chrom1 Pos RefAllele Gtyp1 Gtyp25 303 A G ?
Maq NovoProgramme
SNP Bet Genotypes
Standard Methodology
Mapping Mapping
Assembly
SNP Callingag. Reference
ADT Scoring
Reporting
Remove duplicates
Check the inverse combination
Compare allele between genotypes
Base calling in 2nd genotype
Predicted SNPs against Reference
Customized Methodology (Consensus Base Calling-cc)
ccMaq ccNovo
SNP Calling
Genotype 1 Genotype 2
Programme
Inhouse Script
ADT scoring
Genotype 2fmaj=21/28=0.75
Genotype 1fmaj =38/40=0.95
Mapping Mapping
Consensus Base CallingParameters (Default)
• Max number of mismatches <= 7• Sum of mismatches score <=60• Min mapping quality =>0• Read depth threshold =>5• Major base frequency threshold => 0.75
What if more than 2 genotypes?
Genotype1
Genotype2Genotype3
Genotype4
G1 G2 G3
G1 0 1 1
G2 0 0 1
G3 0 0 0
Combination of genotypes = (n2–n)/2
• Reads format fna and qual(Standard/Sanger)FastqSCARF fomatSolexa fastq, Solexa exportAB SOLiD read formatFASTA
• Reference sequenceChickpea transcript assemblyPearl millet transcript assemblyPigeonpea transcript assemblyMedicago genomeSorghum genome
NGS pipeline input data
NGS pipeline (Input 1)
http://hpc.icrisat.cgiar.org/NGS/
NGS pipeline (Input 2)
NGS pipeline (Help page)
NGS pipeline (Results)
NGS pipeline (Visualisation)
Available in 2 Editions
1. Server Edition
2. Desktop Edition
Pipeline Editions
• User friendly web interface– Installation on following Linux platform
• Fedora 13• Cent OS 5
• Clients can be any OS with a web browser• Communication resources
• SMTP (Email)
• Session specific job processing- Avoid file over writing
Server Edition
Desktop Edition
• All functionalities of Server Edition on a Desktop
• Supported OS
• Fedora 13
• RHEL 5
• Single command installation
• Available in Installable CD
Future plans
•Consideration of new tools to integrate / update eg: BWA, Bowtie
•Implementation of the extension to the pipeline
•Evaluate cloud computing and high performance computing cluster options
•Initiatives such as iPlant (discovery environment – genotype to phenotype)
• Identification ofappropriate modules forMARS, GWS and GBS
• Integration of MARS andGWS module
• Linking of ISMU pipelinewith DMS of IBP
• Documentation & Trainingof ISMU pipeline
Future Plans: ISMU v 2
Internet
Architecture
ReferenceSequences
Velvet
Perl Prog
Maq
Novo
CGISNP Database
Files downloading
DynamicQuerying
AssemblyVisualization
Input datavalidation
NGS Data Analysis pipeline at ICRISAT
Apache ServerHosting Web
Pages
SMTPServer
• Rajeev K. Varshney
• Abhishek Rathore
• Jayashree B
• Vivek Thakur
• R. Pradeep
• A. Bhanu Prakash
• Sarwar Azam
• G.Meenakshi
• David Marshall
• Iain Milne
Contributors
• Jonathan Jones
• David Studholme
• Greg May
• Andrew Farmer
• Jimmy Woodward
• Dave Edwards