gwas.emes.comp
DESCRIPTION
Introduction to NGS data, processing and mapping to detect polymorphisms.TRANSCRIPT
![Page 1: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/1.jpg)
www.nottingham.ac.uk/adac
![Page 2: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/2.jpg)
Obtaining, QC, mapping and analysis of NGS data. Richard Emes Associate Professor & Reader in Bioinformatics. School of Veterinary Medicine and Science Director Advanced Data Analysis Centre [email protected] www.nottingham.ac.uk/adac @rdemes @ADAC_UoN
2
![Page 3: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/3.jpg)
What is ADAC? The University of Nottingham Advanced Data Analysis Centre (ADAC). • Bioinformatics and data analysis support. Why is this important? • Complex data underpins much current research. • Innovative analysis can prompt new discoveries. • Excellent research can often be stalled due to a lack of expertise in conducting data
analysis, availability or cost of inclusion of diverse specialists.
Why ADAC? • ADAC supports high-class research by providing analysts with expertise in a range of
bioinformatics and computer science disciplines. • Flexible support
• Consultancy, collaboration, bespoke analysis. • Leadership from recognized experts in the fields of bioinformatics and computer
science. • Track record of funding
• Pivotal role in collaborations funded by amongst others Zoetis, BBSRC, NERC and Technology Strategy Board.
• ADAC is conducting transcriptome analysis for a multinational FP7 funded project (EU Prohealth).
![Page 4: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/4.jpg)
http://www.nottingham.ac.uk/adac/ Enquiries: [email protected] or [email protected] @ADAC_UoN or @rdemes
Current Areas of expertise relevant here:
• Transcriptomics (Microarray, NGS)
• Comparative genomics (eukaryotic, prokaryotic)
• Identification of biomarkers from genetic and epigenetic datasets
• Artificial intelligence for decision support
• Machine Learning
• Integration of complex datasets
• Data Management
![Page 5: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/5.jpg)
@HWI-_FC_20BTNAAXX:2:1:215:593 ACAGTGCATGACATGCATAGCAGCATAGACTAC +HWI-_FC_20BTNAAXX:2:1:215:593 GhhhhhhhhhhhUhhEGhhhGhhhhhhhhhhhhh
![Page 6: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/6.jpg)
Some common terms
• Library: collection of molecules. This is the “complexity” of what you sequence. • Flowcell: slide where sequencing is attached to a solid platform. • Lane: unique sequencing unit of the flowcell. • Reads: Raw sequence of bases and imputed quality scores. • Fragment: Original molecule being sequenced (fragment of genome/gene). Ie
PE are reads form the same fragment. • Cluster: DNA bound to slide, local amplification of product to amplify signal to
measure fluorescence. • Mapping: Finds where your sequence matches to a reference. Importantly
gives a probability that this is the correct location.
![Page 7: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/7.jpg)
Illumina sequencing
![Page 8: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/8.jpg)
Illumina sequencing
![Page 9: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/9.jpg)
![Page 10: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/10.jpg)
![Page 11: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/11.jpg)
What coverage do I need?
![Page 12: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/12.jpg)
Obtaining NGS Data • Short Read Archive (SRA) • European Nucleotide Archive (ENA)
![Page 13: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/13.jpg)
Obtaining NGS Data • Short Read Archive (SRA) • Will need to convert to FastQ using sra toolkit
![Page 14: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/14.jpg)
Deciphering a fastq file @HWI-_FC_20BTNAAXX:2:1:215:593#0/1 ACAGTGCATGACATGCATAGCAGCATAGACTAC +HWI-_FC_20BTNAAXX:2:1:215:593#0/1 GhhhhhhhhhhhUhhEGhhhGhhhhhhhhhhhhh Header: @HWI-_FC_20BTNAAXX:2:1:215:593#0/1 HWI-_FC_20BTNAAXX instrument identifier 2 flowcell lane 1 tile number in flowcell lane 215 x - coordinate of cluster in the tile 593 y - coordinate of cluster in the tile #0 index of multiplexed samples /1 member of pair /1 or /2 if Paired end
![Page 15: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/15.jpg)
Deciphering a fastq file
@HWI-_FC_20BTNAAXX:2:1:215:593#0/1 ACAGTGCATGACATGCATAGCAGCATAGACTAC +HWI-_FC_20BTNAAXX:2:1:215:593#0/1 GhhhhhhhhhhhUhhEGhhhGhhhhhhhhhhhhh Sequence: ACAGTGCATGACATGCATAGCAGCATAGACTAC Quality Header: +HWI-_FC_20BTNAAXX:2:1:215:593 Quality: GhhhhhhhhhhhUhhEGhhhGhhhhhhhhhhhhh
![Page 16: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/16.jpg)
Deciphering a fastq file Quality: GhhhhhhhhhhhUhhEGhhhGhhhhhhhhhhhhh Sanger encoding = ASCII table lookup – 33 Solexa encoding = ASCII table lookup – 64 G = 71 – 64 = 7 h = 104- 64 = 40
![Page 17: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/17.jpg)
Deciphering a fastq file Quality: GhhhhhhhhhhhUhhEGhhhGhhhhhhhhhhhhh
![Page 18: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/18.jpg)
SNP Calling • Genotyping: identifying variants in a single genome
(i.e. from each parent) • SNP Calling: identifying variants between individual genomes
ACGTGCAGCATAGCA?CGACATCGACATACGC TGCACGTCGTATCGT?GCTGTAGCTGTATGCG ****A******* ***A***** **T****** *****T********** ******T******** ACGTGCAGCATAGCATCGACATCGACATACGC TGCACGTCGTATCGTAGCTGTAGCTGTATGCG
Sample Genome(s) Reads Reference Genome
![Page 19: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/19.jpg)
INDEL Calling • INDEL: Insertion/deletion
ACGTGCAGCATAGCA???CGACATCGACATACGC TGCACGTCGTATCGT???GCTGTAGCTGTATGCG ****ACG******* ***ACG***** **---******** *****---********** ******---******** ACGTGCAGCATAGCACGTCGACATCGACATACGC TGCACGTCGTATCGTGCAGCTGTAGCTGTATGCG
Sample Genome(s) Reads Reference Genome
![Page 20: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/20.jpg)
A pipeline for SNP identification
• Quality Control – FastQC, Fastx toolkit
• Trimming: – Sickle, Trimgalore, Trimomatic, Cutadapt
• Mapping – BWA, Bowtie, Stampy
• Remove Duplicates – Picard tools, Samtools rmdup
• Call SNPs / INDELS – Samtoools mpileup, VarScan, GATK, Many
others!
![Page 21: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/21.jpg)
Galaxy: https://usegalaxy.org/
Toolbox Workflows
![Page 22: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/22.jpg)
Visualize the data • FASTQC • Stand Alone or non-interactive
– Basic Statistics module, includes: • Filename: The original filename of the file which was analyzed • Encoding: Says which ASCII encoding of quality values was found in this file. • Total Sequences: A count of the total number of sequences processed. • Sequence Length: Provides the length of the shortest and longest sequence in
the set. If all sequences are the same length only one value is reported.
![Page 23: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/23.jpg)
Visualize the data • FASTQC: Per Base Sequence Quality:
Red line = Median quality Yellow box = IQR Whiskers = 10%-90% Blue line = Mean quality If the lower quartile for any base is less than 10, or if the median for any base is less than 25. If the lower quartile for any base is less than 5 or if the median for any base is less than 20.
![Page 24: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/24.jpg)
Visualize the data • FASTQC: Per Sequence Quality Scores:
If the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate. If the most frequently observed mean quality is below 20 - this equates to a 1% error rate.
![Page 25: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/25.jpg)
Visualize the data • FASTQC: Per Base Sequence Content
Proportion of each base position in a file for which ATCG DNA bases has been called. If the difference between A and T, or G and C is greater than 10% in any position. If the difference between A and T, or G and C is greater than 20% in any position. Possibly adapters or affect of trimming.
![Page 26: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/26.jpg)
Visualize the data • FASTQC: Sequence Length Distribution
Distribution of fragment sizes in the file. If all sequences are not the same length. If any of the sequences have zero length.
![Page 27: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/27.jpg)
Visualize the data • FASTQC: Duplicate Sequences
Degree of duplication within first 200,000 reads of file. Distribution of duplication levels in dataset If non-unique sequences make up more than 20% of the total. If non-unique sequences make up more than 50% of the total.
![Page 28: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/28.jpg)
Visualize the data • FASTQC: Overrepresented Sequences
• FASTQC: Adapter Content
Lists all of the sequences which make up more than 0.1% of the total. If any sequence is found to represent more than 0.1% of the total. If any sequence is found to represent more than 1% of the total. To know if your library contains a significant amount of adapter in order to be able to assess whether you need to adapter trim or not. If any sequence is present in more than 5% of all reads. If any sequence is present in more than 10% of all reads.
![Page 29: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/29.jpg)
Cut adapters
-f = the type of file (in this case fastq) -q CUTOFF, Trim low-quality ends from reads before adapter removal. -a ADAPTER, Sequence of an adapter that was ligated to the 3' end. The adapter itself and anything that follows is trimmed. -m 100 minimum length of reads following adapter removal. Reads less than 100 will be discarded --discard-untrimmed any reads without an adapter will be discarded. -o output file also in fastq format.
cutadapt -f fastq -q 20 -a AGATCGGAAGAG -m 100 --discard-untrimmed -o SNP.test.trimmed.fastq SNP.test.fastq
![Page 30: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/30.jpg)
Cut adapters Galaxy: Fastx_clipper from fastx_toolkit
![Page 31: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/31.jpg)
Quality filters
![Page 32: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/32.jpg)
![Page 33: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/33.jpg)
Quality filters
Fastx_toolkit -q = Minimum quality score to keep. -p = Minimum percent of bases in a read that must have [-q] quality. - i input file (output of adapter trimming step) -v verbose -Q quality encoding -o output file also in fastq format.
fastq_quality_filter -q 20 -p 70 -i SNP.test.trimmed.fastq -v –Q64 -o SNP.test.trimmed.QC.fastq
![Page 34: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/34.jpg)
Quality filters • Galaxy: filter by quality
![Page 35: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/35.jpg)
Align to genome
The problem • Generally a large genome (Human > 3Gb) • Large number of short reads The solution • Index genome into hash of kmers or short sequences • Use efficient aligners
– Large number of aligners available. • Common aligners: Bowtie1/2, BWA, Stampy
![Page 36: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/36.jpg)
Align to genome Example Bowtie 2 alignment Build index -f fasta formatted genome file ./bowtie.index.files/chr17 output location for index files Galaxy: select pre-built index when using bowtie or BWA
bowtie2-build -f ./genome/chr17.fa ./bowtie.index.files/chr17
![Page 37: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/37.jpg)
Align to genome
Align to reference -p number of processors --end-to-end alignment is not local -k 1 number of positions read is allowed to align k = 1 means all non-uniquely mapping reads are discarded -x path to indexed genome file to align reads to. -U reads are unpaired (in this case -S output in SAM format
bowtie2 -p 4 --end-to-end -k 1 -x ./bowtie.index.files/chr17 -U SNP.test.trimmed.QC.fastq -S SNP.test.trimmed.QC.fastq.sam
![Page 38: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/38.jpg)
![Page 39: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/39.jpg)
Alignment file formats SAM - Sequence Alignment/Map format. (BAM is a binary compressed equivalent)
– TAB-delimited text format – header section (optional)
@HD = Header, VN[format version], SO[sorting order] @SQ = Reference sequence line @PG = Program, ID[program ID], CL[command line]
@HD VN:1.0 SO:unsorted @SQ SN:chr17 LN:81195210 @PG ID:bowtie2 PN:bowtie2 VN:2.2.1 CL:"/home/rde/tools/bowtie2-2.2.1/bowtie2-align-s --wrapper basic-0 -p 4 --end-to-end -k 1 -x ./bowtie.index.files/chr17 -S SNP.test.trimmed.QC.fastq.sam -U SNP.test.trimmed.QC.fastq”
![Page 40: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/40.jpg)
Alignment file formats • SAM - Sequence Alignment/Map format.
– TAB-delimited text format – alignment section mandatory 11 columns – Optional fields – For details http://samtools.github.io/hts-specs/SAMv1.pdf – Samflags http://picard.sourceforge.net/explain-flags.html
instrument_name:100:flowcellID:1:1:32575:625 0
chr17 72387574 255 141M * 0 0
GGAAGAGCTGGGACCAGGCCCAGCAATTACCTCACCATGGTTGGGTGCAACCAAGTGGGGCAACTCTTTGGCCAGAAAGCAAAAGTCTTTTTAGCTTCAATGTAGGCCATTCTGGGTCCCAGACCCACAGCTTTGGACATT
Dgga^h`hedac`hcc`c^_f`ChggeDfC`e`_h_ddgd^_`be^bcg^acchC`^^bCdagfbgf_`^dc^^cD`dbha^Dbc`fb^d`CdaD^``dDb_edffdDDh_chceabh`heecd_h`gb_b^Ch`Dd_^c` AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:141 YT:Z:UU
1 Query template 2 Bitwise flag 3 Reference Name 4 1-based leftmost position 5 Mapping Quality 6 CIGAR string 7 Reference Name of mate 8 Position of mate 9 Observed template length 10 Sequence 11 Phred-scaled Quality
![Page 41: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/41.jpg)
Remove duplicates
• Duplicate reads generated as artifact in the library generation step results in false confidence in variants.
• samtools [rmdup], Picard tools [MarkDuplicates]
samtools view -bS SNP.test.trimmed.QC.fastq.sam -o SNP.test.trimmed.QC.fastq.bam samtools rmdup -s SNP.test.trimmed.QC.fastq.bam SNP.test.trimmed.QC.fastq.rmdup.bam
![Page 42: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/42.jpg)
Remove duplicates
![Page 43: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/43.jpg)
Call Variants • Identify regions in alignment where sequence differs
****A******* ***A***** **T****** *****T********** ******T******** ACGTGCAGCATAGCATCGACATCGACATACGC
Reads Reference Genome
![Page 44: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/44.jpg)
Call Variants samtools sort SNP.test.trimmed.QC.fastq.rmdup.bam SNP.test.trimmed.QC.fastq.rmdup.sorted samtools index SNP.test.trimmed.QC.fastq.rmdup.sorted.bam samtools faidx ./genome/chr17.fa samtools mpileup -f ./genome/chr17.fa SNP.test.trimmed.QC.fastq.rmdup.sorted.bam | java -jar ./tool/VarScan.v2.3.6.jar mpileup2snp --output-vcf –strand-filter 0
samtools mpileup -f faidx indexed reference sequence file VarScan mpileup2snp or mpileup2indel --min-coverage Minimum read depth at a position to make a call [8] --min-reads2 Minimum supporting reads at a position to call variants [2] --min-avg-qual Minimum base quality at a position to count a read [15] --min-var-freq Minimum variant allele frequency threshold [0.01] --strand-filter Ignore variants with >90% support on one strand [1]
![Page 45: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/45.jpg)
Call Variants
![Page 46: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/46.jpg)
Call Variants
![Page 47: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/47.jpg)
VCF Variant Call Format file
• Header text marked with ## • Column headings marked with # • Mandatory columns
– CHROM Chromosome – POS Position of variant start – ID Unique variant ID – REF Reference Allele – ALT Alternate non-reference alleles (comma separated) – QUAL Phred quality score – FILTER Filtering information – INFO User annotation
![Page 48: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/48.jpg)
Visualize: IGV
![Page 49: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/49.jpg)
![Page 50: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/50.jpg)
So Many SNPS – So What? • Get gene • Functional Analysis to identify key candidates
Identify Homologues
Locate variants
Identify Ontologies
Pathway and interaction
analysis
Locate SNPs on structure
Compare to current data
![Page 51: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/51.jpg)
• Functional Analysis to identify key candidates A step by step example (don’t do this with lots of variants!) • Get gene • Modify bases as shown in VCF file. • BLASTx to identify reading frame. • Produce mRNA, and encoded peptide sequence fasta files (provided) • Determine variant positions in mRNA & peptide sequences
![Page 52: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/52.jpg)
Compare to current data
• dbSNP – SNP already known?
• Repositories such as Ensembl, UCSC – In splice variant? – In known regions, domain etc?
• Variant effect predictor (more later)
![Page 53: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/53.jpg)
• Visualize the position in genome – Ensembl/UCSC – Add custom track using GFF/bed file
• Coding/non-coding – Synonymous / non-synonymous – Codon usage
• Locate in relation to known domains – Pfam – SMART
• Repeat regions
Locate variants
![Page 54: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/54.jpg)
• For single genes – Search for available information PubMed, interpro
etc • For multiple genes
– BLAST2GO – DAVID
Identify Ontologies
![Page 55: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/55.jpg)
• Pathway analysis – Understand the process of your gene. Does it make
biological sense? – IPA – DAVID – Webgestalt
• Interaction analysis – BioGRID – STRING – PSICQUIC
Pathway and interaction
analysis
![Page 56: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/56.jpg)
• Structure prediction – BLAST of PDB – Predict structure
• PSIPRED – 2° structure • ITASSER – 3° structure • Phyre – 3° structure
– Locate in 3D • Swiss PDB viewer
Locate SNPs on structure
![Page 57: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/57.jpg)
• Predict effect of SNPs • Suspect • VEP
Locate SNPs on structure
Arg Ser
![Page 58: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/58.jpg)
Identify Homologues
Emes R.D. Inferring function from homology. in Methods in Molecular Biology 453: Humana Press 2008.
![Page 59: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/59.jpg)
Links and websites 1 • Many at http://emeslab.wordpress.com/useful-links/ • SRA: http://www.ncbi.nlm.nih.gov/sra • SRA toolkit http://eutils.ncbi.nih.gov/Traces/sra/?view=software • ENA: http://www.ebi.ac.uk/ena/ • Galaxy: https://usegalaxy.org/ • FASTQC: www.bioinformatics.babraham.ac.uk/projects/fastqc/ • Fastx_toolkit: http://hannonlab.cshl.edu/fastx_toolkit/ • Bowtie 1: http://bowtie-bio.sourceforge.net/index.shtml • Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml • BWA: http://bio-bwa.sourceforge.net/ • Stampy: http://www.well.ox.ac.uk/project-stampy • SAMtools: http://samtools.sourceforge.net • PicardTools: http://picard.sourceforge.net • VarScan: http://varscan.sourceforge.net • IGV: https://www.broadinstitute.org/software/igv/
![Page 60: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/60.jpg)
Links and websites 2 • dbSNP: http://www.ncbi.nlm.nih.gov/SNP/ • Ensembl: http://www.ensembl.org/index.html • UCSC Genome Browser: https://genome.ucsc.edu/ • Pfam: http://pfam.xfam.org/ • SMART: http://smart.embl-heidelberg.de/ • GO: http://www.geneontology.org/ • BLAST2GO: http://www.blast2go.com/b2ghome • IPA: http://www.ingenuity.com/products/login • DAVID: http://david.abcc.ncifcrf.gov/ • Webgestalt: http://bioinfo.vanderbilt.edu/webgestalt/ • BioGRID: http://thebiogrid.org/ • PSICQUIC: http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml • ITASSER: http://zhanglab.ccmb.med.umich.edu/I-TASSER/ • Phyre: http://www.sbg.bio.ic.ac.uk/phyre2/ • SuSPect: http://www.sbg.bio.ic.ac.uk/~suspect/ • PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/ • SwissPDB: http://spdbv.vital-it.ch/
![Page 61: Gwas.emes.comp](https://reader034.vdocument.in/reader034/viewer/2022042521/53fb5d368d7f72b82e8b540f/html5/thumbnails/61.jpg)
Richard Emes Associate Professor & Reader in Bioinformatics. School of Veterinary Medicine and Science Director Advanced Data Analysis Centre [email protected] www.nottingham.ac.uk/adac @rdemes @ADAC_UoN
61