ricopili: imputation tutorialricopili imputation jobs •individuals in each dataset get split into...
TRANSCRIPT
![Page 1: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/1.jpg)
Ricopili: Imputation Module
WCPG Education Day
Stephan Ripke / Raymond Walters
Toronto, October 2015
![Page 2: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/2.jpg)
Ricopili Overview
![Page 3: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/3.jpg)
Ricopili Overview
![Page 4: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/4.jpg)
Outline for this session
• Ricopili’s approach to imputation
– Reference alignment
– Efficient imputation
– Post-processing
• Usage and output structure
![Page 5: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/5.jpg)
Outline for this session
• Ricopili’s approach to imputation
– Reference alignment
– Efficient imputation
– Post-processing
• Usage and output structure
![Page 6: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/6.jpg)
Why Impute?
• Meta-analysis– Smooth out array differences
• Fine mapping– Many more markers to look at
• Fill in missing data
• Add non-SNP variation– e.g. small indels
![Page 7: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/7.jpg)
Imputation Overview
Tasks in imputation module
1. Align genotype data to reference
2. Pre-phase haplotypes from genotypes
3. Impute using reference panel
4. Get imputation results
– Dosages
– Best guess genotypes
– Info scores
![Page 8: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/8.jpg)
Imputation Details
Workflow of full module on github:https://github.com/Nealelab/ricopili/wiki
![Page 9: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/9.jpg)
Aligning to reference
• Need to ensure genotypes are aligned to the
reference panel before phasing
– Same genome build
• LiftOver if needed
– Same mapping of genetic locations
– Resolve any strand flips, allele swaps
• Careful handling of strand ambiguous SNPs
• Consult population allele frequencies
• Ricopili automates this process
![Page 10: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/10.jpg)
How does imputation work?
• We can model our observed genotypes as a mosaic of the reference haplotypes
Marchini & Howie, 2010, Nature Reviews Genetics
![Page 11: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/11.jpg)
How does imputation work?
• Different algorithms for modeling the haplotypes for
the genotypes
– MaCH: Hidden Markov Model
– Impute2: HMM (phase) + MCMC (uncertainty)
– BEAGLE: graphical model
• Ricopili uses Impute2(+Shapeit) by default
– formerly all three algorithms integrated, if wished we can
integrate again
![Page 12: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/12.jpg)
Common Reference Panels
• 1000 Genomes
– Phase 1• chrX as a separate reference panel
– Phase 3• Additional populations vs. phase 1
– South Asia– Additional African populations
• HLA with Amino Acids (Paul deBakker)
• Easy to integrate another reference (on the administrator level)
![Page 13: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/13.jpg)
Imputation, 11 steps1) Guess genome build
2) Align positions
3) Align alleles
4) Cut into genomic chunks
5) Prephasing
6) Imputation
7) Data Reformatting
8) Postimputation QC / Best Guess
9) Genome wide best guess
10) Clean
11) Evaluate harddisk usage (~40Mb / ID, 40GB / 1000 IDs)
• 1000 individuals: 4 hours• 15,000 individuals: 48 hours
Steps 5,6,7 take > 90% of the computer resources of this module
![Page 14: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/14.jpg)
Divide datasets for Parallel Imputation
• 929 genomic chunks with 5Mb each
– Overlapping window of 1Mb on each side to next chunk
– So 3MB of each chunk are kept for downstream analyses.
• 929 Parallel jobs for each dataset (Nd)
– Total 929 x Nd parallel jobs get sent for each step
– More if N>1500
• Total time depends on how free the cluster is
– Prephasing, Imputation each up to several hours per job
![Page 15: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/15.jpg)
Ricopili Prephasing Jobs
• Phasing for each of the 929 genomic chunks– 929 Parallel jobs sent for each dataset (Nd)
• Total time depends on how free the cluster is– Depends strongly on dataset size (no split of individuals)
– Each job up to several hours
– One of two most time consuming step (fewer, but longer jobs than imputation)
• Failed jobs due to long runtime get re-sent with higher multithreading values*
* Not fully implemented on all infrastructures
![Page 16: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/16.jpg)
Ricopili Imputation Jobs
• Individuals in each dataset get split into parts with max. 1500 individuals
• Minimum of 929 x Nd parallel jobs get sent
– If datasets with > 1500 are present number of parallel jobs raise significantly
• Total time depends on how free the cluster is
– each job up to several hours
– One of two most time consuming steps (more but shorter than prephasing)
![Page 17: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/17.jpg)
Ricopili Post-Imputation Processing
• Datasets with > 1500 individuals re-merged
• 3 probabilities into 2 probabilities – saves 1/3 of hard disk space
• Split into qc1 and qc1f, delete probabilities of qc1f
• Outdated: Best guess per chunk (combine to whole genome on a later timepoint)
• 929 x Nd parallel jobs get sent – A lot of I/O, so restrict to 100 parallel jobs*
– Total time within hours* Not fully implemented on all infrastructures
![Page 18: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/18.jpg)
What do we get from imputation?
• Dosages
– Per SNP per individual, the probability of each genotype
• E.g. 1.5% aa, 98% Aa, 0.5% AA
• Best guess genotypes
– Genotype with highest probability with minimum threshold (default 0.8)
• E.g. for the above dosage: Aa
• Highest probability below threshold set as missing
• Different levels of missing-rates and frequencies for different purposesa) No additional filter (~10M SNPs)
b) Loose filter for SNP analyses (5-8M SNPs)
c) Strict filter for PCA analysis (2-4M SNPs)
• Info scores
– Ratio of variances (observed / expected)
– Metric of imputation quality for each SNP
• Scaled roughly 0-1
![Page 19: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/19.jpg)
Outline for this session
• Ricopili’s approach to imputation
– Reference alignment
– Efficient imputation
– Post-processing
• Usage and output structure
![Page 20: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/20.jpg)
Output structure - Overview
![Page 21: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/21.jpg)
Ricopili Output Structure
![Page 22: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/22.jpg)
Ricopili Output Structure
![Page 23: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/23.jpg)
Ricopili Output Structure
• QC1• Dosages• Very light QC
• Info > 0.1• MAF > .005
• QC1f• Dosages failing light QC• Dosages are not kept, just
SNP lists (meta-information found in subdir “info”)
![Page 24: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/24.jpg)
Ricopili Output Structure
• BG• Best guess genotypes• Light QC
• Missing-rate < 2%
• BGS• Best guess with stricter QC
• Missing-rate < 1%• MAF > 5%
• BGN• Best guess, no QC compared
to qc1- dosages
![Page 25: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/25.jpg)
Ricopili Output Structure
• Info• Info scores, original output
files from imputation algorithm (impute2)
![Page 26: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/26.jpg)
Ricopili Output Structure
Whole genome best-guess genotypes :
Three units for each dataset (see above)
• bg
• bgn
• bgs
![Page 27: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/27.jpg)
Ricopili Output Structure
pcaer_sub:
• contains the collection of whole
genome best guess genotypes from
with strict QC
• Contains README with instruction
how to start pca-pipeline to get
• Covariates over all datasets
• Deduped ID collection over all
datasets
![Page 28: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/28.jpg)
Ricopili Output Structure
dasu_*:
• Used for intermediate dosages:
• Important meta-files are kept and
zipped, otherwise empty
errandout:
• Keeps output of jobs from
motherscripts (not working scripts)
Blueprint_bak:
• Backup of job-starting commands,
keeps root directory clean without
loosing information
![Page 29: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/29.jpg)
Ricopili Output Structure
pi_sub:
• Used for intermediate steps:
• Aligning
• Chunking
• Prephasing
• Imputation
• Important meta-files are kept and
zipped.
• Look for *job* files to get a list of
scripts that got sent into the queue
• errandout:
• Keeps output of jobs from
working scripts
![Page 30: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/30.jpg)
Output structure - Details
![Page 31: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/31.jpg)
Detailed look at pi_sub (*job* files)
• buigue: guessing genome build, lifts SNPs to hg19 if necessary, does not change number of SNPs– *noma_comp: listing details to distinct builds
– *noma: listing details for nonmatched SNPs
– *buigue: best matching build
– *liftover_script: liftover_script
– *liftover: final liftover_command (if it ran)
– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.bim/bed/fam
![Page 32: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/32.jpg)
Detailed look at pi_sub (*job* files)
• Chepos (“checkpos6”): – extracts rs-names from snp-names
(“PsychChip_15048346_B|newrs111033171”): *.hg19.bim.ow.det
– Extracts positions from snp-names (“PsychChip_15048346_B|chr12_57594552”): *.hg19.bim.ow.det
– Translate position into snp-name if snp-name not found in reference: *.hg19.bim.addpos.det
– Translate snp-name in position: *.hg19.bim.xchr / xkb– Remove SNPs not found in reference: *.hg19.bim.npos– All detailed files in tar-ball: *.ch.tar.gz– Summary report: *.hg19.ch.report– Collection of commands: *. hg19.bim.chepos.cmd– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.ch.bim/bed/fam
![Page 33: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/33.jpg)
Detailed look at pi_sub (*job* files)
• Checkflip (“checkflip4”): – Flips unambiguous SNPs (non – AT, non - CG) (*.fli)– ambiguous SNPs (AT, CG)
• Very common freq (default MAF > 0.4) are removed (*.uif)• Others are flipped aligning minor allele (*.fli) ***
– Remove non matching alleles (*.xal): rs13172324 CT CG– Remove SNPs with big freq-difference to reference (default
15%) (*.bf) ***– All detailed files in tar-ball: *.ch.fl.tar.gz– Summary report: *.hg19.ch.fl.report– Collection of commands: *. hg19.bim.chefli.cmd– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.ch.fl.bim/bed/fam
*** reference population needs to fit
![Page 34: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/34.jpg)
Detailed look at pi_sub (*job* files)
• Chuck (“my.chuck2”): – Extracts one chunk of the genome (based on the
929 chunks in the reference and saves in subdir(“subfile_*”)
– Label chunk with genomic locations:• chr22_048_051: has SNPs on chromosome 22 from
47Mb to 52Mb. After imputation 1Mb on each side will be chopped off
– If no SNPs, saves information in subdir(“empty_*”)
![Page 35: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/35.jpg)
Detailed look at pi_sub (*job* files)
• prephasing (“my.preph”): – Prephasing one chunk with shapeit2
(https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html)
– Output-directory: haps_*– All individuals together– Temporary different individual IDs, since shapeit does not
accept long FIDs.– After prephasing split into units of max 1500 individuals
(default)– Sets info about multithreading (so that it can increase in next
try)– Shapeit command: *.shapeit.cmd– Split command: *.split.cmd
![Page 36: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/36.jpg)
Detailed look at pi_sub (*job* files)
• imputation (“my.imp2.3”):
– Imputation of prephased (and possibly) split chunk with world wide imputation reference.
– Impute2 (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html)
– Output-directory: pi_*
– Original impute – info scores: *_info
– All other impute meta output files are kept as well
– Impute2 command: *.imp2.cmd
![Page 37: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/37.jpg)
Detailed look at pi_sub (*job* files)
• Dosage format (”haps2dos3”):
– convert 3 probabilities into two,
– reintegrate original identifiers
– Create file that (ngt) keeps information about imputed vs. genotyped
– Output-directory: ../dasu_*
– Plink commands: *.dos.cmd
![Page 38: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/38.jpg)
Detailed look at pi_sub (*job* files)
• Postimp QC and best guess (”daner_bg”)• Output-directory: ../dasuqc1_*
– Subdir QC1• Dosages, passed very light QC (default: Info > 0.1, MAF > .005)
– Subdir QC1f• Snp lists of failed QC
– BGN• Best guess (highest prob, threshold default: 0.8), no further QC• For comparison to dosages
– BG• Best guess genotypes, Light QC, Missing-rate < 2%• For SNP analyses
– BGS• Best guess with strict QC (Missing-rate < 1%, MAF > 5%)• For PCA
– info• Info scores, original output files from imputation algorithm (impute2)• Plink commands: *.bg.cmd
![Page 39: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/39.jpg)
Detailed look at cobg_dir_genome_wide (*job* files)
• Whole genome best guess (”comb_bg_dir”)
• Combine all 929 chunks within each dataset for BGN, BG, BGS.
• Bring all datasets into one subdir
• Plink commands: *.wbg.cmd
![Page 40: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/40.jpg)
Detailed look at pcaer_sub
• Whole genome best guess with strict filtering
• README.pcaer with command to start pcaeron postimputation best guess genotypes with all datasets combined.
![Page 41: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/41.jpg)
Detailed look at clean.job_list
• Cleans temporary subdirs, removing unnecessary files, packing up meta-files.
• du_out_*:
– lists all subdirs and their harddisk usage.
![Page 42: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/42.jpg)
Detailed look at reference_info
• Lists version and location of Imputation reference, also lists all genomic chunks
• Important starting file for –refiex option
![Page 43: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/43.jpg)
imputation options (--help)
• --phase and –refdir: choose imputation reference
• --popname: define reference for frequency-checks– eur, asn, amr, afr, asw
– --sfh frequency thresholds for excluding common ambiguous SNPs
– --fth frequency difference threshold to reference
• --triset: imputation of trios
• --spliha: different max-size of individuals going into imputation engine
![Page 44: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/44.jpg)
imputation options, cont.
• Postimputation QC– --info_th: info score– --freq_th: MAF– --bg_th: min probability to call a best guess
• --noclean: keep all intermediate files (e.g. for debugging)• --force1: if pipeline stopped and problem seems solved
now, then restart with this option• --sjamem_incr: increase memory for working jobs (e.g.
memory request doesn’t seem to be enough for big single datasets, >10K)
• --refiex: exclude chunks from imputation– used from copy of reference_info for single chunks (for
debugging)
![Page 45: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent](https://reader034.vdocument.in/reader034/viewer/2022050601/5fa852768076b17bde345721/html5/thumbnails/45.jpg)
Wrap-up
• Questions?