internship report 2
TRANSCRIPT
COMPARISON OF SEQUENCING DATA WITH TARGETED
GENOTYPING USING ILLUMINA EXOME CHIP V1.0 IN
KORA Internship Report II Submitted by
Ashok Varadharajan
Msc Epidemiology
IBE – Institute for medical Informatics, Biometry and Epidemiology
Ludwig-Maximilians-University (LMU) Munich
Supervised by
Dr. Martina Mueller-Nurasyid
Head of KORAgen
Head of research team “Cardiovascular Genetics and Population Structures”
Institute of Genetic Epidemiology
Helmholtz Zentrum München
1
Contents
1. INTRODUCTION ..................................................................................................................................... 2
2. TASKS ....................................................................................................................................................... 2
3. DATA ........................................................................................................................................................ 3
3.1 Genotype from Exome Chip data (GE) ............................................................................................... 3
3.2 Genotype from sequencing data (GS) .................................................................................................. 3
4. METHODS ................................................................................................................................................ 3
4.1 Descriptive statistics of Exome chip and Sequencing data.................................................................. 3
4.2 Extraction of overlapping individuals. ................................................................................................ 4
4.3 Extraction of overlapping SNPs. ......................................................................................................... 4
4.4 Analyzing the consistency of data between the overlapped genotypes. .............................................. 4
5. RESULTS .................................................................................................................................................. 5
5.1 Descriptive statistics of genotypes from Exome chip data and Sequencing data ................................ 5
5.2 Extraction of overlapping individuals ................................................................................................. 8
5.3 Extraction of overlapping SNPs .......................................................................................................... 8
5.4 Comparison of Exome and Sequence Data ......................................................................................... 8
6. DISCUSSION .......................................................................................................................................... 10
7. REFERENCE .......................................................................................................................................... 11
8. SUPPLEMENTARY ............................................................................................................................... 12
2
1. INTRODUCTION
High-throughput sequencing which is nothing but Next Generation Sequencing has enabled the
researchers to sequence the entire human Genome efficiently. But it is still now financially impractical to
use this expensive technology for large scale Genome wide Association Studies (GWAS). As an
alternative economical approach to high-throughput sequencing technology, Illumina introduced Human
Exome Bead chips which are referred as exome chip. It is an exome genotyping array that includes more
than 240,000 markers representing diverse populations (such as European, African, and Chinese Hispanic
individuals) and common diseases (such as type 2 diabetes, cancer, metabolic and psychiatric disorders)
(1, 2).
KORA (Cooperative Health Research in the Region Augsburg) is initiated by National Research Center
for Environment and Health to incorporate two characteristics: the population based research design on a
regional basis and the cooperative research structure. The aim of the KORA platform was to use new as
well as existing studies with their respective data and bio-samples for future research projects in
epidemiology, health economics, and health care research(3).
KORA-gen is a resource for genetic epidemiological research, based on the KORA. Bio-samples and
phenotypic characteristics as well as environmental parameters of 18,000 adults from Augsburg and the
surrounding counties are available. The age range of the participants was 25 to 74 years of recruitment
and is 30 to 90 years in 2005. The aim of the KORA-gen is to combine the knowledge on the genetic
background with classical epidemiological research and to investigate the interaction of genes and the
environment for understanding complex diseases (4).
In this population-based KORA study F4, various genotyping arrays have been typed. Recently, a subset
of the study has also been processed in a whole-genome sequencing effort. In parallel, Illumina Exome
chip v1.0 has been used for targeted typing of rare variants in exonic regions. The calling of genotypes
from this chip is critical and there are currently two versions of exome chip following two guidelines
from different consortia. The focus of the current internship is to assess the reliability of the genotypes
called with Illumina Exome chip v1.0 through comparison with variants called with sequenced data.
2. TASKS
Understanding of Exome chip data and Sequencing data.
Descriptive statistics of genotype data obtained from Illumina Exome chip v1.0 and sequencing
data.
Identification of overlapping individuals.
Identification of overlapping genotypes.
Analyzing the consistency of the data between the overlapped genotypes.
3
3. DATA
3.1 Genotype from Exome Chip data (GE)
Study samples from KORA study were processed on the Human Exome Bead Chip (version 1.0.),
querying about 247,870 variable sites. Two channel raw data files (.idat) were obtained for all the
samples. Clustering, Genotype Calling and Quality Control were done through a series of steps by
combined usage of tools such as Illumina Genome Studio (Gencall) and zcall, with strong exclusion
criteria such as call rate (>98%), Heterozygosity (>1% SNPS), Gender discordance, GWAS discordance,
Fingerprint discordance, PCA outliers (using the Ancestry Informative Markers) etc. Finally, genotypes
for all the samples were obtained in the form of binary ped file (*.bed). Binary ped file has additional
supporting files such as pedigree/phenotype information file (*.fam) and mapping information file
(*.bim). Plink (5) software is providing excellent features to analyze these files. Genotype type data from
Exome chip is hereafter referred as “GE” in the following report.
3.2 Genotype from sequencing data (GS)
Whole genome of the study samples were sequenced using Next Generation Sequencing (NGS)
technology. In this method, whole genome was randomly digested into small fragments or short reads and
sequenced resulting in the form of fastq formatted file. Each fastq file contains list of sequence reads with
quality score (measure of uncertainty) of each base call. In the next step, calling of SNPs from these NGS
reads were done and it comprises of various steps such as quality filtering of reads, alignment to reference
genome, recalibration of per-base quality scores, genotype calling and snp calling. In this study, SNP
calling from these NGS reads was done using MaCH/Thunder. Finally the variations with respect to the
reference genome were stored in Variant Call Format (VCF). VCF is text file format file which contains
meta-information lines, a header line and then data lines - each containing information about a variant
with respect to reference genome and its corresponding sample. Since whole genome of about 2657
individuals has been sequenced, VCF file is extremely larger in size. Therefore, variants were stored
separately for each chromosome and compressed as *.vcf.gz format. Care should be taken while handling
such large data files. Genotype type data from Sequence data is hereafter referred as “GS” in the
following report.
4. METHODS
4.1 Descriptive statistics of Exome chip and Sequencing data
Data understanding and quality controlling of both ‘GE’ and ‘GS’ data are very important prerequisites
before doing any kind of statistical analysis. Descriptive statistics such as missingness, Minor Allele
Frequency (MAF), monomorphic SNPs, mean call rate etc. would provide comprehensive knowledge
about the quality of data. Quality control criteria such as Genotyping Batch Quality, Gender Checks and
Sex Chromosome Aneuploidy, Chromosomal Aberrations, Relatedness, Population Structure, Genotyping
Completeness and Accuracy, Mendelian Errors, Hardy-Weinberg Equilibrium etc. are often considered
for analyzing GWAS data (7, 8). Quality controlling by individual includes identification of individuals
with discordant sex information, with outlying missing genotype or heterozygosity rates, duplicated or
related, divergent ancestry etc. (6). Quality controlling by marker includes identification of SNPs having
4
an excessive missing genotype, significant deviation from Hardy-Weinberg equilibrium (HWE),
significant difference in missing genotype rates between cases and controls, low minor allele frequency
(MAF) etc. (6). Plink software (5), vcftools (9), R program and ggplot2 have been used for the generating
the descriptive statistics and its corresponding scripts can be found in the supplementary section.
4.2 Extraction of overlapping individuals.
Whole genome had been sequenced for the subset of about 117 individuals from the KORA study F4.
Samples have different identification in GE and GS data. A link file was provided that had the list of
sample id from GE data and its corresponding sample id in GS data. Extraction of overlapping individuals
were done using plink tool (‘keep’ parameter) for GE data and using vcftools (‘vcf-subset’ submodule)
for ‘GS’ data.
4.3 Extraction of overlapping SNPs.
To evaluate the reliability of ‘GE’ data, overlapping SNPs between GE and GS data must be extracted. It
was very important to make sure that reference genome used for genotype calling must be same between
them in order to extract the overlapping SNPs. Chromosome positions had been used as matching
parameter for extracting the overlapping SNPs. GRCh37 reference genome was used for calling genotype
of sequencing data. For GE data, there is a recent update from the CHARGE consortium in the mapping
file which uses hg19 (GRCh37) as reference genome. Therefore, mapping file of exome data should be
updated with the new annotation file obtained from CHARGE consortium. Once reference genome is
sync, reference position list corresponding to overlapping SNPs between these two data was obtained by
using simple awk script. Using the above list, GS and GE data have been filtered so that they have only
these matching SNPs. Additionally INDELs were also removed from the GS data.
4.4 Analyzing the consistency of data between the overlapped genotypes.
Once the overlapping Individual and the overlapping SNPs were extracted, GE and GS data were
compared using plink tool with “bmerge” parameter. Since GS data is in ‘vcf’ format, it was converted to
plink format using Pseq tool before comparison. It was essential to check for flipping problem before
doing consistency analysis. If a SNP in GE data has homozygous ‘C’ genotype and its corresponding SNP
in GS data has homozygous ‘G’ genotype or vice versa, they are not considered as discordant SNP. They
are more likely due to flipping problem. Similarly, flipping issue can occur for homozygous ‘A’ or ‘T’
genotype. Flipping correction was done using Plink tools itself with option as ‘flip’. Some SNPs that
could not flip corrected using Plink tool were done manually using R program. Chromosome X, Y, XY
and MT were not considered for the comparison since it is very complex to interpret. Output from the
plink data were analyzed using R program.
5
5. RESULTS
5.1 Descriptive statistics of genotypes from Exome chip data and Sequencing data
In GE (Genotype from Exome chip) data, 2921 individuals are genotyped using Human Exome Chip v1.0
with variants of about 267389. Whereas, in GS (Genotype from Sequencing) data, whole genome of 2657
individuals were sequenced and 24778233 variants have been called using GRCh37 as reference genome.
Table 1 Summary of exome chip data
Summary Exome Chip data Sequence data
Total number of persons 2921 2657
Total number of variants 267389 24778233
Mean call rate 0.9999324 1
Mean MAF 0.03017911 0.05269535
Number of monomorphic sites 125883 274931
Number of singletons (MAC =1) 23192 6474420
Number of doubletons (MAC =2) 11692 2381449
Missing call rate per individual is nothing but the fraction of missing calls per individual over SNPs. The
missing call rate per individual is an informative indicator of sample quality and also genotype quality.
Generally, individuals having missing call rates greater than 5% are considered to have poor genotype
quality. But according to the figure 1, all the individuals in GE data have missing call rate lesser than 1%.
Also highest outlier in the distribution of missing call rates is about 0.75% which indicates good genotype
quality.
Figure 1 Missing rate by Individual from GE data
6
Missing call rate by SNP is nothing but the fraction of missing calls per SNP over individuals. The
removal of suboptimal markers is an important quality control procedure of a GWA study. Removing
such markers will reduce the false positives SNPS and increase also the probability to identify true
associations correlated with disease risk (6). Generally, markers with a missing rate greater than 5% are
considered to be poorly classified markers. As per figure2, all the markers have missing rate lesser than
5%. Only few markers have missing rate about 3% indicating that all the markers in GE data are in good
quality.
Figure 2 Missing rate by SNP from GE data
Call rate, the other perspective of missing rate, is also an indicator for poor quality sample. Individuals
who have low call rate (less than 99%) are considered to have poor genotype efficiency and should be
eliminated from the analysis (8). Above threshold may vary from study to study depending on the
genotyping platform, quality of the DNA samples, and the variability in human and equipment error in
genotyping (8). As per figure 3, distribution of call rate clearly indicates that all the individuals in GE data
have call rate greater than 99%. Also mean call rate of GE data is about 0.9999324, indicative of good
genotype efficiency. For the Genotype data from sequencing (GS), we consider there is no missing
genotype since whole genome was sequenced and therefore mean call rate for GS data is considered be 1.
7
Figure 3 Distribution of Call rate from GE data
Minor allele frequency (MAF) refers to the frequency at which the least common allele occurs in a given
population. SNPs with a MAF lesser than 5% cannot be considered in this study, as it requires very strong
statistical power to detect true association (6). Distribution of MAF in GE and GS can be seen in figure 4
and figure 5 respectively. Monomorphic SNPs are the markers in which genotypes are same for all the
individuals irrespective of case and controls. Number of monomorphic SNPs was found to be 125883 for
GE data and 274931 for GS data. In case of GS data, only the SNPs whose genotypes are same in all
individuals but different to reference genome are included as monomorphic SNPs. Because VCF file do
not include information about genotype which is same in reference genome and all individuals. These
Monomorphic SNPs do not contain any meaningful information and therefore can be excluded from the
analysis. Similarly, singletons (about 23192 in GE and 6474420 in GS) and doubletons (about 11692 in
GE and 2381449 in GS) could also be excluded from the analysis.
Figure 4 Distribution of Minor Allele Frequency plot from
GE data
Figure 5 Distribution of Minor Allele Frequency plot from
GS data
8
5.2 Extraction of overlapping individuals
Totally, 117 individuals are identified to be matching between GE data and GS data by using the link file
provided. Size of GS data has been greatly reduced after the extracting 117 out of 2657 individuals. As
mentioned before, plink tool and vcftools are used for the extraction of GE and GS data respectively.
Script used for extracting overlapping individuals can be found in supplementary section.
5.3 Extraction of overlapping SNPs
Extraction SNPs which were matching between GE and GS data involved two steps. In the first step,
positions that were common between the GE and GS data are extracted. In the second step, GE and GS
data are filtered by using the above extracted positions. Totally, 10840 SNPs found to be matching
between GE and GS data excluding X chromosome. Maximum number of matching SNPs can be
observed in Chromosome 6 and 8. Summary of the total number of matching SNPs in each chromosome
can be found in table 3.
5.4 Comparison of Exome and Sequence Data
Once the matching SNPs are extracted, GE and GS data are compared using Plink tool with ‘bmerge’
option. GS data which is in the form of ‘vcf’ has been converted to plink bed format using Pseq tool.
Most of the error SNPs was solved by flipping correction from Plink tool itself. Whereas remaining error
SNPs are due to incompatibility of the plink tool to handle triallelic biomarkers. And these error SNPs are
excluded from the consistency analysis. Script used for the comparison can be found in the supplementary
section. Out of 1983969 overlapping SNP calls (number of matching SNPs multiplied by number of
individuals), 244677 SNP calls were found to be discordant by plink tool. When looked at the data
manually, it is found that there are still some SNPs that are discordant due to flipping problem. After
manual correction of the flipping problem, overall concordant rate is estimated as 0.99. Chromosome X,
Y, XY and MT are not taken in to consideration since they are very complex to interpret. Summary of
discordant SNP calls and concordant rate for each chromosome can be found in table 3.
Out of 1859 discordant SNP calls, 111 were due to missing genotype in GE data (table 3). There are 28
homozygous discordant SNP calls which are due to mismatch by AA-GG and CC-TT or vice versa.
Number of heterozygous discordant genotypes is found to be more in GS data which is about 627 when
compared to GE data which is about 245. This implies that genotype call in exome data is more
conservative.
9
Table 2 Matching SNPs and Concordant rate using Plink tool
Chromosome
No. of
SNPs in
GE
No. of
SNPs in
GS
Matching
SNPs
Total
SNP calls
(No of
Matching
SNPs *
No of
Individual
s)
Discordant
SNP calls
Discordant
SNP calls
after
manual flip
Correction
Concordant
rate
Chr1 27073 1847359 507 59319 8017 51 0.999140242
Chr2 19577 2055231 438 51246 4225 95 0.998146197
Chr3 15638 1736024 365 42705 3705 100 0.997658354
Chr4 10812 1734142 280 32760 3132 28 0.999145299
Chr5 16677 1533151 250 29250 3588 56 0.99808547
Chr6 16405 1765698 8161 954837 114580 588 0.999384188
Chr7 11898 1361064 282 32994 3936 37 0.998878584
Chr8 9357 1550874 4114 481338 69894 92 0.999808866
Chr9 10981 1008432 205 23985 2564 138 0.994246404
Chr10 10212 1162142 225 26325 2436 39 0.998518519
Chr11 16826 1162933 342 40014 4358 41 0.998975359
Chr12 13226 1122975 249 29133 3356 136 0.995331754
Chr13 4739 872238 153 17901 1797 17 0.999050332
Chr14 8419 752302 139 16263 2098 41 0.99747894
Chr15 8959 677017 184 21528 1898 29 0.998652917
Chr16 11152 726483 237 27729 3524 53 0.998088644
Chr17 14155 615187 203 23751 2782 219 0.990779336
Chr18 4031 678085 128 14976 1385 24 0.998397436
Chr19 16047 456069 217 25389 3760 16 0.999369806
Chr20 6830 510900 107 12519 837 27 0.997843278
Chr21 3009 316194 65 7605 919 16 0.997896121
Chr22 5535 280901 106 12402 1886 16 0.998709886
Chr23/ChrX 5461 852832 - - -
Chr24/ChrY 142 - - - -
Chr25/ChrXY 2 - - - -
Chr26/chrMT 226 - - - -
Total 267389 24778233 16957 1983969 244677 1859 0.999063
Table 3 Discordant Genotypes of Exome chip Vs Sequence data after Flip correction
Exome chip
Sequencing
Hom Het Missing
Hom 28 245 54
Het 627 0 57
Missing 0 0 0
10
6. DISCUSSION
Descriptive statistics of genotype data from Exome chip (GE) and Sequencing (GS) showed that they
have good genotype quality and efficiency. There are 267389 SNPs in GE data and 24778233 SNPs in
GS data. Mean call rate of GE data is estimated as 0.9999324 and the mean call rate of GS data is
considered as 1. Number of monomorphic sites in GE data is estimated as 125883 which are
approximately 47 % of the total markers available. These markers do not contribute valuable information
for the analysis and therefore can be excluded from the analysis.
Since the genotype quality is shown to be very good from descriptive statistics, a very high concordance
rate can be expected between the GE and GS. Genotype calls were compared for 117 individuals. Number
of matching SNPs between them is found only to be 16957. Decreased number of matching SNPs is due
to fact that VCF can only contain those SNPs which is different from the reference genome. We do not
have any information about other genotype positions as they are excluded from the vcf if its genotype is
similar to the reference genome or its genotype is not sequenced by the sequencer itself
Estimated concordant rate after flipping correction is 0.99 as expected. Only few number of homozygous
discordant SNP calls (about 28) are found, suggesting a very high quality in genotyping from Exome chip
data. Therefore SNPs from the GE data are more reliable for the GWAS study. Number of discordant
SNP calls is estimated as 1859. These discordant SNPs calls are more likely due to mistake done while
calling genotype from intensity data. Increased heterozygous discordant genotypes (about 627) infer that
Genotype call from Exome data is conservative. Also, it should be noted that comparison was done only
on the 16957 SNPs which is only about 16% of the markers from GE data.
For me myself as an internship student, I learnt the concepts of handling of large data sets in LINUX
environment, read sequencing data (vcf format) and genotyping data (PLINK/vcf format), annotation,
comparing genotypes based on genomic position, using R software for evaluation of descriptive statistics,
documentation of genotyping results, documentation of consistency checks etc. Additionally, I learned
lots of new things through the challenges and issues faced during this period.
11
7. REFERENCE
1. Illumina. Human Exome BeadChips. http://wwwilluminacom/content/dam/illumina-
marketing/documents/products/datasheets/datasheet_humanexome_beadchipspdf.
2. Guo Y, He J, Zhao S, Wu H, Zhong X, Sheng Q. Illumina human exome genotyping array
clustering and quality control. 2014;9(11):2643-62.
3. Holle R, Happich M, Lowel H, Wichmann HE, Group MKS. KORA--a research platform for
population based health research. Gesundheitswesen. 2005;67 Suppl 1:S19-25.
4. Wichmann HE, Gieger C, Illig T, Group MKS. KORA-gen--resource for population genetics,
controls and a broad spectrum of disease phenotypes. Gesundheitswesen. 2005;67 Suppl 1:S26-30.
5. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set
for whole-genome association and population-based linkage analyses. American journal of human
genetics. 2007;81(3):559-75.
6. Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT. Data quality control in genetic case-control association studies. Nature protocols. 2010;5(9):1564-73.
7. Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, et al. Quality control and
quality assurance in genotypic data for genome-wide association studies. Genetic epidemiology.
2010;34(6):591-602.
8. Turner S, Armstrong LL, Bradford Y, Carlson CS, Crawford DC, Crenshaw AT, et al. Quality
control procedures for genome-wide association studies. Current protocols in human genetics / editorial
board, Jonathan L Haines [et al]. 2011;Chapter 1:Unit1 19.
9. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call
format and VCFtools. Bioinformatics. 2011;27(15):2156-8.
12
8. SUPPLEMENTARY
8.1 Missingness
## Missingness
plink --noweb --bfile ${EXOME_BED_FILE} --missing --out ${EXOME_PLINK_OUT_FILE}
awk '$3>0' ${EXOME_PLINK_OUT_FILE}.lmiss > ${EXOME_PLINK_OUT_FILE}_only_missing.lmiss
imiss<-read.table(file='exome_f4_charge_top_20130108.imiss',header=T)
lmiss<-read.table(file='exome_f4_charge_top_20130108.lmiss',header=T)
lmissonly<-read.table(file='exome_f4_charge_top_20130108_only_missing.lmiss',header=T)
awk 'NR==FNR {h[$2] = $4; next} {print $0,"\t",h[$2]}'
${EXOME_BED_FILE}.bim ${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108.lmiss
> ${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108_manhattan.lmiss
lmissman<-read.table(file='exome_f4_charge_top_20130108_manhattan.lmiss',header=T)
#1. plot by individual
ggplot(imiss, aes(x=as.factor(IID),y=N_MISS))+ geom_histogram(binwidth=1,stat="identity")
+ theme(axis.text.x = element_blank()) + theme(legend.position="none")
+ labs(title="Individual Missing Genotypes.b") + ylab("Sum of Missing Genotypes")
+ xlab("Individuals")
#2. plot by snp
manhattan(lmissman, p = "F_MISS", bp = "BP", logp = FALSE, ylab = "Missing Frequency",
ylim = max(lmissman$F_MISS), genomewideline = FALSE,
suggestiveline = FALSE, main = "Missing Genotypes by SNP - Manhattan plot")
8.2 Mean call rate
## Mean Call rate
#1. mean call rate by individual
meanInd<-mean(1-imiss$F_MISS)
hist(1-imiss$F_MISS, xlab="Callrate",ylab="Number of individuals",
main="Callrate by individual", breaks=100)
#2. mean call rate by snp
meanSNP<-mean(1-lmiss$F_MISS, na.rm=T)
hist(1-lmiss$F_MISS, xlab="Callrate", ylab="Number of SNPs",main="Callrate by SNP", breaks=100)
13
8.3 Minor Allele Frequency (MAF)
## Minor Allele Frequency
plink --noweb --bfile ${EXOME_BED_FILE} --freq --out ${EXOME_PLINK_OUT_FILE}
awk 'NR==FNR {h[$2] = $4; next} {print $0,"\t",h[$2]}' ${EXOME_BED_FILE}.bim
${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108_freq.frq >
${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108_freq_manhattan.frq
freq<-read.table(file='exome_f4_charge_top_20130108.frq',header=T)
#1. MAF Histogram
hist(freq$MAF, xlab="Minor Allele Frequency",ylab="Number of SNPs",main="Minor Allele Frequency",
breaks=3000, xlim=c(0,0.01))
#2. MAF Manhattan
manhattan(freqman, p = "MAF", bp = "BP", logp = FALSE, ylab = "Minor Allele Frequency",
ylim = c(0,1), genomewideline = FALSE, suggestiveline = FALSE, main = "Minor Allele Frequency –
Manhattan plot")
8.4 Monomorphic and singleton SNPs
## Monomorphic SNPs
awk '$5==0' ${EXOME_PLINK_OUT_FILE}_freq.frq > ${EXOME_PLINK_OUT_FILE}_monomorphic.frq
wc -l ${EXOME_PLINK_OUT_FILE}_monomorphic.frq
## Singleton SNPs
## 1/(2*2921) = 0.0001711743
awk '$5<=0.00018 && $5>0' ${EXOME_PLINK_OUT_FILE}_freq.frq > ${EXOME_PLINK_OUT_FILE}_singleton.frq
wc -l ${EXOME_PLINK_OUT_FILE}_singleton.frq
## Doubleton SNPs
## 2/(2*2921) = 0.0001711743
awk '$5<=0.00035 && $5>0' ${EXOME_PLINK_OUT_FILE}_freq.frq > ${EXOME_PLINK_OUT_FILE}_doubletons.frq
wc -l ${EXOME_PLINK_OUT_FILE}_doubletons.frq
14
8.5 Extraction of overlapping individuals.
## Exome Data
plink --noweb --bfile ${EXOME_BED_FILE} --keep ${EXOME_INDIVIDUAL_MATCHED_DIR}/exome_sample_list.txt
--make-bed --out ${EXOME_INDIVIDUAL_MATCHED_DIR}/exome_f4_charge_filtered_individuals
## Sequence Data
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
do
vcf-subset -c ${SEQ_INDIVIDUAL_MATCHED_VCF_DIR}/samples.txt
${SEQUENCE_DIR}/GoT2D.chr${i}.paper_integrated_snps_indels_sv_beagle_thunder.vcf.gz |
bgzip -c > ${SEQ_INDIVIDUAL_MATCHED_VCF_DIR}/individual_matched_chr${i}.vcf.gz
done
8.6 Identification of overlapping SNPs.
## Exome Data
##1. Extracting matching reference positions list
grep -h -v '^#' ${SEQ_POSITION_MATCHED_VCF_AWK_DIR}/individual_snp_matched_chr*.vcf
| cut -f1,2 | sort -k 1 -k 2 - | uniq | \
awk 'FNR==NR{a[$1$2]++;next} {if($3$4 in a){print $1}}' –
${EXOME_NEW_ANNOTATION_FILE_FILTERED} | \
sort | uniq > ${EXOME_SNP_MATCHED_DIR}/snpid_list.txt
##1. Filter Exome plink data using above list
plink --noweb --bfile
${EXOME_INDIVIDUAL_MATCHED_DIR}/exome_f4_charge_filtered_individuals –
extract ${EXOME_SNP_MATCHED_DIR}/snpid_only_list.txt --make-bed --out
${EXOME_SNP_MATCHED_DIR}/exome_f4_charge_filtered_ind_snp
## Sequence Data
##1. Extract chromosome positions from exomechip data
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
do
awk -v i="${i}" '{if($1==i) print $4}' ${EXOME_BED_FILE}.bim >
${SEQ_PLINK_POSITIONS_DIR}/exomechip_positions_chr${i}.txt
done
##2. Extracting snp from VCF which matches with the above extracted positions
#a.Using AWK command
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
do
zcat ${SEQUENCE_DIR}/GoT2D.chr${i}.paper_integrated_snps_indels_sv_beagle_thunder.vcf.gz | \
awk '
FNR==NR{a[$0];next}
{
if($1 ~ '/^\#/')
{
print $0
}else if ($2 in a)
{
print $0
}
}
15
' ${SEQ_PLINK_POSITIONS_DIR}/exomechip_positions_chr${i}.txt - >
${SEQ_POSITION_MATCHED_VCF_DIR}/position_matched_chr${i}.vcf
done
8.7 Analyzing the consistency of data between the overlapped genotypes.
## Consistency of Exome and Sequence Data
##1. Converting VCF to BED
pseq008 ${HOMEDIR}/plinkseq/vcftobed new-project
pseq008 ${HOMEDIR}/plinkseq/vcftobed load-vcf --vcf
${SEQ_POSITION_MATCHED_VCF_RENAMED_DIR}/individual_snp_matched_renamed_chr*.vcf
pseq008 ${HOMEDIR}/plinkseq/vcftobed write-ped --name ${SEQ_VCF_TO_BED}/GoT2D
plink --noweb --tfile ${SEQ_VCF_TO_BED}/GoT2D --make-bed --out ${SEQ_VCF_TO_BED}/GoT2D
awk 'BEGIN { FS = "\t" ;OFS="\t" }; { gsub("\\<X\\>", "23", $3) ; gsub("\\<Y\\>", "24", $3) ;
gsub("\\<XY\\>", "25", $3) ; gsub("\\<MT\\>", "26", $3) ; print $0}'
${EXOME_NEW_ANNOTATION_FILE_FILTERED} | \
awk 'NR==FNR {h[$3$4] = $1; next} {print $1,h[$1$4],$3,$4,$5,$6}' –
${SEQ_VCF_TO_BED}/GoT2D.bim_bkp > ${SEQ_VCF_TO_BED}/GoT2D.bim
##2. Flipping Correction
plink --noweb --bfile ${EXOME_SNP_MATCHED_UPDATED_DIR}/exome_f4_charge_filtered_ind_snp_map --flip
${COMPARISON_RESULTS_PLINK_DIR}/exome_seq.missnp --make-bed --out
${COMPARISON_RESULTS_PLINK_DIR}/1.flip/exome_f4_charge_flipped
plink --noweb --bfile ${COMPARISON_RESULTS_PLINK_DIR}/1.flip/exome_f4_charge_flipped --exclude
${COMPARISON_RESULTS_PLINK_DIR}/1.flip/comp_exome_seq_flip.missnp --make-bed --out
${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/exome_f4_charge_flipped_excluded
plink --noweb --bfile ${SEQ_VCF_TO_BED}/GoT2D --exclude
${COMPARISON_RESULTS_PLINK_DIR}/1.flip/comp_exome_seq_flip.missnp --make-bed --out
${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded
##1. Comparing Exome and Sequence data
plink --noweb –bfile
${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/exome_f4_charge_flipped_excluded
--bmerge ${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded.bed
${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded.bim
${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded.fam
--merge-mode 6 --out ${COMPARISON_PLINK_OUTPUT}/comp_exome_seq_flip_excluded