internship report 2

COMPARISON OF SEQUENCING DATA WITH TARGETED

GENOTYPING USING ILLUMINA EXOME CHIP V1.0 IN

KORA Internship Report II Submitted by

Ashok Varadharajan

Msc Epidemiology

IBE – Institute for medical Informatics, Biometry and Epidemiology

Ludwig-Maximilians-University (LMU) Munich

Supervised by

Dr. Martina Mueller-Nurasyid

Head of KORAgen

Head of research team “Cardiovascular Genetics and Population Structures”

Institute of Genetic Epidemiology

Helmholtz Zentrum München

1

Contents

1. INTRODUCTION ..................................................................................................................................... 2

2. TASKS ....................................................................................................................................................... 2

3. DATA ........................................................................................................................................................ 3

3.1 Genotype from Exome Chip data (GE) ............................................................................................... 3

3.2 Genotype from sequencing data (GS) .................................................................................................. 3

4. METHODS ................................................................................................................................................ 3

4.1 Descriptive statistics of Exome chip and Sequencing data.................................................................. 3

4.2 Extraction of overlapping individuals. ................................................................................................ 4

4.3 Extraction of overlapping SNPs. ......................................................................................................... 4

4.4 Analyzing the consistency of data between the overlapped genotypes. .............................................. 4

5. RESULTS .................................................................................................................................................. 5

5.1 Descriptive statistics of genotypes from Exome chip data and Sequencing data ................................ 5

5.2 Extraction of overlapping individuals ................................................................................................. 8

5.3 Extraction of overlapping SNPs .......................................................................................................... 8

5.4 Comparison of Exome and Sequence Data ......................................................................................... 8

6. DISCUSSION .......................................................................................................................................... 10

7. REFERENCE .......................................................................................................................................... 11

8. SUPPLEMENTARY ............................................................................................................................... 12

2

1. INTRODUCTION

High-throughput sequencing which is nothing but Next Generation Sequencing has enabled the

researchers to sequence the entire human Genome efficiently. But it is still now financially impractical to

use this expensive technology for large scale Genome wide Association Studies (GWAS). As an

alternative economical approach to high-throughput sequencing technology, Illumina introduced Human

Exome Bead chips which are referred as exome chip. It is an exome genotyping array that includes more

than 240,000 markers representing diverse populations (such as European, African, and Chinese Hispanic

individuals) and common diseases (such as type 2 diabetes, cancer, metabolic and psychiatric disorders)

(1, 2).

KORA (Cooperative Health Research in the Region Augsburg) is initiated by National Research Center

for Environment and Health to incorporate two characteristics: the population based research design on a

regional basis and the cooperative research structure. The aim of the KORA platform was to use new as

well as existing studies with their respective data and bio-samples for future research projects in

epidemiology, health economics, and health care research(3).

KORA-gen is a resource for genetic epidemiological research, based on the KORA. Bio-samples and

phenotypic characteristics as well as environmental parameters of 18,000 adults from Augsburg and the

surrounding counties are available. The age range of the participants was 25 to 74 years of recruitment

and is 30 to 90 years in 2005. The aim of the KORA-gen is to combine the knowledge on the genetic

background with classical epidemiological research and to investigate the interaction of genes and the

environment for understanding complex diseases (4).

In this population-based KORA study F4, various genotyping arrays have been typed. Recently, a subset

of the study has also been processed in a whole-genome sequencing effort. In parallel, Illumina Exome

chip v1.0 has been used for targeted typing of rare variants in exonic regions. The calling of genotypes

from this chip is critical and there are currently two versions of exome chip following two guidelines

from different consortia. The focus of the current internship is to assess the reliability of the genotypes

called with Illumina Exome chip v1.0 through comparison with variants called with sequenced data.

2. TASKS

Understanding of Exome chip data and Sequencing data.

Descriptive statistics of genotype data obtained from Illumina Exome chip v1.0 and sequencing

data.

Identification of overlapping individuals.

Identification of overlapping genotypes.

Analyzing the consistency of the data between the overlapped genotypes.

3

3. DATA

3.1 Genotype from Exome Chip data (GE)

Study samples from KORA study were processed on the Human Exome Bead Chip (version 1.0.),

querying about 247,870 variable sites. Two channel raw data files (.idat) were obtained for all the

samples. Clustering, Genotype Calling and Quality Control were done through a series of steps by

combined usage of tools such as Illumina Genome Studio (Gencall) and zcall, with strong exclusion

criteria such as call rate (>98%), Heterozygosity (>1% SNPS), Gender discordance, GWAS discordance,

Fingerprint discordance, PCA outliers (using the Ancestry Informative Markers) etc. Finally, genotypes

for all the samples were obtained in the form of binary ped file (*.bed). Binary ped file has additional

supporting files such as pedigree/phenotype information file (*.fam) and mapping information file

(*.bim). Plink (5) software is providing excellent features to analyze these files. Genotype type data from

Exome chip is hereafter referred as “GE” in the following report.

3.2 Genotype from sequencing data (GS)

Whole genome of the study samples were sequenced using Next Generation Sequencing (NGS)

technology. In this method, whole genome was randomly digested into small fragments or short reads and

sequenced resulting in the form of fastq formatted file. Each fastq file contains list of sequence reads with

quality score (measure of uncertainty) of each base call. In the next step, calling of SNPs from these NGS

reads were done and it comprises of various steps such as quality filtering of reads, alignment to reference

genome, recalibration of per-base quality scores, genotype calling and snp calling. In this study, SNP

calling from these NGS reads was done using MaCH/Thunder. Finally the variations with respect to the

reference genome were stored in Variant Call Format (VCF). VCF is text file format file which contains

meta-information lines, a header line and then data lines - each containing information about a variant

with respect to reference genome and its corresponding sample. Since whole genome of about 2657

individuals has been sequenced, VCF file is extremely larger in size. Therefore, variants were stored

separately for each chromosome and compressed as *.vcf.gz format. Care should be taken while handling

such large data files. Genotype type data from Sequence data is hereafter referred as “GS” in the

following report.

4. METHODS

4.1 Descriptive statistics of Exome chip and Sequencing data

Data understanding and quality controlling of both ‘GE’ and ‘GS’ data are very important prerequisites

before doing any kind of statistical analysis. Descriptive statistics such as missingness, Minor Allele

Frequency (MAF), monomorphic SNPs, mean call rate etc. would provide comprehensive knowledge

about the quality of data. Quality control criteria such as Genotyping Batch Quality, Gender Checks and

Sex Chromosome Aneuploidy, Chromosomal Aberrations, Relatedness, Population Structure, Genotyping

Completeness and Accuracy, Mendelian Errors, Hardy-Weinberg Equilibrium etc. are often considered

for analyzing GWAS data (7, 8). Quality controlling by individual includes identification of individuals

with discordant sex information, with outlying missing genotype or heterozygosity rates, duplicated or

related, divergent ancestry etc. (6). Quality controlling by marker includes identification of SNPs having

4

an excessive missing genotype, significant deviation from Hardy-Weinberg equilibrium (HWE),

significant difference in missing genotype rates between cases and controls, low minor allele frequency

(MAF) etc. (6). Plink software (5), vcftools (9), R program and ggplot2 have been used for the generating

the descriptive statistics and its corresponding scripts can be found in the supplementary section.

4.2 Extraction of overlapping individuals.

Whole genome had been sequenced for the subset of about 117 individuals from the KORA study F4.

Samples have different identification in GE and GS data. A link file was provided that had the list of

sample id from GE data and its corresponding sample id in GS data. Extraction of overlapping individuals

were done using plink tool (‘keep’ parameter) for GE data and using vcftools (‘vcf-subset’ submodule)

for ‘GS’ data.

4.3 Extraction of overlapping SNPs.

To evaluate the reliability of ‘GE’ data, overlapping SNPs between GE and GS data must be extracted. It

was very important to make sure that reference genome used for genotype calling must be same between

them in order to extract the overlapping SNPs. Chromosome positions had been used as matching

parameter for extracting the overlapping SNPs. GRCh37 reference genome was used for calling genotype

of sequencing data. For GE data, there is a recent update from the CHARGE consortium in the mapping

file which uses hg19 (GRCh37) as reference genome. Therefore, mapping file of exome data should be

updated with the new annotation file obtained from CHARGE consortium. Once reference genome is

sync, reference position list corresponding to overlapping SNPs between these two data was obtained by

using simple awk script. Using the above list, GS and GE data have been filtered so that they have only

these matching SNPs. Additionally INDELs were also removed from the GS data.

4.4 Analyzing the consistency of data between the overlapped genotypes.

Once the overlapping Individual and the overlapping SNPs were extracted, GE and GS data were

compared using plink tool with “bmerge” parameter. Since GS data is in ‘vcf’ format, it was converted to

plink format using Pseq tool before comparison. It was essential to check for flipping problem before

doing consistency analysis. If a SNP in GE data has homozygous ‘C’ genotype and its corresponding SNP

in GS data has homozygous ‘G’ genotype or vice versa, they are not considered as discordant SNP. They

are more likely due to flipping problem. Similarly, flipping issue can occur for homozygous ‘A’ or ‘T’

genotype. Flipping correction was done using Plink tools itself with option as ‘flip’. Some SNPs that

could not flip corrected using Plink tool were done manually using R program. Chromosome X, Y, XY

and MT were not considered for the comparison since it is very complex to interpret. Output from the

plink data were analyzed using R program.

5

5. RESULTS

5.1 Descriptive statistics of genotypes from Exome chip data and Sequencing data

In GE (Genotype from Exome chip) data, 2921 individuals are genotyped using Human Exome Chip v1.0

with variants of about 267389. Whereas, in GS (Genotype from Sequencing) data, whole genome of 2657

individuals were sequenced and 24778233 variants have been called using GRCh37 as reference genome.

Table 1 Summary of exome chip data

Summary Exome Chip data Sequence data

Total number of persons 2921 2657

Total number of variants 267389 24778233

Mean call rate 0.9999324 1

Mean MAF 0.03017911 0.05269535

Number of monomorphic sites 125883 274931

Number of singletons (MAC =1) 23192 6474420

Number of doubletons (MAC =2) 11692 2381449

Missing call rate per individual is nothing but the fraction of missing calls per individual over SNPs. The

missing call rate per individual is an informative indicator of sample quality and also genotype quality.

Generally, individuals having missing call rates greater than 5% are considered to have poor genotype

quality. But according to the figure 1, all the individuals in GE data have missing call rate lesser than 1%.

Also highest outlier in the distribution of missing call rates is about 0.75% which indicates good genotype

quality.

Figure 1 Missing rate by Individual from GE data

6

Missing call rate by SNP is nothing but the fraction of missing calls per SNP over individuals. The

removal of suboptimal markers is an important quality control procedure of a GWA study. Removing

such markers will reduce the false positives SNPS and increase also the probability to identify true

associations correlated with disease risk (6). Generally, markers with a missing rate greater than 5% are

considered to be poorly classified markers. As per figure2, all the markers have missing rate lesser than

5%. Only few markers have missing rate about 3% indicating that all the markers in GE data are in good

quality.

Figure 2 Missing rate by SNP from GE data

Call rate, the other perspective of missing rate, is also an indicator for poor quality sample. Individuals

who have low call rate (less than 99%) are considered to have poor genotype efficiency and should be

eliminated from the analysis (8). Above threshold may vary from study to study depending on the

genotyping platform, quality of the DNA samples, and the variability in human and equipment error in

genotyping (8). As per figure 3, distribution of call rate clearly indicates that all the individuals in GE data

have call rate greater than 99%. Also mean call rate of GE data is about 0.9999324, indicative of good

genotype efficiency. For the Genotype data from sequencing (GS), we consider there is no missing

genotype since whole genome was sequenced and therefore mean call rate for GS data is considered be 1.

7

Figure 3 Distribution of Call rate from GE data

Minor allele frequency (MAF) refers to the frequency at which the least common allele occurs in a given

population. SNPs with a MAF lesser than 5% cannot be considered in this study, as it requires very strong

statistical power to detect true association (6). Distribution of MAF in GE and GS can be seen in figure 4

and figure 5 respectively. Monomorphic SNPs are the markers in which genotypes are same for all the

individuals irrespective of case and controls. Number of monomorphic SNPs was found to be 125883 for

GE data and 274931 for GS data. In case of GS data, only the SNPs whose genotypes are same in all

individuals but different to reference genome are included as monomorphic SNPs. Because VCF file do

not include information about genotype which is same in reference genome and all individuals. These

Monomorphic SNPs do not contain any meaningful information and therefore can be excluded from the

analysis. Similarly, singletons (about 23192 in GE and 6474420 in GS) and doubletons (about 11692 in

GE and 2381449 in GS) could also be excluded from the analysis.

Figure 4 Distribution of Minor Allele Frequency plot from

GE data

Figure 5 Distribution of Minor Allele Frequency plot from

GS data

8

5.2 Extraction of overlapping individuals

Totally, 117 individuals are identified to be matching between GE data and GS data by using the link file

provided. Size of GS data has been greatly reduced after the extracting 117 out of 2657 individuals. As

mentioned before, plink tool and vcftools are used for the extraction of GE and GS data respectively.

Script used for extracting overlapping individuals can be found in supplementary section.

5.3 Extraction of overlapping SNPs

Extraction SNPs which were matching between GE and GS data involved two steps. In the first step,

positions that were common between the GE and GS data are extracted. In the second step, GE and GS

data are filtered by using the above extracted positions. Totally, 10840 SNPs found to be matching

between GE and GS data excluding X chromosome. Maximum number of matching SNPs can be

observed in Chromosome 6 and 8. Summary of the total number of matching SNPs in each chromosome

can be found in table 3.

5.4 Comparison of Exome and Sequence Data

Once the matching SNPs are extracted, GE and GS data are compared using Plink tool with ‘bmerge’

option. GS data which is in the form of ‘vcf’ has been converted to plink bed format using Pseq tool.

Most of the error SNPs was solved by flipping correction from Plink tool itself. Whereas remaining error

SNPs are due to incompatibility of the plink tool to handle triallelic biomarkers. And these error SNPs are

excluded from the consistency analysis. Script used for the comparison can be found in the supplementary

section. Out of 1983969 overlapping SNP calls (number of matching SNPs multiplied by number of

individuals), 244677 SNP calls were found to be discordant by plink tool. When looked at the data

manually, it is found that there are still some SNPs that are discordant due to flipping problem. After

manual correction of the flipping problem, overall concordant rate is estimated as 0.99. Chromosome X,

Y, XY and MT are not taken in to consideration since they are very complex to interpret. Summary of

discordant SNP calls and concordant rate for each chromosome can be found in table 3.

Out of 1859 discordant SNP calls, 111 were due to missing genotype in GE data (table 3). There are 28

homozygous discordant SNP calls which are due to mismatch by AA-GG and CC-TT or vice versa.

Number of heterozygous discordant genotypes is found to be more in GS data which is about 627 when

compared to GE data which is about 245. This implies that genotype call in exome data is more

conservative.

9

Table 2 Matching SNPs and Concordant rate using Plink tool

Chromosome

No. of

SNPs in

GE

No. of

SNPs in

GS

Matching

SNPs

Total

SNP calls

(No of

Matching

SNPs *

No of

Individual

s)

Discordant

SNP calls

Discordant

SNP calls

after

manual flip

Correction

Concordant

rate

Chr1 27073 1847359 507 59319 8017 51 0.999140242

Chr2 19577 2055231 438 51246 4225 95 0.998146197

Chr3 15638 1736024 365 42705 3705 100 0.997658354

Chr4 10812 1734142 280 32760 3132 28 0.999145299

Chr5 16677 1533151 250 29250 3588 56 0.99808547

Chr6 16405 1765698 8161 954837 114580 588 0.999384188

Chr7 11898 1361064 282 32994 3936 37 0.998878584

Chr8 9357 1550874 4114 481338 69894 92 0.999808866

Chr9 10981 1008432 205 23985 2564 138 0.994246404

Chr10 10212 1162142 225 26325 2436 39 0.998518519

Chr11 16826 1162933 342 40014 4358 41 0.998975359

Chr12 13226 1122975 249 29133 3356 136 0.995331754

Chr13 4739 872238 153 17901 1797 17 0.999050332

Chr14 8419 752302 139 16263 2098 41 0.99747894

Chr15 8959 677017 184 21528 1898 29 0.998652917

Chr16 11152 726483 237 27729 3524 53 0.998088644

Chr17 14155 615187 203 23751 2782 219 0.990779336

Chr18 4031 678085 128 14976 1385 24 0.998397436

Chr19 16047 456069 217 25389 3760 16 0.999369806

Chr20 6830 510900 107 12519 837 27 0.997843278

Chr21 3009 316194 65 7605 919 16 0.997896121

Chr22 5535 280901 106 12402 1886 16 0.998709886

Chr23/ChrX 5461 852832 - - -

Chr24/ChrY 142 - - - -

Chr25/ChrXY 2 - - - -

Chr26/chrMT 226 - - - -

Total 267389 24778233 16957 1983969 244677 1859 0.999063

Table 3 Discordant Genotypes of Exome chip Vs Sequence data after Flip correction

Exome chip

Sequencing

Hom Het Missing

Hom 28 245 54

Het 627 0 57

Missing 0 0 0

10

6. DISCUSSION

Descriptive statistics of genotype data from Exome chip (GE) and Sequencing (GS) showed that they

have good genotype quality and efficiency. There are 267389 SNPs in GE data and 24778233 SNPs in

GS data. Mean call rate of GE data is estimated as 0.9999324 and the mean call rate of GS data is

considered as 1. Number of monomorphic sites in GE data is estimated as 125883 which are

approximately 47 % of the total markers available. These markers do not contribute valuable information

for the analysis and therefore can be excluded from the analysis.

Since the genotype quality is shown to be very good from descriptive statistics, a very high concordance

rate can be expected between the GE and GS. Genotype calls were compared for 117 individuals. Number

of matching SNPs between them is found only to be 16957. Decreased number of matching SNPs is due

to fact that VCF can only contain those SNPs which is different from the reference genome. We do not

have any information about other genotype positions as they are excluded from the vcf if its genotype is

similar to the reference genome or its genotype is not sequenced by the sequencer itself

Estimated concordant rate after flipping correction is 0.99 as expected. Only few number of homozygous

discordant SNP calls (about 28) are found, suggesting a very high quality in genotyping from Exome chip

data. Therefore SNPs from the GE data are more reliable for the GWAS study. Number of discordant

SNP calls is estimated as 1859. These discordant SNPs calls are more likely due to mistake done while

calling genotype from intensity data. Increased heterozygous discordant genotypes (about 627) infer that

Genotype call from Exome data is conservative. Also, it should be noted that comparison was done only

on the 16957 SNPs which is only about 16% of the markers from GE data.

For me myself as an internship student, I learnt the concepts of handling of large data sets in LINUX

environment, read sequencing data (vcf format) and genotyping data (PLINK/vcf format), annotation,

comparing genotypes based on genomic position, using R software for evaluation of descriptive statistics,

documentation of genotyping results, documentation of consistency checks etc. Additionally, I learned

lots of new things through the challenges and issues faced during this period.

11

7. REFERENCE

1. Illumina. Human Exome BeadChips. http://wwwilluminacom/content/dam/illumina-

marketing/documents/products/datasheets/datasheet_humanexome_beadchipspdf.

2. Guo Y, He J, Zhao S, Wu H, Zhong X, Sheng Q. Illumina human exome genotyping array

clustering and quality control. 2014;9(11):2643-62.

3. Holle R, Happich M, Lowel H, Wichmann HE, Group MKS. KORA--a research platform for

population based health research. Gesundheitswesen. 2005;67 Suppl 1:S19-25.

4. Wichmann HE, Gieger C, Illig T, Group MKS. KORA-gen--resource for population genetics,

controls and a broad spectrum of disease phenotypes. Gesundheitswesen. 2005;67 Suppl 1:S26-30.

5. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set

for whole-genome association and population-based linkage analyses. American journal of human

genetics. 2007;81(3):559-75.

6. Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT. Data quality control in genetic case-control association studies. Nature protocols. 2010;5(9):1564-73.

7. Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, et al. Quality control and

quality assurance in genotypic data for genome-wide association studies. Genetic epidemiology.

2010;34(6):591-602.

8. Turner S, Armstrong LL, Bradford Y, Carlson CS, Crawford DC, Crenshaw AT, et al. Quality

control procedures for genome-wide association studies. Current protocols in human genetics / editorial

board, Jonathan L Haines [et al]. 2011;Chapter 1:Unit1 19.

9. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call

format and VCFtools. Bioinformatics. 2011;27(15):2156-8.

12

8. SUPPLEMENTARY

8.1 Missingness

## Missingness

plink --noweb --bfile ${EXOME_BED_FILE} --missing --out ${EXOME_PLINK_OUT_FILE}

awk '$3>0' ${EXOME_PLINK_OUT_FILE}.lmiss > ${EXOME_PLINK_OUT_FILE}_only_missing.lmiss

imiss<-read.table(file='exome_f4_charge_top_20130108.imiss',header=T)

lmiss<-read.table(file='exome_f4_charge_top_20130108.lmiss',header=T)

lmissonly<-read.table(file='exome_f4_charge_top_20130108_only_missing.lmiss',header=T)

awk 'NR==FNR {h[$2] = $4; next} {print $0,"\t",h[$2]}'

${EXOME_BED_FILE}.bim ${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108.lmiss

> ${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108_manhattan.lmiss

lmissman<-read.table(file='exome_f4_charge_top_20130108_manhattan.lmiss',header=T)

#1. plot by individual

ggplot(imiss, aes(x=as.factor(IID),y=N_MISS))+ geom_histogram(binwidth=1,stat="identity")

+ theme(axis.text.x = element_blank()) + theme(legend.position="none")

+ labs(title="Individual Missing Genotypes.b") + ylab("Sum of Missing Genotypes")

+ xlab("Individuals")

#2. plot by snp

manhattan(lmissman, p = "F_MISS", bp = "BP", logp = FALSE, ylab = "Missing Frequency",

ylim = max(lmissman$F_MISS), genomewideline = FALSE,

suggestiveline = FALSE, main = "Missing Genotypes by SNP - Manhattan plot")

8.2 Mean call rate

## Mean Call rate

#1. mean call rate by individual

meanInd<-mean(1-imiss$F_MISS)

hist(1-imiss$F_MISS, xlab="Callrate",ylab="Number of individuals",

main="Callrate by individual", breaks=100)

#2. mean call rate by snp

meanSNP<-mean(1-lmiss$F_MISS, na.rm=T)

hist(1-lmiss$F_MISS, xlab="Callrate", ylab="Number of SNPs",main="Callrate by SNP", breaks=100)

13

8.3 Minor Allele Frequency (MAF)

## Minor Allele Frequency

plink --noweb --bfile ${EXOME_BED_FILE} --freq --out ${EXOME_PLINK_OUT_FILE}

awk 'NR==FNR {h[$2] = $4; next} {print $0,"\t",h[$2]}' ${EXOME_BED_FILE}.bim

${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108_freq.frq >

${EXOME_PLINK_OUT_DIR}/exome_f4_charge_top_20130108_freq_manhattan.frq

freq<-read.table(file='exome_f4_charge_top_20130108.frq',header=T)

#1. MAF Histogram

hist(freq$MAF, xlab="Minor Allele Frequency",ylab="Number of SNPs",main="Minor Allele Frequency",

breaks=3000, xlim=c(0,0.01))

#2. MAF Manhattan

manhattan(freqman, p = "MAF", bp = "BP", logp = FALSE, ylab = "Minor Allele Frequency",

ylim = c(0,1), genomewideline = FALSE, suggestiveline = FALSE, main = "Minor Allele Frequency –

Manhattan plot")

8.4 Monomorphic and singleton SNPs

## Monomorphic SNPs

awk '$5==0' ${EXOME_PLINK_OUT_FILE}_freq.frq > ${EXOME_PLINK_OUT_FILE}_monomorphic.frq

wc -l ${EXOME_PLINK_OUT_FILE}_monomorphic.frq

## Singleton SNPs

## 1/(2*2921) = 0.0001711743

awk '$5<=0.00018 && $5>0' ${EXOME_PLINK_OUT_FILE}_freq.frq > ${EXOME_PLINK_OUT_FILE}_singleton.frq

wc -l ${EXOME_PLINK_OUT_FILE}_singleton.frq

## Doubleton SNPs

## 2/(2*2921) = 0.0001711743

awk '$5<=0.00035 && $5>0' ${EXOME_PLINK_OUT_FILE}_freq.frq > ${EXOME_PLINK_OUT_FILE}_doubletons.frq

wc -l ${EXOME_PLINK_OUT_FILE}_doubletons.frq

14

8.5 Extraction of overlapping individuals.

## Exome Data

plink --noweb --bfile ${EXOME_BED_FILE} --keep ${EXOME_INDIVIDUAL_MATCHED_DIR}/exome_sample_list.txt

--make-bed --out ${EXOME_INDIVIDUAL_MATCHED_DIR}/exome_f4_charge_filtered_individuals

## Sequence Data

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X

do

vcf-subset -c ${SEQ_INDIVIDUAL_MATCHED_VCF_DIR}/samples.txt

${SEQUENCE_DIR}/GoT2D.chr${i}.paper_integrated_snps_indels_sv_beagle_thunder.vcf.gz |

bgzip -c > ${SEQ_INDIVIDUAL_MATCHED_VCF_DIR}/individual_matched_chr${i}.vcf.gz

done

8.6 Identification of overlapping SNPs.

## Exome Data

##1. Extracting matching reference positions list

grep -h -v '^#' ${SEQ_POSITION_MATCHED_VCF_AWK_DIR}/individual_snp_matched_chr*.vcf

| cut -f1,2 | sort -k 1 -k 2 - | uniq | \

awk 'FNR==NR{a[$1$2]++;next} {if($3$4 in a){print $1}}' –

${EXOME_NEW_ANNOTATION_FILE_FILTERED} | \

sort | uniq > ${EXOME_SNP_MATCHED_DIR}/snpid_list.txt

##1. Filter Exome plink data using above list

plink --noweb --bfile

${EXOME_INDIVIDUAL_MATCHED_DIR}/exome_f4_charge_filtered_individuals –

extract ${EXOME_SNP_MATCHED_DIR}/snpid_only_list.txt --make-bed --out

${EXOME_SNP_MATCHED_DIR}/exome_f4_charge_filtered_ind_snp

## Sequence Data

##1. Extract chromosome positions from exomechip data

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

do

awk -v i="${i}" '{if($1==i) print $4}' ${EXOME_BED_FILE}.bim >

${SEQ_PLINK_POSITIONS_DIR}/exomechip_positions_chr${i}.txt

done

##2. Extracting snp from VCF which matches with the above extracted positions

#a.Using AWK command

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X

do

zcat ${SEQUENCE_DIR}/GoT2D.chr${i}.paper_integrated_snps_indels_sv_beagle_thunder.vcf.gz | \

awk '

FNR==NR{a[$0];next}

{

if($1 ~ '/^\#/')

{

print $0

}else if ($2 in a)

{

print $0

}

}

15

' ${SEQ_PLINK_POSITIONS_DIR}/exomechip_positions_chr${i}.txt - >

${SEQ_POSITION_MATCHED_VCF_DIR}/position_matched_chr${i}.vcf

done

8.7 Analyzing the consistency of data between the overlapped genotypes.

## Consistency of Exome and Sequence Data

##1. Converting VCF to BED

pseq008 ${HOMEDIR}/plinkseq/vcftobed new-project

pseq008 ${HOMEDIR}/plinkseq/vcftobed load-vcf --vcf

${SEQ_POSITION_MATCHED_VCF_RENAMED_DIR}/individual_snp_matched_renamed_chr*.vcf

pseq008 ${HOMEDIR}/plinkseq/vcftobed write-ped --name ${SEQ_VCF_TO_BED}/GoT2D

plink --noweb --tfile ${SEQ_VCF_TO_BED}/GoT2D --make-bed --out ${SEQ_VCF_TO_BED}/GoT2D

awk 'BEGIN { FS = "\t" ;OFS="\t" }; { gsub("\\<X\\>", "23", $3) ; gsub("\\<Y\\>", "24", $3) ;

gsub("\\<XY\\>", "25", $3) ; gsub("\\<MT\\>", "26", $3) ; print $0}'

${EXOME_NEW_ANNOTATION_FILE_FILTERED} | \

awk 'NR==FNR {h[$3$4] = $1; next} {print $1,h[$1$4],$3,$4,$5,$6}' –

${SEQ_VCF_TO_BED}/GoT2D.bim_bkp > ${SEQ_VCF_TO_BED}/GoT2D.bim

##2. Flipping Correction

plink --noweb --bfile ${EXOME_SNP_MATCHED_UPDATED_DIR}/exome_f4_charge_filtered_ind_snp_map --flip

${COMPARISON_RESULTS_PLINK_DIR}/exome_seq.missnp --make-bed --out

${COMPARISON_RESULTS_PLINK_DIR}/1.flip/exome_f4_charge_flipped

plink --noweb --bfile ${COMPARISON_RESULTS_PLINK_DIR}/1.flip/exome_f4_charge_flipped --exclude

${COMPARISON_RESULTS_PLINK_DIR}/1.flip/comp_exome_seq_flip.missnp --make-bed --out

${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/exome_f4_charge_flipped_excluded

plink --noweb --bfile ${SEQ_VCF_TO_BED}/GoT2D --exclude

${COMPARISON_RESULTS_PLINK_DIR}/1.flip/comp_exome_seq_flip.missnp --make-bed --out

${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded

##1. Comparing Exome and Sequence data

plink --noweb –bfile

${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/exome_f4_charge_flipped_excluded

--bmerge ${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded.bed

${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded.bim

${COMPARISON_RESULTS_PLINK_DIR}/2.flip_excluded_error_snps/GoT2D_excluded.fam

--merge-mode 6 --out ${COMPARISON_PLINK_OUTPUT}/comp_exome_seq_flip_excluded

internship report 2

Documents

comparison of sequencing

sequence data

sequencing data gs

consistency of data

exome chip data ge

comparison of exome

generation sequencing

extraction of overlapping