high-throughput primer and probe design for … primer and probe design for genome-wide dna...

22
High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and Christopher S. Bond Life Sciences Center University of Missouri, USA Garima Kushwaha Informatics Institute and Christopher S. Bond Life Sciences Center University of Missouri, USA Huidong Shi MCG Cancer Center Medical College of Georgia, USA Dong Xu Computer Science Department, Informatics Institute, and Christopher S. Bond Life Sciences Center University of Missouri, USA 1 Introduction Research in the field of DNA sequencing has generated tremendous amounts of genomic sequences. These DNA sequence information have often been used to enhance understanding of genomic changes and genetic diseases. In addition to studying DNA sequences and genetic changes, the study of epigenetic changes that do not involve DNA sequence has also become one of the major focus areas in medical research. DNA methylation, primarily methylation of cytosine at CpG dinucleotides, is the most widely studied epigenetic modification and is known to have profound effects on gene expression. It is involved in embryogenesis, genomic imprinting [Razin & Cedar, 1994], X-chromosome inactivation [Jaenisch & Bird, 2003], and de- velopment of diseases [Li, 2002], particularly various types of cancers [Jones & Takai, 2001]. Methylation studies in cancer research have identified methylation of hundreds of CpG islands (methylation suscepti- ble sites) in a tumor cell. Early diagnostic techniques and therapeutics based on epigenetic strategies have also been considered for cancer treatment and diagnosis [Woonbok et al., 2008]. DNA methylation also plays a critical role in the development of many other human diseases, such as neuro-developmental disor- ders, cardiovascular diseases, type-2 diabetes, obesity and infertility [Van et al, 2007]. DNA methylation is also involved in normal cellular processes, including gene regulation, DNA-protein interaction, and cellular differentiation [Shiota, 2004, Eden & Cedar, 1994]. DNA methylation mainly refers to the addition of the methyl group to the carbon-5 position of cytosine (5MeC) at CpG dinucleotides present in DNA sequences. In human somatic cells, 5MeC accounts for 1%

Upload: trantu

Post on 11-Apr-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

High-throughput Primer and Probe Design for Genome-Wide DNAMethylation Study Using PRIMEGENS

Gyan Prakash SrivastavaComputer Science Department and Christopher S. Bond Life Sciences

Center University of Missouri, USA

Garima KushwahaInformatics Institute and Christopher S. Bond Life Sciences Center

University of Missouri, USA

Huidong ShiMCG Cancer Center

Medical College of Georgia, USA

Dong XuComputer Science Department, Informatics Institute, and Christopher S. Bond Life Sciences Center

University of Missouri, USA

1 Introduction

Research in the field of DNA sequencing has generated tremendous amounts of genomic sequences. TheseDNA sequence information have often been used to enhance understanding of genomic changes and geneticdiseases. In addition to studying DNA sequences and genetic changes, the study of epigenetic changes thatdo not involve DNA sequence has also become one of the major focus areas in medical research. DNAmethylation, primarily methylation of cytosine at CpG dinucleotides, is the most widely studied epigeneticmodification and is known to have profound effects on gene expression. It is involved in embryogenesis,genomic imprinting [Razin & Cedar, 1994], X-chromosome inactivation [Jaenisch & Bird, 2003], and de-velopment of diseases [Li, 2002], particularly various types of cancers [Jones & Takai, 2001]. Methylationstudies in cancer research have identified methylation of hundreds of CpG islands (methylation suscepti-ble sites) in a tumor cell. Early diagnostic techniques and therapeutics based on epigenetic strategies havealso been considered for cancer treatment and diagnosis [Woonbok et al., 2008]. DNA methylation alsoplays a critical role in the development of many other human diseases, such as neuro-developmental disor-ders, cardiovascular diseases, type-2 diabetes, obesity and infertility [Van et al, 2007]. DNA methylation isalso involved in normal cellular processes, including gene regulation, DNA-protein interaction, and cellulardifferentiation [Shiota, 2004, Eden & Cedar, 1994].

DNA methylation mainly refers to the addition of the methyl group to the carbon-5 position of cytosine(5MeC) at CpG dinucleotides present in DNA sequences. In human somatic cells, 5MeC accounts for 1%

Page 2: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

152 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

of total DNA bases and affects 70 - 80% of all CpG dinucleotides in the genome [Ehrlich et al., 1982].Genomic regions of 300-3,000 base pairs that have very high CpG dinucleotide density are known as CpGislands. There are around 29,000 CpG islands identified in the human genome [Venter et al., 2001, Esteller&Herman, 2002]. Approximately 70% of gene promoters are associated with these CpG islands. CpGislands in promoter regions are mostly unmethylated in a normal cell at all stages of development and in alltissue types; these are often associated with active gene transcription [Antequera & Bird, 1993]. However,the methylation profile of a cell can be controlled during its development. The origin of genomic DNAmethylation and factors dictating methylation status of genes remain a mystery [Feng et al., 2007]. MostDNA methylations take place at CG-dinucleotides of CpG islands but some studies have also found non-CGmethylation on DNA sequences associated with highly expressed genes [Clark, 2007, Flanagan & Wild,2007].

DNA methylation patterns in cancer cells show major variations compared with normal cells and areusually associated with transcriptional inactivation of tumor suppressor genes [Das & Singal, 2004, Fujikaneet al., 2009, Kistensen et al., 2009]. Aberrant methylation patterns in cancer cells typically consist of hy-pomethylation involving repeated DNA sequences, such as long interspersed nuclear elements and hyperme-thylation at CpG islands [Ehrlich, 2002, Libault et al., 2009]. There are numerous tumor suppressor genesthat are known to be hypermethylated; these include genes involved in cell cycle regulation (p16INK4a,p15INK4a, Rb, and p14ARF), DNA repair (BRCA1 and MGMT), apoptosis (DAPK and TMS1), drug re-sistance, detoxification, differentiation, angiogenesis, and metastasis. Hypomethylation is common in solidtumors, such as metastatic hepatocellular [Lin et al., 2001, Libault et al., 2009] and cervical cancer [Kimet al, 1994, Libault et al., 2009], prostate tumors [Bedford and Helden, 1987, Libault et al., 2009], and inhematologic malignancies such as B-cell chronic lymphocytic leukemia [Ehrlich, 2002, Libault et al., 2009].Hypomethylation participates in oncogenesis by activating oncogenes such as c-MYC and H-RAS and ac-tivation of latent retrotransposons [Bedford and Helden, 1987, Feinberg and Vogelstein, 1983], therebycausing chromosome instability [Tuck et al., 2000]. Hypomethylation of long interspersed nuclear elementDNA causes transcriptional activation and is found in many types of cancers, such as urinary bladder cancer[Jurgens et al., 1996]. Both genome-wide and candidate gene approach for methylation studies predict thateach tumor or cancer type has its own set of cell-type specific genes associated with methylation. Differentpanel of genes can be used as genetic markers to detect aberrant methylation in certain cancers [Shivapurkar& Gazdar, 2010], such as lung and breast cancers [Brook et al., 2009], to diagnose their early stage of de-velopment [Woonbok et al., 2008]. DNA methylation, together with covalent modification of histones, isknown to alter chromatin packing known as “chromatin remodeling”, which changes the accessibility ofDNA to cellular machinery or proteins that help transcribe the underlying DNA sequence [Flanagan andWild, 2007].

Since the disruption of DNA methylation has become a major hallmark of human cancer, precise map-ping of methylation patterns in CpG islands [Shen et al., 2007] has become an important tool in under-standing cancers and other diseases. For the past several years, an international consensus has emergedin the epigenetics research community for the need of an organized Human Epigenome Project (HEP)aimed at generating a high-resolution DNA methylation map of the human genome in all major tissues[Eckhardt et al., 2006, Rakyan et al., 2004]. The HEP intends to provide a ”reference epigenome” by re-sequencing different normal tissues and adding 5-methylcytosine to the DNA sequencing datasets (http://nihroadmap.nih.gov/epigenomics ). The pilot HEP in Europe utilized direct sequencing of bisulfitepolymerase chain reaction (PCR) products to provide single methyl-cytosine resolution mapping of thou-sands of amplicons [Eckhardt et al., 2006]. Moreover, an innovative massively parallel sequencing-by-synthesis method (454-sequencing) has been applied in ultra-deep bisulfite sequencing analysis of multiple

Page 3: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G.P. Srivastava, K. Kushwaha, H. Shi & D. Xu 153

tumor methylome [Rakyan et al., 2004, Taylor et al., 2007].The successful implementation of any highly parallel sequencing approach depends on the use of auto-

matic primer design program [Liard, 2003] capable of performing genome-wide scans for optimal primersfor high-throughput genomic sequences [Zhang and Smith, 2010], such as in silico bisulfite-treated humangenome sequences. Several methods have been proposed to address this issue partially. There are varioustools available for probe and primer design [Hiller & Green, 1991, Hydman et al. 1996, Haas et al., 1998,Herman et al., 1996, Proutski & Holmes, 1996, Li et al., 1997, Pattyn et al. 2006, Yamada et al., 2006, Tsaiet al., 2007] with a few web servers [Aranyi & Tusnady, 2007], but most of them lack stringent checks forcross-hybridization of primers and probes. MethPrimer [Marshall, 2004] and PerlPrimer [Li et al., 2002]transform target sequences according to the bisulfite treatment for primer design. However, these methodsdo not provide a mechanism to detect non-specific amplification in bisulfite PCR. Bisearch [Tusnady et al.,2005, Aranyi & Tusnady, 2007] provides an important feature of similarity search for potential non-specificPCR products with selected primer pairs on a bisulfite-treated genome. It uses a simple string-matchingsearch method to detect potential cross-hybridization of a designed primer pair. The string-matching searchcan then find an exact match for the primer; however, it cannot detect highly similar sequences (e.g., with1 nt mismatch) in the genome to the primer, which could also be potential binding site for the primer. Inaddition, this method is not practical or suitable for analyzing primer pairs at mis-priming sites in high-throughput primer design [You et al., 2008, Arvidsson et al., 2006], which is required for highly parallelsequencing system to develop high-throughput, large-scale bisulfite genomic sequencing. To address thisissue, an efficient method has been integrated into the second version of PRIMEGENS software [Xu et al.,2002], i.e., PRIMEGENS-v2 [Srivastava et al., 2008]. PRIMEGENS is built on third-party, open-sourcesoftware tools, such as Primer3 [Rozen & Skaletsky, 2000] and BLAST, and includes various new featuresfor genome-scale primer design. It is widely used and cited in the research community [Liu et al., 2003,Hilson et al., 2004, David, 2005, Bertone et al, 2006, Leparc et al., 2009, Lemoine et al., 2009, Libualt etal., 2009], and is available as a standalone tool to run under both Linux and Windows or as a web-based toolnamed PRIMEGENS-w3 (http://primegens.org/w3/index.cgi).

2 PRIMEGENS-v2 Algorithms

PRIMEGENS-v2 can be used for large-scale primer and probe design for various applications, such as PCR,DNA synthesis, qRT-PCR (gene expression), and targeted next-generation sequencing (454, Solexa, AgilentSure-select technology, etc.) for normal and bisulfite-treated sequences. A unique feature of PRIMEGENS-v2 is that it allows users to select various types of input formats and primer design algorithms to designprimers and probes customized for their experimental requirements. There is currently no other softwaretool that allows flexibility to incorporate various types of design constraints and customized algorithmicflowcharts suitable for different applications for PCR, microarray, and targeted sequencing.

PRIMEGENS-v2 designs primers using three different algorithms: 1) Sequence-Specific Primer Design(SSPD), 2) Fragment-Specific Primer Design (FSPD), and 3) Probe-Specific Primer Design (PSPD). It alsohas probe design capacity, simply called Probes Design algorithm. Detailed descriptions of each of thesealgorithms are provided in the following subsections.

2.1 Sequence-Specific Primer Design

Sequence-Specific Primer Design (SSPD), as shown Figure 1, is the most basic algorithm of PRIMEGENS-v2. This algorithm is used to design primers for each sequence given by the user in the query input file

Page 4: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

154 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

Scattered primers pair design using Primer3

Megablast

Record Blast Hits for each olgio in sequence database

Apply hybridization rule for putative amplicons

Unique 3’-end 14mer oligos

Sequence Database

Multiplex mprimer pairs speci cto group of homolog genes

Primer pairs speci cto individual genes

Figure 1: Flowchart for the SSPD algorithm.

against any alternate potential hybridization with any of the sequences given in the database input file. Inits first step, SSPD designs different primers for the query sequence using a third-party program, Primer3.It uses the standalone executable of Primer3 to design hundreds of primer pairs scattered all over the inputquery sequence by Primer3 and then retains only unique ones from this set. It further selects a handfulof specific primers from the resulting multiple sets of primers. Based on the fact that stringency is morecritical at the 3’-end than at the 5’-end of the primer [Onodera & Melcher, 2004], it first selects 15meroligonucleotides from the 3’-end of each primer designed by Primer3 and then retains only the unique onesfrom this set. This step helps optimize the next step and allows PRIMEGENS-v2 to run faster by reducingthe BLAST runs against database. The second step of SSPD is to perform gapless alignments of these15mers against database file to search for all the potential non-specific PCR-products. For this, it usesMegablast (http://www.ncbi.nlm.nih.gov/blast/megablast.shtml) for alignment and is optimizedfor detection of genomic regions that are highly similar to query oligo (15mer) in the database sequencesprovided. Finally, SSPD selects primers that have no or minimum cross hybridizations.

As shown in Figure 2, Fragment Specific Primer Design (FSPD) is another primer-design algorithmused by PRIMEGENS-v2. This algorithm is useful in covering long sequences, such as CpG islands orother regions of interest, thereby allowing PCR to cover the whole region using multiple primers distributeduniformly across the entire region. It first breaks down the whole query sequence into small overlappingfragments and then designs primers for each of these fragments. For this, a user should provide the length ofeach fragment into which the query sequence will be divided (fragment length) and the overlap between twofragments to design associated primers (fragment overlap). For each fragment sequence, it functions in amanner similar to the SSPD algorithm in terms of designing a primer pair, before finally providing multipleprimer pairs for a long query sequence.

2.2 Probe-Specific Primer Design

Probe-Specific Primer Design (PSPD), as shown in Figure 3, is another PRIMEGENS-v2 algorithm usedto design primers. It selects the gene-specific fragments (probes) to design primer pairs. The two major

Page 5: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 155

Fragment Length

Fragment Overlap

Fragment Overlap

Last Fragment Length

Input Query Length

Fixed length fragments from query based on user input

Last fragment with remaining bases

Apply PRIMEGENSv2 basic SSPD algorithm for primer design cosidering each fragment as seperate input query sequence

Primer pairs speci cto individual gragments

5’

5’5’

5’

3’

3’3’

3’

Figure 2: Flowchart for the FSPD algorithm.

computational tasks in this algorithm work in a reverse order from that of SSPD. It first identifies a gene-specific fragment (probe) and then designs primers for the target fragment (or probe). To determine gene-specific fragments, it carries out a heuristic BLAST search for each of the query sequence against all thesequences in the database file to identify homolog sequences. It then performs multiple sequence alignment(star alignment) between the query and each of its homolog using Dynamic Programming. Based on thealignment, PSPD tries to find the sequence fragments that do not align with any other sequence and selectsthem as gene-specific fragments. Sequences that are unique themselves (no BLAST hit) are directly used asgene-specific fragments. In its second task, PRIMEGENS-v2 uses Primer3 to design PCR primers for gene-specific fragments (or probes) of the query sequence found in the first step of PSPD, instead of the wholequery sequence provided. Primer3 gives primers for PCR amplifications based on user-specified parameters(e.g. primer size, melting temperature, GC content, self-complementarity, and so on).

Once the primer-pair design using Primer3 is accomplished and BLAST hits for each 3-end uniqueoligos are recorded, the model checks for potential amplicons amplified by designed primers. Only a certainorder and orientation of primer binding can successfully amplify a sequence, and even if a primer pairbinds in correct order and orientation, the amplicon size should fall within a desired range. Therefore,PRIMEGENS-v2 checks each of these primer selection constraints for the potential cross-hybridization ofthe designed primers and probes. For example, a primer pair showing potential cross-hybridization withan amplification size of more than 10,000 bases other than target region can be safely ignored for PCRexperiment. Figure 4 shows an example of a primer oligo BLAST hit, which results in successful or failedamplification in PCR. Out of four different cases of primer pair hits, only a single case results in successfulamplification. Furthermore, it occurs only if the amplicon size is less than the maximum possible ampliconsize (the default: 10,000 bases in PRIMEGENS-v2) for PCR.

Finally, after the step of hybridization restriction, a set of generated primers is further reduced to allowfor only those primer pairs that have potentially good amplicon based on primer quality and user-specifiedproduct size range. Out of the final designed primer pairs, top primer pairs having least number of cross-hybridization are reported as the final output.

Page 6: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

156 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

Sequence pool

Primers

Query ORF

Outputdesign multiple primers

unique

unique

BLAST search homologous

Gapless alignment

primer3

Dynamicprogramming

Global Alignments

Gene-Speci c Fragments

BLAST Hits

CTAAGCTATCCTGATTTACGCAATGGCTAGCTG

......

Index/Ranking

Figure 3: Flowchart for the PSPD algorithm.

2.3 Probe Design Algorithm for Target Enrichment

The probe design algorithm is used to search for sequence-specific probes with no BLAST hits againstdatabase sequences. This probe design has been used for targeted sequencing, including Agilent sure-selecttechnology with next-generation sequencing. Target enrichment or targeted resequencing is useful whenresearchers are interested in sequencing only a certain portion of genomic region; for example only translatedregions of the human genome. This technique is also known as “DNA capture” or “genome partitioning”.One of the most common platforms for target resequencing is the Agilent SureSelect platform, which allowsusers to capture a subset of a genome and wash away the rest of genome that can be input to any resequencingtechnology. The Agilent target capturing technology replaces other labor-intensive methods, such as the PCRtechnique, which are major bottlenecks for next-generation sequencing output. To capture certain regions ofa genome rather than a whole genome, PRIMEGENS-v2 only allows users to design probes to capture thoseregions. This algorithm takes users inputs of the genomic region of interest in the FASTA format and designsprobes (either single or multiple uniformly distributed probes for each region). The algorithm uses BLASTto ensure that it outputs only those probes that are specific to the target region and hybridize nowhere else.

To design gene-specific probes, PRIMEGENS-v2 uses a sliding window protocol to find fragments ofuser-given length by the PRIMEGENS-v2 variable PROB LENGTH the window size. This sliding windowmoves over the length of PROBE PERIOD (“Probe shift” in Figure 5). It looks for a probe in every genomicregion given by the PRIMEGEN-v2 parameter called the PROBE REGION (“Inter-probe distance” in Figure5). For probe design, users can also provide their preferences in terms of maximum and minimum GCcontent in the probe using other PRIMEGENS-v2 variables, including MAX GC, Min GC, and maximumallowable BLAST hits for probes.

3 PRIMEGENS-v2 Usage

PRIMEGENS-v2 has a simple operation protocol consisting of two basic steps: (1) preparation of datafiles, such as list of target sequences for primer/probe design and relevant database for cross-hybridization

Page 7: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

Figure 4: Fundamental restriction for PCR amplification. The long strands are DNA sequences tobe amplified and the short ones are primers. Of the four scenarios, only (A) will produce ampliconif the inter-primer distance is less than the maximum amplicon size.

check in the FASTA format; and (2) selection of primer and probe design algorithm, as well as the relevantparameters along with primer specifications.

3.1 PRIMEGENS-v2: Input and Output

In order to use PRIMEGENS-v2 for primer and probe design, the user has to prepare a list of target se-quences in one of the formats supported by the software and provide a database to be used for cross-hybridization check. In addition, the software provides a sample configuration file containing various typesof PRIMEGENS-v2 variables as well as standard variables from Primer3 and BLAST or MegaBLAST. Theconfiguration file also provides a description for each variable, allowing the user to select appropriate valuesthat are customized for his or her application.

3.1.1 Input: Target Sequences

In order to use PRIMEGENS-v2 for primer and probe design, a user must provide a list of target sequencesin the FASTA format, that is, each gene name should start with a “> ” sign. The software also allows usersto only provide gene names without their actual nucleotide sequence as long as these sequences can beretrieved from the database. The user can then provide the chromosome position of genes from any genomefor primer design, for example > chr21:33031597-33041570. In this case, a user has to provide the wholegenome as the database. The genome must be in the chromosomal files, and there must be one file for eachsequence containing the full sequence of the corresponding genome.

PRIMEGENS-v2 provides users with four formats to input their target genes without explicit sequences,including gene name/identifier along with description, gene name/identifier, chromosome position with de-scription, and chromosome position (see Table 1(A)). The name/identifier could be used as gene name/identi-fier as long as the same name/identifier is used in the database, which must be in the FASTA format.

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 157

5’ 3’

3’ 5’

(A)5’ 3’3’ 5’

5’ 3’

3’ 5’

(B)5’ 3’

5’ 3’

3’ 5’

(C) 3’ 5’

5’ 3’

3’ 5’

(D)

3’ 5’

3’ 5’

5’ 3’

5’ 3’

Page 8: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

158 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

Select next probeusing probe shift

and BLAST againstgiven database

Hits? More genomic region left?

Select next probe afterinter-probe distanceand BLAST against

given database

ExitGenomic

region left?

Hit Found

seYdnuoF tiH oN

oNseY

Start

Figure 5: Flowchart for the probe design algorithm.

PRIME-GENS-v2 uses this database and searches for the gene name/identifier in the database to retrieveactual sequence for each query input provided by the user.

Table 1(B) shows the same format as Table 1(A), but with actual sequences for each query input. Inthis case, the user is not required to follow any constraint on gene name/identifier in terms of matching onein the database. If users want to separate annotations with gene name/identifier, they can do so by selectingthe format showing the description. When the chromosome position is given, it is still considered as genename/identifier unlike the previous case without sequence, where the actual sequence is retrieved using thislocation information.

3.1.2 Input: Database (Sequence Pool/Genome)

In the second input file(s), the user has to provide PRIMEGENS-v2 a database consisting of all the sequencesagainst which cross-hybridization has to be checked for each primer/probe designed. Based on number ofhybridization, it keeps only those primers with no or minimum hybridization to the input query sequence.

3.1.3 Input: Parameters (Configuration File in the Stand-alone Version)

In addition to input query sequence and database for cross-hybridization check, the user is required to entervalues for various PRIMEGENS-v2 parameters, such as primer design algorithm, query sequence format,database format, and various primer-design related parameters. These parameters are provided through theconfiguration file “config.txt” in PRIMEGENS-v2. To help users further, the software includes a sampleconfiguration file in its “include” directory. The file name for this configuration file is fixed and must benamed “config.txt”. The user can copy the sample configuration file and modify various parameters cus-tomized for their experimental conditions. Once a command is executed, PRIMEGENS-v2 searches for theconfiguration file in the user-provided output directory; if there is no configuration file in the output direc-tory, PRIMEGENS-v2 will search for the same in the current directory. If it fails to retrieve the configurationfile, PRIMEGENS-v2 will use the default sample configuration file provided by the software in the “include”

Page 9: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 159

Input target sequence ID format without DNA sequence

Input target sequence location without DNA sequence

A. Without actual sequences

Input target sequence ID format with DNA sequence

Input target sequence location with DNA sequence

...

...

...

...

...

...

...

GACGCGCAGACGTGATAAAGCTG ...

GACGCGCAGACGTGATAAAGCTG ...

GACGCGCAGACGTGATAAAGCTG ...

GACGCGCAGACGTGATAAAGCTG ...

GACGCGCAGACGT ...

...

...

> S11529022 description ... > S11417986

> S22334604 description ... > S11417988

> S11417968 description ... > S11528870

> S11431057 description ... > S11417995

> S11520763 description ... > S11418609

> chr21:170390-171698 description ... > chr21:170390-171698

> chrX:222396-222882 description ... > chrX:222396-222882

> chr14:36539-40265 description ... > chr14:36539-40265

> chr16:36539-38579 description ... > chr16:36539-38579

> S11529022 description ... > S11417986 GACGCGCAGACGTGATAAAGCTG

> S22334604 description ... > S11417988 GACGCGCAGACGTGATAAAGCTG

> S11417968 description ... > S11528870 GACGCGCAGACGTGATAAAGCTG

> S11431057 description ... > S11417995 GACGCGCAGACGTGATAAAGCTG

> S11520763 description ... > S11418609

> chr21:170390-171698 description ... > chr21:170390-171698 GACGCGCAGACGT

> chrX:222396-222882 description ... GACGCGCAGACGT > chrX:222396-222882 GACGCGCAGACGT

> chr14:36539-40265 description ... GACGCGCAGACGT > chr14:36539-40265 GACGCGCAGACGT

> chr16:36539-38579 description ... > chr16:36539-38579

B. With actual sequences

Table 1: List of all four formats without or with actual sequences supported by PRIMEGENS-v2.

directory. The detailed description of each parameter is provided in the configuration file in lines startingwith the “#” character as comments. Most of the parameters have been set to a default value as standardparameters for the best primer design. These parameters are divided into different sections described below.

1. Parameters for the primer-design algorithm type This section of parameters allows the user tochoose primer and probe design algorithms out of the three primer and one probe design algorithmssupported by PRIMEGENS-v2. The default algorithm for primer design is SSPD.

2. Parameters required forthe BLAST and Primer3 programs In this section, the user can set param-eters for MegaBLAST to search for cross-hybridization of primers in the database sequences and thedesired characteristics of the primer used by Primer3 to design primers, including melting temperatureand primer length, among others.

3. Parameters required forthe FSPD program This section specifies parameters required for the FSPDalgorithm if chosen. This comprises the length of each fragment into which the query sequence is di-vided (fragment length) and the overlap between two fragments to design associated primers (fragmentoverlap).

4. Parameters required forthe PSPD program This section specifies parameters required for the PSPDalgorithm if chosen. These parameters include gene-fragment length and the maximum allowed se-

Page 10: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

160 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

>chr1:203768370-20376857

1) CAGAAAAGCTGTACACGCCA [190] TGTACGCATGTGTGTGTGGT [359] psize 170 hbrdn 163

2) CAGAAAGGAGCGAGCAGTCT [166] TGTACGCATGTGTGTGTGGT [359] psize 194 hbrdn 171

3) AACGTCGTGTGACTCAGTGC [117] TGTACGCATGTGTGTGTGGT [359] psize 243 hbrdn 176

4) AACGTCGTGTGACTCAGTGC [117] TGTGTGTGGTGTGTGTGCAT [349] psize 233 hbrdn 189

5) AACGTCGTGTGACTCAGTGC [117] GTACGCATGTGTGTGTGGTG [358] psize 242 hbrdn 250

6) CAGAAAAGCTGTACACGCCA [190] GTACGCATGTGTGTGTGGTG [358] psize 169 hbrdn 251

7) CAGAAAGGAGCGAGCAGTCT [166] GTACGCATGTGTGTGTGGTG [358] psize 193 hbrdn 265

8) CAGAAAAGCTGTACACGCCA [190] GTGTACGCATGTGTGTGTGG [360] psize 171 hbrdn 310

9) CAGAAAGGAGCGAGCAGTCT [166] GTGTACGCATGTGTGTGTGG [360] psize 195 hbrdn 329

>chr1:235833063-23583326

1) TTGACTCTAGTCTGGCACGG [131] TAGCACCATTCCACAAAGGC [351] psize 221 hbrdn 10

2) CTTGACTCTAGTCTGGCACGG [130] TAGCACCATTCCACAAAGGC [351] psize 222 hbrdn 10

3) TCTAGTCTGGCACGGTGAAG [136] TAGCACCATTCCACAAAGGC [351] psize 216 hbrdn 12

4) CTCTAGTCTGGCACGGTGAAG [135] TAGCACCATTCCACAAAGGC [351] psize 217 hbrdn 12

5) ATGACGAGGCATTTGGCTAC [202] TAGCACCATTCCACAAAGGC [351] psize 150 hbrdn 14

Table 2: Sample Excel output file generated by PRIMEGENS-v2 for designed alternate primerpairs (with relatively high number of cross-hybridizations).

quence identity, among others.

5. Parameters for Probe Design This section specifies parameters required for probe design if chosen.These parameters include probe length and allowed number of probe BLAST hits, among others.

3.1.4 Output: Best Primer Pairs

The most important output file is the Excel sheet generated by PRIMEGENS-v2, which shows the bestprimer pairs for each of the input query sequence provided by the user. This file also contains various typesof primer pair-related information and product size from each primer pair. Table 2 shows the format of thisExcel sheet output file.

3.1.5 Output: Alternate Primer Pairs

In addition to the best primer pairs, PRIMEGENS-v2 produces another file containing alternate primer pairsfor each input query sequence. In case a user wants to select alternate primer pairs instead of the bestprimers, or if the best primer pairs fail to amplify or match other desired criteria, this file provides multiplechoices for additional primer pairs for each query sequence. These primer pairs are ranked based on numberof potential cross-hybridizations. Table 3 shows the format of this output file.

3.1.6 Output: Failed Sequences

PRIMEGENS-v2 also generates one file containing all FASTA formatted sequences, in which the primerdesign failed. There could be various reasons for the failure, including design failure from the Primer3program. In this case, the user can analyze these sequences further and re-run PRIMEGENS-v2 only forfailed sequences using modified parameters.

Page 11: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 161

In addition to three standard output files, PSPD generates an additional output file, i.e., gene-specific frag-ments (recorded with name of the query file followed by “uniseg.txt”). This file contains gene-specificfragments (probe) for each input query sequence that PSPD finds. These are the gene-specific fragmentsthat PSPD ultimately uses to design primers for their corresponding query sequences.

PRIMEGENS-w3 is the Web Server version of PRIMEGENS-v2. It is meant to automate high-throughputprimer and probe design. PRIMEGENS-w3 allows users to utilize our computing resources in conductingPCR primer and probe designs for various applications. Having this available online removes the necessityfor users to install the software, and they can instead use a web server and with stored database on the serverin performing cross-hybridization checks.

PRIMEGENS-w3 consists of two basic steps: 1) uploading data files (PCR template files for primerdesign and optional database for cross-hybridization check); and 2) identifying primer-design specificationsfor job submission and execution. PRIMEGENS-w3 allows user to select any of the algorithms for primerdesign similar to its stand-alone version. The task of designing primers and probes using the web server alsorequires the same two types of inputs. One is the query file having the sequence for which primers/probesneed to be designed, and the other is the database file having all the sequences that are present in the PCRreaction. Sequences in the database files are used for PRIMEGENS-w3 to check for any potential cross-hybridization and thereby select primer/probes that are specific to the query sequence. The user can uploadtheir own customized database file or use available genomes supported by the server. We do not recommenduser to upload large database like any genome. Any new genome can be made available on the web serverbased on user request. PRIMEGENS-w3 also provides different sample data for both query and databasesequences to enable users to test primer/probe design.

The next stage of PRIMEGENS-w3 is to provide all input parameters for primer design. Input pa-rameters on this page of the server are divided into same five sections as described in section 3.1.3. Allparameters have been set to the default values for best primer design. The default values are visible onthe browser and they can be changed. After running PRIMEGENS-w3, the server will show the link to allresult files generated by the tool. PRIMEGENS-w3 also generates files similar to those by the standaloneversion as explained in Sections 3.1.4 - 3.1.7. These files can be viewed within the browser or downloadedon the users local computer. In case the PRIMEGENS-w3 execution takes long time to complete due tolarge size of input files, the server provides user with a link to be used later to check for the updated resultfiles. Along with this, a user can also provide an email address (optional) to enable PRIMEGENS-w3 tosend them an email notification for job completion and the link for the result files. Figure 6 shows the firstpage of the PRIMEGENS-w3 server. As can be seen, this page shows various options for input files requiredby PRIMEGENS-w3. Figure 7 shows the second page of PRIMEGENS-w3 containing the forms whereinthe user can indicate the desired parameters.

3.1.7 Output: Probe for Target Sequencing

4 PRIMEGENS-w3: Web Server for PRIMEGENS-v2

5 Application of PRIMEGENS-v2 for DNA Methylation Study

PRIMEGENS-v2 has several useful features that can be utilized in designing high-quality primers and probesspecifically for use in DNA methylation study in association with any disease or other biological phenotype.

Page 12: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

162 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

Figure 6: The first page of the PRIMEGENS-w3 web-server.

5.1 Primer and Probe Design for Bisulfite-Converted Sequences

There are various techniques that can precisely map cytosine-methylation sites in a DNA sequence. Ofthese techniques, the bisulfite conversion-based method is the most widely used. In this method, DNAis first denatured to a single-stranded DNA and then treated with sodium bisulfite to convert cytosine touracil. Through this reaction, all unmethylated cytosines are converted to uracil by deamination, leaving themethylated cytosines unconverted. After this modification step, DNA is amplified using two different PCRmethods: methylation-specific PCR (MSP) and bisulfite-sequencing PCR (BSP). In MSP, the methylationstatus of the DNA is assessed by designing two sets of primer pairs with at least one CpG site in each pair.One pair is specifically designed to amplify methylated DNA (M pair), while the other pair amplifies theunmethylated DNA (U pair). Therefore, for each DNA sequence, two PCRs are carried out using eachprimer pair. If a sequence becomes amplified by primers with methylated DNA (M pair), it indicates thepresence of methylation of CpG site within the primer sequence. On the other hand, if it becomes amplifiedby the other pair for unmethylated DNA (U pair), this represents that no methylation of CpG site occurredinside the primer sequence. Bisulfite-specific PCR (BSP) is another method used to study DNA methylationafter the bisulfite treatment of the DNA sequence. Primers used for BSP contain no CpG sites; thus, bothmethylated and unmethylated DNA sequences can be amplified. In a PCR product, identifying cytosinesindicates methylation, while identifying thymine at the same position indicates the lack of methylation withinthe amplified region.

PRIMEGENS-v2 can design PCR primers for bisulfite conversion-based PCR reaction. It requires theinput of bisulfite-converted query sequences and four variants of the database against which query sequenceare checked for cross-hybridization. To generate four genome variants after bisulfite conversion, all thecytosine (C) sites in the original sequence are converted into thymine (T), except for places where cytosineprecedes guanine (G), known as methylation of CG. In other words, for each chromosome sequence, fourchromosomal files are generated as the model of the bisulfite-treated sequences: (1) bisulfite-methylated

Page 13: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 163

Figure 7: The second page of the PRIMEGENS-w3 web-server.

forward sequence, (2) bisulfite-methylated reverse sequence, (3) bisulfite-unmethylated forward sequence,and (4) bisulfite-unmethylated reverse sequence. As an example, suppose a human genome fragment has anucleotide sequence ”agctagccagtcga”, this fragment is then modified to generate four variants as follows:

• agctagccagtcga - original DNA sequence

• agttagttagttga - unmethylated forward DNA sequence

• agttagttagtcga - methylated forward DNA sequence

• ttgattggttagtt - unmethylated reverse complementary DNA sequence

• tcgattggttagtt - methylated reverse complementary DNA sequence

The PRIMEGENS-v2 configuration file provides the following tags dedicated for bisulfite conversion-based PCR and MSP:

5.2 Primer and Probe Design around the Transcription Start Site

“Primer design covering TSS” is a feature of PRIMEGENS-v2, which helps the designed primers to coverthe region around a transcription start site (TSS) of any gene.

5.2.1 Uniform Coverage for a Sequence Region

For probe and primer design around the TSS of a gene, the FSPD algorithm is used to divide the querysequence into different fragments, including the surroundings of the TSS position for the gene present in

BISULFITE_MODE = 1 / 0 (1 for bisulfite-conversion based PCR and 0 for normal PCR),PRIMER_CG_MAX = -1 / +N (-1 = ignore this rule, +N = allows max “N” CGs within primer oligo).

Page 14: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

164 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

Query Sequence From Above Fragmentation

5’5’

5’

3’3’

3’

1500

1500

1500 1500

057057

052052 TSS

+2000-2000

Figure 8: Uniform coverage of the TSS while selecting the query sequence, where the fragmentlength of 1500 and the overlap length of 250 are taken to design three different fragments on andaround the TSS.

that query sequence. PRIMEGENS-v2 fragments the sequence, such that one sequence covers TSS withinitself or may exactly be in its center position and others around the TSS position overlap with the sequence.This approach is shown in Figure 8. These fragments are then taken as query sequences to further designprobes or primers, one fragment at a time, using the normal PRIMEGENS-v2 algorithm. This approachproduces a uniform coverage of the sequence on and around TSS of any gene or query sequence.

5.2.2 Coverage of Specific Region around the TSS

For covering specific region around the TSS of a human gene, the user is only required to provide genesymbols for which the primer design is required. PRIMEGENS-v2 is capable of extracting their respectiveTSSs from the UCSC Genome Database (currently for March 2006 assembly). Users can also providedirections from the TSS or region of interest for primer design. Based on this information, PRIMEGENS-v2will extract the final query sequence to be used for the ultimate primer design. Query sequence is chosen insuch a manner that it covers the TSS and the region in either side. Regions around the TSS can be definedout of three options for the primer design, including ”First Exon Coverage”, ”Across TSS”, and ”PromoterCoverage”. The length of the query sequence to be used for the primer design is determined by anotherPRIMEGENS-v2 variable, i.e., QLENGTH. Another important PRIMEGENS-v2 variable is MINCOVER,which determines the extension of the query sequence in the direction opposite to the side of the coveragepreference. It tries to cover either the whole or partial exon/promoter region within the region of QLENGTH.If the sum of the first exon length and MINCOVER is smaller than the QLENGTH, then the QLENGTH ischanged to cover just the exon-1; thus avoiding extending the query sequence for the primer design into nextintronic sequence. On the other hand, if the QLENGTH is smaller than the total of the MINCOVER regionand the first exonic region, i.e., QLENGTH is unable to cover the whole exon-1, and then PRIMEGENS-v2 considers the partial coverage of exon-1. Figure 9 illustrates this feature of PRIMEGENS-v2 to designprimers around the TSS of candidate genes based on users selections.

5.3 Primer and Probe Design in or around CpG Island near the Transcription Start Site

“Primer design covering CpG Island” is one of the unique features of PRIMEGENS-v2, which can be used tostudy methylation patterns of various oncogenes and tumor suppressor genes. This feature designs primers

Page 15: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 165

First Exon CoverageExon-1 < (QLENGTH - MINCOVER)

First Exon CoverageExon-1 > (QLENGTH - MINCOVER)

Across TSS

Promoter Coverage

TSS

TSS

TSS

TSS

QLENGTH

QLENGTH

QLENGTH

QLENGTHMINCOVER

MINCOVER

MINCOVER

Exon-1

Exon-1

Exon-1

Exon-1

Intron-1

Intron-1

Intron-1

Intron-1

Exon-2

Exon-2

Exon-2

Exon-2

Promoter Region

Exonic Region

Intronic Region

Query Sequence for Primer Design

TSS - Transcription Start Site

PRIMEGENS variables- VICINITY- MINCOVER- QLENGTH

Figure 9: Coverage of the desired region around the TSS in different cases.

for genes that have CpG islands present in close proximity to their respective TSSs. Primers can be designedto amplify genes whose expressions are suspected to be influenced by nearby CpG islands. In this approach,locating CpG islands is determined by a PRIMEGENS-v2 variable, VICINITY. PRIMEGENS-v2 looks forthe presence of CpG islands in the region of VICINITY, in both sides of a genes TSS. It tries to cover eitherthe whole or partial CpG islands in the region of the VICINITY value. CpG islands can lie to the left,right, or across the TSS. In addition, users can also provide their direction or region of interest to enablePRIMEGENS-v2 to look for CpG islands. By default, CpG islands present in any of the three possiblepositions are considered, and the distance between a CpG island and TSS, if found at one or the other sideof TSS are also reported. Based on this information, PRIMEGENS-v2 extracts the final query sequencefor the primer design. The length of the query sequence to be used for primer design is also determinedby PRIMEGENS-v2 variables QLENGTH and MINCOVER (see Section 5.2.2). The FSPD algorithm isrecommended for this type of primer design due to the long length of CpG islands. Figure 10 shows thisfeature of PRIMEGENS-v2, illustrates a design primer for genes surrounded by CpG islands. Thus far, thisfeature has been used and supported for human genome and their TSS and CpG locations. Other supportedgenomes will be used and tested in the future.

5.4 Probe design for Targeted Sequencing to Screen DNA Methylation Patterns of All CpG Islandsin the Human Genome

We applied PRIMEGENS-v2 in ultra-long oligonucleotides for a hybridization-based capture of CGIs andtheir flanking sequences from a randomly fragmented shotgun genomic DNA library (Figure 11). The longoligos (160bp) were synthesized by Agilent Technologies to capture the DNA representing 28,226 CGIs

Page 16: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

166 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

A. CpG island on Left of TSS

B. CpG island Crossing TSS

C. CpG island on Right of TSS

dnEtratS

EndStart

Start

CpGIsland

CpGIsland

CpGIsland

MINCOVER

MINCOVER

TSS

TSS

TSS

YTINICIV+YTINICIV-

YTINICIV+YTINICIV-

YTINICIV+YTINICIV-

QLENGTH

QLENGTH

QLENGTH

CpG Island

Query Sequence for Primer Design

PRIMEGENS variables: VICINITY, MINCOVER QLENGTH

Figure 10: Amplified CpG islands within the vicinity of a genes TSS. (A) shows the presence ofCpG island at the left side of the TSS (gradient blue) lying inside the region of VICINITY. Here,the query sequence has been picked of the length of QLENGTH value covering the MINCOVERregion at the opposite side of the TSS (right side), the TSS itself, and portion of the CpG onthe left. (B) shows the presence of CpG island across the TSS (gradient blue). Here, the querysequence has been picked of the length of QLENGTH value equally covering both sides of theTSS. (C) shows the presence of CpG island at the right side of the TSS (gradient blue) with itsstart position lying inside the region of VICINITY at the right side of the TSS. Here, the querysequence has been picked of the length of QLENGTH value covering the MINCOVER region atthe opposite side of the TSS (left side), the TSS itself, and portion of CpG at the left.

(sizes ranging from 200bp to 5000bp) dispersed throughout the genome. The single-strand DNA oligoswere converted to biotin-labeled RNA baits based on the protocols established by Gnirke et al. [Gnirkeet al., 2009]. The RNA baits were used to capture the CpG island DNA from a shotgun-sequencing libraryconsisting of breast cancer cell line DNA. The captured DNAwas then treated with bisulfite and convertedinto DNA sequencing libraries using PCR in a single-tube format. The CGI-enriched, bisulfite-modifiedgenomic DNA library was then sequenced using next-generation sequencer Solexa. As can be seen fromthe results, the probe design captured most of the intended target sequences (Figure 12).

5.5 Primer and Probe Design around the Maximum Cut-site Region

PRIMEGENS-v2 can also be used to search for regions with the maximum enzyme digestion sites (cut-sites)within each query sequence and design primers around these cut-sites. This ensures the presence of cut-sitesin the PCR product and is very useful in MSP. PRIMEGENS-v2 searches for any of the cut-sites providedby the user in the query sequence. From each cut-site, it tries to search for the region with the maximumdensity of such sites in the regions that are determined by the user as another PRIMEGENS-v2 variableknown as CUT SITE REGION. The region with the maximum cut site density is then considered as theregion for TARGET REGION variable used by Primer3 to design primers around them. Figure 13 showshow PRIMEGENS-v2 locates the maximum cut-site region in the whole query sequence and chooses the

Page 17: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu. 167

Shearing,adapter ligation,PCR (optional)

Solution hybridization

Elution

PCR

Sequencing

Bead capture

dnoPtiaB

Catch

Elution

PCR

Biotin–UTP transcription

T7 T7

Microarray GenomicDNA

tnemtaerT etiflusiBnoitceleS dribyH noituloS

Allele 1 (methylated) Allele 1 (unmethylated)

m---ACTCCACGG---TCCATCGCT------TGAGGTGCC---AGGTAGCGA---

m

---ACTCCACGG---TCCATCGCT------TGAGGTGCC---AGGTAGCGA---

Bisulfite treatmentAlkylation

Spontaneous denaturation

---AUTUUAUGG---TUUATCGUT---

---TGAGGTGUU---AGGTAGCGA--- ---TGAGGTGUU---AGGTAGUGA---

---AUTUUAUGG---TUUATUGUT---

Figure 11: Solution-based hybridization capture of CGIs and targeted shotgun bisulfite-treatedsequencing. (A) Left panel illustrates the steps involved in the preparation of a complex pool ofbiotinylated RNA capture probes (top left) using Agilents array-based ultra-long oligonucleotidessynthesis method. The whole-genome fragment input library (pond; top right) was created usingIllumina methylated pair-end adaptors. (B) The hybrid-selected, enriched output library (catch;bottom) was then treated with bisulfite and amplified using Illumina PCR primers. Two sequenc-ing targets and their respective baits are shown in red and blue. The universal adaptor sequencesare gray. The excess of single-stranded non-self-complementary RNA (wavy lines) drives thehybridization. (The left panel came from Nature Publishing Group)

target region within CUT SITE REGION, as well as how Primer3 designs primers around the target region.

6 Conclusion

Highly specific and efficient primer/probe design has been in demand for genomic research. The presence ofsimilar sequences in a sequence pool has often caused difficulties in the identification of unique sequencesof interest using PCR or pull-down approaches. Therefore, designing primers that enable PCR to specif-ically amplify one sequence from a large database of sequences in a high-throughput fashion is a majorchallenge. PRIMEGENS-v2 offers a high-throughput approach in conducting primer and probe design forlarge-scale amplification and sequencing projects. It automates the design of a reliable set of primers witha minimum rate of failure due to cross-hybridization. PRIMEGENS-v2 features can also be efficiently usedin studying DNA methylation. Furthermore, while other existing primer design tools usually design primersfor one gene at a time, PRIMEGENS-v2 can design primers for thousands of genes (fragments) in one run.

Page 18: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

168 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

Figure 12: A genome browser view of designed probes and captured DNA sequences. Theblack squares show the designed probes targeting CGIs. The varying bars indicate the sequenceabundance obtained by the Solexa sequencer at the genomic coordinates.

PRIMEGENS-v2 is the only available tool that can automatically check primer specificity with customizedinputs of sequences on a large scale. Therefore, PRIMEGENS-v2 provides a unique software package forthe research community as compared with other available tools for primer design.

Acknowledgements

This work was supported by the National Institute of Health (Grant No. 1R01-DA025779) and the Congres-sionally Directed Medical Research Programs, U.S. Army Medical Research and Material Command (GrantBC062135). We would like to thank Juyuan Guo, Muneendra Ojha, Michael Wang, and Trupti Joshi forhelpful discussions or technical assistance. Major computations were performed using the UMBC comput-ing resource.

References[Antequera & Bird, 1993] Antequera, F., & Bird, A. (1993). Number of CpG islands and genes in human and mouse. Proc Natl

Acad Sci U S A, 90(24), 11995-11999.

[Aranyi and Tusnady, 2007] Aranyi, T., & Tusnady, G. E. (2007). BiSearch: ePCR tool for native or bisulfite-treated genomictemplate. Methods Mol Biol, 402, 385-402.

[Aryani et al., 2006] Aranyi, T., Varadi, A., Simon, I., & Tusnady, G. E. (2006). The BiSearch web server. BMC Bioinformatics, 7,431.

[Arvidsson et al., 2006] Arvidsson, S., Kwasniewski, M., Riano-Pachon, D. M., & Mueller-Roeber, B. (2008). QuantPrime--a flex-ible tool for reliable high-throughput primer design for quantitative PCR. BMC Bioinformatics, 9, 465.

[Bedford and Helden, 1987] Bedford, M. T., & van Helden, P. D. (1987). Hypomethylation of DNA in pathological conditions ofthe human prostate. Cancer Res, 47(20), 5274-5276.

[Bertone et al, 2006] Bertone, P., Trifonov, V., Rozowsky, J. S., Schubert, F., Emanuelsson, O., Karro, J., et al. (2006). Designoptimization methods for genomic DNA tiling arrays. Genome Res, 16(2), 271-281.

[Brook et al., 2009] Brooks, J., Cairns, P., & Zeleniuch-Jacquotte, A. (2009). Promoter methylation and the detection of breastcancer. Cancer Causes Control , 20(9), 1539-1550.

Page 19: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 169

Scanning CUT_SITE_REGION Across Whole Sequence

PRIMEGENSv2 variables - CUT_SITE_COUNT (3) - CUT_SITE (CCGG, CGCG, GCGC, GGCC) - CUT_SITE_REGION (100)

Target Region (Primer-3 Variable) - maximum

Query Sequence

Forward Primer

Reverse Primer

Aci I (CCGC)

Hin6 I (CGCG)

Hpa II (CCGG)

Cut Sites, like

Figure 13: The PRIMEGENS-v2 algorithm for choosing the maximum cut-site region in querysequence for the primer design. It systematically searches for segments including the cut-sitesand then identifies the regions with the maximum cut-sites for primer design.

[Chung et al., 2008] Chung, W., Kwabi-Addo, B., Ittmann, M., Jelinek, J., Shen, L., Yu, Y., et al. (2008). Identification of noveltumor markers in prostate, colon and breast cancer by unbiased methylation profiling. PLoS One, 3(4), e2079.

[Clark, 2007] Clark, S. J. (2007). Action at a distance: epigenetic silencing of large chromosomal regions in carcinogenesis. HumMol Genet, 16 Spec No 1, R88-95.

[Das & Singal, 2004] Das, P. M., & Singal, R. (2004). DNA methylation and cancer. J Clin Oncol, 22(22), 4632-4642.

[David, 2005] David, J. P., Strode, C., Vontas, J., Nikou, D., Vaughan, A., Pignatelli, P. M., et al. (2005). The Anopheles gambiaedetoxification chip: a highly specific microarray to study metabolic-based insecticide resistance in malaria vectors. Proc NatlAcad Sci U S A, 102(11), 4080-4084.

[Eckhardt et al.,2006] Eckhardt, F., Lewin, J., Cortese, R., Rakyan, V. K., Attwood, J., Burger, M., et al. (2006). DNA methylationprofiling of human chromosomes 6, 20 and 22. Nat Genet, 38(12), 1378-1385.

[Eden & Cedar, 1994] Eden, S., & Cedar, H. (1994). Role of DNA methylation in the regulation of transcription. Curr Opin GenetDev, 4(2), 255-259.

[Ehrlich, 2002] Ehrlich, M. (2002). DNA methylation in cancer: too much, but also too little. Oncogene, 21(35), 5400-5413.

[Ehrlich et al., 1982] Ehrlich, M., Gama-Sosa, M. A., Huang, L. H., Midgett, R. M., Kuo, K. C., McCune, R. A., et al. (1982).Amount and distribution of 5-methylcytosine in human DNA from different types of tissues of cells. Nucleic Acids Res, 10(8),2709-2721.

[Esteller &Herman, 2002] Esteller, M., & Herman, J. G. (2002). Cancer as an epigenetic disease: DNA methylation and chromatinalterations in human tumours. J Pathol , 196(1), 1-7.

[Feinberg and Vogelstein, 1983] Feinberg, A. P., & Vogelstein, B. (1983). Hypomethylation of ras oncogenes in primary humancancers. Biochem Biophys Res Commun, 111(1), 47-54.

Page 20: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

170 Chapter 11 – High-throughput Primer and Probe Design for Genome-Wide DNA Methylation Study

[Feng et al., 2007] Feng, W., Shen, L., Wen, S., Rosen, D. G., Jelinek, J., Hu, X., et al. (2007). Correlation between CpG methylationprofiles and hormone receptor status in breast cancers. Breast Cancer Res , 9(4), R57.

[Flanagan and Wild, 2007] Flanagan, J. M., & Wild, L. (2007). An epigenetic role for noncoding RNAs and intragenic DNA methy-lation. Genome Biol, 8(6), 307.

[Fujikane et al., 2009] Fujikane, T., Nishikawa, N., Toyota, M., Suzuki, H., Nojima, M., Maruyama, R., et al. (2009). Genomicscreening for genes upregulated by demethylation revealed novel targets of epigenetic silencing in breast cancer. Breast CancerRes Treat, aheadofprint.

[Gnirke et al., 2009] Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E. M., Brockman, W., et al. (2009). Solutionhybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol, 27(2), 182-189.

[Haas et al., 1998] Haas, S., Vingron, M., Poustka, A., & Wiemann, S. (1998). Primer design for large scale sequencing. NucleicAcids Res, 26(12), 3006-3012.

[Herman et al., 1996] Herman, J. G., Graff, J. R., Myohanen, S., Nelkin, B. D., & Baylin, S. B. (1996). Methylation-specific PCR:a novel PCR assay for methylation status of CpG islands. Proc Natl Acad Sci U S A, 93(18), 9821-9826.

[Hiller & Green, 1991] Hillier, L., & Green, P. (1991). OSP: a computer program for choosing PCR and DNA sequencing primers.PCR Methods Appl, 1(2), 124-128.

[Hilson et al., 2004] Hilson, P., Allemeersch, J., Altmann, T., Aubourg, S., Avon, A., Beynon, J., et al. (2004). Versatile gene-specific sequence tags for Arabidopsis functional genomics: transcript profiling and reverse genetics applications. Genome Res,14(10B), 2176-2189.

[Hydman et al. 1996] Hyndman, D., Cooper, A., Pruzinsky, S., Coad, D., & Mitsuhashi, M. (1996). Software to determine optimaloligonucleotide sequences based on hybridization simulation data. Biotechniques, 20(6), 1090-1094, 1096-1097.

[Jaenisch & Bird, 2003] Jaenisch, R., & Bird, A. (2003). Epigenetic regulation of gene expression: how the genome integratesintrinsic and environmental signals. Nat Genet, 33 Suppl, 245-254.

[Jones & Takai, 2001] Jones, P. A., & Takai, D. (2001). The role of DNA methylation in mammalian epigenetics. Science,293(5532), 1068-1070.

[Jurgens et al.,1996] Jurgens, B., Schmitz-Drager, B. J., & Schulz, W. A. (1996). Hypomethylation of L1 LINE sequences prevailingin human urothelial carcinoma. Cancer Res, 56(24), 5698-5703.

[Kim et al., 1994] Kim, Y. I., Giuliano, A., Hatch, K. D., Schneider, A., Nour, M. A., Dallal, G. E., et al. (1994). Global DNAhypomethylation increases progressively in cervical dysplasia and carcinoma. Cancer , 74(3), 893-899.

[Kistensen et al., 2009] Kristensen, L. S., Nielsen, H. M., & Hansen, L. L. (2009). Epigenetics and cancer treatment. Eur J Phar-macol, 625(1-3), 131-142.

[Liard, 2003] Laird, P. W. (2003). The power and the promise of DNA methylation markers. Nat Rev Cancer, 3(4), 253-266.

[Lemoine et al., 2009] Lemoine, S., Combes, F., & Le Crom, S. (2009). An evaluation of custom microarray applications: theoligonucleotide design challenge. Nucleic Acids Res, 37(6), 1726-1739.

[Leparc et al., 2009] Leparc, G. G., Tuchler, T., Striedner, G., Bayer, K., Sykacek, P., Hofacker, I. L., et al. (2009). Model-basedprobe set optimization for high-performance microarrays. Nucleic Acids Res, 37(3), e18.

[Li, 2002] Li, E. (2002). Chromatin modification and epigenetic reprogramming in mammalian development. Nat Rev Genet, 3(9),662-673.

[Li et al., 2002] Li, L. C., & Dahiya, R. (2002). MethPrimer: designing primers for methylation PCRs. Bioinformatics, 18(11),1427-1431.

[Li et al., 1997] Li, P., Kupfer, K. C., Davies, C. J., Burbee, D., Evans, G. A., & Garner, H. R. (1997). PRIMO: A primer designprogram that applies base quality statistics for automated large-scale DNA sequencing. Genomics, 40(3), 476-485.

[Libault et al., 2009] Libault, M., Joshi, T., Takahashi, K., Hurley-Sommer, A., Puricelli, K., Blake, S., et al. (2009). Large-scaleanalysis of putative soybean regulatory gene expression identifies a Myb gene involved in soybean nodule development. PlantPhysiol,151(3), 1207-1220.

[Lin et al., 2001] Lin, C. H., Hsieh, S. Y., Sheen, I. S., Lee, W. C., Chen, T. C., Shyu, W. C., et al. (2001). Genome-wide hy-pomethylation in hepatocellular carcinogenesis. Cancer Res, 61(10), 4238-4243.

Page 21: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and

G. P. Srivastava, G. Kushwaha, H. Shi & D. Xu 171

[Liu et al., 2003] Liu, Y., Zhou, J., Omelchenko, M. V., Beliaev, A. S., Venkateswaran, A., Stair, J., et al. (2003). Transcriptomedynamics of Deinococcus radiodurans recovering from ionizing radiation. Proc Natl Acad Sci U S A, 100(7), 4191-4196.

[Marshall, 2004] Marshall, O. J. (2004). PerlPrimer: cross-platform, graphical primer design for standard, bisulphite and real-timePCR. Bioinformatics, 20(15), 2471-2472.

[Onodera & Melcher, 2004] Onodera, K., & Melcher, U. (2004). Selection for 3’ end triplets for polymerase chain reaction primers.Mol Cell Probes, 18(6), 369-372.

[Pattyn et al. 2006] Pattyn, F., Hoebeeck, J., Robbrecht, P., Michels, E., De Paepe, A., Bottu, G., et al. (2006). methBLAST andmethPrimerDB: web-tools for PCR based methylation analysis. BMC Bioinformatics, 7, 496.

[Proutski & Holmes, 1996] Proutski, V., & Holmes, E. C. (1996). Primer Master: a new program for the design and analysis ofPCR primers. Comput Appl Biosci, 12(3), 253-255.

[Rakyan et al. ,2004] Rakyan, V. K., Hildmann, T., Novik, K. L., Lewin, J., Tost, J., Cox, A. V., et al. (2004). DNA methylationprofiling of the human major histocompatibility complex: a pilot study for the human epigenome project. PLoS Biol, 2(12), e405.

[Razin & Cedar, 1994] Razin, A., & Cedar, H. (1994). DNA methylation and genomic imprinting. Cell, 77(4), 473-476.

[Rozen & Skaletsky, 2000] Rozen, S., & Skaletsky, H. (2000). Primer3 on the WWW for general users and for biologist program-mers. Methods Mol Biol, 132, 365-386.

[Shen et al., 2007] Shen, L., Kondo, Y., Guo, Y., Zhang, J., Zhang, L., Ahmed, S., et al. (2007). Genome-wide profiling of DNAmethylation reveals a class of normally methylated CpG island promoters. PLoS Genet, 3(10), 2023-2036.

[Shiota, 2004] Shiota, K. (2004). DNA methylation profiles of CpG islands for cellular differentiation and development in mam-mals. Cytogenet Genome Res, 105(2-4), 325-334.

[Shivapurkar & Gazdar, 2010] Shivapurkar, N., & Gazdar, A. F. (2010). DNA methylation based biomarkers in non-invasive cancerscreening. Curr Mol Med, 10(2), 123-132.

[Srivastava et al., 2008] Srivastava, G. P., Guo, J., Shi, H., & Xu, D. (2008). PRIMEGENS-v2: genome-wide primer design foranalyzing DNA methylation patterns of CpG islands. Bioinformatics, 24(17), 1837-1842.

[Taylor et al., 2007] Taylor, K. H., Kramer, R. S., Davis, J. W., Guo, J., Duff, D. J., Xu, D., et al. (2007). Ultradeep bisulfitesequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res, 67(18), 8511-8518.

[Tsai et al., 2007] Tsai, M. F., Lin, Y. J., Cheng, Y. C., Lee, K. H., Huang, C. C., Chen, Y. T., et al. (2007). PrimerZ: streamlinedprimer design for promoters, exons and human SNPs. Nucleic Acids Res, 35(Web Server issue), W63-65.

[Tuck et al., 2000] Tuck-Muller, C. M., Narayan, A., Tsien, F., Smeets, D. F., Sawyer, J., Fiala, E. S., et al. (2000). DNA hy-pomethylation and unusual chromosome instability in cell lines from ICF syndrome patients. Cytogenet Cell Genet, 89(1-2),121-128.

[Tusnady et al., 2005] Tusnady, G. E., Simon, I., Varadi, A., & Aranyi, T. (2005). BiSearch: primer-design and search tool for PCRon bisulfite-treated genomes. Nucleic Acids Res, 33(1), e9.

[Van et al, 2007] van Vliet, J., Oates, N. A., & Whitelaw, E. (2007). Epigenetic mechanisms in the context of complex diseases.Cell Mol Life Sci, 64(12), 1531-1538.

[Venter et al., 2001] Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., et al. (2001). The sequence ofthe human genome. Science, 291(5507), 1304-1351.

[Woonbok et al., 2008] Woonbok Chung, Bernard Kwabi-Addo, Michael Ittmann, Jaroslav Jelinek, Lanlan Shen, Yinhua Yu, andJean-Pierre J. Issa (2008) Identification of Novel Tumor Markers in Prostate, Colon and Breast Cancer by Unbiased MethylationProfiling. PLoS ONE.3(4): e2079.

[Xu et al., 2002] Xu, D., Li, G., Wu, L., Zhou, J., & Xu, Y. (2002). PRIMEGENS: robust and efficient design of gene-specificprobes for microarray analysis. Bioinformatics, 18(11), 1432-1437.

[Yamada et al., 2006] Yamada, T., Soma, H., & Morishita, S. (2006). PrimerStation: a highly specific multiplex genomic PCRprimer design server for the human genome. Nucleic Acids Res, 34(Web Server issue), W665-669.

[You et al., 2008] You, F. M., Huo, N., Gu, Y. Q., Luo, M. C., Ma, Y., Hane, D., et al. (2008). BatchPrimer3: a high throughput webapplication for PCR and sequencing primer design. BMC Bioinformatics, 9, 253.

[Zhang and Smith, 2010] Zhang MQ, Smith AD. (2010) Challenges in understanding genome-wide DNA methylation. Journal ofComputer Science and technology. 25(1): 26–34.

Page 22: High-throughput Primer and Probe Design for … Primer and Probe Design for Genome-Wide DNA Methylation Study Using PRIMEGENS Gyan Prakash Srivastava Computer Science Department and