michael morra cse 4939w detection of transcription factor binding sites
TRANSCRIPT
MICHAEL MORRACSE 4939W
Detection of Transcription Factor Binding Sites
Project Recap
Implement a method used to accurately and precisely discover the locations of transcription factor binding sites within a DNA sequence.
4 species (Human, Mouse, Fruit Fly & Yeast) 52 Transcription Factors, 524 binding sites
Image from: http://www.cs.uiuc.edu/homes/sinhas/work.html
Multiple Sequence Alignment
To be able to analyze the data effectively, each transcription factor’s binding sites need to be aligned
ClustalW2>s1GACTTTTCGCT>s2CGATTTTCTCG>s3GCATTTTCCCA>s4AGAGAAAACCC>s5GAATAACCCAAGAGAAA>s6ACAGAAAAATC>s7CGAGAAAATCG>s8TGGTTTTCCCG>s9GGGTTTCTCCC
Scoring
Berg and von Hippel method
l = length of the sequence to be scored j = position in the sequence nj = number of times a base occurs at position j in the
alignment tj = base at position j in the sequence to be scored nj(0) = most common base at position j
Implementation
Microsoft Visual Studio - C++ Input
Multiple Sequence Alignment of a transcription factor’s binding sites (.txt file)
All binding sites of a species (.txt file) Output
Scores Results of Leave One Out Cross Validation
Testing and Efficiency purposes
Implementation
Scoring Algorithm Input: Alignment Function: Create the scoring matrix
Leave One Out Cross Validation Input: Alignment and Binding Sites Function: Test the effectiveness of the scoring matrix
Functionality
Sequence to be scored is shorter than the alignment Slide the sequence over the alignment and take the
highest scoring portionSequence to be scored is longer than the
alignment Slide the alignment over the sequence and take the
highest scoring portion
TestingScoring Algorithm/LOOCV
Unit testing will be done on each function and critical portions of code as they are implemented
Once it is determined that the code is functioning correctly and all formulas are providing correct results, implementation can continue
TestingOverall Performance
To determine the effectiveness of the algorithm, a cross validation technique is used
This technique involves leaving one binding site out when the multiple sequence alignment is performed, and then scoring that left out sequence
If the algorithm is effective, the left out sequence should score higher than the majority of other binding sites within that species. (>80-90%)
Progress
Alignments Complete
Scoring Algorithm Mostly Complete
Leave One Out Cross Validation Partially Complete
Remaining Schedule
Nov 15th – Nov 19th Finish implementation and testing of scoring
algorithmNov 20th – 29th
Finish implementation of leave one out algorithm Begin testing of entire program’s effectiveness
Nov 30th – Dec 6th Complete testing Tweak program to run more effectively/accurately
Questions?