leveraging genetic algorithm and neural networks in automated protein crystal recognition ming jack...
TRANSCRIPT
Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition
Ming Jack Po and Andrew Laine
Department of Biomedical EngineeringColumbia UniversityNew York, NY USA
August 22nd 2008IEEE EMBS Annual Conference 2008, Vancouver, Canada
2
Agenda
• Introduction
• Current Algorithm
• Future Direction
Protein Structure Determination currently relies on X-ray crystallography
• The production of protein crystals is crucial to protein structure determination via x-ray crystallography.
• In 2000, the US National Institute of General Medical Sciences of the National Institutes of Health funded the Protein Structure Initiative (PSI), a ten-year project to uncover the three-dimensional shapes of a wide range of proteins.1
• Unfortunately, there are currently no reliable methodology to predict environments that would lead to protein crystallization.
–High throughput experiments with varying crystallization parameters are being performed in order to “brute force” the problem.
3
1) http://www.nature.com/nmeth/journal/v5/n2/full/nmeth0208-203.html
4
HTP Protein Crystallization Screening is currently the bottleneck in protein crystal discovery
• Extensive backlog of images have been developed– 1536 Wells / Plate * 5K Plates * 6 time points ~ 46M
Images*
• Manual Inspection of images from HTP experiments is not practical– Qualified and trained crystallographers are in short supply.– Crystallographers cannot keep up with the speed of robotic
systems used in production experiments.
• Automated Protein Crystallization Screening is needed to tackle both previous existing images and future images
* Feb 2002 to October 2006 only
5
Several key challenges have to be overcome for automated protein crystal recognition
• Arbitrary geometric orientation and structure of crystals
• Presence of organic matter
• Non-uniform lighting conditions
• Irregular droplet boundaries and size.
Hits
Our Solution to the problem – Neural Networks
• Advantages– Allows for incremental learning– Can deal with the seemingly arbitrary geometric orientation
and structure of crystals– Fast classification speed once neural net has been trained.
• Disadvantages– Black-box methodology– Identification of good feature set necessary for good
performance– Need sufficiently large training set to be robust
6
Training database has been compiled by HWI expert crystallographers
Crystal Precipitate Precipitate & Skin Clear
Skin Phase Separation Precipitate & Crystal
Unsure
Phase Separation & Precipitate
Phase Separation & Crystal
7
• Dr. George DeTitta et al. at HWI (Buffalo, NY) has compiled a data set of 73,632 manually classified images.– 3 independent crystallographers each categorized 75,000 images into one of the
above categories.– 75,632 of these images have consensus between at least two crystallographers.
Only these images were used for validation and training.
8
Agenda
• Introduction
• Current Algorithm
• Future Direction
Pre-Processing Steps
• Images are converted to Sobel edge sets and single edge points are removed.
• Multi-population Genetic Algorithm is performed on the image to find ellipsoidal Region of Interest (elaborated upon on the next few slides).
9
Image Normalization
MPGA 1 – ROI Detection
MPGA 2 – Area of Crystals
Linearity Detector
Laplacian Pyramidal Decomposition
Feature Extraction
Multiple Population Genetic Algorithm
10
• Randomly select 100 “chromosomes” of 5 points.• Fitness based on similarity and distance metric.
Similarity = Distance =
• Evolution proceeds through selection and diversification.– Optimize for high fitness score based on a combination of
similarity and distance scores.– Selection eliminates low fit populations.– Diversification is realized through crossover, mutation and
clustering.
• Significant speed and accuracy improvements vs. Randomized Hough Transforms. – Processing time dropped 50% to ~ 10 seconds for ROI detection.
total
djyixE
yx yx
#
),(
),( ,
4
|||| ji
e
Yao, J., Kharma, N., and Grogono, P, "A multi-population genetic algorithm for robust and fast ellipse detection", Pattern Analysis & Applications, Volume 8, Issue 1 - 2, Sep 2005, pp. 149-162
Ellipsoidal Geometry
11
0
0
min
2
2 2 2
2
( )
2
( )
1 2arctan2
4
1
( )
maj
hf bg
Cgh af
yC
rC a b R
rC a b R
h
a b
where
C
x
R h
a h g
h b f
g
ab h
f
a b
• The equation of a conic through 5 points is
– This conic is an ellipse iff
• With 5 (x,y) pairs, it is possible to solve for parameters (a,h,b,g,f), and thus in turn solve for the physically related ellipsoidal parameters to the right.
2 22 2 2 1 0hxy bya gx fyx
2 0ab h
MPGA is run twice due to variations in fitness criteria
–Similarity = Distance =
–Multiple population genetic algorithm allows for significantly faster and more robust search results than Randomized Hough Transform.
–MPGA 1 – ROI Detection• Heavy distance penalties for points that do not line up exactly on the perimeter of the
projected ellipse.• looks for r_maj close to r_min (more circular shapes – droplets, well).• r_maj and r_min are bounded at empirically determined values.
–MPGA 2 – Crystal Detection• Only run inside ROI• Heavy distance penalties only for far away points, but allow for ellipsoidal shape to be
more “flexible”. • Looks for r_maj far from r_min (more elongated ellipsoidal – closer to crystals).• r_maj and r_min are bounded by no more than ½ ROI’s r_maj and r_min.
12
total
djyixE
yx yx
#
),(
),( ,
4
|||| ji
e
13
Crystal Recognition Code Execution Speed
Execution Speed
Pre-Processing 12s
• Background Normalization• GA ROI Detection• Laplacian Pyramidal Decomposition
Feature Extraction 2s
•Mean, Standard Deviation•Skewness, Kurtosis•Energy, Entropy•Area*, Linearity*
Network Classifier 0.5s •Feed-forward Network (log-sig)
Total 14.5s
* Not scale invariant, and done on original scale
Performance for current algorithm
• Performance metrics derived using 10% randomized holdout averaged over 3 iterations.
• Current false negative rate ~ 10%.– Working to reduce the number to below 5% at minimum before
putting it into production.*– Current false negatives are total misses, so not possible to correct
through thresholding. There is also no intuitive visual correlation.
• Current true negative rate ~ 99%.
14
Conversations with John Hunt
15
Agenda
• Introduction
• Current Algorithm
• Future Direction
Future Directions
16
Bishop, C. Neural Networks for Pattern Recognition.
• Incremental Neural Network training has been implemented in Matlab.– Allows us learn new crystal shapes & percipatate. Negligible
performance hit.
• Porting the simulation portion of the network classifier onto C++.– The current program consists of
– Preprocessing done in C++ inside the IT++ framework– Neural network toolbox in Matlab
• Currently working on making new training data sets.– Selectively biasing the training data set in order to increase
accuracy.
• Expansion of feature sets in order to improve false negative rates.
17
Acknowledgements
• This project is part of the Northeast Structural Genomics Consortium (NESG) sponsored by the NIH for evaluating the feasibility, costs, economics of scale, and value of structural genomics.
• Protein crystal images acquired from Hauptman-Woodward Medical Research Institute, Buffalo, NY.