leveraging genetic algorithm and neural networks in automated protein crystal recognition ming jack...

Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition

Ming Jack Po and Andrew Laine

Department of Biomedical EngineeringColumbia UniversityNew York, NY USA

August 22nd 2008IEEE EMBS Annual Conference 2008, Vancouver, Canada

2

Agenda

• Introduction

• Current Algorithm

• Future Direction

Protein Structure Determination currently relies on X-ray crystallography

• The production of protein crystals is crucial to protein structure determination via x-ray crystallography.

• In 2000, the US National Institute of General Medical Sciences of the National Institutes of Health funded the Protein Structure Initiative (PSI), a ten-year project to uncover the three-dimensional shapes of a wide range of proteins.1

• Unfortunately, there are currently no reliable methodology to predict environments that would lead to protein crystallization.

–High throughput experiments with varying crystallization parameters are being performed in order to “brute force” the problem.

3

1) http://www.nature.com/nmeth/journal/v5/n2/full/nmeth0208-203.html

4

HTP Protein Crystallization Screening is currently the bottleneck in protein crystal discovery

• Extensive backlog of images have been developed– 1536 Wells / Plate * 5K Plates * 6 time points ~ 46M

Images*

• Manual Inspection of images from HTP experiments is not practical– Qualified and trained crystallographers are in short supply.– Crystallographers cannot keep up with the speed of robotic

systems used in production experiments.

• Automated Protein Crystallization Screening is needed to tackle both previous existing images and future images

* Feb 2002 to October 2006 only

5

Several key challenges have to be overcome for automated protein crystal recognition

• Arbitrary geometric orientation and structure of crystals

• Presence of organic matter

• Non-uniform lighting conditions

• Irregular droplet boundaries and size.

Hits

Our Solution to the problem – Neural Networks

• Advantages– Allows for incremental learning– Can deal with the seemingly arbitrary geometric orientation

and structure of crystals– Fast classification speed once neural net has been trained.

• Disadvantages– Black-box methodology– Identification of good feature set necessary for good

performance– Need sufficiently large training set to be robust

6

Training database has been compiled by HWI expert crystallographers

Crystal Precipitate Precipitate & Skin Clear

Skin Phase Separation Precipitate & Crystal

Unsure

Phase Separation & Precipitate

Phase Separation & Crystal

7

• Dr. George DeTitta et al. at HWI (Buffalo, NY) has compiled a data set of 73,632 manually classified images.– 3 independent crystallographers each categorized 75,000 images into one of the

above categories.– 75,632 of these images have consensus between at least two crystallographers.

Only these images were used for validation and training.

8

Agenda

• Introduction



Pre-Processing Steps

• Images are converted to Sobel edge sets and single edge points are removed.

• Multi-population Genetic Algorithm is performed on the image to find ellipsoidal Region of Interest (elaborated upon on the next few slides).

9

Image Normalization

MPGA 1 – ROI Detection

MPGA 2 – Area of Crystals

Linearity Detector

Laplacian Pyramidal Decomposition

Feature Extraction

Multiple Population Genetic Algorithm

10

• Randomly select 100 “chromosomes” of 5 points.• Fitness based on similarity and distance metric.

Similarity = Distance =

• Evolution proceeds through selection and diversification.– Optimize for high fitness score based on a combination of

similarity and distance scores.– Selection eliminates low fit populations.– Diversification is realized through crossover, mutation and

clustering.

• Significant speed and accuracy improvements vs. Randomized Hough Transforms. – Processing time dropped 50% to ~ 10 seconds for ROI detection.

total

djyixE

yx yx

#

),(

),( ,

4

|||| ji

e

Yao, J., Kharma, N., and Grogono, P, "A multi-population genetic algorithm for robust and fast ellipse detection", Pattern Analysis & Applications, Volume 8, Issue 1 - 2, Sep 2005, pp. 149-162

Ellipsoidal Geometry

11

0

0

min

2

2 2 2

2

( )

2

( )

1 2arctan2

4

1

( )

maj

hf bg

Cgh af

yC

rC a b R

rC a b R

h

a b

where

C

x

R h

a h g

h b f

g

ab h

f

a b

• The equation of a conic through 5 points is

– This conic is an ellipse iff

• With 5 (x,y) pairs, it is possible to solve for parameters (a,h,b,g,f), and thus in turn solve for the physically related ellipsoidal parameters to the right.

2 22 2 2 1 0hxy bya gx fyx

2 0ab h

MPGA is run twice due to variations in fitness criteria

–Similarity = Distance =

–Multiple population genetic algorithm allows for significantly faster and more robust search results than Randomized Hough Transform.

–MPGA 1 – ROI Detection• Heavy distance penalties for points that do not line up exactly on the perimeter of the

projected ellipse.• looks for r_maj close to r_min (more circular shapes – droplets, well).• r_maj and r_min are bounded at empirically determined values.

–MPGA 2 – Crystal Detection• Only run inside ROI• Heavy distance penalties only for far away points, but allow for ellipsoidal shape to be

more “flexible”. • Looks for r_maj far from r_min (more elongated ellipsoidal – closer to crystals).• r_maj and r_min are bounded by no more than ½ ROI’s r_maj and r_min.

12

total

djyixE

yx yx

#

),(

),( ,

4

|||| ji

e

13

Crystal Recognition Code Execution Speed

Execution Speed

Pre-Processing 12s

• Background Normalization• GA ROI Detection• Laplacian Pyramidal Decomposition

Feature Extraction 2s

•Mean, Standard Deviation•Skewness, Kurtosis•Energy, Entropy•Area*, Linearity*

Network Classifier 0.5s •Feed-forward Network (log-sig)

Total 14.5s

* Not scale invariant, and done on original scale

Performance for current algorithm

• Performance metrics derived using 10% randomized holdout averaged over 3 iterations.

• Current false negative rate ~ 10%.– Working to reduce the number to below 5% at minimum before

putting it into production.*– Current false negatives are total misses, so not possible to correct

through thresholding. There is also no intuitive visual correlation.

• Current true negative rate ~ 99%.

14

Conversations with John Hunt

15

Agenda

• Introduction



Future Directions

16

Bishop, C. Neural Networks for Pattern Recognition.

• Incremental Neural Network training has been implemented in Matlab.– Allows us learn new crystal shapes & percipatate. Negligible

performance hit.

• Porting the simulation portion of the network classifier onto C++.– The current program consists of

– Preprocessing done in C++ inside the IT++ framework– Neural network toolbox in Matlab

• Currently working on making new training data sets.– Selectively biasing the training data set in order to increase

accuracy.

• Expansion of feature sets in order to improve false negative rates.

17

Acknowledgements

• This project is part of the Northeast Structural Genomics Consortium (NESG) sponsored by the NIH for evaluating the feasibility, costs, economics of scale, and value of structural genomics.

• Protein crystal images acquired from Hauptman-Woodward Medical Research Institute, Buffalo, NY.

leveraging genetic algorithm and neural networks in automated protein crystal recognition ming jack...

Documents

future images

manual inspection of

previous existing images

trained crystallographers

production experiments

independent crystallographers

htp experiments

large training set