it summit 150604 cb_wcl_ld_kmh_v6_to_publish
TRANSCRIPT
Parallelization of Large-Scale Image Processing Workflows to Unravel Neuronal Network
Research Computing
Kristina HoltonLingsheng Dong M.D. M.S.Research ComputingHarvard Medical School
Research Computing
Orchestra HPC
• Wiki page: https://wiki.med.harvard.edu/Orchestra
• Tech specs:– 476 compute nodes– Total 5128 cores– 10GigE interconnection– 37.7TB RAM
• Debian Linux• LSF scheduler• Total 20PB storage
Research Computing
Research Computing Consultants
• Meet with HMS user community to discuss projects and computational needs
• Consult on analyses from statistics and algorithms to implementation
• Develop scripts and pipelines to address users’ needs
• Provide outreach and education
Research Computing
User Case
• Investigator: Wei-Chung Allen Lee, from R. Clay Reed’s lab in HMS NeuroBiology
• Drosophila: map olfactory neurons• Custom serial EM (Electron Microscopy)
images: process/reconstruct/trace 3D image• 154 TB raw data stitched to 54 TB• Develop highly-efficient, parallelized pipeline
with error checking and logging
Research Computing
Key Question in Neuroscience
• How is information processed in neural circuits?
• A neuron’s function is fundamentally dependent
on how it is connected within its network
• Reconstruction of neural networks enable analysis
of network connectivity
Previous Work: Mouse Visual Cortex
• Mouse shown a bar in differing orientations• Visual cortex neurons preferentially stimulated
• Map stimulated neurons to their neural network: structure and function (inhibitory vs excitatory)
• First use in vivo two-photon calcium imaging to identify neurons preference; EM to trace
b Schematic representation of diverse input to inhibitory interneurons. colored white. c, In vivo two-photon fluorescence image of the 3-D anatomical volume (red: blood vessels or astrocytes, green: OGB somata or YFP apical dendrites) separated to expose the functionally imaged plane. Scale bar, 100 μm.
Functional characterization of neurons: before
Nature. 2011 Mar 10; 471(7337): 177–182.
Research Computing
Original Workflow
• Highly serial: steps carried out in sequential order
• Manual: Each step has to be submitted separately by hand (time consuming)
• Prone to systemic failure (downstream jobs fail)
• Debug is very difficult for failed steps• Record keeping confusing (Google doc)
Original Image Processing Workflow
linksections
montage
genmasks
selectframes
…
Given a list with hundreds of sections, do:
Create links for the raw data
* Connect neighbor sections together
There are 20+ additional steps.
* Generate mask file for each sections
Find the edges of each section
Improved workflow
• Highly parallel for most steps: steps carried out in parallel order
• Automatic: all the steps are queued together by single command
• Fault tolerance: if any step fails, the script will automatically kill the downstream, instead of give errors
• Check points: can resume workflow from any step• Email notification makes trouble-shooting more friendly• MySQL database: makes record keeping and check
points easy
linksections montage selectframes
…
Given a list with hundreds of sections, do:
montage selectframes
… …
Parallelize as much as we can
• Linksections step only creates links for raw data files, which is very fast to run
• montage step can be parallelized to run multiple sections independently, so that we can submit one job for each section
• Selectframes step can be parallelized to run multiple sections independently, so that we can submit one job for each section
• And so on.
Automation?
• A list of steps in Google doc• Every step is performed by hand• Requires manual labor to perform
and QC • Manually re-run
• Automatic workflow
We need a tool?
For each section in a section list, do:
Do step 1 as a job and log into database Do step2 as a job and log into database, if step1 successfully done
Do step3 as a job and log into database, if step2 successfully done Do step4 as a job and log into database, if step3 successfully done
…
General idea
For each section in a section list, do:
#@1,0, linksections #Step1, doesn’t depend on anything, stepNamesubmit job for step1 to computer cluster
#@2,1,montage #Step2, depends on step1, stepName submit job for step2 to computer cluster
#@3,2, selectframes #Step3, depends on step2, stepName submit job for step3 to computer cluster
….
#@10,6.7, patch_cmaps #Step10, depends step6 and step7, stepNamesubmit job for step3 to computer cluster
…..
Automation!
Research Computing
Trouble-shoot super easy
linksections montage selectframes
…
Given a list with hundreds of sections, do:
montage selectframes
… …
_____________________________
linksections montage selectframes
…
Given a list with hundreds of sections, do:
montage selectframes
… …
Log everything in MySQL
linksections montage
genmasksselectframes
…
Given a list with hundreds of sections, do:
montage selectframes
… …
Easy to re-run
____________________________________________________________________________________________________________
#!/bin/sh
for i in `ls -d folder*`; do cd $i for j in `ls -d sample*`; do cd $j for l in `ls -f *.txt`; do #step1 cp $l $l.copy1 done #step2 cat *.copy1 > $j.copy1.copy2 cd .. done #step3 cat */*copy2 > $i.copy3 cd ..done
folder1/sample1:c1.s1.f1.txtc1.s1.f2.txt
folder1/sample2:c1.s2.f1.txtc1.s2.f2.txt
folder2/sample1:c2.s1.f1.txtc2.s1.f2.txt
folder2/sample2:c2.s2.f1.txtc2.s2.f2.txt
Input files
Generalized Applications
#!/bin/sh#loop,ifor i in `ls –d folder*`; do cd $i #loop,j for j in `ls -d sample*`; do cd $j #loop,l for l in `ls -f *.txt`; do #@1,0,copy1 cp $l $l.copy1 done #@2,1,copy2 cat *.copy1 > $j.copy1.copy2 cd .. done #@3,2,copy3 cat */*copy2> $i.copy3 cd ..done
Generalized ApplicationsInput files
folder1/sample1:c1.s1.f1.txtc1.s1.f2.txt
folder1/sample2:c1.s2.f1.txtc1.s2.f2.txt
folder2/sample1:c2.s1.f1.txtc2.s1.f2.txt
folder2/sample2:c2.s2.f1.txtc2.s2.f2.txt
#!/bin/shfor i in `ls –d folder*`; do cd $i for j in `ls -d sample*`; do cd $j #loop,l for l in `ls -f *.txt`; do #@1,0,copy1 bsub –q mini cp $l $l.copy1 done #@2,1,copy2 bsub –q mini-w ‘done(“copy1”)’ cat *.copy1 > $j.copy1.copy2 cd .. done #@3,2,copy3 bsub –q mini -w ‘done(“copy2”)’ cat */*copy2>$i.copy3 cd ..done
Input files
Generalized Applications
folder1/sample1:c1.s1.f1.txtc1.s1.f2.txt
folder1/sample2:c1.s2.f1.txtc1.s2.f2.txt
folder2/sample1:c2.s1.f1.txtc2.s1.f2.txt
folder2/sample2:c2.s2.f1.txtc2.s2.f2.txt
Movie Time
1) Rotating rendering of reconstructed cortical neurons postsynaptic partnersmovie1
3) Manual reconstruction demonstration (generating (2) above)Movie2
Research Computing