20150522_example_pydata_use-cases_in_astronomy_research
TRANSCRIPT
A research group’s use-cases for PyData tools
Samuel HarroldAstrophysics PhD Student, UT Austin
2015-05-22@ Continuum Analytics, Austin, TX
Motivation
● In 2011:○ Research group mostly used
bash scripts, awk, Fortran, IDL, IRAF.○ Pipeline was tightly coupled with old
computers, cameras, camera software.
Motivation
● In 2011:○ Research group mostly used
bash scripts, awk, Fortran, IDL, IRAF.○ Pipeline was tightly coupled with old
computers, cameras, camera software.● Goals for new computers and camera:
○ Make pipeline loosely coupled, cross-platform.○ Develop skills for non-academic job market.
Motivation
● In 2011:○ Research group mostly used
bash scripts, awk, Fortran, IDL, IRAF.○ Pipeline was tightly coupled with old
computers, cameras, camera software.● Goals for new computers and camera:
○ Make pipeline loosely coupled, cross-platform.○ Develop skills for non-academic job market.
● Led research group in adopting Python tools.
● Conflict of interest:Engineering vs publishing papers.
● To adopt best practices from industry, science needs more tools that lower the entry barrier.○ Example: It’s hard to mine your data if you don’t
know how to create a database.
Summary
Outline
● Motivation
● Use-cases
● FAQ from researchers
Use of some PyData tools
● Anaconda: Environment management.● IPython Notebooks: Copy-paste code share.● scikit-image: Detecting stars.● pandas: Data organization.● statsmodels, emcee: Robust statistics.● astropy, astroML: Astronomy-specific.
Use-case: Star brightness vs time
● “Time-series photometry.”● Objective:
○ Extract relative brightness of stars from images during acquisition.
https://github.com/ccd-utexas/tsphot
Use-case: Star brightness vs time
● Status:○ Developed to be good enough for internal use, but
not made robust for distribution.○ Conflict of interest: engineering vs publishing papers
https://github.com/ccd-utexas/tsphot
Use-case: Data mining platform
● Objective:○ Predict which unobserved white dwarf stars pulsate.
■ What stars are there? From catalogs.■ Which stars are published (non)pulsators? From papers.■ Which stars are unpublished (non)pulsators? From our data.
http://www.slideshare.net/SamuelHarrold/20140409-harrold-dataminingdemostellarseminar
Use-case: Data mining platform
● Status:○ Shut down due to under-use.
■ Users preferred grep + Excel rather than pandas.■ Users didn’t want to maintain MySQL database.
○ Conflict of interest: engineering vs publishing papers
http://www.slideshare.net/SamuelHarrold/20140409-harrold-dataminingdemostellarseminar
Use-case: Reproducible research
● Objective:○ Compute the physical quantities of a binary star
system from time-series photometry.
https://github.com/stharrold/Harrold_2015_SDSSJ1600; https://pypi.python.org/pypi/binstarsolver
Use-case: Reproducible research
● Status:○ Citable code on GitHub with DOI from zenodo.org.○ Distributable code published to PyPI.○ Conflict of interest: engineering vs publishing papers
https://github.com/stharrold/Harrold_2015_SDSSJ1600; https://pypi.python.org/pypi/binstarsolver
FAQ from researchers● Questions:
○ “Why don’t you use ___?”○ “How does this help publish more papers?”○ “Why should I learn another language?”
FAQ from researchers● Questions:
○ “Why don’t you use ___?”○ “How does this help publish more papers?”○ “Why should I learn another language?”
● Answers:○ “Look how quickly I can do ___.”○ Examples justify taking time to learn new skills.○ NSF Data Management and Sharing requirements:
https://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp○ TIOBE code popularity index:
http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html○ Jake VanderPlas’s blog post on data science and academia:
https://jakevdp.github.io/blog/2014/08/22/hacking-academia/