copyright © software carpentry 2011 this work is licensed under the creative commons attribution...

Copyright Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. Data Management Slide 2 1 How do we manage multiple versions of data files? Slide 3 Data Management 2 data samples.mat So far, so good Slide 4 Data Management 3 data samples.mat new version Now what? Slide 5 Data Management 4 data samples.old.mat samples.mat I guess this is alright? Slide 6 Data Management 5 data samples.old.mat samples.old2.mat samples.mat Okay Slide 7 Data Management 6 One problem: #!/usr/bin/env bash cat data/samples.mat \ | tr ^M \n \ | sort -k 2,2nr \ |... > results/samples.txt generate_results.sh Which version of samples.mat was used to generate our results? Slide 8 Data Management 7 In general, dont move or rename files unless absolutely necessary. -makes it harder to reproduce results -makes it harder to find the data later -breaks scripts -breaks symbolic links (ln -s) Slide 9 Data Management 8 But adding new filenames without structure isnt necessarily any better Slide 10 Data Management 9 data samples.mat samplesV2.mat samplesFinal.mat samplesFinalV2.mat samplesUSE_THIS_ONE.mat Which one is the most recent? Slide 11 Data Management 10 You could 1)look at the data and guess 2)check the last-modified date And dont ever move or copy the data! Slide 12 Data Management 11 sounds like a job for Slide 13 Data Management 12 sounds like a job for Slide 14 Data Management 13 sounds like a job for or is it? Slide 15 Data Management 14 Subversion is good for: -simultaneous modification -archive -merge -branch -diff Why shouldnt we just use Subversion? Slide 16 Data Management 15 Subversion is good for: -simultaneous modification -archive -merge -branch -diff Why shouldnt we just use Subversion? what makes sense for data? Slide 17 Data Management 16 Why shouldnt we just use Subversion? Subversion is good for: -simultaneous modification -archive -merge -branch -diff what makes sense for data? Slide 18 Data Management 17 What we need a clear, stable, accessible way to archive data Slide 19 Data Management 18 Performance reasons for not using version control on data -We can no longer remove or compress old and obsolete data files. -Checking out the repository means copying all your data files. -Many version control tools store text-based differences between versions of a file. For binary files, they essentially store a new copy of the file for each version. Slide 20 Data Management 19 Common approach: data samples.mat old old2 Archive old data into subdirectories: -maintains file names -keeps all data files in same version together -still moves files, so scripts and history now refer to different data Slide 21 Data Management 20 Another common approach: data samples.mat final final2 samples.mat really-final Slide 22 Data Management 21 Two big problems still: 1)What order do they go in? The naming can easily become ambiguous. 2)Hard to distinguish top-level data directories from archived or revised data directories. Is data/saved/ an old version of all the data or a current version of saved data? Both are easily solved with a little thinking ahead Slide 23 Data Management 22 From the start: data samples.mat 2010-09-28 2011-03-31 yyyy-mm-dd is a nice date format since: -is easy to read (easier than yyyymmdd) -the alphabetic ordering is also chronological Slide 24 Data Management 23 Were still missing something big -We give a collaborator access to the data folder -A new student joins the project -Its three years later and weve forgotten the details of the project Slide 25 Data Management 24 Were still missing something big -We give a collaborator access to the data folder -A new student joins the project -Its three years later and weve forgotten the details of the project We need context. We need metadata. Slide 26 Data Management 25 Metadata Header inside the file? (see Provenance essay) -Doesnt work well with binary files Separate metadata file for each data file? -Can quickly get out of sync or out of hand -who is the data from? -when was it generated? -what were the experiment conditions? Slide 27 Data Management 26 How I view metadata A list of attributes that I might want to search by later Slide 28 Data Management 27 data samples.mat 2010-09-28 2011-03-31 README 2010-09-28 - batch1 - 25*C - 5 hrs Initial QC: samples 1328, 1422 poor quality 2011-03-31 - batch1 - 28*C - 10 hrs Initial QC: okay, but 1 min power failure @ t=2.5 README Slide 29 Data Management 28 data README 2010-09-28 2011-03-31 README Additional READMEs can be necessary when you have many data files per directory Many data files Slide 30 Data Management 29 data README 2010-09-28 2011-03-31 README Under version control Many data files * * * * Slide 31 Data Management 30 data README 2010-09-28 2011-03-31 README Under version control Many data files * * * * delete for space Slide 32 Data Management 31 data README 2010-09-28 2011-03-31 README Under version control Many data files * * * * README should describe how to exactly reacquire data, if possible. Slide 33 Data Management 32 data README * yyyy-mm- dd Many data files * Under version control * The big picture project Slide 34 Data Management 33 data README * yyyy-mm- dd Many data files * Under version control * The big picture src * scripts, source code, etc * project Slide 35 Data Management 34 data README * yyyy-mm- dd Many data files * Under version control * src * scripts, source code, etc * experiments results from running scripts on data yyyy-mm- dd project The big picture Slide 36 Data Management 35 Why all the fuss? Sensible archiving: -make it clear and obvious for someone else -someone else will likely be you, several months or years from now Slide 37 Data Management 36 Why all the fuss? targets filenameExactlyAsDownloaded.tkq raw targets.gff Example: How exactly did I create targets.gff? This would be very difficult without additional information. Slide 38 Data Management 37 Why all the fuss? The directory structure: targets filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat raw README process_targets.sh targets.gff README - when and where from files were downloaded process_targets.sh - exact commands used to generate targets.gff from downloaded files Slide 39 Data Management 38 Why all the fuss? The directory structure: targets filenameExactlyAsDownloaded.tkq otherNecessaryFile.dat raw README process_targets.sh targets.gff This made it both 1) possible and 2) easy to know exactly where the data came from and how it was processed. Slide 40 Data Management 39 The right layout for the right project -No hierarchy is perfect for all projects. -Thinking hard at the beginning can save pain and suffering later on -For example Slide 41 Data Management 40 The right layout for the right project What weve talked about so far is great for projects with a strong separation between data and analysis: experiment0001 experiment0002 experiment0003 someOtherData analysis0001 analysis0002 analysis0003 analysis0004 analysis0005 Slide 42 Data Management 41 The right layout for the right project But it doesnt work as well for pipelines: parse align filter cluster Slide 43 Data Management 42 The right layout for the right project One representation I have seen is: data align cluster 2010-09-28 filter parse align cluster 2011-03-31 filter parse Slide 44 Data Management 43 The right layout for the right project But: 1)I doesnt make the pipeline ordering clear 2)It doesnt scale well with alternatives: parse align01 filter01 cluster filter02 align02 filter01 cluster Slide 45 Data Management 44 The right layout for the right project A better approach captures the dependency structure of the pipeline: parse2010-09-28 samples.dat align01 samples.dat align02 samples.dat filter01 samples.dat Slide 46 Data Management 45 -Think hard at the beginning -Version control metadata, not data files -An intelligent structure not only makes your data easier to archive, track, and manage, but also reduces the chance that the paths in the pipeline get crossed and the data out the end isn't what you think it is. Summary Slide 47 Copyright Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. Orion Buske May 2011 created by

copyright © software carpentry 2011 this work is licensed under the creative commons attribution...

Documents