hg version control bioinformaticians
DESCRIPTION
a short talk I gave to my group to explain the basics of HG and version controlTRANSCRIPT
Giovanni Dall'Olio,IBE (UPF-CEXS)
Introduction to version control and hg for our bioinformatics
group
What is hg?
● Programmers use software to keep track of all the versions of the code they write. These are called Version Control Systems (VCS)
● There are many software to make VCS; the most renown are cvs, subversion, git, hg, bazaar
● Git, hg and bazaar are newer and based on an improved paradigm called Distributed Version Control System (DVCS)
How will hg be useful for us?
● Keep versions of the scripts we create● also for the datasets, results, etc..
● Have a common and official version of the pipeline and the scripts, on bitbucket.org
● Everybody will work on his computer on his version of the scripts; every once in a while, he will merge it with the official version
Installing hg
● Hg can run on any operating system● On linux, install it through your software center
● sudo apt-get install mercurial
● On other OS, go to http://mercurial.selenic.com/ and download the installer
Initial hg configuration
● Hg stores its configuration in a file called:● ~/.hgrc on Unix● C:\Documents and Settings\your_name\.hgrc
● Open it and write your username:
[ui]username = Giovanni Dall'Olio <[email protected]>
The basic operations of a VCS
● Creating a repository● Can be equivalent to 'start keeping track of the
version of the files in this project'
● Adding files to the repository● Files are not tracked unless you say so
● Committing changes● Saving a version of the actual state of the files
● Pushing the changes and merging them with the standard version
Creating a repository
● Create a new directory and create the repo with:● hg init
Effect of creating a new repo
● An hidden directory (.hg) will be created● From now on, it will be possible to give other hg
commands
Adding files to the repo
● By default, no files are added to the repository● It means that if you create a new file in the
directory, hg will ignore it
Creating a file
Files are not added automatically to the repo
● The command:● hg log file.txt
● should return the historial of changes of the file file.txt. Since it is not in the repo yet, nothing is shown
hg add
● To add a file to the repository, use hg add● This will mean that the software should record
all the changes on that file
Committing changes
● The most important operation in VCS is the commit
● This operation saves the status of the files tracked and associate it with a version
● One commit → one version
Committing a change
● We have added the file file.txt to the repo● This is a change compared to the previous
version (where this file was not present)● So we have to record it with a commit
Our first commit
Effects of adding a file and committing
● From now on, all the changes made to the file will be tracked
What is being 'committed'?
● Every time you commit a new version, hg stores the set of changes since the previous version
● Other old VCS stored a copy of all the files for each version● => very big disk space occupation
● By storing only the changes, hg occupies less space and makes it easier to compare versions
Hg diff
● The hg diff command will show the differences between the file and its last saved version
Hg log
● Hg log will show the history of the changes in the repository
Hg log
The story continues..
● The basic operations in a VCS are adding files to the tracking, and commit changes
● Next week we will see how to keep a copy of our repository on a remote server, and how to collaborate with other people
● Now I will show you some example of using a version control system
Example: backup
● Imagine that for error, you remove a file or a directory from your project
● With a VCS, you can revert to the previous version and get the files back
Example: tracking code
● VCS have been developed to track changes in the code● Return to the point where you have made a mistake
or a typo● Implementing a parallel version of the code, like
trying a different library or approach (branching)● Remember what you have been doing, when you
have to change code written months ago
Example: releasing a software
● Mr. Werewolf publishes a software to predict when the moon will be full
● The code gets adopted by the werewolf community. Papers got published using it
● At a certain point, another werewolf discover a bug in the code. It will be possible to seek the version where the error occurred and identify all the versions affected
Example: tracking data
● Version control can be applied to a dataset ● Example: Mr Dracula wants to write a paper on
the quality of the blood in his neighborhood. Every time he gets new data, he commits a change
Tracking everything else
● VCS can be applied to many kinds of file● Usually they do not support binary files● OpenOffice documents can be tracked (they
are XML)
Tracking huge files
● Hg stores the differences between two versions● Storing all the 1000g will take:
● Some gigabytes to store a compressed version of the files
● Less space to store the following commits (but these commits will take time)
● Maybe it is not worth to put gigabytes of data under version control● No solution to date● Some hg extensions for big files
How frequently should I commit?
● Everybody has his/her own phylosophy● Some people prefer to commit every smallest
change● Others prefer to make only a big commit every day
● As a general rule:● The biggest the commit is, the most difficult is to
integrate it if there are conflicts● It's up to you to decide
How to write the perfect commit messages
● One or two sentences● Avoid generic messages
● “new changes”, “fixed bugs”
● Use tags like 'Fix', 'Add', 'Config', etc..:● “Fix: error when reading file”● “Add: new function for plotting results”
● Cite the files changed if you think it may be useful:● Implemented new sorting algorithm for sorting.py