research reproducibility - code etc

47
Theory and Practice of Reproducible Research OGRS, Perugia, October 12, 2016 Riccardo Rigon, Francesco Serafin, Marialaura Bancheri Antonio Canova, Le tre grazie

Upload: riccardo-rigon

Post on 08-Feb-2017

117 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Research reproducibility - Code etc

Theory and Practice of Reproducible

Research

OGRS, Perugia, October 12, 2016

Riccardo Rigon, Francesco Serafin, Marialaura Bancheri

An

ton

io C

anova

, Le

tre

gra

zie

Page 2: Research reproducibility - Code etc

2

Antonio Canova gypsum statues bring a series of

little signs. They served the stonemasons to

reproduce “industrially” the opera. Art became

“reproducible” for the fist time.

Rigon & Al.

Canova ?

Page 3: Research reproducibility - Code etc

http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper/

I have been frustrated often with statisticians and computer scientists who write papers where they develop new methods and seem to demonstrate that those methods blow away all their competitors. But then no software is available to actually test and see if that is true. … In my mind, new methods/analyses without software are just vaporware … If there is no code, there is no paper. By Jeff Leek*

Page 4: Research reproducibility - Code etc

4

Science must be reproducible (i.e. repeatable)

It is the fundamental. It means that everyone (in principle) should be able to take what you write, the experiment you did, the mathematics you drew, and doing it again with his own resources.

“In principle means” that science is often

not is not shared … for various reasons …

Why reproducibility ?

Rigon & Al.

Page 5: Research reproducibility - Code etc

5

Not anyone can reproduce scientific achievements

“In principle” means that

S/he must be trained to do it (there are problems of transmission of information here). And, in fact, more advanced results, can be difficult to grab, even for the very same autors.

Introduction

Rigon & Al.

Page 6: Research reproducibility - Code etc

6

Getting more: Replicability

Reproducibility vs Replicability

Rigon & Al.

Page 7: Research reproducibility - Code etc

7

Analysing a paper for reproducibility (the case of Formetta et al., 2011)

Geosci. Model Dev., 4, 943–955, 2011www.geosci-model-dev.net/4/943/2011/doi:10.5194/gmd-4-943-2011© Author(s) 2011. CC Attribution 3.0 License.

Geoscientific

Model Development

The JGrass-NewAge system for forecasting and managing thehydrological budgets at the basin scale: models of flow generationand propagation/routingG. Formetta1, R. Mantilla2, S. Franceschi3, A. Antonello3, and R. Rigon11University of Trento, 77 Mesiano St., Trento, 38123, Italy2The University of Iowa, C. Maxwell Stanley Hydraulics Laboratory, Iowa 52242-1585, USA3Hydrologis S.r.l., Bolzano, BZ, Italy

Received: 16 April 2011 – Published in Geosci. Model Dev. Discuss.: 29 April 2011Revised: 20 September 2011 – Accepted: 31 October 2011 – Published: 4 November 2011

Abstract. This paper presents a discussion of the predic-tive capacity of the implementation of the semi-distributedhydrological modeling system JGrass-NewAge. This modelfocuses on the hydrological budgets of medium scale to largescale basins as the product of the processes at the hillslopescale with the interplay of the river network. The part of themodeling system presented here deals with the: (i) estimationof the space-time structure of precipitation, (ii) estimation ofrunoff production; (iii) aggregation and propagation of flowsin channel; (v) estimation of evapotranspiration; (vi) auto-matic calibration of the discharge with the method of particleswarming.The system is based on a hillslope-link geometrical par-

tition of the landscape, combining raster and vectorial treat-ment of hillslope data with vector based tracking of flow inchannels. Measured precipitation are spatially interpolatedwith the use of kriging. Runoff production at each channellink is estimated through a peculiar application of the Hymodmodel. Routing in channels uses an integrated flow equationand produces discharges at any link end, for any link in theriver network. Evapotranspiration is estimated with an im-plementation of the Priestley-Taylor equation. The modelsystem assembly is calibrated using the particle swarmingalgorithm. A two year simulation of hourly discharge of theLittle Washita (OK, USA) basin is presented and discussedwith the support of some classical indices of goodness of fit,and analysis of the residuals. A novelty with respect to tra-ditional hydrological modeling is that each of the elementsabove, including the preprocessing and the analysis tools,is implemented as a software component, built upon ObjectModelling System v3 and jgrasstools prescriptions, that canbe cleanly switched in and out at run-time, rather than at

Correspondence to: G. Formetta( [email protected])

compiling time. The possibility of creating different mod-eling products by the connection of modules with or withoutthe calibration tool, as for instance the case of the presentmodeling chain, reduces redundancy in programming, pro-motes collaborative work, enhances the productivity of re-searchers, and facilitates the search for the optimal modelingsolution.

1 Introduction

Hydrological forecasting over time has focused on differ-ent issues. Determining the discharge of rivers during floodevents has been a central topic for more than a century;firstly through the rational model of Mulvaney (1851), laterthrough the use of instantaneous unit hydrograph models(Sherman, 1932; Dooge, 1959), and more recently includ-ing the geomorphological approach (i.e. GIUH; Rodrıguez-Iturbe and Valdes, 1979; Gupta and Waymire, 1980; Rosso,1984; D’Odorico and Rigon, 2003). Even models of runoffgeneration such as Topmodel (Beven and Kirkby, 1979;Beven, 2001; Franchini et al., 1996) have been used mainlyfor this purpose.Subsequently, however, the water resource and river man-

agement required the need to estimate a whole set of hydro-logical quantities such as discharge, evapotranspiration, andsoil moisture, bringing very soon to the implementation ofmore comprehensive modeling systems, like the pioneeringStanford watershed model (Crawford and Linsley, 1966), theSacramento model (e.g. Burnash et al., 1973), and the PRMSmodel (Leavesley et al., 1983). They were usually based onthe idea of intercommunicating compartments (reservoirs),each representing a process domain, each one with its resi-dence time.

Published by Copernicus Publications on behalf of the European Geosciences Union.

Formetta et al. 2011

Rigon & Al.

Page 8: Research reproducibility - Code etc

8

This is a paper, which I co-authored, dealing with a model for rainfall runoff,

It is mostly which presents a hydrological model, with an application to a case

study

Geosci. Model Dev., 4, 943–955, 2011www.geosci-model-dev.net/4/943/2011/doi:10.5194/gmd-4-943-2011© Author(s) 2011. CC Attribution 3.0 License.

Geoscientific

Model Development

The JGrass-NewAge system for forecasting and managing thehydrological budgets at the basin scale: models of flow generationand propagation/routingG. Formetta1, R. Mantilla2, S. Franceschi3, A. Antonello3, and R. Rigon11University of Trento, 77 Mesiano St., Trento, 38123, Italy2The University of Iowa, C. Maxwell Stanley Hydraulics Laboratory, Iowa 52242-1585, USA3Hydrologis S.r.l., Bolzano, BZ, Italy

Received: 16 April 2011 – Published in Geosci. Model Dev. Discuss.: 29 April 2011Revised: 20 September 2011 – Accepted: 31 October 2011 – Published: 4 November 2011

Abstract. This paper presents a discussion of the predic-tive capacity of the implementation of the semi-distributedhydrological modeling system JGrass-NewAge. This modelfocuses on the hydrological budgets of medium scale to largescale basins as the product of the processes at the hillslopescale with the interplay of the river network. The part of themodeling system presented here deals with the: (i) estimationof the space-time structure of precipitation, (ii) estimation ofrunoff production; (iii) aggregation and propagation of flowsin channel; (v) estimation of evapotranspiration; (vi) auto-matic calibration of the discharge with the method of particleswarming.The system is based on a hillslope-link geometrical par-

tition of the landscape, combining raster and vectorial treat-ment of hillslope data with vector based tracking of flow inchannels. Measured precipitation are spatially interpolatedwith the use of kriging. Runoff production at each channellink is estimated through a peculiar application of the Hymodmodel. Routing in channels uses an integrated flow equationand produces discharges at any link end, for any link in theriver network. Evapotranspiration is estimated with an im-plementation of the Priestley-Taylor equation. The modelsystem assembly is calibrated using the particle swarmingalgorithm. A two year simulation of hourly discharge of theLittle Washita (OK, USA) basin is presented and discussedwith the support of some classical indices of goodness of fit,and analysis of the residuals. A novelty with respect to tra-ditional hydrological modeling is that each of the elementsabove, including the preprocessing and the analysis tools,is implemented as a software component, built upon ObjectModelling System v3 and jgrasstools prescriptions, that canbe cleanly switched in and out at run-time, rather than at

Correspondence to: G. Formetta( [email protected])

compiling time. The possibility of creating different mod-eling products by the connection of modules with or withoutthe calibration tool, as for instance the case of the presentmodeling chain, reduces redundancy in programming, pro-motes collaborative work, enhances the productivity of re-searchers, and facilitates the search for the optimal modelingsolution.

1 Introduction

Hydrological forecasting over time has focused on differ-ent issues. Determining the discharge of rivers during floodevents has been a central topic for more than a century;firstly through the rational model of Mulvaney (1851), laterthrough the use of instantaneous unit hydrograph models(Sherman, 1932; Dooge, 1959), and more recently includ-ing the geomorphological approach (i.e. GIUH; Rodrıguez-Iturbe and Valdes, 1979; Gupta and Waymire, 1980; Rosso,1984; D’Odorico and Rigon, 2003). Even models of runoffgeneration such as Topmodel (Beven and Kirkby, 1979;Beven, 2001; Franchini et al., 1996) have been used mainlyfor this purpose.Subsequently, however, the water resource and river man-

agement required the need to estimate a whole set of hydro-logical quantities such as discharge, evapotranspiration, andsoil moisture, bringing very soon to the implementation ofmore comprehensive modeling systems, like the pioneeringStanford watershed model (Crawford and Linsley, 1966), theSacramento model (e.g. Burnash et al., 1973), and the PRMSmodel (Leavesley et al., 1983). They were usually based onthe idea of intercommunicating compartments (reservoirs),each representing a process domain, each one with its resi-dence time.

Published by Copernicus Publications on behalf of the European Geosciences Union.

Formetta et al. 2011

Rigon & Al.

Page 9: Research reproducibility - Code etc

9

Reproducible, in this case requires first

Consistency of notation

For what regards to this, the paper is certainly consistent (it is part of the peer-review process to guarantee it).

A more strong statement would require consistency of notation through series of companion papers.

But this paper, in particular, is not a heavy

theoretical treatment of some topic, and

notation is not really crucial here.

Notation helps

Rigon & Al.

Page 10: Research reproducibility - Code etc

10

Different story for this paper (the case of Botter et al., 2010)

ClickHere

for

FullArticle

Transport in the hydrologic response: Travel timedistributions, soil moisture dynamics, and the oldwater paradox

Gianluca Botter,1 Enrico Bertuzzo,2 and Andrea Rinaldo1,2

Received 8 July 2009; revised 23 October 2009; accepted 29 October 2009; published 12 March 2010.

[1] We propose a mathematical framework for the general definition and computation oftravel time distributions defined by the closure of a catchment control volume, where theinput flux is an arbitrary rainfall pattern and the output fluxes are green and blue waterflows (namely, evapotranspiration and the hydrologic response embedding runoffproduction through soil water dynamics). The relevance of the problem is both practical,owing to implications in hydrologic watershed modeling, and conceptual for the linkagesand the explanations the theory provides, chiefly concerning the role of geomorphology,climate, soils, and vegetation through soil water dynamics and the treatment of the so‐called old water paradox. The work focuses in particular on the origins of the conditionaland time‐variant nature of travel time distributions and on the differences between unithydrographs and travel time distributions. Both carrier flow and solute matter transport inthe control volume are accounted for coherently. The key effect of mixing processesoccurring within runoff production is also investigated, in particular by a model thatassumes that mobilization of soil water involves randomly sampled particles from theavailable storage. Travel time distributions are analytically expressed in terms of the majorwater fluxes driving soil moisture dynamics, irrespectively of the specific model used tocompute them. Relevant numerical examples and a set of generalized applications areprovided and discussed.

Citation: Botter, G., E. Bertuzzo, and A. Rinaldo (2010), Transport in the hydrologic response: Travel time distributions, soilmoisture dynamics, and the old water paradox, Water Resour. Res., 46, W03514, doi:10.1029/2009WR008371.

1. Introduction

[2] The age of water (or residence time) represents thetime spent by water molecules ideally sampled from a givenhydrologic system within the reference control volume(measured since the entry through rainfall). Thus, the age ofwater blends in a single quantitative attribute informationabout hydrological and chemical storages, flow pathways,and water sources [e.g., McGuire and McDonnell, 2006].Several field observations (especially built through exten-sive rainfall/runoff dating by isotope hydrology) and a fewtheoretical results have established the so‐called “old waterparadox,” according to which a sizable part of the runoffwithin the hydrologic response of catchment transport vo-lumes is constituted by aged water particles (i.e., by waterparticles injected at times preceding the event causally re-lated to the observed runoff) [e.g., Maloszewski and Zuber,1982; McDonnell, 1990; McDonnell et al., 1991; Stewartand McDonnell, 1991; Wilson et al., 1991a, 1991b;Leaney et al., 1993; Rodhe et al., 1996; Cirmo andMcDonnell , 1998; Nyberg et al. , 1999; Peters and

Ratcliffe, 1998; Burns et al., 1998; Weiler et al., 2003;McGuire et al., 2007; Botter et al., 2007, 2008a, 2009]. Therelease of old water has been explained by the propagationof pressure waves induced by precipitation inputs with acelerity exceeding the pore water velocity [e.g., Beven,1981, 1989b], including displacement of water previouslyimmobilized within the soil matrix into preferential flowpathways [e.g., Beven and Germann, 1982]. However, someof the physical processes controlling the release of preeventwater from catchments are still poorly understood orroughly modeled, and the observational data do not suggesteither universal behaviors, nor do they support linear andtime‐invariant behaviors as assumed by unit hydrographschemes [e.g., Weiler and McDonnell, 2006]. The com-plexity of the mixing patterns involving event and preeventwaters in hillslopes is partly a byproduct of the structuralcomplexity of subsurface environments, which are typicallycharacterized by pronounced heterogeneity and time vari-able connectivity of flow pathways. For this reason, it isinappropriate to use the point‐scale physical laws deter-mining the movement of water and solutes within hillslopesto make predictions at larger scales because of the nonlin-earity of flow processes and the uncertain distribution ofhydrologic, geological and morphological properties ofcontrol volumes [e.g., Beven, 1989a, 2006; Kirchner, 2009].Hence, lumped approaches are frequently employed todescribe in an effective manner the overall behavior ofhillslopes/catchments. In particular, the water travel time

1Dipartimento di Ingegneria Idraulica Marittima Ambientale eGeotecnica, Università degli Studi di Padova, Padua, Italy.

2Laboratory of Ecohydrology, Faculte ENAC, Ecole PolytechinqueFederale, Lausanne, Switzerland.

Copyright 2010 by the American Geophysical Union.0043‐1397/10/2009WR008371

WATER RESOURCES RESEARCH, VOL. 46, W03514, doi:10.1029/2009WR008371, 2010

W03514 1 of 18

R. Rigon

Botter et al., 2010

Page 11: Research reproducibility - Code etc

11

This is an outstanding paper dealing with transport for residence time, which I read several times during the last months, in order to reproduce their research (with my own tools)

ClickHere

for

FullArticle

Transport in the hydrologic response: Travel timedistributions, soil moisture dynamics, and the oldwater paradox

Gianluca Botter,1 Enrico Bertuzzo,2 and Andrea Rinaldo1,2

Received 8 July 2009; revised 23 October 2009; accepted 29 October 2009; published 12 March 2010.

[1] We propose a mathematical framework for the general definition and computation oftravel time distributions defined by the closure of a catchment control volume, where theinput flux is an arbitrary rainfall pattern and the output fluxes are green and blue waterflows (namely, evapotranspiration and the hydrologic response embedding runoffproduction through soil water dynamics). The relevance of the problem is both practical,owing to implications in hydrologic watershed modeling, and conceptual for the linkagesand the explanations the theory provides, chiefly concerning the role of geomorphology,climate, soils, and vegetation through soil water dynamics and the treatment of the so‐called old water paradox. The work focuses in particular on the origins of the conditionaland time‐variant nature of travel time distributions and on the differences between unithydrographs and travel time distributions. Both carrier flow and solute matter transport inthe control volume are accounted for coherently. The key effect of mixing processesoccurring within runoff production is also investigated, in particular by a model thatassumes that mobilization of soil water involves randomly sampled particles from theavailable storage. Travel time distributions are analytically expressed in terms of the majorwater fluxes driving soil moisture dynamics, irrespectively of the specific model used tocompute them. Relevant numerical examples and a set of generalized applications areprovided and discussed.

Citation: Botter, G., E. Bertuzzo, and A. Rinaldo (2010), Transport in the hydrologic response: Travel time distributions, soilmoisture dynamics, and the old water paradox, Water Resour. Res., 46, W03514, doi:10.1029/2009WR008371.

1. Introduction

[2] The age of water (or residence time) represents thetime spent by water molecules ideally sampled from a givenhydrologic system within the reference control volume(measured since the entry through rainfall). Thus, the age ofwater blends in a single quantitative attribute informationabout hydrological and chemical storages, flow pathways,and water sources [e.g., McGuire and McDonnell, 2006].Several field observations (especially built through exten-sive rainfall/runoff dating by isotope hydrology) and a fewtheoretical results have established the so‐called “old waterparadox,” according to which a sizable part of the runoffwithin the hydrologic response of catchment transport vo-lumes is constituted by aged water particles (i.e., by waterparticles injected at times preceding the event causally re-lated to the observed runoff) [e.g., Maloszewski and Zuber,1982; McDonnell, 1990; McDonnell et al., 1991; Stewartand McDonnell, 1991; Wilson et al., 1991a, 1991b;Leaney et al., 1993; Rodhe et al., 1996; Cirmo andMcDonnell , 1998; Nyberg et al. , 1999; Peters and

Ratcliffe, 1998; Burns et al., 1998; Weiler et al., 2003;McGuire et al., 2007; Botter et al., 2007, 2008a, 2009]. Therelease of old water has been explained by the propagationof pressure waves induced by precipitation inputs with acelerity exceeding the pore water velocity [e.g., Beven,1981, 1989b], including displacement of water previouslyimmobilized within the soil matrix into preferential flowpathways [e.g., Beven and Germann, 1982]. However, someof the physical processes controlling the release of preeventwater from catchments are still poorly understood orroughly modeled, and the observational data do not suggesteither universal behaviors, nor do they support linear andtime‐invariant behaviors as assumed by unit hydrographschemes [e.g., Weiler and McDonnell, 2006]. The com-plexity of the mixing patterns involving event and preeventwaters in hillslopes is partly a byproduct of the structuralcomplexity of subsurface environments, which are typicallycharacterized by pronounced heterogeneity and time vari-able connectivity of flow pathways. For this reason, it isinappropriate to use the point‐scale physical laws deter-mining the movement of water and solutes within hillslopesto make predictions at larger scales because of the nonlin-earity of flow processes and the uncertain distribution ofhydrologic, geological and morphological properties ofcontrol volumes [e.g., Beven, 1989a, 2006; Kirchner, 2009].Hence, lumped approaches are frequently employed todescribe in an effective manner the overall behavior ofhillslopes/catchments. In particular, the water travel time

1Dipartimento di Ingegneria Idraulica Marittima Ambientale eGeotecnica, Università degli Studi di Padova, Padua, Italy.

2Laboratory of Ecohydrology, Faculte ENAC, Ecole PolytechinqueFederale, Lausanne, Switzerland.

Copyright 2010 by the American Geophysical Union.0043‐1397/10/2009WR008371

WATER RESOURCES RESEARCH, VOL. 46, W03514, doi:10.1029/2009WR008371, 2010

W03514 1 of 18

It is mostly a theoretical paper, with an application to an idealised case study

Botter et al., 2010

Rigon & Al.

Page 12: Research reproducibility - Code etc

12

JGrass-NewAGE 1.0

7. NEWAGE-JGRASS RAINFALL RUNOFF MODEL

7.5 Experimenting di↵erent modeling solutions.

The Hymod component is applied for each HRU and the runo↵ production is then propagated in

the channel network. A new runo↵ propagation components is implemented and presented in the

next subsection. To study the role and the importance of the channel routing component a test

is performed. Two river basins are used for the test and modeled in a three di↵erent delineations

by using one (DL1), three (DL3) and twenty (DL20) HRU’s. Two modeling solutions were set

up: Hymod and RHymod in fig.(7.9).

Figure 7.9: Modelling solutions: Hymod (in red dashed line) and RHymod (in blued dashed line).

The modeling solution RHymod includes: the Pristley-Taylor component for the evapo-

traspiration estimate, the ordinary kriging algorithm for the rainfall spatialization, the hymod

model for the runo↵ production of the hillslope, and finally the new channel routing component

presented in the next section. The modeling solution Hymod di↵ers from the model solution

RHymod by only turning o↵ the channel routing component and the discharge for each HRU

are just added downstream. LUCA (66) was selected as calibration component for both the

modeling solutions. The objective function is the Kling-Gupta e�ciency (KGE) function as

presented in (63).

The test is performed on two di↵erent river basin: Fort Cobb and Little Washita. The

simulation period covered 2006-2007 in the case Fort Cobb and 2002-2003 in the case of Little

Washita river basin; one year was used for calibration and one year for verification. The

simulations time step was hourly.

82

Back to Formetta et al., 2011

Rigon & Al.

Page 13: Research reproducibility - Code etc

13

JGrass-NewAGE 1.0: more

Therefore, to reproduce JGrass-NewAGE 1.0 results, one has to know the theory of any of the above components. Unfortunately, this is only the first impression. You have to know actually more

6. NEWAGE-JGRASS SHORTWAVE RADIATION MODEL

Figure 6.1: OMS3 SWRB components of NewAge-JGrass and the flowchart to model shortwaveradiation at the terrain surface with generic sky conditions. Where not specified, quantity for inputor output must be a spatial field for any instant of simulation time. ”Measured” refers to a quantitythat is measured at a meteorological station. Geomorphic features refer to the hilllslope and channeldelineation, slope and aspect. The components, besides the specfied files received in input, includean appropriate set of parameter values.

56

Back to Formetta et al., 2011

Rigon & Al.

Page 14: Research reproducibility - Code etc

14

JGrass-NewAGE 1.0: even more

5.3 Motivation for Semivariogram modelling and providing krigings toolsin NewAge-JGrass.

5.3.3 The krigings tools in the NewAge-JGrass system

After the variogram assessment, we are able to apply it for kriging interpolation of a dataset.

The flow chart of the kriging algorithm is presented in fig.(5.3). The input data are: i) the

shape file of the measurement stations, ii) the .csv file of the measured data, iii) the shape

file or the raster map of the interpolations points, iv) the semivariogram model to use for

the interpolation. The model parameters are: a flag to specify the working mode (raster or

vector), the semivariogram model parameter, a flag to specify the kriging type (ordinary, local,

or detrended) and some control parameters related to the selected kriging algorithm (maximum

distance for local kriging, threshold of the correlation between elevation and measurements for

detrended kriging). Within kriging model configuration, di↵erent variogram models can be used

for di↵erent time steps. The outputs could be or a .csv file or a raster map with the interpolated

values.

Comparisons with the R-package Gstat (115) are presented in Appendix 1 in order to test

the implemented algorithms (ordinary and local kriging).

Figure 5.3: The Kriging flowchart.

39

Back to Formetta et al., 2011

Rigon & Al.

Page 15: Research reproducibility - Code etc

15

JGrass-NewAGE 1.0: even more than more

4.2 Catchment analysis

Whatever the conceptualisation, the challenge, is to deploy the ideas in robust and correct

code. This is accomplished in NewAge-JGrass by using the GEOtools libraries and their imple-

mentation of the geographic features which seamlessly integrate with OMS3 programming and

uDig.

The Horton Machine (127) and (128) is built on top of these libraries which are the modelling

components that are actually being used.

To obtain this hierarchical structure it is necessary to first process the raster data from a

digital elevation model which is summarised below.

4.2 Catchment analysis

The analysis of the catchment, starts with the acquisition of a Digital Terrain Model (DTM)

of the catchment, e.g. (159). It is performed as illustrated in fig.(4.1) and summarized for the

reader below.

Figure 4.1: The workflow for the basin delineation in NewAge-JGrass -

4.2.1 Geomorphological analysis

Starting from the digital terrain model (DTM), the ”Horton Machines” (128) components as

provided by the GIS uDig-JGrass are used. In sequence, those are:

25

Back to Formetta et al., 2011

Rigon & Al.

Page 16: Research reproducibility - Code etc

16

Scared Enough ?

R. Rigon

Help me!

Page 17: Research reproducibility - Code etc

17

JGrass-NewAGE 1.0: Sorry, I forgot a pieceG. Formetta et al.: The JGrass-NewAge System for forecasting and managing hydrological budgets 953

Fig. 9. Application of the JGrass-NewAge model for the period 01/01/2002 to 31/12/2003.

case of two submodels for runoff production, one of which,whilst appealing from a theoretical point of view, revealedunfeasible during calibration. This models was, in fact, eas-ily substituted by another without the need to rebuild thewhole model system.The versatility of the modeling approach was also tested

by implementing two different modeling chains, one sub-stantially performing simulation with a very lumped appli-cation of the model, just using Hymod for the whole catch-ment, the other representing a more distributed “version” ofthe same Hymod runoff generating mechanism, connectedwith a routing scheme. The forecasts were tested by analysisof the residuals and through the estimation of some objectiveindices, which were also implemented as software compo-nents. These allowed us to objectively state that, at least forthe case in study, the performances of the distributed ver-sion of the modeling chain was significantly better than thelumped version, thus supporting the idea that the increase inmodel complexity was worthwhile. It is noteworthy that thiscomparison was made between systems where most of thecode was the same, thus guaranteeing, in our opinion, themost fair comparison possible.

The modeling chain, although seemingly very traditional,was actually implemented using advanced specifications ofthe geographical objects, as required by OGC, and uses aparticular specification of the river network hierarchy and therelated hillslopes that was built upon the Pfafstetter orderingscheme.Even though the overall performances of the forecasting

can be considered very good, in the future some new compo-nents could substitute the older ones and be compared consis-tently along the same lines, even if further improvements inthe ability to forecast measured discharge could not be con-sidered significant without a proper assessment of the uncer-tainties inherent to the description of the processes.These comparisons could be made by the same authors

or independently by other researchers, since the JGrass-NewAge modeling system is freely available, with just thenew component requiring coding. In this sense the infras-tructure promotes independent testing and verification of re-search results with unprecedented easiness. In this perspec-tive a component by component and interoperability com-parison of the JGrass-NewAge system with others, such asPRSM (Leavesley et al., 1983) or J2000 (Krause, 2001) thatembraced the OMS3 frameworks can be envisaged.

www.geosci-model-dev.net/4/943/2011/ Geosci. Model Dev., 4, 943–955, 2011

You need the same data !

In this case, you are lucky. We used open data … but this is not always the case

Back to Formetta et al., 2011

Rigon & Al.

Page 18: Research reproducibility - Code etc

18

Assuming you are bold and smart

This will take for you at least a couple of years for putting all the parts together for your own and just following verbatim the indication you can get from the paper. (We think we put all of the information in the paper necessary: but, you know, this is practically unverifiable)

Mumbling

Rigon & Al.

Page 19: Research reproducibility - Code etc

19

Our paper is theoretically reproducible … but practically not: it requires

a trained person to do it, having all the right tools in her hands (including

programming skills)…

If you are a Ph.D. student that starts from the scratch you cannot

afford it ! Almost nobody goes back and repeats something that's

already been published, though.*

*http://arstechnica.com/science/2012/08/scientific-reproducibility-for-fun-and-profit/

Mumbling Mumbling

Rigon & Al.

Page 20: Research reproducibility - Code etc

20

So are we doing science or just cheating of doing

science ?

Theoretically reproducible … but practically not: means that

theoretically we are doing science, but practically not ?

Mumbling Mumbling Mumbling

Rigon & Al.

Page 21: Research reproducibility - Code etc

21

This is even worse than believed in today sciences

Because of the massive use of computation.

Computation is now central to the scientific enterprise and it adds a further layer of complexity to the science visible in papers.

Some paper that comes

out from computation

are out of any control

Not just one single case

Rigon & Al.

Page 22: Research reproducibility - Code etc

Not just one single case

22

“Computation is now central to the scientific enterprise, and the emergence of powerful computational hardware, combined with a vast array of computational software, presents novel opportunities for researchers. Unfortunately, the scientific culture surrounding computational work has evolved in ways that make it difficult to verify findings, efficiently build on past research, or even apply the basic tenets of the scientific method to computational procedures.”

By Victoria Stodden, Jonathan M. Borwein, David H. Bailey, SIAM news

http://sinews.siam.org/DetailsPage/tabid/607/ArticleID/351/%E2%80%9CSetting-the-Default-to-Reproducible%E2%80%9D-in-Computational-Science-Research.aspx

are out of any control

Rigon & Al.

Page 23: Research reproducibility - Code etc

23

To keep out any doubt

I decided to make public any code (any source code, actually) under a copyleft

license (GPL v 3.0). Se at:

http://abouthydrology.blogspot.it/2015/03/jgrass-newage-essentials.html

So we reduced a couple of years of work to three months (with instructions)

No fake science

Rigon & Al.

Page 24: Research reproducibility - Code etc

24

An we plan to make our work

Replicablein any paper not only Reproducible

but we are not alone

No fake science

Rigon & Al.

Page 25: Research reproducibility - Code etc

25

Editorial: The publication of geoscientific model developments v1.0

one of the EGU’s Open Access journals, i.f. 3.6

Journals

Rigon & Al.

Page 26: Research reproducibility - Code etc

26

Editorial: Vadose Zone Journal

Vadose Zone Journal | Advancing Critical Zone Science

Reproducible Research in Vadose Zone SciencesT.H. Skaggs,* M.H. Young, and J.A. VrugtA significant portion of present-day soil and Earth science research is computational, involving complex data analysis pipelines, advanced mathematical and statistical models, and sophisticated computer codes. Opportunities for scientific progress are greatly diminished if reproduc-ing and building on published research is difficult or impossible due to the complexity of these computational systems. Vadose Zone Journal (VZJ) is launching a Reproducible Research (RR) program in which code and data underlying a research article will be published alongside the article, thereby enabling readers to analyze data in a manner similar to that presented in the article and build on results in future research and applications. In this article, we discuss reproducible research, its background and use across other disciplines, its value to the scientific community, and its implementa-tion in VZJ.

Abbreviations: NIH, National Institutes of Health; RR, Reproducible Research; VZJ, Vadose Zone Journal.

A hallmark of the scientific method is that research results must be reproduc-ible. Although the reproducibility requirement has always existed, technological advances over the last few decades have changed the way science is practiced and communicated, creating for researchers and publishers new opportunities and challenges with respect to openness and reproducibility.

One set of opportunities involves increased reuse of experimental data. The internet and related information technologies have allowed greater archiving and sharing of environmen-tal and geoscience data. Data sharing makes the validation of scientific findings possible, lessens the need for wasteful duplication of research efforts, and facilitates new data synthesis and aggregation activities. A number of environmental and geoscience publishers have pro-moted data sharing through the introduction of “dataset” articles and journals that focus on digital data archives (e.g., Hornberger, 1994; Pfeiffenberger and Carlson, 2011; Nature Publishing Group, 2014; Gregorich, 2015). Novel data sharing opportunities also arise from long-term observational networks such as LTER (http://www.lternet.edu, accessed 4 Sept. 2015), NEON (http://www.neoninc.org, accessed 4 Sept. 2015), FLUXNET (http://fluxnet.ornl.gov, accessed 4 Sept. 2015), LTAR (http://www.ars.usda.gov/ltar, accessed 4 Sept. 2015), CZO (http://criticalzone.org/national, accessed 4 Sept. 2015), TERENO (http://teodoor.icg.kfa-juelich.de/overview-en, accessed 4 Sept. 2015), and various monitored watersheds (e.g., Reynolds Creek Experimental Watershed, Idaho [Marks, 2001]). These networks are creating new possibilities for evaluating agroecosystem data, including assessments of reproducibility across geographical locations and time.

Yet, beyond these considerations of experimental data and field observations, we recognize that modern computing technologies have created entirely new dimensions to the issue of research reproducibility (Yale Law School Round Table on Data and Code Sharing, 2010; Peng, 2011; Stodden et al., 2014). A significant portion of present-day scientific research is computational, involving complex data analysis pipelines, elaborate mathemati-cal and statistical models, and sophisticated computer codes or scripts. In many published papers, the computational methods are integral to the research results being reported,

Core Ideas•A signif icant portion of present-

day g eosc i ence resea rch i s computational.

•Science would benefit from greater transparency in computational research.

•Vadose Zone Journal is launching a Reproducible Research program.

•Code and data under ly ing a research article will be published alongside articles.

T.H. Skaggs, U.S. Salinity Laboratory, 450 W. Big Springs Rd., Riverside, CA 92507, USA. M.H. Young, Bureau of Economic Geology, Jackson School of Geosciences, University of Texas at Austin, Austin, TX, USA. J.A. Vrugt, Dep. of Civil and Environmental Enginee-ring, University of California, Irvine, CA, USA. *Corresponding author ([email protected]).

Vadose Zone J. doi:10.2136/vzj2015.06.0088Received 12 June 2015.Accepted 15 Aug. 2015.Open access article

Opinion and Policy

© Soil Science Society of America 5585 Guilford Rd., Madison, WI 53711 USA.All rights reserved.

Published October 12, 2015

Journals

Rigon & Al.

Page 27: Research reproducibility - Code etc

27

1 Make our source code open source (actually not necessary just the

executable could serve the scope) and available through

https://github.com/

Counterattack: a strategy to make our work replicable

Rigon & Al.

Page 28: Research reproducibility - Code etc

28

a) Documenting our code as best as possible, according to a standard format (still to define … but we are working on it).

b) Documenting our algorithms.

c) Using the Object modelling System v3 (David et al., 2013, Formetta et al, 2014)

Rigon & Al.

2

Counterattack: a strategy to make our work replicable

Page 29: Research reproducibility - Code etc

29

Counterattack: a strategy to make our work replicable

3Using the appropriate building tools

https://gradle.org/

Rigon & Al.

Page 30: Research reproducibility - Code etc

30

Use standard names for hydrological variable. For instance use the

Basic Model Interface standards BMI

http://csdms.colorado.edu/wiki/BMI_Description

Rigon & Al.

Counterattack: a strategy to make our work replicable

4

Page 31: Research reproducibility - Code etc

31

Using Authorea for uploading complementary material and documentation.

https://www.authorea.com/

Rigon & Al.

Counterattack: a strategy to make our work replicable

You can use also Jupyter or Beaker

Page 32: Research reproducibility - Code etc

32

A strategy to make our paper replicable

Using as much as possible Open Data in our research and making available openly our data*.

*

is a Nature Journal

http://www.nature.com/sdata/

https://en.wikipedia.org/wiki/Open_data

Rigon & Al.

Counterattack: a strategy to make our work replicable

Page 33: Research reproducibility - Code etc

33

Other (more valuable experiences)

The R community

(https://cran.r-project.org/web/views/ReproducibleResearch.html)

Communities

Rigon & Al.

Page 34: Research reproducibility - Code etc

34

Communities

Python

http://software-carpentry.org/

Rigon & Al.

Page 35: Research reproducibility - Code etc

35

https://www.coursera.org/course/repdata?from_restricted_preview=1&course_id=973513&r=https%3A%2F%2Fclass.coursera.org%2Frepdata-012%2Fclass#

R based reproducible research on Coursera

Communities

Rigon & Al.

Page 36: Research reproducibility - Code etc

36

XXXVCONVEGNONAZIONALEDIIDRAULICAECOSTRUZIONIIDRAULICHEBologna,14-16Se/embre2016

Bancheri M. et al., Research reproducibility and replicability: the case of JGrass-NewAge

Source code Project examples

Community blog Documentation

htpp://geoframe.blogspot.com & https://github.com/geoframecomponents

R.Rigon, M.Bancheri, F. Serafin, W.Abera, G.Formetta

Page 37: Research reproducibility - Code etc

37

Become a Reproducible Research Warrior !

Do not wait! Make your stuff available on the Web (whatever format) under an open license*.

*Same as Tim Berners-Lee - Waiting to have it in better shape will delays the publication forever, and your contribution will be lost (like tears in rain): http://5stardata.info/

R2The

stairs

For yourself

Rigon & Al.

Page 38: Research reproducibility - Code etc

38

M a k e i t a v a i l a b l e w i t h documentation (e.g. a README file for any data set and for any model)

R2The

stairs

For yourself

Rigon & Al.

Page 39: Research reproducibility - Code etc

39

Provide examples of runs, and give some reference . S tructure your documentation. Include figures and their making.

R2The

stairs

For yourself

Rigon & Al.

Page 40: Research reproducibility - Code etc

40

Use URLs and providers like Github to store code and data, so people can point at your stuff, and browse it freely*

R2The

stairs

For yourself

Rigon & Al.

Page 41: Research reproducibility - Code etc

41

Maintain a user group (and answer to questions when asked). Provide any run you do on the web with the appropriate metadata.** ***

**: https://earthsystemcog.org/projects/es-doc-models/

***http://abouthydrology.blogspot.it/2014/10/naming-things-in-hydrological-models.html

R2The

stairs

For yourself

Rigon & Al.

Page 42: Research reproducibility - Code etc

42

R. Rigon

Maintain a user group (and answer to questions when asked). Provide any run you do on the web with the appropriate metadata.** ***

**: https://earthsystemcog.org/projects/es-doc-models/

***http://abouthydrology.blogspot.it/2014/10/naming-things-in-hydrological-models.html

Use URLs and providers like Github to store code and data, so people can point at your stuff, and browse it freely*

M a k e i t a v a i l a b l e w i t h documentation (e.g. a README file for any data set and for any model)

Provide examples of runs, and give some reference . S tructure your documentation. Include figures and their making.

Do not wait! Make your stuff available on the Web (whatever format) under an open license*.

*http://5stardata.info/

R2The

stairs

For yourself

Page 43: Research reproducibility - Code etc

43

See Also

Journals

Rigon & Al.

http://sciencecodemanifesto.org/

Page 44: Research reproducibility - Code etc

44

In conclusion

Conclusions

• Research must be reproducible

• In many case it would be better it is replicable • Making our research replicable can be an advantage

• It can favour the progress of science

• Do not be shy: share your research

• Nobody is going to hurt you

Rigon & Al.

Find your own way to Reproducible Research

Page 45: Research reproducibility - Code etc

!45

Find this presentation at

http://abouthydrology.blogspot.com

Ulr

ici, 2

00

0 ?

Other material at

Questions ?

http://abouthydrology.blogspot.it/2015/07/theory-and-practice-of-reproducible.html

Rigon & Al.

Page 46: Research reproducibility - Code etc

46

For the web references, see the slides.

Formetta, G.; Mantilla, R.; Franceschi, S., Antonello A., Rigon R., The JGrass- NewAge system for forecasting and managing the hydrological budgets at the basin scale: models of flow generation and propagation/routing, Geoscientific Model Development Volume: 4 Issue: 4 Pages: 943-955, DOI: 10.5194/gmd-4- 943-201, 2011

Botter, G., E. Bertuzzo, and A. Rinaldo (2010), Transport in the hydrologic response: Travel time distributions, soil moisture dynamics, and the old water paradox, Water Resour. Res., 46, W03514, doi:10.1029/2009WR008371.

Formetta G., Antonello A., Franceschi S., David O., and Rigon R., Hydrological modelling with components: A GIS-based open-source framework, Environmen- tal Modelling Software, 5 (2014), 190-200

David, O., Ascough II, J.C., Lloyd, W., Green, T.R., Rojas, K.W., Leavesley, G.H., Ahuja, L.R., 2013. A software engineering perspective on environmental modeling framework design: the Object Modeling System. Environ. Model. Softw. 39, 201e213.

References

Rigon & Al.

Page 47: Research reproducibility - Code etc

47

References right to the point

Hutton, C., Wagener, T., Freer, J., Han, D., Duffy, C., & Arheimer, B. (2016). Most computational hydrology is not reproducible, so is it really science?,, so is it really science? Water Resources Research, 1–14. http://doi.org/10.1002/2016WR019285

Ince, D. C., Hatton, L., & Graham-Cumming, J. (2013). The case of open computers programs, Nature, 482(7386), 485–488. http://doi.org/10.1038/nature10836

Reproducible Research in Vadose Zone Sciences. (2015). Reproducible Research in Vadose Zone Sciences, 1–5. http://doi.org/10.2136/vzj

Rigon & Al.

References