the compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_siz17.pdfthe test...

11
Geosci. Model Dev., 10, 413–423, 2017 www.geosci-model-dev.net/10/413/2017/ doi:10.5194/gmd-10-413-2017 © Author(s) 2017. CC Attribution 3.0 License. The compression–error trade-off for large gridded data sets Jeremy D. Silver 1 and Charles S. Zender 2 1 School of Earth Sciences, University of Melbourne, Melbourne, Australia 2 Departments of Earth System Science and of Computer Science, University of California, Irvine, CA, USA Correspondence to: Jeremy D. Silver ([email protected]) Received: 7 July 2016 – Published in Geosci. Model Dev. Discuss.: 29 July 2016 Revised: 28 October 2016 – Accepted: 19 November 2016 – Published: 27 January 2017 Abstract. The netCDF-4 format is widely used for large gridded scientific data sets and includes several compres- sion methods: lossy linear scaling and the non-lossy deflate and shuffle algorithms. Many multidimensional geoscientific data sets exhibit considerable variation over one or several spatial dimensions (e.g., vertically) with less variation in the remaining dimensions (e.g., horizontally). On such data sets, linear scaling with a single pair of scale and offset parame- ters often entails considerable loss of precision. We introduce an alternative compression method called “layer-packing” that simultaneously exploits lossy linear scaling and loss- less compression. Layer-packing stores arrays (instead of a scalar pair) of scale and offset parameters. An implementa- tion of this method is compared with lossless compression, storing data at fixed relative precision (bit-grooming) and scalar linear packing in terms of compression ratio, accuracy and speed. When viewed as a trade-off between compression and er- ror, layer-packing yields similar results to bit-grooming (stor- ing between 3 and 4 significant figures). Bit-grooming and layer-packing offer significantly better control of precision than scalar linear packing. Relative performance, in terms of compression and errors, of bit-groomed and layer-packed data were strongly predicted by the entropy of the exponent array, and lossless compression was well predicted by en- tropy of the original data array. Layer-packed data files must be “unpacked” to be readily usable. The compression and precision characteristics make layer-packing a competitive archive format for many scientific data sets. 1 Introduction The volume of both computational and observational geo- physical data has grown dramatically in recent decades, and this trend is likely to continue. While disk storage costs have fallen, constraints remain on data storage. Hence, in practice, compression of large data sets continues to be important for efficient use of storage resources. Two important sources of large volumes of data are computational modeling and remote-sensing (principally from satellites). These data often have a “hypercube” structure and are stored in formats such as HDF5 (http: //www.hdfgroup.org), netCDF-4 (http://www.unidata.ucar. edu/netcdf) and GRIB2 (http://rda.ucar.edu/docs/formats/ grib2/grib2doc/). Each of these has their own built-in com- pression techniques, allowing data to be stored in a com- pressed format, while simultaneously allowing access to the data (i.e., incorporating the compression/decompression al- gorithms into the format’s API). These compression methods are either “lossless” (i.e., no precision is lost) or “lossy” (i.e., some accuracy is lost). In this study, we examine the advantages and trade-offs of in allowing for different treatment of dimensions in the com- pression process. One motivation for this work is improving the lossy compression ratios typically achieved with HDF5 and netCDF-4, so they are more comparable with impres- sive compression achieved by GRIB2. Records in GRIB2 are strictly two-dimensional, and this format allows for only a limited set of predefined metadata. However GRIB2 offers excellent compression efficiency (Caron, 2014), built upon the JPEG image compression methods; it is therefore well- suited for operational environments. By contrast, HDF5 and netCDF-4 provide a highly flexible framework, allowing for attributes, groups and user-defined data types. This format is Published by Copernicus Publications on behalf of the European Geosciences Union.

Upload: others

Post on 22-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

Geosci. Model Dev., 10, 413–423, 2017www.geosci-model-dev.net/10/413/2017/doi:10.5194/gmd-10-413-2017© Author(s) 2017. CC Attribution 3.0 License.

The compression–error trade-off for large gridded data setsJeremy D. Silver1 and Charles S. Zender2

1School of Earth Sciences, University of Melbourne, Melbourne, Australia2Departments of Earth System Science and of Computer Science, University of California, Irvine, CA, USA

Correspondence to: Jeremy D. Silver ([email protected])

Received: 7 July 2016 – Published in Geosci. Model Dev. Discuss.: 29 July 2016Revised: 28 October 2016 – Accepted: 19 November 2016 – Published: 27 January 2017

Abstract. The netCDF-4 format is widely used for largegridded scientific data sets and includes several compres-sion methods: lossy linear scaling and the non-lossy deflateand shuffle algorithms. Many multidimensional geoscientificdata sets exhibit considerable variation over one or severalspatial dimensions (e.g., vertically) with less variation in theremaining dimensions (e.g., horizontally). On such data sets,linear scaling with a single pair of scale and offset parame-ters often entails considerable loss of precision. We introducean alternative compression method called “layer-packing”that simultaneously exploits lossy linear scaling and loss-less compression. Layer-packing stores arrays (instead of ascalar pair) of scale and offset parameters. An implementa-tion of this method is compared with lossless compression,storing data at fixed relative precision (bit-grooming) andscalar linear packing in terms of compression ratio, accuracyand speed.

When viewed as a trade-off between compression and er-ror, layer-packing yields similar results to bit-grooming (stor-ing between 3 and 4 significant figures). Bit-grooming andlayer-packing offer significantly better control of precisionthan scalar linear packing. Relative performance, in termsof compression and errors, of bit-groomed and layer-packeddata were strongly predicted by the entropy of the exponentarray, and lossless compression was well predicted by en-tropy of the original data array. Layer-packed data files mustbe “unpacked” to be readily usable. The compression andprecision characteristics make layer-packing a competitivearchive format for many scientific data sets.

1 Introduction

The volume of both computational and observational geo-physical data has grown dramatically in recent decades, andthis trend is likely to continue. While disk storage costs havefallen, constraints remain on data storage. Hence, in practice,compression of large data sets continues to be important forefficient use of storage resources.

Two important sources of large volumes of data arecomputational modeling and remote-sensing (principallyfrom satellites). These data often have a “hypercube”structure and are stored in formats such as HDF5 (http://www.hdfgroup.org), netCDF-4 (http://www.unidata.ucar.edu/netcdf) and GRIB2 (http://rda.ucar.edu/docs/formats/grib2/grib2doc/). Each of these has their own built-in com-pression techniques, allowing data to be stored in a com-pressed format, while simultaneously allowing access to thedata (i.e., incorporating the compression/decompression al-gorithms into the format’s API). These compression methodsare either “lossless” (i.e., no precision is lost) or “lossy” (i.e.,some accuracy is lost).

In this study, we examine the advantages and trade-offs ofin allowing for different treatment of dimensions in the com-pression process. One motivation for this work is improvingthe lossy compression ratios typically achieved with HDF5and netCDF-4, so they are more comparable with impres-sive compression achieved by GRIB2. Records in GRIB2 arestrictly two-dimensional, and this format allows for only alimited set of predefined metadata. However GRIB2 offersexcellent compression efficiency (Caron, 2014), built uponthe JPEG image compression methods; it is therefore well-suited for operational environments. By contrast, HDF5 andnetCDF-4 provide a highly flexible framework, allowing forattributes, groups and user-defined data types. This format is

Published by Copernicus Publications on behalf of the European Geosciences Union.

Page 2: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

414 J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets

well-suited for more experimental environments and allowsintroduction of user-defined parameters to guide and improvecompression. We use this flexibility to develop a lossy com-pression algorithm, termed “layer-packing”, which combinesdesirable features from both GRIB2 and HDF5–netCDF-4.

Layer-packing is a variant of compression via linear scal-ing that exploits the clustering of data values along dimen-sional axes. We contrast this with other compression methods(rounding to fixed precision, and simple linear scaling) thatare readily available within the HDF5–netCDF-4 framework.The performance is quantified in terms of the resultant lossof precision and compression ratio. We also examine vari-ous statistical properties of data sets that are predictive fortheir overall compressibility and the relative performance oflayer-packing or rounding to fixed precision.

2 Methods

This section outlines the implementation of the layer-packingand layer-unpacking methods, the test data sets, evaluationmetrics and the performance of the compression methods onthe test cases considered.

2.1 Compression algorithms

2.1.1 Deflate and shuffle

The “deflate” algorithm (Deutsch, 2008) is a widely usedlossless compression method, available within most moderninstallations of the netCDF-4 or HDF5 API. The deflate al-gorithm breaks the data into blocks, which are compressedvia the LZ77 algorithm (Ziv and Lempel, 1977, 1978) com-bined with Huffman encoding (Huffman, 1952). The LZ77algorithm searches for duplicated patterns of bytes withinthe block. The Huffman coding step involves substituting fre-quently used “symbols” with shorter bit sequence and rarersymbols with longer bit sequence.

When used in the HDF5–netCDF-4 framework, the de-flate algorithm is often applied together with the shuffle filter(HDF5 Group, 2002). The shuffle filter does not compressdata as such, but instead changes the byte ordering of thedata; the first byte of each value is stored in the first chunkof the data stream, the second byte in the second chunk, andso on. This tends to improve the compressibility of the data,particularly if there is autocorrelation in the data stream (i.e.,data that are close in value tend to appear close together).

In all of the following work, we have used the deflate al-gorithm and the shuffle filter together, henceforth referred tocollectively as DEFLATE for brevity. In the results presentedbelow, compression via DEFLATE was performed with thencks tool of the NCO bundle (Zender, 2008). The compres-sion level (an integer, taking values between 1 and 9, where1 means fastest and 9 means greatest compression) dictatesthe amount of searching for duplicated patterns that is per-formed within the LZ77 step. For all applications (either as

an independent method or in combination with the methodsoutlined below), the same compression level was used withinDEFLATE (namely, level 4).

When using the netCDF-4 framework to compress a vari-able, the user must define the size of the “chunk” withinwhich compression occurs. In all the results presented here(both for DEFLATE and for the other compression methods),the chunk size was equal to the layers packed using layer-packing (see Sect. 2.1.3). The same analyses were performedsetting the chunk sizes to encompass the whole variable, andthe conclusions reached in this article were essentially thesame.

2.1.2 Scalar linear packing

One can compress a field using a lower-precision value (e.g.,as 2-byte integers rather than 4-byte floating point numbers)and a linear transformation between the original and com-pressed representations; this process is termed “packing”. Inthe common case of representing 4-byte floats as 2-byte in-tegers, there is already a saving of 50 % relative to its orig-inal representation (ignoring the small overhead due to stor-ing parameters involved in the transformation). We call this“scalar linear packing” (or LIN, for short), and it is the stan-dard method of packing in the geoscience community. Itsattributes are defined in the netCDF User Guide (Unidata,2016) and it has a long and wide tradition of support, includ-ing automatic interpretation in a range of netCDF readers.

Scalar linear packing to convert floating-point data to anunsigned 2-byte integer representation is outlined below:

maxPackedValue = 216− 1 = 65 535

vMin = minimum value of datavMax = maximum value of dataadd_offset = vMinscale_factor = (vMax - vMin)/maxPackedValuepacked = uint16((data - add_offset) / scale_factor ),

where vMin, vMax, add_offset and scale_factor are scalarfloating point values, data is a floating point array and packedis an unsigned 2-byte integer array of the same dimension asdata, and the function uint16(·) converts from floating pointto unsigned 2-byte integer. Care must be taken in handlinginfinite, not-a-number or undefined values (such details areomitted here). The value of maxPackedValue listed uses allvalues in the 2-byte integer range to represent floats; one canchoose a different value of maxPackedValue, leaving some2-byte values for the special floating-point values. The val-ues of add_offset and scale_factor must be stored along withpacked to enable the reverse transformation:

unpacked = float(packed) × scale_factor + add_offset.Data packed with this method often can be compressed

substantially more than the 50 % noted above. This is doneby applying DEFLATE to the packed data; this was done forall data sets compressed with LIN here. In this study, com-

Geosci. Model Dev., 10, 413–423, 2017 www.geosci-model-dev.net/10/413/2017/

Page 3: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets 415

pression via DEFLATE was performed with the ncpdq toolof the NCO bundle (Zender, 2008).

2.1.3 Layer-packing

In many applications in the geophysical sciences, a multi-dimensional gridded variable varies dramatically across onedimension while exhibiting a limited range of within slicesof this variable. Examples include variation in atmosphericdensity and water vapor mixing ratio with respect to height,variation in ocean temperature and current velocity with re-spect to depth, and variation in atmospheric concentrationsof nitrogen dioxide with respect to height and time.

We will use the term “thick dimensions” to denote thosedimensions that account for the majority of the variation insuch variables, “thin dimensions” to denote the remaining di-mensions; in the case of the first example above (assuming aglobal grid and a geographic coordinate system), the verticaldimension (pressure or height) is thick, and the horizontal di-mensions (latitude, longitude) and time are thin. We will usethe term “thin slice” to describe a slice through the hyper-cube for fixed values of the thick dimensions. Note that thereare cases with multiple thick dimensions, such as the thirdexample noted above.

Scalar linear packing applied to such gridded variableswill result in a considerable loss of accuracy. This is becausein order to cover the scale of variation spanned by the thickdimensions, the range of the individual thin slices will belimited to a subset of available discrete values. To reduceloss of precision within the thin slice, one can store for eachvariable arrays of scale-offset pairs, with size correspond-ing to the thick dimensions only. This is the key innovationof the layer-packing technique (or LAY for brevity). Layer-packing, as discussed here, was implemented via the follow-ing algorithm (described here in pseudocode):

for var in vars doif (var in splitVars) and intersection(dimensions[var],splitDims) is not None then

theseSplitDims = intersection(dimensions[var],splitDims)theseDimLens = lengths of theseSplitDimsiSplitDim = indices of theseSplitDims in dimen-sions[var]splitDimIdxs = all possible index combinations fortheseSplitDimsnSplits = number of combinations given in split-DimIdxsnDim = length of dimensions[var]indices = list of length nDim, with each elementequal to Ellipsisdata = the data arraypacked = array of zeros, type is uint16, with thesame shape as dataadd_offset = array of zeros, type is float, with di-mensions given by theseDimLens

scale_factor = same as add_offsetfor iSplit in range(0, nSplits) do

indices[iSplitDim] = splitDimIdxs[iSplit]thinSlice = data[indices]vMin = minimum value of thinSlicevMax = maximum value of thinSliceadd_offset[splitDimIdxs[iSplit]] = vMinscale_factor[splitDimIdxs[iSplit]] = (vMax -vMin)/maxPackedValuepacked[indices] = uint16( (thinSlice - add_offset)/ scale_factor )

end forwrite packed, add_offset, scale_factor to file

elsewrite the data array to file,

end ifend for

where vars are the variables in the file, splitVars are thosevariables that should be layer-packed, and splitDims are thethick dimensions for layer-packing.

The 2-byte representation halves the storage cost of thearray itself. However arrays of scale factors and linear off-sets must also be stored and this adds to the total spacerequired (NB these are often negligible relative to the sizeof the packed data array). Compression is generally signif-icantly improved by applying DEFLATE and this was donefor all data sets presented here.

Further details about the implementation and the stor-age format are given in the Supplement. The code to per-form layer-packing described in this article was written asstand-alone command-line tools in Python (v. 2.7.6). Theseare freely available on https://github.com/JeremySilver/layerpack. Beyond the standard Python installation, it re-quires that the numpy and netCDF4 modules are installed.

2.1.4 Bit-grooming

One can store the data at a fixed precision (i.e., a cho-sen number of significant digits, or NSD). This method isknown as “bit-grooming” and is detailed by Zender (2016)and implemented in the NCO package (Zender, 2008). If bit-groomed data remain uncompressed in floating-point formatthis coarsening will not affect the file size. However the ap-plication of the deflate/shuffle algorithms will in general im-prove the compressibility of the coarsened data, as they willbe represented by a smaller number of discrete values. Forfurther explanation, we must briefly summarize how floating-point numbers are represented by computers.

Single-precision floating-point numbers occupy 32 bitswithin memory. Floating-point numbers are represented asthe product of a sign, significand and an exponential term:

datum= sign · significand · baseexponent.

The IEEE standard specifies that the sign accounts for 1 bit,the significand (also known as the mantissa) 23 bits and the

www.geosci-model-dev.net/10/413/2017/ Geosci. Model Dev., 10, 413–423, 2017

Page 4: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

416 J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets

exponent 8 bits. The base is 2 by convention. The sign is aninteger from the set {−1,1}, the significand is a real numberin the range [0.0,1.0) and the exponent is an integer between−128 and +127; see, for example, Goldberg (1991) for fur-ther details.

Bit-grooming quantizes1 data to a fixed number of signif-icant digits (NSD) using bitmasks, not floating-point arith-metic. The NSD bitmasks alter the IEEE floating point man-tissa by changing to 1 (bit setting) or 0 (bit shaving) the leastsignificant bits that are superfluous to the specified precision.Sequential values of the data are alternately shaved and set,which nearly eliminates any mean bias due to quantization(Zender, 2016). To guarantee preserving 1–6 digits of pre-cision, bit-grooming retains 5, 8, 11, 15, 18 and 21 explicitmantissa bits, respectively, and retains all exponent bits.

In the following we compared storing 2, 3, 4 and 5 sig-nificant digits; these are denoted NSD2, NSD3, NSD4 andNSD5, respectively. Similar to LIN and LAY, DEFLATE wasapplied together with rounding. In this study, compressionvia bit-grooming was performed using the ncks tool withinthe NCO package (Zender, 2008).

2.2 Data sets

In the following tests, we compared a total of 255 variablesfrom six data sets. Each variable was extracted individuallyto file as uncompressed netCDF, and the file was then com-pressed using the methods described, allowing for computa-tion of compression and error metrics described below. Thedata sets are summarized in Table 1. Further details are pro-vided in the Supplement.

The variables chosen from these data sets were those withthe largest number of data points overall. For example indata sets 2–5, variables without a vertical coordinate werenot considered in the analysis, since these account for onlya small fraction of the total data. A small number of thevariables that would otherwise be included (based on thedimensions alone) were excluded due to the occurrence inseemingly random data (i.e., of all magnitudes) in regionsof the array where values were not defined (in the sense ofsea-surface temperatures over land points), which we believeshould have been masked with a fill value. The rationale forexcluding these variables is, first, that these regions did notappear to contain meaningful data and that the extreme rangeof the seemingly random data led to gross outliers in the dis-tribution of error statistics for LIN and LAY in particular.

1The process of quantization means mapping, in this case viaa process similar to rounding, from a large set of possible inputs(in this case the full set of real numbers representable as floating-point values) to a smaller set (in this case those floating-point valuesdefined to a chosen NSD).

2.3 Error and compression metrics

The methods are compared with two metrics. The first re-lates to the compression efficiency. Compression ratios aredefined as (uncompressed size) / (compressed size); as such,larger values indicate greater compression. The second met-ric relates to the accuracy (or, seen another way, the error) ofthe compressed data relative to the original data. The error ofDEFLATE is zero since it ensures lossless compression. Theremaining methods cause some loss of precision.

Error was quantified by the root-mean-square differencebetween the original and the compressed variables. Howeverin order to compare results across variables with differentscales and units, the errors must be normalized somehow.We considered four different normalization methods, whichemphasized different aspects of the error profile. The errorswere normalized either by the standard deviation or the meanof the original data, and these were calculated either sepa-rately per thin slice or across the entire variable – the ratio-nale is as follows.

When normalizing by the mean (of the entire variable,or of the thin slice), variables with a low mean-to-standard-deviation ratio (e.g., potential vorticity) will show larger er-rors using the layer-packing compression. However, whennormalizing by the standard deviation, variables with a highmean-to-standard-deviation ratio will show larger errors us-ing the bit-grooming compression (e.g., atmospheric temper-atures stored with units of kelvin, concentrations of well-mixed atmospheric trace gases such as CO2 or CH4).

If we calculate the ratio of the RMSE to normalization fac-tor (i.e., the mean or standard deviation) per thin slice andthen average across the normalized errors across slices, theresulting metric will be more sensitive to large relative er-rors within subsections of the data array. The alternative is tocalculate the normalization factor across the whole variable,and the resulting metric will be more reflective of relative er-rors across the entire data array. This may be understood inthe context of a hypothetical three-dimensional array, withvalues ranging from O(104) to O(100) across the verticaldimension and a mean value of O(103), and errors roughlyuniform of O(10−1). If relative errors (normalizing by themean) are calculated for the whole array then they will beO(10−4), whereas if calculated across each thin slice sepa-rately they will range from O(10−1) to O(10−5) and mayhave a mean of O(10−2). The case of uniform errors is mostlikely to arise for LIN, whereas bit-grooming guarantees pre-cision for each individual datum and layer-packing for eachthin slice.

2.4 Complexity statistics

In order to make sense of which variables compress well orpoorly with different methods, a range of statistics were cal-culated for each variable. Most of these statistics were cal-culated over two-dimensional hyperslices of the original data

Geosci. Model Dev., 10, 413–423, 2017 www.geosci-model-dev.net/10/413/2017/

Page 5: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets 417

Table 1. Summary of the data sets used in this study. Abbreviations: ID – index, NWP – numerical weather prediction, CTM – chemistry-transport model, Dims – dimensions of the variable, TS Dims – dimensions of the thin slice, # vars – number of variables per data set. Thedimension sizes are indicated as nx – length of the east–west dimension, ny – length of the north–south dimension, nz – length of the verticaldimension, nt – length of the time dimension and nx′ – length of the generalized horizontal coordinate dimension (used for the unstructuredgrid in last data set only).

ID Description Grid type Dims TS Dims # vars

1 Global 3-D NWP reanalyses Rectangular (nx ,ny ,nz,nt )= 240× 121× 37× 16 (nx ,ny)= 240× 121 142 3-D CTM output Rectangular (nx ,ny ,nz,nt )= 9× 10× 56× 172 (nx ,ny)= 9× 10 773 3-D NWP model output Rectangular (nx ,ny ,nz)= 165× 140× 32 (nx ,ny)= 165× 140 204 Global 3-D NWP reanalyses Rectangular (nx ,ny ,nz,nt )= 288× 144× 42× 8 (nx ,ny)= 288× 144 115 Dust transport–dispersion model Rectangular (nx ,ny ,nz)= 192× 94× 28 (nx ,ny)= 192× 94 156 3-D coupled NWP-CTM output Irregular (nx′ ,nz)= 48 602× 30 nx′ = 48 602 118

arrays, and then the value for an individual variable was takenas the average over these hyperslices. A full list of the statis-tics calculated is given in the Supplement.

The two most informative statistics that arose from thisanalysis were based on the entropy of either the original datafield or the corresponding exponent field (i.e., decomposingthe data array into significand and exponent, and then cal-culating the entropy of the exponent array). The entropy is ameasure of statistical dispersion, based on the frequency withwhich each value appears in each data set. Let us denote asP(xi) the proportion of the array occupied by each uniquevalue xi . For an array containing discrete values X = {xi}

ki=1

for k discrete values, the entropy was defined as

H(X)= E[−log2 (P(X))

]=−

k∑i=1

P(xi)log2 (P(xi)) . (1)

For an array of K , the entropy has a maximum value oflog2(K), which will arise if all values are unique. For single-precision arrays of size 232

≈ 4.29× 109 or larger, the maxi-mum entropy is equal to 32.

In order to normalize for these limitations to the entropyof a finite data set, for each case the entropy was normalizedby the maximum theoretical value attainable for that data set,which was taken to be log2(K

′), where K ′ is the number ofelements in the thin slice (in each case K � 232). In the caseof the entropy of the exponent array, the normalization wasbased on the min(log2(K

′),8), since the maximum entropyof an 8 bit field (recall that 8 bits are used for the exponent ofa floating point) is 8.

It was found that some of the data sets compressed signif-icantly using DEFLATE only. This was often due to a highproportion of zero or “missing” values. Variables were classi-fied as either “sparse” (highly compressible or otherwise rel-atively simple) or “dense” (all other variables). Sparse vari-ables were chosen to be those satisfying any one of the fol-lowing conditions: the compression ratio is greater than 5.0using DEFLATE, the fraction of values equal to the mostcommon value in the entire variable is greater than 0.2, andthe fraction of hyperslices where all values were identical

is great than 0.2. These definitions were somewhat arbitraryand other classifications may be preferable, but it is seen(e.g., in Figs. 1c and 3a, b) that sparse variables do not al-ways follow the same pattern as dense variables. Of the 255variables in total, 181 were classified as dense. The break-down among the different categories is given in Table S1 inthe Supplement.

2.5 Compression and error results

Figure 1a shows the distribution of compression ratios, nor-malized errors and timing statistics for the different meth-ods. In the case of the compression ratios and normalized er-rors, results are presented separately for the dense and sparsevariables. For dense variables, the median compression ratioswere 1.3 (DEFLATE), 3.2 (NSD2), 2.4 (NSD3), 2.0 (NSD4),1.6 (NSD5), 4.2 (LIN) and 2.6 (LAY). For sparse variables,the median compression ratios were 2.0 (DEFLATE), 4.3(NSD2), 3.3 (NSD3), 2.8 (NSD4), 2.3 (NSD5), 7.4 (LIN) and5.2 (LAY). It can be seen that LIN gave the greatest compres-sion, and the LAY compression ratios were comparable withthose of NSD2 or NSD3.

The median compression times (Fig. 1b) normalized rela-tive to the DEFLATE compression time were 0.82 (NSD2),0.91 (NSD3), 0.79 (NSD4), 0.91 (NSD5), 0.57 (LIN), 3.45(LAY compression) and 1.84 (LAY extraction). Differencesbetween the bit-grooming methods were relatively small andslightly faster than DEFLATE alone, whereas the LIN com-pression was nearly twice as fast as DEFLATE. These val-ues are consistent with DEFLATE compressing twice asmuch data for bit-grooming as for LIN, which store 4 and2 bytes per value, respectively. The LAY times (both com-pression and decompression) were significantly slower thanfor the other methods, particularly for compression; this ismost likely due to differences in implementation, as this im-plementation of LAY was programmed in Python while theother methods used compiled C/C++ utilities. We believe thatthe overhead from loading some of the Python libraries usedin the implementation of LAY may cause this method to berelatively slower for smaller files; we note that most of the

www.geosci-model-dev.net/10/413/2017/ Geosci. Model Dev., 10, 413–423, 2017

Page 6: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

418 J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets

●●●

●●●●●

●●●●●●●●●

●●

●●

●●

●●●

●●●

●●●

●●●●

●●●●●

●●

●●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●●●●●

●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●

●●●●

●●●●

●●

●●

●●●●

12

510

2050

100

200

Method

Com

pres

sion

rat

io

DEFLATE NSD2 NSD3 NSD4 NSD5 LIN LAY

(a)

●●

●●●●

●●

●●●●

●●

●●

●●

●●●●

●●●●●

●●●

●●

● ●

0.1

0.2

0.5

1.0

2.0

5.0

10.0

Method

(Tim

e of

met

hod)

/(tim

e of

DE

FLA

TE

)

NSD2 NSD3 NSD4 NSD5 LIN LAY LAY

compress extract

(b)

●●

●●●

●●●●

●●

●●●

●●●●●●●●

●●●●

●●●

●●●

●●●

●●●●

●●

●●●

●●●●●●●●

●●●●

●●●

●●

●●●

●●●●

●●

●●●

●●●●●●●●

●●●●

●●●

●●●

●●●

●●●●

●●

●●●

●●●●●●●●

●●●●

●●●

●●●●●●●

●●●

●●

●●●

●●

●●

●●●●●

Method

Mea

n er

ror (

norm

aliz

ed b

y SD

, per

laye

r)

NSD2 NSD3 NSD4 NSD5 LIN LAY

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

(c)

●●

●●

● ●●

● ●

●●

●●

●●●●●●● ●●●●●●●●●●●

●●

●●●●

●●●

●●●●

●●●●●

●●

●●

●●●

●●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●● ●

●●

●●

●● ●●●

●●

●●●●●

●●●

●●●●●●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●

●●

●●●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●● ●●●

0.0 0.2 0.4 0.6 0.8 1.0

12

510

2050

100

Entropy as a predictor of lossless compression

Normalized entropy

Com

pres

sion

rat

io (

DE

FLA

TE

)

Dense variablesSparse variables(d)

Figure 1. (a) Distribution of compression ratios (original file size divided by compressed file size) measured for seven methods, appliedto six test data sets (higher is better), plotted separately by dense variables and sparse variables (white and grey boxes, respectively). Thebox plots (in panels a, b and c) were defined as follows: the thick line and the center of the box denotes the median, the bottom and top ofthe box show q0.25 and q0.75 (respectively) or the 0.25 and 0.75 quantiles of the distribution, the whiskers extend from q0.25− 1.5 · IQR toq0.25+1.5 ·IQR, where IQR is the inter-quartile range (q0.75−q0.25) and the points shown are all outliers beyond this range. (b) Distributionof scaled compression/decompression times for each method (lower is better), with LAY represented twice (for compression and extraction).These times are normalized by the compression time from DEFLATE. (c) Distribution of errors, normalized by the per-layer standarddeviation. (d) The achieved lossless compression ratios (i.e., from DEFLATE) as a function of the normalized entropy (1.0 corresponding tothe maximum theoretical for the data).

variables considered are relatively small, with uncompressedfile sizes ranging from 1.9 to 65.6 MB. This hypothesis wassupported by a test case where a suite of compression meth-ods were applied to a much larger array2 of size 1.5 GB,the compression times were 103 s (DEFLATE), 89 s (NSD2),

2The ERA-Interim (Dee et al., 2011) east–west wind componenton a grid with 241 grid points in the latitude dimension, 480 gridpoints in the longitude dimension, 60 vertical levels and 124 times,spanning from 1 July 2015 00:00 UTC to 31 July 2015 18:00 UTCat 6-hourly intervals, converted from its original GRIB format. Thisdata set can be accessed through the ECMWF’s public data set por-tal (http://apps.ecmwf.int/datasets/)

107 s (NSD3), 97 s (NSD4), 109 s (NSD5), 69 s (LIN) and70 s (LAY), while the unpacking time for LAY was 133 s.

For all methods considered except DEFLATE, the com-pression comes at the expense of precision; the distributionof resultant errors is shown in Fig. 1c (and Fig. S1 of theSupplement). For dense variables, the median relative errorsshown in Fig. 1c are 1.8×10−3 (NSD2), 2.3×10−4 (NSD3),1.4×10−5 (NSD4), 1.8×10−6 (NSD5), 5.3×10−2 (LIN) and1.2× 10−4 (LAY). Unsurprisingly, with each additional sig-nificant digit of precision requested of bit-grooming, the nor-malized errors fall by a factor of 10. The errors of LAY arecomparable to those of NSD3 or NSD4 while for the metricshown in Fig. 1c (RMSE/SD, calculated separately per thin

Geosci. Model Dev., 10, 413–423, 2017 www.geosci-model-dev.net/10/413/2017/

Page 7: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets 419

slice, then averaging the ratios), LIN displays much largererrors than the other methods – the errors for LIN are, morethan for the other lossy compression methods in this compar-ison, sensitive to the error metric used and this is discussedbelow. Dense variables show some differences in the distri-bution of errors compared to sparse variables: errors normal-ized by the standard deviation appear smaller, and the errorsnormalized by the mean appear larger. This is because manyof the sparse variables are zero at most points, and in manycases this tends to reduce the mean and increase the standarddeviation (compared to examining only the non-zero values).

The choice of error metric is ambiguous and leads toslightly different results. Four different normalization fac-tors for the RMSE were considered (shown in Fig. S1).It can be seen that the comments about the bit-groomingand LAY methods in the preceding paragraph hold regard-less of the normalization method, whereas LIN shows muchhigher standardized errors if errors are normalized withinthin slices and then averaged (due to the reasons explainedin the last part of Sect. 2.3). Bit-grooming keeps precisionloss to known bounds for each individual datum, LAY leadsto roughly constant errors within thin slices, and LIN resultsin roughly constant errors for the whole variable. This lastpoint is illustrated in Fig. S4 (especially panels a, b, c ande) of the Supplement, which shows horizontal profiles of theRMSE, mean and standard deviation for a sample of six vari-ables from among the 255 considered; Fig. S5 shows the cor-responding relative errors.

The pairwise relationship between compression and erroris shown in Fig. 2, with the distribution shown based onlyon dense variables; the error metric used in this figure is theaverage (across layers) of the RMSE normalized by the per-layer standard deviation, which tends to highlight large rela-tive errors in certain sections of the data array (most notablyfor LIN). We see that the bit-grooming and LAY methodsform something of a continuum, with LAY falling betweenNSD3 and NSD4. The linear slope on a log–log plot is sug-gestive of a power-law relationship, which would be consis-tent with fundamental constraints on compression potentialconsistent with rate–distortion theory (Berger, 2003).

The question of when LAY is preferable to NSD3 or NSD4can be addressed with reference to the complexity statistics.Among those complexity metrics considered, the normalizedentropy of the data field proved to be the best predictor ofcompression in the lossless case (DEFLATE). By contrast,the best predictor of the relative performance of LAY to thetwo bit-grooming methods was the normalized entropy of theexponent field (NEEF). By “best predictor”, we mean thatthese were, respectively, the most strongly correlated amongthe metrics considered with the DEFLATE compression ra-tios and the relative error or compressed file size. In the caseof lossless compression, the correlation of the log of the com-pression ratio with the normalized entropy was over 0.9 (thenext highest correlation was below 0.8; all variables wereincluded), while for differentiating bit-grooming and LAY,

Error vs. compression ratio

Compression ratio

Err

ors

(RM

SE

/SD

), p

er la

yer

1 2 5 10

10−6

10−5

10−4

10−3

10−2

10−1 NSD2NSD3NSD4NSD5LINLAY

Figure 2. The relationship between normalized errors and compres-sion ratio for the lossy compression methods considered. The threecontours for each method show the bounds within which the two-dimensional kernel-smoothed distribution integrates to 0.25, 0.5 and0.75, respectively. Only dense variables were used to produce thisplot. The normalized errors (the y-axis coordinate) were measuredper layer and then averaged.

the absolute correlations between the NEEF and the log ofthe file size ratio or the log of the RMSE ratio as shown inFig. 3 were both over 0.8 (cf. the next highest correlationswere around 0.6; only dense variables were included).

The trade-off between error and compression is evident: asthe NEEF increases, the bit-groomed file sizes become largerthan the corresponding LAY file sizes, while the errors ofresulting from LAY grow relative to those of bit-grooming(Fig. 3). The LAY file sizes were larger than those of NSD3or NSD4 when the NEEF is greater than 0.25 or equal to 0.1,respectively (Fig. 3, upper row). Errors of LAY were gener-ally less than those of NSD3, while the errors for NSD4 weresmaller compared to LAY for values of the NEEF greaterthan 0.15 (Fig. 3, lower row). In most cases the choice be-tween LAY and NSD3 or NSD4 implies a choice betweensmaller file sizes or smaller errors (an exception being whenthen NEEF is greater than 0.25, and LAY yields both smallerfiles and lower errors than NSD3).

The reason that the NEEF differentiates the relative perfor-mance between bit-grooming and LAY can be understood bythe nature of the errors induced by the two techniques. Bit-grooming guarantees constant relative errors for each indi-vidual datum, whereas LAY results in errors that are roughlyconstant in absolute magnitude across a thin slice. Assum-ing that the variable is dense, if the data within each thinslice vary little (corresponding to a low NEEF), then LAYwill achieve relatively small errors; since the compressed

www.geosci-model-dev.net/10/413/2017/ Geosci. Model Dev., 10, 413–423, 2017

Page 8: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

420 J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets

● ●

●●

●●

●●

●●

●●

●●

●●● ●●

●●

●●

● ●

●●●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

● ●●

●●●

●●

●●

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

12

34

5

NSD3 vs LP: compression

Normalized entropy of exponent field

Rel

ativ

e co

mpr

essi

on e

rror

: (si

ze N

SD

3)/(

size

LP

)

Dense variablesSparse variablesLOWESS fit

(a)

● ●

● ●

●● ●

●●●●

●●

●●

●●● ●●

● ●●

● ●

●●●

●●

● ●

●●

●●

● ●

● ●

● ●●

●●

●●

● ●●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

● ●●

●●●

●●

●●

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

12

34

5

NSD4 vs LP: compression

Normalized entropy of exponent field

Rel

ativ

e co

mpr

essi

on e

rror

: (si

ze N

SD

4)/(

size

LP

)

Dense variablesSparse variablesLOWESS fit

(b)

● ●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●● ●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●● ●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●●

●● ●

●●●

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

NSD3 vs LP: error

Normalized entropy of exponent field

Rel

ativ

e co

mpr

essi

on s

ize:

(R

MS

E N

SD

3)/(

RM

SE

LP

)

2 ⋅ 10−2

5 ⋅ 10−2

10−1

2 ⋅ 10−1

5 ⋅ 10−1

100

2 ⋅ 100

5 ⋅ 100

101

2 ⋅ 101

5 ⋅ 101

102

2 ⋅ 102●

Dense variablesSparse variablesLOWESS fit

(c)

● ●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●● ●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

●●

●●

●●● ●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

● ●

● ●●

●●●

●● ●

●●●

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

NSD4 vs LP: error

Normalized entropy of exponent field

Rel

ativ

e co

mpr

essi

on s

ize:

(R

MS

E N

SD

4)/(

RM

SE

LP

)

2 ⋅ 10−2

5 ⋅ 10−2

10−1

2 ⋅ 10−1

5 ⋅ 10−1

100

2 ⋅ 100

5 ⋅ 100

101

2 ⋅ 101

5 ⋅ 101

102

2 ⋅ 102●

Dense variablesSparse variablesLOWESS fit

(d)

Figure 3. Relative performance of LAY compared with NSD3 (left column) or NSD4 (right column), in terms of compression (top row) anderrors (bottom row). The grey dashed line indicates values of 1.0 on the y axis. The LOWESS fit was based on the dense variables alone.

field contains more information, the resultant file size willbe larger.

One interesting aspect of these results is that the entropyis defined independently of the values in the distribution(Fig. 1), based solely on the frequencies of values in the ar-ray, yet the NEEF proved to be more informative regardingrelative performance than other metrics that accounted forrange in magnitude (e.g., the range or standard deviation ofthe exponent field, the logarithm of the largest non-zero valuedivided by the smallest non-zero value). This is likely dueto the fact that the RMSE summarizes the error distribution,weighting both by size and frequency, and the entropy is ameasure of statistical dispersion.

The clear relationship between the normalized entropy ofthe data field (NEDF) and the compression ratios achievedby DEFLATE alone (Fig. 1d) lead to the question of whetherthe NEDF is predictive of compression ratios achieved bythe lossy methods. The NEDF is indeed highly predictiveof the compression ratio of the reduced-precision fields as

well (Fig. S2), with the absolute correlation exceeding 0.85in each case. This may be expected, given the already clearrelationship between these two parameters. Furthermore, thereduction in the NEDF is seen to be predictive of the “com-pression improvement”, which we define as the DEFLATE-compressed file size divided by the file size achieved by thelossy compression method; this relationship is most appar-ent for those methods with the highest compression ratios,namely LIN, LAY and NSD2 (see Fig. S3). The linear rela-tionship, however, offers only a partial explanation (the cor-relation is over 0.8 for LIN and LAY, and over 0.6 for NSD2and NSD3). However it appears to explain most of the vari-ation when the lossy method achieves 2-fold or greater com-pression relative to that obtained by DEFLATE.

Geosci. Model Dev., 10, 413–423, 2017 www.geosci-model-dev.net/10/413/2017/

Page 9: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets 421

3 Discussion

Layer-packing was compared with scalar linear packing, bit-grooming and lossless compression via the DEFLATE al-gorithm. The lossy methods form a continuum when onecompares the resultant compression ratios and normalized er-rors. The trade-off between error and compression has beenshown elsewhere (e.g., Berger, 2003). Despite the fact thatLAY and LIN represented the data as 2-byte integer arrays,which occupy half the storage of 4-byte floating point num-bers (ignoring the relatively minor contribution to LAY fromthe much smaller accompanying scale and offset arrays), itcan be seen that both methods effectively fit into the con-tinuum spanned by bit-grooming, which represents data as4-byte floats (Fig. 2).

In this study, we have effectively separated the precisionreduction from the compression itself. This is because thebit-grooming, LAY and LIN methods can be thought of aspreconditioners for the same lossless compression algorithm(namely DEFLATE). The reduction in entropy due to thesepreconditioners explains a large part of the improved com-pression above lossless compression. This concept could beextended to develop methods for automatically determiningthe right precision to retain in a data set.

This study neither extended to the comparison with otherlossless filters for compression of the precision-reducedfields, nor did it compare the results with other lossy com-pression techniques. While it is possible that the findingspresented here extend beyond the deflate and shuffle tech-nique, other lossy and non-lossy compression algorithms op-erate in fundamentally different ways. Such an extension tothis study may be considered in future, for example, by tak-ing advantage of the fact that the HDF5 API allows for arange of alternative compression filters to be loaded dynami-cally (HDF5 Group, 2016) to provide faster or more efficientcompression than the default algorithms.

A number of such compression filters (both lossy andlossless) have been developed specifically for geophysicaldata sets, which have different properties compared to plaintext, for example. They tend to be multi-dimensional, storedas floating-point numbers and in many cases are relativelysmooth (Hübbe et al., 2013). Specialized compression algo-rithms exist for such data sets that rely on the smoothnessproperties to confer a certain degree of predictability, whichreduces the number of effective degrees of freedom in thedata set. For example, Hübbe et al. (2013) present a losslessalgorithm termed MAFISC (Multidimensional Adaptive Fil-tering Improved Scientific data Compression) that incorpo-rates a series of filters (some of which are adaptive), which inthe cases presented gave stronger compression than the otherlossless algorithms under consideration. In a similar spirit,Di and Cappello (2016) use a lossy, adaptive, curve-fittingtechnique that gives highly competitive compression perfor-mance while simultaneously constraining relative and/or ab-solute errors (as defined by the user).

Other authors have shown impressive data-compressionrates using methods originally developed for image pro-cessing. For example, the GRIB2 format allows for com-pression using the JPEG-2000 algorithm and format (basedon wavelet transforms) to store numerical fields (Skodraset al., 2001). Woodring et al. (2011) describe a work-flow based around JPEG-2000 compression in order toovercome bandwidth-limited connections while quantifyingthe ensuing reduction in precision. They demonstrate thesame compression–error trade-off as illustrated in this work.Robinson et al. (2016) compare lossy compression via JPEG-2000, PNG (another graphics standard) and three videocodecs; the video codecs showed very high relative errors (inthe order of 0.1–1.0), but also very high compression rates(around 300-fold compression).

A number of studies have assessed a range of compressionmethods on a variety of data sets (e.g., Baker et al., 2014; Diand Cappello, 2016). They show, amongst other things, thatno one method provides the most effective compression forall data sets considered. Also apparent is that lossy methodstend to result in higher compression ratios than lossless tech-niques, and that methods designed specifically for scientificdata (e.g., exploiting smoothness when it is present) are oftenhighly competitive.

Layer-packing achieves compression and error resultsroughly in between bit-grooming storing three or four signif-icant figures. Layer-packing and bit-grooming control errorin different ways, with bit-grooming guaranteeing fixed rel-ative errors for every individual datum while layer-packingresults in roughly constant relative errors within the hyper-slice across which the packing is applied.

The idea itself behind layer-packing is not new and formsthe basis of compression within the GRIB format, in whicheach field is a two-dimensional array and compression isperformed on fields individually. In one sense layer-packingis more general than that used in GRIB, in that thin slicesare not restricted to being two-dimensional. Our preliminaryresults (not shown) showed that the JPEG2000 algorithmyields greater compression compared to the methods pre-sented here for the same level of error; this echoes the find-ings of Caron (2014), which describe the efficient compres-sion achieved. However like scalar linear packing, JPEG2000does not offer clear controls about the resultant errors andthus some experimentation (in setting the number of bitsper value) is needed to avoid excessive loss of precision(Woodring et al., 2011). More limiting, perhaps, are the tech-nical procedures required to convert generic netCDF datainto GRIB2 format, including meeting constraints on vari-able and dimension names. The GRIB format only allowsstoring metadata that match predefined tables, and it is thusnowhere near as general or self-describing as HDF5 (or itsderivatives such as netCDF-4). This, combined with the soft-ware requirements beyond netCDF (e.g., the JPEG library)for decompression, renders GRIB-compression unattractivefor general purpose usage. Outside of organizations with

www.geosci-model-dev.net/10/413/2017/ Geosci. Model Dev., 10, 413–423, 2017

Page 10: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

422 J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets

strong technical support for modeling operations (e.g., op-erational weather prediction centers), its suitability may belimited to special purposes and expert users.

Although the methods described here are used only onnetCDF data files, they apply equally to other data formats(e.g., HDF4, HDF5). These results are presented as a proofof concept only.

The timing information is presented mainly for complete-ness and a caveat should be raised. As noted above, the com-pression via DEFLATE, bit-grooming and LIN were per-formed using tools from the NCO bundle (written in C andC++), whereas the LAY compression was implemented inPython. The code is presented as a demonstration of layer-packing rather than production code.

Both the scalar linear packing and the layer-packing usethe same representation of the data (i.e., 2-byte integers).However large differences in the compression and relativeerrors were found. This is because the compressibility of apacked field is related to the distribution of values withinthe scaling range. Across a given thin slice, layer-packingwill represent values using the full range of 2-byte integers,whereas scalar linear packing will typically use a smaller por-tion of that range. This increases the loss of precision result-ing from scalar linear packing but also entails greater com-pression.

When considering which compression method to use foran individual data set, one needs to consider several factors.First, space constraints and the size of the data sets in ques-tion will vary considerably for different applications. Second,the degree of precision required will also depend on the ap-plication and may differ between variables within a data set.The bit-grooming (storing at least three significant digits) andlayer-packing techniques achieved average normalized errorsof 0.05 % or better, which in many geoscientific applicationsis much less than the model or measurement errors. Third,data sets vary considerably in their inherent compressibility,which in the cases considered appeared strongly related tothe normalized entropy of the data array. Fourth, how dataare stored should also reflect how they will be used (e.g., ac-tive use versus archiving). A major disadvantage of the layer-packing as described here is that it is essentially an archiveformat and needs to be unpacked by a custom application be-fore it can be easily interpreted. Scalar linear packing is sim-ilarly dependent on unpacking (although many netCDF read-ers will automatically unpack such data, from 2-byte integersto 4-byte floating point, by default), whereas bit-groomingrequires no additional software. Finally, other issues relat-ing to portability, the availability of libraries and consistencywithin a community also play a role in determining the mostappropriate storage format.

4 Conclusions

This paper considers layer-packing, scalar linear packing andbit-grooming as a basis for compressing large gridded datasets. When viewed in terms of the compression–error trade-off, layer-packing was found to fit within the continuum ofbit-grooming (i.e., when varying the number of significantdigits to store), roughly in between storing three and foursignificant digits. The relative performance of layer-packingand bit-grooming was strongly related to the normalized en-tropy of the exponent array and again highlighted the trade-off between compression and errors. Given the variation incompression and accuracy achieved for the different data setsconsidered, the results highlight the importance of testingcompression methods on a realistic sample of the data.

If space is limited and a large data set must be stored, thenwe recommend that the standard deflate and shuffle methodsbe applied. If this does not save enough space, then care-ful thought should be given to precisely which variables andwhich subsets of individual variables will be required in thefuture; it often arises that despite a wealth of model output,only a limited portion will ever be examined. Many tools ex-ist for subsetting such data sets. Beyond this, if further sav-ings are required and if the data need not be stored in full pre-cision, then the appropriate relative precision for each vari-able should be selected and applied via bit-grooming. Layer-packing should be considered when choosing a compressiontechnique for specialist archive applications.

5 Code availability

The Python-based command line utilities used for the layer-packing, unpacking and relative-error analysis are availablefreely at https://github.com/JeremySilver/layerpack.

6 Data availability

Of the data sets used in the tests (listed in Sect. 2.2), datasets 2–6 are available online at https://figshare.com/projects/Layer_Packing_Tests/14480 along with a brief description ofeach file. Data set 1 could not be distributed with the otherfiles due to licensing restrictions but can be accessed throughthe ECMWF’s public data set portal (http://apps.ecmwf.int/datasets/), using the following set of inputs: stream = syn-optic monthly means, vertical levels = pressure levels (all37 layers), parameters = all 14 variables, data set = in-terim_mnth, step = 0, version = 1, time = 00:00:00, 06:00:00,12:00:00, 18:00:00, date = 20080901 to 20081201, grid =1.5◦× 1.5◦, type = analysis, class = ERA Interim.

The Supplement related to this article is available onlineat doi:10.5194/gmd-10-413-2017-supplement.

Geosci. Model Dev., 10, 413–423, 2017 www.geosci-model-dev.net/10/413/2017/

Page 11: The compression–error trade-off for large gridded …dust.ess.uci.edu/ppr/ppr_SiZ17.pdfthe test cases considered. 2.1 Compression algorithms 2.1.1 Deflate and shuffle The “deflate”

J. D. Silver and C. S. Zender: The compression–error trade-off for large gridded data sets 423

Author contributions. Jeremy D. Silver wrote the layer-packingPython software, performed the compression experiments andwrote most of the manuscript. Charles S. Zender contributed to thedesign of the study, provided some of the test data sets used in theexperiments and contributed to the text.

Competing interests. The authors declare that they have no conflictof interest.

Acknowledgements. The work of Jeremy D. Silver was funded bythe University of Melbourne’s McKenzie Postdoctoral Fellowshipprogram. The work of Charles S. Zender was funded by NASAACCESS NNX12AF48A and NNX14AH55A and by DOEACME DE-SC0012998. We thank Peter J. Rayner (University ofMelbourne) for useful discussions. Three anonymous reviewersprovided constructive comments and suggestions on the manuscript.

Edited by: P. UllrichReviewed by: three anonymous referees

References

Baker, A. H., Xu, H., Dennis, J. M., Levy, M. N., Nychka, D., Mick-elson, S. A., Edwards, J., Vertenstein, M., and Wegener, A.: Amethodology for evaluating the impact of data compression onclimate simulation data, in: Proceedings of the 23rd InternationalSymposium on High-performance Parallel and Distributed Com-puting, Vancouver, BC, Canada, 23–27 June 2014, ACM, 203–214, 2014.

Berger, T.: Rate-Distortion Theory, in: Wiley Encyclope-dia of Telecommunications, John Wiley & Sons, Inc.,doi:10.1002/0471219282.eot142, 2003.

Caron, J.: Converting GRIB to netCDF-4: Compression stud-ies, presentation to the workshop “Closing the GRIB/netCDFgap”, European Centre for Medium Range Weather Fore-casts (ECMWF), 24–25 September 2014, Reading, UK, avail-able at: http://www.ecmwf.int/sites/default/files/elibrary/2014/13711-converting-grib-netcdf-4.pdf, last access: 17 June 2016,2014.

Dee, D., Uppala, S., Simmons, A., Berrisford, P., Poli, P.,Kobayashi, S., Andrae, U., Balmaseda, M., Balsamo, G., Bauer,P., and Bechtold, P.: The ERA-Interim reanalysis: Configurationand performance of the data assimilation system, Q. J. Roy. Me-teor. Soc., 137, 553–597, 2011.

Deutsch, L. P.: DEFLATE compressed data format specificationversion 1.3, Tech. Rep. IETF RFC1951, Internet EngineeringTask Force, 2008.

Di, S. and Cappello, F.: Fast error-bounded lossy HPC data com-pression with SZ, in: 2016 IEEE International Parallel andDistributed Processing Symposium (IPDPS), 23–27 May 2016,Chicago, IL, USA, doi:10.1109/IPDPS.2016.11, 2016.

Goldberg, D.: What every computer scientist should know aboutfloating-point arithmetic, ACM Comput. Surv., 23, 5–48, 1991.

HDF5 Group: Performance evaluation report: gzip, bzip2 com-pression with and without shuffling algorithm, availableat: https://support.hdfgroup.org/HDF5/doc_resource/H5Shuffle_Perf.pdf, last access: 17 June 2016, 2002.

HDF5 Group: Filters, available at: www.hdfgroup.org/services/filters.html, last access: 17 June 2016.

Hübbe, N., Wegener, A., Kunkel, J. M., Ling, Y., and Ludwig, T.:Evaluating lossy compression on climate data, in: InternationalSupercomputing Conference, Springer, 343–356, 2013.

Huffman, D.: A Method for the Construction of Minimum-Redundancy Codes, Proceedings of the IRE, 40, 1098–1101,doi:10.1109/JRPROC.1952.273898, 1952.

Robinson, N. H., Prudden, R., and Arribas, A. A.: A practicalapproach to spatiotemporal data compression, arXiv preprint,arXiv:1604.03688, 2016.

Skodras, A., Christopoulos, C., and Ebrahimi, T.: The JPEG 2000still image compression standard, IEEE Signal Proc. Mag., 18,36–58, 2001.

Unidata: The netCDF User’s Guide, Tech. rep., Unidata, UCAR,Boulder, CO, USA, available at: www.unidata.ucar.edu/software/netcdf/docs/user_guide.html, last access: 17 June 2016.

Woodring, J., Mniszewski, S., Brislawn, C., DeMarle, D., andAhrens, J.: Revisiting wavelet compression for large-scale cli-mate data using JPEG 2000 and ensuring data precision, in: IEEESymposium on Large Data Analysis and Visualization, IEEE,31–38, 2011.

Zender, C. S.: Analysis of self-describing gridded geoscience datawith netCDF Operators (NCO), Environ. Modell. Softw., 23,1338–1342, 2008.

Zender, C. S.: Bit Grooming: statistically accurate precision-preserving quantization with compression, evaluated in thenetCDF Operators (NCO, v4.4.8+), Geosci. Model Dev., 9,3199–3211, doi:10.5194/gmd-9-3199-2016, 2016.

Ziv, J. and Lempel, A.: A universal algorithm for sequential datacompression, IEEE T. Inform. Theory, 23, 337–343, 1977.

Ziv, J. and Lempel, A.: Compression of individual sequences viavariable-rate coding, IEEE T. Inform. Theory, 24, 530–536,doi:10.1109/TIT.1978.1055934, 1978.

www.geosci-model-dev.net/10/413/2017/ Geosci. Model Dev., 10, 413–423, 2017