data compression for pds4 lisa gaddis, sue lavoie, jeff anderson, elizabeth rye pds imaging node...

13
Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Upload: myrtle-mcdowell

Post on 24-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Data Compression for PDS4

Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye

PDS Imaging Node

March 26, 2010

Page 2: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression2

Syntax• Data Compression

– Encodes information using fewer bits– Reduces consumption of expensive resources

• Data storage and/or transmission bandwidth– Requires decompression– Trade-offs

• degree of compression• amount of ‘distortion’ introduced• computational resources required for decompression

• Image Compression– Application of data compression to digital images– Reduces redundancy in images to improve efficiency of storage and

transmission– Lossless and lossy methods– Preserve image quality at a given bit- or compression-rate

• File Compression– Reduces redundancy at the file level– Many available tools

• ZIP• GZIP• BZIP2

Page 3: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression3

Why image compression?• Image compression for data providers and

archivists– NASA missions deliver significant numbers of large

image files– Need to support and/or reduce storage costs and

data transmission times of images– Promotes exchange between different users and

systems– Athough falling in cost, storage is expensive for

many TB of data and multiple copies• FY10: ~$750/TB for RAID storage with network

infrastructure

Page 4: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression4

Image Compression• Lossless compression

– Exploits data redundancy– Image can be recovered exactly

• ‘Run-length encoding’ makes use of redundant patterns or ‘runs’• ‘LZW (Lempel Ziv Welch) encoding’ also address strings of characters; builds

up a table of strings and their corresponding codes• ‘Huffman coding’ uses a binary encoding tree to represent commonly

occurring values in few bits and less frequently occurring values in more bits– Best for documents, computer programs, line drawings, etc.– JPEG2000 has a lossless option, approved for use by PDS

• Lossy compression– Exploits data redundancy and ‘irrelevant’ data– Image data are not recovered exactly

• JPEG• JPEG2000 (lossy)

– Best for digital images, audio, video– Not approved for PDS archive data

• Exceptions: Browse and some EDR images (e.g., Clementine UVVIS and NIR) are lossy JPEG images (5.5 ave. compression rate)

Page 5: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression5

MRO and LRO images

• Not your typical images

– MESSENGER MDIS, Viking Orbiter, Galileo SSI, etc.• Framing cameras• 800 samples x 800 lines to 1024 samples x 1024 lines• Roughly one megabyte (MB) per observation• PDS Imaging Node combined archive requirements for all

missions other than LRO and MRO is <25 TB

– MRO/HiRISE, LRO/LROC• Line-scan cameras• 10,000-20,000 samples x 50,000-100,000 lines• Roughly 500 to 2,000 MB per observation• Combined expected archive total for MRO and LRO is 500 TB • 20X larger than sum total of all other Imaging Node holdings

Page 6: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression6

Image Compression for HiRISE RDRs

• Why image compression was needed– Enormous volume of HiRISE archive, 1 yr

• EDR – 12,100 Gb (~1.5 TB)• RDR – 92,500 Gb (11.3 TB)

– Very large Standard Data Products• EDR (2048 X 64,000, 16-bit) = 262 MB• RDR (40,000 x 64,000, after reprojection, 16-bit) = ~500 to 1000 MB

– Advantages for delivery of RDR data in JPEG2000 format• Losslessly recompressed format• Wavelet compression greatly improves speed of web access• Fast browse, zoom, pan capabilities for handling large files

– Volume projections• EDR DVD volumes: 321 (losslessly recompressed) vs 482

(uncompressed) (1.5 compression ratio)• RDR DVD volumes: 2400 (losslessly compressed) vs 7300

(uncompressed) (assuming 3.0 compression ratio)

Page 7: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression7

HiRISE Example

• JPEG2000 image compression applied to map-projected RDR images only

• lots of null pixels• Nulls are highly compressed as a

result of the lossless compression using JPEG2000

• Projected ~3:1 compression ratios• Achieved 15:1 in recent tests

Page 8: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression8

Past Experience

• Problems with compression– Voyager, Viking, and MGS-MOC PDS archives contain losslessly

compressed data– Decompression algorithms (e.g., in ISIS) break due to

• New compilers• New operating systems• Changes in hardware architecture (32-bit vs 64-bit)

– JPEG2000 compressed HiRISE RDR images are supported by ISIS3

• But, when JPEG2000 format reaches end-of-life, software maintenance to read this format will be much more difficult than the existing Voyager/Viking/MGS-MOC algorithms

• A proliferation of image compression formats in PDS would be a problem for long-term archiving and usability of the images

Page 9: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression9

Data Storage Costs: MRO & LRO

• Expected PDS storage requirements for the MRO nominal mission are 75TB – High capacity RAID storage & network infrastructure costs

~$750 per TB– The hardware cost to store a single copy of the MRO data is

~$56K • Only one copy of the three required by PDS

• Does not include data from an extended mission

• Archive includes JPEG2000 compressed images

• LRO archive volume is projected to be ~400 TB– Hardware cost for one copy is ~$300K– Same caveats as above apply

Page 10: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression10

PDS3 Compressed Image Formats

• Clem-JPEG (not in PDS Standards Reference)• Huffman First Difference (“)• JPEG2000

– Improved compression efficiency (vs. JPEG)– Highly scalable embedded data streams– Progressive lossy to lossless compression within a single data stream– Arbitrarily crop images in the compressed domain– Selectively enhance quality of spatial “regions of interest”– Support for very large images

• Used for HiRISE & LROC RDRs• Previous Pixel (“)• Run Length (“)• Zip, gzip = GNU zip

– Widely used open-source tool– Runs on a variety of common computer platforms– Available since 1992

Page 11: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression11

Possible Solution for PDS4

• Allow File Compression – Use standard, non-patented algorithms (e.g., Lempel-Ziv 77,

Huffman coding)– Use stable, open-source, well-maintained software (e.g., gzip)

• Tests using gzip, HiRISE data– RDRs

• HiRISE RDR, JPEG2000 = 454 MB• Uncompressed, converted to raw format = 6.6 GB (15x larger)• Compressed using gzip = 1.1 GB (2.5x larger)

– EDRs• Not compressed, typical file size = 250 MB• gzipped versions = 100 MB (2.5x smaller)

– Overall the HiRISE archive would be 5% smaller• gzip EDRs• Convert RDRs to raw, then gzip

Page 12: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression12

Recommendation

• Allow file-based compression (such as gzip, bzip2) in PDS4– Stable, free, widely used open-source software tool– Works on a variety of common computer platforms

• Macs, PCs, Solaris, MSDOS, VAX, etc.

– Maintained by open-source community

• Consistent with PDS3 history, PDS4 plans for simplification

• Reduces storage costs

• Improves data transfer rates over internet

• Supports management and delivery of high-volume data sets for providers and users

Page 13: Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010

Imaging Node Data Compression13

Policy Questions

• Do we permit compression at all in the PDS4 archive?

• If so:– Do we want a mixture of compressed and uncompressed data?

• One copy is uncompressed, two are compressed– Do we distinguish between EDRs and RDRs and other derived

products?– Do we distinguish between frequently accessed data and those

offline and/or in ‘deep archive’ storage?• Store deep archive data in uncompressed form or use one approved

compression format (e.g., gzip)• Permit nodes to use and maintain other compression methods as

needed for one or more copies

• Whatever we decide, do we require older, compressed data to be ‘restored’ to meet requirements of the new compression policy?