29 march 2004 steven worley, nsf/ncar/scd 1 research data stewardship and access steven worley,...
Post on 18-Jan-2016
214 Views
Preview:
TRANSCRIPT
29 March 2004 Steven Worley, NSF/NCAR/SCD 1
Research Data Stewardship and Access Steven Worley, CISL/SCD
Cyberinfrastructure meeting with Priscilla Nelson and NSF
colleagues
29 March 2004 Steven Worley, NSF/NCAR/SCD 2
How is cyberinfrastructure used in this domain?
• Harvest data to build RDA content– World-wide
• Create standard metadata– Enable discovery and metadata sharing
• Provide data access– Internally to NCAR/UCAR– Externally to global research community
29 March 2004 Steven Worley, NSF/NCAR/SCD 3
Definition of the RDA
• 500 plus distinct archived datasets• Continual growth for about 40 years• Each has metadata displayed on a web
page• All data on the MSS (primary +
backups)– 548K files– 100.5 TB
29 March 2004 Steven Worley, NSF/NCAR/SCD 4
Harvest data to build RDA content
Dataset Update Frequency
0 5 10 15 20 25 30 35
Number of Datasets
Annual
Several/yr
Weekly
Monthly
Irregular
DailyTotal Active Datasets = 79
Dataset Update Method
0 10 20 30 40 50 60
Number of Datasets
Other Dataset
Network
Tape
CDROM
29 March 2004 Steven Worley, NSF/NCAR/SCD 5
• Current network methods– Manual web download– Automatic scripted FTP– Subscription uploadCommodity internet
• Limitations– Slow for large volumes– Success/failure checks are responsibility of
staff
• Future– Exploit larger bandwidth networks– Larger bandwidth tools, ESG… etc
Harvest data to build RDA content
29 March 2004 Steven Worley, NSF/NCAR/SCD 6
Create standard metadata• Legacy metadata
– Hardcopy and images – Digitally online since about 1980– Local standardize format
• Currently– Legacy metadata remains available
• Used to derive web pages – Transformed to standards used in CDP– Incorporated into THREDDS catalogues
• Enable searches across UCAR
• Future– More detailed metadata for accurate discovery (e.g.
file level metadata) – Continue to be export through CDP and data servers
systems
29 March 2004 Steven Worley, NSF/NCAR/SCD 7
Provide data access (delivery)
• Internally – to NCAR computing systems • Currently, from the NCAR MSS
– Supercomputer– Data analysis systems– Divisional computer systemsMSS is a tape based archive system not
designed to be a scalable file server
• Future • SANS between computer systems and MSS• Enable rapid file service and unburden the
archive system
29 March 2004 Steven Worley, NSF/NCAR/SCD 8
Internal (MSS) access metrics
Unique Users for MSS
0
50
100
150
200
250
300
350
400
450
500
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Data Delivery from MSS
0
5
10
15
20
25
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Terab
ytes
Files read for 2004
• 25K
29 March 2004 Steven Worley, NSF/NCAR/SCD 9
Provide data access (delivery)
• Externally – to the internet• Caveat: some NCAR user
• Currently, traditional data server– Web and FTP downloads
• Most popular data only (166 K files, 10.7 TB)
– Subsetting• By request and delayed mode processing
• Future– More traditional services– Key datasets available through portals
(CDP/ESG)
29 March 2004 Steven Worley, NSF/NCAR/SCD 10
Provide data access (delivery)
• Data server (Web and FTP) metrics• Jan. – Feb. 2005 Only
– New system to accurately track users– Old system provided “fuzzy” metrics
January 2005
February 2005
Unique Users
517 523
Amount (TB) 1.2 1.8
No. Files 6151 12403
29 March 2004 Steven Worley, NSF/NCAR/SCD 11
Future
• Fact– Dataset size and complexity is growing –
need to handle more data
• How?– Use advanced networks harvest rapidly – More complete metadata, in a standard
• Improved data discovery and access• Improved (more efficient) data management
– Provide critical collections through portals• Interoperable access through servers (e.g. GDS,
etc)– Distributed archives
• Share metadata with other portals (global discovery)
29 March 2004 Steven Worley, NSF/NCAR/SCD 12
Key Case – ERA-40
• 35 TB collection, 30 distinct product lines
• Added about 10 products (computed in SCD)– Support Climate Modeling
• Metrics for 2004 Web & FTP NCAR MSS Total Unique Users 68 70 129 Number of Data Files 28426 12898 41324 Data Amount (GB) 10778 9500 20278
• Web & FTP = MSS in Data Amount
• Over 20 TB delivered
• 13K files from non-file server MSS
29 March 2004 Steven Worley, NSF/NCAR/SCD 13
Conclusions
Are using basic cyberinfrastructure now
Will use new proven components in our operations
With cyberinfrastructure we plan to:
• improve data acquisition, discovery, and access
• improve our management efficiencyIn the process we will:
• seamlessly integrate new and traditional systems
• not lose track of critical legacy data and metadata
29 March 2004 Steven Worley, NSF/NCAR/SCD 14
Questions/Discussion
top related