29 march 2004 steven worley, nsf/ncar/scd 1 research data stewardship and access steven worley,...

14
29 March 2004 Steven Worley, NSF/NCAR/S CD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson and NSF colleagues

Upload: christian-hicks

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 1

Research Data Stewardship and Access Steven Worley, CISL/SCD

Cyberinfrastructure meeting with Priscilla Nelson and NSF

colleagues

Page 2: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 2

How is cyberinfrastructure used in this domain?

• Harvest data to build RDA content– World-wide

• Create standard metadata– Enable discovery and metadata sharing

• Provide data access– Internally to NCAR/UCAR– Externally to global research community

Page 3: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 3

Definition of the RDA

• 500 plus distinct archived datasets• Continual growth for about 40 years• Each has metadata displayed on a web

page• All data on the MSS (primary +

backups)– 548K files– 100.5 TB

Page 4: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 4

Harvest data to build RDA content

Dataset Update Frequency

0 5 10 15 20 25 30 35

Number of Datasets

Annual

Several/yr

Weekly

Monthly

Irregular

DailyTotal Active Datasets = 79

Dataset Update Method

0 10 20 30 40 50 60

Number of Datasets

Other Dataset

Network

Tape

CDROM

Page 5: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 5

• Current network methods– Manual web download– Automatic scripted FTP– Subscription uploadCommodity internet

• Limitations– Slow for large volumes– Success/failure checks are responsibility of

staff

• Future– Exploit larger bandwidth networks– Larger bandwidth tools, ESG… etc

Harvest data to build RDA content

Page 6: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 6

Create standard metadata• Legacy metadata

– Hardcopy and images – Digitally online since about 1980– Local standardize format

• Currently– Legacy metadata remains available

• Used to derive web pages – Transformed to standards used in CDP– Incorporated into THREDDS catalogues

• Enable searches across UCAR

• Future– More detailed metadata for accurate discovery (e.g.

file level metadata) – Continue to be export through CDP and data servers

systems

Page 7: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 7

Provide data access (delivery)

• Internally – to NCAR computing systems • Currently, from the NCAR MSS

– Supercomputer– Data analysis systems– Divisional computer systemsMSS is a tape based archive system not

designed to be a scalable file server

• Future • SANS between computer systems and MSS• Enable rapid file service and unburden the

archive system

Page 8: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 8

Internal (MSS) access metrics

Unique Users for MSS

0

50

100

150

200

250

300

350

400

450

500

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Data Delivery from MSS

0

5

10

15

20

25

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Terab

ytes

Files read for 2004

• 25K

Page 9: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 9

Provide data access (delivery)

• Externally – to the internet• Caveat: some NCAR user

• Currently, traditional data server– Web and FTP downloads

• Most popular data only (166 K files, 10.7 TB)

– Subsetting• By request and delayed mode processing

• Future– More traditional services– Key datasets available through portals

(CDP/ESG)

Page 10: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 10

Provide data access (delivery)

• Data server (Web and FTP) metrics• Jan. – Feb. 2005 Only

– New system to accurately track users– Old system provided “fuzzy” metrics

January 2005

February 2005

Unique Users

517 523

Amount (TB) 1.2 1.8

No. Files 6151 12403

Page 11: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 11

Future

• Fact– Dataset size and complexity is growing –

need to handle more data

• How?– Use advanced networks harvest rapidly – More complete metadata, in a standard

• Improved data discovery and access• Improved (more efficient) data management

– Provide critical collections through portals• Interoperable access through servers (e.g. GDS,

etc)– Distributed archives

• Share metadata with other portals (global discovery)

Page 12: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 12

Key Case – ERA-40

• 35 TB collection, 30 distinct product lines

• Added about 10 products (computed in SCD)– Support Climate Modeling

• Metrics for 2004 Web & FTP NCAR MSS Total Unique Users 68 70 129 Number of Data Files 28426 12898 41324 Data Amount (GB) 10778 9500 20278

• Web & FTP = MSS in Data Amount

• Over 20 TB delivered

• 13K files from non-file server MSS

Page 13: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 13

Conclusions

Are using basic cyberinfrastructure now

Will use new proven components in our operations

With cyberinfrastructure we plan to:

• improve data acquisition, discovery, and access

• improve our management efficiencyIn the process we will:

• seamlessly integrate new and traditional systems

• not lose track of critical legacy data and metadata

Page 14: 29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson

29 March 2004 Steven Worley, NSF/NCAR/SCD 14

Questions/Discussion