29 march 2004 steven worley, nsf/ncar/scd 1 research data stewardship and access steven worley,...

Post on 18-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

29 March 2004 Steven Worley, NSF/NCAR/SCD 1

Research Data Stewardship and Access Steven Worley, CISL/SCD

Cyberinfrastructure meeting with Priscilla Nelson and NSF

colleagues

29 March 2004 Steven Worley, NSF/NCAR/SCD 2

How is cyberinfrastructure used in this domain?

• Harvest data to build RDA content– World-wide

• Create standard metadata– Enable discovery and metadata sharing

• Provide data access– Internally to NCAR/UCAR– Externally to global research community

29 March 2004 Steven Worley, NSF/NCAR/SCD 3

Definition of the RDA

• 500 plus distinct archived datasets• Continual growth for about 40 years• Each has metadata displayed on a web

page• All data on the MSS (primary +

backups)– 548K files– 100.5 TB

29 March 2004 Steven Worley, NSF/NCAR/SCD 4

Harvest data to build RDA content

Dataset Update Frequency

0 5 10 15 20 25 30 35

Number of Datasets

Annual

Several/yr

Weekly

Monthly

Irregular

DailyTotal Active Datasets = 79

Dataset Update Method

0 10 20 30 40 50 60

Number of Datasets

Other Dataset

Network

Tape

CDROM

29 March 2004 Steven Worley, NSF/NCAR/SCD 5

• Current network methods– Manual web download– Automatic scripted FTP– Subscription uploadCommodity internet

• Limitations– Slow for large volumes– Success/failure checks are responsibility of

staff

• Future– Exploit larger bandwidth networks– Larger bandwidth tools, ESG… etc

Harvest data to build RDA content

29 March 2004 Steven Worley, NSF/NCAR/SCD 6

Create standard metadata• Legacy metadata

– Hardcopy and images – Digitally online since about 1980– Local standardize format

• Currently– Legacy metadata remains available

• Used to derive web pages – Transformed to standards used in CDP– Incorporated into THREDDS catalogues

• Enable searches across UCAR

• Future– More detailed metadata for accurate discovery (e.g.

file level metadata) – Continue to be export through CDP and data servers

systems

29 March 2004 Steven Worley, NSF/NCAR/SCD 7

Provide data access (delivery)

• Internally – to NCAR computing systems • Currently, from the NCAR MSS

– Supercomputer– Data analysis systems– Divisional computer systemsMSS is a tape based archive system not

designed to be a scalable file server

• Future • SANS between computer systems and MSS• Enable rapid file service and unburden the

archive system

29 March 2004 Steven Worley, NSF/NCAR/SCD 8

Internal (MSS) access metrics

Unique Users for MSS

0

50

100

150

200

250

300

350

400

450

500

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Data Delivery from MSS

0

5

10

15

20

25

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Terab

ytes

Files read for 2004

• 25K

29 March 2004 Steven Worley, NSF/NCAR/SCD 9

Provide data access (delivery)

• Externally – to the internet• Caveat: some NCAR user

• Currently, traditional data server– Web and FTP downloads

• Most popular data only (166 K files, 10.7 TB)

– Subsetting• By request and delayed mode processing

• Future– More traditional services– Key datasets available through portals

(CDP/ESG)

29 March 2004 Steven Worley, NSF/NCAR/SCD 10

Provide data access (delivery)

• Data server (Web and FTP) metrics• Jan. – Feb. 2005 Only

– New system to accurately track users– Old system provided “fuzzy” metrics

January 2005

February 2005

Unique Users

517 523

Amount (TB) 1.2 1.8

No. Files 6151 12403

29 March 2004 Steven Worley, NSF/NCAR/SCD 11

Future

• Fact– Dataset size and complexity is growing –

need to handle more data

• How?– Use advanced networks harvest rapidly – More complete metadata, in a standard

• Improved data discovery and access• Improved (more efficient) data management

– Provide critical collections through portals• Interoperable access through servers (e.g. GDS,

etc)– Distributed archives

• Share metadata with other portals (global discovery)

29 March 2004 Steven Worley, NSF/NCAR/SCD 12

Key Case – ERA-40

• 35 TB collection, 30 distinct product lines

• Added about 10 products (computed in SCD)– Support Climate Modeling

• Metrics for 2004 Web & FTP NCAR MSS Total Unique Users 68 70 129 Number of Data Files 28426 12898 41324 Data Amount (GB) 10778 9500 20278

• Web & FTP = MSS in Data Amount

• Over 20 TB delivered

• 13K files from non-file server MSS

29 March 2004 Steven Worley, NSF/NCAR/SCD 13

Conclusions

Are using basic cyberinfrastructure now

Will use new proven components in our operations

With cyberinfrastructure we plan to:

• improve data acquisition, discovery, and access

• improve our management efficiencyIn the process we will:

• seamlessly integrate new and traditional systems

• not lose track of critical legacy data and metadata

29 March 2004 Steven Worley, NSF/NCAR/SCD 14

Questions/Discussion

top related