introduction to data management and sharing
Post on 06-May-2015
4.028 Views
Preview:
DESCRIPTION
TRANSCRIPT
Introduction to Introduction to Data Management and SharingData Management and Sharing
University Libraries/Information Services University Libraries/Information Services Office of Research Compliance and TrainingOffice of Research Compliance and Training
Why is there a new focus on Why is there a new focus on data management and data management and
sharing?sharing?
22
Data sharing is not widely practiced…
• Lack of time Lack of time for data clean up, user questionsfor data clean up, user questions
• Lack of recognition Lack of recognition not valued in promotion/tenurenot valued in promotion/tenure
• Lack of controlLack of control worries about scooping, misinterpretationworries about scooping, misinterpretation
• Legal concerns Legal concerns copyright, patentscopyright, patents
• Inadequate infrastructureInadequate infrastructure33
…yet its value is recognized
Data sharing was a key element of:Data sharing was a key element of:
Human Human Genome Genome ProjectProject
NIH-funded NIH-funded Alzheimer’s study Alzheimer’s study published in published in April 2011April 2011
Sloan Sloan Digital Digital Sky Sky SurveySurvey
44
55
There are new possibilities…
Networked digital Networked digital technology creates technology creates new potential for:new potential for:
•data collectiondata collection
•data analysisdata analysis
•data “mash ups”data “mash ups”
•collaborationcollaboration
•citizen sciencecitizen scienceNational Science FoundationNational Science Foundation
66
““The impact of science on The impact of science on people’s lives, and the people’s lives, and the implications of scientific implications of scientific assessments for society assessments for society and the economy are now and the economy are now so great that people won’t so great that people won’t just believe scientists when just believe scientists when they say “trust me, I’m an they say “trust me, I’m an expert.” … Science has to expert.” … Science has to adapt.” adapt.”
- Geoffrey Boulton, chair of working group - Geoffrey Boulton, chair of working group for study: for study: Science as a public enterprise: Science as a public enterprise:
opening up scientific informationopening up scientific information, 5.13.11, 5.13.11
…and science is in the spotlight
77
These factors have changed the These factors have changed the conversation, resulting in…conversation, resulting in…
Calls for data accessibility…
88
““It is obvious that It is obvious that making data widely making data widely available is an available is an essential element of essential element of scientific research.”scientific research.”
- Science - Science editorial “Making Data editorial “Making Data Maximally Available,” 2.11.11Maximally Available,” 2.11.11
…and new data management policies
NSF and other research sponsors are NSF and other research sponsors are strengthening their data management strengthening their data management and sharing policies to help: and sharing policies to help:
•increase the accessibility of data increase the accessibility of data
•create standards and protocolscreate standards and protocols
•develop interoperable data repositoriesdevelop interoperable data repositories
•encourage transparency of researchencourage transparency of research
99
Submitting a proposal to the NSF?
You must:You must:
•Submit a two-page data Submit a two-page data management plan with your management plan with your proposal.proposal.
•Share your research data (or Share your research data (or justify why you should not justify why you should not share share it).it).
1010
Publishing in a Nature journal?
1111
“…“…authors are required authors are required to make materials, data to make materials, data and associated and associated protocols promptly protocols promptly available to readers.”available to readers.”
1212
More than ever, researchers More than ever, researchers are expected to make their are expected to make their
data accessible to—and data accessible to—and usable by—others. usable by—others.
This means…This means…
Having a data Having a data management plan management plan is more important is more important
than ever.than ever.
1313
Library of CongressLibrary of Congress
Data management plan (DMP)
A data management A data management plan outlines how plan outlines how you will collect, you will collect, organize, manage, organize, manage, store, secure, back store, secure, back up, preserve, and up, preserve, and share your data. share your data.
1414
Academic CommonsAcademic Commons
Other DMP elements
•Designating who is Designating who is responsible for data responsible for data managementmanagement
•Tools or software Tools or software needed to needed to create/process/visualicreate/process/visualize the dataze the data
•Compliance with Compliance with policies and policies and regulations regulations
1515
NISTNIST
Columbia DMP Template
•Columbia provides a DMP template. Columbia provides a DMP template.
•Though created in response to NSF Though created in response to NSF requirements, you can use it as a guide requirements, you can use it as a guide for creating any DMP.for creating any DMP.
•You can find the template on theYou can find the template on theNSF Data Management Requirements page page of this website.of this website.
1616
1717
Some points to consider Some points to consider when creating your DMPwhen creating your DMP
Your data storage needs
•Data formats and Data formats and sizesize
•Retention periodRetention period
•Privacy or security Privacy or security requirementsrequirements
•Backup planBackup plan
•Access Access requirementsrequirements
1818
Pittsburgh Supercomputing CenterPittsburgh Supercomputing Center
Data storage planning
•Plan for the Plan for the entire life-cycle.entire life-cycle.
•Establish a Establish a baseline and baseline and project the rate project the rate of growth for of growth for the duration of the duration of the project.the project.
1919
CDC/Dorothy Roland CDC/Dorothy Roland
Two types of storage
•ActiveActive
Frequent Frequent additions and additions and updatesupdates
•ArchivalArchival
In fixed form; In fixed form; only need only need periodic accessperiodic access
2020
CDCCDC
Active storage at Columbia
•School/department/division servers School/department/division servers Many researchers use servers managed by Many researchers use servers managed by
“local” IT groups.“local” IT groups.
•CUIT CUIT 20-80 MB personal storage20-80 MB personal storage
Central LAN serviceCentral LAN service
•Center for Digital Research & ScholarshipCenter for Digital Research & Scholarship Consultation availableConsultation available
2121
Archival storage at Columbia
•DigitalDigital
Academic Commons Academic Commons is Columbia’s online is Columbia’s online research repository.research repository.
•PhysicalPhysical
Consult the Consult the appropriate Columbia appropriate Columbia University Libraries University Libraries archive.archive.
2222
2323
Best archival file formats
• Nonproprietary file Nonproprietary file formatsformats
• Uncompressed and Uncompressed and unencrypted filesunencrypted files
• Consider ease of Consider ease of migration going migration going forwardforward
• May need to May need to archive software as archive software as well as datawell as data
INLINL
Data retention requirements
2424
Other important retention policies
•NIH NIH
3 years3 years
•NSF NSF
Check with individual Check with individual NSF directoratesNSF directorates
•Health Information Health Information Portability and Portability and Accountability Act Accountability Act (HIPPA)(HIPPA)
At least 6 yearsAt least 6 years2525
USGSUSGS
Data security and integrity
•SecuritySecurity
Protect data from Protect data from unauthorized access unauthorized access or accidental or accidental disclosure.disclosure.
•IntegrityIntegrity
Ensure that data Ensure that data remains unaltered remains unaltered before, during, and before, during, and after analysis and after analysis and presentation.presentation.
2626
NPSNPS
Data security requirements
Your data may be subject to laws and Your data may be subject to laws and policies such as:policies such as:
• HIPAA (Health Information Portability and HIPAA (Health Information Portability and Accountability Act)Accountability Act)
• IRB (Institutional Review Board)IRB (Institutional Review Board)
•Columbia Columbia computing policiescomputing policies• See the Computing and Technology section of See the Computing and Technology section of
the Columbia Administrative Policy Librarythe Columbia Administrative Policy Library
2727
Physical security best practices
• Restricted access Restricted access to research to research facilities, facilities, computers, datacomputers, data
• Only trusted Only trusted individuals individuals troubleshoot troubleshoot computer problemscomputer problems
• Lab notebooks, Lab notebooks, samples in locked samples in locked cabinetscabinets
2828
Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory
Digital security best practices
• Sensitive data on Sensitive data on computers not computers not connected to Internetconnected to Internet
• Virus protection up to Virus protection up to datedate
• No confidential data No confidential data via e-mail or FTPvia e-mail or FTP
• Passwords to access Passwords to access files and computersfiles and computers
• Proper data disposal Proper data disposal at end of retention at end of retention periodperiod
2929
Lawrence Livermore National LaboratoryLawrence Livermore National Laboratory
Data backup best practices
•Make 3 copies Make 3 copies OriginalOriginal
External/local External/local
•Verify recovery is possibleVerify recovery is possible Checksum validationChecksum validation
Test file restore after initial setupTest file restore after initial setup
PerPeriodically thereafteriodically thereafter
External/remote – different geographic areaExternal/remote – different geographic area
3030
Data backup options
•Hard driveHard drive
•Tape back-upTape back-up
•ServerServer
•Cloud storageCloud storage
Amazon S3Amazon S3
Subject Repository/ Data Subject Repository/ Data CentersCenters• Examples: PubChem, Dryad, IRI/LDEOExamples: PubChem, Dryad, IRI/LDEO
3131
NIHNIH
Sharing requirements
How, when, and what How, when, and what you share depends on:you share depends on:
• Data formatData format
• Restrictions on dataRestrictions on data
• Funder and publisher Funder and publisher guidelinesguidelines
• Customary embargo Customary embargo periodsperiods
• Availability of appropriate Availability of appropriate repositories or other repositories or other vehicles for sharingvehicles for sharing
3232
NIHNIH
3333
Sample data sharing guidelines
Sharing restrictions
Under HIPAA (Health Under HIPAA (Health Information Portability and Information Portability and Accountability Act), you cannot Accountability Act), you cannot share information that share information that compromises the compromises the confidentiality or privacy of confidentiality or privacy of human subjects. Any data human subjects. Any data resulting from studies using resulting from studies using human subjects must be human subjects must be scrubbed of identifying scrubbed of identifying information.information.
3434
3535
You may have other You may have other reasons that justify reasons that justify not sharing your not sharing your data, and you can data, and you can detail these in your detail these in your data management data management plan. Funders may plan. Funders may allow exceptions to allow exceptions to data sharing data sharing policies.policies.
Sharing restrictions
Don’t forget metadata
Metadata is structured Metadata is structured information that information that describes, explains, describes, explains, locates, and otherwise locates, and otherwise makes it easier to makes it easier to retrieve and use an retrieve and use an information resource. information resource.
3636
BLM NTSCBLM NTSC
““The metadata accompanying your data The metadata accompanying your data should be written for a user 20 years should be written for a user 20 years into the future -- what does that person into the future -- what does that person need to know to use your data properly? need to know to use your data properly? Prepare the metadata for a user who is Prepare the metadata for a user who is unfamiliar with your project, methods, or unfamiliar with your project, methods, or observations. “observations. “
Oak Ridge National Laboratory
Distributed Active Archive CenterDistributed Active Archive Center
3737
Metadata facilitates use of your data
Major metadata standards
•Darwin Core (Biology)Darwin Core (Biology)
•DDI (Data Documentation Initiative, for social DDI (Data Documentation Initiative, for social and behavioral sciences data) and behavioral sciences data)
•DIF (Directory Interchange Format for DIF (Directory Interchange Format for scientific data) scientific data)
• EML (Ecological Metadata Language) EML (Ecological Metadata Language)
• FGDC/CSDGM (geographic data) FGDC/CSDGM (geographic data)
•NBII (National Biological Information NBII (National Biological Information Infrastructure)Infrastructure)
3838
Online data repositoriesOnline data repositories
• organized around institutions or subjectsorganized around institutions or subjects
• often open accessoften open access
• archival, not active, archival, not active,
• may offer:may offer: long-term preservation and accesslong-term preservation and access
search engine optimizationsearch engine optimization
permanent URL or DOI permanent URL or DOI
Repositories for data sharing
3939
Columbia’s repository
Academic Commons accepts materials Academic Commons accepts materials from faculty, students, and staff. from faculty, students, and staff.
4040
• secure replicated secure replicated storagestorage
• accurate metadataaccurate metadata
• globally accessible globally accessible repository repository
• contextual linking contextual linking between data and between data and publicationspublications
• a permanent URLa permanent URL
Some subject-based repositories
4141
Space science Space science mission mission
repositoryrepository
Cryospheric Cryospheric data data repositoryrepository
Macromolecular Macromolecular structural data structural data repository repository
Marine data Marine data repositoryrepository
Biological Biological activities of small activities of small molecules data molecules data repositoryrepository
4242
More subject-based repositories
Deep-sea core Deep-sea core samples samples repository repository housed at housed at LDEOLDEO
Data repository Data repository for archeology for archeology and related and related disciplinesdisciplines
Basic and applied Basic and applied biosciences data biosciences data repository repository
Geodesy data Geodesy data repository repository
Social science Social science data repositorydata repository
4343
Licensing your data
• Copyright issues Copyright issues around data can around data can be complexbe complex
• These groups These groups offer “ready-offer “ready-made” licenses made” licenses for data that help for data that help clarify any clarify any restrictions on restrictions on reusereuse
4444
For more information
• Data Management section of Scholarly Data Management section of Scholarly Communication Program websiteCommunication Program website
• Sponsored Projects AdministrationSponsored Projects Administration
• Office of Research Compliance and TrainingOffice of Research Compliance and Training
• Center for Digital Research and ScholarshipCenter for Digital Research and Scholarship
• CUITCUIT
• Computing and Technology section of Columbia Computing and Technology section of Columbia Administrative Policy LibraryAdministrative Policy Library
top related