introduction to data management, terminologies and use of data management platforms
TRANSCRIPT
1www.iita.orgA member of CGIAR consortium
Introduction to data management, terminologies and use of data management
platforms
Workshop on Management and Analyses of ISFM Data
Monday, May 25, 2015
2www.iita.orgA member of CGIAR consortium
Data management
"Data management is the development, execution and supervision of plans,
policies, programs and practices that control, protect, deliver and enhance the value of data and information assets.“
(DAMA Data Management Association International )
3www.iita.orgA member of CGIAR consortium
Data managementObjective: • to maximize the potential of data while
integrating them into business processes
Topics:• Data quality• Data security• Data organization
4www.iita.orgA member of CGIAR consortium
Data management principles• Data are correct• Data are consistent(uniform in content, content structure, notation, units, methods used, meaning, language)
• Data are complete• Data are up to date• Data are relevant• Data are precise enough• Datasets are free of redundancies• Data are reliable and comprehensible• Data are understandable by all involved
users and processible by machines• Data are unambiguous/explicit
Data quality
5www.iita.orgA member of CGIAR consortium
Data management principles
• Every data needs a frequent backup
• no data without access permission control
• Treatment of data of different ownership (private) is clarified
Data security
6www.iita.orgA member of CGIAR consortium
Data management principles
• There is no data without a person responsible for it (clear roles & responsibilities)
• There is no data without one, clearly defined, easy to find and communicated location for it
Data organization
www.iita.orgA member of CGIAR consortium
Main roles in data management• Data Editor: The person that validates, creates and
edits the data• Data Steward: The person that holds the data,
usually they will take care of the data, ensuring the data consumers obtain exactly the data approved by the data owner
• Data Owner: The person that approves data before it is published for the eventual audience
• Data Consumer: A person that uses the data without editing, correcting or modifying it
7
www.iita.orgA member of CGIAR consortium
Operational levels
• Individual(Execution of data activities, self-organizing)
• Project/working group(Plans&deliveries, rules&responsibilities, workflow&steering, communication, access/permission control, data organizing (content mgt./file order, file naming strategies, templates, Project data…) )
• Organization(Policies, Infrastructure&repositories, Ressources, …)
• Global(Metadata standards, data exchange protocols, vocabularies/ ontologies, legal issues, Open Access, …)
8
Global
Organization
Project
Individual
collect
assure
describe
preserve
discoverintegrate
analyze
present
plan
www.iita.orgA member of CGIAR consortium
Data lifecycle
9
interpret dataderive data (apply statistical and analytical methods)produce research outputsauthor publications
create metadata and documentation
Identify (tracking)Categorize
migrate data to suitable medium
back-up and store data
archive data
collect data (experiment, observe, measure, simulate)
design researchplan data management (formats, storage etc)plan consent for sharing
locate existing data
enter data, digitize, transcribe, translate
check, validate, clean dataanonymize data where necessarydescribe data
migrate data to best format
Locate, explore and understand datascrutinize findings
distribute datashare datacontrol accessestablish copyrightpromote data
establish copyrightpromote data
follow-up researchundertake research reviews
teach and learn
Exposing metadata through a searchable interface
Source: Boston University Libraries
www.iita.orgA member of CGIAR consortium
Data intervention areas
Data capturing and preprocessingData transferData flow/content mgt.Data storageData analyticsData delivery
10
www.iita.orgA member of CGIAR consortium
From capture to delivery
11
12
Find answers to • ensure all data mgt.
principles are respected
• in and across all intervention areas
• at all operational levels
Start planning from the desired outcomes!
www.iita.orgA member of CGIAR consortium
Plan data managementDa
ta m
anag
emen
t prin
cipl
es
Operational levels
Data lifecycle / intervention areas
www.iita.orgA member of CGIAR consortium
Data presentation/publication• Who are the end users of which data?• Mode of presentation per information product• Ease of extraction of the right data in the right format for
the right (authorized) people• Automized? real-time data? Personalized data?• Consumers conditions (file formats? Com. tools?)• ability to search&browse (metadata, tags)• Presentation mode and conditions (inclusive
visualization)• licensing
13
www.iita.orgA member of CGIAR consortium
Data transfer• Transfer format and requirements (Data
Transformation needed?)• Transfer initiative (receiver or sender?)• Transfer mode and instructions• Transfer compression needs (zip, tar…), limited
internet availability? • Transfer channels (email, phone, skype, RSS
etc.)• Transfer check (i.e. email)
14
www.iita.orgA member of CGIAR consortium
Data transfer• Transfer security• Platform Openness• Authorization Controls (user credentials)• Encryption Standards (SSL, S/MIME etc.)• Transfer scheduling• Use of API’s?
15
www.iita.orgA member of CGIAR consortium
Data storage• Suitable end repository (server folder, Sharepoint, MySQL database, cloud based solution, PC, external repository)
• Suitable data infrastructure hardware(servers, network(s), bandwidth, databases, security facilities, PCs, external hard drive, USB stick, Smartphones/tablets, scanners, field or laboratory sensors with digital data capturing, etc.)
• Data categorization, file order, filing order criteria• Data deleting policy and archiving for
evidence/documentation purposes• Data disposal/sharing/access control +
administration
16
www.iita.orgA member of CGIAR consortium
Data analytics and data search• Goal and mode of analysis• Frequency of a data analysis• Participating units and data integration
(Business intelligence)• Storage and backup of analysis results• Speed of search• eventual transition or termination of the
data?
17
www.iita.orgA member of CGIAR consortium
Data backup• Risk assessment:
loss/theft/damage/overload/hacker attack…• Backup mode and regulations• Backup frequency/scheduling and discipline• Suitable backup repository (server folder,
Sharepoint, MySQL database, cloud, PC, external repository, external hard drive, USB stick etc.)
• Backup tool/software/opportunities to automize
18
www.iita.orgA member of CGIAR consortium
Data capturing and preprocessing• Capturing location and its conditions• Capturing mode (manual typing, crowd sourcing, data mining, etc.)• Capturing tools/hardware (PC, Smartphones, tablets, GPS, mobile
phone, scanners etc.)• Capturing software and requirements (field data capturing tools,
scanning & OCR read software, etc.)• Capturing instructions (metadata, data protocols, add. data
descriptions, methodological correctness)• Data validation rules + data checks: Ensuring Data quality• Referencing captured data in time & space• Data structure at capturing• Capturing data intermediate storage
19
www.iita.orgA member of CGIAR consortium
Platforms• MS SharePoint• CKAN• aWhere• Collaboration tools• File sharing services (google drive,
dropbox, FTP server, etc.)
20
www.iita.orgA member of CGIAR consortium
Data mgt. platforms (1)
MS SharePoint• Fits to existing Microsoft environment (MS Office (especially Outlook, Excel, Access, Visio, Project), MS Server databases, Exchange server, skype)
• With proper permission settings, allows to create as much pages, apps or subsites as necessary
• Useful features for data mgt.(Metadata tagging, version control, templates (MS office only), validation rules, linking data lists, workflows (approvals etc.), many predefined apps come with customizable metadata sets)
• Weak: issues linking open repositories
21
www.iita.orgA member of CGIAR consortium
Data mgt. platforms (2)
CKAN – “Meta-repository”• functional emphasis: defacto standard software for
publishing open data(started as a catalogue for harvesting published data spread of knowledge)
• Python based (DKAN in PHP) • Strength: customizable, data organization, harvesting
multiple repositories • Weak: no workflow or bulk operations: processing
need to be done before cataloguing; no collaboration tools; no upload of multiple ressources at a time and batch edit the metadata
• Example: http://data.ilri.org/portal/ 22
www.iita.orgA member of CGIAR consortium
Data mgt. platforms (3a)
ILRI dataset portal based on CKAN
23
www.iita.orgA member of CGIAR consortium
Data mgt. platforms (3b)
ILRI dataset portal based on CKAN
24
www.iita.orgA member of CGIAR consortium
Data mgt. platforms (4)
aWhere• Functional emphasis: (geo)data exploration • Strength: easy to use platform to explore data
from xls or ODK as tables, diagram or maps and in connection with data from other users, the library and the weather module
• Weak: xls only; collaboration functionality• More by Hannah and Courtney
25
www.iita.orgA member of CGIAR consortium
Data mgt. platforms (5)
Collaboration tools - basecamp• Functional emphasis: collaboration with many
different partners in projects• Strength: easy to use platform with typical
collab. tools (file sharing+tagging, calendar, wiki, task tracking)
• Weak: not customizable, no data linkage to databases
26
www.iita.orgA member of CGIAR consortium
Data mgt. platforms (6)
File sharing services – Google drive• Functional emphasis: synchronized working on
office apps in the cloud• Strength: data sharing and synchronizing, widely
known, easy to use• Weak: not customizable, no data linkage to
databases, google account necessary; adverts
27
28www.iita.orgA member of CGIAR consortium
Thank you!
www.iita.orgA member of CGIAR consortium
File naming strategies
29
Order by date:2013-04-12_interview-recording_THD.mp3
2013-04-12_interview-transcript_THD.docx
2012-12-15_interview-recording_MBD.mp3
2012-12-15_interview-transcript_MBD.docx
Order by subject:MBD_interview-recording_2012-12-15.mp3
MBD_interview-transcript_2012-12-15.docx
THD_interview-recording_2013-04-12.mp3
THD_interview-transcript_2013-04-12.docx
Order by type:Interview-recording_MBD_2012-12-15.mp3
Interview-recording_THD_2013-04-12.mp3
Interview-transcript_MBD_2012-12-15.docx
Interview-transcript_THD_2013-04-12.docx
Forced order with numbering:01_THD_interview-recording_2013-04-12.mp3
02_THD_interview-transcript_2013-04-12.docx
03_MBD_interview-recording_2012-12-15.mp3
04_MBD_interview-transcript_2012-12-15.docx
www.iita.orgA member of CGIAR consortium
Supporting documentation(1)
30
Supporting documentation is information in separate files that accompanies data in order to provide • context, • explanation, or • instructions on • confidentiality and • data use or • reuse
Source: Dublin UCD Library
www.iita.orgA member of CGIAR consortium
Supporting documentation(1)
31
Examples of supporting documentation include:
Source: Dublin UCD Library
Information about the project and data creators;Working papers or laboratory notebooksQuestionnaires or interview guides CodebooksDetails on how the data were created, analysed, anonymised etc;Final project reports and publications
www.iita.orgA member of CGIAR consortium
Metadata
32
There are three broad categories of metadata:
Source: Dublin UCD Library
Descriptive - common fields such as title, author, abstract, keywords which help users to discover online sources through searching and browsing.
Administrative - preservation, rights management, and technical metadata about formats.
Structural - how different components of a set of associated data relate to one another, such as a schema describing relations between tables in a database.