aligned data curation methods and tools
TRANSCRIPT
ALIGNED Data Curation Methods andTools
Rob Brennan, ALIGNED Coordinator
SWIMing VoCamp Workshop,
Dublin, 22 March 2016
3/25/20162
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644055.
This communication reflects only the author’s view and the Commission is not responsible for any use that may be made of the information it contains.
ApplicationUsers
Data Harvesters
DatasetDomain Experts
Software Developers
SystemAdmins
Data Architects
Dev.Managers
Software Testers Data
Consumers
SoftwareAnalysts
Implementation
Analysis
Planning
Maintenance
SoftwareEngineering
Lifecycle
Design
Manual Revision/ Author
Inter-linking/Fusing
Classify/Enrich
Quality Analysis
Evolve /RepairSearch/
Browse/Explore
Extract
Store/Query
DataEngineering
Lifecycle
SystemAnalysts
Overall Goal:How can we get these guys to talk?
To improve: Productivity, Agility, Quality?
Data Quality and Data Curation in ALIGNED
• Building high quality data-intensive systems requires high quality datasets
• But– Datasets are now first class citizens with lifecycles
that are independent of the consuming apps
– Quality still problematic
• We observe:– Rich data models support quality engineering
– Linked Data entering the enterprise
ALIGNED Tools for Data CurationProductivity, Agility, Quality
DataEngineering
Data Quality Validation
UnifiedProcess
Governance
Data Integrity Assurance
Data IntegrationAssurance
Semi-Supervised
Data Curation
See: http://aligned-project.eu/open-source-tools/https://www.poolparty.biz/
Linked DataExtract,
Transform, Load
TaxonomyManagement
Dataset Release Automation
ALIGNED Validates in Real-World, Data Intensive Systems
Global History Databank
Legal InformationSystem
Nucleus for the Web of Data
SemanticMiddleware
Data Consumers
Community of experts & Volunteers
Electronic Archives
Example: Seshat Target System
databases
SeshatDatabank
Collective Intelligence
High
Quality
Open
Data
Feedback
“improve the extraction of collective intelligence from electronic archives,
research communities and data consumers to improve the quality of published data”
Seshat Data Web
Wiki
RDF Triple Store
Linked Data Publication
User Management
Schema Management
tool
Wiki Data Entry/Validati
on Tool
Errors
Data Visualisations
Data Transformations
Links to other Datasets
Seshat Data Web Pages
Read/query
Enter Data
Validate Candidate
Time Series Analysis
Data ExportTool
Data Dump File (TSV )
CandidateGeneration/
Filtering tools
Seshat Editor Seshat AdministratorSeshat Contributors Seshat Analyst
Copy of Seshat Data
Seshat Schema Knowledge
Model
Seshat Data Knowledge
Model
Seshat Reader
FeedbackView Data
Data Quality Controls
Read Data
DBpediaExternal candidate
source
WorkflowManagement
WikiGeneration
tool
generate
Global History Databank Pilot Data Curation System
Goal is to minimise work requirements from expert users (domain expert, architect) and to ensure data-quality in different dimensions at different steps in the process.
Dacura: Generic, Quality-Oriented Data Curation Process
Dacura Data Harvesting Interfaces
• Knowledge and Data Engineering Group/ADAPT Centre, Trinity College Dublin
• Software Engineering Group, University of Oxford
• Institute of Cognitive and Evolutionary Anthropology,University of Oxford
• Agile Knowledge Engineering and Semantic Web GroupUniversität Leipzig
• Semantic Web Company GmbH• Content Strategy and Architecture Department,
Wolters Kluwer Germany,Wolters Kluwer Poland
• Institute of PrehistoryAdam Mickiewicz University at Poznan
Partners
We want to help you!The ALIGNED Consultancy Program
• Are you a business?
• Do any of these apply:– Are you building data-intensive applications?
– Do you want to curate high quality data?
– Need help integrating Linked Data + apps?
– Want to integrate your software and data engineering teams?
Call on the ALIGNED consultancy program!
http://aligned-project.eu/aligned-consultancy-program-opportunities/