eddi: introduction to sdmx arofan gregory open data foundation
Post on 11-Jan-2016
223 Views
Preview:
TRANSCRIPT
EDDI: Introduction to SDMX
Arofan Gregory
Open Data Foundation
What is SDMX?
• The problem space:– Statistical collection, processing, and
exchange is time-consuming and resource-intensive
– Various international and national organisations have individual approaches for their constituencies
– Uncertainties about how to proceed with new technologies (XML, web services …)
International OrganisationsRegional Organisations
accountsstatistics
Banks, CorporatesIndividual Households
trans-actions
accounts
National Statistical Organisations
accountsstatistics
180
+ C
ount
ries
180
+ C
ount
ries
Inte
rnet
, S
earc
h, N
avig
atio
nIn
tern
et,
Sea
rch,
Nav
igat
ionwww.z.org
www.hub.org
www.x.org
www.y.org
What is SDMX?
The Statistical Data and Metadata Exchange (SDMX) initiative is taking steps to address these challenges and opportunities that have just been mentioned:– By focusing on business practices in the field
of statistical information– By identifying more efficient processes for
exchange and sharing of data and metadata using modern technology
Historical Note
• SDMX uses an approach based on the 10-year-long success of an earlier standard – GESMES/TS
• GESMES/TS was an initiative that is used today in many countries for collecting, exchanging, and updating statistical databases– GESMES/TS is now SDMX-EDI
• Focus is on time-series, and is mostly used by central banks
Who is SDMX?• SDMX is an initiative made up of seven
international organizations:– Bank for International Settlements– European Central Bank– Eurostat – International Monetary Fund– Organisation for Economic Cooperation and
Development– United Nations– World Bank
• The initiative was launched in 2002
SDMX Products• Technical standards for the formatting and
exchange of aggregate statistics:– SDMX Technical Specifications version 1.0 (now
ISO/TS 17369 SDMX)– SDMX Technical Specifications version 2.0
(submitted to ISO)– SDMX Technical Specifications version 2.1 under
review (will be forwarded to ISO)
• Content-Oriented Guidelines– Common Metadata Vocabulary– Cross-Domain Statistical Concepts– Statistical Subject-Matter Domains
Detailed SDMX Goals• Reduce national reporting burden to international institutions• Fostering consistency, accuracy, and timeliness between
data and metadata disseminated by national and international institutions, relying on what is decentrally released via national websites
• Enhancing national statistical processing efficiency, especially through internationally-recognised standard formats for exchanges between statistical silos within institutions and with other national statistical agencies
• Providing standards for web-based dissemination formats that are computer readable and facilitate updating of databases
• Enhancing comparison of data and metadata analysis through standard formats and content-oriented guidelines
Official Recommendations
• SDMX has been officially recommended:– February 2007: SDMX endorsed by the
European Union’s Statistical Programme Committee
– March 2008: UN Statistical Commission declares SDMX to be the preferred standard for data and metadata
Exchange Patterns
• Bilateral: Institutions exchange data according to bilateral agreements regarding format, timing, protocols, etc.
• Gateway: Institutions share the data they collect with their peers, in agreed formats among counterparty communities
• Data-sharing: standard exchange of data using standard formats and protocols
Bilateral Exchange
Gateway Exchange
Data-Sharing Exchange
Notes About Data-Sharing
• Data-sharing only works if there are standard formats
• Data-sharing works only if the data themselves are decentralized– One big database doesn’t work!
• Like the Web itself, a data-sharing model relies on pull exchanges, not push exchanges– Data consumers discover the data they need, and its
location, and then go and get it– Data producers don’t have to send data
SDMX View
• SDMX products support all types of exchange
• One major requirement is to work well with existing systems, to protect technology investments
• SDMX promotes an incremental movement toward the data-sharing model
Exchange with Peer Organizations
• SDMX-EDI and SDMX-ML are both able to exchange databases between peer organizations
• Structural metadata is also exchanged and can be read by counterparty systems
• Incremental updating is possible• Increases degree of automation for exchange –
lowers degree of bilateral, verbal agreement• Can use “pull” instead of “push” if registry is
deployed
Integration within an Organization
• SDMX standard formats are also useful within an organization– Many organizations have several disparate databases– Differences in database structure and content can
make it difficult to use other system’s data– SDMX-ML provides a way to loosely couple such
databases, while facilitating exchange– An SDMX registry can allow visibility into other
databases, while not affecting control or ownership of data
Data Collection and Warehousing
• When data is collected from many different sources, it can be in a wide variety of formats– Typically metadata-poor
• SDMX allows for a single, metadata-rich reporting format for each type of data
• Existing counterparty systems can be “wrappered” to support SDMX for exchange only
Adoption of SDMX
• SDMX has been aggressively adopted, as compared to other international technology standards– Many important data sets are available in
SDMX-ML today– There are many prototypes and planned
projects at the national and international level– Increasing numbers of tools are available
which support SDMX
Adopters/Interest• The following are known adopters (or planning to adopt):
– US Federal Reserve Board and Bank of New York– European Central Bank– Joint External Debt Hub (WB, IMF, OECD, BIS)– UN/TRADECOM at UN Statistical Division– NAAWE (National Accounts from OECD/Eurostat)– European Statistical System (Eurostat and National Statistical Institutes)– Mexican Federal System– Vietnamese Ministry of Planning and Investment– Qatar Information Exchange– IMF (BOP, SNA, SDDS/GDDS)– Food and Agriculture Organization– Millennium Development Goals (UN System, others)– International Labor Organization– Bank for International Settlements– OECD– World Bank World Development Indicators (WDI)– Marchioness Islands (Spanish/Portuguese Statistical Region)– UNESCO (Education)– Australian Bureau of Statistics– WHO (SDMX-HD)– Statistics Canada– There are many others!
SDMX and Domains
• SDMX is organized as a central standard, created and supported by the SDMX Initiative– Each statistical domain creates it’s own domain
standard– Example: WHO has created SDMX-HD (“Health
Domain”) for monitoring disease outbreaks/epidemiology
– Example: UNESCO and Eurostat have developed standard SDMX applications for Education Statistics
• You should look at the work in the different domains when applying SDMX to different national-level statistics collection
US Federal Reserve Board
• Several important data sets are available – and searchable at a granular level – using SDMX
• SDMX-ML is both a web-delivery format and an internal exchange format for production of data
http://www.federalreserve.gov/datadownload/
default.htm
Federal Reserve Bank of New York
• Historical data – once stored in huge CSV files – is now available as SDMX-ML
• Increased the use of the site
• The “typical user” is now a machine
http://www.newyorkfed.org/xml/index.html
European Central Bank
• ECB uses SDMX-EDI to exchange data with European Central Banks
• SDMX-ML is used for web dissemination– Simultaneous release on many CB sites– Each site can use its own language and look & feel– Data warehouse now available in SDMX-ML
• Built and maintained using SDMX standardshttp://www.ecb.int/stats/exchange/eurofxref/html/index.en.htmlhttp://stats.ecb.europa.eu/stats/sdmx/visualisation/icp/dashboard/rc1/
• ECB’s Statistical Data Warehouse/web service
OECD
• Data structures are specified using SDMX standards
• Data sets are held in SDMX-ML format and navigated “on the fly”– OECD.Stat
• http://stats.oecd.org/WBOS/index.aspx
• Experimenting with graphical presentation of data
• Serves all OECD data as SDMX through OECD.stat web service
Eurostat
• Builds on long experience of using GESMES for data transmission (GESMES is main format for transmission of data in several important domains e.g. national accounts, balance of payments, short-term statistics)
• More than 50 Data Structure Definitions for GESMES developed and maintained (in partnership with ECB)
• Software components developed and made available as open-source software (see Tools page of SDMX website)
• Now creating a portal for all European Census data, collected as SDMX
SDMX Specifications and Products
can provide data/metadata for many data/metadata flows using agreed data/metadata structure
conforms to business rules of the data/metadata flow Data or
Metadata FlowData or
Metadata Flow
Data ProviderData ProviderProvision
AgreementProvision
Agreement
can get data/metadata from multiple data/metadata providers
Data or Metadata Set
Data or Metadata Set
publishes/reports data/metadata sets
uses specific data/metadata structure
Data or Metadata Structure DefinitionData or Metadata
Structure Definition
Registered Data or
Metadata Set
Registered Data or
Metadata Set
can have child categories
comprises subject or reporting categoriescan be linked to
categories in multiple category schemes
SDMX Information Model: High level Schematic
CategoryCategory
Category Scheme
Category Scheme
is registered for
registers existence of data and metadata
SDMX Technical Specs v 1.0
• Information Model (data structure definitions and data formats)
• SDMX-ML: XML formats for data structure definitions and data
• SDMX-EDI: EDI formats for data structure definitions and data
• Web-Services Guidelines
• User Guide
Technical Notes on Version 1.0
• Only numeric observations were supported
• Only coded key values were supported
• Intended to provide an XML version of the existing GESMES/TS data model – GESMES/TS became SDMX-EDI– XML extended the data model to provide for
more types of groups and cross-sectional data
• Hierarchical codelists not supported
SDMX Technical Spec v. 2.0
• Expanded data model includes– Registry interfaces– Metadata structures and formats– Data and metadata provisioning– Other advanced features (process flow,
reporting taxonomy, structure mapping, etc.)
• Data formats now include uncoded dimensions, hierarchical codelists, and non-numeric observations
Technical Notes on Version 2.0
• A very large expansion of scope– Model covers the process of statistical
exchange, not just the data formats– Many cases which version 1.0 could not
support were included in version 2.0 as a result of implementations
• Full support for the “data sharing” pattern of exchange– Resulting from the inclusion of the registry
Changes for Version 2.1
• Expanded Web Services Guidelines– Standard WSDL Functions– Standard RESTful syntax (URL-based API)– Standard Error Codes– Will allow for interoperable web services for SDMX – so generic
clients can use multiple sources
• Simplified Data Formats– All data formats will be more consistent– Cross-sectional and time-series formats are more similar
• SDMX Query has been improved• Note: SDMX 2.1 is available for public review now!
SDMX Content-Oriented Guidelines
• Four documents:– Overview– Metadata Common Vocabulary– Cross-Domain Concepts– Statistical Subject-Matter Domains
• These will not become ISO specifications, but will evolve as publications of the SDMX Initiative
Metadata Common Vocabulary
• A set of terms and definitions for the different parts of the SDMX technical standards, and many common concepts used in data and metadata structures
• Does not replace other major vocabularies in this space (such as the OECD glossary) but references these other works
Cross-Domain Concepts• Includes concepts which are common
across many statistical domains– Names & Definitions– Representations
• These are concepts which support both data and metadata structures
Statistical Subject-Matter Domains
• Based on the UN/ECE classification of statistical activities
• Provides a classification system for use in exchanging statistics across domain boundaries
• Provides a breakdown of the various domains within official statistics
SDMX and Data Formats
Data Set
We have a dataset, what do we need to know?
• Version 1.0– What it is and how it is structured
• Version 2.0– Who reports/disseminates it– How a specific data set fits into the overall
collection framework and which organisation is responsible for reporting which parts
– The reporting/publication schedule– That it has been reported/published
Data Set: Structure
First: Identify the Concepts
• A concept is a unit of knowledge created by a unique combination of characteristics (SDMX Information Model)
Computers need structure of data
•Concepts
•Code lists
•Data values
•How these fit together
Unit Multiplier
Unit
Topic
Time/Frequency
CountryStock/Flow
Data Set Structure:Concepts
Data Set Structure: Code Lists
Code Lists
TOPIC
A Brady Bonds
B Bank Loans
C Debt Securities
AR Argentina
MX Mexico
ZA South Africa
COUNTRY STOCK/FLOW
1 Stock
2 Flow
CONCEPTS
Topic
Country
Flow
Concepts
16457
Q,ZA,B,1,1999-06-30=16547
Data Makes Sense
Data Set Structure: Defining Multi-dimensional Structures
• Comprises– Concepts that identify the observation value– Concepts that add additional metadata about the
observation value– Concept that is the observation value– Any of these may be
• coded• text• date/time• number• etc.
Dimensions
Attributes
Measure
Representation
Data Set Structure: Concept Usage
Unit Multiplier
Unit
Topic
Time/Frequency
CountryStock/Flow
Observation
(Dimension)(Dimension)
(Dimension)
(Attribute)
(Dimension)
(Dimension)
(Attribute)
(Measure)
has code list
Code List
Code List
AttributesAttributes
concepts that add metadata
has format
concepts that identify groups of keys
concepts that identify the observation
Data Structure Definition
Data Structure Definition
Key Key Group Key Group Key
Dimensions Dimensions
Concept Concept
MeasuresMeasures
CONCEPTS
Topic
Country
Flow
takes semantic
from
has formattakes
semantic
from
takes semantic
from
has format
concepts that are observed phenomenon
TOPIC
A Brady Bonds
B Bank Loans
C Debt Securities
Representation
Coded Coded Non-
coded Non-
coded
16457
Q,ZA,B,1,1999-06-30=16547
Data Makes SenseFrequency,Country,Topic,Stock/Flow,Time=Observation
Quarterly, South Africa, Bank Loans, Stocks, 2nd quarter 1999
Identifying Concepts
• Identifying Concepts - Sources– Existing data set tables
• From website• From applications
– Data Collection Instruments• Questionnaires• Excel spreadsheets
– Regulations, Handbooks, User Guides• Labour Statistics Convention, 1985 (No. 160), Recommendation,
1985 (No. 170)• Council Regulation No: 311/76/EEC of 09/021976; OJ: L039 of
14/02/1976; Compilation of statistics on foreign workers
– Database Tables– Existing Data Structure Definitions
• From other organisations
Identify Concepts – from website
Source: FAO proof of concept project
Measurement = 1,000 Kg
Concepts
Reference Region
Commodity
Frequency and Time
Observation Value
Measure Type
Unit and Unit Multiplier
Measurement = 1,000 Kg
Concept Role: Reminder
• Dimensions– Are the concepts that identify the observation value
• Attributes– Are the concepts that add additional metadata about
the observation value
• Measure– Is the concept that is the observation value
Exercise:Concept Role
Reference Region
Commodity
Frequency and Time
Observation Value
Measure Type
Unit and Unit Multiplier
Measurement = 1,000 Kg
(Dimension)(Dimensions)
(Measure)
(Dimension)
(Dimension)
(Attributes)
Data Set and StructureDimension Concept
FREQ
REF_AREA_REG
COMMODITY
MEASURE_TYPE
TIME
Measure Concept
OBS_VALUE
Attribute Concept
OBS_STATUS
OBS_CONF
UNIT
UNIT_MULTIPLIER
Identify/Define Code Lists
• Purpose of a Code List– Constrains the value domain of concepts when used
in a structure like a data structure definition– Defines a shortened language independent
representation of the values– Gives semantic meaning to the values, possibly in
multiple languages
• Agreeing on harmonised code lists is the most difficult aspect of defining a data structure definition
Data Structure Definition
Data Structure Definition
Key Key Group Key Group Key
Dimensions Dimensions
Concept Concept
AttributesAttributes MeasuresMeasures
takes semantic
from
has format
takes semantic
from
takes semantic
from
has format
has format
concepts that add metadata
concepts that identify the observation
concepts that are observed phenomenon
concepts that identify groups of keys
Data Structure Definition - Reminder
Representation
Coded Coded Non-
coded Non-
coded
Code List
Code List
has code list
SDMX and Data Formats
Session: SDMX Syntax Implementations for Data
SDMX Data Syntax Implementations
• SDMX provides for two main syntaxes:– UN/EDIFACT (for SDMX-EDI)– XML (for SDMX-ML)
• Each syntax provides a format for describing data structure definitions
• Each syntax provides at least one format for data– There are 4 different XML syntaxes for data
SDMX-EDI
• EDI – “electronic data interchange” – is an older, flat-file syntax used primarily to conduct e-commerce– There have been a few statistical messages– GESMES is the “generic statistical message”
• EDI messages are difficult to read unless you know EDI very well…
Benefits of SDMX-EDI
• As a data format, it is very compact– Good for very large data sets
• Permits incremental updating of data sets• Permits attributes and observations to be
sent separately• Has a very large installed base within the
European community and the central banks (used by 180 countries)
• It is not very Web-oriented, however
SDMX-ML Document Types (Data)• Structure Message: Holds the agencies, concepts,
codelists, and data structure definitions (DSDs)• Generic Format: A single XML schema for all different types
of data, regardless of data structure definition• Utility Format: Specific to DSD, provides strongest
validation• Compact Format: Like the EDI message, compact, but not
as much validation as Utility• Cross-Sectional Format: Similar to Compact, but holds
cross-sectional data• Data Query Message: Allows for querying of online
databases and similar applications which are SDMX-aware. Supports web services.
The SDMX-ML Data Formats• In designing the XML formats for SDMX, several different
needs were identified– Needed an XML format for describing data structure definitions– Needed an XML version of the EDIFACT messages for transmitting
large databases– Needed an XML which would help validate statistical data sets– Needed an XML which could be used generically for any statistical
data set– Needed an XML for transmitting cross-sectional data– Needed a message to query for data
• Because SDMX-ML is based on the SDMX Information Model, it was decided to create several equivalent XML data formats, to satisfy each of these cases– Requirements were mutually exclusive for these cases
Generic Data Message• No validation• Carries data for any data structure definition• Verbose – files are very large• Can perform incremental updates and carry
partial data sets• Useful for applications which need to carry
potentially incorrect data for processing and cleaning
• Useful for generic applications which handle data for more than one DSD
• Serves as a “pivot format” between other SDMX-ML format types
Utility Data Message
• Provides strongest validation – all business rules in DSD are enforced by a generic XML parser (schemas are specific to particular DSDs)
• Less verbose than Generic; more verbose than Compact & Cross-Sectional
• Incremental updates not supported• For XML tools, this is the most “normal”
type of XML schema – performs best
Compact Data Message
• Equivalent of SDMX-EDI data format, but schemas are specific to a particular DSD
• Good for exchanging partial data sets and incremental updates
• Very compact (for XML) in terms of file sizes• Very simple, but performs limited validation
– Will validate codelists, but not some other things
Cross-Sectional Data Message
• Similar to Compact format, but allows for lots of observations for a single point in time (not time-series oriented like other formats)
• Very compact
• Supports incremental updates
• Provides limited validation – schemas are specific to a particular DSD
Selecting the Right SDMX-ML Format
• Free tools allow transformation between data formats without any loss – each application can use one or more formats for specific tasks
• Depending on the application, one format may be preferable to another– How large are the data files?– How much validation needs to be performed?– How many DSDs are supported by the application?– Will all data be correct when received (according to
the DSD)?
SDMX-ML “Model-Driven” XML Approach
DSD
Additional SDMX Features
• Hierarchical Code List
• Structure Set (mappings)
• Reporting Taxonomy
Hierarchical Code Lists – Example Scenario
• France is a country• France is part of the continent of Europe• France is a member of NATO• France is a member of the EU• France is a member of the G10• When I analyse statistics I might want to see totals by
– continent– trading block– military alliance– financial grouping
• France will be grouped with different sets of countries depending on the “view” required
• How do we express these groupings?
6B NATO
B0 EU
B1 NAFTA
BE Belgium
BG Bulgaria
CA Canada
CH Switzerland
CZ Czech Republic
DE Germany
DK Denmark
E1 Europe
E8 North America
EE Estonia
ES Spain
FI Finland
FR France
GB United Kingdom
GR Greece
HU Hungary
JP Japan
I2 Euro 12
IT Italy
NE Netherlands
US United States
Reference Area
Code Parent
BE E1
BG E1
CH E1
CZ E1
DE E1
DK E1
EE E1
ES E1
FI E1
FR E1
GB E1
etc
Code Parent
BE E0
CZ E0
DE E0
DK E0
EE E0
ES E0
FI E0
FR E0
GB E0
etc
Europe EU countries
Code Parent
BE 6B
BG 6B
CA 6B
CZ 6B
DE 6B
DK 6B
EE 6B
ES 6B
FR 6B
GB 6B
etc
NATO countries
Code Parent
CA B1
US B1
MX B1
NAFTA countries
Code Parent
CA B1
US B1
North America
Code Composition
Code Composition
Code Parent
BE G0
CA G0
CH G0
DE G0
FR G0
GB G0
JP G0
IT G0
NL G0
SE G0
US G0
G10 countries
Code Association
Code Association
Code ListCode ListCodeCode Hierarchy-1Hierarchy-1
Code Composition
Code Composition
Hierarchy-2Hierarchy-2 Hierarchy-3Hierarchy-3
Code Composition
Code Composition
Hierarchy-4Hierarchy-4
Code Composition
Code Composition
value based hierarchy has code groups
belongs to
Hierarchical Code
Scheme
Hierarchical Code
Scheme
CodeCodeCode
AssociationCode
Association
Code Composition
Code Composition
LevelLevel
HierarchyHierarchy
parent code
code
relates a code to a parent code
groups codes with the same parent
comprises code groupscomprises hierarchies
comprises code groups
level based hierarchy has formal levels
PropertyProperty
Code ListCode List
Properties of the association
The codes may be in variety of code lists.
Schematic of the Hierarchical Code Scheme
Item Scheme Maps
• Many types of “item scheme” use the same fundamental structure– Code list– Category scheme– Concept scheme
• Two Item Schemes can be mapped
Item SchemeItem Scheme
ItemItemItem
AssociationItem
Association
has item associations
source item
Item SchemeItem Scheme
ItemItemtarget item
Item Scheme Association
Item Scheme Association
source item schemetarget item scheme
Code List Category Scheme
Concept Scheme
Code Category Concept
Code List Map
Category Scheme
Map
Concept Scheme
Map
Association Role
Association Role
Code List Category Scheme
Concept Scheme
Code Category Concept
Schematic of the “Code” Mapping
Structure Maps
• Structures can also be mapped– Data structures– Metadata structures
Data or Metadata Flow
Data or Metadata Flow
Data ProviderData ProviderProvision
AgreementProvision
Agreement
Data or Metadata Set
Data or Metadata Set
Data or Metadata Structure DefinitionData or Metadata
Structure Definition
CategoryCategory
Category Scheme
Category Scheme
Registered Data Set or
Metadata Set
Registered Data Set or
Metadata Set
Data/Metadata Reporting, Query, Analysis, Mapping
Structure & Item Scheme
Maps
Structure & Item Scheme
Maps
Content ConstraintContent
Constraint
Attachment Constraint
Attachment Constraint
Reporting Taxonomy
• An SDMX Reporting Taxonomy is a group of data flows and/or metadata flows which form the basis of a single real-world document or report
• They can be organized into groups and sub-groups as needed
• They can be named and identified• Useful for managing various types of
reports over time
Processes
• SDMX 2.0 provides the ability to document the steps and logic of a process flow
• This is not executable, but serves as documentation to describe the processes which produce data and metadata
• It is useful as a target for the attachment of reference metadata describing processing
SDMX and Metadata Formats
Reference Metadata• We have seen how data values are limited to where they
belong– Series key (usually qualified by time)
• Data attribute values are limited in where they belong– Observation value– Series key– Group key– Data set
• Metadata is everywhere, but– it must be metadata about “something”
• what is the “something”• how is it identified
– it comprises concepts and how are they structured• The Metadata Structure Definition answers these
questions• Advance release calendar is only one possible example
Metadata Example: Advance Release Calendar (ARC)
• What is the release calendar for?– Informs when data will be
published/made available
• Who publishes the data set?• What type of data is it (data flow)?• What metadata is in the release
calendar (i.e. its structure)• Who publishes the release calendar?• When is it published?
Labour Force Statistics
RELEASE CALENDAR
Structure
•Concepts
•Hierarchies
•Representation (e.g. code list)
Metadata Structure Definition (MSD)
RELEASE CALENDAR
definition of format and permitted values
Metadata Structure Definition (MSD)Report Structure
Metadata AttributesMetadata AttributesMetadata AttributesMetadata Attributes
Format and Permitted Value List
Format and Permitted Value List
Metadata Report
Metadata Report
Concept Scheme
Concept Scheme
concept defined inConceptConcept
takes semantic and context from
Metadata Structure Definition
Metadata Structure Definition
can comprise the specification of one or more report
can have hierarchy
can have hierarchy
can comprise the specification of one or more report
Example ARC MetadataDay Ref Area Indicator Ref Period Time Tolerance Status
30-04-2007
INE, Spain LF-H Q: 31-03-2007
09:00 +24 Hr. Final
30-04-2007
INE, Spain LF-E Q: 31-03-2007
09:00 +24 Hr. Final
30-04-2007
ONS, UK LF-H Q: 31-03-2007
09:00 +48 Hr. Final
30-04-2007
ONS, UK LF-E Q: 31-03-2007
09:00 +48 Hr. Draft
Identifiers
MSD Metadata Concepts: Advance Release Calendar
Concepts
Concept Id Description
REFERENCE_PERIOD The time period to which a variable refers
RELEASE_DATE_TIME The specific point in time that data or metadata are made available
DATE_TOLERANCE The possible or permissible variance of a time period relative to a known point in time.
RELEASE_STATUS The state of preparedness of a statement on the availability of data or metadata
ANNOTATION Additional metadata
1
MSD: Report Structure for ARC
ARC
REFERENCE_PERIOD
RELEASE_DATE_TIME
RELEASE_STATUS
DATE_TOLERANCE
MY_AGENCY:METADATA_CONCEPTS
ANNOTATION
Metadata AttributesMetadata AttributesMetadata AttributesMetadata Attributes
Format and Permitted Value List
Format and Permitted Value List
Metadata Report
Metadata Report
Concept Scheme
Concept Scheme
ConceptConcept
Metadata Structure Definition
Metadata Structure Definition
ARC_METADATA
REFERENCE_PERIOD
RELEASE_DATE_TIME
RELEASE_STATUS
DATE_TOLERANCE
ANNOTATION
Annotation
Metadata Attribute
Concept = Representation =Text
Release_Date_Time
Release_Status
ARCMetadata Report =
CL_Status
F Final
P Provisional
Target Id =
MSD: Metadata Report Structure
Reference_Period
Metadata Attribute
Concept = Representation = Date/Time
Date_Tolerance
Date/Time
Time Value
Metadata Attribute
Metadata Attribute
Metadata Attribute
Concept = Representation =
Concept = Representation =
Concept = Representation =
AnnotationConcept = Value = simultaneous release by ECB
Date_Tolerance
Release_Date_Time
Reference_Period
Release_Status
ARCMetadata Report =
Metadata Attributes
Concept =
Concept =
Identifiers
Metadata Set
Value = 2007-31-03
Concept = Value = 2007-04-30T09:00
Value = F
Concept = Value = +24Hr
Metadata Structure = ARC_METADATA
Metadata Set: ARC Report Example
Metadata Example: Advance Release Calendar (ARC)
• What is the release calendar for?– Informs when data will be
published/made available
• Who publishes the data set?• What type of data is it (data flow)• What metadata is in the release
calendar (i.e. its structure) • Who publishes the release calendar?• When is it published?
RELEASE CALENDAR
definition of format and permitted values
Metadata Structure Definition (MSD)To which object is the metadata attached?
Metadata AttributesMetadata AttributesMetadata AttributesMetadata Attributes
Format and Permitted Value List
Format and Permitted Value List
Metadata Report
Metadata Report
Concept Scheme
Concept Scheme
concept defined inConceptConcept
takes semantic and context from
Metadata Structure Definition
Metadata Structure Definition
can comprise the specification of one or more report
can have hierarchy
can have hierarchy
Target IdentifierTarget
Identifier
can comprise the specification of one or more report
Links to
can provide data for many data flows using agreed data structure
conforms to business rules of the dataflow Data FlowData Flow
Data ProviderData
Provider
Provision AgreementProvision
Agreement
can get data from multiple data providers
Data SetData Set
publishes/reports data sets
uses specific data structure
Data Flows: Controlling Reporting and Publishing
RELEASE CALENDAR
Structure Definition
Structure Definition
can provide data for many data flows using agreed data structure
Controlling Data Reporting
conforms to business rules of the dataflow Data FlowData Flow
Data ProviderData
Provider
Provision AgreementProvision
Agreement
can get data from multiple data providers
Data SetData Set
publishes/reports data sets
uses specific data structure
Structure Definition
Structure Definition
RELEASE CALENDAR
1A – INE Spain
LF-H = labor force hours
Provision AgreementProvision
Agreement
Identify Structure
•Concepts
•Hierarchies
•Representation (e.g. code list)
Metadata Structure Definition (MSD)
RELEASE CALENDAR
Full Target Identifier
Full Target Identifier
Partial Target
Identifier
Partial Target
Identifier
Metadata Structure Definition
Metadata Structure Definition
Identifier Components
Identifier Components
Identifier Components
Identifier Components Item SchemeItem Scheme
defines “keys” of object types to which metadata can be “attached”
specifies the identifier components (“key”) of the target object
identifies the code list or other type of list (e.g. Category Scheme which defines the valid values tat can be used when metadata are reported in a metadata set
Target Object Type
Target Object Type
identifies target object type of the component
MSD: Identifying the “Target”
MSD: Object Identification for ARC
Data FlowData Flow
Data ProviderData Provider
Data_Flow_Provider
ARC_METADATA
ARC
Full Target Identifier
Full Target Identifier
Partial Target
Identifier
Partial Target
Identifier
Metadata Structure Definition
Metadata Structure Definition
Identifier Components
Identifier Components
Identifier Components
Identifier Components Item SchemeItem Scheme
Metadata Report
Metadata Report
Target Object Type
Target Object Type
LF-H Labour Force, Hours Worked
LF-E Labour Force, Employment
CL_DATA_FLOW
OS_DATA_PROVIDER
1A INE, Spain
2A ONS, UK
Data ProviderData Provider
Data FlowData Flow
Metadata Structure Definition =
Target = Data_Flow_Provider
MSD: Identifiers for ARC
ARC_METADATA
Identifier Component
Target Object Type =
Item Scheme =
Target Object Type =
Item Scheme =
Identifier Component
OS_DATA_PROVIDER
1A INE, Spain
2A ONS, UK
LF-H Labour Force, Hours Worked
LF-E Labour Force, Employment
CL_DATA_FLOW
Annotation
Metadata Attribute
Concept = Representation =Text
Release_Date_Time
Release_Status
ARCMetadata Report =
CL_Status
F Final
P Provisional
Target Id =
MSD: Metadata Report Structure
Data_Flow_Provider
Reference_Period
Metadata Attribute
Concept = Representation = Date/Time
Date_Tolerance
Date/Time
Time Value
Metadata Attribute
Metadata Attribute
Metadata Attribute
Concept = Representation =
Concept = Representation =
Concept = Representation =
AnnotationConcept = Value = simultaneous release by ECB
Date_Tolerance
Release_Date_Time
Reference_Period
Release_Status
ARCMetadata Report =
Metadata Attributes
Concept =
Concept =
Identifiers
Metadata Set
Data Provider =
Data Flow =
1A
LF-H
Value = 2007-31-03
Concept = Value = 2007-04-30T09:00
Value = F
Concept = Value = +24Hr
Metadata Structure = ARC_METADATA
Metadata Set: ARC Report Example
Data FlowData Flow
Data ProviderData
Provider
Provision AgreementProvision
Agreement
Metadata: Advance Release Calendar (ARC)
• What is the release calendar for?– Informs when data will be
published/made available
• Who publishes the data?• What type of data is it (data flow)?• What metadata is in the release
calendar (i.e. its structure)?• Who publishes the release calendar?• When is it published?
RELEASE CALENDAR
ARC_METADATA
ARC
1A
conforms to business rules of the metadata flow
can provide metadata for many metadata flows using agreed metadata structure
Controlling Metadata Reporting
Metadata collectors can set up control metadata for the collection process
Metadata Flow
Metadata Flow
(Meta)Data Provider
(Meta)Data Provider
Provision AgreementProvision
Agreement
can get metadata from multiple metadata providers
Metadata Set
Metadata Set
publishes/reports metadata sets
uses specific data structure
Metadata Structure Definition
Metadata Structure Definition
Metadata: Advance Release Calendar (ARC)
• What is the release calendar for?– Informs when data will be
published/made available
• Who publishes the data?• What type of data is it (data flow)?• What metadata is in the release
calendar (i.e. its structure)• Who publishes the release calendar?• When is it published?
RELEASE CALENDAR
Reference Metadata
• Metadata is everywhere, but– it must be metadata about “something”
• what is the “something”• how is it identified
– it comprises concepts and how are they structured• The Metadata Structure Definition answers these
questions• Advance release calendar is only one possible example
– attached to the Provision Agreement
To which (other) things can metadata be attached?
MSD: Some Object Types
Category Scheme
CategoryData or Metadata
Flow
Data Provider
Provision Agreement
Structure Definition
Data Set or Metadata
Set
Content
Constraint
Structure and Item Scheme
Maps
Registered Data Set or Metadata
Set
Attachment
Constraint
MSD: List of Object Types to Which Metadata can be Attached
AgencyConceptSchemeConceptCodelistCodeKeyFamilyComponentKeyDescriptorMeasureDescriptorAttributeDescriptorGroupKeyDescriptorDimensionMeasureAttributeCategorySchemeReportingTaxonomyCategoryOrganisationScheme
DataProviderMetadataStructureFullTargetIdentifierPartialTargetIdentifierMetadataAttributeDataFlowProvisionAgreementMetadataFlowContentConstraintAttachmentConstraintDataSetXSDataSetMetadataSetHierarchicalCodelistHierarchyStructureSetStructureMapComponentMap
CodelistMapCodeMapCategorySchemeMapCategoryMapOrganisationSchemeMapOrganisationRoleMapConceptSchemeMapConceptMapProcessProcessStep
definition of format and permitted values
Metadata Structure Definition (MSD)Report Structure
Metadata AttributesMetadata AttributesMetadata AttributesMetadata Attributes
Format and Permitted Value List
Format and Permitted Value List
Metadata Report
Metadata Report
Concept Scheme
Concept Scheme
concept defined inConceptConcept
takes semantic and context from
Metadata Structure Definition
Metadata Structure Definition
can comprise the specification of one or more report
can have hierarchy
can have hierarchy
Target IdentifierTarget
Identifier
can comprise the specification of one or more report
Links to
SDMX and Metadata Formats
Session: SDMX-ML Formats for Metadata Sets
Metadata Formats Syntax Implementation
• There are three relevant constructs in SDMX-ML for handling metadata sets– Metadata Structure Definitions– Metadata Reports (specific to an MSD)– Generic Metadata Sets (for any MSD)
• This is similar to data formats in SDMX-ML, except that there are fewer different use cases
• There is no corresponding format implementation in SDMX-EDI for Reference Metadata
Comparing Formats for Metadata Sets
• Generic Metadata performs no validation, but can hold any type of metadata report
• MSD-specific Metadata Reports can perform more validation, and are less verbose– Because there tend to be few codelists or
numeric types in metadata reports, the validation may not be very useful
Metadata: Quality Frameworks
• The SDMX cross domain concepts for reference metadata are concerned with data quality framework (DQAF) metadata
• These DQAFs are used to improve the quality, comparability, transparency etc. of published data
Metadata – Reported according to a Quality Framework
Metadata AttributesMetadata AttributesMetadata AttributesMetadata Attributes
Format and Permitted Value List
Format and Permitted Value List
Metadata Report
Metadata Report
Concept Scheme
Concept SchemeConceptConcept
Metadata Structure Definition
Metadata Structure Definition
CATEGORY_CONTENT_REPORT
QUALITY_METADATA
COVERAGE
REF_AREA
ACCOUNTING_CONV
MY_CONCEPTS
COVERAGE
REF_PERIOD
BASE_PER
BASE_PER
Example Metadata: Content
REF_PERIOD
ACCOUNTING_CONV
COVERAGE_SECTOR
COVERAGE_SECTOR
REF_AREA
BASE_PER
SDMX Registry Overview
REPOSITORY Provisioning
Metadata
REGISTRY Data Set/
Metadata Set
REPOSITORY Structural Metadata
Register
Query
Submit
Query
Submit
Query
SDMX Registry/Repository
Describes data and metadata structures
Describes data and metadata sources and reporting processes
Indexes data and metadata
SDMX Registry Interfaces
REPOSITORY Provisioning
Metadata
REGISTRY Data Set/
Metadata Set
REPOSITORY Structural Metadata
Subscription/Notification
Applications can subscribe to notification of new or changed objects
Register
Query
Submit
Query
Submit
Query
SDMX Registry/Repository
Describes data and metadata structures
Indexes data and metadata
SDMX Registry Interfaces
URL, registration date etc.
can provide data/metadata for many data/metadata flows using agreed data/metadata structure
conforms to business rules of the data/metadata flow Data or
Metadata FlowData or
Metadata Flow
Data ProviderData ProviderProvision
AgreementProvision
Agreement
can get data/metadata from multiple data/metadata providers
Data or Metadata Set
Data or Metadata Set
publishes/reports data/metadata sets
uses specific data/metadata structure
Data or Metadata Structure DefinitionData or Metadata
Structure Definition
can have child categories
comprises subject or reporting categoriescan be linked to
categories in multiple category schemes CategoryCategory
Category Scheme
Category Scheme
Data or Metadata Set
Data or Metadata Set
Information Model: High level Schematic
registers existence of data and metadata
Structure Maps
Structure Maps
structure and code list maps
REPOSITORY Provisioning
Metadata
REGISTRY Data Set/
Metadata Set
REPOSITORY Structural Metadata
Subscription/Notification
Applications can subscribe to notification of new or changed objects
Register
Query
Submit
Query
Submit
Query
SDMX Registry/Repository
Describes data and metadata structures
Indexes data and metadata
SDMX Registry Interfaces
URL, registration date etc.
can provide data/metadata for many data/metadata flows using agreed data/metadata structure
Data FlowData Flow
Data ProviderData ProviderProvision
AgreementProvision
Agreement
can get data/metadata from multiple data/metadata providers
uses specific data/metadata structure
Structure DefinitionStructure Definition
can have child categories
comprises subject or reporting categoriescan be linked to
categories in multiple category schemes CategoryCategory
Category Scheme
Category Scheme
Data Set Data Set
SDMX Artefacts: Registry Contents
registers existence of data and metadata sets
Structural Metadata
Provisioning Metadata
Registered Data and Metadata
Structure Maps
Structure Maps
structure and code list maps
The Old JEDH (Joint External Debt Hub) Site
BIS
IMF
OECD
WorldBank
WEBSITE
(VariousFormats) (3-month production cycle)
JEDH with SDMX
BIS
IMF
OECD
WorldBank
SDMX-ML
SDMX-ML
SDMX-ML
SDMX-ML
SDMX-ML(Debtor database)
[Info about data is registered]
SDMX“Agent”
SDMXRegistry
Discover data and URLs
Retrieves data from sites
JEDH Site
Data providedin real timeto site
SDMX-MLLoaded into
JEDH DB
CountrySTATRegionSTAT
National Publication Server(s)
Regional Publication Server
FAO SDMX Registry
Flow of FAO CountrySTAT-RegionSTAT Implementation
1
23a
4
3b
SDMX in Action: Prototype System
FOOD AND AGRICULTURE ORGANIZATIONOF THE UNITED NATIONS
Slide courtesy of the FAO
FOOD AND AGRICULTURE ORGANIZATIONOF THE UNITED NATIONS
1 CountryStat National Publication Server
•The web site is published from the files in CountryStat
SDMX Publication
•The new CountryStat files are converted to SDMX-ML data sets and made web accessible on the CountryStat web site
•These files are registered in the FAO SDMX Registry
RegionStat Regional Publication Server
•Queries the registry for new registrations which responds with registration details including the URL of the new data sets
•Retrieves the new data sets from the CountryStat web site
•Converts the SDMX-ML files to an internal format and integrates the new data sets with existing RegionStat data sets
•Re-publishes the RegionStat web site
2
3a
4
Prototype System: Explanation
Slide courtesy of the FAO
3b
SDMX Implementation
Developing SDMX Applications
• General Design Approaches
• Publications and Dissemination
• Data Warehousing/Integration of Data Sources
• Other Topics
SDMX Publication and Dissemination
• SDMX can be used to drive Web dissemination and print publication– It is a useful format for distribution from
websites– It can be used by websites to improve delivery
of content– It can be used to provide content to print
applications, for tabular data
• These techniques can result from a single system
Data Storage(SDMX)
Print PublicationEngine
SDMX-ML
XSL-FO
SDMX Query Engine
PDF, etc.
Templates, boilerplate text,
analysis
Website
CannedQueries
ASP/JSP
HTML
SDMX-ML
CSV
On-the-FlyQueries
Note: Can be a virtualdata store fed by theSDMX registry
XSLT
SDMX-ML
SDMXRegistry
Notes on Publication/Dissemination
• Current practice is often to focus on the delivery of tables– This is often not what users ideally want– Tables can be viewed as “canned queries”
• Better web-sites can be created which support granular user queries supported by rich metadata– See the ECB data warehouse, Federal Reserve
Board site as examples– See “Data on the Web” presentation for more details
Data Warehousing/Integration of Data Sources
• SDMX is also designed to support the collection and processing of data– In most organizations, this is seen as a data
warehousing activity
• SDMX provides tools for integrating data from a variety of sources– Can be among a set of organizations or within
an organization
Dat
a S
ou
rces
(st
atic
fil
es,
dat
abas
es,
etc.
)
SDMX Registry
DataRegistration
Data LoadingData
Harmonization/Processing
Data Dissemination
Notification
Data Pulled
Met
adat
aInternal
Applications
Print Publication
Web
site
Registratio
n
Data Warehouse
Note: All types of dissemination applications may use the registryfor various purposes. The registry may even be made publically available to users who want SDMX-ML data and metadata.
Notes on Data Warehousing• Each stage is loosely coupled with associated
applications, using XML interfaces:– Data sources– Data processing– Data dissemination applications
• The SDMX Registry functions throughout as a metadata repository, to provide structural and provisioning information as well as location of data as needed
• Internal database structures are based on SDMX information model– They are predictable and regular– They can be auto-generated
SDMX Tools and Resources
SDMX Tools (Partial List)• Metadata Technology has a set of free tools for
working with data and metadata, and a free registry implementation– Mostly Java and XSLT
• Eurostat has a set of free tools for working with data and metadata, and has a registry implementation
• OECD and IMF have a web-services based package for dissemination: .STAT (available through MOU)
• ECB visualization tools written in Flex on Google Code
• Some other tools, including commercial vendors (STR Supercross 2, etc.)
Other Resources
• www.sdmx.org has a blog and makes many different presentations and paper available, as well as distributing copies of the standards– An SDMX User’s Guide is currently being developed
(beyond the material contained in the SDMX v 2.0 specification)
• The Open Data Foundation promotes SDMX (among other standards)– Check www.opendatafoundation.org– They host the SDMX Users Forum
www.sdmxusers.org
SDMX and Other Standards
Other Important Standards
• Data Documentation Initiative (DDI) – describes the micro-data inputs to aggregate (SDMX) data
• ISO/IEC 11179 Metadata Registries – describes terminological/semantic and conceptual models, and the metadata lifecycle
• eXtensible Business Reporting Language (XBRL) – describes financial microdata for economic statistics
SDMX and XBRL
• These standards can be mapped to each other successfully
• However, the mapping depends on the specific SDMX Data Structure Definition, and the specific XBRL “Taxonomy”– There is no single, standard mapping
DDI and SDMX Combined Data Model
• DDI 3 focuses on:– collection and production of microdata– reuse and sharing of common data structures– conversion to statistical tables (matrices)– preservation and multiple storage options
• SDMX focuses on:– statistical tables– reuse and sharing of common data structures– consistent data transfer structure
• Together they form a coherent data management model for data capture, storage and interchange with a wide area of overlap
S20 138
Generic Process Example
Survey/Register
Raw Data SetRaw Data Set
Anonymization, cleaning, Anonymization, cleaning, recoding, etc.recoding, etc.
Micro-Data Set/Micro-Data Set/Public Use FilesPublic Use Files
Tabulation, processing,
Tabulation, processing,
case selection, etc.
case selection, etc.
Aggregation,
Aggregation,
harmonizatio
n
harmonizatio
n
Aggregation, Aggregation, harmonizationharmonization
Aggregate Data SetAggregate Data Set(Lower level)(Lower level)
Aggregate Data SetAggregate Data Set(Higher Level)(Higher Level)
DDIDDI
SDMXSDMXAggregate Data SetAggregate Data Set(Highest-Level)(Highest-Level)
The Generic Staistical Business Process Model (GSBPM)
• The METIS group is a part of UN/ECE which addresses metadata issues for national statistical agencies (and other producers of official statistics)– This community uses both SDMX and DDI
• They have produced a reference model of the statistical production process– The DDI 3 Lifecycle Model was a major input– GSBPM has a much greater level of detail
The Generic Statistical Information Model (GSIM)
• Early work on an information model to accompany the GSBPM is starting– Still informal, very early– Involves some of the statistical agencies which lead
the work on GSBPM• GSIM will take as a major input both the DDI and
SDMX information models– Will also cover other metadata– Will also draw on other standards (Neuchatel Model
for Classifications, etc.)• Goal is to publish GSIM through METIS
alongside the GSBPM
Questions?
top related