internet dissemination. part i. storing and retrieving data. disseminating statistics: internet and...

89
Internet dissemination . Part I. Storing and retrieving data. Disseminating statistics: Internet and Publications Madrid, 3-5 March 2008 1 Storing data. Structure for dissemination Different data storage formats Data retrieval and presentation European Statistical Training Programme

Upload: aniyah-channell

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

1

Storing data. Structure for dissemination

• Different data storage formats

• Data retrieval and presentation

European Statistical

Training

Programme

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

2

Different data storage formats

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

3

Different designs for cities

Chaotic urbanization (old towns)

Madrid (Spain), City Centre Toledo (Spain) , City Centre

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

4

Different designs for cities

Organized development (new districts and towns)

Madrid, a modern district Manhattan, NY

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

5

Different designs for cities

... most often, old and new urban districts

New cities

And, what about storing ( for disseminating) statistical data?

Is there a best solution ?

Tres Cantos, Spain, 1970Brasilia, Brazil, 1956

M

A

D

R

I

D

must coexist

can be designed in a ‘structured’ way, but ...

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

6

Introduction

Questions to answer

When the dissemination stage begins

Different storage formats

SDMX-ML Standard

Issues to address

The role of metadata

Document structure normalisation

Example of an application with unstructured data

Example of a tool for structuring data

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

7

• The role of metadata

• Experiences of document structure normalisation applied to statistical dissemination. Ordinary files to multidimensional databases

• Example of tools for structuring information to be disseminated: PX-Make

Introduction

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

8

• How to structure statistical information in order to disseminate it better– and therefore metadata,– different structuring and storage formats,– and some information technologies for dissemination

• and try to answer some questions:Must we always think big? Should we use the latest and most powerful dissemination

technology?Must we try to use one single technology?Are the DW systems the best, or the only, dissemination

technology?

Questions to answer

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

9

• Anticipating the content of the presentation…

• The INE’s answer to those questions is based on our own experience, and on our own restrictions (resources, time, etc.)

• and the answer to all of them is NO, because:

– Some systems may demand a large previous investment of time and resources, and then not be sufficiently dynamic

– Each type of statistical information may require a different dissemination technology

– And because it is possible, under a single “brand and aspect”, (INEbase in our case), to group data from very different statistical operations, applying different dissemination technologies, trying to achieve as well very similar interfaces for final users

Questions to answer

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

10

• The day that we finish tabulating a survey or statistics, it seems as though the work is done, but…

• There is still time and work to do before the statistics are disseminated. This is our situation:

– 1 day for the press release (on paper, fax and the Internet)

– 1 day to post all the content from the tables database and the temporal series database on the Internet

– 10 days to edit a diskette or CD-ROM, including replication

– 1 month for the book

When the dissemination stage begins

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

11

• In order to comply with these deadlines and shorten the time taken by editorial operations ...

• It is necessary to begin the dissemination project a long time in advance of the tabulation process

• Therefore,

– it is useful for us to know as much as possible about data, metadata, formats, methods, dissemination techniques and standards

– it is to this that we dedicate part of this presentation

When the dissemination stage begins

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

12

• UnstructuredUnstructured formats, not enriched with metadata, not particularly focused on computer processing or statistical dissemination– easy, quick and cheap to produce– poor informative content– very limited computer processing

• StructuredStructured formats, programs and standard methods – less easy or cheap– quick to produce (…it can be obtained) – rich dissemination media, secure and stable– with a guarantee of being able to address new requirements– easy to automate dissemination processes

Document structure normalisation

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

13

Document structure normalisation

Example of non-structured

document or file: a Press release ( .doc, or .pdf )

Possible ‘processing’ of this document or file:

•Reading•Printing•Or ‘photocopying’

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

14

• Basically what we are going to be looking at in this presentation, as the degree of structuring chosen increases, is...

1. Visual and presentation performance of the format we are using will increase

2. Complexity and the human and economic cost of implementing that solution will increase

• We will also see that, on a website (ours in this case) different formats can share the same storage system with no problems and are used for different purposes or types of statistical products

Document structure normalisation

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

15

• Unstructured formats:

– Not enriched with metadata (there will be a different session dealing exclusively with metadata)

– Not particularly focused on statistical dissemination

• Adobe Acrobat PDF• Text and spreadsheets• Static HTML pages

– We can certainly “get by” with them

Document structure normalisation

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

16

• PDF, XLS … (no comment will be made about them)

Use of static HTML pages online and in offline publications

• AdvantagesAdvantages

– Documents with a statistical table aspecttable aspect can be “shown” using the tags <table>, <body>, <cell>...

– There are many functions available for formatting text, although not so many for organising tables

– Both static and dynamic HTML pages can be created (dynamic pages are usually generated with the help of “CGI” type programs which send logical queries to databases and file servers, or with other online data access technologies (ASP, PSP) Example

– And all Office tools offer the possibility of editing static HTML pages.

Document structure normalisation. Unstructured formats

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

17

Use of static HTML pages online and in offline publications

• Disadvantage

– HTML enables us to “show” metadata,

– but it will not manage it for us in our best interests:

(such as conventions regarding the meaning of different parts of the information, which would be useful for computer presentation processes), but as just another text

Document structure normalisation. Unstructured formats

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

18

• Example of HTML source code<HTML>

<HEAD>

<TITLE>Población de 16 y más años por sexo, grupos de edad (4) y relación con<BR>

la actividad económica.</TITLE>

</HEAD>

<BODY>

<TABLE>

<TR ALIGN=LEFT>

<TH COLSPAN=8>Población de 16 y más años por sexo, grupos de edad (4) y relación con<BR>

la actividad económica.</TH>

<TR ALIGN=LEFT>

<TH> </TH>

<TH VALIGN COLSPAN=1>Población > 16 años</TH>

<TH VALIGN COLSPAN=1>Activos</TH>

<TH VALIGN COLSPAN=1>Ocupados</TH>

<TH VALIGN COLSPAN=1>Parados</TH>

<TH VALIGN COLSPAN=1>Parados que buscan primer empleo</TH>

<TH VALIGN COLSPAN=1>Inactivos</TH>

<TH VALIGN COLSPAN=1>Población contada aparte</TH>

</TR>

<TR ALIGN=RIGHT>

<TH ALIGN=LEFT VALIGN=TOP>Ambos sexos</TH>

</TR>

<TR ALIGN=RIGHT>

<TH ALIGN=LEFT VALIGN=TOP>Total</TH>

<TD>9.0</TD>

<TD>9.0</TD>

<TD>9.0</TD>

</TR> ....

Document structure normalisation. Unstructured formats

... <TR ALIGN=RIGHT><TH ALIGN=LEFT VALIGN=TOP>De 16 a 19 años</TH><TD>9.0</TD><TD>9.0</TD></TR></TR><TR ALIGN=LEFT>

</TR></TABLE></BODY></HTML>

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

19

StructuredStructured formats (based on metadata) (based on metadata) :

– Standards promoted by official statistical institutions, EDIFACT/GESMES, SDMX ...

– Actual ‘de facto’ standards for disseminating statistical data ( readers: “Pseudo OLAP”: PC-Axis, SuperTABLE, EVA, Navidata, Beyond 20/20 ®)

– Conventional databases, with capacity to store data and metadata, and to dynamically generated the required information

– Multidimensional systems, or OLAP as they are actually called, for storage and dissemination of data: Data Warehouse approach

Document structure normalisation

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

20

Is it necessary to spend time and resources structuring statistical data files Is it necessary to spend time and resources structuring statistical data files

or creating costly databases for dissemination?or creating costly databases for dissemination?

1.- Automating data presentation tasks and achieving productivity bonuses will only be possible when the structure of the files generated is widely recognised, stable, repetitive … All of which will aid editing tasks, irrespective of the mediumirrespective of the medium (paper or electronic and online publications)

2.- Presentation logic will be in response to a metadata model, and metadata may be used to reinforce searchsearch functions

3.- Clear metadata documentation simplifies communication between servicescommunication between services, producers and the dissemination unit, and enables concurrent working

4.- and it will subsequently facilitate the communication of data between between organizations,organizations, or to individuals, Web Services, content syndication environments..

Structured formats. The key question: the role of metadata

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

21

• One way or another, using metadata will bring us closer to a matrix model

• However, how easy is it to structure tables in multidimensional matrix form reflecting possible variable crosses, based on metadata used to describe them?…

• Not always easy, sometimes “cubist art” (using cubes) is required Not always easy, sometimes “cubist art” (using cubes) is required for complex tables such as this one:for complex tables such as this one:

Structured formats.The key question: the role of metadata

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

22

• This “cubist art” demands that, besides concerning ourselves with metadata, we focus on clearly identifying matrices resulting from the tabulation process and which are valid for dissemination systems.

• Sometimes it is necessary to manipulate a tabulated matrix• Dividing it into several matrices• Combining a classification variable with a counting

one• Concatenating classification variables• Combining previous actions

• It should be recognised this may entail an added workload

Structured formats.The key question: the role of metadata

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

23

• The decision:

– Have we already opted to simply produce structured files or databases with statistical tables ordered as matrices, resulting from systematically crossing variables, and accompanied by all the metadata necessary for their interpretation?...

• If the answer is ‘YES’, it is necessary to talk of available formats and procedures

Structured formats.The key question: the role of metadata

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

24

• EDIFACT,EDIFACT, EElectronic DData IInterchange FFor AAdministration CCommerce and TTrade, electronic document structures promoted by the United Nations for exchanging documents electronically in the field of trade and public administrations

• GESMES GESMES = GEGEneric SStatistical MESMESsage– adaptation for statistical purposes of the EDIFACT EDI syntax– Designed by a workgroup composed of statistics institutes, customs bodies and

central banks.– Financed as part of the European Union IDAEuropean Union IDA project (Interchange of data

between administrations)– Published in 1993– Adapted to multidimensional “data set” structures including their own

metadata– Complete, detailedComplete, detailed, and somewhat complexcomplex– In use between EUROSTAT and all the INEs, on a communication system

based on the “Stadium - Statel - Testa services” extranet

Different storage formats Standards promoted by official statistics institutions

EDIFACT/GESMES, EDAMIS

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

25

• Basic EDIFACT syntax:– An EDIFACT exchange comprises of a sequence of

segmentssegments– each segment has a unique 3-character identifieridentifier– There are rules of orderrules of order for segments– “Entity-relation” modelling techniques were used to

design message syntax

Different storage formats Standards promoted by official statistics institutions

EDIFACT/GESMES

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

26

• GESMES implementations

– ECOSER (economic time series)

– BOPSTA (balance of payments)

– PRODCOM (production data)

– CLASET (statistical classifications)

– RDRMES (raw data collection)

– GESMES / CB (central banks short term economic indicators)

Different storage formats Standards promoted by official statistics institutions

EDIFACT/GESMES

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

27

• Problems …Problems …

– The more popular table presentation programs often do not have the capability of exporting and importing data with GESMES (PC-Axis does this)

– Through intensive use of the internet, new technologies emerged, particularly the standard XML / SDMX

Different storage formats Standards promoted by official statistics institutions : EDIFACT/GESMES

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

28

• An example of a messageAn example of a message: UNH+01001+GESMES:D:95A:E6'BGM+74:::PC-AXIS Win 2.0'DTM+137:19980813:101'NAD+MS+ine'CTA+CC+:INE Difusion e-mail: [email protected] Fax: +34 91 5839158'NAD+MR+eurostat'ASI+01001'SCD+4+sexo++++:1'SCD+4+grupos de edad (4)++++:2'SCD+4+relacion con la actividad economica++++:3'SCD+3+Poblacion de 16 y mas años++++:4'DSI+epa4t97'GIR+5+SDB:AB+01:AC+Ejemplo.- Resultados nacionales:AD'ARR+

+9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9..0.0:9.0:9.0:9.0:9.0:9.0:9.0:++9.0:9.0:9.0:9.0:9.0'

IDE+5+01001'

Different storage formats Standards promoted by official statistics institutions EDIFACT/GESMES

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

29

• SDMX (SStandard DData and MMetadata eXchange) http://www.sdmx.org/ is an initiative promoted by BIS (International Payments Bank), OCDE, IMF, World Bank, European Central Bank, UN and EUROSTAT in order to:

• Promote the use of standards in exchanging statistical information between institutions

• There are already pilot projects in place, or experiences such as:

– Eurostat SODI ( SSdmx OOpen DData IInterchange)

– NAWWE (“The primary objective of  the NAWWE project is to use a web based mechanism for collecting  national accounts data based on already internationally agreed national accounts standards”)

http://stats.oecd.org/nawwe/

– Mexico : http://www2.inegi.gob.mx/estestint/Standards/default.asp

Different storage formats. SDMX

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

30

• Not institutionally standardised, although they have come to be “de facto” standards

• Specially designed for holding and presenting data and metadata• Reader programs mimic the functions of OLAP multidimensional data

presentation (to show dimensions and hierarchies, to pivot, to deepen, to nest, to show graphics and statistical maps)

• Full metadata handling capability• Several programs, several regional markets:

– PC Axis (Sweden)PC Axis (Sweden): Nordic countries, UNECE, other EU countries (Spain too), South Africa, Guatemala…

– CUB X / EVACUB X / EVA: Eurostat program– Beyond 2020Beyond 2020: USA, Canada, UK, France ...– SuperTABLESuperTABLE: Australia

Different storage formats

De facto standards and “pseudo OLAP” visualisers

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

31

• Statistical table management application with a spreadsheet interface

• Windows environment

• Simple to handle and use by non-IT experts

• Simple to generate: Write in ASCII with tags, structured, self-documented, and easy to translate to XML (Adaptation to SDMX ver. 2 format underway)

• Easy to associate with Office applications

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

32

• The program has been developed by Statistics SwedenStatistics Sweden• The GUI shows typically statistical elements: universes or

contents, variables, modify variable and value selections, nestings, etc...

• File generation can be fully automated :– By means of robots or tabulation program macros (SAS)– From relational or multidimensional databases containing

the information (such as our Tempus 2 system)– Or simply displayed online using “cgi / web gateways”

type programs

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

33

The table filetable file (*.px) contains both metadatmetadataa and datdata a :Lines of MetadataLines of Metadata…

AXIS-VERSION="2000";CREATION-DATE="20070228 09:44";SUBJECT-AREA="Demography";SUBJECT-CODE="l1";MATRIX="L10026E";TITLE="Population of main capital cities.";CONTENTS="Population of the largest urban agglomeration. Year 2005";DESCRIPTION="Population of the largest urban agglomeration. Year 2005 ";DECIMALS=0;SHOWDECIMALS=0;STUB="Country/Agglomeration";HEADING="population (thousands)";UNITS="population (thousands)";LAST-UPDATED="26/03/07";CONTACT="INE E-mail:www.ine.es/infoine Internet:www.ine.es Tel:+34 91 5839100 """;VALUES("Country/Agglomeration")="Afghanistan. Kabul","Albania. Tirana","Algeria. Algiers","American Samoa. Pago Pago", …

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

34

SOURCE="Statistical Yearbook of Spain ";COPYRIGHT=YES;NOTE="Information Source: United Nations Demographic Yearbook. ";VALUENOTE("Country/Agglomeration","Australia. Sydney ")=" Including Christmas Island, Cocos (Keeling) Island and Norfolk ""Island.";VALUENOTE("Country/Agglomeration","Channel Islands. ST. Helier")="Including the islands of Guernsey and Jersey. ";VALUENOTE("Country/Agglomeration","China. Shanghai")="For statistical purposes, the data for China do not include Hong Kong #""and Macao Special Administrative regions (SAR) of China. #"" ";VALUENOTE("Country/Agglomeration","Comoros (The). Moroni")="Including the island of Mayotte. ";VALUENOTE("Country/Agglomeration","Finland. Helsinki")="Including Aland Islands. ";VALUENOTE("Country/Agglomeration","Mauritius. Port Louis")="Including Agalega, Rodrigues and Saint Brandon. ";VALUENOTE("Country/Agglomeration","Norway. Oslo")="Including Savalbard and Jan Mayen islands. ";VALUENOTE("Country/Agglomeration","Saint Helena. Half Tree Hollow")="Including Ascension and Tristan da Cunha. ";

Lines of dataLines of data …DATA=2994 388 3200 …

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

35

The correspondence between tables and the program interface is intuitive: publication / folder or database, subject areas, tables, variables, values, data ...

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

36

View a table

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

37

• DDE and OLE communication with other Office programs: Excel, Word...

• Multiple export formats: Excel, Text, Html, Dbase, Gesmes, shortly SDMX...

Other functionsOther functions:

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

38

• Easy to combine with browsing structures based on static and dynamic HTML pages (as done by the INE in INEbase and monthly INEbase)

• Statistical Graphs and Maps (with PX-Map)

• In several languages

• Calculation functions, on rows, columns and between tables of equal dimensions

• Customisable views and printing

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

39

• Site in English maintained by Statistics Sweden with links to all the programs in the PC-Axis suite, to countries where PC-Axis solutions are used, to the forum, download area, etc

• http://www.pc-axis.scb.se/

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

40

• How the INE uses PC-AxisHow the INE uses PC-Axis:

– As a reference for designing the dissemination database, and

particularly for its metadata storage structures (INEbase

and Tempus 2

)

– As a format for the files from all statistical operations not yet

uploaded to Tempus 2, or which are not anticipated to be

uploaded ( a program -‘Jaxi’- displays them online)

– As another export format offered by INEbase

– And for building “offline” programs: monthly INEbase, EPA ...

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

41

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

42

• One or more CGI programs provide “pseudo OLAP” browsing, search and presentation functions

• Data is held, by means of PC-Axis files, on the internet file server, in a directory structure which follows the logical subject tree of the organisation – that of the ISO-.

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

43

Different storage formats

De facto standards and “pseudo OLAP” visualisers: PC-Axis

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

44

• EVA (ex CUB.X) is a very similar program to PC-Axis, which also enables handling of multidimensional tables

• http://epp.eurostat.cec.eu.int/extraction/evajava/evajava/help/en/homepage.htm states that:

“EVA stands for EEurostat's VVisual AApplication, the Eurostat's Common Browser for its statistical databases. EVA is a specialised

multidimensional statistical table browser”

Different storage formats

De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

45

• Eurostat’s “New Cronos” database and its HTML presentation

Different storage formats

De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

46

Shell HTML

Different storage formats

De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

47

Retrieving data in

HTML table format

Different storage formats De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

48

The visualiser for Windows

• The dimension “blocks” show the groups of values of the variable, and allow for rotating the chosen values

• “Spreadsheet” type interface, drag and drop functions to modify the header row and the header column

• Values and codes

• Multiple export format: Excel, Dbase, Gesmes...

Different storage formats

De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

49

INFOFri Dec 19 10:16:08 1997 @_# trueValues=5246 on 5544 #_@LASTUPFri, 19 Dec 1997 10:11:04 +0100TYPERVDELIMS(),@DIMLST(soft,theme,domain,collect,table,indic,country,time)DIMUSE(R,R,N,N,N,V,V,V)POSLST(newcronos)(theme1)(eur2)(01-cn)(01-cn-a)(cnpib90a)

(01,22,30,11,34,32,14,28,16,24,18,36,38,40,41,26,42,46)(1999a00,1998a00,1997a00,1996a00,1995a00,1994a00,1993a00,1992a00,1991a00,1990a00,1989a00,1988a00,1987a00,1986a00)FORMATFORMATRNOTAV:VALLST(0)(6309801.90,6119851.10,5942454.30,5792810.20,5691611.00,5554092.30,5394748.60,5424698.30,5373350.50,5198353.80,

An example of an EVA file:

Different storage formats

De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

50

– Distributed by the Canadian company of the same name, which also produces:

• the (independent) file and metadata preparation system• and an “internet file server” version of the program

– Data and metadata are stored in files which are not directly legible (binary), it is not possible to create Beyond files from outside its specific “builder” programs

– Capabilities: spreadsheet interface, drag and drop, exchange and nesting of variables, statistical graphs and maps

– Two main types of file: tablestables and “extractsextracts”. A distinguishing feature of Beyond is its capability for handling microdata and tables with the same program. “Extracts” are indexed microdata files which are specially handled so that online tabulation is quick

Different storage formats

De facto standards and “pseudo OLAP” visualisers: Beyond 20/20

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

51

• Chapters, tables and “extracts”

Different storage formats

De facto standards and “pseudo OLAP” visualisers: Beyond 20/20

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

52

• Viewing tables

Different storage formats

De facto standards and “pseudo OLAP” visualisers: Beyond 20/20

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

53

• Selecting variables from an “extract” or microdata file

Different storage formats

De facto standards and “pseudo OLAP” visualisers: Beyond 20/20

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

54

Structured storage. Conventional databases with the capability to store data and metadata and dynamically generate the information requested

• It is also possible to create Databases for online statistical dissemination, with robust metadata support :

– Adhering in its design to a pre-existing structured format (the Swedish model, Spain –Oracle-,...)

– Or with a model of its own (Holland, the StatLine system)

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

55

• StatLine, from the Netherlands Central Bureau of Statistics, is a fantastic reference, of how to combine a database with a look-up system…

• It seems a simple medium, but is the outcome of several years’ work...

Different storage formats

Databases

http://statline.cbs.nl/

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

56

• StatLine: powerful presentation of metadata, “pseudo OLAP” look-up functions. (Data supported by a relational-model Database on the server, Java Applet internet technology, it is recommendable to have ample broadband …)

Viewing metadata, cubes, dimensions

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

57

Viewing data, functions of deepening, nesting, pivoting, Dragging and dropping

StatLine: powerful presentation of metadata, “pseudo OLAP” look-up functions. (Data supported by a relational-model Database on the server, Java Applet internet technology, it is recommendable to have ample broadband …)

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

58

The INEbase Tempus II subsystem ( Time Series databank )The INEbase Tempus II subsystem ( Time Series databank )

Relational database system are also widely used as dissemination tools. The INE uses them:

• 1.- As a more compact store than PC-Axis files, distributing in different tables the different metadata components and data, and enabling:

• construction of queries on demand• exporting PC-Axis, Excel format …A growing part of the information is uploaded to the INEbase Tempus II subsystem, a relational database system (Oracle) in which the following are made compatible: • single information storage• and a presentation in two possible forms: tables and chronological series

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

59

Analyze

Define

Create

1.- Relational Model

2.- GES_Tempus

Tool for managing

processes. Using new TP2 format

4.- Tempus 2 (model + data)

3.- Gathering data from Tempus, PC-Axis and other sources.

5.- Displaying tables (collection of series)

(March 2004)

8.- Program for accessing series

Tempus 2.

6.- Accessing to series (first version)

7.- Extracting data from T2 e.g: FMI

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

60

The INEbase Tempus II subsystem ( Time Series databank )The INEbase Tempus II subsystem ( Time Series databank )

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

61

TheThe INEbase Tempus II subs INEbase Tempus II subsyystemstem

We developed a tool ( Ges-Tempus) for managing all operations at Tempus 2

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

62

Division (statistical operation) APR-06 SEP-06 APR-07 AUG-07 DIC-07

APR-06->DIC-07 % 20 months División

1 CCM 54 54 54 54 54 0 CCM2 CNTR 957 921 1.013 1.011 1.011 54 5,64 CNTR3 CTA 1.272 1.272 1.275 1.272 1.272 0 CTA4 DPOD 24.537 24.537 24.537 24.537 24.537 0 DPOD5 DPOH 8.201 8.201 8.198 8.201 8.201 0 DPOH6 DPOP 24.543 24.543 24.546 24.546 24.546 3 0,01 DPOP7 EIE 11.237 11.237 10.757 11.237 11.237 0 EIE8 EOT 9.791 10.043 10.043 10.043 NEW EOT9 EPA 85.229 126.023 157.196 212.130 212.130 126.901 148,89 EPA10 ETCL 5.029 5.029 5.029 5.029 5.029 0 ETCL11 HPT 37.940 37.940 37.940 37.940 37.940 0 HPT12 IAS 234 234 365 354 354 120 51,28 IAS13 ICM 1.128 1.191 818 1.191 1.191 63 5,59 ICM14 ICN 21 21 21 21 NEW ICN15 IDB 1.764 1.764 1.764 1.814 1.764 0 0,00 IDB16 IEP 21 21 21 21 NEW IEP17 IPC 82.705 82.705 115.878 115.878 126.765 44.060 53,27 IPC18 IPCA 393 393 393 393 NEW IPCA19 IPI 4.613 4.613 4.616 4.613 4.613 0 IPI20 IPP 5.362 5.362 10.867 10.867 10.867 5.505 102,67 IPP21 IPR 3.902 3.902 3.960 3.958 3.958 56 1,44 IPR22 MNP 93.290 97.582 101.874 101.874 101.874 NEW MNP23 Reserv. IPC 157.728 157.728 157.728 157.728 NEW Reserv. IPC24 EPOB 18.624 18.624 18.624 NEW EPOB25 TV 145 145 145 NEW TV26 DIR 660 179.502 179.502 NEW DIR27 ECPF 3.596 34.617 34.617 NEW ECPF28 ECM 7.920 0 NEW ECM

TOTAL 298.707 432.818 674.349 754.141 978.437 679.730 227,56 TOTAL

Tempus 2. Divisions and series

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

63

2.- As a dissemination system of statistical data closer to the concept of “lists” than of “tables”: An example, the List of place names :

Filtrable lists, not crosses of variables

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

64

3.- As a dissemination system of statistical data closer to the concept of “lists” than of “tables”. Another example, the Industrial Product Survey

Filtrable lists, not crosses of variables

Different storage formats

Databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

65

What might the role be of BI/DW systems in a statistical dissemination strategy?What might the role be of BI/DW systems in a statistical dissemination strategy?

• When in a company or economic purpose being studied …– The number of variables or dimensions to be analysed is high– Granularity or level of subject or territorial detail is also high– It is difficult to predict many of the possible subject and territorial crosses,

as well as that of hierarchical presentation levels appropriate for different types of users

• …We shall need to model “n-dimensional cubes” populated by cell volumes significantly greater than 10 raised to 5…

• We can continue to use traditional relational modelling systems, however… It is time to speak to an expert in multidimensional analysis!It is time to speak to an expert in multidimensional analysis!

Different storage formats

OLAP systems for data dissemination Multidimensional databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

66

• This will spell the end of working with multiple “data sets”, or with a set of relational tables, to store (less numerous) “cubes” which contain a large amount of data with a high level of subject, territorial or temporal granularity

• Ideal for displaying results of Censuses and other operations enabling small-area statistics– Censuses– Large surveys, large company or establishment directories, high

level of detail• Intranet or Internet• Use of the most advanced OLAP techniques• One example is the experience of the INE in the 2001 Censuses

Different storage formats

OLAP systems for data dissemination Multidimensional databases

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

67

• It is not yet the most common dissemination technology, however it is more predictable around the XML format (and the SDMX project), for there to be built, in addition to data exchange standards …

– automated surveying systems for companies via the internet

– data dissemination systems

• They are ideal for combining with structuring systems for data in herent to XML, using “classic” or “multidimensional - OLAP” databases

• An interesting experience which is underway: the pilot project on the foreign debt, based on the SDMX standard

Different storage formats

Operating international normalisation experiences: SDMX-XML

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

68

Interesting example of the OECD using SDMX: http://stats.oecd.org/nawwe/csp/default.html

Different storage formats

Operating international normalisation experiences: SDMX-XML

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

69

• We have said “Structure to disseminate”. However ...

• What if the data is completely unstructured, What if the data is completely unstructured, as is the case as is the case withwith old, paper-based old, paper-based publications?publications?

Example of an application with unstructured data

• The INE does not rule out using the internet to disseminate these valuable historical collections. The INEbase HISTORIA project is currently in its final stages of cataloguing, and combines en mass OCR processing (scanning), a SGBDR system, and a file server, in order to provide guided access and search systems in order to view and download pages from those publications, in PDF and Excel formats

• This will be covered in another presentation

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

70

Example of an application with unstructured data

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

71

• Depending on the IT infrastructure dedicated to storage, and dissemination of data and metadata, we are able to use different tools to structure the information to be disseminated, from greater to lesser complexity …– A metadata creation environment in a

multidimensional database system (The INE uses it in the DW 2001 Census )

– Or one associated with a relational database (The INEbase Tempus II environment …)

– Or something as simple as handling PX-Make ( O PX-Make ( O PX-Edit) PX-Edit) , in order to produce PC-Axis files ...

Example of a tool for structuring data: PX-Make

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

72

Example of a tool for structuring data: PX-Make

• Interface designed for working with PX files

• Exchange of data with Excel, Access..., and with PX files already made

• EASY: a day’s training is enough. Used by service promoters

• It is part of the SW from the “PC-Axis suite”

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

73

Example of a tool for structuring data: PX-Make

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

74

Example of a tool for structuring data: PX-Make

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

75

Data retrieval and presentation

Are our ‘official’ statistical data, naked or boring data?

http://blogstats.wordpress.com/2007/07/20/naked-data/

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

76

Are ‘official’ statistical data boring,

or ‘naked data’?

Let’s see some ways for helping our users to access more friendly to our information.

Statistical data can be even

amusing!

Specially if the information is structured! ... of course

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

77

• “Some say that statistics or data aren’t very sexy, that they have the image of being quite difficult, of being boring or even of being biased and not worth to be studied”.http://blogstats.wordpress.com/category/0223-gapminder/

• “Listening to representatives of Gapminder or Swivel one could think official statistics is just naked data, difficult to access and not considering new technologies. Is this true?”http://blogstats.wordpress.com/2007/07/20/naked-data/

Are ‘official’ statistical data boring, or ‘naked data’?

• “Official statistics are a key “public good” that foster the progress of societies”. OECD World Forum, Istanbul Declaration, June 2007

Kindly suggested to watch interesting video ‘Unveiling the beauty of statistics’, presented by Hans Roslings at the OECD World Forum in Istanbul in June 2007

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

78

Are ‘official’ statistical data boring, or ‘naked data’?

One of the questions we did on our survey was:

“Do you think blogs and forums are interesting in statistical dissemination?”

Blogs can be perfect tools for being used for statisticians in order to know new initiatives for improving statistical data dissemination

e.g. http://blogstats.wordpress.com

BLOGS

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

79

Private initiatives

Using ‘our’ official data they attempt to make it user-friendly

• Gapminder : www.gapminder.org “Gapminder developed the Trendalyzer software that converts international statistics into moving, interactive and enjoyable graphics.” ( http://en.wikipedia.org/wiki/Hans_Rosling )

• Many Eyes “Our goal is to "democratize" visualization and to enable a new social kind of data analysis” ( www.many-eyes.com )Internet Penetration and Usage in Europe, by Country, Sept. 2007

• Swivel : www.swivel.org “Where Curious People Explore Data”Average age at death by Age at retirement

• Netvibes : www.netvibes.com “Built-in Netvibes modules include an RSS/Atom feed reader” ( http://en.wikipedia.org/wiki/Netvibes)

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

80

Are ‘official’ statistical data boring, or ‘naked data’?

Graphs “A picture is worth a thousand words”

Different ‘friendly’ visualization styles for similar data

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

81

Are ‘official’ statistical data boring, or ‘naked data’?

CertiEnabling users to make calculations, even using everyday language to explain

the objective of the program

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

82

Are ‘official’ statistical data boring, or ‘naked data’?

‘Gossiping’ ( Why not?) some demanded data

Or surnames ( Spain )

‘Friendly’ style

‘Naked’ style

The most frequent names ( Zurich)

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

83

Are ‘official’ statistical data boring, or ‘naked data’?

Giving users tools for helping to use our sites

Search engines (including suggested links) and sitemap

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

84

Are ‘official’ statistical data boring, or ‘naked data’?

Giving users possibility to look for values, to configure results screen, to export to different formats ...

Selection of variables (INE Spain) ... or PX-Web model (Finland)

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

85

Special sections for ‘other’ types of users

With children in mind (Brazil) ... or journalists (Spain)

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

86

Historical information is very popular and demanded indeed

Evolution of municipalities in Spain

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

87

Even giving colloquial texts for unspecialised users

The same information is available for specialized users in other sections

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

88

The icing on the cake

Interesting maps ( Switzerland) ... Population clock (Census- USA)

Internet dissemination . Part I. Storing and retrieving data.

Disseminating statistics: Internet and Publications

Madrid, 3-5 March 2008

89

Thank you very much for your attention. Any questions, please?

Storing data Structure for dissemination

• Different data storage formats

• Data retrieval and presentation

European

Statistical Training

Programme