practical introduction to spatial data integrator powered by

42
camptocamp SA / Foss4g2008 / www.camptocamp.com / [email protected] Practical introduction to Spatial Data Integrator powered by OpenSource SpatialETL Agenda : Camptocamp and Talend presentation Spatial Data Integrator (SDI) powered by Talend overview Tutorial What's next ?

Upload: others

Post on 13-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

camptocamp SA / Foss4g2008 / www.camptocamp.com / [email protected]

Practical introduction to Spatial Data Integrator powered by

OpenSource SpatialETL

Agenda:Camptocamp and Talend presentationSpatial Data Integrator (SDI) powered by Talend overviewTutorialWhat's next ?

2

Camptocamp, an Open Source Base Camp ! 45 employees

Switzerland & France About 50 to 70 % of growth per year since 2002 3 activity domains

Spatial solutions Business solutions Infrastructure solutions

4 services poles Consulting Engeneering Supporting Training

Geo-spatial Solutions

Infrastructure Solutions

Business Solutions

CONSULTING

ENGENEERING

SUPPORT

TRAINING

WebmappingGIS / MetadataSpatial Data InfrastructuresWeb Services

ERPBusiness IntelligenceETL

SecurityLinux ServerVoIP

3

Talend overview

Talend is the first provider of open source data integration software

Located in France, USA, Germany, China 70 employees

First product release: 2006 Leader in open source data integration

Rival large established proprietary players

4

What is ETL? Extract / Transform / Load

ETL is a process in Data Warehousing. « How to get data in ? » is ETL process name.

http://en.wikipedia.org/wiki/Extract,_transform,_load)

Extractextract data from source system where data originates.

Transformapply series of rules or functions to the extracted data (selecting, translating, joining, ...)

Loadonce data transformed and cleaned, load the data in a data warehouse

5

Spatial Data integration

Synchronize and check integrity

of your applications data

ExternalData Files

Migrate legacyapplications

Parcel

RoadsNetwork Production SoE

CentralGeodata

warehouse

Extract, Transform and Load Data

GeospatialDatabase

Replicate subset of datainto subject matter DM

Datamart

Datamart

Exchange / sharedata with customers

or suppliers

eCommerce

Govt agency

6

Productivity & Ease of Use

Graphical development Dramatically increased productivity & ramp up Combined graphical & technical views Drag-and-drop mapping interface Large library of components & connectors

Leverage industry-standard languages Java, Perl, SQL

7

Job pannel

Componentspalette

Job start/stopComponentproperties

tab

8

GeoSpatial components

Feature manipulation Formats Raster processing*

MetadataGeocoding Viewer

* next release (version 1.3)

*

*

*

*

*

*

*

*

9

Tutorial: Places around natural reserves

Unzip the application Start Talend Create a connection (set the email address)

Create a new project in Java (Create button)

With GeoNames.org data for South Africa (isoCode ZA), use nature reserves (fcode=RESN) and search for populated places (more than 10 000 inhabitants)in .2 width circle.

10

1.Create a new job

First step is to create a new job.On the repository Tab, in Job design, click on create a new Job

Job name: getData This will create a new view to design your job

11

2.Download the data & unzip

12

3.Define a schema to use later in jobs

13

4.Convert text file to MapInfo format

14

5.Filter natural reserves ...

15

... and compute buffer

16

6.Search for populated place

17

7.Search for populated place

Intersect buffers with places

Filter population

18

8.Display results

19

9.Run the full process in a job of sub-jobs

20

New release coming soon

Version 1.3 Changes

New Components : WFS, GPX, simplify, transform, ...

Added Geometry type Added wizard for automatic schema

creation Embed uDig to view data Linked to Sextante library to process

RASTER Demo

21

Spatial Data Integrator project

Community website http://spatialdataintegrator.org forum, wiki, tutorials

Developpers: SVN repository Prototype with uDig (thanks Jesse) Sandbox for Sextante (thanks Victor) Sandbox for Grass

camptocamp SA / Foss4g2008 / www.camptocamp.com / [email protected]

Practical introduction to Spatial Data Integrator powered by

OpenSource SpatialETL

Agenda:Camptocamp and Talend presentationSpatial Data Integrator (SDI) powered by Talend overviewTutorialWhat's next ?

Data integration is a key process

Data volumes in exponential growth

Diversity and heterogeneity of data sources

Data processing plays a major role in implementing GIS projects

Consolidating and aggregating spatial data with data from other sources is often required

GIS data integration situation

Use command or hand-made script from various tools and libraries

gdal/ogr commands, fwtools, postgis command, ...

Proprietary Spatial ETL such as FME

Lack of Open Source global geo-spatial data integrator

SDI prototyped in 2007 and presented at Foss4g 2007

Foss4g2008 / Francois Prunayre

2

Camptocamp, an Open Source Base Camp ! 45 employees

Switzerland & France About 50 to 70 % of growth per year since 2002 3 activity domains

Spatial solutions Business solutions Infrastructure solutions

4 services poles Consulting Engeneering Supporting Training

Geo-spatial Solutions

Infrastructure Solutions

Business Solutions

CONSULTING

ENGENEERING

SUPPORT

TRAINING

WebmappingGIS / MetadataSpatial Data InfrastructuresWeb Services

ERPBusiness IntelligenceETL

SecurityLinux ServerVoIP

Foss4g2008 / Francois Prunayre

3

Talend overview

Talend is the first provider of open source data integration software

Located in France, USA, Germany, China 70 employees

First product release: 2006 Leader in open source data integration

Rival large established proprietary players

Foss4g2008 / Francois Prunayre

4

What is ETL? Extract / Transform / Load

ETL is a process in Data Warehousing. « How to get data in ? » is ETL process name.

http://en.wikipedia.org/wiki/Extract,_transform,_load)

Extractextract data from source system where data originates.

Transformapply series of rules or functions to the extracted data (selecting, translating, joining, ...)

Loadonce data transformed and cleaned, load the data in a data warehouse

more info http://en.wikipedia.org/wiki/Extract,_transform,_load

Foss4g2008 / Francois Prunayre 5

5

Spatial Data integration

Synchronize and check integrity

of your applications data

ExternalData Files

Migrate legacyapplications

Parcel

RoadsNetwork Production SoE

CentralGeodata

warehouse

Extract, Transform and Load Data

GeospatialDatabase

Replicate subset of datainto subject matter DM

Datamart

Datamart

Exchange / sharedata with customers

or suppliers

eCommerce

Govt agency

Spatial Data Integrator is one component of the SDI useful for ...

Data manipulation (Extraction, Quality checking, Conversion, Projection)

Data & metadata production (vector and Raster analysis)

Data & metadata manager (Network files and database manipulation, archiving)

Data dissemination (WWW publication, Deploy jobs as webservice)

Data reporting (Indicators, Analysis, ...)

... End user tools to define common tasks (ie. Job, process, script) usually made by hand or scripting in desktop GIS.

Foss4g2008 / Francois Prunayre 6

6

Productivity & Ease of Use

Graphical development Dramatically increased productivity & ramp up Combined graphical & technical views Drag-and-drop mapping interface Large library of components & connectors

Leverage industry-standard languages Java, Perl, SQL

Key features Business-oriented process modeling Graphical development Robust and scalable execution Broadest connectivity to support all systems Project repository for design and execution Real-time debugging

A high adoption rate 100,000 product downloads 20% register as users

Active community 1,000 beta testers 500 forum contributors

Highest performance, robust and scalable execution Grid-distributed processing Industry-standard code generated (Java or Perl) Leverage both ETL and ELT architectures Process data closest to the source

Foss4g2008 / Francois Prunayre

7

Click to add title

Job pannel

Componentspalette

Job start/stopComponentproperties

tab

Broadest connectivity to support all systems 100+ connectors available out of the box

RDBMS: Oracle, PostgreSQL, MySQL, DB2, SQL Server, Sybase, Ingres, …

Web: Web Services, FTP, HTTP, POP, SMTP…

Files: Delimited, positional, XML, Excel…

Business Applications: SugarCRM, SalesForce.com, LDAP…

Geospatial GIS Format (Shapefile, MIF, WFS, GPX, PostGIS), Geometry manipulation based on

JTS (buffer, relation, transformation)

8

GeoSpatial components

Feature manipulation Formats Raster processing*

MetadataGeocoding Viewer

* next release (version 1.3)

*

*

*

*

*

*

*

*

Foss4g2008 / Francois Prunayre

9

Tutorial: Places around natural reserves

Unzip the application Start Talend Create a connection (set the email address)

Create a new project in Java (Create button)

With GeoNames.org data for South Africa (isoCode ZA), use nature reserves (fcode=RESN) and search for populated places (more than 10 000 inhabitants)in .2 width circle.

GeoSpatial components are only available in Java

Slides, datasets are available here

http://spatialdataintegrator.org/foss4g2008/data.zip

Foss4g2008 / Francois Prunayre

10

1.Create a new job

First step is to create a new job.On the repository Tab, in Job design, click on create a new Job

Job name: getData This will create a new view to design your job

Foss4g2008 / Francois Prunayre

11

2.Download the data & unzip

Click to add an outline

1.DOWNLOAD THE DATA

From the palette view, Add a Internet/tFileFetch component to the workspace.

In the component properties tab, set URI to "http://download.geonames.org/export/dump/ZA.zip" to get geonames.org data for South Africa.

Set the destination directory (eg. ''/tmp/'' or ''c:/temp/'') and filename (''ZA.zip'').

2.UNZIP

Add a File/FileManagement/tFileUnarchive component.

Right click the tFileFetch and trigger « OnComponentOk » the tFileUnarchive.

Set the archive file (eg. ''/tmp/ZA.zip'') and the extraction directory (eg. ''/tmp'')

3. RUN THE JOB (F6) and check that a ZA.txt file and a Readme.txt should be in your temp directory.

NOTE: In Talend, all strings should be quoted in component's properties.

NOTE: In the name of the component, the first letter « t » stands for Talend initial

components, « s » for Spatial ones, « u » for Users ones.

HINT: If a panel could not be find in the Talend workspace (eg. « Palette »), click on menu

« Window>Show view », and then search for the « Palette ».

Foss4g2008 / Francois Prunayre

12

3.Define a schema to use later in jobs

Click to add an outline

In the repository tab, Metadata is use to store information to be used in all the jobs of your workspace. Let's create one for the geonames text file format.

Right click File Delimited and create a file delimited

Set name (eg. Geonames)

Click next to go to step 2

* Browse to select the file you unzipped in the previous step

Click next to go to step 3

* Set Field separator to tabulation

Click next to go to step 4

* Talend automaticaly detect column type (but modify type of column6 to be String instead of Character because ESRI Shapefile doesn't support that datatype)

* Rename column name in order to easily use this later. Geonames provides a Readme.txt with information about column names (Columns list: is: geonameid, name, asciiname, altname, lat, lon, fclass, fcode, country, cc2, adm1, adm2, adm3, adm4, pop, elevation, gtopo, tz, date)

NOTE: you could also import a schema from an XML file (http://spatialdataintegrator.org/foss4g/)

Foss4g2008 / Francois Prunayre

13

4.Convert text file to MapInfo format

Click to add an outline

Create a new job named « convertToGIS ».

Drag & drop, the geoname file delimited from the repository object into this new job view.

Add a Geo/Manipulators/s2DPointReplacer.

* Right click the tFileDelimited component and create a row/main link between the 2 components.

* In the s2DPointReplacer select lon , lat columns to define X and Y to be used to create a point geometry. If you can't see column list in the properties, click the « Sync Column » button to update the column of the current component.

Add a Geo/File/Output/sMapinfoOutput components.

* Create a row/main link between the s2DPointReplacer and this component.

* Set the file name (eg. ''/tmp/ZA.mif'')

Run the job (F6). Try to turn on statistics in the run job tab and run the job again.

Display the new layer in a GIS (eg. QGis).

Foss4g2008 / Francois Prunayre

14

5.Filter natural reserves ...

Click to add an outline

The objectives is to filter geonames data to extract natural reserve.

Add a processing/Replicate component after the s2DReplacer to duplicate the flow. Remove connection between s2dPointReplacer and sMapinfoOutput. Connect s2DPointReplacer to tReplicate, and tReplicate to sMapinfoOutput.

Add a tFilterRow component and connect the tReplicate to the tFilterRow. In the tFilterRow properties, set a filter on the « fcode » column to get only the natural reserve of South Africa (ie. InputColumn: fcode, value=''RESN''; do not forget « '' »).

Add a Geo/File/Output/sShapefileOutput component and set the filename (eg. ''/tmp/resn.shp''). Connect the tFilterRow (filter flow) to the new output component. Click the « synch column » button to synchronise columns between the 2 components.

Run the job (F6) (optionnaly with statistics)

Display the new layer in a GIS.

Foss4g2008 / Francois Prunayre

15

... and compute buffer

Click to add an outline

The objectives is to compute a buffer around natural reserve for later use.

Add a processing/Replicate component after the tFilterRow to duplicate the flow.

Add a Geo/Manipulators/sBufferCalculator component. Connect the tReplicate to the sBufferCalculator. In the sBufferCalculator properties set the distance property (eg. « .2 ») to define the width of the buffer around the natural reserves.

Add a Geo/File/Output/sShapefileOutput component and set the filename (eg. ''/tmp/buffer.shp''). Connect the sBufferCalculator to the new output component.

Run the job (F6)

Display the new layer in a GIS.

Foss4g2008 / Francois Prunayre

16

6.Search for populated place

Click to add an outline

Search for populated place with a population greater than 10 000 inhabitants.

Create a new job named « intersect ».

Drag & drop, the geoname file delimited object into this new job.

Add a Geo/Manipulators/s2DPointReplacer and set properties (X and Y columns to be used to create the point geometry as in first job).

Add a processing/tMap component and connect the main flow from the s2DPointReplacer to the tMap.

Add a Geo/File/Input/sShapefileInput. Set filename to the file containing the buffer around natural reserves (eg. ''/tmp/buffer.shp''). Connect it to the tMap.

Double click the tMap...

Foss4g2008 / Francois Prunayre

17

7.Search for populated place

Click to add an outline

Intersect buffers with places

Filter population

The tMap component is one of the more advanced component in Talend. It allows complex join, mapping and filter actions.

In the left side of the tMap panel, INPUT components are listed (row2 should be the geonames text files and row3 the buffer).

1.CREATE A JOIN TO INTERSECT BUFFER AND GEONAMES DATA

In the buffer input (row3), click the « activate filter expression » button.

Click in the expression filter section, and set the join « GeoOperation.INTERSECTS(row2.the_geom , row3.the_geom) » (use CTRL+SPACE to turn on autocompletion)

2. CREATE THE OUTPUT FLOW

In the right side, click the « Add output table » button. Set the name (eg. « place »)

Drag & drop the columns from the left to the right (eg. Geonameid, name, the_geom, pop) to set the schema of the output.

Click in the expression filter section, and set the filter on the population column « row2.pop > 10000 »

Click ok.

3. CREATE OUTPUT

Add a sShapefileOutput component, set filename (eg. ''/tmp/place.shp'') and connect the tMap output flow (named « place ») to this new component.

Run the job & display in a GIS.

Foss4g2008 / Francois Prunayre

18

8.Display results

Click to add an outline

Foss4g2008 / Francois Prunayre

19

9.Run the full process in a job of sub-jobs

Click to add an outline

Create a new job.

From the repository> Job, Drag & Drop the first job (getData) in the job designer.

Drag & Drop the convertToGIS job

Drag & Drop the intersect job.

Then connect the 3 jobs. Trigger onSubJobOk events, the next job to process the 3 jobs in the correct order.

Foss4g2008 / Francois Prunayre

20

New release coming soon

Version 1.3 Changes

New Components : WFS, GPX, simplify, transform, ...

Added Geometry type Added wizard for automatic schema

creation Embed uDig to view data Linked to Sextante library to process

RASTER Demo

Changes

* Update to TOS 2.4.1

* Update to GeoTools 2.5

* 7 new Components

** sWfsInput:

** sGpxInput:

** sSimplify: Use DouglasPeuckerSimplifier and TopologyPreservingSimplifier algorithme to simplify geometries

** sSimpleGeomToMulti: Convert simple geometry (POINT, LINESTRING, POLYGON) to collection (MULTIPOINT, MULTILINESTRING, MULTIPOLYGON).

** sGraticuleBuilder: Create grid

** sChangeLineDirection: Invert line direction

** sTransform: Compute affine transformation (translation, rotation and scaling)

Foss4g2008 / Francois Prunayre

21

Spatial Data Integrator project

Community website http://spatialdataintegrator.org forum, wiki, tutorials

Developpers: SVN repository Prototype with uDig (thanks Jesse) Sandbox for Sextante (thanks Victor) Sandbox for Grass