geospatial csv imports hidden complexity

Download Geospatial csv imports hidden complexity

If you can't read please download the document

Upload: rafael-de-la-torre

Post on 15-Apr-2017

165 views

Category:

Engineering


1 download

TRANSCRIPT

PowerPoint Presentation

GEOSPATIAL CSV IMPORTS HIDDEN COMPLEXITYRafa de la Torre

La complejidad oculta de importar CSVs geoespaciales

CartoDB

La manera ms fcil de crear mapas y analizar informacin geoespacial

Editor, plataforma con API's

+60k users, ~1k paying users, 3+ years old product

- Migrant files- Stabilized appartments (John Krauss)- Multas Madrid (Feb'15, 17.5M)- Illustreets

Agenda

1) CSV Format Issues2) Import Issues

CSV FORMAT ISSUES

Intro

.csv / MIME:text/csvUnknown birthdate (80s?)RFC 4180 (2005)

- tabla- columnas: commas, filas: saltos- Fortran '67, Fortran77'78- Intercambio entre BBDD

Intro

Plain textSimple formatSimple rules

- MS-DOS-style lines that end with (CR/LF) characters (optional for the last line)

- An optional header record (there is no sure way to detect whether it is present).

- Each record "should" contain the same number of comma-separated fields.

- Any field may be quoted (with double quotes).

- Fields containing a line-break, double-quote, and/or commas should be quoted. (If they are not, the file will likely be impossible to process correctly).

Usage

Ejemplo de importacin

(todo menos el arrastrar/soltar)

CSV

0101000020E610000000000000008049C000000000000038C0,1083

"alien",2014-11-04 15:24:40.43413+00

Category 1,

"jumpjump up!", {""value"":""es""}

1. WKB, int2. string, date (iso)3. String, cadena vaca? NULL?4. String con saltos de lnea, CSV

WKT: Well-Known Text

POINT (30 10)LINESTRING (30 10, 10 30, 40 40)POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))MULTIPOINT ((10 40), (40 30), (20 20), (30 10))MULTIPOLYGON (((30 20, 45 40, 10 40, 30 20)),((15 5, 40 10, 10 20, 5 10, 15 5)))https://en.wikipedia.org/wiki/Well-known_textWKT

WKB: Well-Known Binary

POINT(2.0 4.0) = 000000000140000000000000004010000000000000https://en.wikipedia.org/wiki/Well-known_text#Well-known_binary

GeoJSON

{ "type": "Feature", "geometry": { "type": "Point", "coordinates": [125.6, 10.1] }, "properties": { "name": "Dinagat Islands" }}http://geojson.org/

IMPORT ISSUES

Typical

Huge files (>1GB)Lots of rows (+2M)Lots of columns (~1600)XLS/XLSX -> CSV

Typical

Stream HTTP downloaded fileStream file between serversStream data import to DB

IO.copy_stream(src, file)

Typical

CartoDB-specific

Content guessing (e.g. lat/lon)Type guessingGeometry errors fixingSync tables -> No downtime allowed

DB-Specific

Leave DB indexes as last stepPrefer big INSERT to multiple UPDATEGDALs ogr2ogr > Ruby/Python scriptshttp://www.gdal.org/ogr2ogr.html

Questions?

Thanks!

[email protected]

@Kartones

Click to edit the title text formatClick to edit Master title style

11/26/15

Click to edit the title text formatClick to edit Master title style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

11/26/15