geospatial csv imports hidden complexity
TRANSCRIPT
PowerPoint Presentation
GEOSPATIAL CSV IMPORTS HIDDEN COMPLEXITYRafa de la Torre
La complejidad oculta de importar CSVs geoespaciales
CartoDB
La manera ms fcil de crear mapas y analizar informacin geoespacial
Editor, plataforma con API's
+60k users, ~1k paying users, 3+ years old product
- Migrant files- Stabilized appartments (John Krauss)- Multas Madrid (Feb'15, 17.5M)- Illustreets
Agenda
1) CSV Format Issues2) Import Issues
CSV FORMAT ISSUES
Intro
.csv / MIME:text/csvUnknown birthdate (80s?)RFC 4180 (2005)
- tabla- columnas: commas, filas: saltos- Fortran '67, Fortran77'78- Intercambio entre BBDD
Intro
Plain textSimple formatSimple rules
- MS-DOS-style lines that end with (CR/LF) characters (optional for the last line)
- An optional header record (there is no sure way to detect whether it is present).
- Each record "should" contain the same number of comma-separated fields.
- Any field may be quoted (with double quotes).
- Fields containing a line-break, double-quote, and/or commas should be quoted. (If they are not, the file will likely be impossible to process correctly).
Usage
Ejemplo de importacin
(todo menos el arrastrar/soltar)
CSV
0101000020E610000000000000008049C000000000000038C0,1083
"alien",2014-11-04 15:24:40.43413+00
Category 1,
"jumpjump up!", {""value"":""es""}
1. WKB, int2. string, date (iso)3. String, cadena vaca? NULL?4. String con saltos de lnea, CSV
WKT: Well-Known Text
POINT (30 10)LINESTRING (30 10, 10 30, 40 40)POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))MULTIPOINT ((10 40), (40 30), (20 20), (30 10))MULTIPOLYGON (((30 20, 45 40, 10 40, 30 20)),((15 5, 40 10, 10 20, 5 10, 15 5)))https://en.wikipedia.org/wiki/Well-known_textWKT
WKB: Well-Known Binary
POINT(2.0 4.0) = 000000000140000000000000004010000000000000https://en.wikipedia.org/wiki/Well-known_text#Well-known_binary
GeoJSON
{ "type": "Feature", "geometry": { "type": "Point", "coordinates": [125.6, 10.1] }, "properties": { "name": "Dinagat Islands" }}http://geojson.org/
IMPORT ISSUES
Typical
Huge files (>1GB)Lots of rows (+2M)Lots of columns (~1600)XLS/XLSX -> CSV
Typical
Stream HTTP downloaded fileStream file between serversStream data import to DB
IO.copy_stream(src, file)
Typical
CartoDB-specific
Content guessing (e.g. lat/lon)Type guessingGeometry errors fixingSync tables -> No downtime allowed
DB-Specific
Leave DB indexes as last stepPrefer big INSERT to multiple UPDATEGDALs ogr2ogr > Ruby/Python scriptshttp://www.gdal.org/ogr2ogr.html
Questions?
Thanks!
@Kartones
Click to edit the title text formatClick to edit Master title style
11/26/15
Click to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
11/26/15