slc ingestion presentation-boston_sep2012

7
Contains Company Confidential Material – Do Not Disclose Ingestion 101 Presenter: Oleg Krook September 29-30, 2012 Boston, MA

Upload: slc-is-now-inbloom

Post on 20-Aug-2015

260 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Slc ingestion presentation-boston_sep2012

Contains Company Confidential Material – Do Not Disclose

Ingestion 101

Presenter: Oleg Krook

September 29-30, 2012Boston, MA

Page 2: Slc ingestion presentation-boston_sep2012

Contains Company Confidential Material – Do Not Disclose

Ingestion Pipeline Overview

Landing Zone provides an entry point for data

Input data is defined in Ed-Fi formatFound at http://www.ed-fi.org/technical-documentation/

Two input methods supported:•XML files followed by a control file•compressed ZIP file containing above files

Page 3: Slc ingestion presentation-boston_sep2012

Contains Company Confidential Material – Do Not Disclose

Control File Format

The control file will be used solely as to define the set of inbound data files, and to perform basic integrity checking on these files. It contains a row of comma-separated values for each data file. Leading/trailing spaces are considered part of the values and will not be trimmed. The last value in any row must not be followed by a comma.

The row format is:<file format>,<file type>,<file name>,<file checksum>

, where<file format> Specifies the file format.At this time, edfi-xml is the only supported file format<file type> Represents the type of object(s) found in the file.In the case of Ed-Fi XML, the file type maps to the name of the appropriate interchange schema.

Anatomy of an ingestion job Control files, Ed-Fi

Page 4: Slc ingestion presentation-boston_sep2012

Contains Company Confidential Material – Do Not Disclose

<file name> Specifies the file's name.File names are case sensitive. This field may or may not be enclosed in double quotes.File names containing double quotes and/or commas should be enclosed in double-quotes. A double-quote appearing inside a field must be escaped by preceding it with another double quote.

<file checksum> Is the file's MD5 checksum.The MD5 checksum is expressed as 32 hexadecimal digits with alphabetic characters always in lowercase.

Anatomy of an ingestion job Control files, Ed-Fi Cont.

Page 5: Slc ingestion presentation-boston_sep2012

Contains Company Confidential Material – Do Not Disclose

The control file format allows for specification of job-level parameters. These are specified in the control file as line entries preceded with the @ symbol.

The following table describes the parameters that are currently supported in the control file:@dry-runIndicates that the results of ingestion processing should not be written to the core data store.@purgeDeletes all previously ingested data from this tenant. All other content of the control file is ignored.

A job control file may look as follows:

@dry-runedfi-xml,StudentEnrollment,data.xml,756a5e96e330082424b83902908b070a

Anatomy of an ingestion job Control files, Ed-Fi Cont.

Page 6: Slc ingestion presentation-boston_sep2012

Contains Company Confidential Material – Do Not Disclose

In the course of ingestion several log files are created and placed in the landing zone. These files are used to capture warning and errors at job level (per control file) or at resource level (per XML file within job).

Error/Status Logs

job-<jobId>.log Once for every job INFO <jobId information>INFO [file] <resourceId> (<internalschema>)INFO [file] <resourceId> records considered: <#>INFO [file] <resourceId> records ingested successfully: <#>INFO [file] <resourceId> records failed: <#> INFO [configProperty] <list of config parameters>INFO <All|#> records process successfullyINFO Processed <#> records

job_warn-<jobId>.log Job-level (non-resource specific) warnings present

WARN <warning detail>

job_error-<jobId>.log Job-level (non-resource specific) errors present

ERROR <error detail>

warn.<resourceId>-<jobId>.log

Resource-level warnings present

WARN <warning detail>

error.<resourceId>-<jobId>.log

Resource-level errors present

ERROR <error detail>

Page 7: Slc ingestion presentation-boston_sep2012

Contains Company Confidential Material – Do Not Disclose

Offline Validation Tool is an open sourced tool, to provide a way to check the format of the ingestion files for Ed-Fi format compliance before they get transmitted for ingestion.

This provide an opportunity to check the file format on the spot instead of waiting to transmit and process the file on the SLI side.

This tool only checks for structure, XML compliance, but does not check for referential integrity of data.

Offline Validation Tool