slc ingestion presentation-boston_sep2012
TRANSCRIPT
Contains Company Confidential Material – Do Not Disclose
Ingestion 101
Presenter: Oleg Krook
September 29-30, 2012Boston, MA
Contains Company Confidential Material – Do Not Disclose
Ingestion Pipeline Overview
Landing Zone provides an entry point for data
Input data is defined in Ed-Fi formatFound at http://www.ed-fi.org/technical-documentation/
Two input methods supported:•XML files followed by a control file•compressed ZIP file containing above files
Contains Company Confidential Material – Do Not Disclose
Control File Format
The control file will be used solely as to define the set of inbound data files, and to perform basic integrity checking on these files. It contains a row of comma-separated values for each data file. Leading/trailing spaces are considered part of the values and will not be trimmed. The last value in any row must not be followed by a comma.
The row format is:<file format>,<file type>,<file name>,<file checksum>
, where<file format> Specifies the file format.At this time, edfi-xml is the only supported file format<file type> Represents the type of object(s) found in the file.In the case of Ed-Fi XML, the file type maps to the name of the appropriate interchange schema.
Anatomy of an ingestion job Control files, Ed-Fi
Contains Company Confidential Material – Do Not Disclose
<file name> Specifies the file's name.File names are case sensitive. This field may or may not be enclosed in double quotes.File names containing double quotes and/or commas should be enclosed in double-quotes. A double-quote appearing inside a field must be escaped by preceding it with another double quote.
<file checksum> Is the file's MD5 checksum.The MD5 checksum is expressed as 32 hexadecimal digits with alphabetic characters always in lowercase.
Anatomy of an ingestion job Control files, Ed-Fi Cont.
Contains Company Confidential Material – Do Not Disclose
The control file format allows for specification of job-level parameters. These are specified in the control file as line entries preceded with the @ symbol.
The following table describes the parameters that are currently supported in the control file:@dry-runIndicates that the results of ingestion processing should not be written to the core data store.@purgeDeletes all previously ingested data from this tenant. All other content of the control file is ignored.
A job control file may look as follows:
@dry-runedfi-xml,StudentEnrollment,data.xml,756a5e96e330082424b83902908b070a
Anatomy of an ingestion job Control files, Ed-Fi Cont.
Contains Company Confidential Material – Do Not Disclose
In the course of ingestion several log files are created and placed in the landing zone. These files are used to capture warning and errors at job level (per control file) or at resource level (per XML file within job).
Error/Status Logs
job-<jobId>.log Once for every job INFO <jobId information>INFO [file] <resourceId> (<internalschema>)INFO [file] <resourceId> records considered: <#>INFO [file] <resourceId> records ingested successfully: <#>INFO [file] <resourceId> records failed: <#> INFO [configProperty] <list of config parameters>INFO <All|#> records process successfullyINFO Processed <#> records
job_warn-<jobId>.log Job-level (non-resource specific) warnings present
WARN <warning detail>
job_error-<jobId>.log Job-level (non-resource specific) errors present
ERROR <error detail>
warn.<resourceId>-<jobId>.log
Resource-level warnings present
WARN <warning detail>
error.<resourceId>-<jobId>.log
Resource-level errors present
ERROR <error detail>
Contains Company Confidential Material – Do Not Disclose
Offline Validation Tool is an open sourced tool, to provide a way to check the format of the ingestion files for Ed-Fi format compliance before they get transmitted for ingestion.
This provide an opportunity to check the file format on the spot instead of waiting to transmit and process the file on the SLI side.
This tool only checks for structure, XML compliance, but does not check for referential integrity of data.
Offline Validation Tool