oscar engineering part 1

Post on 20-Aug-2015



Data & Analytics



Click to see full reader


Our objective seemed simple at first: to create a unified and up-to-date view of incoming data for all users, without misplacing a single bit. !

It started small, just a few vendors to connect. Maybe 5 important feeds. !

5 became 10, then grew to 50.

There were data streamsIn the beginning

Formats became more varied, as did the properties of the feeds themselves. !

Achieving the goal was going to require thought, but first, let’s look at the data.


ASCII blobs - schemas define character ranges and data types. !

Example: Transaction log from a Cobol DB. 254 tables in single feed. !

Aggressive parsing required! Often, data types and NULL constraints are the only obvious clues indicating data integrity.

ASCII byte rangesFixed-Width


00226345user.ix1 00000074 2013102112110400CHANGED 00000074IT8208IV*Z IT8208 IT8208 INFO| 100000.00 100000.00 100000.00 A20050101INITLD20120720IT8208 215 91515 91414 11114 7 0 9 814 7 1 0111414 0 010 7 0 51514 01415 41515 01415 0 714 0 01015 11114 51514 41515 0151414 0 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3SASSSASSSASSNORM 0225DEMO AHAUSSLER@SS-HEALTHCARE.COM 18991231 NORMNORM A00819870000271400000002C0002589 C00272560000050000000087

Since 1989 (v2), XML as of 2005 (v3) !

Targets clinical and admin data interchange among hospitals !

Core reference model has been called an incoherent standard (Smith & Ceusters, 2006)

Health Level SevenHL7


<POLB_IN224200 ITSVersion="XML_1.0" xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <id root="2.16.840.1.113883.19.1122.7" extension="CNTRL-3456"/> <creationTime value="200202150930-0400"/> <versionCode code="2006-05"/> <interactionId root="2.16.840.1.113883.1.6" extension="POLB_IN224200"/> <processingCode code="P"/> <processingModeCode nullFlavor="OTH"/> <acceptAckCode code="ER"/> <receiver typeCode="RCV"> <device classCode="DEV" determinerCode="INSTANCE"> <id extension="GHH LAB" root="2.16.840.1.113883.19.1122.1"/> <asLocatedEntity classCode="LOCE"> … </POLB_IN224200>

Since 1971 !

Elements, segments, loops, and hierarchy - like XML, schema is up to the designers. Unlike XML, structure must be derived at parse time. !

Context-sensitive grammars! Need a parse stack to understand all EDI document types.

Electronic Data InterchangeEDI


ISA*00* *00* *12*ABCCOM *01*99999999*101127*1719*U*00400*000003438*0*P*~GS*PO*4405197800*999999999*20101127*1719*1421*X*004010VICS~ST*834*0179~BGN*00*1*20050315*110650****~REF*38*SAMPLE_POLICY_NUMBER~DTP*303*D8*20080321~N1*P5*COMPAN_NAME*FI*000000000~INS*Y*18*030*20*A REF*0F*SUBSCRIBER_NUMBER~NM1*IL*1*JOHNDOE*R***34*1*0000000~PER*IP**HP*2138051111~N3*123 SAMPLE RD~N4*CITY*ST*12345~DMG*D8*19690101*F~HD*030~DTP*348*D8*20080101~REF*1L*INDIV_POLICY_NO~SE*16*0179~GE*1*1421~IEA*1*000003438

Get a feed, write a parser, model it, store it, try to do it right, then deploy it. 50 times! In 50 ways?

Sure, as long as we stay organizedEasy, right?


Every feed has its own characteristics. !

Every project seems to want its own solution. !

It’s so easy at first just to implement things. Then comes The Mess.

Insurance is hard. Let’s simplify it.Complexity is the Enemy!

Can we identify common properties needed by our systems, and guarantee that those properties are satisfied in reaching a fundamental goal?


Control latency. Integrate data as quickly as possible. !

And last but far from least: !

Maintain privacy. Our jobs are often handling user data, and they must handle private data with great care.

To keep our data correct, we need to ensure some fundamental system properties


Respect order. You process things out of order, you corrupt your data. !

Break on failure. Never write bad data or continue in an abnormal state. !

Be idempotent. Mid-process or mid-transaction errors are to be expected.

Judgment is required here, in abundance. These goals are more nebulous but of critical importance to a growing engineering team.

Let’s not forget some important higher-order properties


Implement consistently. Avoid The Mess. Consolidate best practices. !

Deploy consistently. Every new deploy style carries a constant increase in operational complexity.

Parse and model each feed as a custom step, but let the framework handle common properties. !

oscaretl: a framework for transactional safety


Data streams in which data is not independent (often the case), will benefit from being handled as transaction logs. !

Monotonically increasing transaction IDs are very useful. Try to derive them if they don’t exist naturally in the data.

Parsers and schemas are custom, but data formatting, safe writes, and safe execution can be factored out. !

Common schema types can be re-used. !

Factoring out the common elements


Good start, but we need more: at this point, a good runtime model and a job scheduler.

What have we achieved?


Strict ordering. Processing will halt on missing data. !

Idempotence. Careful state binding allows us to resume where we left off. !

Break on failure. Processing will halt on error.

top related