virginia’s longitudinal data system a federated approach to longitudinal data april 4 th, 2011

Download Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011

If you can't read please download the document

Upload: isaac-gilmore

Post on 17-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

3 The Challenge To develop a Statewide Longitudinal Data System (SLDS) that, without violating privacy policies or law, provides users with a capability to query, link, download and create reports from record level or aggregate data between one or more agencies Because of existing Commonwealth law, the SLDS could not be based on an underlying data warehouse De-identified data may be merged when a viable reason exist. However, The use of persistent, de-identified, linked (merged) data was determined to be highly inefficient and raised political issues which could have endangered the project. December 13, 20103

TRANSCRIPT

Virginias Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011 2 Agenda The Challenge Virginias Approach Best Practice and SME Findings Design Considerations Proposed Solution Summary 3 The Challenge To develop a Statewide Longitudinal Data System (SLDS) that, without violating privacy policies or law, provides users with a capability to query, link, download and create reports from record level or aggregate data between one or more agencies Because of existing Commonwealth law, the SLDS could not be based on an underlying data warehouse De-identified data may be merged when a viable reason exist. However, The use of persistent, de-identified, linked (merged) data was determined to be highly inefficient and raised political issues which could have endangered the project. December 13, 20103 4 Virginias Approach Virginia undertook a comprehensive investigation of best practices and subject matter experts to determine the feasibility of a federated data model. Between October and December 2010, the Center for Innovative Technology (CIT), Virginia Information Technologies Agency (VITA) and the Department of Education (DOE) interviewed six best practice organizations and ten subject matter experts. Those findings led to a SLDS Technical Architecture which fulfilled the objective of the grant while adhering to the Commonwealths privacy constraints. 5 Significant Findings Best Practice InterviewsSubject Matter Experts Interviews Stakeholder ManagementFederated Systems Perform Poorly Data Governance Use of Commercial Solutions Leveraging Existing SystemsUse of Multiple Hash Keys Requirements Drive System Architecture Cleary Defined Security Policies 6 Important Design Considerations User friendly Maximize use of existing technologies/solutions Minimize sustainment costs Record level data queries were not time sensitive Strong central security model 7 The Solution A federated data model and technical architecture comprised of a web based user interface (UI), a query/linking engine, a multi-level security module, a rich business intelligence (BI) capability, a Lexicon and integrated workflow. December 13, Data 8 9 Conceptual Portal DataData DataData 10 Portal Components Shaker Distributed Query Engine (DQE) For use by Agency employees and named users Reports Public Facing Aggregated Data Named Users - Query Building Tool (QBT) Lexicon Workflow Account request Data request DataData DataData 11 Portal Features (Public Facing) Aggregated Data Reports Lexicon Links to Agency reports Help Files FAQs Request for Named User Account DataData DataData 12 Portal Features (Named Users) Help / Training Reports Non-suppressed aggregated data Query Building Tool (QBT) Lexicon Workflow Account and Data request Data retrieval File Attachment for uploading NDAs, etc. Ability to check status, modify or cancel account and/or data request Password reset DataData DataData 13 Data 14 Security Overview Aggregated Data (Suppressed) Aggregated Data (Non- Suppressed) Unit Record Level Data Account Management Portal Components Anonymous Named Schools Researchers Agency Employees System Admin DataData DataData 15 Security DataData DataData DataData DataData Authentication Authorization Database Table Column Database Table Column Role Based Permission Role Based Permission Viewing Editing Viewing Editing Suppressed Data Non-Suppressed Data Suppressed Data Non-Suppressed Data Viewing 16 Data 17 Workflow DataData DataData 18 Data 19 Reporting: Record Level Linked Data DataData DataData Report Creation 1,2 (Ad Hoc interface) Lexicon Shell Database 1,2 Ad Hoc Metadata Report Creation 1,2 (Ad Hoc interface) Query Results 5,6 DOE SCHEV VEC Approval 1.1. Instantiates the information contained in the Lexicon Contains dummy data Instantiates the information contained in the Lexicon Contains dummy data. Source Data 1.Report link will display report with dummy data. 2.Report will have a button that will allow submission of report to workflow. 3.Distributed query engine generate queries to each of the source data systems and join the result sets. 4.Engine will interact with Lexicon. 5.Options for report display include a Logi Analysis Grid (depending on number of records returned.) or a link to download a file. 6.Access may be provided through Ad Hoc report portal. 1.Report link will display report with dummy data. 2.Report will have a button that will allow submission of report to workflow. 3.Distributed query engine generate queries to each of the source data systems and join the result sets. 4.Engine will interact with Lexicon. 5.Options for report display include a Logi Analysis Grid (depending on number of records returned.) or a link to download a file. 6.Access may be provided through Ad Hoc report portal. Results Shaker 3,4 20 Reporting: Aggregate Linked Data Aggregate Linked Data 3 Aggregate Linked Data 3 DOE SCHEV VEC Source Data 1. There will be prebuilt reports for linked data from the different sources (e.g., DOE to SCHEV, SCHEV to VEC). 2.The prebuilt reports may provide the user with some capabilities to perform analysis on the data (e.g., crosstabbing, grouping, filtering, etc.) 1. There will be prebuilt reports for linked data from the different sources (e.g., DOE to SCHEV, SCHEV to VEC). 2.The prebuilt reports may provide the user with some capabilities to perform analysis on the data (e.g., crosstabbing, grouping, filtering, etc.) Prebuilt Reports 1,2 User ETL 1,2 1.ETL process will periodically pull source data and load aggregate data tables. 2.The tool used for the ETL process may be SSIS or LogiETL. 3.. Data access through Stored Procedures which will handle data suppression. 1.ETL process will periodically pull source data and load aggregate data tables. 2.The tool used for the ETL process may be SSIS or LogiETL. 3.. Data access through Stored Procedures which will handle data suppression. HTTP Record Level Linked Data Record Level Linked Data Direct DB Connection SLDS Portal Portal 1 HTTP 1. Prebuilt Reports will be displayed within iFrames in Portal. DataData DataData Public Reports SLDS Portal 21 Data 22 Lexicon Defined Transformations & Matching Algorithms 23 Lexicon Maintenance To maintain accuracy and manage extensibility, the linking module will process all data sources periodically at a predetermined time/interval looking for: Changes in data ranges ( a new code was added for race/ethnicity ) New fields (more data, more data, more data!) Anything else that would disrupt the probabilistic matching or provide more ways to slice and dice the data Anomalies found by the linking module will prompt an alert for a system administrator to modify the matching algorithm or add query choices For new sources, or those with known common fields/links, this would be the method of entry 24 Shaker Data Lexicon Shaker Process DS 1 DS 2 DS 3 Lexicon Linking Control Data Access Control User Interface/ Portal/ LogiXML Sub-Query Optimization Hashed ID Matrix Authorized Query Query Results Common IDs [deterministic] or Common Elements with appropriate Transforms, Matching Algorithms and Thresholds [probabilistic] A linking engine process will update the Lexicon periodically to allow query building on known available matched data fields. No data is used in this process. Queries are built on the relationships between data fields in the Lexicon. Workflow Manager Sample Data Shell Database Query Building Process (Pre-Authorization) ? 26 Matched Hash ID Values The SLDS server will match records from different agencies using the Hash ID After records are matched, the SLDS server will delete the Hash ID values and replace them with randomly generated unique IDs. January 30, 2016 Possible Connection using Web Service creates Web Services Data Source (Oracle) - enables application and data integration by turning external web service into an SQL data source, making external Web services appear as regular SQL tables. This table function represents the output of calling external web services and can be used in an SQL query. Possible Connection using Homogeneous link between Oracle DBs establish synonyms for global names of remote objects in the distributed system so that the Shaker can access them with the same syntax as local objects Sub-query processing priority will be determined for each query to minimize unnecessary data transfer (e.g. not downloading unmatched records unless specifically requested) to optimize join performance see Query Sub-Process Optimization Possible Connection using Heterogeneous link using available Transparent Gateway or Generic ODBC/OLE Joining Sub-Queries on Hashed-IDs DataData DataData Addl Data Sources 27 2 nd DS to query is DS with next least count using specified criteria (if Inner Join) Query 2 nd DS using todays key AND hashed-ID list from 1 st DS 2 nd DS to query is DS with next least count using specified criteria (if Inner Join) Query 2 nd DS using todays key AND hashed-ID list from 1 st DS DS 2 DS 3 Get COUNTS from each DS Web Service for each set of limiting criteria Query Derive JOIN Criteria from Lexicon - Common IDs [Deterministic] or Common Elements with appropriate Transforms, Matching Algorithm and Thresholds [probabilistic] Lexicon Parse Sub-Queries Run 1 st Sub-Query Run 2 nd Sub-Query Join Sub-Queries on Hashed ID Sub-Query Process Optimization 1 st DS to query is DS with least count using specified criteria Query 1 st DS using todays key Returns set with hashed IDs 1 st DS to query is DS with least count using specified criteria Query 1 st DS using todays key Returns set with hashed IDs DS 1 DataData DataData Query Results Agency Creates Hash- IDs Create Hash-Key 28 Data 29 Data Architecture DS 1 Lexicon DS 1 SPs 3 Aggregate Linked Data 1.Contains DBs for Shaker, Ad Hoc metadata, logging, auditing, etc. 2.Database for Shaker process and that temporarily stores linked record level data. The temporary tables will be dropped after a set period of time. 3.For canned reports, Stored Procedures will be used for data querying and suppression. 1.Contains DBs for Shaker, Ad Hoc metadata, logging, auditing, etc. 2.Database for Shaker process and that temporarily stores linked record level data. The temporary tables will be dropped after a set period of time. 3.For canned reports, Stored Procedures will be used for data querying and suppression. Shaker/ Deidentified Record Level Data 2 VITA (CESC) Aggregate Linked Reports Record Level Query / Reports Lexicon UI / Admin ETL 1 Metadata and Security 1 Shell DB Workflow DataData DataData DS 3 DS 2 SLDS Portal 30 Physical Infrastructure 31 Physical Infrastructure Shaker Production Env. (CESC) 32 SLDS Components Matrix ComponentCustom / COTSSuggested Product PortalCustom SecurityCustom AuthenticationCOV AUTH AuthorizationMixed WorkflowCOTSMS Dynamics Reports Public FacingCOTSLogi Info Query BuildingCOTSLogi Ad-Hoc LexiconCustom Shaker Extract, Transform & LoadCOTSLogi ETL, SSIS or Informatica Distributed Query Engine (DQE) Custom or COTSSyncsort, Informatica or Custom Questions? 34 Back-Up Slides 35 Security Authentication COV AUTH Authorization Role Based Anonymous User Named User System Administrator Agency Employee Researcher Permissions Workflow Reports (Suppressed and Non-Suppressed) Query Building Tool Lexicon Data elements User Account Management Data security enforced by/at . Portal Lexicon Viewing Editing Reports Suppressed Data Non-Suppressed Data Workflow Data Database Table Column DataData DataData