159497270 data profiling basics

Upload: sarferazulhaque

Post on 02-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 159497270 Data Profiling Basics

    1/4

    Tril lium SoftwareSolut ion Guide

    Key Considerations

    for Selecting aData Pro ling Tool

    1. Who is pro ling :business users, IT,or both

    2. Common enviroment to communicate, review, and interpret results

    3. Complexity of analysis, number of sources

    4. Security of da ta

    5. Ongoing support and monitoring

    What is Data Proling?Data pro ling is a process for ana lyzing large dat a set s. Standa rdda ta pro ling a utomat ically compiles stat istics and ot her summaryinformat ion about t he d at a records. It includes a nalysis by eldfor minimum and maximum values and other basic statistics, fre-q uency count s for elds, da ta type and pat terns/formats, and con-formity to expected values. Other advanced pro ling techniquesalso perform ana lysis about the relat ionships bet ween elds, suchas dependencies betw een elds in a single set and b etw een eldsin separate da ta set s.

    Why Do People Prole?People may w ant to pro le for several reasons, including:

    Assessing risks Can d ata support the new initiative?

    Planning projects What are realistic time lines and whatda ta , systems, and resources will the project involve?

    Scoping projects Which data and systems will be includedba sed on priority, q uality, and level of effort required?

    Assessing data quality How accurate, consistent,and complete is the dat a w ithin a single system?

    Designing new systems What should the ta rget st ructureslook like? What ma ppings or transformations need to o ccur?

    Checking/monitoring data Does the data continue to meetbusiness requirements after systems have gone live and changesand additions occur?

    Data Proling Basics

  • 8/10/2019 159497270 Data Profiling Basics

    2/4

    Who Should Be Proling the Data?Data pro ling is primarily considered pa rt o fIT projects, but the mo st successful effortsinvolve a blend of ITresources and business

    users of the da ta . IT, business users, and dat astewards each contribute valuable insightscritical to the process:

    IT system owners, developers, and projectmanagers analyze and understand issuesof data structure: how complete is thedata, how consistent are the formats, arekey elds uniq ue, is referent ial integ rityenforced?

    Business users and subject matter expertsunderstand the dat a content: wha t the da tameans, how it is applied in existing businessprocesses, what data is required for newprocesses, what data is inaccurate or outof context?

    Data stewards understand corporate sta n-dards and enterprise data requirementsas a whole. They can cont ribute to boththe requirements for speci c projects andthe co rporation.

    How Do People Prole Data?The techniques for pro ling are either manualor automated via a pro ling tool:

    Manual techniques involve people siftingthrough the data to assess its condition,q uery by q uery. Manua l pro ling is ap-propriate for small data sets from a singlesource, with few er than 50 elds, where theda ta is relat ively simple.

    Automated techniques use software toolsto collect summary statistics and analy-ses. These tools are the most appropriat efor projects with hundreds of thousands

    of records, many elds, multiple sources, and questionabledocumenta tion and metad ata . Sophisticat ed da ta pro lingtechnology was built to handle complex problems, especiallyfor high-pro le and mission-critica l projects.

    How Do Data Proling Tools Differ?Data pro ling tools vary both in the architecture they use to ana -lyze da ta and in the working environment they provide for theda ta pro ling team.

    Architecture option: Query-based proling Some pro lingtechnologies involve crafting SQL queries that are run againstsource systems or against a snapshot copy of the source data.While this generates some good information ab out the da ta, it

    has several limita tions:

    Performance risks: Queries strain live systems, slowing downoperat ions, sometimes signi can tly. When add itiona l informa-tion is required, or if users want to see the actual data, a sec-ond q uery executes, creating even more strain on the system.Organizations reduce this risk by making a copy of the data,but this requires replicating the entire environmentbothhardware and software systemswhich can be costly andtime-consuming.

    Tracea bility risks: Data in production systems changes con-stantly. The sta tistics and metada ta captured from query-ba sed pro ling risk being out o f da te immediately.

    Completeness risks: It is d if cult t o gain comprehensiveinsights using q uery-ba sed ana lysis. Queries are ba sed o nassumptions, and the purpose is to confirm and quantifyexpectations about what is wrong and right in the data.Given t his, it is easy to overlook problems tha t you a re notalready aw are of.

    Pro ling by q uery is valuab le when you w ant to mo ni-tor production data for certain conditions. But it is notthe best way to analyze large volumes of data in prepa-ration for large-scale data integrations and migrations.

  • 8/10/2019 159497270 Data Profiling Basics

    3/4

    Tril lium Sof t ware Solut ion Guide: Data Pro li ng Basics

    Architecture option: Data proling repositoryOther pro ling techno logies pro le data a s part o f a scheduledprocess and sto re results in a pro ling repo sitory. Stored resultscan include content such as summary statistics, metad ata, pat-

    terns, keys, relationships, and da ta va lues. Results can then befurther ana lyzed by users or stored for later t rending a nalysis.

    Pro ling repositories tha t a llow users to d rill down on informa-tion and see original dat a values in the context of source recordsprovide the most versat ility and stability for non-technical audi-ences. Independence from operat ional source systems coupledwith the vast a mount of meta dat a and information derived froma point in t ime pro le provide a cross-functiona l team of b usi-ness and ITresources a common, comprehensive view of source

    system data from which traceable decisions can b e ba sed.

    Volume considerat ions: Should tables o r les enter intothe range o f hund reds of millions of records, a p ro lingrepository strateg y should be considered. With volumesthis large, the best strategy may be a blend between week-end-scheduled pro ling processes a nd focused, non-con-ten tious q uery-based pro ling , closely monitored by IT.

    Work environment: Multi-user workspaceSome pro ling tools are designed as desktop solutions for re-sources to use as a tea m of one. How many resources will beinvolved in your da ta pro ling efforts? For large projects, thereis generally a cross-functiona l team involved.

    Consider the environment a pro ling tool provides since mul-tiple users with different skills, different expertise, and varyinglevel of techn ical skills all need to be ab le to a ccess and clearlysee the condition of the da ta . Even if some prospective da tapro lers are skilled in SQL and database technologies, pro l-ing tools that foster collab oration bet ween business users andIToffer greater value overall. With a common w indow on t heda ta set s, people with d iverse ba ckground s can concretely andproduct ively discuss the dat a, its current stat e, and what is re-q uired to move forwa rd.

    Work environment: Graphical InterfaceBecause users may not be familiar with data-base structures and technologies, it is impor-tant t o nd a tool tha t provides an intuitive

    easy-to-learn graphical user interface (GUI).Appropriate security features should also bea part o f the work environment, to ensure tha taccess to restricted elds or records ca n be a l-lowed or denied, for sensitive informat ion.

    What Follows Data Proling?Once the ta sk of da ta pro ling is complete,there is more to do. Keep in mind both theshort- and long-term goals driving the need to

    pro le your data . Leverag e your investmentsby understand ing wha t follows and see if thereare logical extensions to your pro ling effortstha t can be executed w ithin the same tool.

    ETL projects for data integration or migra-tion use pro ling results to design targetsystems, de ne how to a ccurately integ ratemultiple da ta sets, and ef ciently move datato a new system, taking all data conditionsinto consideration.

    Data quality processes that improve theaccuracy, consistency, and completenessof data use results to identify problems oranoma lies and then d evelop rules for auto-mated cleansing and standardization.

    Data monitoring initiatives use pro ling re-sults to establish automated processes forongoing assessment of key data elementsand acute data conditions in productionsystems. The pro ling reposito ry capturesresults, sends alerts, and centrally managesdata standards.

  • 8/10/2019 159497270 Data Profiling Basics

    4/4

    Tril lium Sof t war e Solut ion Guide: Data Pro li ng Basics

    TS Discovery for Data ProlingTS Discovery is unlike ma ny ot her da ta pro- ling too ls on the market. It performs as abest -of-breed da ta pro ling tool, and also

    has several key differentiators that advanceits overall value:

    User Interface the interface is designedspeci cally for a business user. It is intui-tive, easy t o use, and a llow s for immediatedrill dow n for further ana lysis, without hit-ting product ion systems.

    Collaborative Environment team mem-

    bers can log into a common repository, viewthe same data, and contribute to prioritizingand determining appropriate actions to ta kefor addressing anomalies, improvements,integration rules, and monitoring thresholds.

    Proling Repository the repository storesmetadata created by reverse-engineeringthe data. This metadata can be summarized,synthesized, drilled down into, and used torecreate original source record replicas. Busi-ness rules and data standards can be devel-oped w ithin the repository to run aga inst themetadata or deployed to run systematicallyagainst production systems, complete withalert notications.

    Robust proling functionsin additionto basic pro ling, TS Discovery providesad vanced pro ling opt ions such as:pattern analysis, soundex routines, meta-phones, natural key analysis, join analysis,dependency analysis, comparisons againstde ned da ta standards, and regulationag ainst esta blished b usiness rules.

    Improve Data Immediately data can be cleansed andstanda rdized directly using TS Discovery. Name and ad dresscleansing, ad dress valida tion, and recoding processes can berun using TS Discovery. Cleansed da ta is placed in new elds,

    never overwriting source dat a. The cleansed les can be usedimmediate ly in other systems and business processes.

    Helpful Modeling Functions data architects and mod-elers rely on results from key integrity, natural key, join,and depend ency analyses. Physical da ta models can beproduced through the effects of reverse engineering theda ta , to validate models and identify prob lem areas. Venndiag rams can be used to ident ify outlier records and orphans.

    Monitoring capabilities monitor data sources for specicevents and/or errors. Notify users when data values do not meetpredened requirements, such as unacceptable data values orincomplete elds.These powerful features give users the envi-ronment necessary to understand the true nature of their cur-rent data landscape and how data relates across systems.

    TS Discovery: Investing in the FutureA data pro ling solution cannot exist in a vacuum because it isalso a part o f a larger process. While da ta pro ling is a w ay to

    understa nd the cond ition of dat a, TS Discovery provides b ridg esto larger initiatives for da ta integ rations, da ta q uality, and da tamonitoring. Trillium Software is committ ed to expanding thosebridges. It continues to innova te and provide new ways to track,control, and access da ta a cross the ent erprise. It a lso cont inuesto integrate TS Quality functiona lity into TS Discovery to estab-lish and promote high quality data in complex and dynamicbusiness environments.

    Harte-Hanks Trillium Softwarewww.trilliumsoftware.comCorporate Headquarters

    + 1 (978) [email protected]

    European Headquarters+ 44 (0)118 940 7600

    [email protected]