tdwi checklist report: active data archiving

TDWI CHECKLIST REPORT

TDWI RESEARCH

tdwi.org

Active Data ArchivingFor Big Data, Compliance, and AnalyticsBy Philip Russom

Sponsored by:

1 TDWI RESEARCH tdwi.org

2 FOREWORD

2 NUMBER ONE Embrace modern practices and platforms for active data archiving

3 NUMBER TWO Assure and improve data governance by using a compli-ance data archive

3 NUMBER THREE Consider an analytics archive for critical, high-value, and aging analytics data

4 NUMBER FOUR Rethink how data is committed to an archive

4 NUMBER FIVE Rethink how archived data is accessed and used actively

5 NUMBER SIX Deploy archiving systems that have multiple storage and processing tiers

6 NUMBER SEVEN Make security a high priority because it will make or break an archive

7 ABOUT OUR SPONSOR

7 ABOUT THE AUTHOR

7 ABOUT TDWI RESEARCH

7 ABOUT THE TDWI CHECKLIST REPORT SERIES

© 2014 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to [email protected]. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

MAY 2014

ACTIVE DATA ARCHIVINGFor Big Data, Compliance, and Analytics

By Philip Russom

TABLE OF CONTENTS

555 S Renton Village Place, Ste. 700 Renton, WA 98057-3295

T 425.277.9126 F 425.687.2842 E [email protected]

tdwi.org

TDWI CHECKLIST REPORT


TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIV ING FOR BIG DATA, COMPLIANCE, AND ANALY TICS

Data archiving presents various problems in the enterprise today. Many organizations don’t archive at all. Others mistakenly think that mere data backups can serve as archives, whereas tape is actually the final burial place of data, from which it rarely returns. Equally off base, others believe a data warehouse is an archive. Although it’s true that data archiving processes exist today in some organizations, these are rarely formalized or policy driven, such that data is archived in an ad hoc fashion (typically per application or per department) without an enterprise standard or strategy.

Even when an organization makes an honest attempt at an enterprise data archive, the result is usually not trustworthy (because data is easily altered), not auditable (due to poor metadata and documentation), not compliant (due to inadequate usage monitoring or the inability to purge data at specified milestones), and not properly secured (lacking encryption, masking, and security standards). Furthermore, with most existing data archives, it’s hard to get data in with integrity and out with speed because the primary platform is not online, active, and highly available.

Why don’t more organizations invest in formal archiving processes and technical solutions? Most likely it’s their common belief that archives provide little or no return on investment (ROI) because users rarely (if ever) access the archive. Without prominent and frequent usage, a respectable ROI is unlikely.

A data archive can achieve ROI by serving multiple uses and users from an online, active platform. Yes, organizations do need to retain data; that’s not in question. However, archived data is not just insurance for compliance, audit, and legal contingencies. Those are important goals, but a data archive should also be treated as an enterprise asset to be leveraged, typically via analytics. Hence, a data archive can be more than a cost center; it can achieve ROI when it serves multiple uses (archiving, compliance, and analytics of deep historical data sets) and it manages data online for active access at any time by a wide range of users.

Users must start planning today for active data archiving. To help them prepare, this TDWI Checklist report will drill into the desirable attributes, use cases, user best practices, and enabling technologies of active data archiving.

FOREWORD

There are compelling reasons for improving data archives. Traditional reasons for data archives still apply: namely, supplying data for compliance, audit, and legal requirements. However, a modern online data archive brings greater speed, accuracy, and credibility to these tasks so they are a smaller drain on enterprise processes and resources.

New reasons have come into play as well: namely, organizations’ voracious hunger for actionable insights discovered through advanced analysis of raw source data, big data, and a broadening diversity of data types. One of the most influential changes, however, concerns the state-of-the-art in data platforms—both hardware and software. Their speed, scale, and functionality continue to rise even as their costs fall, which in turn makes the improvement of users’ data archive solutions feasible for both technical and financial reasons.

Active data archiving can address these problems and opportunities. Enterprises need to embrace the emerging practice of active data archiving along with its enabling technologies. A modern solution for active data archiving will:

• Be built primarily for compliance or data governance but also serve the archival needs of analytics and sometimes data backup and disaster recovery.

• Be open to active access by a wide range of users, including those who need simple lookups and easy data exploration.

• Manage data as an immutable record that cannot be altered so that data is trustworthy for compliance and legal requirements.

• Be secured like a bank vault, for data security, privacy, and trust, using role-based permission access, data masking, encryption, and multiple data security standards.

• Scale up to multi-terabyte and petabyte data volumes using fast bulk loads and data compression to embrace new big data sources and because archives inevitably grow over time.

• Operate online with high availability around the clock to enable active data loads and extracts that keep the archive current up to the minute. Furthermore, data is constantly appended to an active archive without downtime or performance degradation.

• Support high-performance access based on SQL and other standards because users expect quick responses as they run queries and searches against archived data.

NUMBER ONE

EMBRACE MODERN PRACTICES AND PLATFORMS FOR ACTIVE DATA ARCHIVING



Two broad archive categories—defined by their content and the primary use of that information—can coexist and overlap in active data archiving solutions:

• Compliance archives: Data retained in content, format, and for timeframes prescribed by legislation and other regulations (e.g., partners, lenders, and legal liabilities)

• Analytic archives: Detailed source data from operational and transactional applications, extracted for general business intelligence purposes but retained for advanced analytics (as defined in the next section of this report)

Compliance archives have a number of desirable process and technical attributes:

Data that’s properly archived is solid evidence of an organization’s compliance. In legal terms, honest attempts at archiving constitute proper intent, whereas a lack of archiving may be construed as malfeasance.

Data archived for compliance must support appropriate regulations. These vary by industry. For example, in the United States, the most stringent regulations target banking and the financial services industry as seen in the Dodd-Frank legislation or SEC Rule 17a-4. Similarly, the telecommunications industry is subject to legal hold and lawful intercept requirements that demand timed data retention.

Archived data must be tamper proof to be trusted. Most is captured and stored in original form so it’s a credible representation of a transaction, report, business process, or other event at a specific time. If archived data becomes altered, it is no longer considered credible. For example, stock trades are stored for exact timeframes, to protect both trader and institution. Transparency is of the utmost importance to compliance archives, and WORM (write once, read many times) storage has become key.

Archived data demands a convincingly documented audit trail. Most audits commence with a request for information, followed by a request for an audit trail for supplied information. With data stored properly in an active archive, audits go faster—perhaps more accurately, too—than with traditional offline, ad hoc archives. The speedy, documented response builds confidence with auditing bodies and contributes to favorable outcomes.

An active data archive should have tracking functions so an organization can monitor and study its own activities to assure compliance and make improvements. The same tracking functions can flag data that has aged beyond its compliance requirements and should be deleted.

ASSURE AND IMPROVE DATA GOVERNANCE BY USING A COMPLIANCE DATA ARCHIVE

NUMBER TWO

Archiving operational data for analytic purposes is on the rise. As more advanced forms of analytics have gained credence over the last 15 to 20 years, user organizations have been retaining more detailed source data. The traditional practice was to extract data from operational applications and other sources, process that data and load the results into a DW, then delete the extracted source data. The accepted practice today keeps most source data because it is also the preferred material for analytics based on data mining, statistical analyses, natural language processing, and SQL-based analytics.

An analytic archive and a data warehouse are similar but different. Because of the stepped-up data retention, the data staging areas within most data warehouse architectures today are bigger than their core warehouses. This is tantamount to data archiving, though few BI/DW professionals call it archiving. All they know is that they have to do something to improve the content and accessibility of their analytic data archives. Furthermore, they need to offload this burden from core warehouses, which have higher priorities than analytics (namely reporting, OLAP, and performance management). Hence, as BI/DW professionals ponder where to put certain classes of analytic data, they should consider a platform for active data archiving.

An analytic archive easily integrates with multi-platform DW architectures. DW system architectures have always been multi-platform, but this trend has accelerated in recent years as users have extended their DW environments by adding new platforms for columnar databases, appliances, NoSQL, and Hadoop. An additional platform—one that specializes in archiving data for advanced analytics—would wring more value from archived source data and easily integrate with multi-platform DW architectures.

A data archive can future-proof analytic applications. Most data warehouses are designed by their users (not vendors) for the data requirements of reporting, OLAP, and performance management. These practices need calculated, aggregated, standardized, and time-series numeric values modeled in multidimensional structures that don’t exist in source systems. Advanced analytics has different data requirements. It needs a very large store of unaltered (or lightly transformed) detailed source data. Other than that, it’s impossible to anticipate data requirements for future analytic applications (AA). Accordingly, an analytic archive preserves source data in its original form, so the source is there for future AAs to explore and repurpose.

CONSIDER AN ANALYTICS ARCHIVE FOR CRITICAL, HIGH-VALUE, AND AGING ANALYTICS DATA

NUMBER THREE



A data archive has to be more than a dumping ground. For one thing, there needs to be a strategy based on new and evolving user requirements for aging, less frequently accessed data and other metrics for identifying which data should be archived at what level and on what schedule. Note that not all data should be archived: some data belongs elsewhere, say, in its original application database or in a data warehouse. Archive specialists need to interview a broad range of business users and managers to determine users’ needs for archived data. If your organization has a legal department and compliance officers, give priority to their needs but without neglecting the rest of the enterprise.

On a technology level, develop interfaces and integration logic for getting data into the archive quickly and in lightly transformed states that are conducive to query and search, without altering the essential content of archived data. Finally, assume that all the data in the archive needs an audit trail and documentation (via metadata, etc.) that is sufficient to satisfy even the most aggressive users and auditors.

What if data comes from applications that have been upgraded or customized (which can alter data models)? Look for a data archiving platform that can manage changing data models. That way, the platform understands changes to source schema and adjusts metadata and pointers accordingly.

What if archived data comes from an application that was decommissioned (also known as application retirement)? When the only application that can read a dataset with full integrity is gone, that application’s data may need to be lightly transformed before entering an archive (or after it’s in the archive) so it can be easily accessed by common query and search tools. This practice is inspired by data warehousing but it does not require the full-blown time, skills, and expense of the average data warehouse.

Some archived data needs encryption (for security) or compression (to reduce its storage footprint). Look for a platform that can apply these and other data operations as data enters the archive or after data is in the archive. Furthermore, as data growth rates continue to rise over time and business demands for retaining older data grow, data should be stored in a compressed state to optimize storage capacity and scale over time. Similarly, the security classification of data can change as organizational rules and policies evolve.

RETHINK HOW DATA IS COMMITTED TO AN ARCHIVE

NUMBER FOUR

Let’s be honest: We’ve all worked in organizations where archives were purely pro forma, without a credible effort to preserve data in a state that’s quickly or easily accessed by anyone, much less the growing number of employees who can benefit from accessing the information. Luckily, this old “worst practice” is giving way to the realization that all enterprise datasets—including archived data—are valuable assets that can contribute to many business goals. The recent craze for analytics with big data has led many organizations to seek more business value from their datasets.

With that in mind, active data archiving is a bit of a cultural shock in some organizations. To get past the shock, these organizations need upper management to define a mandate for modern archiving based on the following goals:

Archived data must be leveraged. Typical use cases include fast, documented auditing for compliance, a source for analytic applications, data exploration, and information lookups.

Some data will come out of the archive to be used elsewhere. To enable a broad range of users, tools, and purposes, the archive should support both query and search mechanisms. Furthermore, the archive should serve as a source for other data platforms, especially those for business intelligence and analytics.

A growing constituency of users will have access to archived data. This is a sticky point in organizations that define data governance and compliance as the process of limiting data access. The catch is to balance access and control, typically through well-defined user types controlled via role-based user access and strong security features in the archival platform.

Accessing archived data will be timely. First, to be truly active, the archive must be online like a database, not offline like magnetic tapes and optical disks or any media that demand a distracting and time-consuming restoration process. Second, data access mechanisms should perform at or near real time for the sake of user productivity.

RETHINK HOW ARCHIVED DATA IS ACCESSED AND USED ACTIVELY

NUMBER FIVE



For a data archive to be truly active, its primary tier should be based on a robust database management system (DBMS). The DBMS must include traditional relational functions (for query and data exploration) and functions for multiple security strategies, scalability, and high availability. The assumptions here are that most data being archived will be structured and that most users and applications will need to access data via queries. Even so, some functions of the DBMS should be controlled; for example, inserting and updating data can destroy data’s original state, whereas appending data avoids such integrity problems. In addition to relational technology, free text search is critical to finding records of interest and to enabling non-technical users.

An active data archiving platform can host many archives, each with its own unique requirements, similar to how a DBMS can manage several databases (defined as collections of data). Thus, multi-tenancy is another key assumption for a modern data archive.

In most cases, an archive platform is not a data processing or analytics platform. Hence, archived data is best extracted, then moved to a DBMS or other data platform that is more conducive to in-database analytics, intense SQL-based analytics, and miscellaneous forms of advanced analytics. For these purposes, mature organizations already have in place relational data warehouses, columnar databases, and DW appliances, possibly NoSQL databases and Hadoop. As an exception, when an active archive runs atop Hadoop, it may make sense to process and analyze data on the same platform where it’s archived. Note that the DBMS in the primary tier of a data archive does not replace other DBMSs, especially not those deployed for analytics. Instead, it complements them and (in addition to its archival purpose) serves as yet another source of data for analytics (largely historical data).

The storage tier of an active archive should be diverse. This is to accommodate subsystems users already have as well as newer commodity-priced types such as CAS hardware or the Hadoop Distributed File System (HDFS). Even a modern active archive might include systems for magnetic tape and optical disk in the storage tier. After all, many organizations have pre-existing mag tape or op disk libraries that they must maintain. Note that these archaic media are antithetical to an active data archive; if possible, their data should be migrated into the active archive so it’s online and available when users need it.

In the case of a compliance archive (for, say, a financial services institution), the archive must reside in a WORM storage platform. This, in turn, requires a DBMS that supports WORM devices. WORM technologies are worth the investment because they keep

DEPLOY ARCHIVING SYSTEMS THAT HAVE MULTIPLE STORAGE AND PROCESSING TIERS

NUMBER SIX

compliance and risk officers happy and they avoid fines, penalties, and damaging publicity.

Users should consider Hadoop as both a highly scalable storage platform for archiving and a low-cost processing platform for analytics. Note that open-source Hadoop’s poor support for two key standards—SQL (and other relational technologies) and security (especially LDAP and Linux PAM)—keeps it unpalatable for mature IT organizations.

Despite these two limitations, Hadoop has roles to play in multi-platform archive architectures. Hadoop excels with very large data volumes, as well as with file-based data, data documents (XML and JSON), textual content (e-mail and word processing files), unstructured and non-relational structured data, and schema-free data. Hadoop’s low price is appropriate to many kinds of lower-value (but high-volume) historic data, such as Web logs. However, due to limitations in current releases, purely open-source Hadoop may not be the best choice for structured data that needs relational processing (such as intense SQL or multi-way joins) or sensitive data that demands high security. That’s not a show stopper because a number of software vendors offer products that integrate with Hadoop to give it stronger and broader support for security and relational technologies like standard SQL.

Consider economics as you select platforms, tools, and features for a new active archiving architecture. For example, it’s technically possible to include almost any brand of relational DBMS in an archiving solution. However, the older and more mature vendor brands are relatively expensive, especially once an archive scales into multi-terabytes, and they include far more features and functions than are required for archiving. A more cost-effective choice is a DBMS designed for archiving or one of the newer columnar, open-source, or appliance-based DBMSs. In this context, Hadoop is affordable in terms of dollars per terabyte of storage. Similarly, data compression is a feature that can reduce storage costs because it reduces the footprint of archived data in storage.



Put succinctly, if an archive isn’t secure, it won’t meet the compliance goals that are its primary purpose. Furthermore, if users don’t trust the security of the archival platform, they won’t use it or its data, and the archive will fail to demonstrate a positive ROI.

The primary line of defense is the security layer built into the relational DBMS at the heart of an active data archiving platform. Most mature IT departments and DBMS teams prefer role-based approaches to security, and many have LDAP and other directories they’d like to reuse and apply within the active archiving solution.

If Hadoop is to be part of an active archive’s infrastructure, note that security in purely open-source Hadoop today is mostly about general access privileges controlled through Kerberos. However, a few third parties now offer add-on products that enable LDAP, Active Directory, and other approaches to security for the Hadoop family of products.

Almost all modern data archives are loaded with sensitive data about customers, partners, employees, Social Security numbers, credit card numbers, transactions, internal financials, and so on. Encryption or data masking can make this data unreadable in the eventuality of a hack or other unauthorized access.

Additional layers of data protection may be used to keep data locked and immutable. This provides evidence that data records and files have not been altered, which is fundamental to a credible audit. Likewise, records and files cannot be deleted before their retention periods expire.

MAKE SECURITY A HIGH PRIORITY BECAUSE IT WILL MAKE OR BREAK AN ARCHIVE

NUMBER SEVEN



TDWI Research provides research and advice for business intelligence and data warehousing professionals worldwide. TDWI Research focuses exclusively on BI/DW issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence and data warehousing solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations.

ABOUT TDWI RESEARCH

ABOUT THE AUTHOR

Philip Russom is the research director for data management at The Data Warehousing Institute (TDWI), where he oversees many of TDWI’s research-oriented publications, services, and events. He’s been an industry analyst at Forrester Research and Giga Information Group, where he researched, wrote, spoke, and consulted about BI issues. Before that, Russom worked in technical and marketing positions for various database vendors. Over the years, Russom has produced over 500 publications and speeches. You can reach him at [email protected].

TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects.

ABOUT THE TDWI CHECKLIST REPORT SERIES

www.rainstor.com

RainStor provides the world’s most efficient database solutions that reduce the cost, complexity, and compliance risk of managing data. Delivering solutions to the enterprise, you can quickly deploy an Analytical Archive or Compliance Archive so you continue to create business value and stay compliant. RainStor runs anywhere: on-premises or in the cloud and natively on Hadoop. Among RainStor’s customers are 20 of the world’s largest communications providers and 10 of the biggest banks and financial services organizations, which use RainStor to manage historical data, while saving millions. For more info: www.rainstor.com or join the conversation: @rainstor.

ABOUT OUR SPONSOR

http://bit.ly/1742G6d

http://rainstor.com/solutions/analytical-archive/

http://rainstor.com/solutions/compliance-archive/

tdwi checklist report: active data archiving

Software