education information classification: the cornerstone to ...information classification: the...
TRANSCRIPT
EDUCATION
Information Classification: The Cornerstone to Information ManagementSheila Childs, EMC
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 2
SNIA Legal Notice
• The material contained in this tutorial is copyrighted by the SNIA.
• Member companies and individuals may use this material in presentations and literature under the following conditions:– Any slide or slides used must be reproduced without
modification– The SNIA must be acknowledged as source of any
material used in the body of any document containing material from these presentations.
• This presentation is a project of the SNIA Education Committee.
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 3
Abstract
Information Classification: The Cornerstone to Information ManagementAnyone who is trying to get a handle on information growth, compliance-related risk mitigation and information management costs realizes that without an understanding of the information under management, these objectives are difficult to achieve. Fundamental to any ILM strategy is the practice of information classification
Information classification requires that I.T. administrators work with Line-of-Business and knowledge workers to gain an understanding of the data to be managed. This sessions will explore the different types of classification methodologies and file system metadata-based classification and content/context-based classification will be discussed. Manual versus automated classification procedures will be discussed, along with the pros and cons that each approach has in its implementation. We will discuss the difference between indexing and classification, and discuss where each approach makes sense.
The session will include information on various standards under development and focus on the work of the SNIA DMF ILM initiative. It will culminate with a view of the benefits to be derived through classification of information – better risk mitigation, lowered management costs and a better understanding of corporate information.
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 4
Why Classification? Why Now?
• Growth of Information hasn’t slowed down…• Sarbanes-Oxley initiatives are evolving into
broader enterprise risk management activities– Numerous regulations and corporate requirements
switch focus from capital spend to operating expense (management)
• Storage landscape is changing– Is the DVD player more important or the
movies you watch?• Storage people are reacting to an
unfamiliar and new set of guidelines
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 5
Classification Drivers
Storage TCO– External disk storage purchase projected to grow at 52% annually– Capacity is #1 storage issued driven by email, unstructured data– Significant transition to disk-based archival storage – Digital archive capacity will increase nearly tenfold between 2005 and 2010
* Sources: Merrill Lynch 2007-08 storage forecast & views from CIOs, Enterprise Strategy Groups 2006 Digital Archive study
2005 2010
Total Digital Archived Capacity,WW
EmailDatabaseUnstructured 54% CAGR54% CAGR
79% CAGR79% CAGR68% CAGR68% CAGR
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 6
Classification Drivers
Risk management– Compliance:
• Payment Card Industry Data Security Standard (PCI), Health Insurance Portability and Accountability Act (HIPAA), new Federal Rules of Civil Procedure (FRCP)
• EU Directive on Privacy and Electronic Communications (2002/58/EC)
– Information Security• Protecting Personally
Identifiable Information (PII)
• Companies will spend as much as $80B on compliance by 2009
• Compliant records data growing 60% per year generating more than 2PB of new storage capacity requirements in 2007
• Single fastest growing application segment of the storage industry
* Source: Fred Moore, Horison, Storage Spectrum 2006
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 7
Classification Drivers
Improved productivity• The average knowledge worker spends six hours per
week searching for information – 50% of all searches fail to locate desired information – 15% of the average knowledge worker’s time is spent recreating
existing information
• Need – Better organization of information – Accurate search– Consistent management of information– Shortened “time-to-information”
Source: IDC
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 8
Classification for Security and Risk Mitigation
•To date, over 54 million identities have been stolen
•An estimated 19,000 more identities are stolen each day.
•Companies on average are spending over 1,500 hours per incident at a cost of $40,000 to $90,000 per victim
Top 10 Customer Data-Loss Incidents Prior to Jan 2006*
Company / Organization Number of affected customers
Date of initial disclosure
CardSystems 40 million June 17, 2005
Citigroup 3.9 million June 6, 2005
DSW Shoe Warehouse 1.4 million March 8, 2005
Bank of America 1.2 million Feb 25, 2005
Wachovia, Bank of America, PNC Financial Services Group, Commerce Bancorp
676,000 April 28, 2005
Time Warner 600,000 May 2, 2005
Georgia Department of Motor Vehicles
465,000 April 2005
LexisNexis 310,000 July 19,2005
University of Southern California 270,000 July 19, 2005
Marriott international 206,000 Dec 28, 2005
* Privacy Rights Clearinghouse, InformationWeek, Fred Moore, Storage Sepctrum
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 9
Classification for Litigation Support
• eDiscovery and records management coming together– Driven by huge costs and risks– Changes to the Federal Rules of Civil
Procedure• Electronically Stored Information (ESI) is subject to production (the
way it is managed from cradle to grave will impact costs and risks of eDiscovery)
• There will be an early “meet and confer”• Word “preserving” appears in the rules for the first time• There is a need to understand the “sources” of ESI
– Average eDiscovery costs can run into the millions of dollars per event
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 10
Some Relevant Terms
• Data– Data is what I.T. manages: files, volumes, bits and bytes
• Information– Information is data with context– Data Lifecycle supports the Information Lifecycle
• Record– Recorded information, regardless of medium or characteristics,
made or received by an organization that is evidence of its operations, and has value requiring its retention for a specificperiod of time (ARMA)
• See the SNIA Technical dictionary for additional definitions:
http://www.snia.org/education/dictionary
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 11
Information Lifecycle Management and Classification
• Information Lifecycle Management– The policies, processes, practices and tools used to align the
business value of information with the most appropriate and cost effective IT infrastructure from the time information is created through it’s final disposition
– Information is aligned with business processes through management polices and service levels associated with applications, metadata, information and data
• Information Classification– The process of identifying and categorizing information
associated with a business process, in order to produce requirements for the management of this information, within a defined scope
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 12
Who Needs to Classify
• Traditionally – Records Information Managers• Now – IT and numerous other groups are involved
– Line of Business (LOB information stakeholders)• Application performance, availability, recoverability…• Staff response time, asset reporting…• Cost
– Corporate information stakeholders:• Security officer: Secret, confidential, proprietary…• Records Managers: corporate system of record• Compliance officers: authorization, retention• Litigation support: eDiscovery Check out
SNIA Tutorial:ILM and Tiers of Storage
Check outSNIA Tutorial:
ILM and Tiers of Storage
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 13
Classification is Challenging
• Need is to classify “ESI” – Electronically Stored Information– Stakeholders are numerous
• RM is a component of a bigger information management strategy• Legal provides Litigation support requirements• IT provides ILM TCO requirements
– Huge amounts of data – do we need to classify it all?– Current gaps between records managers, IT, application owners– How much risk are you willing to bear?
• Example: Major bank implementing a top-down information management strategy run by an ILM senior architect– “board room stuff”
Cross-disciplinary Information Management
Executive Committee
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 14
The Classification Process: Application Classification
Focused on Business Applications
Drivers for Application Classification– Disaster recovery and business continuity– Server consolidation– Application performance
Application Classification is fairly “simple”– Establishes a ranking of applications– All information associated with the
application is treated the same– Works best when applications are
segmented by server
Application Classification is often “good enough”
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 15
The Classification Process: File Attribute Classification
• FILE ATTRIBUTE classification is largely based on file attributes and access patterns
• What is file named?• What is the file type?• Who owns the data?• Where is it located?• When was it created?
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 16
File Attribute Classification
• File attributes offers limited input, therefore limited recognition– Still useful but class of solutions limited– Generally useful in optimizing HSM or archiving strategies– Tends not to meet complex ILM needs (security, retention, etc.)
• Pros & Cons– Fast, lightweight, not invasive– May not address changing business value over time
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 17
Data Movement Across Tiers
Example: File Attribute Classification
Primary Storage
Secondary Disk-based Storage
Secondary orTertiary
Non-disk storage
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 18
The Classification Process: Content Classification• Categories and taxonomies have been in use since ancient Greece• Classification based on CONTENT makes use of indexes, lexicons and
taxonomies
• What keywords?• How is this data related to
other data?• How should data be
retained/disposed of for compliance or otherwise used by the business?
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 19
Content Classification: Some Additional Definitions• Taxonomy
– A hierarchical structure used for categorizing a body of information or knowledge, allowing an understanding of how that body of knowledge can be broken down into parts, and how its various parts relate to each other. Taxonomies are used to organize information in systems, therefore helping users to find it
• Related terms: ontology, categories, evidence structures• Lexicon
– The vocabulary of a language, an individual speaker or group of speakers, or a subject• Example: A dictionary of over 200,000 medical, pharmaceutica, biomedical and
healthcare acronyms and abreviations is a medical lexicon• Related terms: thesaurus, vocabulary
• Indexing– The act of preparing data for search and/or classification, based on content and/or metadata– Index when keyword searches only are sufficient– Index when looking to find information quickly within a particular document or file
• Search– The act of looking for something in a set of data
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 20
Why Content Classification?
• How does I.T. determine what type of services these files require?
• Are they important?
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 21
Manual Analysis of Files
• File names are not intuitive
• Content is difficult to decipher manually
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 22
Manual Analysis of Files
• Content analysis requires expertise
• Cost becomes a burden, leading to increased risk
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 23
Automated Content Classification
Automated Classification speeds time-to-information effectivenessAutomated Content Classification make sense
– When multiple classification options results in confusion– When there is an overwhelming volume of items to classify– When some documents require time-consuming review by subject matter experts– When there are a large number of non-business documents– When you don’t want to have idiosyncratic results
“The highest quality and accuracy occurs when records management is as non-intrusive as possible to the desktop end users and does not interfere with the normal work routines of professional staff in the enterprise”**
** Timothy J. Sprehe and Charles R. McClure, “Lifting the Burden.”Information Management Journal, Vol.39 Issue 4 (Jul/Aug 2005), 475
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 24
Content Classification Algorithms
Rules-based content classification algorithms
– Conceptual and Semantic Analysis
– Keywords – Term frequency– Pattern matching– Stemming – Compound terms– Latent semantic analysis
(synonymy and polysemy)
• All content-based classification is based on “natural language”• Two general types:
“Learning”-based content classification algorithms
– Neural Networks– Probabilistic modeling– Bayesian Inference– Shannon’s Information
Theory
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 25
Example of Rules-based Content Classification
• Personally identifiable/identifying information (PII)– Any piece of information which can potentially be used to uniquely identify,
contact, locate, stalk or steal the identity of a single person.• Payment Card Industry (PCI) Data Security Standard
– Protects PII: Forbids retailers from storing credit and debit card data on point-of-sale systems
– All retailers must ensure that their POS systems are purged of such information, which includes magnetic stripe, PIN and card verification value data
• Classification becomes the cornerstone for identification of PII – Need to go beyond formal security classifications that have existed in the U.S. for
years– Example: Major bank
• Formal security policies include identifying categories of information: “bank-confidential”, “bank customer confidential”, “other”
• Patterns include customer account numbers according to PCI standards
Classification for Security and Privacy: Pattern Matching
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 26
Example of Rules-based Content Classification
A search for “Washington” delivers 333,000,000 results
Classification for targeted search: keyword
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 27
Example of Rules-based Content Classification
Further refining my category rules to: “Washington state sightseeing” gets me what I need
VERY SPECIAL PLACESDistinctive Washington Lodging, Tours & Unique Attractions
Classification for targeted search: keyword
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 28
Classification: What is “Good Enough”?Challenges of classification – various types• Some human intervention always required to review results of classification
– Automated tools improve efficiency • Documents with little text – how are these classified?
– Power point slides, email, etc. – Varying document types– Metadata classification might be better in this case
• Lack of consistency in naming , structure, format– Metadata classification may be best
Factors affecting accuracy• Document consistency / naming consistency• The strength of the taxonomy (content)• Applicability of classification algorithms to specific content
What is a reasonable cost per document?What is the cost of a document that is incorrectly classified?
Does value to the organization outweigh the cost?
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 29
Classification: Getting Started
• Obtain buy-in from Execs • Establish cross-disciplinary information management
team• Start small, identify business drivers for classification
(risk, cost, information access?)• Audit current state, determine desired outcome and
assess gaps• Begin building policies and procedures to deliver
classification strategy– Develop taxonomies and associated rules, if required– Evaluate and Select classification tools
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 30
Summary
Classification - Immediate Benefits– Better understand and organize your information– Mitigate risk associated with unmanaged information– Better deployment and alignment of your I.T. resources
(storage/server consolidation, “smart” purchases, etc.)– Better compliance readiness and eDiscovery
Classification - Longer term benefits– Service Level Management improves I.T. service
delivery– Information management automation– Cost reduction
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 31
Continue Your SNIA Education Experience At SNW
• Attend Hands-On Labs in:Data Classification
Key to Service Level ManagementData Security and Protection
Data Assurance Solutions to Meet Corporate Requirements
IP StorageiSCSI, Your IP SAN
Storage ManagementManage Storage or Be Managed By It
Storage VirtualizationIncreasing Productivity
Zero to SAN• Fibre Channel Connectivity in No Time
Sessions begin Monday afternoon, April 16 and continue through Wednesday, April 18. All sessions in Emma/Maggie/Annie, 3rd
Floor of the Hyatt Manchester.Registration at the SNW Registration area
EDUCATION
Information Classification: The Cornerstone to Information Management© 2007 Storage Networking Industry Association. All Rights Reserved. 32
Q&A / Feedback• Please send any questions or comments on this presentation to
SNIA: [email protected]
Many thanks to the following individuals for their contributions to this tutorial.
SNIA Education Committee
Edgar St.Pierre Bob Rodgers Rob PeglarJeff Porter
Check outSNIA Tutorial:
ILM and TiersOf Storage
Check outSNIA Tutorial:
ILM and TiersOf Storage