storage 2.0 (unstructured data)

16
Vikas Deolaliker 2008

Upload: vikas-deolaliker

Post on 25-May-2015

1.309 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Storage 2.0 (Unstructured Data)

Vikas Deolaliker2008

Page 2: Storage 2.0 (Unstructured Data)

Executive Summary – IOpportunity Fixed content mining is a computationally intensive operation. A purpose built appliance with adequate integration

hooks to back end data warehousing systems with add-on/plug-ins for most popular BI clients will meet the requirements for departmental and small and medium sized businesses.

The key value multiplier for such a product is in its ability to seamlessly integrate with existing enterprise systems and generate reports which can be printed, imported (into excel, access etc).

The market for such an appliance is expected to reach $200M in 2010 (not counting the storage/server pullthrough).

Industry Unstructured data or “Fixed Content” refers to digital content that is generated outside a business context i.e. the

data does not have a schema and is not stored in databases. Normally, all non-transactional data such as email, IM, media, Web Content, metadata and customer generated content is considered unstructured.

It is increasing desired by businesses to add information from “unstructured data” to their library of information sources to improve business decision making. Business Intelligence industry offers numerous products that enable discovery, intercept, metadata extraction, semantic analysis, storage and lifecycle management (ILM) of unstructured data. Companies such as InterWoven, Vignette, Informatica, Manugistics, IBM and Oracle have products that enable warehousing of content that is considered “unstructured”. BusObj recently acquired Inxight software for text analytics.

Storage system vendors such as EMC and NetApp have created a new category of storage systems called “Content Addressable Storage” or CAS. IBM recently acquired XIV to offer a competition to EMC called Nextra.

SNIA, a storage industry standard body has a initiative called XAM (Extensible Access Management). XAM compliant f products offer a programmable interface for archiving applications to query (search), retrieve and control access to fixed content. It also allows compliance software to access fixed content for SOX and other regulatory compliance tests.

Page 3: Storage 2.0 (Unstructured Data)

Executive Summary - IIMarket Fixed content is stored on NAS appliances and clusters. The market for rich media NAS is fast growing and expected

to surpass $1B in 2009. EMC is the current leader in NAS with 36% of the market share and is expected to lead the NAS market for fixed content as well.

BI on fixed content is an emerging market. the market for fixed content BI software is still in its infancy with annual revenues under $20M. This market came under focus with acquisition of Inxight software by BusObjs. This software mostly runs on windows and accesses storage using iSCSI over IP.

Content Service Providers (CSPs) like Google/Yahoo store fixed content in clusters of servers which run their own proprietary filesystems a.k.a Storage 2.0. This market is proprietary as those filesystems are source of differentiation for the CSPs.

Page 4: Storage 2.0 (Unstructured Data)

The Trend: According to IDC, transactional data is growing at 32.3%

while fixed content (unstructured data) is growing at 63.7%. Replicated or back-up data is growing at 43% p.a.

IDC, Dec, 2007

Unstructured data growth means growth in file services over LAN/WAN

Replicated data growth is lower than expected.

Structured data growth is lowest because most of the data generated today is outside a transactional context.

Page 5: Storage 2.0 (Unstructured Data)

The Opportunity: Client can take a three pronged strategy to

enter the fixed content market. (a) Develop asset management software (b) Develop Storage 2.0 infrastructure (c) Develop a purpose built BI appliance for fixed content.

  2005 2006 2007 2008 2009 2010 2011 2012 2007 Share (%) 2007-2012 CAGR (%) 2012 Share (%)Unix 82 97 113 131 153 179 210 250 22.6 17.2 20.9Linux/other open source 6 7 9 11 14 17 22 26 1.8 23.9 2.1Windows 32 and 64 197 238 285 341 409 492 590 703 57.3 19.8 58.8Other 64 77 91 108 129 153 182 218 18.3 19 18.2Total 349 418 498 591 705 841 1,004 1,196 100 19.2 100

Digital Asset Management Software, IDC, 2008 $M

Client can enter the market with media asset management software. The software pulls server and storage infrastructure.

a) Digital Asset Management- Workflow and ILM management of fixed content- Content Intelligence Functionality such as text analytics, search,

visualizationb) File System Infrastructure (a.k.a Storage 2.0)

- Enhanced NAS storage for low latency, API based access to media content in filesystems

c) Business can add modules to BI pipeline and make a purpose built appliance for fixed content.

Page 6: Storage 2.0 (Unstructured Data)

The Market: Fixed content market total is approximately

$1B. Majority of the market is storage hardware and software. NAS is the dominant organization for storage for fixed content. Emerging products such as Storage 2.0, Fixed media asset management software and BI appliances are currently in nascent stage.

Hardware (~$1B)EMC, NetApp, HDS, IBM, HPSell directly to CSPs like Yahoo, Google,

eBay Software (~$20M)

Asset Management: HP, IBMBusiness Intelligence: BusObj (Inxight)Fixed Content Search/Retrieval: Endeca,

Lucene, Microsoft (FAST)Content Editing: Adobe, Microsoft, Apple

Page 7: Storage 2.0 (Unstructured Data)

Storage 2.0: Focus has shifted from IO in storage 1.0 to

higher level fileservices. The driver is no longer access protocol but content preference.

Feature Storage 1.0 Storage 2.0

Management Local Application Web Application

Access SCSI Over FC, GbE or Bus

SCSI Over IP

Provisioning LUN level granularity Filesystem size

Virtualization LUN, Volume Mgr. Object Level

Application Profile Write Many Read One WORM

Oversubscription 1:1 (provisioning is equal to allocation)

N:1 (Provisioning can be n-times allocation)

Page 8: Storage 2.0 (Unstructured Data)

The Infrastructure Play: Filesystems

are turning into a platform with programming interfaces, data routers, load balancers, autonomic functions, analyzers, parsers.

Filesystem 1.0 Filesystem 2.0

Kernel Space

User Space

Block Driver

Disk

Buffer

FileSys

iNode Cache

Ob

ject

C

ach

e

VFS

Process/ Socket

Tools ClusterFS

Kernel Space

User Space

Name Server

MetaData ServerClient

Disk

Block Driver

File Driver

Most functional blocks in kernel space. Data and Control have the same path. Block level semantics is exposed to applications

Most functional blocks in user space. Data and Control have separate paths. Block level semantics is hidden from application.

Page 9: Storage 2.0 (Unstructured Data)

Filesystem 2.0: Content centric filesystem

where the primitive is no longer a block but a file.

Content Addressable StorageNext generation of NAS devices that implement CAS

filesystem. The focus shift from block device driver to a file driver which manages the “chunks” of a file on multiple underlying block devices.

NAS with CAS does not use FC disks, prefers SAS/SATA disk for its low cost. Replaces backplane with high performance interconnect with RDMA.

Provides an API for application. When application queries meta data server for a file, it is given a fileID instead of iNode with block level addresses. It uses file addresses to locate a file. This creates a need for file data router.

The actual data can be stored on existing node level filesystem or a cluster filesystem

Page 10: Storage 2.0 (Unstructured Data)

BI Play: Fixed content is being warehoused in the enterprise and

information from this content is being analyzed and delivered in many ways to the client. Client’s opportunity lies in adding support for fixed content across the BI pipeline.

Mining Tools Analytics Delivery

Visualization

Web Service

Reports

Real Time

Portal

Supply Chain

Customer Relationship

Financials

Sales Force

Human Resources

Image/Video

Query

Search

Media Search

Report Generators

OLAP

Router

ETL

Warehousing

Metadata Extraction

Scatter/ Gather

Data Storage

Workflow Integration

Fixed Content Support

Page 11: Storage 2.0 (Unstructured Data)

BI Modules I: BI pipeline needs to be enhanced

to support 54% of the data that exists in the enterprise i.e. Fixed Content.

WarehouseExisting warehousing techniques depend upon ETL (Extract,

Transform & Log) methodology i.e. changing the format of the data and making it amenable for further processing.

Fixed content warehousing needs tools such as transcoder, metadata annotation tools, caching, Variable Bit Rate (VBR) encoding etc.

ToolsFixed content needs tools such as search which searches

content and metadata. Semantic analysis is going to end up in this domain once it is well specified in industry bodies.

Mashup is a tool that will probably end up in the BI domain once it is standardized. Current XHR based mashup is for web browsers only.

Page 12: Storage 2.0 (Unstructured Data)

BI Modules II: Analytics

Fixed content analytics requires recognition and analysis of all media types: Text, Voice & Video. Streaming video is a challenge

Visualization backend is going to end up in this domain. Visualization transcodes data based on access device

DeliveryEnd user can ask for data delivery as real time

streaming or downloadable file or as a visual image or as a web service.

Page 13: Storage 2.0 (Unstructured Data)

Content Management Play: Digitization

and automation is redefining the upstream portion of the value chain. SOA is increasingly being used as the integration technology to drive the C&P framework.

Processing

Film

DVD

Camera

File

Music

Broadcast

Cinema

Cable/Sat

Print

Tape

Internet

Storage

Distribution Channels

Repository Processing Distribution

Content

StreamingDownloadRepackaging

- Static/Dynamic- Structured/Un- Format (s)- DRM- Batch/Auto

- Metadata Tagging/ Cataloging- Domain Specifics- Ontologies

iSCSI BlockFile StorageSearch

- Adaptation for devices- Integration with User Interface

- Formats- Encryption

Workflow Orchestration

Page 14: Storage 2.0 (Unstructured Data)

The Product: Enhancing existing storage products to

make them metadata aware and global name space aware will enable Business to enter the fixed content market.

Target CSPs with enhanced FS for unstructured data.FS expects nodes to be shared memory. Enable support

of COTS serversSeparate name server, head node and store nodes

Target Enterprise unstructured ILM market with xFrame. Enhance Xframe to include unstructured data ILM by

moving time-sensitive data into near storage and tag stale data for archival.

Page 15: Storage 2.0 (Unstructured Data)

GTM Strategy: Enterprise, CSPs and SMBs

form the three segments of the fixed content market.

To Target the segment Make alliances with and/or make enhancements to

With the following offering

Enterprises Fixed content asset management software companies

Enhanced FS on Cluster with enhanced xFrame

CSPs Integrate with CSPs proprietary metadata servers and tools

Server and storage

SMBs Create a CSP-In-A-Box solution

CSP-In-A-Box

Page 16: Storage 2.0 (Unstructured Data)

Unstructured Data, Storage 2.0

Vikas Deolaliker2008