s calable d ecentralized d e-duplication s tore

12
SCALABLE DECENTRALIZED DE- DUPLICATION STORE Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah

Upload: sharis

Post on 15-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

S CALABLE D ECENTRALIZED D E-DUPLICATION S TORE. Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah. Motivation. Importance of storage space Finding enough space to meet the demands of the customers has been a huge challenge for cloud providers. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

SCALABLE DECENTRALIZED DE-DUPLICATION STORE

Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah

Page 2: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Motivation Importance of storage space

Finding enough space to meet the demands of the customers has been a huge challenge for cloud providers.

Saving significant resources during web crawling, indexing, and search.

Backup Strategies To backup the data and replicate them across

many geographical locations. Need for devising ingenious techniques to use

the storage space more efficiently.

Page 3: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Deduplication Removing duplicate copies of files and storing only

the pointers to the original copy. Block-level deduplication

Allows more granularity and hence offers a greater reduction in storage space.

Requires more processing power when compared to file-level deduplication.

Use case Storage of snapshots of virtual machine (VM) images

in a virtualized cloud environment. Detecting exact duplicates and near duplicates in web

pages.

Page 4: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Architecture

Page 5: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Cassandra Schema create keyspace minhash;

create column family minhash_chunks with column_type=Super;

create column family minhash_filerecipe with column_type=Super;

create column family minhash_fullhash;

create keyspace files; create column family files_minhash;

Page 6: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Data DistributionClient / Application

Cassandra Cluster

Load Balancing

Cassandra Nodes

Page 7: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Data Flow in Cassandra

Cassandra Cluster

Client

OS Snapshot file / Web page

File input to Client

File Name Match Check file already exists

Start Chunking Process

Chunks

Compute minhash and fullhash Check full

hash already exists

MinHash

Full hash

Insert <fileid , minhash> Insert <minhash,filerecipe>Insert <minhash, fullhash>

Insert <minhash, chunkData>

Page 8: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

System Implementation

Page 9: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Sequence - put

Page 10: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Sequence – get

Page 11: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

System Efficiency Calculating the total amount of space

saved. Demonstrate the extent of similarity in

various snapshots and web pages. The overhead associated with file

storage and retrieval in our system.

Page 12: S CALABLE  D ECENTRALIZED  D E-DUPLICATION  S TORE

Questions ?