s calable d ecentralized d e-duplication s tore

SCALABLE DECENTRALIZED DE-DUPLICATION STORE

Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah

Motivation Importance of storage space

Finding enough space to meet the demands of the customers has been a huge challenge for cloud providers.

Saving significant resources during web crawling, indexing, and search.

Backup Strategies To backup the data and replicate them across

many geographical locations. Need for devising ingenious techniques to use

the storage space more efficiently.

Deduplication Removing duplicate copies of files and storing only

the pointers to the original copy. Block-level deduplication

Allows more granularity and hence offers a greater reduction in storage space.

Requires more processing power when compared to file-level deduplication.

Use case Storage of snapshots of virtual machine (VM) images

in a virtualized cloud environment. Detecting exact duplicates and near duplicates in web

pages.

Architecture

Cassandra Schema create keyspace minhash;

create column family minhash_chunks with column_type=Super;

create column family minhash_filerecipe with column_type=Super;

create column family minhash_fullhash;

create keyspace files; create column family files_minhash;

Data DistributionClient / Application

Cassandra Cluster

Load Balancing

Cassandra Nodes

Data Flow in Cassandra

Cassandra Cluster

Client

OS Snapshot file / Web page

File input to Client

File Name Match Check file already exists

Start Chunking Process

Chunks

Compute minhash and fullhash Check full

hash already exists

MinHash

Full hash

Insert <fileid , minhash> Insert <minhash,filerecipe>Insert <minhash, fullhash>

Insert <minhash, chunkData>

System Implementation

Sequence - put

Sequence – get

System Efficiency Calculating the total amount of space

saved. Demonstrate the extent of similarity in

various snapshots and web pages. The overhead associated with file

storage and retrieval in our system.

Questions ?

s calable d ecentralized d e-duplication s tore

Documents

column family minhash

column family files

file storage

web pages

keyspace files

web crawling

filelevel deduplication

match check file