ubiquitous data access doppalapudi raghu chaitanya jaliparthi gangadhar

Ubiquitous Data Access

Doppalapudi Raghu Chaitanya

Jaliparthi Gangadhar

Outline Ubiquitous Data History - NFS, AFS CODA File system Cedar LBNFS Operation shipping MFS Data Staging on untrusted surrogates Portable soul pads Portable & distributed storage GFS Conclusion

Ubiquitous Data

“In ten years, billions of people will be using the Web, but a trillion "gizmos" will also be connected to the Web.” Asilomar Rep. on DB Research, Dec. 1998

“Fundamentally, the ability to access all information from anywhere and have ONE unified and synchronized information repository is critical to making appliances useful.”

Ubiquitous data access will put existing data management techniques to the test, in all aspects – searching, location, reliability, consistency, …

Ubiquitous Data AccessState of the Art Everyone uses a database system and/or search engine

every day Although they may not realize it! (the true test of “ubiquity”).

The Internet and WWW have become a ubiquitous means of global data dissemination and exchange.

Databases play a crucial but largely invisible role here. XML and related standards are enabling increasingly

sophisticated interoperation. Wireless access provides anytime-anywhere access and

enables location-centric applications.

Characteristics of Ubiquitous Data systems

functionality scalability serializability optimality interoperability

personalization

globalization

synchronization

flow regulation

integration

History

NFS (1985) Sun Microsystems NFS allows one

computer attached to a network to access the file systems present on the hard disk of another computer on the N/w.

AFS (Andrews File System)

AFS was developed at CMU AFS has many benefits in security & scalability areas AFS uses Kerberos for authentication Read and write operations on an open file are directed

only to the locally cached copy When modified file is closed, the changed portions are

copied back to the file server Cache consistency is maintained by a mechanism called

callback AFS influenced lot of today’s distributed file systems like

CODA

CODA File System

CODA is a Network File System that achieves high availability by techniques using two techniques:

Server Replication & Disconnected Operation Disconnected operation is the mode of operation that

enables a client to continue accessing critical data during temporary failures of network connectivity

Server replication involves maintaining read-write replicas at more than one server. The replication sites for a volume is its volume storage group (VSG)

Main idea behind this is caching of data to improve availability

Design

On each client, a user level process called Venus, manages a file cache on the local disk. It is ‘venus’ that bears the brunt of disconnected operation

Venus States

Venus operates in three states

Hoarding

Emulation

Reintegration

Hoarding

When there is good connectivity between client and server

In this state venus hoards useful data in anticipation of disconnection

It should estimate the files used later and prefetch them for disconnected operation

Hoard Walking: maintains client cache in equilibrium, caches high priority files for high availability. Periodically restores equilibrium by performing hoard walk.

Emulation

When client is very weakly or disconnected with server Venus acts as pseudo server, assumes full responsibility

for access When a client asks for a file, venus provides the file if it

is stored in cache If the requested file is not present in cache it reports a

error, but not as a cache miss Logging: During emulation venus records sufficient

information to replay update activity when it reintegrates.

Reintegration

When network connectivity is resumed between client and server

Reintegration is a transitory state through which venus passes in changing roles from pseudo-server to cache manager

Venus propagates changes made during emultion, and update its cache to reflect current server state

Conflict handling

Drawbacks

Updates are not visible to other clients Cache misses may impede progress Exhaustion of cache space is a concern Update conflicts become more likely Updates are at a risk due to theft, loss or damage

Google gears

Cedar

Mobile database access over low-bandwidth Networks Relational databases is core of business process Cedar is useful for mobile commerce, traveling sales

people, disaster recovery Stale client replica can be used to reduce data

transmission volume Basics of database

Cedar Architecture

Content Addressable Storage Storing information that can be retrieved based on its content System will record a content address, which is an identifier uniquely

and permanently linked to the information content itself. A request to retrieve information from a CAS system must provide

the content identifier, from which the system can determine the physical location of the data and retrieve it

Any change to a data element will necessarily change its content address

CAS device will not permit editing information once it has been stored.

Cedar Protocol

Transparency of cedar

Application Transparency Database Transparency Adaptive Interposition

Commonality detection Exploring structure in data Generating compact CAS descriptions

Creating and refreshing client replicas

Hoard Granularity Database hoard

profiles Tools for handling Refreshing stale

client replicas

Results of Cedar

Drawbacks of cedar

LBFS-Low bandwidth Network File System

LBFS-Low Bandwidth Network File System

A NFS for efficient use of network in the face of low connectivity

LBFS exploits the similarities between files or versions of the same file to save bandwidth

Avoids sending of data over network when same data can already be found in server file system or client cache

Applied together with compression and caching to improve performance

Design

LBFS server divides the file it stores into chunks and indexes the chunks by hash value.

Client indexes a large persistent cache Whenever requesting data transfer, each system identifies the

chunks already in the system

Reading a file in LBFS

Observations

Drawbacks

Same files appear different when encrypted differently- so LBFS is not useful here

Synchronization problems with different chunk sizes Useful only when there exists minimal commonality

between files

Operation Shipping

Operation Shipping for Mobile File Systems

How to propagate an updated large file from a weakly connected client to its server?

operation shipping or operation based update propagation can be used to solve the problem.

Value shipping

Operation shipping

The user operation is send to a surrogate client that is strongly connected to the server

The surrogate replays the user operation, regenerates the files, checks whether they are identical to original files, and, if so, sends the files to the servers on behalf of the client.

Forward error correction is used to restore minor re-execution discrepancies.

Operation shipping

Observations:

Network traffic reductions from 12 to 400 time Speedups in the range from 1.4 to nearly 50 times. Correctness of the re-executed file is ensured

May not be feasible when the surrogate doesn't support the user operation

There are some side effects that makes the re-executed file to be different from that of main file. In such cases we have to fall back for value shipping.

Data Staging on Untrusted Surrogates

Data staging on Untrusted Surrogates

How untrusted computers can be used to facilitate secure mobile data access?

Data staging can improve the performance of Distributed file systems

Data staging opportunistically prefetches files and caches them on a nearby surrogates.

Surrogates are untrusted and unmanaged: we use end to end and secure hashes to provide privacy and authenticity of data.

Results show reduction in average latency by 54%

System model

observations

Pros/cons

PROS Reduces the latency between server and a client Increases pervasiveness by supporting small devices with small

memory and limited power

CONS Surrogates are manually located at present Malicious surrogates provide risks like eavesdrop, denial of

service, corruption of data, etc.

Portable Soul pads

Architecture

ISR (Internet Suspend/Respond)

User’s computation state is stored as a check-pointed virtual machine image.

Remote Desktop

Soul pad

Knoppix for Auto-configuring host OS

VMware workstation for the VMM

Windows or Linux for guest OS

Observations

Soul pad provide AES 128 block encryption When USB drive is removed all the memory that is

related to soul pad operations is erased. Backups are created on network file systems when ever

host has internet connection. Resume & Suspend Latencies Application Response times Instruction set Architecture diversity

Practical Implementation

Mojopac Install Mojopac on USB pen drive Install software on Mojopac Use that software on which ever system you want Copyrights violations need to be changed

Integrating Portable and Distributed Storage

Architecture

Each have their own pros and cons Performance and availability increases by integrating

portable and distributed storage Lookaside caching

GFS Google file system

GFS

A scalable large distributed data-intensive applications.

Fault tolerant while running on inexpensive hardware.

Google’s storage platform for generation and processing of data.

Hundreds of terabytes of storage access thousands of disks on thousands of machines and accessed by hundreds of clients

GFS Architecture

Working of GFS Single master, Multiple chunk servers, Multiple

Users fixed-size chunks (giant blocks) (how big? 64MB) 64-bit ids for each chunk clients read/write chunks directly from chunkservers chunks are the unit of replication Master maintains all metadata namespace and access control map from filenames to chunk ids current locations for each chunk metadata is cached at clients

Other Google technologies

Bigtable: A Distributed Storage System for Structured Data

Used for Google Earth and Google Finance. Bigtable has successfully provided a flexible, high-

performance solution for all of these Google products

References

1. Disconnected Operation in the Coda File System – James J. Kistler, CMU

2. Exploiting weak connectivity for Mobile File Access - Lily B. Mummert, CMU

3. A Low Bandwidth Network File system – Athicha Muthithachareon,MIT

4. Data staging on untrusted surrogates – Jason Flinn, Intel Research

5. Operation shipping for Mobile File systems – Yai Lee, IEEE

6. Improving Mobile Database Access over WANs – Niraj Tolia, CMU

7. Reincarnating PCs with portable soulpads– Ramon Caceres, IBM Research

8. Pervasive personal computing in internet suspend system – satya, CMU

9. Integrating portable and distributed storage – Niraj Tolia, CMU

10. The Google File System – Sanjay Ghemawat, Google

11. Coda File System – M Satyanarayan, CMU

ubiquitous data access doppalapudi raghu chaitanya jaliparthi gangadhar

Documents

coda slide

critical data

ubiquitous data access

availability slide

file cache

coda file system coda

ubiquitous data history

network file system