ubiquitous data access doppalapudi raghu chaitanya jaliparthi gangadhar
TRANSCRIPT
Ubiquitous Data Access
Doppalapudi Raghu Chaitanya
Jaliparthi Gangadhar
Outline Ubiquitous Data History - NFS, AFS CODA File system Cedar LBNFS Operation shipping MFS Data Staging on untrusted surrogates Portable soul pads Portable & distributed storage GFS Conclusion
Ubiquitous Data
“In ten years, billions of people will be using the Web, but a trillion "gizmos" will also be connected to the Web.” Asilomar Rep. on DB Research, Dec. 1998
“Fundamentally, the ability to access all information from anywhere and have ONE unified and synchronized information repository is critical to making appliances useful.”
Ubiquitous data access will put existing data management techniques to the test, in all aspects – searching, location, reliability, consistency, …
Ubiquitous Data AccessState of the Art Everyone uses a database system and/or search engine
every day Although they may not realize it! (the true test of “ubiquity”).
The Internet and WWW have become a ubiquitous means of global data dissemination and exchange.
Databases play a crucial but largely invisible role here. XML and related standards are enabling increasingly
sophisticated interoperation. Wireless access provides anytime-anywhere access and
enables location-centric applications.
Characteristics of Ubiquitous Data systems
functionality scalability serializability optimality interoperability
personalization
globalization
synchronization
flow regulation
integration
History
NFS (1985) Sun Microsystems NFS allows one
computer attached to a network to access the file systems present on the hard disk of another computer on the N/w.
AFS (Andrews File System)
AFS was developed at CMU AFS has many benefits in security & scalability areas AFS uses Kerberos for authentication Read and write operations on an open file are directed
only to the locally cached copy When modified file is closed, the changed portions are
copied back to the file server Cache consistency is maintained by a mechanism called
callback AFS influenced lot of today’s distributed file systems like
CODA
CODA
CODA File System
CODA is a Network File System that achieves high availability by techniques using two techniques:
Server Replication & Disconnected Operation Disconnected operation is the mode of operation that
enables a client to continue accessing critical data during temporary failures of network connectivity
Server replication involves maintaining read-write replicas at more than one server. The replication sites for a volume is its volume storage group (VSG)
Main idea behind this is caching of data to improve availability
Design
On each client, a user level process called Venus, manages a file cache on the local disk. It is ‘venus’ that bears the brunt of disconnected operation
Venus States
Venus operates in three states
Hoarding
Emulation
Reintegration
Hoarding
When there is good connectivity between client and server
In this state venus hoards useful data in anticipation of disconnection
It should estimate the files used later and prefetch them for disconnected operation
Hoard Walking: maintains client cache in equilibrium, caches high priority files for high availability. Periodically restores equilibrium by performing hoard walk.
Emulation
When client is very weakly or disconnected with server Venus acts as pseudo server, assumes full responsibility
for access When a client asks for a file, venus provides the file if it
is stored in cache If the requested file is not present in cache it reports a
error, but not as a cache miss Logging: During emulation venus records sufficient
information to replay update activity when it reintegrates.
Reintegration
When network connectivity is resumed between client and server
Reintegration is a transitory state through which venus passes in changing roles from pseudo-server to cache manager
Venus propagates changes made during emultion, and update its cache to reflect current server state
Conflict handling
Drawbacks
Updates are not visible to other clients Cache misses may impede progress Exhaustion of cache space is a concern Update conflicts become more likely Updates are at a risk due to theft, loss or damage
Google gears
Cedar
Cedar
Mobile database access over low-bandwidth Networks Relational databases is core of business process Cedar is useful for mobile commerce, traveling sales
people, disaster recovery Stale client replica can be used to reduce data
transmission volume Basics of database
Cedar Architecture
Content Addressable Storage Storing information that can be retrieved based on its content System will record a content address, which is an identifier uniquely
and permanently linked to the information content itself. A request to retrieve information from a CAS system must provide
the content identifier, from which the system can determine the physical location of the data and retrieve it
Any change to a data element will necessarily change its content address
CAS device will not permit editing information once it has been stored.
Cedar Protocol
Transparency of cedar
Application Transparency Database Transparency Adaptive Interposition
Commonality detection Exploring structure in data Generating compact CAS descriptions
Creating and refreshing client replicas
Hoard Granularity Database hoard
profiles Tools for handling Refreshing stale
client replicas
Results of Cedar
Drawbacks of cedar
LBFS-Low bandwidth Network File System
LBFS-Low Bandwidth Network File System
A NFS for efficient use of network in the face of low connectivity
LBFS exploits the similarities between files or versions of the same file to save bandwidth
Avoids sending of data over network when same data can already be found in server file system or client cache
Applied together with compression and caching to improve performance
Design
LBFS server divides the file it stores into chunks and indexes the chunks by hash value.
Client indexes a large persistent cache Whenever requesting data transfer, each system identifies the
chunks already in the system
Reading a file in LBFS
Observations
Drawbacks
Same files appear different when encrypted differently- so LBFS is not useful here
Synchronization problems with different chunk sizes Useful only when there exists minimal commonality
between files
Operation Shipping
Operation Shipping for Mobile File Systems
How to propagate an updated large file from a weakly connected client to its server?
operation shipping or operation based update propagation can be used to solve the problem.
Value shipping
Operation shipping
The user operation is send to a surrogate client that is strongly connected to the server
The surrogate replays the user operation, regenerates the files, checks whether they are identical to original files, and, if so, sends the files to the servers on behalf of the client.
Forward error correction is used to restore minor re-execution discrepancies.
Operation shipping
Observations:
Network traffic reductions from 12 to 400 time Speedups in the range from 1.4 to nearly 50 times. Correctness of the re-executed file is ensured
May not be feasible when the surrogate doesn't support the user operation
There are some side effects that makes the re-executed file to be different from that of main file. In such cases we have to fall back for value shipping.
Data Staging on Untrusted Surrogates
Data staging on Untrusted Surrogates
How untrusted computers can be used to facilitate secure mobile data access?
Data staging can improve the performance of Distributed file systems
Data staging opportunistically prefetches files and caches them on a nearby surrogates.
Surrogates are untrusted and unmanaged: we use end to end and secure hashes to provide privacy and authenticity of data.
Results show reduction in average latency by 54%
System model
observations
Pros/cons
PROS Reduces the latency between server and a client Increases pervasiveness by supporting small devices with small
memory and limited power
CONS Surrogates are manually located at present Malicious surrogates provide risks like eavesdrop, denial of
service, corruption of data, etc.
Portable Soul pads
Architecture
ISR (Internet Suspend/Respond)
User’s computation state is stored as a check-pointed virtual machine image.
Remote Desktop
Soul pad
Knoppix for Auto-configuring host OS
VMware workstation for the VMM
Windows or Linux for guest OS
Observations
Soul pad provide AES 128 block encryption When USB drive is removed all the memory that is
related to soul pad operations is erased. Backups are created on network file systems when ever
host has internet connection. Resume & Suspend Latencies Application Response times Instruction set Architecture diversity
Practical Implementation
Mojopac Install Mojopac on USB pen drive Install software on Mojopac Use that software on which ever system you want Copyrights violations need to be changed
Integrating Portable and Distributed Storage
Architecture
Each have their own pros and cons Performance and availability increases by integrating
portable and distributed storage Lookaside caching
GFS Google file system
GFS
A scalable large distributed data-intensive applications.
Fault tolerant while running on inexpensive hardware.
Google’s storage platform for generation and processing of data.
Hundreds of terabytes of storage access thousands of disks on thousands of machines and accessed by hundreds of clients
GFS Architecture
Working of GFS Single master, Multiple chunk servers, Multiple
Users fixed-size chunks (giant blocks) (how big? 64MB) 64-bit ids for each chunk clients read/write chunks directly from chunkservers chunks are the unit of replication Master maintains all metadata namespace and access control map from filenames to chunk ids current locations for each chunk metadata is cached at clients
Other Google technologies
Bigtable: A Distributed Storage System for Structured Data
Used for Google Earth and Google Finance. Bigtable has successfully provided a flexible, high-
performance solution for all of these Google products
References
1. Disconnected Operation in the Coda File System – James J. Kistler, CMU
2. Exploiting weak connectivity for Mobile File Access - Lily B. Mummert, CMU
3. A Low Bandwidth Network File system – Athicha Muthithachareon,MIT
4. Data staging on untrusted surrogates – Jason Flinn, Intel Research
5. Operation shipping for Mobile File systems – Yai Lee, IEEE
6. Improving Mobile Database Access over WANs – Niraj Tolia, CMU
7. Reincarnating PCs with portable soulpads– Ramon Caceres, IBM Research
8. Pervasive personal computing in internet suspend system – satya, CMU
9. Integrating portable and distributed storage – Niraj Tolia, CMU
10. The Google File System – Sanjay Ghemawat, Google
11. Coda File System – M Satyanarayan, CMU