survey edit

38
 ALLOCATION METHODS Introduction Disk space in secondary storage used for physical storage of program for logical execution in mai n memory. Ope rat ing sys tem all ocates spa ce for pro gram in phy sical memory. The allocation of disk space has various phenomenons like search for range of space, indexing on data disk, retrieving and ordering of content which enables efficient utilization of disk space. Much of research work has been carried out earlier in terms of ordered pair of sets like buddy systems. arlier work carried out on !ontiguous allocation, "inked allocation and #ndexed allocation where ma$or contribution is on defining data structure and methods to enhance the effective utilization of disk space. Prio r to 1970 , Disc allocation was administrated by means of a map held in core, with one  bit per disc page managed by a method fou nd by J.L. Smith. . J. Au!tin retains all of %mith&s procedure, except that the techni'ue for searching the map has been improved. (llocation was dynamic that space given to file only as needed, not being reserved in advance. "ist also maintained a list of hole&s in the disc map. This list was added to whenever a search for allocation found a string of ad$acent free pages whose length was less than re'uired for re'uest in hand. )ile loading experiments have shown that proposed allocation algorithm has *+ tables when compared to previous allocation algorithms has -+ page tables. ( programmer may produce a file in which the logical addressing space is noncontiguous. %uch a file always has a page table. Time for file loading process has been considerably reduced due to causes/ the number of disc operations is reduced, the time spent searching the map is reduced. The dumping time and reloading of file storage takes a little less than hours of machine time The load takes about three 'uart ers of an hour and the dump , which invo lves verificati on of the tapes produced, takes about an hour. 0ith previous algorithm the number of page tables immediately after file load was ++ to 1,+++ and during a day additional 1++ to 2++ would be formed. 1

Upload: nivi09

Post on 07-Oct-2015

23 views

Category:

Documents


0 download

DESCRIPTION

Survey on Allocation Methods

TRANSCRIPT

ALLOCATION METHODS

Introduction

Disk space in secondary storage used for physical storage of program for logical execution in main memory. Operating system allocates space for program in physical memory. The allocation of disk space has various phenomenons like search for range of space, indexing on data disk, retrieving and ordering of content which enables efficient utilization of disk space. Much of research work has been carried out earlier in terms of ordered pair of sets like buddy systems.Earlier work carried out on Contiguous allocation, Linked allocation and Indexed allocation where major contribution is on defining data structure and methods to enhance the effective utilization of disk space.Prior to 1970, Disc allocation was administrated by means of a map held in core, with one bit per disc page managed by a method found by J.L. Smith.B. J. Austin retains all of Smith's procedure, except that the technique for searching the map has been improved. Allocation was dynamic that space given to file only as needed, not being reserved in advance. List also maintained a list of hole's in the disc map. This list was added to whenever a search for allocation found a string of adjacent free pages whose length was less than required for request in hand.

File loading experiments have shown that proposed allocation algorithm has 80 tables when compared to previous allocation algorithms has 540 page tables. A programmer may produce a file in which the logical addressing space is non-contiguous. Such a file always has a page table. Time for file loading process has been considerably reduced due to causes; the number of disc operations is reduced, the time spent searching the map is reduced.

The dumping time and reloading of file storage takes a little less than hours of machine time- The load takes about three quarters of an hour and the dump, which involves verification of the tapes produced, takes about an hour. With previous algorithm the number of page tables immediately after file load was 500 to 1,000 and during a day additional 100 to 200 would be formed.

The System no longer spends a large amount of time searching the allocation map when the disc is practically full, but no quantitative evaluation of this effect is available. Algorithm seems to be satisfactory for both file loading and normal operations.

D.E. Gold et.,al proposed a basic permutation algorithm and its variations. Their work deals with masking rotational latency. Algorithm for permutation of blocks in memory hierarchy has been proposed like Slow-to Fast Transition algorithm, Fast to Slow transition algorithm, Special cases of Permutation are addition and deletion of blocks. Arbitrary permutation algorithm, Standard permutation algorithm Permutation assignment algorithms were proposed. The preceding method is implemented for a head per- track disk system by performing row permutations in intermediate (bulk storage) random access memory as described in the earlier example. The first row permutation is accomplished by accumulating blocks in an output buffer as they come from the primary memory. When a row is accumulated, these blocks start to be output to the disk in their permuted order. The final row permutations are performed similarly by accumulating a row of blocks in the input buffer before transmission to primary memory in permuted order. Howard Lee Morgan proposed an optimization model for the assignment of flies to disk packs, and packs to either resident or nonresident status is presented. Heuristics are suggested for those cases in which it is inefficient to compute the actual optimum.

When the amount of space required for file storage exceeds the amount which can be kept online, decisions must be made as to which files are to be permanently resident and which mountable. These decisions will affect the number of mount requests issued to the operators. This is often a bottleneck in a computing facility, and reducing the number of mounts thus decreases turnaround time.

In summary, it is clear that the problem of optimally using disk storage devices is a complex one, effects contributed by queuing and scheduling as well as space allocation. This paper has attempted to describe where space allocation may fit in, and to prescribe some methods for handling this important problem. Test cases of multiple Knapsack algorithms executed.Kenneth K. Shen et...al, presented an extension of the buddy method, called the weighted buddy method, for dynamic storage allocation is presented. The weighted buddy method allows block sizes of 2k and 3.2k, whereas the original buddy method allowed only block sizes of 2k. This extension is achieved at an additional cost of only two bits per block. Simulation results are presented which compare this method with the buddy method. These results indicate that, for a uniform request distribution, the buddy system has less total memory fragmentation than the weighted buddy algorithm. However, the total fragmentation is smaller for the weighted buddy method when the requests are for exponentially distributed block sizes.James A. Hinds proposed a simple scheme for the determination of the location of a block of storage relative to other blocks is described. This scheme is applicable to buddy type storage allocation system.Warren Burton worked on generalization of the buddy system for storage allocation. The set of permitted block sizes {SIZEi}i=0 to n must satisfy the condition SIZE~ SIZE~_I ~- SIZE~_k~) where k may be any meaningful integral-valued function. This makes it possible to force logical storage blocks to coincide with physical storage blocks, such as tracks and cylinders.James L.Peterson presented two algorithms for implementing any of a class of buddy systems for dynamic storage allocation. Each buddy system corresponds to a set of recurrence relations which relate the block sizes provided to each other. Analyses of the internal fragmentation of the binary buddy system, the Fibonacci buddy system, and the weighted buddy system are given. Comparative simulation results are also presented for internal, external, and total fragmentation.The total fragmentation of the weighted buddy system is generally worse than that of the Fibonacci buddy binary system. The total fragmentation of the binary buddy Fibonacci system varies widely because of its internal fragmentation characteristics. Still the variation among these buddy systems is not great, and the lower execution time of the binary buddy would therefore seem to recommend it for general use, although the execution time of the Fibonacci buddy system is not much greater. The weighted buddy system seems to be less desirable than either the binary or the Fibonacci system owing to its higher execution time and greater external fragmentation, In conclusion then, we would recommend that the Fibonacci memory management module of a system be constructed as either a binary or Fibonacci buddy systemH. C. DU etal presented cartesian product files that have been shown to exhibit attractive properties for partial match queries. This paper considers the file allocation problem for Cartesian product files, which can be stated as follows: Given a k-attribute Cartesian product file and an m-disk system, allocate buckets among the m disks in such a way that, for all possible partial match queries, the concurrency of disk accesses is maximized. The Risk Modulo (DM) allocation method is described first, and it is shown to be strict optimal under many conditions commonly occurring in practice, including all possible partial match queries when the number of disks is 2 or 3. It is also shown that although it has good performance, the DM allocation method is not strict optimal for all possible partial match queries when the number of disks is greater than 3. The General Disk Modulo (GDM) allocation method is then described, and a sufficient but not necessary condition for strict optimality of the GDM method for all partial match queries and any number of disks is then derived. Simulation studies comparing the DM and random allocation methods in terms of the average number of disk accesses, in response to various classes of partial match queries, show the former to be significantly more effective even when the number of disks is greater than 3, that is, even in cases where the DM method is not strict optimal. The results that have been derived formally and shown by simulation can be used for more effective design of optimal file systems for partial match queries. When considering multiple-disk systems with independent access paths, it is important to ensure that similar records are clustered into the same or similar buckets, while similar buckets should be dispersed uniformly among the disks.Free disk space Management Matthew S. Hecht et..al, Schemes for managing free disk pages are so widely known that they must be considered part of the folklore of computer science. Two popular data structures are bitmaps and linked lists. The bitmap has bit position i set when disk page number i is free, and cleared when disk page number i is in use. The linked list contains the page numbers of free disk pages. Less widely known, however, are variations of such schemes that preserve the consistency of these data across failures. While recoverable schemes for managing free disk pages are all based on the principle of maintaining a copy (complete, or incremental with a base) on memory media with an independent failure mode, the details of such schemes vary considerably. The general problem we consider here is how to make the free-disk-space data structures survive two kinds of failures: (1) failure of main memory (e.g., loss of power) resulting in loss of its contents, and (2) failure of a disk transfer resulting in an unreadable disk page. This paper presents a programming technique, using a linked list for managing the free disk pages of a file system and using shadowing (also known as careful replacement [lo]) for failure recovery, which enjoys the following properties:

(1) The state of allocation at the previous checkpoint (a consistent system state preserved on disk) is always maintained on the disk.

(2) The data structure describing free space on disk is never copied during a checkpoint or recover (from a main-memory failure) operation.

(3) A window of only two pages of main memory is required for accessing and maintaining the data describing free space.

(4) System information need be written to disk only during a checkpoint, rather than every time it changes.

Lorie [7] describes a scheme similar to ours that uses bitmaps and shadowing. Gray [l, 21 describes the update-in-place with logging paradigm that can be applied to the problem of managing free disk pages across failures. Sturgis, Mitchell, and Israel [9] (see also Mitchell and Dion [8]) describe an abstraction called stable storage whereby a page-allocation bitmap is recorded redundantly on disk; the second page is not written until the first has been written successfully.Distributed File System Bruce Walker et..al, presented LOCUS Is a distributed operating system which supports transparent access to data through a network wide flle system, permits automatic replication of storage supports transparent distributed process execution, supplies a number of high reliability functions such as nested transactions, and is upward compatible with Unix. Partitioned operation of subnet and their dynamic merge is also supported. The system has been operational for about two years at UCLA and extensive experience In its use has been obtained. The complete system architecture is outlined in this paper, and that experience is summarized. The most obvious conclusion to be drawn from the LOCUS work is that a high performance, network transparent, distributed file system which contains all of the various functions indicated throughout this paper, is feasible to design and implement, even in a small machine environment. Replication of storage is valuable, both from the user and the system's point of view. However, much of the work is in recovery and in dealing with the various races and failures that can exist. Nothing is free. In order to avoid performance degradation when resources are local, the cost has been converted into additional code and substantial care in implementation architecture. LOCUS is approximately a third bigger than Unix and certainly more complex. The difficulties involved in dynamically reconfiguring an operating system are both intrinsic to the problem, and dependent on the particular system. Rebuilding lock tables and synchronizing processes running in separate environments are problems of inherent difficulty. Most of the system dependent problems can be avoided, however, with careful design. The fact that LOCUS uses specialized protocols for operating system to operating system communication made it possible to control message traffic quite selectively. The ability to alter specific protocols to simplify the reconfiguration solution was particularly appreciated. The task of developing a protocol by which sites would agree about the membership of s partition proved to be surprisingly difficult. Balancing the needs of protocol synchronization and failure detection while maintaining good performance presented a considerable challenge. Since reconfiguration software is run precisely when the network is flaky, those problems are real, and not events that are unlikely. Nevertheless, it has been possible to design and implement a solution that exhibits reasonably high performance. Further work is still needed to assure that scaling to a large network will successfully maintain that performance characteristic, but our experience with the present solution makes us quite optimistic. In summary, however, use of LOCUS indicates the enormous value of a highly transparent, distributed operating system. Since file activity often is the dominant part of the operating system load, it seems clear that the LOCUS architecture, constructed on a distributed file system base, is rather attractive.

PHILIP D. L. KOCH presented the buddy system is known for its speed and simplicity. However, high internal and external fragmentation have made it unattractive for use in operating system file layout. A variant of the binary buddy system that reduces fragmentation is described. Files are allocated on up to t extents, and inoptimally allocated files are periodically reallocated. The Dartmouth Time-Sharing System (DTSS) uses this method. Several installations of DTSS, representing different classes of workload, are studied to measure the methods performance. Internal fragmentation varies from 2-6 percent, and external fragmentation varies from O-10 percent for expected request sizes. Less than 0.1 percent of the CPU is spent executing the algorithm. In addition, most files are stored contiguously on disk. The mean number of extents per file is less than 1.5, and the upper bound is t. Compared to the tile layout method used by UNIX, the buddy system results in more efficient access but less efficient utilization of disk space. As disks become larger and less expensive per byte, strategies that achieve efficient I/O throughput at the expense of some storage loss become increasingly attractive. 1987 - Disk File Allocation Based on the Buddy System

PHILIP D. L. KOCH presented that the purpose of a distributed file system (DFS) is to allow users of physically distributed computers to share data and storage resources by using a common file system. A typical configuration for a DFS is a collection of workstations and mainframes connected by a local area network (LAN). A DFS is implemented as part of the operating system of each of the connected computers. This paper establishes a viewpoint that emphasizes the dispersed structure and decentralization of both data and control in the design of such systems. It defines the concepts of transparency, fault tolerance, and scalability and discusses them in the context of DFSs. The paper claims that the principle of distributed operation is fundamental for a fault tolerant and scalable DFS design. It also presents alternatives for the semantics of sharing and methods for providing access to remote files. A survey of contemporary UNIX@-based systems, namely, UNIX United, Locus, Sprite, Suns Network File System, and ITCs Andrew, illustrates the concepts and demonstrates various implementations and design alternatives. Based on the assessment of these systems, the paper makes the point that a departure from the approach of extending centralized file systems over a communication network is necessary to accomplish sound distributed file system design.ON THE OPTIMALITY OF DISK ALLOCATION FOR CARTESIAN PRODUCT FILES

In this paper we present a coding-theoretic analysis of the disk allocation problem. We provide both necessary and sufficient conditions for the existence of strictly optimal allocation methods. Based on a class of optimal codes, known as maximum distance separable codes, strictly optimal allocation methods are constructed. Using the necessary conditions proved, we argue that the standard definition of strict optimality is too strong, and cannot be attained in general. A new criterion for optimality is therefore defined whose objective is to design allocation methods that yield a response time of one for all queries with a minimum number of specified attributes. Using coding theory, we determined this minimum number for binary files, assuming that the number of disks is a power of two. In general, our approach provides better allocation methods than previous techniques.

Using the necessary conditions proved, we have argued that the standard definition of strict optimality is too strong, and cannot be attained in general. We therefore defined a new criterion for optimality whose objective is to design allocation methods that yield response time 1 for all queries with a minimum number of specified attributes. Using coding theory, we determined this minimum number for binary files that have up to 16 attributes assuming that the number of disks is a power of 2. Our approach contrasts with most previous results where ad hoc methods were used [2,3,4,5,10]. B ase d on the results of coding theory, we construct optimal allocation methods for response time 1, thus resulting in better performance than previous methods in this case. For example, compared to [4], when p = 2, n = 8, and m = 16, our approach ensures a response time of one if five attributes are specified, while in [4] six or more attributes must be specified to guarantee such a response time. Khaled A. S. Abdel-Ghaffar, The problem of disk allocation addresses the issue of how to distribute a file on several disks in order to maximize concurrent disk accesses in response to a partial match query. In this paper a coding-theoretic analysis of this problem is presented, and both necessary and sufficient conditions for the existence of strictly optimal allocation methods are provided. Based on a class of optimal codes, known as maximum distance separable codes, strictly optimal allocation methods are constructed. Using the necessary conditions proved, we argue that the standard definition of strict optimality is too strong and cannot be attained, in general. Hence, we reconsider the definition of optimality. Instead of basing it on an abstract definition that may not be attainable, we propose a new definition based on the best possible allocation method. Using coding theory, allocation methods that are optimal according to our proposed criterion are developed. In this paper we have presented a coding-theoretic analysis of the disk allocation problem. We have shown the equivalence of the problem of strictly optimal disk allocation and the class of MDS codes. One main open problem in this area is the development of tight necessary and sufficient conditions for the existence of optimal disk allocation [8]. The results in Section 4 provide both necessary and sufficient conditions for the existence of such methods. These results formalize the intuitive ideas developed by Faloutsos and Metaxas [6], as well as extend and generalize several other previous results, especially those presented by Sung [18]. Unlike the necessary condition in [18] (Theorem 3.1), which considered binary Cartesian product files, Theorem 4.1 applies to any p-ary Cartesian product file. Our results also compliment the results in [5] and [8], where strictly optimal disk allocation methods are derived when p > m. In this paper we have studied the case p < m (this is especially important since, often, p = 2). For this case, our results imply better allocation methods than those mentioned in the literature. For example, when p = 8, n == 6, and m = 64, the field exclusive method of [8] gives a response time 2.4 on average for queries with two unspecified fields. How-ever, our techniques result in a strictly optimal allocation method that, of course, has response time 1 for such queries. Using the necessary conditions proved, we have argued that the standard definition of strict optimality is too strong and cannot be attained, in general. Therefore, we have defined a new criterion for optimality whose objective is to design allocation methods that yield response time 1 for all queries with a minimum number of specified attributes. Using coding theory, we have determined this minimum number for binary Cartesian product files that have up to 16 attributes, assuming that the number of disks is a power of 2. Our approach contrasts with most previous results where ad hoc methods were used [4, 5, 6, 8, 17]. Based on results from coding theory, we have constructed optimal allocation methods for response time 1, thus, in this case, resulting in better performance than previous methods. For example, compared to [6], when p = 2, n = 8, and m = 16, our approach ensures a response time 1 if five attributes are specified, whereas in [6] six or more attributes must be specified to guarantee such a response time. Furthermore, we have shown that if all queries with s specified attributes have response time p - /m, where m is a power of p, then all queries with s < s specified attributes have response time p ns /m. This implies that, if an allocation method can be optimized for all queries with a particular number of unspecified attributes, then it is optimized for all queries with more unspecified attributes. Finally, we have addressed some of the implications of our results on the design of multiple-disk file systems. As mentioned earlier, the number of attributes of a given file is usually predetermined by the application. Hence, the system designer has two parameters left to determine: p, the number of partitions the attribute domains are partitioned into; and m, the number of disks on which the file is distributed. In Section 4 we have shown that, for any given p, the range of choices for m is fairly limited when the goal is to design a strictly optimal allocation method. In contrast, our new optimality criterion, which ensures response time 1 with a minimum number of specified attributes, provides the system designer with much more flexibility in determining p and m. Although the results concerning our new optimality criterion are based on the assumption that m is a power of p, our approach gives tight bounds on the performance of the best possible allocation methods even if this assumption does not hold. This can be achieved by replacing m, which is not necessarily a power of p, by ~rlOgP1 and p~lOgP J, that is, by considering allocation methods where the number of disks is a power of p close to m. For example, if p = 2, n = 8, and m = 5, we may get a lower bound on SI, the minimum number such that any query with SI specified attributes has a response time 1, by replacing m = 5 by m = 8. From Table III, we get SI > 7. On the other hand, we know from the same table that there is an allocation method for m = 4 disks that gives a response time 1 for any query with seven specified attributes. This allocation method can be applied to the case of five disks (by using only four disks) to guarantee that every query with SI = 7 specified attributes has a response time 1. From the lower bound, we know that this value of SI is optimal; that is, there is no allocation method that gives a response time 1 for every query with six specified attributes if the number of disks equals 5

1997 HFS: A Performance-Oriented Flexible File

System Based on Building-Block Compositions the problem of disk allocation addresses the issue of how to distribute a file on several disks in order to maximize concurrent disk accesses in response to a partial match query. In this paper a coding-theoretic analysis of this problem is presented, and both necessary and sufficient conditions for the existence of strictly optimal allocation methods are provided. Based on a class of optimal codes, known as maximum distance separable codes, strictly optimal allocation methods are constructed. Using the necessary conditions proved, we argue that the standard definition of strict optimality is too strong and cannot be attained, in general. Hence, we reconsider the definition of optimality. Instead of basing it on an abstract definition that may not be attainable, we propose a new definition based on the best possible allocation method. Using coding theory, allocation methods that are optimal according to our proposed criterion are developed.

1998 - Efficient Disk Allocation for Fast Similarity SearchingSunil Prabakar et..al, presented a new scheme which provides good declustering for similarity searching. In particular, it does global declustering as opposed to local declustering, exploits the availability of extra disks and does not limit the partitioning of the data space. Our technique is based upon the Cyclic declustering schemes which were developed for range and partial match queries. We establish, in general, that Cyclic declustering techniques outperform previously proposed techniques. The problem of efficient similarity searching is becoming important for databases as non-textual information is stored. The problem reduces to one of finding nearest-neighbors in high-dimensional spaces. In this paper, a new disk allocation method for declustering high-dimensional data to optimize nearest-neighbor queries is developed. The new scheme, called cyclic allocation, is simple to implement and is a general allocation method in that it imposes no restrictions on the partitioning of the data space. Furthermore, it exploits the availability of any number of disks to improve performance. Finally, by varying the skip values the method can be adapted to yield allocations that are optimized for various criteria. We demonstrated the superior performance of the cyclic approach compared to existing schemes both those that were originally designed for range queries (FX, DM and HCAM) as well as those designed specifically for nearest-neighbors (NOD). The FX and DM schemes are found to be inappropriate for nearest-neighbor queries. HCAM performs reasonably well for odd numbers of disks, but extremely poorly for even numbers. NOD was found not to achieve as much parallelism as Cyclic for most cases, except when retrieving only direct neighbors with a small number of disks. NOD also has the potential to give better performance for some dimensions when the number of disks is close to that required to achieve near-optimality. On the other hand, NOD is restricted to 2-way partitioning of each dimension, and its cost remains the same even when more disks beyond those required for near-optimal declustering are available. This results in a saturation of the gains produced by NOD beyond this point. In contrast, the Cyclic approach is not restricted to 2-way partitioning and makes use of all available disks. In fact, its cost tracks the lower bound and reduces as the number of disks increases. Overall we observe that the Cyclic scheme gives the best performance for nearest-neighbor queries more consistently than any other scheme. Given the success of the cyclic schemes for two dimensional range queries [PAAE98], and the flexibility for nearest-neighbor queries, we expect that it will give good performance for systems that require both types of queries.

2003 An application of a context-aware file systemUbiquitous computing environments stretch the requirements of traditional infrastructures used to facilitate the development of applications. Activities are often supported by collections of applications, some of which are automatically launched with little or no human intervention. This task-driven environment challenges existing application construction and data management techniques. In this paper, we describe a file system that organises application data based on contextual information, imports user data based on its physical presence, and supports format conversions to accommodate device context. We describe several applications that we have developed within our ubiquitous computing infrastructure and show how they leverage the novel features of our file system to simplify their complexity. Ubiquitous computing is poised to revolutionise computer systems design and use. New challenges arise beyond what is required in typical distributed systems, such as how to integrate context, account for mobility and integrate heterogeneous devices. Applications are no longer tied to a single machine, but may be mapped to a physical space based on resource availability or the role that the space is playing. Applications running in a physical space may be affected by context, such as the location, what is happening in the space or who is present. Our file system is used to address these issues by restricting the scope of information to what is meaningful for the current context, such as application configurations or application data, making personal information available conditioned on user presence, and adapting content for resource availability through dynamic data types. While much work has been conducted in location-aware data retrieval, we believe that other types of context information can be useful in such environments.

A design philosophy of our work has been to create a programmable system in which the user retains a level of control. By allowing agreements between a space and its applications, users are able to tag data in such a way that it can become available to those applications conditioned on the current context and the system is able to operate with deterministic behaviour. The applications that we have built are able to take advantage of the features of the system to simplify functions that usually require additional effort on the part of the developer or user. We have found the performance of the system to easily satisfy the requirements of the applications we have developed. As we build more applications, we will be able to determine what types of different scenarios and applications are best supported by such a system. Through our experiences with the system, we have identified several areas for improvement and we will be incorporating some of these functions into the infrastructure. Future work will also involve the inclusion of access control. Our research group is currently working on a role-based security framework that will be integrated into the system. 2003 - The Google File System

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use. The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. While some design decisions are specific to our unique setting, many may apply to data processing tasks of a similar magnitude and cost consciousness. We started by reexamining traditional file system assumptions in light of our current and anticipated application workloads and technological environment. Our observations have led to radically different points in the design space. We treat component failures as the norm rather than the exception, optimize for huge files that are mostly appended to (perhaps concurrently) and then read (usually sequentially), and both extend and relax the standard file system interface to improve the overall system. Our system provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery. Chunk replication allows us to tolerate chunk server failures. The frequency of these failures motivated a novel online repair mechanism that regularly and transparently repairs the damage and compensates for lost replicas as soon as possible. Additionally, we use check summing to detect data corruption at the disk or IDE subsystem level, which becomes all too common given the number of disks in the system. Our design delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks. We achieve this by separating file system control, which passes through the master, from data transfer, which passes directly between chunk servers and clients. Master involvement in common operations is minimized by a large chunk size and by chunk leases, which delegates authority to primary replicas in data mutations. This makes possible a simple, centralized master that does not become a bottleneck. We believe that improvements in our networking stack will lift the current limitation on the write throughput seen by an individual client. GFS has successfully met our storage needs and is widely used within Google as the storage platform for research and development as well as production data processing. It is an important tool that enables us to continue to innovate and attack problems on the scale of the entire web. *2005 - Algorithms and Data Structures for Flash Memories

Flash memory is a type of electrically-erasable programmable read-only memory (EEPROM). Because flash memories are nonvolatile and relatively dense, they are now used to store files and other persistent objects in handheld computers, mobile phones, digital cameras, portable music players, and many other computer systems in which magnetic disks are inappropriate. Flash, like earlier EEPROM devices, suffers from two limitations. First, bits can only be cleared by erasing a large block of memory. Second, each block can only sustain a limited number of erasures, after which it can no longer reliably store data. Due to these limitations, sophisticated data structures and algorithms are required to effectively use flash memories. These algorithms and data structures support efficient not-in-place updates of data, reduce the number of erasures, and level the wear of the blocks in the device. This survey presents these algorithms and data structures, many of which have only been described in patents until now. Note: Contains many file systemsImproving Storage System Availability With D-GRAID*

We present the design, implementation, and evaluation of D-GRAID, a gracefully degrading and quickly recovering RAID storage array. D-GRAID ensures that most files within the file system remain available even when an unexpectedly high number of faults occur. D-GRAID achieves high availability through aggressive replication of semantically critical data, and fault-isolated placement of logically related data. D-GRAID also recovers from failures quickly, restoring only live file system data to a hot spare. Both graceful degradation and live-block recovery are implemented in a prototype SCSI-based storage system underneath unmodified file systems, demonstrating that powerful file-system like functionality can be implemented within a semantically smart disk system behind a narrow block-based interface. D-GRAID turns the simple binary failure model found in most storage systems into a continuum, increasing the availability of storage by continuing operation under partial failure and quickly restoring live data after a failure does occur. In this article, we have shown the potential benefits of D-GRAID, established the limits of semantic knowledge, and shown how a successful D-GRAID implementation can be achieved despite these limits. Through simulation and the evaluation of a prototype implementation, we have found that D-GRAID can be built underneath a standard block-based interface, without any file system Modification, and that it delivers graceful degradation and live-block recovery, and, through access-driven diffusion, good performance. We conclude with a discussions of the lessons we have learned in the process of implementing D-GRAID:

Limited knowledge within the disk does not imply limited functionality. One of the main contributions of this article is a demonstration of both the limits of semantic knowledge, as well as the proof via implementation that despite such limitations, interesting functionality can be built inside of a semantically smart disk system. We believe any semantic disk system must be careful in its assumptions about file system behavior, and hope that our work can guide others who pursue a similar course.

Semantically smart disks would be easier to build with some help from above. Because of the way file systems reorder, delay, and hide operations from disks, reverse engineering exactly what they are doing at the SCSI level is difficult. We believe that small modifications to file systems could substantially lessen this difficulty. For example, if the file system could inform the disk whenever it believes the file system structures are in a consistent on-disk state, many of the challenges in the disk would be lessened. This is one example of many small alterations that could ease the burden of semantic disk development.

Semantically smart disks stress file systems in unexpected ways. File systems were not built to operate on top of disks that behave as D-GRAID does; specifically, they may not behave particularly well when part of a volume address space becomes unavailable. Perhaps because of its heritage as an OS for inexpensive hardware, Linux file systems handle unexpected conditions fairly well. However, the exact model for dealing with failure is inconsistent: data blocks could be missing and then reappear, but the same is not true for inodes. As semantically smart disks push new functionality into storage, file systems may potentially need to evolve to accommodate them. 2006 An approach to virtual allocation in Storage systems.

Storage area networks (SANs) and storage virtualization [Huang et al. 2004] allow storage devices to be shared by multiple heterogeneous operating systems. However, native file systems, such as Windows NTFS or Linux ext2, expect to have exclusive access to their volumes. In other words, each operating system reserves storage devices for its own use, and the space in a storage device owned by one operating system cannot be used by another. This problem seriously hampers the flexibility of storage management [Menon et al. 2003]. This lack of flexibility manifests in the following restrictions: (a) File systems cannot use storage space in another device owned by another file system; and (b) a computer can only create files on devices it owns.Traditional file systems are closely tied to their underlying storage. File systems currently employ a static allocation approach, where the entire storage space is claimed at the time of file system creation. This is somewhat similar to running with only physical memory in a memory system. Memory systems employ virtual memory for many reasons: to allow flexible allocation of resources, to share memory safely, etc. Traditional static allocation of storage at the time of file system creation lacks such flexibility. To address this problem and to enable storage as a discoverable resource, we have developed a virtual allocation technique at the block level. Virtual allocation has the following combination of characteristics:

It uses the generic block interface widely employed in todays systems. This allows it to support a wide range of heterogeneous platforms, and permits the simplest reuse of existing operating and file systems technologies.

It provides a mechanism to create virtual storage systems, where unused storage space can be provided to any application or file system as a discoverable resource.

It can be built on existing systems with little change to operating systems.File system-level: IBMs Storage Tank [Menon et al. 2003] separates metadata operations from data operations to allow a common pool of storage space to be shared across heterogeneous environments. Log-structured file systems (e.g., LFS [Rosenblum and Ousterhout 1992] and WAFL [Hitz et al. 1994]) allow storage allocation to be detached from file systems because all new data is written at the end of a log. File systems such as ReiserFS [Reiser 2004] and JFS (journaled file system) [Best 2000] support expansion of the file systems. None of these systems allocate storage space based on actual usage.

Block-level: Veritas Volume Manager [Veritas 2002] and other existing storage resource management (SRM) products, such as IBM Tivoli Storage Manager [Kaczmarski et al. 2003], provide considerable flexibility in storage management, but allocate storage space based on file system size. Loge [English and Stepanov 1992] separates storage allocation from the file system to optimize write operations; it writes blocks near the current disk-head position. The Loge system does not provide storage allocation flexibility with traditional file systems. By contrast, the focus of this article is on integration of the flexible storage allocation scheme into a traditional file system environment. HPs AutoRAID employed dynamic allocation in order to reduce the small-write penalty of RAID systems [Wilkes et al. 1996]. Petal [Lee and Thekkath 1996] proposes a disk-like interface which provides an easy-tomanage distributed storage system to many clients (e.g., file systems and databases). Petals virtual disks cleanly separate a clients view of storage from the physical resources, which allows sharing of the physical resources more flexibly among many clients, but it still allocates storage space according to the configured file system size. Recently, a storage resource allocation technique called thin provisioning has been introduced [3PARdata 2003], which provides storage allocation flexibility, as our system does. However, the details about their architecture, implementation, and performance are unknown. Virtual allocation is on-demand allocation; it is orthogonal to storage virtualization products sold by vendors. Storage virtualization hides the connectivity, physical characteristics, and organization of devices from file systems. However, existing products allocate storage based on file system size, and dont allow sharing of unused space. This is akin to programs allocating the maximum amount of memory they may need and not relinquishing or sharing it, even when they actually need much smaller amounts of memory. This is because the products employ static allocation, where every mapping between the logical and physical blocks is predetermined for a given size when creating a logical volume. By contrast, in VA, every mapping is determined dynamically as data is written (on demand) so that storage can be allocated based on actual use. Using VA, several different file systems may share a single storage resource, or a single file system may span multiple storage resources, resulting in better flexibility.

Conclusion

We have proposed virtual allocation employing an allocate-on-write policy for improving the flexibility of managing storage across multiple file systems/ platforms. By separating storage allocation from file system creation, common storage space can be pooled across different file systems and flexibly managed to meet the needs of different file systems. Virtual allocation also makes it possible for storage devices to expand easily, so that existing devices incorporate other available resources. We have shown that this scheme can be implemented with existing file systems, without significant changes to the operating systems. Our experimental results from a Linux PC-based prototype system demonstrated the effectiveness of our approach.

2006 - The Conquest File SystemBetter Performance Through a Disk/Persistent-RAM Hybrid Design Modern file systems assume the use of disk, a system-wide performance bottleneck for over a decade. Current disk caching and RAM file systems either impose high overhead to access memory content or fail to provide mechanisms to achieve data persistence across reboots. The Conquest file system is based on the observation that memory is becoming inexpensive, which enables all file system services to be delivered from memory, except for providing large storage capacity. Unlike caching, Conquest uses memory with battery backup as persistent storage, and provides specialized and separate data paths to memory and disk. Therefore, the memory data path contains no disk-related complexity. The disk data path consists of optimizations only for the specialized disk usage pattern. Compared to a memory-based file system, Conquest incurs little performance overhead. Compared to several disk-based file systems, Conquest achieves 1.3x to 19x faster memory performance, and 1.4x to 2.0x faster performance when exercising both memory and disk. Conquest realizes most of the benefits of persistent RAM at a fraction of the cost of a RAM-only solution. It also demonstrates that disk-related optimizations impose high overheads for accessing memory content in a memory-rich environment. The Conquest disk/persistent-RAM hybrid file system addresses the performance problem of disk. The key observation is that the cost of persistent RAM (e.g., battery-backed DRAM) is declining rapidly, and the assumption of RAM as a scarce resource is becoming less true for average users. Conquest explores these emerging memory-rich environments and their effects on file system architecture and better performance. The Conquest experience also teaches the following lessons: (1) Current operating systems have a deep-rooted assumption of high latency storage throughout the computing stack, which is difficult to bypass or remove; (2) File systems designed for disks fail to exploit the full potential of memory performance in a memory-rich environment. (3) separating data paths into low and high-latency storage and matching workload characteristics to appropriate storage, media can yield significant performance gains and data path simplifications.

Conquest is a disk/persistent-RAM hybrid file system that delivers all file system services from persistent RAM, with the single exception that high-capacity storage is still provided by traditional disks. In essence, Conquest provides two specialized and simplified data paths to persistent-RAM and disk storage: (1) It stores all small files and metadata (e.g., directories and file attributes) in RAM; and (2) disk holds only the data content of the remaining large files (with their metadata stored in persistent RAM).

By partitioning data in this fashion, Conquest performs all file system management on memory-resident data structures, thereby minimizing disk accesses. Tailoring both file system data structures and management to the physical characteristics of memory significantly improves performance, as compared to disk-only designs. In addition, traversing the memory data path incurs no disk-related overhead, and the disk data path consists of only the processing costs for handling access patterns that are suitable for disk. Another benefit of Conquest is its ability to provide a smooth and cost effective transition from disk-based to persistent-RAM-based storage. Unlike other memory file systems [McKusick et al. 1990; Douglis et al. 1994; Wu and Zwaenepoel 1994], Conquest allows persistent RAM to assume more file system responsibility as memory prices decline. Thus, it can achieve most of the benefits of persistent RAM, without the high cost of RAM-only solutions. The experience of designing and implementing Conquest offers several major lessons:

The handling of disk characteristics permeates file system design, even at levels above the device layer. For example, the default VFS routines contain read ahead and buffer cache mechanisms that add high and unnecessary overheads to low-latency main store. Because of the need to bypass these mechanisms, building Conquest was much more difficult than we initially expected. For example, certain internal storage routines anticipate the data structures associated with disk handling. Reusing these routines either involves constructing memory-specific access routines from scratch or finding ways to invoke them with memory-based data structures.

File systems that are optimized for disk are not suitable for an environment where memory is abundant. For example, ext2, reiserfs, and SGI XFS do not exploit the speed of RAM as well as had been anticipated. Disk-related optimizations impose high overheads on in-memory accesses. Matching the physical characteristics of media to storage objects provides opportunities for faster performance and considerable simplification for each medium-specific data path. Conquest applies this principle of specialization:

Leaving only the data content of large files on disk leads to simpler and cleaner management for both memory and disk storage. This observation may seem obvious, but good results are not achieved automatically. For example, should the L2 cache footprint of two specialized data paths exceed the size of a single generic data path, the resulting performance can go in either direction, depending on the size of the physical cache.

Access to cached data in traditional file systems incurs performance costs due to commingled disk-related code. Removing disk-related complexity for in memory storage under Conquest therefore yields unexpected benefits, even for cache accesses. In particular, one surprising result was Conquests ability to outperform ramfs by 7% to 14% in bandwidth under micro benchmarks, despite the fact that the storage data paths in ramfs are already heavily optimized.

It is much more difficult to use RAM to improve disk performance than might appear at first. Simple approaches such as increasing the buffer cache size or installing simple RAM-disk drivers do not generate a full featured, high performance solution.

Conquest demonstrates how this process of rethinking underlying assumptions can lead to significant performance benefits and architectural simplifications. This experience suggests that radical changes in hardware, applications, and user expectations of the past decade should also lead us to reflect on future file system design and other aspects of operating system design. Contributing Storage Using the Transparent

File System

Contributory applications allow users to donate unused resources on their personal computers to a shared pool. Applications such as SETI@home, Folding@home, and Freenet are now in wide use and provide a variety of services, including data processing and content distribution. However, while several research projects have proposed contributory applications that support peer-to-peer storage systems, their adoption has been comparatively limited. We believe that a key barrier to the adoption of contributory storage systems is that contributing a large quantity of local storage interferes with the principal user of the machine.

To overcome this barrier, we introduce the Transparent File System (TFS). TFS provides background tasks with large amounts of unreliable storageall of the currently available spacewithout impacting the performance of ordinary file access operations. We show that TFS allows a peer-to-peer contributory storage system to provide 40% more storage at twice the performance when compared to a user-space storage mechanism. We analyze the impact of TFS on replication in peer-to-peer storage systems and show that TFS does not appreciably increase the resources needed for file replication.

Contributions:

This article presents the Transparent File System (TFS), a file system that can contribute 100% of the idle space on a disk while imposing a negligible performance penalty on the local user. TFS operates by storing files in the free space of the file system so that they are invisible to ordinary files. In essence, normal file allocation proceeds as if the system were not contributing any space at all. We show in Section 5 that TFS imposes nearly no overhead on the local user. TFS achieves this both by minimizing interference with the file systems block allocation policy and by sacrificing persistence for contributed space: Normal files may overwrite contributed space at any time. TFS takes several steps that limit this unreliability, but because contributory applications are already designed to work with unreliable machines, they behave appropriately in the face of unreliable files. Furthermore, we show that TFS does not appreciably impact the bandwidth needed for replication. Users typically create little data in the course of a day [Bolosky et al. 2000], thus the erasure of contributed storage is negligible when compared to the rate of machine failures.

TFS is especially useful for replicated storage systems executing across relatively stable machines with plentiful bandwidth, as in a university or corporate network. This environment is the same one targeted by distributed storage systems such as FARSITE [Adya et al. 2002]. As others have shown previously, for high-failure modes such as wide-area Internet-based systems, the key limitation is the bandwidth between nodes, not the total storage. The bandwidth needed to replicate data after failures essentially limits the amount of storage the network can use [Blake and Rodrigues 2003]. In a stable network, TFS offers substantially more storage than dynamic, user-space techniques for contributing storage. Using free disk space. Recognizing that the file system on a typical desktop is nearly half-empty, researchers have investigated ways to make use of extra storage. FS2 [Huang et al. 2005] is a file system that uses extra space on the disk for block-level replication to reduce average seek time. FS2 dynamically analyzes disk traffic to determine which blocks are frequently used together. It then creates replicas of blocks so that the spatial locality on disk matches the observed temporal locality. FS2 uses a policy that deletes replicas on-demand as space is needed. We believe that it could benefit from a TFS-like allocation policy, where all replicas except the primary one would be stored as transparent blocks. In this way, the entire disk could be used for block replication. A number of peer-to-peer storage systems have been proposed that make use of replication and free disk space to provide reliability. These include distributed hash tables such as Chord [Morris et al. 2001] and Pastry [Rowstron and Druschel 2001a], as well as complete file systems like the Chord file system

[Dabek et al. 2001], and Past [Rowstron and Druschel 2001b].CA-NFS: A Congestion-Aware Network

File System

ALEXANDROS BATSAKIS et...al.,

We develop a holistic framework for adaptively scheduling asynchronous requests in distributed file systems. The system is holistic in that it manages all resources, including network bandwidth, server I/O, server CPU, and client and server memory utilization. It accelerates, defers, or cancels asynchronous requests in order to improve application-perceived performance directly. We employ congestion pricing via online auctions to coordinate the use of system resources by the file system clients so that they can detect shortages and adapt their resource usage. We implement our modifications in the Congestion-Aware Network File System (CA-NFS), an extension to the ubiquitous network file system (NFS). Our experimental result shows that CA-NFS results in a 20% improvement in execution times when compared with NFS for a variety of workloads. CA-NFS introduces a new dimension in resource management by implicitly managing and coordinating the usage of the file system resources among all clients. It unifies fairness and priorities in a single framework that assures that realizing optimization goals will benefit file system users, not the file system servers.

Flash-Aware Cluster Allocation Method Based on Filename Extension for FAT File System

To support conventional file systems such as FAT, the NAND flash memory needs the FTL to provide transparent block device emulation to the file system. However, the log-block FTL, the most popularly used FTL, suffers from the performance decline in workloads that contain many random writes. When the file system is not aware of the NAND flash, it often allocates files fragmented in the NAND flash, and this causes many random writes. Consequently, to improve the performance of the FTL, the file system must allocate files defragmented in the NAND flash. In this paper, we propose a NAND flash-aware cluster allocation method for FAT file system, named FECA (Flash-aware Extension-based Cluster Allocation). FECA exploits the following two observations. The first one is that the effort to defragment small-sized files may not improve the performance at all times.

The second one is that there is a very strong correlation between the size and the filename extension of files in most cases. Based on those observations, FECA predicts sizes of files by using their extensions and determines the allocation policy for them. To evaluate the effectiveness of FECA, we devise two defragmentation metrics considering the features of the NAND flash. We prove that FECA outperforms previous methods in terms of both metrics through extensive experiments. The results show that FECA improves the performance by 10 % and reduces the garbage collection frequency up to 35% compared to the previous methods.

The file system space is managed in a unit called cluster or block that is a group of consecutive sectors of the disk [9]. In this paper, to prevent confusion with NAND block, we use the term cluster only. We assume that a cluster corresponds to a page of the NAND flash. There are two existing methods popularly used to allocate free cluster: SCA (Simple Cluster Allocation), and GCA (Greedy Cluster Allocation). Figure 1 illustrates how these methods work. We assume that the FTL is a log-block mapping scheme and the logical space of file system aligns with NAND block of physical space. In the figure, logical space means the status of the space shown to the file system, and physical space indicates the status of the physical NAND block space. They are aligned with each other, but only physical space may contain obsolete pages, which were not erased after invalidation. Each slot of logical and physical space corresponds to a cluster and a page, respectively. As shown in this figure, clusters 0-2, 6-8, and a-b of logical space were already allocated to some files. Note that the file system sees clusters 3-5, and 9 as free clusters, but they are actually obsolete in physical space. We now describe what will happen when files A and B, whose sizes are five clusters and one cluster respectively, are written in each file system.

5.1 Experimental environments

We implemented the allocation methods on a real NAND Flashbased embedded board. The test board is equipped with S3C2413 (ARM926) MCU, 128MB SDRAM, and 1GB MLC (Multi Level Cell) NAND Flash [14]. The access latency of the read, write, and erase operation is 121us/page, 861us/page, and 1.5ms/block, respectively. The latency includes the transfer time between the microprocessor and the page register in the NAND flash, as well as transfer time between the page register and the NAND flash basic cell. We use the log-block mapping FTL and the FAT-compatible file system. We set the cluster size as 2KB, so that 128 clusters are situated in one cluster group (NAND block). For fast experiment, we restrict the file system size as 512MB but we believe that we would get similar results with a larger size. In order to collect FDI and SDI, we devise IO control interfaces named IOCTL_FDI and IOCTL_SDI. Upon the IOCTL_FDI request, the FDI value is returned by scanning the entire files. For the IOCTL_SDI request, the SDI value is calculated by using cluster group bitmaps. We target a smart phone that supports a full Internet browsing, where read, write, and delete operations happen frequently. To obtain the workload characteristic, we analyze several web server workloads [15][16] and Microsoft Explorer Internet Temporary Files. As a result, most files involved in the Internet browsing are under 256 KB, and 90 % of them are under 16 KB as shown in figure 5. Besides, general system is known to handle mostly small-sized files [7][17]. To reflect such workloads, we make two groups of files: the small-sized and the large-sized group. The former consists of files whose sizes are uniformly distributed between 1 byte and 16 KB. The latter consists of files whose sizes

6. CONCLUSION

Since FAT is widely used, it is necessary for CE devices to support FAT-compatible file systems. However, due to the unique physical characteristics of the NAND flash memory used in most CE devices, existing allocation methods optimized to disk-based storage systems show inefficient behaviors when they are adopted by the NAND flash. In this paper, we have proposed the flashaware cluster allocation method optimized to FAT on the NAND flash. Since the defragmentation is important for the NAND flash to improve the write performance, we first have suggested the defragmentation policy taking the size of the file into account. Then, by using the fact that the extension of the file is strongly correlated to the size, we have proposed a flash-aware cluster allocation method based on the filename extension, called FECA. To evaluate the effectiveness of FECA, we also have designed two indices, FDI (File Defragmentation Index) and SDI (Space Defragmentation Index), that represent the degree of the defragmentation by considering the features of the NAND flash. Several experimental results have shown that FECA increased both of FDI and SDI, and hence significantly improved the performance in comparison with SCA and GCA.

Request Bridging and Interleaving: Improving the Performance of Small Synchronous Updates under Seek-Optimizing Disk Subsystems

Write-through caching in modern disk drives enables the protection of data in the event of power failures as well as from certain disk errors when the write-back cache does not. Host system can achieve these benefits at the price of significant performance degradation, especially for small disk writes. We present new block level techniques to address the performance problem of write-through caching disks. Our techniques are strongly motivated by some interesting results when the disk-level caching is turned off. By extending the conventional request merging, request bridging increases the request size and amortizes the inherent delays in the disk drive across more bytes of data. Like sector interleaving, request interleaving rearranges requests to prevent the disk head from missing the target sector position in close proximity, and thus reduces disk latency. We have evaluated our block-level approach using a variety of I/O workloads and shown that it increases disk I/O throughput by up to about 50%. For some real-world workloads, the disk performance is comparable or even superior to that of using the write-back disk cache. In practice, our simple yet effective solutions achieve better tradeoffs between data reliability and disk performance when applied to write through caching disks.

The proposed block I/O techniques, request bridging and interleaving improve small write performance by efficiently merging and reordering them. In the multithreaded and multi-process environment, our techniques have proven to be more effective in handling concurrent multiple I/O streams that lead to a significant fraction of slow, small block writes in disk drives. From the detailed analysis of block-level disk operations, we have found the causes of the performance degradation in the write-through caching disk, which can be a good solution to ensure data integrity from power failures and some disk errors. It is more beneficial to data integrity as well as system performance to use write-through cache with request bridging and interleaving than write-back cache in write-intensive multiprocess or multithread environments. To conclude: our approach can be the best performing tradeoff between performance and data reliability for synchronous disk I/O operations. We have implemented the proposed technique in the Linux kernel and conducted experiments using several benchmarks. Our experiments show that the combination of these techniques increases disk I/O throughput by up to about 50%. For some real world workloads, the disk performance was comparable or even superior to that of using write-back disk cache. The proposed technique has been designed to minimize possible negative impact on the multi streamed I/O requests issued to write-through disk concurrently; that is, sequential single-stream writes and read-mostly workloads are out of the scope. For popular multicore CPUs and the rapidly increasing I/O intensive parallel applications running on such machines, however, we expect that our approach can be more effective as long as there is the semantic gap due to the narrow interface among disk I/O layers.FRASH: Exploiting Storage Class Memory in Hybrid File System for Hierarchical Storage

JAEMIN JUNG and YOUJIP WON

we develop a novel hybrid file system, FRASH, for storage-class memory and NAND Flash. Despite the promising physical characteristics of storage-class memory, its scale is an order of magnitude smaller than the current storage device scale. This fact makes it less than desirable for use as an independent storage device. We carefully analyze in-memory and on-disk file system objects in a log-structured file system, and exploit memory and storage aspects of the storage-class memory to overcome the drawbacks of the current log-structured file system. FRASH provides a hybrid view storage-class memory. It harbors an in-memory data structure as well as a ondisk structure. It provides nonvolatility to key data structures which have been maintained inmemory in a legacy log-structured file system. This approach greatly improves the mount latency and effectively resolves the robustness issue. By maintaining on-disk structure in storage-class memory, FRASH provides byte-addressability to the file system object and metadata for page, and subsequently greatly improves the I/O performance compared to the legacy log-structured approach. While storage-class memory offers byte granularity, it is still far slower than its DRAM counter part. We develop a copy-on-mount technique to overcome the access latency difference between main memory and storage-class memory. Our file system was able to reduce the mount time by 92% and file system I/O performance was increased by 16%. In this work, we develop a hybrid file system, FRASH, for storage-class memory and NAND Flash. Once realized into proper scale, storage-class memory will clearly resolve significant issues in current storage and memory systems. Despite all these promising characteristics, for the next few years, the scale of storage-class memory devices will be an order of magnitude smaller (e.g., 1/1000) than the current storage devices.We argue that a storage-class memory should be exploited as a new hybrid layer between main memory and storage, rather than positioning itself as a full substitute for memory or storage. Via this approach, storage-class memory can complement the physical characteristics of the two: that is, the volatility of main memory and the block access granularity of storage. The key ingredient in this file system design is how to use storage class memory in a system hierarchy. It can be mapped onto the main memory address space. In this case, it is possible to provide nonvolatility to data stored in the respective address range. On the other hand, storage-class memory can be used as part of the block device. In this case, I/O speed will become faster, and it is possible that an I/O-bound workload will become a CPU-bound workload. The data structures and objects to be maintained in storage class memory should be selected very carefully, since storage-class memory is still too small to accommodate all file system objects. In this work, we exploit both the memory and storage aspects of the storage class memory. FRASH provides a hybrid view of the storage-class memory. It harbors in-memory data as well as on-disk structures for the file system. By maintaining on-disk structure in storage-class memory, FRASH provides byte addressability to the on-disk file system object and metadata for the page. The contribution of the FRASH file system is threefold: (i) mount latency, which has been regarded as a major drawback of the log-structured file system, is decreased by an order of magnitude; (ii) I/O performance improves significantly via migrating an on-disk structure to the storage-class memory layer; and (iii) by maintaining the directory snapshot and file tree in the storage-class memory, system becomes more robust against unexpected failure. In summary, we successfully developed a state-of-art hybrid file system, and showed that storage class memory can be exploited effectively to resolve the various technical issues in existing file systems.MFTL: A Design and Implementation for MLC Flash Memory Storage Systems JEN-WEI HSIEH

NAND flash memory has gained its popularity in a variety of applications as a storage medium due to its low power consumption, non volatility, high performance, physical stability, and portability. In particular, Multi-Level Cell (MLC) flash memory, which provides a lower cost and higher density solution, has occupied the largest part of NAND flash-memory market share. However, MLC flash memory also introduces new challenges: (1) Pages in a block must be written sequentially. (2) Information to indicate a page being obsoleted cannot be recorded in its spare area due to the limitation on the number of partial programming. Since most of applications access NAND flash memory under FAT file system, this article designs an MLC Flash Translation Layer (MFTL) for flash-memory storage systems which takes constraints of MLC flash memory and access behaviors of FAT file system into consideration. A series of trace-driven simulations was conducted to evaluate the performance of the proposed scheme. Although MFTL is designed for MLC flash memory and FAT file system, it is applicable to SLC flash memory and other file systems as well. Our experiment results show that the proposed MFTL could achieve a good performance for various access patterns even on SLC flash memory. 2.3. Data Structures

There are two major data structures to implement address translation: a direct map and an inverse map [Gal and Toledo 2005]. The direct map is the fundamental data structure of an FTL, which maps a logical address to a physical address. The translation process can be as simple as an array lookup, although it may also involve searching a tree. At the same time, a logical address may need to go through several passes of translation to get its corresponding physical address. The direct map usually resides in SRAM or the flash memory itself, but for the sake of efficiency, at least the referenced parts of the direct map should be kept in SRAM.

The inverse map is essentially made up of identifiers of flash pages or blocks, which is always kept in flash memory. When scanning a physical page or block of a flash memory, in order to identify valid data during garbage collection or recover the FTL after a system failure, we can easily get its logical address from the inverse map. In brief, the inverse map guarantees that we can always identify the logical address of a page or block, and the direct map provides fast address translation. It should be noted that an FTL may not necessarily employ a direct map and that not all FTLs store a complete mirror of the direct map in permanent storage.

Conclusion

Flash is a promising memory technology that has emerged for tens of years. Since in-place updates are no longer supported, address translation technology becomes an indispensable approach to make flash memory a new layer in the modern storage hierarchy. FTL is a software layer built in the firmware or in the system software that implements address translation, garbage collection, wear leveling, and so forth, and wraps the flash memory into a normal block device like magnetic disks. Unfortunately, many of these technologies are either undocumented or only described in patents. This survey provides a broad overview of some typical state-of-the-art address translation technologies described in patents, journals, and conference proceedings. We hope that this survey will facilitate future research and development of flash-based applications. This article proposed a multi-level cell (MLC) flash memory translation layer, MFTL, for MLC flash memory storage systems. It provides a transparent way for existing file systems to directly access large-capacity MLC flash memory. Observing that most existing user applications access NAND flash memory under FAT file system, this article takes constraints of MLC flash memory and access behaviors of FAT file system into consideration. The proposed MFTL scheme treats different access behaviors in an adaptive manner. It adopts a hybrid mapping approach to balance between the performance and the RAM requirement. Since data integrity is an important issue for storage systems, MFTL also provides an effective and efficientmechanism for recovery from a system crash. An extensive set of experiments has been conducted to evaluate the performance of the proposed MFTL. Our experiment results show that the proposed scheme achieves a good performance in terms of extra page writes and block erasures while the RAM requirement is largely reduced compared with existing schemes. SCMFS: A File System for Storage Class Memory and its Extensions Modern computer systems have been built around the assumption that persistent storage is accessed via a slow, block-based interface. However, emerging nonvolatile memory technologies (sometimes referred to as storage class memory (SCM)), are poised to revolutionize storage systems. The SCM devices can be attached directly to the memory bus and offer fast, fine-grained access to persistent storage. In this article, we propose a new file systemSCMFS, which is specially designed for Storage Class Memory. SCMFS is implemented on the virtual address space and utilizes the existing memory management module of the operating system to help mange the file system space. As a result, we largely simplified the file system operations of SCMFS, which allowed us a better exploration of performance gain from SCM. We have implemented a prototype in Linux and evaluated its performance through multiple benchmarks. The experimental results show that SCMFS outperforms other memory resident file systems, tmpfs, ramfs and ext2 on ramdisk, and achieves about 70% of memory bandwidth for file read/write operations. In this article, we propose a new file systemSCMFS, which is specifically designed for SCM. With consideration of compatibility, this file system exports the identical interfaces as the regular file systems do, in order that all the existing applications can work on it. In this file system, we aim to minimize the CPU overhead of file system operations. We build our file system on virtual memory space and utilize the memory management unit (MMU) to map the file system address to physical addresses on CM. The layouts in both physical and virtual address spaces are very simple. We also keep the space contiguous for each file in SCMFS, to simplify the process of handling read/write requests in the file system. We will show, through results based on multiple benchmarks, that the simplicity of SCMFS makes it easy to implement and improves the performance. In this article, we focus on building a file system on a single processor. Our approach can be extended to larger, multiprocessor, systems through distributed shared memory.

The primary contributions of this article are as follows.

This article proposes a new, low-overhead file system for Storage Class Memory. The proposed file system, SCMFS, largely simplifies regular file operations by leveraging the existing memory management module of the operating system. We employ super pages within our file system to relieve the pressure on system resources such as TLB, to further improve performance. We show that SCMFS outperforms tmpfs, ramfs, and ext2 on Ram Disk and achieves about 70% of memory bandwidth for file read/write operations. The remainder of the article is organized as follows. We will discuss related work in Section 2. In Section 3, we provide design and implementation details of SCMFS, which are then evaluated in Section 4. We present further extensions of SCMFS in Section 5. Section 6 concludes the article. A number of file systems have been designed for flash devices (YAF1; Woodhouse [2011]). BPFS [Condit et al. 2009] is proposed as a file system designed for nonvolatile byte-addressable memory, which uses shadow paging techniques to provide fast and consistent updates. Our file system aims to simplify the design and eliminate unnecessary overheads to improve the performance. DFS [Josephson et al. 2010] is the file system most similar to our own. DFS incorporates the functionality of block management in the device driver and firmware to simplify the file system, and also keeps the files contiguous in a huge address space. It is designed for a PCIe-based SSD device by FusionIo, and relies on specific features in the hardware. Since there is no Linux kernel level implementation of BPFS, we did not compare our performance to BPFS. In this work, we aim to design a file system for SCM devices. With traditional persistent storage devices, the overhead brought by I/O latency is much higher than that of the file system layer itself. So the storage system performance usually depends on the devices characteristics and the performance of the I/O scheduler. However, in the case where the storage device is directly attached to the memory bus, the storage device will share some critical system resources with the main memory. They will share the bandwidth of the memory bus, the CPU caches, and TLBs for both instructions and data. We believe that the lower complexity of the file system can reduce the CPU overhead in the storage system and thus improve the total performance. Our design is motivated by the need to minimize the number of operations required to carry out file system requests. SCMFS simplifies these data structures by allocating contiguous logical address space within each file. We build the file system on virtual address space, which can be larger than the physical address space of the SCM. Each file is allocated to a contiguous virtual address space while the file data may be stored in physical pages that are not contiguous. In SCMFS, with the contiguity inside each file, we do not need complicated data structures to keep track of logical address space, and simply store the start address and the size for each file. This mechanism significantly simplifies the processing of the read/write requests. To get the location of the request data, we just add the offset to the start address of the file. The actual physical location of the data is available through the page mapping data structures, again leveraging the system infrastructure.

REFERENCES1. Gold, D. E., & Kuck, D. J. (1974). A model for masking rotational latency by dynamic disk allocation. Communications of the ACM, 17, 278288. doi:10.1145/360980.361006.

2. Morgan, H. L. (1974). Optimal space allocation on disk storage devices. Communications of the ACM, 17(3), 139142. doi:10.1145/360860.360867.

3. Shen, K. K., & Peterson, J. L. (1974). A weighted buddy method for dynamic storage allocation. Communications of the ACM, 17, 558562. doi:10.1145/355620.361164

4. Burton, W. (1976). A buddy system variation for disk storage allocation. Communications of the ACM, 19(7), 416417. doi:10.1145/360248.360259

5. Peterson, J. L., & Norman, T. a. (1977). Buddy systems. Communications of the ACM, 20, 421431. doi:10.1145/359605.359626

6. Du, H. C., & Sobolewski, J. S. (1982). Disk allocation for Cartesian product files on multiple-disk systems. ACM Transactions on Database Systems, 7(1), 82101. doi:10.1145/319682.319698

7. Hecht, M. S., & Gabbe, J. D. (1983). Shadowed management of free disk pages with a linked list. ACM Transactions on Database Systems, 8(4), 503514. doi:10.1145/319996.320002

8. Distributed Operating System I Bruce Walker , Gerald Popek , Robert English , Charles Kline and Greg Thiel 2 University of California at Los Angeles. (1983), 4970.

9. Koch, P. D. L. (1987). Disk file allocation based on the buddy system. ACM Transactions on Computer Systems, 5(4), 352370. doi:10.1145/29868.29871

10. Levy, E., & Silberschatz, a. (1990). Distributed File Systems: Concepts and Examples, 22(4), 321374.

11. Abdel-Ghaffar, K. a. S., & El Abbadi, A. (1993). Optimal disk allocation for partial match queries. ACM Transactions on Database Systems, 18(1), 132156. doi:10.1145/151284.151288

12. Krieger, O., & Stumm, M. (1997). HFS: a performance-oriented flexible file system based on building-block compositions. ACM Transactions on Computer Systems, 15(3), 286321. doi:10.1145/263326.263356

13. Prabhakar, S., Agrawal, D., El, A., & Barbara, S. (1998). Efficient Disk Allocation for Fast Similarity Searching Divyakant Agrawal Amr El Abbadi University of California, 7887.

14. Hess, C. K., & Campbell, . R. H. (2003). An application of a context-aware file system, 339352. doi:10.1007/s00779-003-0250-y

15. Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. ACM SIGOPS Operating Systems Review, 37, 29. doi:10.1145/1165389.945450

16. Gal, E., & Toledo, S. (2005). Algorithms and data structures for flash memories. ACM Computing Surveys, 37(2), 138163. doi:10.1145/1089733.1089735

17. Sivathanu, M., Prabhakaran, V., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2005). Improving storage system availability with D-GRAID. ACM Transactions on Storage, 1(2), 133170. doi:10.1145/1063786.1063787.

18. Kang, S., & Reddy, a. L. N. (2006). An approach to virtual allocation in storage systems. ACM Transactions on Storage, 2(4), 371399. doi:10.1145/1210596.1210597

19. Wang, A.-I. A., Kuenning, G., Reiher, P., & Popek, G. (2006). The Conquest file system. ACM Transactions on Storage, 2(3), 309348. doi:10.1145/1168910.1168914

20. Agrawal, N., Bolosky, W. J., Douceur, J. R., & Lorch, J. R. (2007). A Five-Year Study of File-System Metadata, 3(3).

21. Cipar, J., Corner, M. D., & Berger, E. D. (2007). Contributing storage using the transparent file system. ACM Transactions on Storage, 3(3), 12es. doi:10.1145/1288783.1288787

22. Batsakis, A., Burns, R., Kanevsky, A., Lentini, J., & Talpey, T. (2009). Ca-Nfs. ACM Transactions on Storage, 5(4), 124. doi:10.1145/1629080.1629085

23. Thomasian, A., & Blaum, M. (2009). Higher reliability redundant disk arrays. ACM Transactions on Storage, 5(3), 159. doi:10.1145/1629075.1629076

24. Ryu, S., Lee, C., Yoo, S., & Seo, S. (2010). Flash-aware cluster allocation method based on filename extension for FAT file system. Proceedings of the 2010 ACM Symposium on Applied Computing - SAC 10, 502. doi:10.1145/1774088.1774192

25. Jung, J., Won, Y., Kim, E., Shin, H., & Jeon, B. (2010). Frash. ACM Transactions on Storage, 6(1), 125. doi:10.1145/1714454.1714457

26. Shin, D. I., Yu, Y. J., Kim, H. S., Eom, H., & Yeom, H. Y. (2011). Request Bridging and Interleaving. ACM Transactions on Storage, 7(2), 131. doi:10.1145/1970348.1970349

27. Wu, X., & Reddy, a. L. N. (2011). SCMFS: A file system for Storage Class Memory. 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 9(3), 111. doi:10.1145/2063384.2063436

28. Hsieh, J.-W., Wu, C.-H., & Chiu, G.-M. (2012). Mftl. ACM Transactions on Storage, 8(2), 129. doi:10.1145/2180905.2180908

29. Paulo, J., & Pereira, J. (2014). A Survey and Classification of Storage Deduplication Systems. ACM Computing Surveys, 47(1), 130. doi:10.1145/2611778

FILE ALLOCATION TABLE

Microsoft was granted a series of patents for key parts of the FAT file system in 1990. All four patents to long-file name extension to FAT first seen in Windows 95.

Limitations when using FAT32 file system with Windows XP.

Clusters cannot be 64 kilobytes (KB) or larger. If clusters are 64 KB or larger, some programs (such as Setup programs) may incorrectly calculate disk space.

A FAT32 volume must contain a minimum of 65,527 clusters. You cannot increase the cluster size on a volume that uses the FAT32 file system so that it contains fewer than 65,527 clusters.

The maximum disk size is approximately 8 terabytes when you take into account the following variables: The maximum possible number of clusters on a FAT32 volume is 268,435,445, and there is a maximum of 32 KB per cluster, along with the space required for the file allocation table (FAT).

You cannot decrease the cluster size on a FAT32 volume so that the size of the FAT is larger than 16 megabytes (MB) minus 64 KB.

You cannot format a volume larger than 32 gigabytes (GB) in size using the FAT32 file system during the Windows XP installation process. Windows XP can mount and support FAT32 volumes larger than 32 GB (subject to the other limits), but you cannot create a FAT32 volume larger than 32 GB by using the Format tool during Setup. If you need to format a volume that is larger than 32 GB, use the NTFS file system to format it. Another option is to start from a Microsoft Windows 98 or Microsoft Windows Millennium Edition (Me) Startup disk and use the Format tool included on the disk.

FILE SYSTEM

Table 3 Comparison of File Systems

AuthorPublication and ReferencesYear

File SystemOperating SystemCharacteristics

Table 10.1 shows a comparison of FAT16 and FAT32.

Table 10 . 1 FAT16 and FAT32 comparison

FAT16FAT32

Most operating systems (MS-DOS, Windows 98, Windows NT, OS/2, and UNIX) are designed to implement and use it.Currently, FAT32 can be used only in Windows 98 and Windows 95 OSR2.

It is efficient, both in speed and storage, on logical drives smaller than 256 MB.Drives smaller that 512 MB are not supported by FAT32.

Disk compression, such as Drvspace, is supported.Disk compression is not supported with FAT32.

FAT16 is limited in size to 65,525 clusters with each cluster being fixed in a size relative to the logical drive. If both the quantity of clusters and their maximum size (32 KB) is reached, the largest drive is limited to 2 GB.FAT32 allows for (x) clusters, therefore, the largest drive can be up to 2 terabytes, based on the 32 KB cluster size limitation.

Storing files in a FAT16 system can be inefficient in larger drives as the size of the cluster increases. The space allocated for storing a file is based on the size of the Cluster Allocation Granularity, not the file size. A 10 KB file stored in a 32 KB cluster wastes 22 KB of disk space.FAT32 cluster sizes are 4 KB (drives less than 800 MB).

Cluster Sizes of FAT16 and FAT32

The largest possible file for a FAT32 drive is 4 GB minus