Distributed file system for cloud

Distributed file system for cloud is a file system that allows many clients to have access to the same data/file providing important operations (create, delete, modify, read, write). Each file may be partitioned into several parts called chunks. Each chunk is stored in remote machines.Typically, data is stored in files in a hierarchical tree where the nodes represent the directories. Hence, it facilitates the parallel execution of applications. There are several ways to share files in a distributed architecture. Each solution must be suitable for a certain type of application relying on how complex is the application or how simple it is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system. Nowadays, users can share resources from any computer/device, anywhere and everywhere through internet thanks to cloud computing which is typically characterized by the scalable and elastic resources -such as physical servers, applications and any services that are virtualized and allocated dynamically. Thus, synchronization is required to make sure that all devices are update. Distributed file systems enable also many big, medium and small enterprises to store and access their remote data exactly as they do locally, facilitating the use of variable resources.

Overview

History

Today, there are many implementations of distributed file systems. The first file servers were developed by researchers in the 1970s, and the Sun's Network File System were disposable in the early 1980. Before that, people who wanted to share files used the sneakernet method. Once the computer networks start to progress, it became obvious that the existing file systems had a lot of limitations and were unsuitable for multi-user environments. At the beginning, many users started to use FTP to share files.[1] It started running on the PDP-10 in the end of 1973. Even with FTP, files needed to be copied from the source computer onto a server and also from the server onto the destination computer. And that force the users to know the physical addresses of all computers concerned by the file sharing.[2]

Supporting techniques

Cloud computing use important techniques to enforce the performance of all the system. Modern Data centers provide a huge environment with data center networking (DCN) and consisting of big number of computers characterized by different capacity of storage. MapReduce framework had shown its performance with Data-intensive computing applications in a parallel and distributed system. Moreover, virtualization technique has been employed to provide dynamic resource allocation and allowing multiple operating systems to coexist on the same physical server.

Applications

As cloud computing provides a large-scale computing thanks to its ability of providing to the user the needful CPU and storage resources with a complete transparency, it makes it very suitable to different types of applications that require a large-scale distributed processing. That kind of Data-intensive computing needs a high performance file system that can share data between VMs (Virtual machine).[3]

The application of the Cloud Computing and Cluster Computing paradigms are becoming increasingly important in the industrial data processing and scientific applications such as astronomy or physic ones that frequently demand the availability of a huge number of computers in order to lead the required experiments. The cloud computing have represent a new way of using the computing infrastructure by dynamically allocating the needed resources, release them once it is finished and only pay for what they use instead of paying some resources, for a certain time fixed earlier(the pas-as-you-go model). That kind of services is often provide in the context of Service-level agreement.[4]

Architectures

Most of distributed file systems are built on the client-server architecture, but yet others decentralized solutions exist as well.

Client-server architecture

Remote access model

NFS is the one of the most that use this architecture. It enables to share files between a certain number of machines on a network as if they were located locally. It provides a standardized view of the local file system. The NFS protocol allows heterogeneous clients (process), probably running on different operating systems and machines, to access the files on a distant server, ignoring the actual location of files. However, relying on a single server makes the NFS protocol suffering form a low availability and a poor scalability. Using multiple servers does not solve the problem since each server is working independently.[5] The model of NFS is the remote file service. This model is also called the remote access model which is in contrast with the upload/download model:

• remote access model: provides the transparency, the client has access to a file. He can do requests to the remote file(the file remains on the server) [6]
• upload/download model: the client can access the file only locally. It means that he has to download the file, make the modification and uploaded it again so it can be used by others clients.

The file system offered by NFS is almost the same as the one offered by Unix systems. Files are hierarchically organized into a naming graph in which directories and files are represented by nodes.

Cluster-based architectures

It is rather an amelioration of client-server architecture in a way that improve the execution of parallel application. The technique used here is the file-striping one. This technique lead to split a file into several segments in order to save them in multiple servers. The goal is to have access to different parts of a file in parallel. If the application does not benefit from this technique, then it could be more convenient to just store different files on different servers. However, when it comes to organize a distributed file system for large data centers such as Amazon and Google that offer services to web clients allowing multiple operations (reading, updating, deleting,...) to a huge amount of files distributed among a massive number of computers, then it becomes more interesting. Note that a massive number of computers opens the door for more hardware failures because more server machines mean more hardware and thus high probability of hardware failures.[7] Two of the most widely used DFS are the Google file system and the Hadoop distributed file system. In both systems, the file system is implemented by user level processes running on top of a standard operating system (in the case of GFS, Linux).[8]

Design principles

Goals

GFS and HDFS are specifically built for handling batch processing on very large data sets. For that, the following hypotheses must be taken into account:[9]

• High availability: the cluster can contain thousands of file servers and some of them can be down at any time
• A servers belongs to a rack, a room, a data center, a country and a continent in order to precisely identify its geographical location
• The size of file can vary from many gigabytes to many terabytes. The file system should be able to support a massive number of files
• Need to support append operations and allow file contents to be visible even while a file is being written
• Communication is reliable among working machines: TCP/IP is used with a remote procedure call RPC communication abstraction. TCP allows the client to know almost immediately that there is a problem and it can try to set up a new connection.[10]
load balancing and rebalancing: Delete file
load balancing and rebalancing: New server

Load balancing is essential for efficient operations in distributed environments. It means distributing the amount of work to do between different servers[11] in order to get more work done in the same amount of time and serve clients faster. In this case, consider a large-scale distributed file system. The system contains N chunkservers in a cloud (N can be 1000, 10000, or more), where a certain number of files are stored. Each file is split into several parts or chunks of fixed size (for example 64 megabytes). The load of each chunkserver is proportional to the number of chunks hosted by the server.[12] In a load-balanced cloud, the resources can be well used while maximizing the performance of MapReduce-based applications.

In a cloud computing environment, failure is the norm,[13][14] and chunkservers may be upgraded, replaced, and added in the system. Files can also be dynamically created, deleted, and appended. That leads to load imbalance in a distributed file system, meaning that the file chunks are not distributed equitably between the nodes.

Distributed file systems in clouds such as GFS and HDFS rely on central servers (master for GFS and NameNode for HDFS) to manage the metadata and the load balancing. The master rebalances replicas periodically: data must be moved form a DataNode/chumkserver to another one if its free space is below a certain threshold.[15] However, this centralized approach can provoke a bottleneck for those servers as they become unable to manage a large number of file accesses. Consequently, dealing with the load imbalance problem with the central nodes complicates more the situation as it increases their heavy loads. The load rebalance problem is NP-hard.[16]

In order to manage large number of chunkservers to work in collaboration, and solve the problem of load balancing in distributed file systems, several approaches have been proposed such as reallocating file chunks such that the chunks can be distributed to the system as uniformly as possible while reducing the movement cost as much as possible.[12]

Google file system architecture

Splitting File
The main article for this category is Google File System.
Description

Among the biggest internet companies, Google has created its own distributed file system named Google File System to meet the rapidly growing requests of Google's data processing needs and it is used for all cloud services. GFS is a scalable distributed file system for data-intensive applications. It provides a fault-tolerant way to store data and offer a high performance to a large number of clients.

GFS uses MapReduce that allows users to create programs and run them on multiple machines without thinking about the parallelization and load-balancing issues . GFS architecture is based on a single master, multiple chunkservers and multiple clients.[17]

The master server running on a dedicated node is responsible for coordinating storage resources and managing files's metadata (such as the equivalent of inodes in classical file systems).[9] Each file is split to multiple chunks of 64 MByte. Each chunk is stored in a chunk server.A chunk is identified by a chunk handle, which is a globally unique 64-bit number that is assigned by the master when the chunk is first created.

As said previously, the master maintain all of the files's metadata including their names, directories and the mapping of files to the list of chunks that contain each file’s data.The metadata is kept in the master main memory, along with the mapping of files to chunks. Updates of these data are logged to the disk onto an operation log. This operation log is also replicated onto remote machines. When the log become too large, a checkpoint is made and the main-memory data is stored in a B-tree structure to facilitate the mapped back into main memory.[18]

Fault tolerance

For fault tolerance, a chunk is replicated onto multiple chunkservers, by default on three chunckservers.[19] A chunk is available on at least a chunk server. The advantage of this system is the simplicity. The master is responsible of allocating the chunk servers for each chunk and it is contacted only for metadata information. For all other data, the client has to interact with chunkservers.

Moreover, the master keeps track of where a chunk is located. However, it does not attempt to keep precisely the chunk locations but occasionally contact the chunk servers to see which chunks they have stored.[20] GFS is a scalable distributed file system for data-intensive applications.[21] The master does not have a problem of bottleneck due to all the work that it has to accomplish. In fact, when the client want to access data, it communicates with the master to see which chunk server is holding that data. Once done, the communication is set up between the client and the concerned chunk server.

In GFS, most files are modified by appending new data and not overwriting existing data. In fact, once written, the files are only read and often only sequentially rather than randomly, and that made this DFS the most suitable for scenarios in which many large files are created once but read many times.[22][23]

File process

When a client wants to write/update to a file, the master should accord a replica for this operation. This replica will be the primary replica since it is the first one that gets the modification from clients. The process of writing is decomposed into two steps:[9]

• sending: First, and by far the most important, the client contacts the master to find out which chunk servers holds the data. So the client is given a list of replicas identifying the primary chunk server and secondaries ones. Then, the client contacts the nearest replica chunk server, and send the data to it. This server will send the data to the next closest one, which then forwards it to yet another replica, and so on. After that, the data have been propagated but not yet written to a file (sits in a cache)
• writing: When all the replicas receive the data, the client sends a write request to the primary chunk server -identifying the data that was sent in the sending phase- who will then assign a sequence number to the write operations that it has received, applies the writes to the file in serial-number order, and forwards the write requests in that order to the secondaries. Meanwhile, the master is kept out of the loop.

Consequently, we can differentiate two types of flows: the data flow and the control flow. The first one is associated to the sending phase and the second one is associated to the writing phase. This assures that the primary chunk server takes the control of the writes order. Note that when the master accord the write operation to a replica, it increments the chunk version number and informs all of the replicas containing that chunk of the new version number. Chunk version numbers allow to see if any replica didn't make the update because that chunkserver was down.[24]

It seems that some new Google applications did not work well with the 64-megabyte chunk size. To treat that, GFS started in 2004 to implement the BigTable approach."[1]

Hadoop distributed file system

The main article for this category is Apache Hadoop.

HDFS, hosted by Apache Software Foundation, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes). Its architecture is similar to GFS one, i.e. a master/slave architecture. The HDFS is normally installed on a cluster of computers. The design concept of Hadoop refers to Google, including Google File System, Google MapReduce and BigTable. These three techniques are individually mapping to Hadoop and Distributed File System (HDFS), Hadoop MapReduce Hadoop Base (HBase).[25]

An HDFS cluster consists of a single NameNode and several DataNode machines. A NameNode, a master server, manages and maintains the metadata of storage DataNodes in its RAM. DataNodes manage storage attached to the nodes that they run on. The NameNode and DataNode are software programs designed to run on everyday-use machines, which typically run on a GNU/Linux OS. HDFS can be run on any machine that supports Java and therefore can run either a NameNode or the Datanode software.[26]

More explicitly, a file is split into one or more equal-size blocks except the last block that could be smaller. Each block is stored in multiple DataNodes. Each block may be replicated on multiple DataNodes to guarantee a high availability. By default, each block is replicated three times, a process called "Block Level Replication".[27]

The NameNode manages the file system namespace operations such as opening, closing, and renaming files and directories and regulates the file access. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for operating read and write requests from the file system’s clients, managing the block allocation or deletion, and replicating blocks.[28]

When a client wants to read or write data, it contacts the NameNode and the NameNode checks where the data should be read from or written to. After that, the client has the location of the DataNode and can send read or write requests to it.

The HDFS is typically characterized by its compatibility with data rebalancing schemes. In general, managing the free space on a DataNode is very important. Data must be moved from one DataNode to another one if its free space is not adequate, and in the case of creating additional replicas, data should move to assure the balance of the system.[27]

Other examples

Distributed file systems can be classified into two categories. The first category of DFS is the one designed for internet services such as GFS. The second category include DFS that support intensive applications usually executed in parallel.[29] Here are some example from the second category: Ceph FS, Fraunhofer File System (FhGFS), Lustre File System, IBM General Parallel File System (GPFS) and Parallel Virtual File System.

Ceph file system is a distributed file system that provides excellent performance and reliability.[30] It presents some challenges that are the need to be able to deal with huge files and directories, coordinate the activity of thousands of disks, provide parallel access to metadata on a massive scale, manipulate both scientific and general-purpose workloads, authenticate and encrypt at scale, and increase or decrease dynamically because of frequent device decommissioning, device failures, and cluster expansions.[31]

FhGFS, the high-performance parallel file system from the Fraunhofer Competence Centre for High Performance Computing. The distributed metadata architecture of FhGFS has been designed in order to provide the scalability and flexibility needed to run the most widely used HPC applications.[32]

Lustre File System has been designed and implemented to deal with the issue of bottlenecks traditionally found in distributed systems. Lustre is characterized by its efficiency, scalability and redundancy.[33] GPFS was also designed with the goal of removing the bottlenecks.[34]

Communication

The high performance of distributed file systems require an efficient communication between computing nodes and a fast access to the storage system. Operations as open, close, read, write, send and receive should be fast to assure that performance. Note that for each read or write request, the remote disk is accessed and that may takes a long time due to the network latencies.[35]

The data communication (send/receive) operation transfer the data from the application buffer to the kernel on the machine.TCP control the process of sending data and is implemented in the kernel. However, in case of network congestion or errors, TCP may not send the data directly. While transferring, data from a buffer in the kernel to the application, the machine does not read the byte stream from the remote machine. In fact, TCP is responsible for buffering the data for the application.[36]

Providing a high level of communication can be done by choosing the buffer-size of file reading and writing or file sending and receiving on application level. Explicitly, the buffer mechanism is developed using Circular Linked List.[37] It consists of a set of BufferNodes. Each BufferNode has a DataField. The DataField contains the data and a pointer called NextBufferNode that points to the next BufferNode. To find out the current position, two pointers are used: CurrentBufferNode and EndBufferNode, that represent the position in the BufferNode for the last written position and last read one. If the BufferNode has no free space, it will send a wait signal to the client to tell him to wait until there is available space.[38]

Cloud-based Synchronization of Distributed File System

More and more users have multiple devices with ad hoc connectivity. These devices need to be synchronized. In fact, an important point is to maintain user data by synchronizing replicated data sets between an arbitrary number of servers. This is useful for the backups and also for offline operation. Indeed, when the user network conditions are not good, then the user device will selectively replicate a part of data that will be modified later and off-line. Once the network conditions become good, it makes the synchronization.[39] Two approaches exists to tackle with the distributed synchronization issue: the user-controlled peer-to-peer synchronization and the cloud master-replica synchronization approach.[39]

• user-controlled peer-to-peer: software such as rsync must be installed in all users computers that contain their data. The files are synchronized by a peer-to-peer synchronization in a way that users has to give all the network addresses of the devices and the synchronization parameters and thus made a manual process.
• cloud master-replica synchronization: widely used by cloud services in which a master replica that contains all data to be synchronized is retained as a central copy in the cloud, and all the updates and synchronization operations are pushed to this central copy offering a high level of availability and reliability in case of failures.

Security keys

In cloud computing, the most important security concepts are confidentiality, availability and integrity. In fact, confidentiality becomes indispensable in order to keep private data from being disclosed and maintain privacy. In addition, integrity assures that data is not corrupted.[40]

Confidentiality

Confidentiality means that data and computation tasks are confidential: neither the cloud provider nor others clients could access to data. Much research has been done about confidentiality because it is one of the crucial points that still represents challenges for cloud computing. The lack of trust toward the cloud providers is also a related issue.[41] So the infrastructure of the cloud must make assurance that all consumer's data will not be accessed by any an unauthorized persons. The environment becomes unsecured if the service provider:[42]

• can locate consumer's data in the cloud
• has the privilege to access and retrieve consumer's data
• can understand the meaning of data (types of data, functionalities and interfaces of the application and format of the data).

If these three conditions are satisfied simultaneously, then it became very dangerous.

The geographic location of data stores influences on the privacy and confidentiality. Furthermore, the location of clients should be taken into account. Indeed, clients in Europe won't be interested by using datacenters located in United States, because that affects the confidentiality of data as it will not be guaranteed. In order to figure out that problem, some Cloud computing vendors have included the geographic location of the hosting as a parameter of the service level agreement made with the customer [43] allowing users to chose by themselves the locations of the servers that will host their data.

An approach that may help to face the confidentiality matter is the data encryption [44] otherwise, there will be some serious risks of unauthorized uses. In the same context, other solutions exists such as encrypting only sensitive data.[45] and supporting only some operations, in order to simplify computation.[46] Furthermore, Cryptographic techniques and tools as FHE, are also used to strengthen privacy preserving in cloud.[47]

Availability

Availability is generally treated by replication.[48][49] [50][51] Meanwhile, consistency must be guaranteed. However, consistency and availability cannot be achieved at the same time. This means that neither releasing consistency will allow the system to remain available nor making consistency a priority and letting the system sometimes unavailable.[52] In other hand, data must have an identity to be accessible. For instance, Skute [48] is a mechanism based on key/value store that allow dynamic data allocation in an efficient way. Indeed, each server must be identified by a label in this form “continent-country-datacenter-room-rack-server”. The server has reference to multiple virtual nodes, each node has a selection of data(or multiple partition of multiple data). Each data is identified by a key space which is generated by a one-way cryptographic hash function (e.g. MD5) and is localised by the hash function value of this key. The key space may be partitioned into multiple partitions and every partition refers to a part of a data. To perform replication, virtual nodes must be replicated and so referenced by other servers. To maximize data availability data durability, the replicas must be placed in different servers and every server should be in different region, because data availability increase with the geographical diversity. The process of replication consists of an evaluation of the data availability that must be above a certain minimum. Otherwise, data are replicated to another chunk server. Each partition i has an availability value represented by the following formula:

$avail_i=\sum_{i=0}^{|s_i|}\sum_{j=i+1}^{|s_i|} conf_i.conf_j.diversity(s_i,s_j)$

where $s_{i}$ are the servers hosting the replicas, $conf_{i}$ and $conf_{j}$ are the confidence of servers $_{i}$ and $_{j}$ (relying on technical factors such as hardware components and non-technical ones like the economic and political situation of a country) and the diversity is the geographical distance between$s_{i}$ and $s_{j}$.[53]

Replication is a great solution to ensure data availability, but it costs too much in terms of memory space.[54] DiskReduce [54] is a modified version of HDFS that's based on RAID technology (RAID-5 and RAID-6) and allows asynchronous encoding of replicated data. Indeed, there is a background process which look for wide data and it deletes extra copies after encoding it. Another approach is to replace replication with erasure coding[55] In addition, to ensure data availability there are many approaches that allow data recovery. In fact, data must be coded and once it is lost, it can be recovered from fragments which are constructed during the coding phase.[56] Some other approaches that apply different mechanisms to guarantee availability are following: Reed-Solomon code of Microsoft Azure, RaidNode for HDFS, also Google is still working on a new approach based on erasure coding mechanism.[57]

Until now there is no RAID implementation established for cloud storage.[55]

integrity

Integrity in cloud computing implies data integrity and meanwhile computing integrity. Integrity means data has to be stored correctly on cloud servers and in case of failures or incorrect computing, problems have to be detected.

Data integrity is easy to achieve thanks to cryptography (typically through Message authentication code, or MACs, on data blocks).[58]

There are different ways affecting data's integrity either from a malicious event or from administration errors (i.e. backup and restore, data migration, changing memberships in P2P systems).[59]

It exists some checking mechanisms that check data integrity. For instance:

• HAIL (HAIL (High-Availability and Integrity Layer) a distributed cryptographic system that allows a set of servers to prove to a client that a stored file is intact and retrievable.[60]
• Hach PORs [61] (proofs of retrievability for large file) is based on a symmetric cryptographic system, there is only one verification key that must be stored in file to improve its integrity. This method serves to encrypt a file F and then generate a random string named sentinel that must be added at the end of the encrypted file. The server cannot locate the sentinel, which is impossible to differentiate it from other blocks, so a small change would indicate whether the file has been changed or not.
• Different mechanisms of PDP (Provable data possession) checking : Is a class of efficient and practical method that provides an efficient way to check data integrity at untrusted servers:
PDP:[62] Before storing the data on a server, the client must store , locally, some meta-data. At a later time, and without downloading data, the client is able to ask the server to check that the data had not been falsified. This approach is used for static data.
Scalable PDP:[63] This approach is premised upon a symmetric-key which is more efficient than public-key encryption. It supports some dynamic operations (modification, deletion and append) but it cannot be used for public verification.
Dynamic PDP:[64] This approach extends the PDP model to support several update operations such as append, insert, modify and delete which is well-suited for intense computation .

Economic aspects

The cloud computing is growing rapidly. The US government decided to spend 40% of annual growth rate CAGR and fixed 7 billion dollars by 2015. Huge number that should be take into consideration.[65]

More and more companies have been utilizing the cloud computing to manage the massive amount of data and overcome the lack of storage capacities. Indeed, the companies are enabled to use resources as a service to assure their computing needs without having to invest on infrastructure, so they pay for what they are going to use (Pay-as-you-go model).[66]

Every application provider has to periodically pay the cost of each server where replicas of his data are stored. The cost of a server is generally estimated by the quality of the hardware, the storage capacities, and its query processing and communication overhead.[67]

Cloud computing facilitates the tasks for enterprises to scale their services under the client requests. The pay-as-you-go model has also facilitate the tasks for the startup companies that wish to benefit from compute-intensive business. Cloud computing also offers a huge opportunity to many third-world countries that don't have enough resources, and thus enabling IT services. Cloud computing can lower IT barriers to innovation.[68]

Although the wide utilization of cloud computing, an efficient sharing of large volumes of data in an untrusted cloud is still a challenging research topic.

References

1. ^ Sun microsystem, p. 1.
2. ^ Fabio Kon, p. 1
3. ^ Kobayashi et al. 2011, p. 1.
4. ^ Angabini et al. 2011, p. 1.
5. ^ Di Sano et al. 2012, p. 2.
6. ^ Andrew & Maarten 2006, p. 492.
7. ^ Andrew & Maarten 2006, p. 496
8. ^ Humbetov 2012, p. 2
9. ^ a b c Krzyzanowski 2012, p. 2
10. ^ Pavel Bžoch, p. 7.
11. ^ Kai et al. 2013, p. 23.
12. ^ a b Hsiao et al. 2013, p. 2.
13. ^ Hsiao et al. 2013, p. 952.
14. ^
15. ^
16. ^ Hsiao et al. 2013, p. 953.
17. ^ Di Sano et al. 2012, pp. 1–2
18. ^ Krzyzanowski 2012, p. 4
19. ^ Di Sano et al. 2012, p. 2
20. ^ Andrew & Maarten 2006, p. 497
21. ^ Humbetov 2012, p. 3
22. ^ Humbetov 2012, p. 5
23. ^ Andrew & Maarten 2006, p. 498
24. ^ Krzyzanowski 2012, p. 5
25. ^ Fan-Hsun et al. 2012, p. 2
26. ^ Azzedin 2013, p. 2
27. ^ a b Adamov 2012, p. 2
28. ^ Yee & Thu Naing 2011, p. 122
29. ^ Soares et al. 2013, p. 158
30. ^ Weil et al. 2006, p. 307
31. ^ MALTZAHN et al. 2010, p. 39
32. ^ Jacobi Lingemann, p. 10
33. ^ Schwan Philip 2003, p. 401
34. ^
35. ^ Upadhyaya et al. 2008, p. 400.
36. ^ Upadhyaya et al. 2008, p. 403.
37. ^ Upadhyaya et al. 2008, p. 401.
38. ^ Upadhyaya et al. 2008, p. 402.
39. ^ a b
40. ^ Zhifeng & Yang 2013, p. 854
41. ^ Zhifeng & Yang 2013, pp. 845–846
42. ^ Yau & An 2010, p. 353
43. ^
44. ^ Yau & An 2010, p. 352
45. ^ Miranda & Siani 2009
46. ^
47. ^ Zhifeng & Yang 2013, p. 854.
48. ^ a b
49. ^ Cuong et al. 2012, p. 5
50. ^ A., A. & P. 2011, p. 3
51. ^ Qian, D. & T. 2011, p. 3
52. ^ Vogels 2009, p. 2
53. ^
54. ^ a b Carnegie et al. 2009, p. 1
55. ^ a b Wang et al. 2012, p. 1
56. ^
57. ^ Wang et al. 2012, p. 9
58. ^ Juels & Oprea 2013, p. 4
59. ^ Zhifeng & Yang 2013, p. 5
60. ^ Bowers, Juels & Oprea 2009
61. ^ Juels & S. Kaliski 2007, p. 2
62. ^ Ateniese et al. Kissner
63. ^ Ateniese et al. 2008, p. 9
64. ^ Erway et al. 2009, p. 2
65. ^ Lori M. Kaufman 2009, p. 2
66. ^ Angabini et al. 2011, p. 1
67. ^
68. ^ Marston et al. 2011, p. 3.

Bibliography

1. Architecture & Structure & design:
• Zhang, Qi-fei; Pan, Xue-zeng; Shen, Yan; Li, Wen-juan (2012). "A Novel Scalable Architecture of Cloud Storage System for Small Files Based on P2P". Coll. of Comput. Sci. & Technol., Zhejiang Univ., Hangzhou, China. Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012 IEEE International Conference on. doi:10.1109/ClusterW.2012.27. Zhang. Lay summary.
• Azzedin, Farag (2013). "Towards A Scalable HDFS Architecture". Information and Computer Science Department King Fahd University of Petroleum and Minerals. Collaboration Technologies and Systems (CTS), 2013 International Conference on: 155–161. doi:10.1109/CTS.2013.6567222. Azzedin. Lay summary.
• Krzyzanowski, Paul (2012). "Distributed File Systems". Krzyzanowski.
• Kobayashi, K; Mikami, S; Kimura, H; Tatebe, O (2011). "The Gfarm File System on Compute Clouds". Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. Grad. Sch. of Syst. & Inf. Eng., Univ. of Tsukuba, Tsukuba, Japan. doi:10.1109/IPDPS.2011.255. Kobayashi.
• Humbetov, Shamil (2012). "Data-Intensive Computing with Map-Reduce and Hadoop". Department of Computer Engineering Qafqaz University Baku, Azerbaijan. Application of Information and Communication Technologies (AICT), 2012 6th International Conference on: 1–5. doi:10.1109/ICAICT.2012.6398489. Humbetov. Lay summary.
• Hsiao, Hung-Chang; Chung, Hsueh-Yi; Shen, Haiying; Chao, Yu-Chang (2013). "Load Rebalancing for Distributed File Systems in Clouds". National Cheng Kung University, Tainan. Parallel and Distributed Systems, IEEE Transactions on (Volume:24 , Issue: 5 ): 951–962. doi:10.1109/TPDS.2012.196. Hsiao. Lay summary.
• Kai, Fan; Dayang, Zhang; Hui, Li; Yintang, Yang (2013). "An Adaptive Feedback Load Balancing Algorithm in HDFS". State Key Lab. of Integrated Service Networks, Xidian Univ., Xi'an, China. Intelligent Networking and Collaborative Systems (INCoS), 2013 5th International Conference on: 23–29. doi:10.1109/INCoS.2013.14. Fan. Lay summary.
• Upadhyaya, B; Azimov, F; Doan, T.T; Choi, Eunmi; Kim, Sangbum; Kim, Pilsung (2008). "Distributed File System: Efficiency Experiments for Data Access and Communication". Sch. of Bus. IT, Kookmin Univ., Seoul. Networked Computing and Advanced Information Management, 2008. NCM '08. Fourth International Conference on (Volume:2 ): 400–405. doi:10.1109/NCM.2008.164. Upadhyaya. Lay summary.
• Soares, Tiago S.; Dantas†, M.A.R; de Macedo, Douglas D.J.; Bauer, Michael A (2013). "A Data Management in a Private Cloud Storage Environment Utilizing High Performance Distributed File Systems". nf. & Statistic Dept. (INE), Fed. Univ. of Santa Catarina (UFSC), Florianopolis, Brazil. Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2013 IEEE 22nd International Workshop on: 158–163. doi:10.1109/WETICE.2013.12. Soares. Lay summary.
• Adamov, Abzetdin (2012). "Distributed File System as a basis of Data-Intensive Computing". Comput. Eng. Dept., Qafqaz Univ., Baku, Azerbaijan. Application of Information and Communication Technologies (AICT), 2012 6th International Conference on: 1–3. doi:10.1109/ICAICT.2012.6398484. Adamov. Lay summary.
• Schwan Philip (2003). "Lustre: Building a File System for 1,000-node Clusters". Cluster File Systems, Inc.. Proceedings of the 2003 Linux Symposium: 400–407. Schwan. Lay summary.
• Jones, Terry; Koniges, Alice; Yates, R. Kim (2000). "Performance of the IBM General Parallel File System". Lawrence Livermore National Laboratory. Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International. Jones. Lay summary.
• Weil, Sage A.; Brandt, Scott A.; Miller, Ethan L.; Long, Darrell D. E. (2006). Ceph: A Scalable, High-Performance Distributed File System. University of California, Santa Cruz. Weil.
• MALTZAHN, CARLOS; MOLINA-ESTOLANO, ESTEBAN; KHURANA, AMANDEEP; NELSON, ALEX J.; BRANDT, SCOTT A.; WEIL, SAGE (2010). Ceph as a scalable alternative to the Hadoop Distributed FileSystem. MALTZAHN.
• S.A., Brandt; E.L., Miller; D.D.E., Long; Lan, Xue (2003). "Efficient metadata management in large distributed storage systems". Storage Syst. Res. Center, California Univ., Santa Cruz, CA, USA. Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings. 20th IEEE/11th NASA Goddard Conference on: 290–298. doi:10.1109/MASS.2003.1194865. Brandt. Lay summary.
• Garth A., Gibson; Rodney, MVan Meter (November 2000). "Network attached storage architecture". COMMUNICATIONS OF THE ACM 43 (11). Gibson.
• Yee, Tin Tin; Thu Naing, Thinn (2011). "PC-Cluster based Storage System Architecture for Cloud Storage". The Smithsonian/NASA Astrophysics Data System. Yee.
• Cho Cho, Khaing; Thinn Thu, Naing (2011). "The efficient data storage management system on cluster-based private cloud data center". Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on: 235–239. doi:10.1109/CCIS.2011.6045066. Khaing. Lay summary.
• S.A., Brandt; E.L., Miller; D.D.E., Long; Lan, Xue (2011). "A carrier-grade service-oriented file storage architecture for cloud computing". PCN&CAD Center, Beijing Univ. of Posts & Telecommun., Beijing, China. Web Society (SWS), 2011 3rd Symposium on: 16–20. doi:10.1109/SWS.2011.6101263. Brandt. Lay summary.
• Ghemawat, Sanjay; Gobioff, Howard; Leung, Shun-Tak (2003). "The Google File System". SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles: 29–43. doi:10.1145/945445.945450. Ghemawat. Lay summary.
2. Security Concept
• Vecchiola, C; Pandey, S; Buyya, R (2009). "High-Performance Cloud Computing: A View of Scientific Applications". Dept. of Comput. Sci. & Software Eng., Univ. of Melbourne, Melbourne, VIC, Australia. Pervasive Systems, Algorithms, and Networks (ISPAN), 2009 10th International Symposium on: 4–16. doi:10.1109/I-SPAN.2009.150. Vecchiola. Lay summary.
• Miranda, Mowbray; Siani, Pearson (2009). "A client-based privacy manager for cloud computing". COMSWARE '09 Proceedings of the Fourth International ICST Conference on COMmunication System softWAre and middlewaRE. doi:10.1145/1621890.1621897. Miranda. Lay summary.
• Naehrig, Michael; Lauter, Kristin (2013). "Can homomorphic encryption be practical?". CCSW '11 Proceedings of the 3rd ACM workshop on Cloud computing security workshop: 113–124. doi:10.1145/2046660.2046682. Michael. Lay summary.
• Du, Hongtao; Li, Zhanhuai (2012). "Efficient metadata management in large distributed storage systems". Comput. Coll., Northwestern Polytech. Univ., XiAn, China. Measurement, Information and Control (MIC), 2012 International Conference on 1: 327–331. doi:10.1109/MIC.2012.6273264. Hongtao. Lay summary.
• A.Brandt, Scott; L.Miller, Ethan; D.E.Long, Darrell; Xue, Lan (2003). "Efficient Metadata Management in Large Distributed Storage Systems". Storage Systems Research Center University of California,Santa Cruz. 11th NASA Goddard Conference on Mass Storage Systems and Technologies,SanDiego,CA. Scott.
• Lori M. Kaufman (2009). "Data Security in the World of Cloud Computing". Security & Privacy, IEEE (Volume:7 , Issue: 4 ): 161–64. doi:10.1109/MSP.2009.87. Kaufman. Lay summary.
• Bowers, Kevin; Juels, Ari; Oprea, Alina (2009). "HAIL: a high-availability and integrity layer for cloud storageComputing". Proceedings of the 16th ACM conference on Computer and communications security: 187–198. doi:10.1145/1653662.1653686. HAIL. Lay summary.
• Juels, Ari; Oprea, Alina (2013). "New approaches to security and availability for cloud data". Magazine Communications of the ACM CACM Homepage archive Volume 56 Issue 2, February 2013: 64–73. doi:10.1145/2408776.2408793. Ari Juels. Lay summary.
• Zhang, Jing; Wu, Gongqing; Hu, Xuegang; Wu, Xindong (2012). "A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services". Dept. of Comput. Sci., Hefei Univ. of Technol., Hefei, China. Grid Computing (GRID), 2012 ACM/IEEE 13th International Conference on: 12–21. doi:10.1109/Grid.2012.17. Jing. Lay summary.
• A., Pan; J.P., Walters; V.S., Pai; D.-I.D., Kang; S.P., Crago (2012). "Integrating High Performance File Systems in a Cloud Computing Environment". Dept. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA. High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:: 753–759. doi:10.1109/SC.Companion.2012.103. Pan. Lay summary.
• Fan-Hsun, Tseng; Chi-Yuan, Chen; Li-Der, Chou; Han-Chieh, Chao (2012). "Implement a reliable and secure cloud distributed file system". Dept. of Comput. Sci. & Inf. Eng., Nat. Central Univ., Taoyuan, Taiwan. Intelligent Signal Processing and Communications Systems (ISPACS), 2012 International Symposium on: 227–232. doi:10.1109/ISPACS.2012.6473485. Fan-Hsun. Lay summary.
• Di Sano, M; Di Stefano, A; Morana, G; Zito, D (2012). "File System As-a-Service: Providing Transient and Consistent Views of Files to Cooperating Applications in Clouds". Dept. of Electr., Electron. & Comput. Eng., Univ. of Catania, Catania, Italy. Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2012 IEEE 21st International Workshop on: 173–178. doi:10.1109/WETICE.2012.104. Di Sano. Lay summary.
• Zhifeng, Xiao; Yang, Xiao (2013). "Security and Privacy in Cloud Computing". Communications Surveys & Tutorials, IEEE (Volume:15 , Issue: 2 ): 843–859. doi:10.1109/SURV.2012.060912.00182. Zhifeng. Lay summary.
• John B, Horrigan (2008). "Use of cloud computing applications and services". Horrigan.
• Yau, Stephen; An, Ho (2010). "Confidentiality Protection in cloud computing systems". Int J Software Informatics, Vol.4, No.4,: 351–365. Stephen.
• Carnegie, Bin Fan; Tantisiriroj, Wittawat; Xiao, Lin; Gibson, Garth (2009). "DiskReduce: RAID for data-intensive scalable computing". PDSW '09 Proceedings of the 4th Annual Workshop on Petascale Data Storage: 6–10. doi:10.1145/1713072.1713075. Carnegie. Lay summary.
• Wang, Jianzong; Gong, Weijiao; P., Varman; Xie, Changsheng (2012). "Reducing Storage Overhead with Small Write Bottleneck Avoiding in Cloud RAID System". Grid Computing (GRID), 2012 ACM/IEEE 13th International Conference on: 174–183. doi:10.1109/Grid.2012.29. Changsheng. Lay summary.
• Abu-Libdeh, Hussam; Princehouse, Lonnie; Weatherspoon, Hakim (2010). "RACS: a case for cloud storage diversity". SoCC '10 Proceedings of the 1st ACM symposium on Cloud computing: 229–240. doi:10.1145/1807128.1807165. Hussam. Lay summary.
• Vogels, Werner (2009). "Eventually consistent". Communications of the ACM - Rural engineering development CACM Volume 52 Issue 1: 40–44. doi:10.1145/1435417.1435432. Vogels. Lay summary.
• Cuong, Pham; Cao, Phuong; Kalbarczyk, Z; Iyer, R.K (2012). "Toward a high availability cloud: Techniques and challenges". Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on: 1–6. doi:10.1109/DSNW.2012.6264687. Cuong. Lay summary.
• A., Undheim; A., Chilwan; P., Heegaard (2011). "Differentiated Availability in Cloud Computing SLAs". Grid Computing (GRID), 2011 12th IEEE/ACM International Conference on: 129–136. doi:10.1109/Grid.2011.25. Undheim. Lay summary.
• Qian, Haiyang; D., Medhi; T., Trivedi (2011). "A hierarchical model to evaluate quality of experience of online services hosted by cloud computing". Communications of the ACM - Rural engineering development CACM Volume 52 Issue 1: 105–112. doi:10.1109/INM.2011.5990680. Medhi. Lay summary.
• Ateniese, Giuseppe; Burns, Randal; Curtmola, Reza; Herring, Joseph; Kissner, Lea; Peterson, Zachary; Song, Dawn (2007). "Provable data possession at untrusted stores". CCS '07 Proceedings of the 14th ACM conference on Computer and communications security: 598–609. doi:10.1145/1315245.1315318. Giuseppe. Lay summary.
• Ateniese, Giuseppe; Di Pietro, Roberto; V. Mancini, Luigi; Tsudik, Gene (2008). "Scalable and efficient provable data possession". Proceedings of the 4th international conference on Security and privacy in communication netowrks. doi:10.1145/1460877.1460889. Ateniese. Lay summary.
• Erway, Chris; Küpçü, Alptekin; Tamassia, Roberto; Papamanthou, Charalampos (2009). "Dynamic provable data possession". Proceedings of the 16th ACM conference on Computer and communications security: 213–222. doi:10.1145/1653662.1653688. Erway. Lay summary.
• Juels, Ari; S. Kaliski, Burton (2007). "Pors: proofs of retrievability for large files". Proceedings of the 14th ACM conference on Computer and communications: 584–597. doi:10.1145/1315245.1315317. Burton. Lay summary.
• Bonvin, Nicolas; Papaioannou, Thanasis; Aberer, Karl (2009). "A self-organized, fault-tolerant and scalable replication scheme for cloud storage". SoCC '10 Proceedings of the 1st ACM symposium on Cloud computing: 205–216. doi:10.1145/1807128.1807162. Bonvin. Lay summary.
• Tim, Kraska; Martin, Hentschel; Gustavo, Alonso; Donald, Kossma (2009). "Consistency rationing in the cloud: pay only when it matters". Proceedings of the VLDB Endowment VLDB Endowment Hompage archive Volume 2 Issue 1,: 253–264. Kraska. Lay summary.
• Daniel, J. Abadi (2009). "Data Management in the Cloud: Limitations and Opportunities". IEEE. Abadi. Lay summary.
• Ari, Juels; S., Burton; Jr, Kaliski (2007). "Pors: proofs of retrievability for large files". Communications of the ACM CACM Volume 56 Issue 2: 584–597. doi:10.1145/1315245.1315317. Vogels. Lay summary.
• Ari, Ateniese; Randal, Burns; Johns, Reza; Curtmola, Joseph; Herring, Burton; Lea, Kissner; Zachary, Peterson; Dawn, Song (2007). "PDP: Provable data possession at untrusted stores". CCS '07 Proceedings of the 14th ACM conference on Computer and communications security: 598–609. doi:10.1145/1315245.1315318. Ari. Lay summary.
1. synchronization
• Uppoor, S; Flouris, M.D; Bilas, A (2010). "Cloud-based Synchronization of Distributed File System Hierarchies". Inst. of Comput. Sci. (ICS), Found. for Res. & Technol. - Hellas (FORTH), Heraklion, Greece. Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on: 1–4. doi:10.1109/CLUSTERWKSP.2010.5613087. Uppoor. Lay summary.
2. Economic aspects
• Lori M., Kaufman (2009). "Data Security in the World of Cloud Computing". Security & Privacy, IEEE (Volume:7 , Issue: 4 ): 161–64. doi:10.1109/MSP.2009.87. Lay summary.
• Marston, Sean; Lia, Zhi; Bandyopadhyaya, Subhajyoti; Zhanga, Juheng; Ghalsasi, Anand (2011). "Cloud computing — The business perspective". Decision Support Systems Volume 51, Issue 1,. pp. 176–189. doi:10.1016/j.dss.2010.12.006. Lia.
• Angabini, A; Yazdani, N; Mundt, T; Hassani, F (2011). "Suitability of Cloud Computing for Scientific Data Analyzing Applications; An Empirical Study". Sch. of Electr. & Comput. Eng., Univ. of Tehran, Tehran, Iran. P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2011 International Conference on: 193–199. doi:10.1109/3PGCIC.2011.37. Angabini. Lay summary.