Storage Resource Manager

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Storage Resource Management (SRM) technology was initiated by the Scientific Data Management Group at LBNL and developed in response to growing needs of managing large datasets on a variety of storage systems. Dynamic storage management is essential to ensure
(i) prevention of data loss,
(ii) decrease of error rates of data replication, and
(iii) decrease of the analysis time by ensuring that analysis tasks have the storage space to run to completion.

There are already numerous examples where data from simulations running on leadership class machines were lost because they were not moved in time to a mass storage system. Storage Resource Managers (SRMs) address such issues by coordinating storage allocation, streaming the data between sites, and enforcing secure interfaces to the storage systems (i.e. dealing with special security requirements of each storage system at its home institution.) For example, in a production environment, using SRMs has reduced error rates of large-scale replication from 1% to 0.02% in the STAR project. Furthermore, SRMs can prevent job failures. When running jobs on clusters some of the local disks get filled before the job finishes, resulting in loss of productivity, and therefore a delay in analysis. This occurs because space was not dynamically allocated and previous unneeded files were not removed. While there are tools for dynamically allocating compute and network resources, SRMs are the only tool available for providing dynamic space reservation, guaranteeing secure file availability with lifetime support, and automatic garbage collection that prevents clogging of storage systems.

The SRM specification has evolved into an international de facto standard, and many projects have committed to use this technology, especially in the HEP and HENP communities, such as the World-wide Large Hadron Collider (LHC) Computing Grid (WLCG) that supports ATLAS and CMS. The SRM approach is to develop a uniform standard interface that allows multiple implementations by various institutions to interoperate. This approach removes the dependence on a single implementation, and permits multiple groups to develop SRM systems for their specific storage resources. This approach became crucial to the interoperation of storage systems for such large scale projects that have to manage and distribute massive amounts of data efficiently and securely. Without such a unifying technology, such projects cannot scale, and are bound to fail. This problem will only grow over time as computing facilities move into the petascale regime.

Another important problem that SRMs address is storage clogging. Storage clogging is a critical problem for large scale shared storage systems, since the removal of files after they are used is not automated. This increases the cost of storage, and slows the analysis and discovery process. SRMs help unclog temporary storage systems, by providing lifetime management of accessed files. This capability is crucial to efficient usage of storage under cost constraints.

SRMs also serve as gateways to secure data access. By limiting external access to all storage systems through a standard SRM interface, one can assure not only authenticated access, but also the enforcement of authorized access to files. The SRM technology was highly successful in SciDAC-1, and is currently used in production in several large collaborations. SRM implementations that interoperate have been developed at LBNL, FNAL and TJNAF, as well as several sites in Europe. Furthermore, this technology increases the scientist’s productivity by eliminating the tedious and time consuming tasks of managing storage, performing robust data movement, and dealing with security requirements at various storage sites.

In addition to leading the SRM standard development by coordinating with multiple institutions, the LBNL team has developed SRM systems to disk storage and mass storage systems, including HPSS. These SRMs have been used in several application domains, including multiple projects at the SDM center, Earth System Grid, the STAR experiment, and the Open Science Grid (OSG). As data sets continue to grow and become ever more complex, these projects depend on the continued development and support of the SRM implementations from LBNL. It is essential to capitalize on the SciDAC-1 successes and sustain current projects that depend on the SRM technology, further improving and deploying SRMs in additional projects and application domains, and continued evolution of the SRM standard. Specifically, based on past experience, we have identified important features that require further development and coordination. These include sophisticated aspects of resource monitoring that can be used for performance estimation, authorization enforcement, and accounting tracking and reporting for the purpose of enforcing quota usage in SRMs. Another aspect that needs further development is SRMs for multi-component storage systems. Such systems, made of a combination of multiple disk arrays, parallel file systems, and archival storage are becoming more prevalent as the volume of data that need to be managed grow exponentially with petascale computing.

Use of SRMs in real applications[edit]

The SRM interfaces have been cooperatively defined and multiple implementations developed in the US and Europe. LBNL has introduced the concepts and subsequently led a coordinated effort of defining a community-based common interface. Several implementations have been deployed in various applications including HEP, HENP, ESG as well as new application domains, such as Fusion simulation, biology, and others. Some specifics of SRM usage to date are:

  • LBNL’s SRMs have been used in production over the last few years to support intensive robust data movement between BNL to NERSC at a rate of about 10,000 files (about 1 TB) per week in an automated fashion. This arrangement resulted in a 50X reduction in the error rates, from 1% to 0.02% in the STAR project.
  • In one application, called GridCollector, SRMs were used in combination with an efficient indexing method to greatly speed up the analysis of STAR. In several cases the analysis task was performed in a day as compared to previous efforts where scientists waited for months to sift out the relevant data. This work received recognition with a Best Paper Award in ISC’05.
  • The SRM collaboration has grown as a grass root activity between LBNL, FNAL, and BNL, and later CERN and RAL. Consequently, a common interface was developed, and this activity continues at this time. This standard has been adapted by the WLCG collaboration.
  • SRMs have been used in production by several facilities including BNL, NERSC, FNAL, CERN, TJNAF, ORNL and NCAR, and other facilities in Europe and Asia.
  • Another example of a successful deployment is the SRM-dCache developed at FNAL. It is widely deployed for use in the CMS project, and it interoperates with the SRM-Castor at CERN. This effort demonstrated the usefulness of SRMs by achieving sustained SRM-to-SRM managed transfers from Castor to FNAL dCache and onto tape at a rate between 40 and 60 MB/s.
  • SRMs are used by TJNAF to provide the CLAS and Lattice QCD collaborations with remote access to the JASMine mass storage system. Such access has allowed researchers to utilize computing resources at universities and other collaborating institutions to process and analyze data weeks or months sooner than if done using only TJNAF computing resources.
  • LBNL’s SRMs have been used in production in the Earth Systems Grid (ESG) Project to provide transparent access from multiple remote storage systems at NERSC, NCAR, ORNL, LLNL, and LANL, including HPSS and NCAR-MSS. A disk version of an SRM has been used by the ESG portal to manage the disk space when it is shared as file storage for multiple clients.
  • The use of SRMs for the CPES fusion project for large-scale robust data movement will be incorporated into workflow engines as part of the SDM center activities.

List of Storage Resource Manager software:

See also[edit]