HPC4U

HPC4U

Highly Predictable Cluster for Internet Grids

Expertises:

Engineering of complex IT systems 

Factsheet:

Introduction

A Service Level Agreement (SLA) is a powerful instrument for describing a job’s requirement profile. It is the exact statement of all obligations and expectations within the business partnership between the resource provider and the Grid user as its customer: it describes which resources should be provided in what amount, for how long and in which quality. It also encompasses the price for resource consumption, respectively the penalty fee for violating the agreement.

Many research projects already focus on SLA functionality within the Grid middleware. As Grid middleware systems base on local Resource Management Systems (RMSs) for execution of Grid jobs, also these RMSs have to be able to guarantee the contents of a negotiated SLA. Comparing the capabilities of current Resource Management Systems on the one side, which are at best able to reserve resources for a fixed time span, and the requirements of future Grid systems on the other, a gap between both sides becomes apparent.

The goal of the HPC4U project (Highly Predictable Cluster for Internet Grids) is to provide an application-transparent and software-only solution of a reliable Resource Management System. It will allow the Grid to negotiate on SLAs, and it will also feature mechanisms like process and storage check pointing to realize Fault Tolerance and to assure the adherence with given SLAs. The HPC4U solution will act as an active Grid component, using available Grid resources for further improving its level of Fault Tolerance. To ensure compatibility with existing middleware, HPC4U will use the WS-Agreement and WS-Agreement Negotiation protocols, defined by the GRAAP (Grid resource allocation and acquisition protocol) working group of the Global Grid Forum (GGF).

The HPC4U cluster middleware will consist of tailored building blocks for ensuring a high level of fault tolerance:

  • an SLA-aware resource management system
  • a subsystem for process check pointing
  • a storage subsystem for snapshot and virtualization
  • a network subsystem for fault tolerance and transparent failover

In an exceptional situation, e.g. the outage of hardware resources, the HPC4U system will use its fault tolerance (FT) mechanisms to assure the completion of a job.

Phases of Operation within HPC4U

Nowadays users only get best-effort service, if they use Grid infrastructures for computing their jobs. However, customers may have specific demands, e.g. regarding the required level of service quality. To ensure that these demands can be fulfilled at runtime, a SLA negotiation phase precedes the runtime phase.

In this phase, the resource provider is contacted and provides the customer with an SLA template, which he can use to specify all his requirements. At this, the template forms a framework for requesting service quality. Furthermore it defines all constraints, which have to be considered by the customer (e.g. specific services can only be quested in combination, or exclude each other).

The resource provider first checks a newly received SLA request to see if all constraints are fulfilled. This check represents the static phase of SLA request analysis. In the dynamic phase, the service provider has to validate if the requested service level can be realized with the available resources. The current or predicted system condition forms the fundamentals of this check. If the resource provider accepts the SLA request, it issues a guarantee to the requesting customer, that all the formulated requirements will be fulfilled at runtime.

In the next phase, the pre-runtime phase, the validity period of the SLA has not actually has started, but the system has to prepare for this new job. This means, that the network has to be configured, the assigned compute nodes have to be initialized and the storage has to be provided.

The main phase of operation is the runtime phase, which starts at the beginning of the validity period of the SLA. As a first step, necessary input data is transferred from the Grid customer to the compute resource. This process is denoted as stage-in. If all input data is available, the job can start its computation, generating temporary and result data, and using all agreed resources. During this computation, the HPC4U cluster middleware has to ensure the compliance with all QoS statements of the SLA, which may imply the regular generation of checkpoints or the utilization of available fault tolerance mechanisms. If the computation has finished, resulting data will be transferred back to the user within the stage out process.

The final phase of the operation is the post-runtime phase, where the validity period of the SLA has ended, the job has concluded and all output data has left the system. After this phase the system can be reconfigured to "normal" operation.

Basic Structure of an SLA

The protocols specify the form of negotiation but not the content. Therefore the application of these protocols is not bound to a specific domain of applications. The usage of these protocols for a concrete application demands the definition of specific Service Description Terms (SDT). With these SDTs, the user can specify, e.g., a number of nodes, available disk space, an interconnect type, etc. Agreements concerning higher-level services, composition of services like workflows etc. are not in the scope of HPC4U.

Conclusion

Currently, a first prototype implementation of the architecture has already been finished. It enables the user to request for a fault tolerant handling of his single-node running jobs. The HPC4U system starts such a job within a virtual bubble, using the subsystems for transparent checkpoint and migration within the same cluster system. Ongoing work within HPC4U focuses on providing check pointing and migration also to parallel-node jobs, and the realization of inter-cluster Grid migration.