|
Deja Vu Software
Component failures are endemic to any large-scale
computational resource. As inexpensive computational clusters
continue to push the limits of performance and scalability, fundamental
issues remain in engendering stability in large-scale cluster
systems. The increased component count inherent in cluster-based
systems increases the instability of the resource as a whole due
to the combinatorial dependency of the integrated system on single-component
failure rates. For instance, even if a cluster is based on very
high reliability nodes, – where a node has a software or
hardware failure once a year – a 1000 node cluster will
fail several times a day. Engendering stability in ever growing
networked collections of cluster systems needs a systemic software
solution that provides reliable access to computing resources.
Under an exclusive license from Virginia Tech, California Digital
is developing the first comprehensive solution to the problem
of transparent fault tolerance, which enables large-scale cluster
supercomputers to mask hardware, operating system and software
failures - a decades old problem. The goal of our software solution
called Déjà vu is not just to implement "point"
solutions, but an integrated system that will constitute a fundamental
component of enterprise technical computing resources.
Déjà vu provides:
- A transparent parallel checkpointing and
recovery mechanism that recovers from any combination of systems
failures without any modification to parallel applications.
- An online migration subsystem that recovers
from failures by migrating failed nodes to "hot-spares".
The migration subsystem can be invoked through administrative
control or automatically by the queuing system.
- A novel post-compiler analysis system that
transparently instruments the application and captures application
state.
- A systems architecture that seamlessly integrates
user-initiated and system-initiated checkpoints in a single
framework enabling the effective use of a wide variety of domain
specific knowledge.
- Novel runtime mechanisms for transparent
incremental checkpointing, to efficiently capture the least
amount of state required to maintain global consistency.
- A novel communications architecture that
enables transparent migration of existing MPI codes without
source-code modifications to the application.
- Recoverable IO subsystems that can be tailored
to specific storage environments.
|