Deja Vu

DQ

Remote Management

You can contact us at
sales@californiadigital.com
or call 1-888-546-8948 for more
information.

software systems solutions turnkey support about


Deja Vu Software

Component failures are endemic to any large-scale computational resource. As inexpensive computational clusters continue to push the limits of performance and scalability, fundamental issues remain in engendering stability in large-scale cluster systems. The increased component count inherent in cluster-based systems increases the instability of the resource as a whole due to the combinatorial dependency of the integrated system on single-component failure rates. For instance, even if a cluster is based on very high reliability nodes, – where a node has a software or hardware failure once a year – a 1000 node cluster will fail several times a day. Engendering stability in ever growing networked collections of cluster systems needs a systemic software solution that provides reliable access to computing resources.

Under an exclusive license from Virginia Tech, California Digital is developing the first comprehensive solution to the problem of transparent fault tolerance, which enables large-scale cluster supercomputers to mask hardware, operating system and software failures - a decades old problem. The goal of our software solution called Déjà vu is not just to implement "point" solutions, but an integrated system that will constitute a fundamental component of enterprise technical computing resources.

Déjà vu provides:

  • A transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications.
  • An online migration subsystem that recovers from failures by migrating failed nodes to "hot-spares". The migration subsystem can be invoked through administrative control or automatically by the queuing system.
  • A novel post-compiler analysis system that transparently instruments the application and captures application state.
  • A systems architecture that seamlessly integrates user-initiated and system-initiated checkpoints in a single framework enabling the effective use of a wide variety of domain specific knowledge.
  • Novel runtime mechanisms for transparent incremental checkpointing, to efficiently capture the least amount of state required to maintain global consistency.
  • A novel communications architecture that enables transparent migration of existing MPI codes without source-code modifications to the application.
  • Recoverable IO subsystems that can be tailored to specific storage environments.

 

 


Home | Contact Us | Legal Statement | Terms & Conditions
Call us at 1-888-546-8948 or (510) 651-8811; Copyright © 2005 California Digital
Contact Us.