Modernization Hub

Fault Tolerance

Enhanced Definition

Fault tolerance in the z/OS environment refers to the ability of a system or application to continue operating correctly and without interruption despite the failure of one or more of its components. It ensures high availability and data integrity by incorporating redundancy and automatic recovery mechanisms.

Key Characteristics

    • Redundancy: Involves duplicating critical hardware components (e.g., power supplies, network adapters, storage paths) and software instances to provide backup in case of failure.
    • Automatic Failover: The system can automatically detect a component failure and seamlessly switch processing to a redundant component or path without manual intervention or service disruption.
    • Error Detection and Correction: Mechanisms are in place to detect errors at various levels (hardware, software, data) and often correct them or isolate the faulty component.
    • Isolation: Failures are contained within a specific component or subsystem to prevent them from propagating and affecting the entire system.
    • Data Integrity: Ensures that data remains consistent, uncorrupted, and available even during and after component failures.
    • Continuous Operation: The primary goal is to maintain continuous service availability for critical applications and data.

Use Cases

    • Online Transaction Processing (OLTP): Critical CICS or IMS transactions that require 24/7 availability and cannot tolerate downtime, often leveraging Parallel Sysplex for resilience.
    • Database Systems: DB2 and IMS databases where continuous data access and integrity are paramount, utilizing data sharing and replication technologies.
    • Core z/OS Services: Ensuring essential system components like JES2, VTAM, and vital address spaces remain operational despite hardware or software issues.
    • Batch Processing: Designing long-running batch jobs with restart and recovery capabilities to handle unexpected interruptions without losing significant processing progress.
    • High-Volume Data Ingestion: Systems processing large volumes of incoming data where any interruption could lead to data loss or significant backlogs.

Related Concepts

Fault tolerance is a foundational concept for achieving High Availability (HA), as it provides the mechanisms (like redundancy and failover) that enable systems to remain operational. It complements Disaster Recovery (DR) by addressing local component failures, whereas DR focuses on recovering from site-wide catastrophic events. The Parallel Sysplex architecture is a prime example of a fault-tolerant design on z/OS, providing workload balancing, data sharing, and automatic failover across multiple LPARs.

Best Practices:

Related Products

Related Vendors

IBM

646 products

Related Categories

Operating System

154 products