Automatic Restart Manager

ARM

Enhanced Definition

The Automatic Restart Manager (ARM) is a z/OS component that automatically restarts failed batch jobs, started tasks (STCs), and TSO/E address spaces within a `sysplex`. Its primary purpose is to enhance system availability and reduce the need for manual intervention during outages or program failures.

Key Characteristics

- Policy-Driven Recovery: ARM operates based on user-defined policies stored in a sysplex couple dataset, specifying which address spaces to monitor and how to restart them.
- Sysplex Scope: ARM is a sysplex-aware component, coordinating restart actions across multiple z/OS images to ensure consistent recovery.
- Monitors Registered Address Spaces: Critical address spaces (e.g., CICS regions, DB2 subsystems, IMS control regions) register with ARM to be monitored for failure.
- Resource Dependency Management: Policies can include rules for resource availability, preventing restarts if critical dependencies (like a specific DB2 subsystem) are not met.
- Restart Groups: Allows grouping related address spaces into restart groups for coordinated and sequential restarts, ensuring proper application startup order.
- Failure Detection: Detects various types of failures, including program abends, system crashes, and resource-related issues, triggering appropriate restart actions.

Use Cases

- Critical Started Task Availability: Automatically restarts vital started tasks like CICS regions, DB2 subsystems, or IMS control regions if they terminate unexpectedly, minimizing downtime for online transactions.
- Batch Job Resilience: Ensures the completion of critical COBOL or PL/I batch jobs by automatically restarting them after an abend, reducing manual intervention and reprocessing.
- TSO/E Session Recovery: Recovers TSO/E address spaces that might fail, allowing users to quickly resume their interactive work without losing context.
- System-Wide Resilience: Contributes to the overall resilience of a z/OS sysplex by automating recovery processes for essential system components and applications.

Related Concepts

ARM is tightly integrated with the z/OS sysplex environment, leveraging XCF (Cross-System Coupling Facility) for communication and storing its policies in sysplex couple datasets. It often works in conjunction with WLM (Workload Manager) to manage system resources and priorities, especially during the restart of critical workloads. While ARM handles the automatic restart of failed components, it complements other recovery mechanisms like DFHSM for data recovery or GDPS for disaster recovery solutions.

Best Practices:

Define Comprehensive Policies: Create robust ARM policies (ARM Policy) that explicitly define all critical address spaces, their restart criteria, and any interdependencies to ensure effective recovery.
Test Policies Regularly: Periodically test ARM policies in a non-production environment to validate their behavior during various failure scenarios and ensure they meet recovery objectives.
Monitor ARM Activity: Implement monitoring for ARM messages (IEA) and system logs to track restart attempts, success rates, and any issues encountered during automated recovery.
Coordinate with Operations: Ensure operations staff are familiar with ARM's capabilities, how to query its status using commands like DISPLAY ARM, and how to intervene if necessary.
Consider Resource Availability: Design ARM policies to account for the availability of critical resources (e.g., DB2 subsystems, VSAM files) that restarted tasks might depend on, potentially delaying restarts until resources are online.