Automatic Restart Manager
The Automatic Restart Manager (ARM) is a z/OS component that automatically restarts failed batch jobs, started tasks (STCs), and TSO/E address spaces within a `sysplex`. Its primary purpose is to enhance system availability and reduce the need for manual intervention during outages or program failures.
Key Characteristics
-
- Policy-Driven Recovery: ARM operates based on user-defined policies stored in a
sysplexcouple dataset, specifying which address spaces to monitor and how to restart them. - Sysplex Scope: ARM is a
sysplex-aware component, coordinating restart actions across multiple z/OS images to ensure consistent recovery. - Monitors Registered Address Spaces: Critical address spaces (e.g.,
CICSregions,DB2subsystems,IMScontrol regions) register with ARM to be monitored for failure. - Resource Dependency Management: Policies can include rules for resource availability, preventing restarts if critical dependencies (like a specific
DB2subsystem) are not met. - Restart Groups: Allows grouping related address spaces into restart groups for coordinated and sequential restarts, ensuring proper application startup order.
- Failure Detection: Detects various types of failures, including program abends, system crashes, and resource-related issues, triggering appropriate restart actions.
- Policy-Driven Recovery: ARM operates based on user-defined policies stored in a
Use Cases
-
- Critical Started Task Availability: Automatically restarts vital started tasks like
CICSregions,DB2subsystems, orIMScontrol regions if they terminate unexpectedly, minimizing downtime for online transactions. - Batch Job Resilience: Ensures the completion of critical
COBOLorPL/Ibatch jobs by automatically restarting them after an abend, reducing manual intervention and reprocessing. - TSO/E Session Recovery: Recovers
TSO/Eaddress spaces that might fail, allowing users to quickly resume their interactive work without losing context. - System-Wide Resilience: Contributes to the overall resilience of a z/OS
sysplexby automating recovery processes for essential system components and applications.
- Critical Started Task Availability: Automatically restarts vital started tasks like
Related Concepts
ARM is tightly integrated with the z/OS sysplex environment, leveraging XCF (Cross-System Coupling Facility) for communication and storing its policies in sysplex couple datasets. It often works in conjunction with WLM (Workload Manager) to manage system resources and priorities, especially during the restart of critical workloads. While ARM handles the automatic restart of failed components, it complements other recovery mechanisms like DFHSM for data recovery or GDPS for disaster recovery solutions.
- Define Comprehensive Policies: Create robust ARM policies (
ARM Policy) that explicitly define all critical address spaces, their restart criteria, and any interdependencies to ensure effective recovery. - Test Policies Regularly: Periodically test ARM policies in a non-production environment to validate their behavior during various failure scenarios and ensure they meet recovery objectives.
- Monitor ARM Activity: Implement monitoring for ARM messages (
IEA) and system logs to track restart attempts, success rates, and any issues encountered during automated recovery. - Coordinate with Operations: Ensure operations staff are familiar with ARM's capabilities, how to query its status using commands like
DISPLAY ARM, and how to intervene if necessary. - Consider Resource Availability: Design ARM policies to account for the availability of critical resources (e.g.,
DB2subsystems,VSAMfiles) that restarted tasks might depend on, potentially delaying restarts until resources are online.