ARM - Automatic Restart Management

Enhanced Definition

Automatic Restart Management (ARM) is a z/OS component designed to enhance the high availability of critical applications and system services within a Sysplex. It automatically monitors and restarts registered address spaces or elements that have abnormally terminated, minimizing downtime and ensuring continuous operation.

Key Characteristics

- Sysplex-wide Automation: ARM operates across a z/OS Sysplex, enabling the restart of failed elements on the same logical partition (LPAR) or a different LPAR within the Sysplex, based on policy.
- Policy-Driven Restart: Restart actions are governed by an ARM Policy (defined in the GRS Policy or MVS Couple Data Set Policy), which specifies which elements to manage, restart limits, and target systems.
- Element Registration: Critical address spaces or elements (e.g., CICS regions, DB2 DBM1, IMS control regions) must explicitly register with ARM using the IXCARM macro to be monitored and managed.
- Persistent Restart Information: ARM utilizes the Coupling Facility (CF) to store persistent restart information, allowing it to recover elements even after an IPL or Sysplex partition.
- Integration with WLM: ARM works in conjunction with Workload Manager (WLM) to make intelligent restart decisions, considering system resources and workload goals for optimal placement and performance.
- Restart Groups: Elements can be grouped together into restart groups, allowing ARM to manage their restarts as a coordinated unit, ensuring that interdependent components are brought back online in the correct order.

Use Cases

- CICS Region Recovery: Automatically restarting a failed CICS Transaction Server region on the same or another LPAR in the Sysplex to minimize application downtime for online transactions.
- DB2 Subsystem Availability: Ensuring continuous availability of a DB2 subsystem by restarting its critical address spaces (e.g., DBM1, IRLM) after an abnormal termination, maintaining database access.
- IMS Control Region Resilience: Automatically restarting an IMS control region if it terminates unexpectedly, preserving transaction processing capabilities and message queues.
- Critical Batch Initiator Management: Restarting a dedicated batch initiator address space that handles high-priority jobs, preventing delays in critical batch processing workflows.
- Custom Application Server High Availability: Managing the restart of custom application servers or middleware components running as z/OS address spaces that have registered with ARM, ensuring their continuous service.

Related Concepts

ARM is a fundamental component for achieving high availability and disaster recovery within a z/OS Sysplex. It relies heavily on the Coupling Facility (CF) to maintain persistent restart information and coordinate actions across multiple LPARs. ARM works hand-in-hand with Workload Manager (WLM), which provides the policy framework and resource management context for ARM's restart decisions, ensuring that restarted workloads align with overall system goals and service levels.

Best Practices:

Define Comprehensive Policies: Create detailed ARM policies that cover all critical address spaces, specifying appropriate restart limits, target systems, and dependencies to ensure robust recovery.
Test Policies Regularly: Periodically simulate failures (e.g., using KILL commands or SLIP traps) and test ARM's restart actions to ensure policies are correctly implemented and achieve desired recovery objectives.
Monitor ARM Activity: Utilize system logs (SYSLOG), RMF reports, and DISPLAY ARM commands to actively monitor ARM's status, restart attempts, and overall effectiveness in your environment.
Coordinate with WLM: Ensure ARM policies are aligned with WLM service definitions to prevent resource contention, ensure optimal placement of restarted workloads, and meet performance goals.
Leverage Restart Groups: Use restart groups effectively to manage the coordinated restart of interdependent applications, ensuring that all necessary components are available before an application is fully brought back online.