ARM - Automatic Restart Management
Automatic Restart Management (ARM) is a z/OS component designed to enhance the high availability of critical applications and system services within a Sysplex. It automatically monitors and restarts registered address spaces or elements that have abnormally terminated, minimizing downtime and ensuring continuous operation.
Key Characteristics
-
- Sysplex-wide Automation: ARM operates across a z/OS Sysplex, enabling the restart of failed elements on the same logical partition (LPAR) or a different LPAR within the Sysplex, based on policy.
- Policy-Driven Restart: Restart actions are governed by an
ARM Policy(defined in theGRS PolicyorMVS Couple Data Set Policy), which specifies which elements to manage, restart limits, and target systems. - Element Registration: Critical address spaces or elements (e.g., CICS regions, DB2 DBM1, IMS control regions) must explicitly register with ARM using the
IXCARMmacro to be monitored and managed. - Persistent Restart Information: ARM utilizes the
Coupling Facility (CF)to store persistent restart information, allowing it to recover elements even after an IPL or Sysplex partition. - Integration with WLM: ARM works in conjunction with
Workload Manager (WLM)to make intelligent restart decisions, considering system resources and workload goals for optimal placement and performance. - Restart Groups: Elements can be grouped together into
restart groups, allowing ARM to manage their restarts as a coordinated unit, ensuring that interdependent components are brought back online in the correct order.
Use Cases
-
- CICS Region Recovery: Automatically restarting a failed CICS Transaction Server region on the same or another LPAR in the Sysplex to minimize application downtime for online transactions.
- DB2 Subsystem Availability: Ensuring continuous availability of a DB2 subsystem by restarting its critical address spaces (e.g., DBM1, IRLM) after an abnormal termination, maintaining database access.
- IMS Control Region Resilience: Automatically restarting an IMS control region if it terminates unexpectedly, preserving transaction processing capabilities and message queues.
- Critical Batch Initiator Management: Restarting a dedicated batch initiator address space that handles high-priority jobs, preventing delays in critical batch processing workflows.
- Custom Application Server High Availability: Managing the restart of custom application servers or middleware components running as z/OS address spaces that have registered with ARM, ensuring their continuous service.
Related Concepts
ARM is a fundamental component for achieving high availability and disaster recovery within a z/OS Sysplex. It relies heavily on the Coupling Facility (CF) to maintain persistent restart information and coordinate actions across multiple LPARs. ARM works hand-in-hand with Workload Manager (WLM), which provides the policy framework and resource management context for ARM's restart decisions, ensuring that restarted workloads align with overall system goals and service levels.
- Define Comprehensive Policies: Create detailed ARM policies that cover all critical address spaces, specifying appropriate restart limits, target systems, and dependencies to ensure robust recovery.
- Test Policies Regularly: Periodically simulate failures (e.g., using
KILLcommands orSLIPtraps) and test ARM's restart actions to ensure policies are correctly implemented and achieve desired recovery objectives. - Monitor ARM Activity: Utilize system logs (
SYSLOG),RMFreports, andDISPLAY ARMcommands to actively monitor ARM's status, restart attempts, and overall effectiveness in your environment. - Coordinate with WLM: Ensure ARM policies are aligned with
WLMservice definitions to prevent resource contention, ensure optimal placement of restarted workloads, and meet performance goals. - Leverage Restart Groups: Use
restart groupseffectively to manage the coordinated restart of interdependent applications, ensuring that all necessary components are available before an application is fully brought back online.