Critical

Enhanced Definition

In the mainframe context, "critical" refers to a state, resource, task, or problem that is essential for the continued operation, integrity, or availability of the system or a core business function, and typically requires immediate attention or resolution due to its high impact. It signifies a severe condition that can lead to significant disruption, data loss, or business interruption if not addressed promptly.

Key Characteristics

- High Impact: A critical event or resource failure can severely disrupt business operations, lead to system outages, data corruption, or significant financial loss.
- Immediate Action Required: Problems classified as critical demand urgent investigation and resolution, often involving a dedicated incident response team and expedited processes.
- System Availability/Integrity: Often relates directly to the uptime, stability, and data integrity of the z/OS operating system, key subsystems (like CICS, DB2, IMS), or vital applications.
- Priority in Workload Management: Critical workloads or address spaces are typically assigned higher dispatching priorities by z/OS Workload Manager (WLM) to ensure their preferential access to system resources.
- Severity Levels: Used in problem management systems (e.g., IBM PMRs, internal incident tracking) to denote the highest level of severity, often implying a production system outage or severe degradation.

Use Cases

- System abend: An abend (abnormal end) of a critical z/OS system task, such as a master scheduler, a core component of a database manager (e.g., DB2 DBM1 address space), or a critical CICS region.
- Performance Degradation: Severe performance degradation affecting a critical online transaction processing system (e.g., CICS region) that renders it unusable for end-users or fails to meet service level agreements.
- Resource Contention: A critical dataset or enqueue becoming unavailable or excessively contended, blocking multiple essential batch jobs or online transactions, potentially leading to a system-wide stall.
- Security Breach: Detection of an active security breach or unauthorized access attempt on a critical system component, sensitive data, or a production environment.
- Disaster Recovery: Identifying critical applications and data as part of a disaster recovery plan, meaning their recovery is paramount and must occur within the shortest Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Related Concepts

The concept of "critical" is fundamental to Problem Management, where incidents are classified by severity to prioritize resolution efforts and allocate resources effectively. It is closely tied to System Monitoring and Alerting, as critical conditions trigger immediate notifications to operations staff and automated recovery actions. Workload Management (WLM) uses criticality as a key input to assign appropriate service levels and resource priorities, ensuring vital applications perform as expected even under stress. Furthermore, it's a core consideration in High Availability and Disaster Recovery (HA/DR) planning, where critical systems and data are identified for rapid restoration and business continuity.

Best Practices:

Define Criticality: Clearly define what constitutes a critical system, application, or incident within your organization, including specific metrics, impact assessments, and escalation procedures.
Robust Monitoring: Implement comprehensive monitoring tools (e.g., OMEGAMON, NetView) with aggressive thresholds and immediate alerting for critical system components, application performance, and resource availability.
Incident Response Plan: Establish and regularly test a well-defined incident response plan for critical events, including clear escalation paths, communication protocols, and dedicated support teams.
Prioritize Resolution: Ensure that critical problems receive the highest priority for diagnosis and resolution, allocating necessary resources and expertise immediately, often involving 24/7 support.
Regular HA/DR Testing: Periodically review and test disaster recovery plans specifically for critical applications and data to validate recovery procedures and ensure RTO/RPO objectives can be met.
WLM Service Definitions: Configure WLM service definitions to accurately reflect the criticality of workloads, ensuring that critical tasks receive preferential treatment for CPU, I/O, and memory during resource contention.