Incident

Enhanced Definition

In the mainframe and z/OS context, an **incident** is an unplanned interruption to an IT service or a degradation in the quality of an IT service. It represents an event that deviates from normal operation and requires immediate attention to restore service as quickly as possible. Incidents can range from minor application errors to major system outages affecting critical business functions.

Key Characteristics

- Unplanned Event: By definition, an incident is an unexpected occurrence, not a planned system maintenance or outage.
- Service Impact: It always has a negative impact on the availability, performance, or security of a mainframe service or application.
- Urgent Resolution: Incidents demand prompt investigation and resolution to minimize business disruption and meet Service Level Agreements (SLAs).
- Detection: Often detected by automated monitoring tools (e.g., OMEGAMON, NetView), system logs (SYSLOG), or reported by end-users.
- Categorization and Prioritization: Incidents are typically categorized (e.g., application, network, security) and prioritized based on their impact and urgency to ensure appropriate resource allocation.
- Root Cause vs. Symptom: An incident is often a symptom of an underlying problem, which requires deeper analysis to resolve permanently.

Use Cases

- Application Abend: A critical COBOL batch job or CICS transaction program terminates abnormally (abend), preventing further processing or user interaction.
- System Performance Degradation: High CPU utilization across an LPAR, excessive I/O wait times, or slow response times for online CICS or IMS transactions.
- Resource Unavailability: A DB2 subsystem becomes unresponsive, an IMS control region crashes, or a critical dataset is unavailable, halting dependent applications.
- Security Breach Attempt: Detection of unauthorized access attempts to sensitive data or system resources, triggering alerts from RACF, ACF2, or TSS.
- Automated Operations Alert: An automated operations tool detects a threshold breach (e.g., disk space critically low, queue depth exceeded) and raises an alert requiring operator intervention.

Related Concepts

Incidents are central to Incident Management, which is a core ITIL process focused on restoring normal service operation as quickly as possible. They are closely related to Problem Management, where incidents often serve as symptoms that trigger a deeper investigation to find and eliminate the root cause, preventing recurrence. Furthermore, incidents directly impact Service Level Agreements (SLAs), as they represent deviations from agreed-upon service levels. Effective Change Management processes are crucial to minimize incidents caused by poorly implemented changes.

Best Practices:

Establish a Clear Incident Management Process: Define roles, responsibilities, escalation paths, and communication protocols for all types of mainframe incidents.
Implement Robust Monitoring and Alerting: Utilize tools like OMEGAMON, RMF, and NetView for proactive detection of anomalies and automated alerting to operations staff.
Prioritize Based on Impact and Urgency: Develop a clear incident prioritization matrix to ensure critical business services receive immediate attention.
Document Resolutions and Create a Knowledge Base: Maintain a comprehensive knowledge base of past incidents and their resolutions to facilitate faster diagnosis and resolution of future occurrences.
Perform Root Cause Analysis (RCA): For significant or recurring incidents, conduct thorough RCA to identify underlying problems and implement permanent solutions, linking incident management to problem management.
Communicate Effectively: Provide timely and accurate updates to affected users and stakeholders during an incident, managing expectations and minimizing uncertainty.