Skip to main content
Modernization Hub

Checkpoint/Restart Logic - Long-Running Batch

Interview Question

"Monthly claims reconciliation job runs 12 hours processing 100 million records:

Current State:

  • Job ABENDs 8-10 hours in (DB2 timeout, space issues, etc.)
  • Each ABEND = reprocess all records from start
  • Wasted 80+ hours CPU time last month
  • Business pressure: 'Fix this or we'll complain to CIO'

Job Structure:

Step 1: Extract claims from DB2 (5M rows) - 1 hour
Step 2: Sort by policy (DFSORT) - 1 hour  
Step 3: Match with policies (DB2 lookup) - 2 hours
Step 4: Calculate reserves (complex logic) - 6 hours ← Usually fails here
Step 5: Update totals (DB2 updates) - 1 hour
Step 6: Generate reports - 1 hour

Requirements:

  • Implement restart logic for Step 4
  • Commit every 10,000 records
  • Must handle DB2 rollback on ABEND
  • Track progress for reporting
  • No impact on normal processing performance

Design the complete restart solution with COBOL code."

What This Tests

  • Checkpoint/restart design patterns
  • DB2 commit strategies
  • Production resilience thinking
  • Complex batch job architecture

Good Answer Should Include

1. Checkpoint Table Design:

CREATE TABLE BATCH_CHECKPOINT (
    JOB_NAME        CHAR(8) NOT NULL,
    JOB_DATE        DATE NOT NULL,
    STEP_NAME       CHAR(8) NOT NULL,
    LAST_KEY        CHAR(20),
    RECORDS_PROCESSED INTEGER,
    COMMIT_TIMESTAMP TIMESTAMP,
    STATUS          CHAR(1),
    PRIMARY KEY (JOB_NAME, JOB_DATE, STEP_NAME)
);

2. COBOL Restart Logic:

WORKING-STORAGE SECTION.
01  CHECKPOINT-CONTROLS.
    05  COMMIT-FREQUENCY      PIC 9(5) VALUE 10000.
    05  COMMIT-COUNTER        PIC 9(5) VALUE ZERO.
    05  TOTAL-PROCESSED       PIC 9(9) VALUE ZERO.
    05  RESTART-KEY           PIC X(20).
    05  RESTART-FLAG          PIC X VALUE 'N'.
    05  WS-JOB-NAME           PIC X(8) VALUE 'CLMRECON'.
    05  WS-STEP-NAME          PIC X(8) VALUE 'CALCRSRV'.

PROCEDURE DIVISION.
MAIN-LOGIC.
    PERFORM INITIALIZE-RESTART
    PERFORM PROCESS-CLAIMS
    PERFORM FINALIZE-CHECKPOINT
    STOP RUN.

INITIALIZE-RESTART.
    * Check if restart needed
    EXEC SQL
        SELECT LAST_KEY, RECORDS_PROCESSED, STATUS
        INTO :RESTART-KEY, :TOTAL-PROCESSED, :WS-STATUS
        FROM BATCH_CHECKPOINT
        WHERE JOB_NAME = :WS-JOB-NAME
          AND JOB_DATE = CURRENT DATE
          AND STEP_NAME = :WS-STEP-NAME
          AND STATUS = 'R'
    END-EXEC.
    
    IF SQLCODE = 0
        MOVE 'Y' TO RESTART-FLAG
        DISPLAY 'RESTARTING FROM KEY: ' RESTART-KEY
        DISPLAY 'ALREADY PROCESSED: ' TOTAL-PROCESSED ' RECORDS'
    ELSE
        * Cold start - initialize checkpoint
        EXEC SQL
            INSERT INTO BATCH_CHECKPOINT
            VALUES (:WS-JOB-NAME, CURRENT DATE, :WS-STEP-NAME,
                    '', 0, CURRENT TIMESTAMP, 'R')
        END-EXEC
        EXEC SQL COMMIT END-EXEC
    END-IF.

PROCESS-CLAIMS.
    * Open cursor - skip already processed records if restart
    IF RESTART-FLAG = 'Y'
        EXEC SQL
            DECLARE C1 CURSOR FOR
            SELECT CLAIM_NUMBER, POLICY_NUMBER, CLAIM_AMOUNT
            FROM CLAIMS_WORK
            WHERE CLAIM_NUMBER > :RESTART-KEY
            ORDER BY CLAIM_NUMBER
        END-EXEC
    ELSE
        EXEC SQL
            DECLARE C1 CURSOR FOR
            SELECT CLAIM_NUMBER, POLICY_NUMBER, CLAIM_AMOUNT
            FROM CLAIMS_WORK
            ORDER BY CLAIM_NUMBER
        END-EXEC
    END-IF.
    
    EXEC SQL OPEN C1 END-EXEC.
    
    PERFORM UNTIL SQLCODE = 100
        EXEC SQL
            FETCH C1 INTO :CLAIM-NUMBER, :POLICY-NUMBER, :CLAIM-AMOUNT
        END-EXEC
        
        IF SQLCODE = 0
            PERFORM CALCULATE-RESERVE
            ADD 1 TO COMMIT-COUNTER
            ADD 1 TO TOTAL-PROCESSED
            
            * Checkpoint logic
            IF COMMIT-COUNTER >= COMMIT-FREQUENCY
                PERFORM WRITE-CHECKPOINT
                EXEC SQL COMMIT END-EXEC
                MOVE ZERO TO COMMIT-COUNTER
                DISPLAY 'CHECKPOINT: ' TOTAL-PROCESSED ' RECORDS'
            END-IF
        END-IF
    END-PERFORM.
    
    EXEC SQL CLOSE C1 END-EXEC.
    
    * Final commit for remaining records
    IF COMMIT-COUNTER > 0
        PERFORM WRITE-CHECKPOINT
        EXEC SQL COMMIT END-EXEC
    END-IF.

WRITE-CHECKPOINT.
    * Update checkpoint table with current position
    EXEC SQL
        UPDATE BATCH_CHECKPOINT
        SET LAST_KEY = :CLAIM-NUMBER,
            RECORDS_PROCESSED = :TOTAL-PROCESSED,
            COMMIT_TIMESTAMP = CURRENT TIMESTAMP,
            STATUS = 'R'
        WHERE JOB_NAME = :WS-JOB-NAME
          AND JOB_DATE = CURRENT DATE
          AND STEP_NAME = :WS-STEP-NAME
    END-EXEC.

FINALIZE-CHECKPOINT.
    * Mark as complete
    EXEC SQL
        UPDATE BATCH_CHECKPOINT
        SET STATUS = 'C',
            COMMIT_TIMESTAMP = CURRENT TIMESTAMP
        WHERE JOB_NAME = :WS-JOB-NAME
          AND JOB_DATE = CURRENT DATE
          AND STEP_NAME = :WS-STEP-NAME
    END-EXEC.
    EXEC SQL COMMIT END-EXEC.
    
    DISPLAY 'JOB COMPLETED. TOTAL RECORDS: ' TOTAL-PROCESSED.

3. JCL Restart Implementation:

//CLMRECON JOB  (ACCT),'CLAIMS RECON',CLASS=A,
//         MSGCLASS=X,NOTIFY=&SYSUID,
//         RESTART=STEP4              ← Restart from Step 4
//*
//STEP1    EXEC PGM=EXTRACT
// ... (Steps 1-3 omitted)
//*
//STEP4    EXEC PGM=CALCRSRV,
//         PARM='RESTART=AUTO'        ← Pass restart parameter
//STEPLIB  DD DSN=PROD.LOADLIB,DISP=SHR
//SYSOUT   DD SYSOUT=*
//CLAIMS   DD DSN=WORK.CLAIMS.DATA,DISP=SHR
//DB2PLAN  DD DSN=PROD.DBRMLIB.PLAN(CLMRECON),DISP=SHR
//*

4. Enhanced Error Handling:

* Add ABEND handler
EXEC CICS HANDLE ABEND
    PROGRAM('ABENDMGR')
END-EXEC.

* Or in batch
ON EXCEPTION
    PERFORM EMERGENCY-CHECKPOINT
    DISPLAY 'ABNORMAL TERMINATION - CHECKPOINT WRITTEN'
    MOVE 12 TO RETURN-CODE
    STOP RUN
END-ON.

EMERGENCY-CHECKPOINT.
    * Same as WRITE-CHECKPOINT
    * Ensures last position is saved even on ABEND
    EXEC SQL
        UPDATE BATCH_CHECKPOINT
        SET LAST_KEY = :CLAIM-NUMBER,
            RECORDS_PROCESSED = :TOTAL-PROCESSED,
            COMMIT_TIMESTAMP = CURRENT TIMESTAMP,
            STATUS = 'A'              ← Abnormal status
        WHERE JOB_NAME = :WS-JOB-NAME
          AND JOB_DATE = CURRENT DATE
          AND STEP_NAME = :WS-STEP-NAME
    END-EXEC.
    EXEC SQL COMMIT END-EXEC.

5. Monitoring & Reporting:

* Progress reporting (every 100K records)
IF FUNCTION MOD(TOTAL-PROCESSED, 100000) = 0
    COMPUTE WS-PCT-COMPLETE = 
        (TOTAL-PROCESSED / 100000000) * 100
    DISPLAY 'PROGRESS: ' WS-PCT-COMPLETE '% COMPLETE'
    
    * Estimate time remaining
    COMPUTE WS-ELAPSED = CURRENT-TIME - START-TIME
    COMPUTE WS-RATE = TOTAL-PROCESSED / WS-ELAPSED
    COMPUTE WS-REMAINING = (100000000 - TOTAL-PROCESSED) / WS-RATE
    DISPLAY 'ESTIMATED TIME REMAINING: ' WS-REMAINING ' SECONDS'
END-IF.

6. Benefits Quantified:

Scenario Without Restart With Restart Savings
Fail at 8hrs 8hrs wasted 0hrs wasted 8hrs CPU
Fail at 10hrs 10hrs wasted 0hrs wasted 10hrs CPU
Monthly total 80hrs wasted ~5hrs overhead 75hrs saved
Cost savings - - $15,000/month

Red Flags

  • ❌ Suggests using GDG for restart (too coarse-grained)
  • ❌ No DB2 commit strategy
  • ❌ Checkpoint table not indexed properly
  • ❌ Doesn't handle concurrent job runs (same date)
  • ❌ No progress reporting
  • ❌ Doesn't test restart logic

Follow-Up Questions

  • "What if two instances of the job run concurrently?"
  • "How do you handle a DB2 rollback after 5,000 records?"
  • "What's the performance overhead of checkpoint writes?"
  • "How do you test restart logic without waiting 8 hours?"

Difficulty Level

Senior/Expert

Relevant Roles

Senior Developer, Architect, Batch Lead