Checkpoint/Restart Logic - Long-Running Batch
Table of Contents
Interview Question
"Monthly claims reconciliation job runs 12 hours processing 100 million records:
Current State:
- Job ABENDs 8-10 hours in (DB2 timeout, space issues, etc.)
- Each ABEND = reprocess all records from start
- Wasted 80+ hours CPU time last month
- Business pressure: 'Fix this or we'll complain to CIO'
Job Structure:
Step 1: Extract claims from DB2 (5M rows) - 1 hour
Step 2: Sort by policy (DFSORT) - 1 hour
Step 3: Match with policies (DB2 lookup) - 2 hours
Step 4: Calculate reserves (complex logic) - 6 hours ← Usually fails here
Step 5: Update totals (DB2 updates) - 1 hour
Step 6: Generate reports - 1 hour
Requirements:
- Implement restart logic for Step 4
- Commit every 10,000 records
- Must handle DB2 rollback on ABEND
- Track progress for reporting
- No impact on normal processing performance
Design the complete restart solution with COBOL code."
What This Tests
Good Answer Should Include
1. Checkpoint Table Design:
CREATE TABLE BATCH_CHECKPOINT (
JOB_NAME CHAR(8) NOT NULL,
JOB_DATE DATE NOT NULL,
STEP_NAME CHAR(8) NOT NULL,
LAST_KEY CHAR(20),
RECORDS_PROCESSED INTEGER,
COMMIT_TIMESTAMP TIMESTAMP,
STATUS CHAR(1),
PRIMARY KEY (JOB_NAME, JOB_DATE, STEP_NAME)
);
2. COBOL Restart Logic:
WORKING-STORAGE SECTION.
01 CHECKPOINT-CONTROLS.
05 COMMIT-FREQUENCY PIC 9(5) VALUE 10000.
05 COMMIT-COUNTER PIC 9(5) VALUE ZERO.
05 TOTAL-PROCESSED PIC 9(9) VALUE ZERO.
05 RESTART-KEY PIC X(20).
05 RESTART-FLAG PIC X VALUE 'N'.
05 WS-JOB-NAME PIC X(8) VALUE 'CLMRECON'.
05 WS-STEP-NAME PIC X(8) VALUE 'CALCRSRV'.
PROCEDURE DIVISION.
MAIN-LOGIC.
PERFORM INITIALIZE-RESTART
PERFORM PROCESS-CLAIMS
PERFORM FINALIZE-CHECKPOINT
STOP RUN.
INITIALIZE-RESTART.
* Check if restart needed
EXEC SQL
SELECT LAST_KEY, RECORDS_PROCESSED, STATUS
INTO :RESTART-KEY, :TOTAL-PROCESSED, :WS-STATUS
FROM BATCH_CHECKPOINT
WHERE JOB_NAME = :WS-JOB-NAME
AND JOB_DATE = CURRENT DATE
AND STEP_NAME = :WS-STEP-NAME
AND STATUS = 'R'
END-EXEC.
IF SQLCODE = 0
MOVE 'Y' TO RESTART-FLAG
DISPLAY 'RESTARTING FROM KEY: ' RESTART-KEY
DISPLAY 'ALREADY PROCESSED: ' TOTAL-PROCESSED ' RECORDS'
ELSE
* Cold start - initialize checkpoint
EXEC SQL
INSERT INTO BATCH_CHECKPOINT
VALUES (:WS-JOB-NAME, CURRENT DATE, :WS-STEP-NAME,
'', 0, CURRENT TIMESTAMP, 'R')
END-EXEC
EXEC SQL COMMIT END-EXEC
END-IF.
PROCESS-CLAIMS.
* Open cursor - skip already processed records if restart
IF RESTART-FLAG = 'Y'
EXEC SQL
DECLARE C1 CURSOR FOR
SELECT CLAIM_NUMBER, POLICY_NUMBER, CLAIM_AMOUNT
FROM CLAIMS_WORK
WHERE CLAIM_NUMBER > :RESTART-KEY
ORDER BY CLAIM_NUMBER
END-EXEC
ELSE
EXEC SQL
DECLARE C1 CURSOR FOR
SELECT CLAIM_NUMBER, POLICY_NUMBER, CLAIM_AMOUNT
FROM CLAIMS_WORK
ORDER BY CLAIM_NUMBER
END-EXEC
END-IF.
EXEC SQL OPEN C1 END-EXEC.
PERFORM UNTIL SQLCODE = 100
EXEC SQL
FETCH C1 INTO :CLAIM-NUMBER, :POLICY-NUMBER, :CLAIM-AMOUNT
END-EXEC
IF SQLCODE = 0
PERFORM CALCULATE-RESERVE
ADD 1 TO COMMIT-COUNTER
ADD 1 TO TOTAL-PROCESSED
* Checkpoint logic
IF COMMIT-COUNTER >= COMMIT-FREQUENCY
PERFORM WRITE-CHECKPOINT
EXEC SQL COMMIT END-EXEC
MOVE ZERO TO COMMIT-COUNTER
DISPLAY 'CHECKPOINT: ' TOTAL-PROCESSED ' RECORDS'
END-IF
END-IF
END-PERFORM.
EXEC SQL CLOSE C1 END-EXEC.
* Final commit for remaining records
IF COMMIT-COUNTER > 0
PERFORM WRITE-CHECKPOINT
EXEC SQL COMMIT END-EXEC
END-IF.
WRITE-CHECKPOINT.
* Update checkpoint table with current position
EXEC SQL
UPDATE BATCH_CHECKPOINT
SET LAST_KEY = :CLAIM-NUMBER,
RECORDS_PROCESSED = :TOTAL-PROCESSED,
COMMIT_TIMESTAMP = CURRENT TIMESTAMP,
STATUS = 'R'
WHERE JOB_NAME = :WS-JOB-NAME
AND JOB_DATE = CURRENT DATE
AND STEP_NAME = :WS-STEP-NAME
END-EXEC.
FINALIZE-CHECKPOINT.
* Mark as complete
EXEC SQL
UPDATE BATCH_CHECKPOINT
SET STATUS = 'C',
COMMIT_TIMESTAMP = CURRENT TIMESTAMP
WHERE JOB_NAME = :WS-JOB-NAME
AND JOB_DATE = CURRENT DATE
AND STEP_NAME = :WS-STEP-NAME
END-EXEC.
EXEC SQL COMMIT END-EXEC.
DISPLAY 'JOB COMPLETED. TOTAL RECORDS: ' TOTAL-PROCESSED.
3. JCL Restart Implementation:
//CLMRECON JOB (ACCT),'CLAIMS RECON',CLASS=A,
// MSGCLASS=X,NOTIFY=&SYSUID,
// RESTART=STEP4 ← Restart from Step 4
//*
//STEP1 EXEC PGM=EXTRACT
// ... (Steps 1-3 omitted)
//*
//STEP4 EXEC PGM=CALCRSRV,
// PARM='RESTART=AUTO' ← Pass restart parameter
//STEPLIB DD DSN=PROD.LOADLIB,DISP=SHR
//SYSOUT DD SYSOUT=*
//CLAIMS DD DSN=WORK.CLAIMS.DATA,DISP=SHR
//DB2PLAN DD DSN=PROD.DBRMLIB.PLAN(CLMRECON),DISP=SHR
//*
4. Enhanced Error Handling:
* Add ABEND handler
EXEC CICS HANDLE ABEND
PROGRAM('ABENDMGR')
END-EXEC.
* Or in batch
ON EXCEPTION
PERFORM EMERGENCY-CHECKPOINT
DISPLAY 'ABNORMAL TERMINATION - CHECKPOINT WRITTEN'
MOVE 12 TO RETURN-CODE
STOP RUN
END-ON.
EMERGENCY-CHECKPOINT.
* Same as WRITE-CHECKPOINT
* Ensures last position is saved even on ABEND
EXEC SQL
UPDATE BATCH_CHECKPOINT
SET LAST_KEY = :CLAIM-NUMBER,
RECORDS_PROCESSED = :TOTAL-PROCESSED,
COMMIT_TIMESTAMP = CURRENT TIMESTAMP,
STATUS = 'A' ← Abnormal status
WHERE JOB_NAME = :WS-JOB-NAME
AND JOB_DATE = CURRENT DATE
AND STEP_NAME = :WS-STEP-NAME
END-EXEC.
EXEC SQL COMMIT END-EXEC.
5. Monitoring & Reporting:
* Progress reporting (every 100K records)
IF FUNCTION MOD(TOTAL-PROCESSED, 100000) = 0
COMPUTE WS-PCT-COMPLETE =
(TOTAL-PROCESSED / 100000000) * 100
DISPLAY 'PROGRESS: ' WS-PCT-COMPLETE '% COMPLETE'
* Estimate time remaining
COMPUTE WS-ELAPSED = CURRENT-TIME - START-TIME
COMPUTE WS-RATE = TOTAL-PROCESSED / WS-ELAPSED
COMPUTE WS-REMAINING = (100000000 - TOTAL-PROCESSED) / WS-RATE
DISPLAY 'ESTIMATED TIME REMAINING: ' WS-REMAINING ' SECONDS'
END-IF.
6. Benefits Quantified:
| Scenario | Without Restart | With Restart | Savings |
|---|---|---|---|
| Fail at 8hrs | 8hrs wasted | 0hrs wasted | 8hrs CPU |
| Fail at 10hrs | 10hrs wasted | 0hrs wasted | 10hrs CPU |
| Monthly total | 80hrs wasted | ~5hrs overhead | 75hrs saved |
| Cost savings | - | - | $15,000/month |
Red Flags
Follow-Up Questions
Difficulty Level
Senior/Expert
Relevant Roles
Senior Developer, Architect, Batch Lead