Data Set

Enhanced Definition

A fundamental unit of data storage on IBM mainframe systems running z/OS, representing a collection of logically related records. It is the mainframe equivalent of a file in other operating systems, managed by z/OS and its data management services.

Key Characteristics

- Organization: Data sets can be organized in various ways, including Sequential Data Set (PS), Partitioned Data Set (PDS) or Partitioned Data Set Extended (PDSE), Virtual Storage Access Method (VSAM) (KSDS, ESDS, RRDS, LDS), and Generation Data Group (GDG).
- Naming Convention: Each data set is identified by a unique data set name (DSN), which is typically hierarchical (e.g., PROD.APPL.COBOL.SOURCE).
- Allocation: Requires pre-allocation of space on Direct Access Storage Devices (DASD) using JCL or utilities, specifying attributes like space, record format, and block size.
- Attributes: Defined by characteristics such as RECFM (Record Format - Fixed, Variable, Undefined), LRECL (Logical Record Length), BLKSIZE (Block Size), and DSORG (Data Set Organization).
- Management: Managed by z/OS's Data Facility Product (DFP) component, which handles allocation, cataloging, and I/O operations.
- Cataloging: Most production data sets are cataloged in the Integrated Catalog Facility (ICF) to allow programs and users to locate them by DSN without needing specific volume information.

Use Cases

- Source Code Storage: PDS or PDSE are commonly used to store source code for programs written in COBOL, PL/I, Assembler, or JCL procedures.
- Program Libraries: PDSEs store executable load modules (compiled programs) that are invoked by batch jobs, CICS transactions, or IMS applications.
- Transaction Data: VSAM KSDS or ESDS are frequently used by online transaction processing systems like CICS and IMS for high-volume, random, or sequential access to application data.
- Batch Processing Input/Output: Sequential data sets are extensively used for input files, intermediate work files, and output reports generated by batch jobs.
- System Logs and Journals: Sequential data sets or VSAM ESDS can store system logs, audit trails, and journal records for recovery or compliance.

Related Concepts

Data sets are the fundamental building blocks for data storage on z/OS. They are defined and manipulated using JCL (Job Control Language) statements (specifically DD statements) to specify their names, attributes, and access methods for batch jobs. COBOL and other programming languages interact with data sets through file definitions (SELECT, FD) and I/O statements (OPEN, READ, WRITE). CICS and DB2 leverage VSAM data sets for their underlying data storage, with DB2 managing its own data within VSAM Linear Data Sets (LDS).

Best Practices:

Descriptive Naming: Use clear, hierarchical DSNs that indicate ownership, application, and content (e.g., SYS1.PROD.APPL.DATA.MASTFILE) for easier identification and management.
Optimal Block Size: Choose an appropriate BLKSIZE to optimize I/O performance and DASD utilization, often a multiple of the track size or LRECL.
Cataloging: Always catalog production data sets in the ICF to simplify access, improve system performance, and facilitate data set management.
Space Allocation: Allocate sufficient primary and secondary space to prevent X37 abends, but avoid over-allocating excessively large amounts of space that waste DASD.
GDG Usage: Utilize Generation Data Groups (GDGs) for sequential files that are regularly updated or archived, simplifying JCL and providing inherent version control.
Data Set Security: Implement RACF (or equivalent security product) profiles to control access to data sets, specifying read, write, update, and delete permissions to protect sensitive information.