# Record Objects The **Record** class is DBTK's universal data structure. Every cursor query and file reader returns Records, and every `*Surge` class expects them as input. Understanding Record unlocks the full power of DBTK's pipeline architecture. ## Why Record? ETL pipelines process hundreds of thousands or millions of rows, all with the same columns. This creates a design tension: - **Dicts** give you flexible key-based access, but store keys on every instance. A million rows means a million copies of the same column names. - **Tuples** are memory-efficient, but force you to remember column positions. `row[7]` is fragile and unreadable. - **Named tuples** are immutable - you can't add or modify fields during transformation. Record solves this by storing column names once on a shared class while each instance holds only its values. You get dict-like flexibility with tuple-like memory efficiency. ```python # All of these work on the same Record object: row['name'] # Dict-style access row.name # Attribute access row[1] # Index access row[1:3] # Slicing name, email = row[1:3] # Unpack a slice row.get('phone', 'N/A') # Safe access with default dict(row) # Convert to dict when needed '\t'.join(str(v) for v in row.values()) # Join values ``` ## Access Patterns ### Dict-Style Access ```python # Key access user_name = row['name'] # Safe access with default phone = row.get('phone', 'unlisted') # Check if column exists if 'email' in row: send_notification(row['email']) # Iterate over keys for column in row.keys(): print(column) # Iterate over key-value pairs for column, value in row.items(): print(f"{column}: {value}") ``` ### Attribute Access ```python # Clean, readable syntax print(f"Welcome, {row.name}!") # Works with normalized names (see below) row.employee_id # Works even if column is "Employee_ID" ``` ### Index and Slice Access ```python # Positional access first_col = row[0] last_col = row[-1] # Slicing first_three = row[:3] middle = row[2:5] # Unpack specific positions id, name, email = row[:3] ``` ### Tuple Unpacking ```python # Unpack entire row id, name, email, created = row # Unpack in loop for id, name, status in cursor: process(id, name, status) ``` ## Normalized Field Names Record stores both original column names and normalized versions. Normalization exists because Python attribute syntax doesn't allow `row.FULL NAME` or `row.Employee-ID` - only valid Python identifiers work as attributes. Normalization applies the following rules in order: 1. Lowercase and strip surrounding whitespace 2. Replace runs of non-alphanumeric characters with a single underscore 3. Strip leading and trailing underscores — **except** an explicit leading `_` in the original name, which is preserved as a single `_` 4. Prefix `n` if the result starts with a digit | Original Column | Normalized Name | Notes | |-------------------|------------------|----------------------------------------| | `Employee_ID` | `employee_id` | | | `FULL NAME` | `full_name` | | | `Start Year` | `start_year` | | | `Contact Email!` | `contact_email` | trailing `!` stripped | | `#Term Code` | `term_code` | leading `#` stripped (not an explicit `_`) | | `$Secret_Code!` | `secret_code` | leading `$` stripped | | `_internal_id` | `_internal_id` | explicit leading `_` preserved | | `__version__` | `_version` | explicit `_` preserved, extras stripped | | `2025_sales` | `n2025_sales` | digit prefix gets `n` | **Access rules:** - **Normalized names** work with both attribute access (`row.employee_id`) and dict access (`row['employee_id']`) - **Original names** only work with dict access (`row['FULL NAME']`) — they may contain characters invalid for Python attributes ```python cursor.execute('SELECT Employee_ID, "FULL NAME" FROM users') for row in cursor: # Normalized - works both ways row.employee_id # Attribute access row['employee_id'] # Dict access # Original - dict only row['FULL NAME'] # Works row.FULL NAME # SyntaxError! ``` ### Why This Matters When defining Table field mappings, use normalized names and your code works regardless of source column casing: ```python # This Table config works whether the CSV has # "Employee ID", "employee_id", or "EMPLOYEE_ID" employee_table = Table('employees', { 'emp_id': {'field': 'employee_id', 'key': True}, 'full_name': {'field': 'full_name'}, 'hire_date': {'field': 'start_date', 'fn': 'date'}, }, cursor=cursor) ``` You write the mapping once. DBTK handles the column name variations. ### Duplicate Column Names If normalization produces duplicate names, Record appends `_2`, `_3`, etc.: ```python # Columns: "Status", "STATUS", "status!" # Normalized: "status", "status_2", "status_3" ``` ### Reserved Name Collisions Record has its own methods and attributes (`get`, `keys`, `values`, `items`, `update`, `pop`, `copy`, etc.). If a database column normalizes to one of these names it would shadow the method, making the Record unusable. To prevent this, `set_fields()` checks normalized names against a reserved set and renames any collision the same way it handles duplicates — appending `_2`, `_3`, etc. — and logs a warning so the rename is visible: ```python # Column named "values" would shadow Record.values() # → normalized to "values_2", warning logged: # WARNING: Attribute access for 'values' was renamed to 'values_2' due to # collision with existing Record method/attribute. row['values'] # still works via original name row.values_2 # attribute access uses the renamed form row.values() # the method is still intact ``` The same applies to internal attributes like `_fields`, `_added`, and classmethods like `set_fields`. ### Accessing Both Forms ```python # Get original column names columns = row.keys() # ['Employee_ID', 'FULL NAME'] # Get normalized column names columns = row.keys(normalized=True) # ['employee_id', 'full_name'] # Convert to dict with normalized keys data = row.to_dict(normalized=True) # {'employee_id': 123, 'full_name': 'Aang'} ``` ## Modifying Records Records are mutable - you can add, update, and delete fields during transformation. ### Update Existing Fields ```python # By key row['status'] = 'active' # By attribute row.status = 'active' # Multiple updates row.update({'status': 'active', 'modified': datetime.now()}) row.update(name='Aang', age=130) # Coalesce - only fills in None values, leaves existing values alone row.coalesce({'phone': 'unlisted', 'status': 'active'}) ``` ### Add New Fields ```python # Add computed fields row['full_name'] = f"{row.first_name} {row.last_name}" # Add via attribute row.processed_at = datetime.now() ``` ### Delete Fields ```python # Remove a field del row['temp_field'] # Or use pop to get the value old_status = row.pop('old_status', None) ``` ## Conversion Methods ```python # Convert to tuple or list of values values = tuple(row) values = list(row) # Convert to dict (standard Python idiom) data = dict(row) # Use to_dict() when you need normalized keys data = row.to_dict(normalized=True) ``` ## Debugging ### Pretty Print ```python row.pprint() # Employee_ID : 12345 # FULL NAME : Avatar Aang # Start Date : 2024-01-15 # Status : active # With normalized names row.pprint(normalized=True) # employee_id : 12345 # full_name : Avatar Aang # start_date : 2024-01-15 # status : active ``` `FixedWidthRecord.pprint()` accepts an additional `add_comments=True` option that appends each column's comment (from the `FixedColumn` definition) inline: ```python row.pprint(add_comments=True) # record_type_code : 6 # Entry Detail # routing_number : 021000021 # 9-digit ABA # amount : 0000015000 # Cents, no decimal ``` Columns without a comment are left blank in that position. ### String Representation ```python print(row) # Record('employee_id': 12345, 'full_name': 'Avatar Aang', ...) print(repr(row)) # Record(12345, 'Avatar Aang', datetime.date(2024, 1, 15), 'active') ``` ## Performance Note Record achieves tuple-like speed through a subclass-per-query pattern. Each time a cursor executes a query, DBTK dynamically creates a Record subclass with those column names as class attributes. Every row returned from that query shares the subclass - column names exist once in memory, not on every row. ```python cursor.execute("SELECT id, name, email FROM users") # DBTK creates a Record subclass with _fields = ['id', 'name', 'email'] # All 10,000 rows share this class, each storing only [id_val, name_val, email_val] cursor.execute("SELECT order_id, total, status FROM orders") # New query = new subclass with different fields # Previous subclass is eligible for garbage collection ``` This is why Record can offer dict-like convenience without dict-like memory overhead when processing millions of rows. The same pattern applies to file readers - each reader creates a Record subclass based on the file's header row. ## FixedWidthRecord `FixedWidthRecord` is a `Record` subclass returned by `FixedReader` and `EDIReader`. It carries the column layout alongside the values, enabling exact reconstruction of the original fixed-width line via `to_line()`. ### `to_line(truncate_overflow=False)` Reconstructs the original fixed-width line from the record's current values. ```python with readers.FixedReader(open('claims.txt'), columns) as reader: for claim in reader: original_line = claim.to_line() # reproduces the source line exactly ``` **How it works:** `to_line()` allocates a space-filled buffer as wide as the longest column's end position, then splices each formatted value into its exact byte range using the column's `start_pos`. Because placement is position-based rather than sequential: - Columns defined in any order are placed at the correct positions - Gaps between columns (bytes not covered by any column definition) stay as spaces - The output width is determined by the rightmost column end position, not the number of columns **Alignment and padding** are applied per-column before splicing. Defaults follow the column's `column_type`: | `column_type` | Default alignment | Default pad | |--------------|-------------------|-------------| | `text`, `date`, `datetime` | left | space | | `int`, `float` | right | `'0'` | Override with explicit `align=` and `pad_char=` on `FixedColumn`. Both must be set together when overriding numeric columns — setting `align` alone does not change the pad character: ```python # Wrong: left-aligned but still zero-padded → '4200000000' FixedColumn('amount', 1, 10, 'int', align='left') # Correct: left-aligned and space-padded → '42 ' FixedColumn('amount', 1, 10, 'int', align='left', pad_char=' ') ``` **Overflow handling:** ```python # Default: raises ValueError naming the offending field record.to_line() # Silently truncate values that exceed column width record.to_line(truncate_overflow=True) ``` ### Modifying Records Before Writing Because values are spliced by position, only the modified fields change — the rest of the line is reproduced exactly: ```python with readers.FixedReader(open('payments.txt'), columns) as reader: with open('payments_updated.txt', 'w') as out: for record in reader: if record.status == 'P': record['status'] = 'C' # only this field changes out.write(record.to_line() + '\n') ``` ### ACH File Round-Trip Example ```python from dbtk.readers.fixed_width import EDIReader from dbtk.formats.edi import ACH_COLUMNS with EDIReader(open('payroll.ach'), ACH_COLUMNS) as reader: with open('payroll_modified.ach', 'w') as out: for record in reader: if record.record_type_code == '6': record['individual_name'] = record.individual_name.upper() out.write(record.to_line() + '\n') ```