Record Objects

The Record class is DBTK’s universal data structure. Every cursor query and file reader returns Records, and every *Surge class expects them as input. Understanding Record unlocks the full power of DBTK’s pipeline architecture.

Why Record?

ETL pipelines process hundreds of thousands or millions of rows, all with the same columns. This creates a design tension:

Dicts give you flexible key-based access, but store keys on every instance. A million rows means a million copies of the same column names.
Tuples are memory-efficient, but force you to remember column positions. row[7] is fragile and unreadable.
Named tuples are immutable - you can’t add or modify fields during transformation.

Record solves this by storing column names once on a shared class while each instance holds only its values. You get dict-like flexibility with tuple-like memory efficiency.

# All of these work on the same Record object:
row['name']              # Dict-style access
row.name                 # Attribute access
row[1]                   # Index access
row[1:3]                 # Slicing
name, email = row[1:3]   # Unpack a slice
row.get('phone', 'N/A')  # Safe access with default
dict(row)                # Convert to dict when needed
'\t'.join(str(v) for v in row.values())  # Join values

Access Patterns

Dict-Style Access

# Key access
user_name = row['name']

# Safe access with default
phone = row.get('phone', 'unlisted')

# Check if column exists
if 'email' in row:
    send_notification(row['email'])

# Iterate over keys
for column in row.keys():
    print(column)

# Iterate over key-value pairs
for column, value in row.items():
    print(f"{column}: {value}")

Attribute Access

# Clean, readable syntax
print(f"Welcome, {row.name}!")

# Works with normalized names (see below)
row.employee_id  # Works even if column is "Employee_ID"

Index and Slice Access

# Positional access
first_col = row[0]
last_col = row[-1]

# Slicing
first_three = row[:3]
middle = row[2:5]

# Unpack specific positions
id, name, email = row[:3]

Tuple Unpacking

# Unpack entire row
id, name, email, created = row

# Unpack in loop
for id, name, status in cursor:
    process(id, name, status)

Normalized Field Names

Record stores both original column names and normalized versions. Normalization exists because Python attribute syntax doesn’t allow row.FULL NAME or row.Employee-ID - only valid Python identifiers work as attributes.

Normalization applies the following rules in order:

Lowercase and strip surrounding whitespace
Replace runs of non-alphanumeric characters with a single underscore
Strip leading and trailing underscores — except an explicit leading _ in the original name, which is preserved as a single _
Prefix n if the result starts with a digit

Original Column	Normalized Name	Notes
`Employee_ID`	`employee_id`
`FULL NAME`	`full_name`
`Start Year`	`start_year`
`Contact Email!`	`contact_email`	trailing `!` stripped
`#Term Code`	`term_code`	leading `#` stripped (not an explicit `_`)
`$Secret_Code!`	`secret_code`	leading `$` stripped
`_internal_id`	`_internal_id`	explicit leading `_` preserved
`__version__`	`_version`	explicit `_` preserved, extras stripped
`2025_sales`	`n2025_sales`	digit prefix gets `n`

Access rules:

Normalized names work with both attribute access (row.employee_id) and dict access (row['employee_id'])
Original names only work with dict access (row['FULL NAME']) — they may contain characters invalid for Python attributes

cursor.execute('SELECT Employee_ID, "FULL NAME" FROM users')
for row in cursor:
    # Normalized - works both ways
    row.employee_id       # Attribute access
    row['employee_id']    # Dict access

    # Original - dict only
    row['FULL NAME']      # Works
    row.FULL NAME         # SyntaxError!

Why This Matters

When defining Table field mappings, use normalized names and your code works regardless of source column casing:

# This Table config works whether the CSV has
# "Employee ID", "employee_id", or "EMPLOYEE_ID"
employee_table = Table('employees', {
    'emp_id': {'field': 'employee_id', 'key': True},
    'full_name': {'field': 'full_name'},
    'hire_date': {'field': 'start_date', 'fn': 'date'},
}, cursor=cursor)

You write the mapping once. DBTK handles the column name variations.

Duplicate Column Names

If normalization produces duplicate names, Record appends _2, _3, etc.:

# Columns: "Status", "STATUS", "status!"
# Normalized: "status", "status_2", "status_3"

Reserved Name Collisions

Record has its own methods and attributes (get, keys, values, items, update, pop, copy, etc.). If a database column normalizes to one of these names it would shadow the method, making the Record unusable. To prevent this, set_fields() checks normalized names against a reserved set and renames any collision the same way it handles duplicates — appending _2, _3, etc. — and logs a warning so the rename is visible:

# Column named "values" would shadow Record.values()
# → normalized to "values_2", warning logged:
# WARNING: Attribute access for 'values' was renamed to 'values_2' due to
#          collision with existing Record method/attribute.
row['values']    # still works via original name
row.values_2     # attribute access uses the renamed form
row.values()     # the method is still intact

The same applies to internal attributes like _fields, _added, and classmethods like set_fields.

Accessing Both Forms

# Get original column names
columns = row.keys()                    # ['Employee_ID', 'FULL NAME']

# Get normalized column names
columns = row.keys(normalized=True)     # ['employee_id', 'full_name']

# Convert to dict with normalized keys
data = row.to_dict(normalized=True)     # {'employee_id': 123, 'full_name': 'Aang'}

Modifying Records

Records are mutable - you can add, update, and delete fields during transformation.

Update Existing Fields

# By key
row['status'] = 'active'

# By attribute
row.status = 'active'

# Multiple updates
row.update({'status': 'active', 'modified': datetime.now()})
row.update(name='Aang', age=130)

# Coalesce - only fills in None values, leaves existing values alone
row.coalesce({'phone': 'unlisted', 'status': 'active'})

Add New Fields

# Add computed fields
row['full_name'] = f"{row.first_name} {row.last_name}"

# Add via attribute
row.processed_at = datetime.now()

Delete Fields

# Remove a field
del row['temp_field']

# Or use pop to get the value
old_status = row.pop('old_status', None)

Conversion Methods

# Convert to tuple or list of values
values = tuple(row)
values = list(row)

# Convert to dict (standard Python idiom)
data = dict(row)

# Use to_dict() when you need normalized keys
data = row.to_dict(normalized=True)

Debugging

Pretty Print

row.pprint()
# Employee_ID : 12345
# FULL NAME   : Avatar Aang
# Start Date  : 2024-01-15
# Status      : active

# With normalized names
row.pprint(normalized=True)
# employee_id : 12345
# full_name   : Avatar Aang
# start_date  : 2024-01-15
# status      : active

FixedWidthRecord.pprint() accepts an additional add_comments=True option that appends each column’s comment (from the FixedColumn definition) inline:

row.pprint(add_comments=True)
# record_type_code : 6          # Entry Detail
# routing_number   : 021000021  # 9-digit ABA
# amount           : 0000015000 # Cents, no decimal

Columns without a comment are left blank in that position.

String Representation

print(row)
# Record('employee_id': 12345, 'full_name': 'Avatar Aang', ...)

print(repr(row))
# Record(12345, 'Avatar Aang', datetime.date(2024, 1, 15), 'active')

Performance Note

Record achieves tuple-like speed through a subclass-per-query pattern. Each time a cursor executes a query, DBTK dynamically creates a Record subclass with those column names as class attributes. Every row returned from that query shares the subclass - column names exist once in memory, not on every row.

cursor.execute("SELECT id, name, email FROM users")
# DBTK creates a Record subclass with _fields = ['id', 'name', 'email']
# All 10,000 rows share this class, each storing only [id_val, name_val, email_val]

cursor.execute("SELECT order_id, total, status FROM orders")
# New query = new subclass with different fields
# Previous subclass is eligible for garbage collection

This is why Record can offer dict-like convenience without dict-like memory overhead when processing millions of rows. The same pattern applies to file readers - each reader creates a Record subclass based on the file’s header row.

Working with Streaming Records

Cursors and file readers hand you records one at a time, and every DBTK writer accepts that same stream directly — most data never needs to live in a list at all. If you need to drop a column, keep only a few, or rename headers before writing, materializing the whole result set into a list just to reshape it throws away that advantage. For a result set with a few thousand rows it won’t matter; for a multi-million-row export, it’s the difference between constant memory usage and a program that runs out of it. dbtk.writers provides four small generator functions that reshape a record stream one row at a time, so filtering, reordering, or renaming columns costs nothing beyond what you’re already paying to read and write the data.

Each does one job, and they compose freely.

select_columns — keep some columns, in a given order

from dbtk.writers import select_columns

cursor.execute("SELECT * FROM users")
writers.to_excel(select_columns(cursor, ['name', 'email', 'signup_date']), 'report.xlsx')

col_names is an allow-list; only those columns appear in the output, in the order given.

exclude_columns — drop some columns, keep the rest

from dbtk.writers import exclude_columns

writers.to_csv(exclude_columns(cursor, ['ssn', 'password_hash']), 'users.csv')

col_names is a block-list — a list, tuple, or set all work, since order never matters for exclusion. Everything else passes through in its original order.

rename_columns — relabel some columns, leave the rest alone

from dbtk.writers import rename_columns

writers.to_csv(rename_columns(cursor, {'signup_date': 'Joined On'}), 'customers.csv')

mapping is a dict of source name -> new name. Only the columns you list are relabeled; everything else keeps its original name and position — this is a pure rename, not a select. A falsy value ('' or None) is a no-op for that column, so you can build the mapping programmatically without filtering out unchanged entries.

Composing them

select_columns, exclude_columns, and rename_columns all accept and return the same kind of self-describing rows, so you can chain them:

writers.to_csv(
    rename_columns(exclude_columns(cursor, ['ssn']), {'signup_date': 'Joined On'}),
    'customers.csv',
)

tuples_to_records — attach column names to positional data

select_columns, exclude_columns, and rename_columns all expect self-describing rows — cursor results, Record, dict, or namedtuple — and infer column names from the first row. Plain tuples/lists carry no names of their own, so attach them first:

from dbtk.record import tuples_to_records
# also available as: from dbtk.writers import tuples_to_records

records = tuples_to_records(rows, ['id', 'name', 'email'])

Pass a falsy entry (None or '') at any position to drop that column entirely instead of naming it — handy for filler/padding data you never want to carry forward:

records = tuples_to_records(rows, ['id', 'name', None, 'email'])  # 3rd position dropped
writers.to_csv(select_columns(records, ['name', 'email']), 'users.csv')

All four functions build one Record subclass up front, using the same shared-class pattern described in Performance Note above, and reuse it for every row — so reshaping a stream stays cheap no matter how many rows pass through it.

FixedWidthRecord

FixedWidthRecord is a Record subclass returned by FixedReader and EDIReader. It carries the column layout alongside the values, enabling exact reconstruction of the original fixed-width line via to_line().

`to_line(truncate_overflow=False)`

Reconstructs the original fixed-width line from the record’s current values.

with readers.FixedReader(open('claims.txt'), columns) as reader:
    for claim in reader:
        original_line = claim.to_line()   # reproduces the source line exactly

How it works: to_line() allocates a space-filled buffer as wide as the longest column’s end position, then splices each formatted value into its exact byte range using the column’s start_pos. Because placement is position-based rather than sequential:

Columns defined in any order are placed at the correct positions
Gaps between columns (bytes not covered by any column definition) stay as spaces
The output width is determined by the rightmost column end position, not the number of columns

Alignment and padding are applied per-column before splicing. Defaults follow the column’s column_type:

`column_type`	Default alignment	Default pad
`text`, `date`, `datetime`	left	space
`int`, `float`	right	`'0'`

Override with explicit align= and pad_char= on FixedColumn. Both must be set together when overriding numeric columns — setting align alone does not change the pad character:

# Wrong: left-aligned but still zero-padded → '4200000000'
FixedColumn('amount', 1, 10, 'int', align='left')

# Correct: left-aligned and space-padded → '42        '
FixedColumn('amount', 1, 10, 'int', align='left', pad_char=' ')

Overflow handling:

# Default: raises ValueError naming the offending field
record.to_line()

# Silently truncate values that exceed column width
record.to_line(truncate_overflow=True)

Modifying Records Before Writing

Because values are spliced by position, only the modified fields change — the rest of the line is reproduced exactly:

with readers.FixedReader(open('payments.txt'), columns) as reader:
    with open('payments_updated.txt', 'w') as out:
        for record in reader:
            if record.status == 'P':
                record['status'] = 'C'      # only this field changes
            out.write(record.to_line() + '\n')

ACH File Round-Trip Example

from dbtk.readers.fixed_width import EDIReader
from dbtk.formats.edi import ACH_COLUMNS

with EDIReader(open('payroll.ach'), ACH_COLUMNS) as reader:
    with open('payroll_modified.ach', 'w') as out:
        for record in reader:
            if record.record_type_code == '6':
                record['individual_name'] = record.individual_name.upper()
            out.write(record.to_line() + '\n')