Advanced Features & Performance

Multiple Configuration Locations

DBTK searches for configuration files in this order:

Explicitly set path via dbtk.set_config_file('path/to/config.yml')
./dbtk.yml or ./dbtk.yaml (project-specific)
~/.config/dbtk.yml or ~/.config/dbtk.yaml (user-specific)

This lets you maintain per-project configurations while having a fallback for personal databases. If no config is found, a sample is created at ~/.config/dbtk_sample.yml.

Custom Driver Registration

If you’re using a database driver not built into DBTK, you can register it:

from dbtk.database import register_user_drivers

custom_drivers = {
    'my_postgres_fork': {
        'database_type': 'postgres',
        'priority': 10,
        'param_map': {'database': 'dbname'},
        'required_params': [{'host', 'database', 'user'}],
        'optional_params': {'port', 'password'},
        'connection_method': 'kwargs',
        'default_port': 5432
    }
}

register_user_drivers(custom_drivers)

Performance Tips

Use appropriate batch sizes - Larger batches are faster but use more memory:

 import dbtk
 db = dbtk.connect('fire_nation_archive')
 cur = db.cursor()
 ...
 table = dbtk.etl.Table('intel', intel_cols, cursor=cur)
 bulk_writer = dbtk.etl.DataSurge(table, batch_size=5000) # Tune based on your data
 bulk_writer.insert(reader)

Materialize results when needed - Don’t fetch twice:

data = cursor.fetchall()  # Fetch once
dbtk.writers.to_csv(data, 'output.csv')
dbtk.writers.to_excel(data, 'output.xlsx')

Use transactions for bulk operations - Commit once for many inserts:

with db.transaction():
    for record in records:
        table.set_values(record)
        table.execute('insert')

Use DataSurge for bulk operations - Much faster than row-by-row:

bulk_writer = DataSurge(table)
bulk_writer.insert(records)

Use prepared statements for repeated queries - Read and parse SQL once:

stmt = cursor.prepare_file('query.sql')
for params in parameter_sets:
    stmt.execute(params)

Let the database do the work - Use db_expr in Table definitions to leverage database functions instead of processing in Python.

IdentityManager & ValidationCollector

For detailed documentation on identity resolution, validation, and logging tools for production ETL pipelines, see ETL: Tools & Logging.

IdentityManager - Resolves source-system keys to target-system identifiers with caching, status tracking, and state persistence. Essential for multi-stage imports and CRM/ERP integrations.

ValidationCollector - Collects and validates coded values during processing, with optional lookup enrichment.

Advanced Features & Performance

Multiple Configuration Locations

Custom Driver Registration

Performance Tips

IdentityManager & ValidationCollector

See Also