Advanced Features & Performance
Multiple Configuration Locations
DBTK searches for configuration files in this order:
Explicitly set path via
dbtk.set_config_file('path/to/config.yml')./dbtk.ymlor./dbtk.yaml(project-specific)~/.config/dbtk.ymlor~/.config/dbtk.yaml(user-specific)
This lets you maintain per-project configurations while having a fallback for personal databases. If no config is found, a sample is created at ~/.config/dbtk_sample.yml.
Custom Driver Registration
If you’re using a database driver not built into DBTK, you can register it:
from dbtk.database import register_user_drivers
custom_drivers = {
'my_postgres_fork': {
'database_type': 'postgres',
'priority': 10,
'param_map': {'database': 'dbname'},
'required_params': [{'host', 'database', 'user'}],
'optional_params': {'port', 'password'},
'connection_method': 'kwargs',
'default_port': 5432
}
}
register_user_drivers(custom_drivers)
Performance Tips
Use appropriate batch sizes - Larger batches are faster but use more memory:
import dbtk db = dbtk.connect('fire_nation_archive') cur = db.cursor() ... table = dbtk.etl.Table('intel', intel_cols, cursor=cur) bulk_writer = dbtk.etl.DataSurge(table, batch_size=5000) # Tune based on your data bulk_writer.insert(reader)
Materialize results when needed - Don’t fetch twice:
data = cursor.fetchall() # Fetch once dbtk.writers.to_csv(data, 'output.csv') dbtk.writers.to_excel(data, 'output.xlsx')
Use transactions for bulk operations - Commit once for many inserts:
with db.transaction(): for record in records: table.set_values(record) table.execute('insert')
Use DataSurge for bulk operations - Much faster than row-by-row:
bulk_writer = DataSurge(table) bulk_writer.insert(records)
Use prepared statements for repeated queries - Read and parse SQL once:
stmt = cursor.prepare_file('query.sql') for params in parameter_sets: stmt.execute(params)
Let the database do the work - Use
db_exprin Table definitions to leverage database functions instead of processing in Python.
IdentityManager & ValidationCollector
For detailed documentation on identity resolution, validation, and logging tools for production ETL pipelines, see ETL: Tools & Logging.
IdentityManager - Resolves source-system keys to target-system identifiers with caching, status tracking, and state persistence. Essential for multi-stage imports and CRM/ERP integrations.
ValidationCollector - Collects and validates coded values during processing, with optional lookup enrichment.
See Also
ETL: Tools & Logging - IdentityManager, ValidationCollector, and integration logging
Configuration & Security - Custom driver registration, config file locations
ETL: Table & Transforms - Using db_expr for database-side processing
ETL: DataSurge & BulkSurge - Performance tuning for bulk operations