# Advanced Features & Performance ## Multiple Configuration Locations DBTK searches for configuration files in this order: 1. Explicitly set path via `dbtk.set_config_file('path/to/config.yml')` 2. `./dbtk.yml` or `./dbtk.yaml` (project-specific) 3. `~/.config/dbtk.yml` or `~/.config/dbtk.yaml` (user-specific) This lets you maintain per-project configurations while having a fallback for personal databases. If no config is found, a sample is created at `~/.config/dbtk_sample.yml`. ## Custom Driver Registration If you're using a database driver not built into DBTK, you can register it: ```python from dbtk.database import register_user_drivers custom_drivers = { 'my_postgres_fork': { 'database_type': 'postgres', 'priority': 10, 'param_map': {'database': 'dbname'}, 'required_params': [{'host', 'database', 'user'}], 'optional_params': {'port', 'password'}, 'connection_method': 'kwargs', 'default_port': 5432 } } register_user_drivers(custom_drivers) ``` ## Performance Tips 1. **Use appropriate batch sizes** - Larger batches are faster but use more memory: ```python import dbtk db = dbtk.connect('fire_nation_archive') cur = db.cursor() ... table = dbtk.etl.Table('intel', intel_cols, cursor=cur) bulk_writer = dbtk.etl.DataSurge(table, batch_size=5000) # Tune based on your data bulk_writer.insert(reader) ``` 2. **Materialize results when needed** - Don't fetch twice: ```python data = cursor.fetchall() # Fetch once dbtk.writers.to_csv(data, 'output.csv') dbtk.writers.to_excel(data, 'output.xlsx') ``` 3. **Use transactions for bulk operations** - Commit once for many inserts: ```python with db.transaction(): for record in records: table.set_values(record) table.execute('insert') ``` 4. **Use DataSurge for bulk operations** - Much faster than row-by-row: ```python bulk_writer = DataSurge(table) bulk_writer.insert(records) ``` 5. **Use prepared statements for repeated queries** - Read and parse SQL once: ```python stmt = cursor.prepare_file('query.sql') for params in parameter_sets: stmt.execute(params) ``` 6. **Let the database do the work** - Use `db_expr` in Table definitions to leverage database functions instead of processing in Python. ## IdentityManager & ValidationCollector For detailed documentation on identity resolution, validation, and logging tools for production ETL pipelines, see [ETL: Tools & Logging](09-etl-tools.md). **IdentityManager** - Resolves source-system keys to target-system identifiers with caching, status tracking, and state persistence. Essential for multi-stage imports and CRM/ERP integrations. **ValidationCollector** - Collects and validates coded values during processing, with optional lookup enrichment. ## See Also - [ETL: Tools & Logging](09-etl-tools.md) - IdentityManager, ValidationCollector, and integration logging - [Configuration & Security](02-configuration.md) - Custom driver registration, config file locations - [ETL: Table & Transforms](07-table.md) - Using db_expr for database-side processing - [ETL: DataSurge & BulkSurge](08-datasurge.md) - Performance tuning for bulk operations