DBTK - Data Benders ToolKit

Like the elemental benders of Avatar, this library gives you precise control over data, the world’s most rapidly growing element. Extract data from various sources, transform it through powerful operations, and load it exactly where it needs to go. This library is designed by and for data integrators.

Why DBTK?

Most data integration tools make you run infrastructure.

Airbyte, Boomi, Pentaho — powerful tools, but you’re deploying Docker containers, configuring connectors through a UI, and maintaining a platform just to move data between systems. A slow job needs scheduling, monitoring, failure recovery, a maintenance window. When something breaks, you’re debugging someone else’s abstraction layer.

The alternative — SQLAlchemy + pandas + custom batch loops — means writing the same boilerplate across every interface, slowly accreting defensive code for every edge case. And when you finally write the database load, you’re still doing row-at-a-time inserts through an ORM that knows nothing about COPY or bcp.

DBTK is a library, not a platform.

pip install dbtk and you have native bulk loading for PostgreSQL, Oracle, MySQL, and SQL Server — in a cron job, a Lambda function, a Jupyter notebook, or embedded in your application. No Docker. No daemon. No server. The pipeline is Python.

It handles the formats everyone else ignores.

Finance, healthcare, logistics, and government move enormous data volumes in fixed-width text and EDI formats. The modern Python data stack has largely ignored these because they’re unglamorous. DBTK handles them with the same clean API as CSV and JSON — because real integration work lives there.

DBTK threads the needle between power and control.

It’s declarative enough to eliminate repetition, explicit enough that you stay in control. When something breaks, you know exactly where to look.

Perfect for:

Data integration and ELT jobs
Record-level transformations (cleaning, parsing, validation)
Simple CRUD operations with multiple database types
Moving data between systems

Not designed for:

Heavy in-memory data transformations (use Pandas/polars)
Complex aggregations (leverage your database’s strengths)
Data analysis and statistics

Features:

Universal Database Connectivity - Unified interface across PostgreSQL, Oracle, MySQL, SQL Server, and SQLite with intelligent driver auto-detection
Portable SQL Queries - Write SQL once with named parameters, runs on any database regardless of parameter style
Smart Cursors - All cursors and readers return Record objects with the speed and efficiency of tuples and the flexibility of dicts
Flexible File Reading - CSV, Excel (XLS/XLSX), JSON, NDJSON, XML, DataFrame and fixed-width text files with consistent API
Transparent Compression - Automatic decompression of .gz, .bz2, .xz, and .zip files with smart member selection
Multiple Export Formats - Write to CSV, Excel, JSON, NDJSON, XML, fixed-width text, or directly between databases
Advanced ETL Framework - Full-featured Table class for complex data transformations, validations, and merging
Data Transformations - Built-in functions for dates, phones, emails, and custom data cleaning with international support
High-Performance Bulk Operations - DataSurge for blazing-fast batch operations; BulkSurge for even faster direct loading when supported
Integrated Logging - Timestamped log files with automatic cleanup, split error logs, and zero-config setup
Encrypted Configuration - YAML-based config with password encryption and environment variable support

Speed and Memory BulkSurge streaming from polars and doing direct loads to PostgreSQL will process 1M rows in 3-4 seconds. Even with a standard Python CSV reader and numerous column transforms, DataSurge writes 1M rows to every supported database in 5-10 seconds. These aren’t toy benchmarks — they include field mapping, type conversions, validation, and real database constraints. The same pipeline that takes seconds in DBTK can take hours in a drag-and-drop ETL platform.

Philosophy

Modern databases excel at aggregating and transforming data at scale. DBTK embraces this by focusing on what Python does well: flexible record-by-record transformations, connecting disparate systems, and orchestrating data movement.

If you need to pivot, aggregate, or perform complex SQL operations - write SQL and let your database handle it. If you need dataframes and heavy analytics - reach for Pandas or polars. DBTK sits in between: getting your data where it needs to be, cleaned and validated along the way.

Quick Start

import dbtk
from dbtk.etl import Table
from dbtk.etl.transforms import parse_date, email_validate

# Read data, transform records, load to database
with dbtk.connect('my_db') as db:
    cursor = db.cursor()
    table = Table('users', columns={
        'email': {'field': 'email', 'fn': email_validate},
        'signup_date': {'field': 'date', 'fn': parse_date}
    }, cursor=cursor)

    with dbtk.readers.CSVReader(open('users.csv')) as reader:
        for record in reader:
            table.set_values(record)
            # Use raise_error=False to track incomplete records
            table.execute('insert', raise_error=False)

    print(f"Inserted: {table.counts['insert']}")
    print(f"Skipped: {table.counts['incomplete']}")

Contents:

DBTK - Data Benders ToolKit

Why DBTK?

Philosophy

Quick Start

Indices and tables