Evolution of Data Loading Frameworks at Gemma

Hey everyone, I'm Bijan from Gemma Analytics. We're a data consultancy, which means we build and maintain data pipelines for a living. I want to walk you through how our data loading stack changed over the last five or six years, what broke along the way, and where we're headed next.

Quick context. We're Gemma Analytics, about 20 people in Berlin. We do data engineering, analytics, and AI. 70-something projects since 2020. We don't just build pipelines — we create end-to-end data infrastructure for many clients in parallel. Historically that's meant the modern data stack: Airflow for orchestration, dbt for transformation, cloud data warehouses, and various visualization tools. So our tooling has to scale across teams, not just within one project.

Here's the map. Four iterations, four questions. Bespoke was "how do we stop reinventing the wheel every time?" EWAH was "how do we standardize?" The hybrid was "what if dlt handles the hard part?" And standalone was "why are we still inside Airflow at all?" Each one solved something real and created new problems. I'll walk through each.

Before we had any framework, every client got a hand-built Airflow setup. New project? Copy-paste from the last one. Even loading was custom — manual INSERT statements, type mapping, all from scratch. If we fixed something for one client, it never made it to the others. It was slow, unreliable, and a maintenance black hole. So we built a framework.

EWAH solved the copy-paste problem. One framework, 30+ connectors, YAML config, containerized and versioned as a public Docker image. Ran in production for years. But centralizing created new problems. Handing it over to client teams or even other Gemma engineers was really hard because of the complexity. Any bug fix meant a new version for everyone. The codebase bloated. Airflow upgrades took weeks. So we decided: move to dlt for loading, and use a best practice repo that clients can fork and own.

No more centralized management of images or code. Custom Airflow provider where source hooks yield dicts and dlt handles all loading. Switching destinations became a config change. But the custom Airflow provider was very complex, and you still couldn't run anything without Airflow. Still hand-writing pagination for every API, and everything in one deployment. So our answer: standalone containerized dlt connectors that you can optionally orchestrate via Airflow.

Here's the repo. Each connector is its own directory with its own pyproject.toml, Dockerfile, and dependencies. Shopify doesn't know Facebook exists. You can version, deploy, and scale them independently.

Run it locally with duckdb during development. Point it at Postgres with an env file. Build a Docker image and run it in production. Orchestrate with whatever you already have. The connector doesn't know or care.

A few things this unlocked. We can use any connector from the dlt ecosystem out of the box. The dev flow is simple — develop and test locally without spinning up Airflow. New people can learn from dlt's community. And we can migrate one connector at a time. On the flip side, credentials moved from Airflow's nice UI to TOML files pasted into variables or hidden env vars. We now manage Docker registries and image builds. And without a central repo, each connector needs its own maintenance. These are real tradeoffs. We decided the decoupling was worth it.

We have 30+ EWAH connectors still to migrate. Doing it by hand doesn't scale. So we use coding agents. You give the agent the EWAH code, point it at dlt patterns, and it produces a working connector. The bottleneck shifted from writing the code to verifying the output. That's what we're solving now.

Where are we going? We're building two best practice repos as building blocks. dlt-connectors defines how we build connectors. airflow3 defines how we deploy Airflow — independently, but connectors plug right in. The exciting part: with AI, specifically Claude Code, you give it the context of both repos, tell it to build, and you get a working data infrastructure in about an hour. That's the speed-up we're after.

Five things. One: separating extract from load was the best thing we did, not because of the separation itself, but because it unlocked a decoupled architecture. Two: dlt's open-source-first approach is what made the move away from EWAH possible. Three: decoupled architecture is simpler for everyone. Four: open-source tools like dlt are perfectly positioned for the AI era — AI agents supercharge them, while SaaS tools risk being replaced entirely. And five: what takes a teammate hours might take an AI agent minutes, as long as you give it the right context. That's why simplicity and good documentation compound.

Here's the thing: dlt already makes spinning up data loading trivially easy. And with AI tools like Claude Code, deploying more complex infrastructure is faster than ever. The maintenance problem of many independent repos? AI agents can propagate updates across all of them automatically. But there are open questions. How do you build true self-service analytics with LLM bots? And what happens to the transformation layer as AI gets more capable? These are the questions we're exploring.

That's it. Happy to talk about any of the four approaches, the migration, dlt specifics, the agent workflow, or how this works in a consultancy context. Thanks for listening.

Evolution of Data Loading Frameworks at Gemma

Serving Many Clients, Fast

Bijan Soltani

Who we are

Bespoke: Hand-built pipelines (~2019–2020)

EWAH: One framework, all clients (2020–2023)

The Hybrid: dlt as the loading layer (2023–2025)

Standalone: The architecture

Standalone: Run it anywhere

Standalone: What it enables — and honest tradeoffs

Currently working on: Agents migrating connectors

What's next: Building blocks

Key takeaways

The future is now: Data in the age of AI

Thanks!

Questions?