Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Evolution of Data Loading Frameworks at Gemma

Serving Many Clients, Fast

Bijan Soltani

Founder & Managing Director

Berlin — February 17, 2026 | dltHub Community Meetup

Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Who we are

  • Data & AI consultancy, ~20 people, based in Berlin, founded in 2020
  • 70+ projects across data engineering, analytics engineering, and AI
  • We create end-to-end data infrastructure for multiple clients at the same time
  • Historically focused on the "modern data stack": Airflow, dbt, cloud DWHs, and many different viz tools
Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings
Bespoke
~2019 – 2020
Hand-built Airflow per client
Custom extract + load
Copy-paste between projects
"How do we stop
reinventing the wheel?"
EWAH
2020 – 2023
Custom Airflow framework
30+ connectors, YAML config
Extract + Load in one monolith
"How do we standardize
across all clients?"
Hybrid
2023 – 2025
Custom extraction hooks
dlt handles all loading
Still inside Airflow
"What if we just replace
the hardest part?"
Standalone
2025 – now
Pure dlt connectors
One container per source
Orchestrate with anything
"What if each connector
is just... a container?"
Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Bespoke: Hand-built pipelines (~2019–2020)

  • Every client got their own hand-built Airflow setup — copy-paste between projects
  • Even loading was custom-built (pre-Airbyte, pre-dlt)
  • Fix a connector for one client? Should copy it to all others. Never happened.
  • Slow to build, unreliable for clients, terrible to maintain

Our answer: Build a framework. One repo, one set of connectors, all clients.

Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

EWAH: One framework, all clients (2020–2023)

  • DAG Factory (YAML) → Operator → Hook → Uploader30+ connectors in a single repo
  • Centralized, declarative config — ran in production for years
  • Containerized, versioned public Docker image as base for client infrastructure

Where it broke:

  • Very difficult to hand over to internal engineering teams at clients — or even to other people at Gemma — due to its complexity
  • Any bug fix → new version published for everyone
  • Codebase bloated from supporting every edge case
  • Airflow upgrades became multi-week ordeals — locked into Airflow

Our answer: dlt for loading + best practice repo with copy & paste for clients.

Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

The Hybrid: dlt as the loading layer (2023–2025)

  • No centralized management of images or code anymore
  • Custom Airflow provider: source hooks yield Python dicts, dlt handles all loading
  • Switch destinations by changing one connection — zero custom upload code

What still hurt:

  • Custom Airflow provider is very complex, and can't run anything without Airflow
  • Still hand-writing extraction logic for every API
  • All connectors in one Airflow deployment

Our answer: Standalone containerized dlt connectors, optionally orchestrated via Airflow.

Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Standalone: The architecture

dlt-connectors/
  connectors/
    shopify/
      shopify_pipeline.py    ← entry point (~25 lines)
      shopify.py             ← dlt source + resources
      Dockerfile
      pyproject.toml
      .dlt/secrets_example.toml
    facebook/
      ...
    google_ads/
      ...

Each connector: own deps, own container, own lifecycle

Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Standalone: Run it anywhere

# Locally with duckdb
uv run python shopify_pipeline.py

# Locally with Postgres
DLT_DESTINATION=postgres uv run --env-file=.env python shopify_pipeline.py

# As a Docker container
docker run --rm \
  --env DLT_DESTINATION=postgres \
  --env DLT_SECRETS_TOML_BASE64=$(base64 -w 0 .dlt/secrets.toml) \
  dlt-connector-shopify:latest

Orchestrate with anything: Airflow, Dagster, K8s CronJob, Cloud Run...

Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Standalone: What it enables — and honest tradeoffs

What it enables:

  • Tap into the entire dlt ecosystem — use any existing connector out of the box
  • Simple dev flow: develop and test locally, no Airflow needed
  • New teammates learn from dlt docs, YouTube, Slack — not just internal wiki
  • Migrate one connector at a time, old and new coexist

Honest tradeoffs:

  • Credentials go from Airflow's UI to TOML files or env vars — harder to manage at scale
  • New operational work: Docker registries, image builds, container config
  • Many independent client repos to maintain — changes don't propagate automatically
Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Currently working on: Agents migrating connectors

  • Our challenge now: building a validation loop for autonomous migration

Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

What's next: Building blocks

Best practice repos as building blocks:

  • dlt-connectors — best practice repo for how we build dlt connectors (open-sourcing soon)
  • airflow3 — best practice repo for Airflow deployment, independent of dlt — but dlt connectors plug right in

AI-powered infrastructure setup:

  • A new Airflow repo can be built in ~1 hour with this setup + AI (e.g. Claude Code)
  • Give Claude the context of the dlt and Airflow best practice repos → tell it to build → magic happens, it just works
  • Extreme speed-up in deploying data infrastructure
Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Key takeaways

  1. Separating extract from load was the highest-impact change — not for the separation itself, but because it enabled a decoupled, less centralized architecture
  2. dlt's open-source-first approach enabled us to move away from EWAH to something fundamentally better
  3. Decoupled architecture is simpler — to understand, develop, and use
  4. dlt is built for the AI era — open-source libraries get supercharged by AI agents, not replaced like SaaS tools
  5. What takes a teammate hours may take an agent minutes — if the context is right
Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

The future is now: Data in the age of AI

  • dlt makes it extremely easy to spin up data loading
  • AI makes it super easy and fast to deploy more complex infrastructure
  • AI agents can propagate maintenance and updates across many repos automatically

Open questions:

  • How to best create true self-service analytics with LLM bots?
  • How will data transformation be affected with more and more AI usage?
  • What data infrastructure enables all of the above — cost-efficiently, fast to implement, and easy to maintain?
Gemma Analytics | Evolution of Data Loading Frameworks
Intro·Bespoke·EWAH·Hybrid·Standalone·Learnings

Thanks!

Questions?

bijan.soltani@gemmaanalytics.com

Hey everyone, I'm Bijan from Gemma Analytics. We're a data consultancy, which means we build and maintain data pipelines for a living. I want to walk you through how our data loading stack changed over the last five or six years, what broke along the way, and where we're headed next.

Quick context. We're Gemma Analytics, about 20 people in Berlin. We do data engineering, analytics, and AI. 70-something projects since 2020. We don't just build pipelines — we create end-to-end data infrastructure for many clients in parallel. Historically that's meant the modern data stack: Airflow for orchestration, dbt for transformation, cloud data warehouses, and various visualization tools. So our tooling has to scale across teams, not just within one project.

Here's the map. Four iterations, four questions. Bespoke was "how do we stop reinventing the wheel every time?" EWAH was "how do we standardize?" The hybrid was "what if dlt handles the hard part?" And standalone was "why are we still inside Airflow at all?" Each one solved something real and created new problems. I'll walk through each.

Before we had any framework, every client got a hand-built Airflow setup. New project? Copy-paste from the last one. Even loading was custom — manual INSERT statements, type mapping, all from scratch. If we fixed something for one client, it never made it to the others. It was slow, unreliable, and a maintenance black hole. So we built a framework.

EWAH solved the copy-paste problem. One framework, 30+ connectors, YAML config, containerized and versioned as a public Docker image. Ran in production for years. But centralizing created new problems. Handing it over to client teams or even other Gemma engineers was really hard because of the complexity. Any bug fix meant a new version for everyone. The codebase bloated. Airflow upgrades took weeks. So we decided: move to dlt for loading, and use a best practice repo that clients can fork and own.

No more centralized management of images or code. Custom Airflow provider where source hooks yield dicts and dlt handles all loading. Switching destinations became a config change. But the custom Airflow provider was very complex, and you still couldn't run anything without Airflow. Still hand-writing pagination for every API, and everything in one deployment. So our answer: standalone containerized dlt connectors that you can optionally orchestrate via Airflow.

Here's the repo. Each connector is its own directory with its own pyproject.toml, Dockerfile, and dependencies. Shopify doesn't know Facebook exists. You can version, deploy, and scale them independently.

Run it locally with duckdb during development. Point it at Postgres with an env file. Build a Docker image and run it in production. Orchestrate with whatever you already have. The connector doesn't know or care.

A few things this unlocked. We can use any connector from the dlt ecosystem out of the box. The dev flow is simple — develop and test locally without spinning up Airflow. New people can learn from dlt's community. And we can migrate one connector at a time. On the flip side, credentials moved from Airflow's nice UI to TOML files pasted into variables or hidden env vars. We now manage Docker registries and image builds. And without a central repo, each connector needs its own maintenance. These are real tradeoffs. We decided the decoupling was worth it.

We have 30+ EWAH connectors still to migrate. Doing it by hand doesn't scale. So we use coding agents. You give the agent the EWAH code, point it at dlt patterns, and it produces a working connector. The bottleneck shifted from writing the code to verifying the output. That's what we're solving now.

Where are we going? We're building two best practice repos as building blocks. dlt-connectors defines how we build connectors. airflow3 defines how we deploy Airflow — independently, but connectors plug right in. The exciting part: with AI, specifically Claude Code, you give it the context of both repos, tell it to build, and you get a working data infrastructure in about an hour. That's the speed-up we're after.

Five things. One: separating extract from load was the best thing we did, not because of the separation itself, but because it unlocked a decoupled architecture. Two: dlt's open-source-first approach is what made the move away from EWAH possible. Three: decoupled architecture is simpler for everyone. Four: open-source tools like dlt are perfectly positioned for the AI era — AI agents supercharge them, while SaaS tools risk being replaced entirely. And five: what takes a teammate hours might take an AI agent minutes, as long as you give it the right context. That's why simplicity and good documentation compound.

Here's the thing: dlt already makes spinning up data loading trivially easy. And with AI tools like Claude Code, deploying more complex infrastructure is faster than ever. The maintenance problem of many independent repos? AI agents can propagate updates across all of them automatically. But there are open questions. How do you build true self-service analytics with LLM bots? And what happens to the transformation layer as AI gets more capable? These are the questions we're exploring.

That's it. Happy to talk about any of the four approaches, the migration, dlt specifics, the agent workflow, or how this works in a consultancy context. Thanks for listening.