Extensibility System

DataGenFlow’s extensibility system lets engineers consume DataGenFlow as a Docker image and maintain custom blocks and templates in their own repositories.

Table of Contents

Overview

Engineers clone the DataGenFlow repository once, then:

  1. Build the Docker image locally
  2. Mount custom user_blocks/ and user_templates/ directories
  3. Manage extensions with the dgf CLI or the Extensions UI page
your-repo/
  user_blocks/
    sentiment_analyzer.py
    translator.py
  user_templates/
    my_qa_pipeline.yaml
  docker-compose.yml
  .env

The system provides:

  • Block registry with source tracking (builtin, custom, user)
  • Dependency declaration via class attribute on blocks
  • Hot reload via file watcher (watchdog) with 500ms debounce
  • CLI tool (dgf) for managing blocks, templates, and images
  • Extensions page in the frontend showing all blocks and templates with status

Quick Start

# 1. clone DataGenFlow
git clone https://github.com/your-org/DataGenFlow.git
cd DataGenFlow

# 2. build the Docker image
docker build -f docker/Dockerfile -t datagenflow:local .

# 3. create your project directory
mkdir -p my-project/user_blocks my-project/user_templates my-project/data
cd my-project

# 4. create docker-compose.yml (see Docker Setup section)

# 5. start DataGenFlow
docker-compose up -d

# 6. scaffold a block
cd ../DataGenFlow && uv run dgf blocks scaffold SentimentAnalyzer -c validators
mv sentiment_analyzer.py ../my-project/user_blocks/

# 7. check it's registered
uv run dgf blocks list

# 8. open the Extensions page in the UI
open http://localhost:8000/extensions

Writing Custom Blocks

Custom blocks follow the same BaseBlock interface as builtin blocks. See How to Create Custom Blocks for the full guide.

Block with Dependencies

Blocks can declare pip dependencies via a dependencies class attribute. Missing dependencies are detected at registration time, and the block appears as “unavailable” in the UI with an actionable error.

from lib.blocks.base import BaseBlock
from lib.entities.block_execution_context import BlockExecutionContext
from typing import Any


class SentimentAnalyzer(BaseBlock):
    name = "Sentiment Analyzer"
    description = "Analyze text sentiment using transformers"
    category = "validators"
    inputs = ["text"]
    outputs = ["sentiment", "confidence"]

    # declare pip dependencies
    dependencies = ["transformers>=4.30.0", "torch>=2.0.0"]

    def __init__(self, model: str = "distilbert-base-uncased"):
        self.model = model
        self._pipeline = None

    async def execute(self, context: BlockExecutionContext) -> dict[str, Any]:
        if self._pipeline is None:
            from transformers import pipeline
            self._pipeline = pipeline("sentiment-analysis", model=self.model)

        text = context.get_state("text", "")
        result = self._pipeline(text)[0]

        return {
            "sentiment": result["label"],
            "confidence": result["score"],
        }

Install missing dependencies via CLI or the Extensions page:

dgf blocks list                                    # see which blocks are unavailable
# POST /api/extensions/blocks/SentimentAnalyzer/install-deps

Block Discovery

Blocks are discovered from three directories:

DirectorySource LabelPurpose
lib/blocks/builtin/builtinShips with DataGenFlow
lib/blocks/custom/customProject-specific blocks
user_blocks/userUser-mounted blocks (extensibility)

Any .py file (not starting with _) containing a BaseBlock subclass is auto-discovered. The user_blocks/ path is configurable via the DATAGENFLOW_BLOCKS_PATH environment variable.

Writing Custom Templates

Templates are YAML files that define pre-configured pipelines.

name: "My QA Pipeline"
description: "Generate question-answer pairs from content"

blocks:
  - type: TextGenerator
    config:
      model: "gpt-4o-mini"
      user_prompt: |
        Generate a question-answer pair from:
        {{ content }}

Place templates in user_templates/ (or the path set by DATAGENFLOW_TEMPLATES_PATH). They appear in the Templates section of the UI and CLI.

Note: If a user template has the same ID (filename stem) as a builtin template, the builtin takes precedence and the user template is skipped.

CLI Reference

The dgf CLI is included in the DataGenFlow repository. Run it with uv:

cd /path/to/DataGenFlow
uv run dgf --help

Or install globally (requires the repo to be cloned):

cd /path/to/DataGenFlow
uv pip install -e .
dgf --help

Status

dgf status

Shows server health, block counts, template counts, and hot reload status.

Blocks Commands

dgf blocks list                          # list all blocks with status and source
dgf blocks validate ./my_block.py        # check syntax and find block classes
dgf blocks scaffold MyBlock -c general   # generate a starter block file

Templates Commands

dgf templates list                       # list all templates with source
dgf templates validate ./flow.yaml       # check YAML structure and required fields
dgf templates scaffold "My Flow"         # generate a starter template YAML

Image Commands

dgf image scaffold --blocks-dir ./user_blocks  # generate Dockerfile with deps
dgf image build -t my-datagenflow:latest       # build custom Docker image

The scaffold command parses dependencies attributes from block files and generates a Dockerfile.custom with the right uv pip install commands.

Configuration

dgf configure --show                     # show current endpoint
dgf configure --endpoint https://my-server:8000

Configuration resolution order:

  1. DATAGENFLOW_ENDPOINT environment variable (highest priority)
  2. .env file in current directory
  3. Default: http://localhost:8000

Hot Reload

The file watcher monitors user_blocks/ and user_templates/ for changes. When a file is created, modified, or deleted:

  • Blocks: The block registry re-scans all directories
  • Templates: The specific template is registered or unregistered

Events are debounced at 500ms (configurable via DATAGENFLOW_HOT_RELOAD_DEBOUNCE_MS) to handle rapid saves.

Environment VariableDefaultDescription
DATAGENFLOW_HOT_RELOADtrueEnable/disable file watching
DATAGENFLOW_HOT_RELOAD_DEBOUNCE_MS500Debounce interval in milliseconds

Tip: Set DATAGENFLOW_HOT_RELOAD=false in production to avoid unnecessary file system overhead.

Extensions API

All extension endpoints live under /api/extensions/.

MethodEndpointDescription
GET/api/extensions/statusBlock/template counts by source
GET/api/extensions/blocksList all blocks with source and availability
GET/api/extensions/templatesList all templates with source
POST/api/extensions/reloadTrigger manual reload of all extensions
POST/api/extensions/blocks/{name}/validateValidate block availability and dependencies
GET/api/extensions/blocks/{name}/dependenciesGet dependency info for a block
POST/api/extensions/blocks/{name}/install-depsInstall missing dependencies via uv

Example responseGET /api/extensions/status:

{
  "blocks": {
    "total": 14,
    "builtin_blocks": 12,
    "custom_blocks": 0,
    "user_blocks": 2,
    "available": 13,
    "unavailable": 1
  },
  "templates": {
    "total": 6,
    "builtin_templates": 4,
    "user_templates": 2
  }
}

Extensions Page

The Extensions page (/extensions) in the frontend shows:

  • Status cards with block and template counts by source
  • Block list with availability status, source badges, and dependency info. Unavailable blocks show a red border, error message, and an “Install Deps” button.
  • Template list with source badges and a “Create Pipeline” button that creates a pipeline from the template and navigates to /pipelines
  • Reload button to trigger a manual re-scan of all extension directories

Docker Setup

Building the Image

# from DataGenFlow repository root
docker build -f docker/Dockerfile -t datagenflow:local .

docker-compose.yml for Your Project

Create this in your project directory (outside DataGenFlow):

services:
  datagenflow:
    image: datagenflow:local
    ports:
      - "8000:8000"
    volumes:
      - ./user_blocks:/app/user_blocks
      - ./user_templates:/app/user_templates
      - ./data:/app/data
    env_file:
      - .env
    environment:
      - DATAGENFLOW_HOT_RELOAD=true
    restart: unless-stopped

Environment Variables

Create a .env file:

# Required: LLM provider API key
LLM_API_KEY=your-api-key

# Optional: endpoint for dgf CLI
DATAGENFLOW_ENDPOINT=http://localhost:8000

# Optional: hot reload settings
DATAGENFLOW_HOT_RELOAD=true
DATAGENFLOW_HOT_RELOAD_DEBOUNCE_MS=500

All extensibility variables:

VariableDefaultDescription
DATAGENFLOW_ENDPOINThttp://localhost:8000API endpoint (for CLI)
DATAGENFLOW_BLOCKS_PATHuser_blocksPath to user blocks directory
DATAGENFLOW_TEMPLATES_PATHuser_templatesPath to user templates directory
DATAGENFLOW_HOT_RELOADtrueEnable file watching

Building Custom Images

For production, pre-bake dependencies into the image:

# 1. generate Dockerfile with dependencies from your blocks
cd /path/to/DataGenFlow
uv run dgf image scaffold --blocks-dir /path/to/my-project/user_blocks -o /path/to/my-project/Dockerfile.custom

# 2. build the custom image (from DataGenFlow repo root)
docker build -f /path/to/my-project/Dockerfile.custom -t my-datagenflow:latest .

# 3. update docker-compose.yml to use new image
# image: my-datagenflow:latest

The generated Dockerfile builds from source and runs uv pip install for all declared dependencies.

Troubleshooting

Block not appearing in UI

  • Cause: File not in a discovered directory, or class doesn’t inherit from BaseBlock
  • Fix: Verify the file is in user_blocks/, the filename doesn’t start with _, and the class inherits from BaseBlock

Block shows as unavailable

Two sub-cases:

  1. Import succeeded but runtime deps are missingdependencies attribute is readable, GET /dependencies lists them, POST /install-deps installs and reloads automatically.
  2. Import itself failed (syntax error, missing module) — block_class is None, so /dependencies and /install-deps both return 422 with the import error. Fix the source file (or install the missing module), then trigger a reload via POST /api/extensions/reload. Once the class loads successfully the block becomes available.

Hot reload not working

  • Cause: DATAGENFLOW_HOT_RELOAD=false or directory doesn’t exist at startup
  • Fix: Check the environment variable and ensure user_blocks/ and user_templates/ exist before the server starts

CLI cannot connect

  • Cause: Wrong endpoint or server not running
  • Fix: Run dgf configure --show to check the endpoint, then dgf status to test connectivity

User template ignored

  • Cause: Template ID (filename stem) conflicts with a builtin template
  • Fix: Rename the template file to avoid the collision. Check server logs for “skipped: conflicts with builtin” warnings.