Extensibility System
DataGenFlow’s extensibility system lets engineers consume DataGenFlow as a Docker image and maintain custom blocks and templates in their own repositories.
Table of Contents
- Overview
- Quick Start
- Writing Custom Blocks
- Writing Custom Templates
- CLI Reference
- Hot Reload
- Extensions API
- Extensions Page
- Docker Setup
- Building Custom Images
- Troubleshooting
Overview
Engineers clone the DataGenFlow repository once, then:
- Build the Docker image locally
- Mount custom
user_blocks/anduser_templates/directories - Manage extensions with the
dgfCLI or the Extensions UI page
your-repo/
user_blocks/
sentiment_analyzer.py
translator.py
user_templates/
my_qa_pipeline.yaml
docker-compose.yml
.env
The system provides:
- Block registry with source tracking (
builtin,custom,user) - Dependency declaration via class attribute on blocks
- Hot reload via file watcher (watchdog) with 500ms debounce
- CLI tool (
dgf) for managing blocks, templates, and images - Extensions page in the frontend showing all blocks and templates with status
Quick Start
# 1. clone DataGenFlow
git clone https://github.com/your-org/DataGenFlow.git
cd DataGenFlow
# 2. build the Docker image
docker build -f docker/Dockerfile -t datagenflow:local .
# 3. create your project directory
mkdir -p my-project/user_blocks my-project/user_templates my-project/data
cd my-project
# 4. create docker-compose.yml (see Docker Setup section)
# 5. start DataGenFlow
docker-compose up -d
# 6. scaffold a block
cd ../DataGenFlow && uv run dgf blocks scaffold SentimentAnalyzer -c validators
mv sentiment_analyzer.py ../my-project/user_blocks/
# 7. check it's registered
uv run dgf blocks list
# 8. open the Extensions page in the UI
open http://localhost:8000/extensions
Writing Custom Blocks
Custom blocks follow the same BaseBlock interface as builtin blocks. See How to Create Custom Blocks for the full guide.
Block with Dependencies
Blocks can declare pip dependencies via a dependencies class attribute. Missing dependencies are detected at registration time, and the block appears as “unavailable” in the UI with an actionable error.
from lib.blocks.base import BaseBlock
from lib.entities.block_execution_context import BlockExecutionContext
from typing import Any
class SentimentAnalyzer(BaseBlock):
name = "Sentiment Analyzer"
description = "Analyze text sentiment using transformers"
category = "validators"
inputs = ["text"]
outputs = ["sentiment", "confidence"]
# declare pip dependencies
dependencies = ["transformers>=4.30.0", "torch>=2.0.0"]
def __init__(self, model: str = "distilbert-base-uncased"):
self.model = model
self._pipeline = None
async def execute(self, context: BlockExecutionContext) -> dict[str, Any]:
if self._pipeline is None:
from transformers import pipeline
self._pipeline = pipeline("sentiment-analysis", model=self.model)
text = context.get_state("text", "")
result = self._pipeline(text)[0]
return {
"sentiment": result["label"],
"confidence": result["score"],
}
Install missing dependencies via CLI or the Extensions page:
dgf blocks list # see which blocks are unavailable
# POST /api/extensions/blocks/SentimentAnalyzer/install-deps
Block Discovery
Blocks are discovered from three directories:
| Directory | Source Label | Purpose |
|---|---|---|
lib/blocks/builtin/ | builtin | Ships with DataGenFlow |
lib/blocks/custom/ | custom | Project-specific blocks |
user_blocks/ | user | User-mounted blocks (extensibility) |
Any .py file (not starting with _) containing a BaseBlock subclass is auto-discovered. The user_blocks/ path is configurable via the DATAGENFLOW_BLOCKS_PATH environment variable.
Writing Custom Templates
Templates are YAML files that define pre-configured pipelines.
name: "My QA Pipeline"
description: "Generate question-answer pairs from content"
blocks:
- type: TextGenerator
config:
model: "gpt-4o-mini"
user_prompt: |
Generate a question-answer pair from:
{{ content }}
Place templates in user_templates/ (or the path set by DATAGENFLOW_TEMPLATES_PATH). They appear in the Templates section of the UI and CLI.
Note: If a user template has the same ID (filename stem) as a builtin template, the builtin takes precedence and the user template is skipped.
CLI Reference
The dgf CLI is included in the DataGenFlow repository. Run it with uv:
cd /path/to/DataGenFlow
uv run dgf --help
Or install globally (requires the repo to be cloned):
cd /path/to/DataGenFlow
uv pip install -e .
dgf --help
Status
dgf status
Shows server health, block counts, template counts, and hot reload status.
Blocks Commands
dgf blocks list # list all blocks with status and source
dgf blocks validate ./my_block.py # check syntax and find block classes
dgf blocks scaffold MyBlock -c general # generate a starter block file
Templates Commands
dgf templates list # list all templates with source
dgf templates validate ./flow.yaml # check YAML structure and required fields
dgf templates scaffold "My Flow" # generate a starter template YAML
Image Commands
dgf image scaffold --blocks-dir ./user_blocks # generate Dockerfile with deps
dgf image build -t my-datagenflow:latest # build custom Docker image
The scaffold command parses dependencies attributes from block files and generates a Dockerfile.custom with the right uv pip install commands.
Configuration
dgf configure --show # show current endpoint
dgf configure --endpoint https://my-server:8000
Configuration resolution order:
DATAGENFLOW_ENDPOINTenvironment variable (highest priority).envfile in current directory- Default:
http://localhost:8000
Hot Reload
The file watcher monitors user_blocks/ and user_templates/ for changes. When a file is created, modified, or deleted:
- Blocks: The block registry re-scans all directories
- Templates: The specific template is registered or unregistered
Events are debounced at 500ms (configurable via DATAGENFLOW_HOT_RELOAD_DEBOUNCE_MS) to handle rapid saves.
| Environment Variable | Default | Description |
|---|---|---|
DATAGENFLOW_HOT_RELOAD | true | Enable/disable file watching |
DATAGENFLOW_HOT_RELOAD_DEBOUNCE_MS | 500 | Debounce interval in milliseconds |
Tip: Set
DATAGENFLOW_HOT_RELOAD=falsein production to avoid unnecessary file system overhead.
Extensions API
All extension endpoints live under /api/extensions/.
| Method | Endpoint | Description |
|---|---|---|
GET | /api/extensions/status | Block/template counts by source |
GET | /api/extensions/blocks | List all blocks with source and availability |
GET | /api/extensions/templates | List all templates with source |
POST | /api/extensions/reload | Trigger manual reload of all extensions |
POST | /api/extensions/blocks/{name}/validate | Validate block availability and dependencies |
GET | /api/extensions/blocks/{name}/dependencies | Get dependency info for a block |
POST | /api/extensions/blocks/{name}/install-deps | Install missing dependencies via uv |
Example response — GET /api/extensions/status:
{
"blocks": {
"total": 14,
"builtin_blocks": 12,
"custom_blocks": 0,
"user_blocks": 2,
"available": 13,
"unavailable": 1
},
"templates": {
"total": 6,
"builtin_templates": 4,
"user_templates": 2
}
}
Extensions Page
The Extensions page (/extensions) in the frontend shows:
- Status cards with block and template counts by source
- Block list with availability status, source badges, and dependency info. Unavailable blocks show a red border, error message, and an “Install Deps” button.
- Template list with source badges and a “Create Pipeline” button that creates a pipeline from the template and navigates to
/pipelines - Reload button to trigger a manual re-scan of all extension directories
Docker Setup
Building the Image
# from DataGenFlow repository root
docker build -f docker/Dockerfile -t datagenflow:local .
docker-compose.yml for Your Project
Create this in your project directory (outside DataGenFlow):
services:
datagenflow:
image: datagenflow:local
ports:
- "8000:8000"
volumes:
- ./user_blocks:/app/user_blocks
- ./user_templates:/app/user_templates
- ./data:/app/data
env_file:
- .env
environment:
- DATAGENFLOW_HOT_RELOAD=true
restart: unless-stopped
Environment Variables
Create a .env file:
# Required: LLM provider API key
LLM_API_KEY=your-api-key
# Optional: endpoint for dgf CLI
DATAGENFLOW_ENDPOINT=http://localhost:8000
# Optional: hot reload settings
DATAGENFLOW_HOT_RELOAD=true
DATAGENFLOW_HOT_RELOAD_DEBOUNCE_MS=500
All extensibility variables:
| Variable | Default | Description |
|---|---|---|
DATAGENFLOW_ENDPOINT | http://localhost:8000 | API endpoint (for CLI) |
DATAGENFLOW_BLOCKS_PATH | user_blocks | Path to user blocks directory |
DATAGENFLOW_TEMPLATES_PATH | user_templates | Path to user templates directory |
DATAGENFLOW_HOT_RELOAD | true | Enable file watching |
Building Custom Images
For production, pre-bake dependencies into the image:
# 1. generate Dockerfile with dependencies from your blocks
cd /path/to/DataGenFlow
uv run dgf image scaffold --blocks-dir /path/to/my-project/user_blocks -o /path/to/my-project/Dockerfile.custom
# 2. build the custom image (from DataGenFlow repo root)
docker build -f /path/to/my-project/Dockerfile.custom -t my-datagenflow:latest .
# 3. update docker-compose.yml to use new image
# image: my-datagenflow:latest
The generated Dockerfile builds from source and runs uv pip install for all declared dependencies.
Troubleshooting
Block not appearing in UI
- Cause: File not in a discovered directory, or class doesn’t inherit from
BaseBlock - Fix: Verify the file is in
user_blocks/, the filename doesn’t start with_, and the class inherits fromBaseBlock
Block shows as unavailable
Two sub-cases:
- Import succeeded but runtime deps are missing —
dependenciesattribute is readable,GET /dependencieslists them,POST /install-depsinstalls and reloads automatically. - Import itself failed (syntax error, missing module) —
block_classisNone, so/dependenciesand/install-depsboth return422with the import error. Fix the source file (or install the missing module), then trigger a reload viaPOST /api/extensions/reload. Once the class loads successfully the block becomes available.
Hot reload not working
- Cause:
DATAGENFLOW_HOT_RELOAD=falseor directory doesn’t exist at startup - Fix: Check the environment variable and ensure
user_blocks/anduser_templates/exist before the server starts
CLI cannot connect
- Cause: Wrong endpoint or server not running
- Fix: Run
dgf configure --showto check the endpoint, thendgf statusto test connectivity
User template ignored
- Cause: Template ID (filename stem) conflicts with a builtin template
- Fix: Rename the template file to avoid the collision. Check server logs for “skipped: conflicts with builtin” warnings.