Overview
DataGenFlow transforms complex data generation workflows into intuitive visual pipelines. A minimal tool designed to help you generate, validate, and export quality data with full transparency.
Table of Contents
Why DataGenFlow
The Problem
Creating quality datasets for LLM training, testing, or validation often involves:
- Writing repetitive boilerplate code
- Managing complex orchestration logic
- Debugging opaque generation failures
- Manual review of thousands of records
- Difficulty extending or customizing workflows
The Solution
DataGenFlow provides:
- Visual Pipeline Builder: Drag-and-drop blocks instead of writing orchestration code
- Auto-Discovery: Custom blocks appear automatically - no configuration needed
- Full Transparency: Complete execution traces show exactly how each result was generated
- Accumulated State: Data flows automatically between blocks - no manual wiring
- Easy Extension: Add domain-specific logic in minutes with simple Python classes
Quick Start
Get DataGenFlow running in under 2 minutes:
# Install dependencies
make setup
make dev
# Launch application
make run-dev
# Open http://localhost:8000
That’s it! No complex configuration or external dependencies required.
Note: DataGenFlow works with any OpenAI-compatible LLM endpoint (Ollama, OpenAI, etc.). Configure your endpoint in
.envfile.
Core Concepts
Pipelines
A pipeline is a sequence of blocks that process data. Think of it as a visual workflow where:
- Data enters from seed files (JSON)
- Each block transforms or validates the data
- Results accumulate and flow to the next block
- Final output is ready for review and export
Blocks
Blocks are reusable processing units. Each block:
- Declares what inputs it needs
- Declares what outputs it produces
- Executes asynchronously
- Adds its outputs to accumulated state
Built-in blocks:
- LLMBlock: Generate text using LLM with Jinja2 templates
- ValidatorBlock: Check text quality (length, patterns, forbidden words)
- JSONValidatorBlock: Parse and validate JSON structures
- OutputBlock: Format final results with Jinja2 templates
Custom blocks: Create your own in minutes - just inherit from BaseBlock and implement execute().
Accumulated State
One of DataGenFlow’s most powerful features: data automatically flows between blocks.
Seed Data: {"topic": "Python", "level": "beginner"}
↓
LLMBlock outputs: {"assistant": "Python is..."}
↓ (state: topic, level, assistant)
ValidatorBlock outputs: {"is_valid": true}
↓ (state: topic, level, assistant, is_valid)
OutputBlock can access: ALL previous data
No manual wiring needed - every block sees all previous outputs plus the original seed data.
Jobs & Review
Jobs track batch generation:
- Run pipelines on multiple seeds
- Monitor progress in real-time
- Track success/failure counts
- Review results by job
Review workflow:
- Filter records by job
- Accept (A), Reject (R), or Edit (E) records
- View execution traces for debugging
- Export approved records as JSONL
Key Features
Visual Pipeline Builder
- ReactFlow-based editor with drag-and-drop blocks
- Accumulated state visualization shows available data at each step
- Block status indicators (Not Configured, Not Connected)
- Template pipelines for quick start
- Pipeline validation before save
Real-Time Progress Tracking
- Global indicator in header shows active jobs
- Detailed progress view with:
- Progress bar and percentage
- Current block being executed
- Success/failure counts
- Elapsed time
- Job cancellation support
Complete Execution Traces
Every record includes full trace:
- Block-by-block execution history
- Input/output for each block
- Accumulated state at each step
- Execution timing per block
- Error context if failures occur
Job-Scoped Operations
- Filter by job to review specific generation runs
- Export by job for organized datasets
- Delete by job to clean up experiments
- Only 1 concurrent job allowed (prevents resource conflicts)
Jinja2 Template Support
Use powerful template syntax in LLMBlock and OutputBlock:
System: You are a {{ role }} expert in {{ domain }}.
User: Explain {{ topic }} at {{ level }} level.
{% if include_examples %}
Include practical examples.
{% endif %}
Output: {{ assistant | truncate(500) }}
Variables come from:
- Seed data metadata
- Previous block outputs
- Accumulated state
Next Steps
Ready to dive deeper?
- How to Use: Complete walkthrough of building pipelines, generating data, and reviewing results
- Create Custom Blocks: Extend DataGenFlow with your own processing logic
- Developer Guide: Technical architecture, API reference, and debugging tools
- Contributing: Share improvements and join the community
Happy data generating! 🌱