How to Use DataGenFlow
This guide walks you through using DataGenFlow to create datasets powered by LLMs.
Table of Contents
- Overview
- Prerequisites
- Phase 1: Building Pipelines
- Phase 2: Generating Records
- Phase 3: Reviewing Results
- Tips and Best Practices
- Common Issues
- Next Steps
Overview
DataGenFlow uses a three-phase workflow:
- Build: Create pipelines by connecting blocks
- Generate: Run pipelines on seed data to produce records
- Review: Validate, edit, and export results
Quick Start with Templates
DataGenFlow includes pre-configured templates for common tasks. See Pipeline Templates for:
- JSON Extraction - Extract structured information from text
- Text Classification - Classify text into categories with confidence scores
- Q&A Generation - Generate question-answer pairs from content
Templates use simplified seeds with just a content field in metadata.
Prerequisites
- LLM endpoint configured (Ollama, OpenAI, etc.)
- Application running at http://localhost:8000
- Seed data file prepared (JSON format)
Phase 1: Building Pipelines
What is a Pipeline?
A pipeline is a sequence of blocks that process data. Each block:
- Takes inputs from accumulated state (previous blocks’ outputs + seed data)
- Performs an operation (LLM generation, validation, formatting)
- Outputs results that get added to accumulated state for next blocks
Key concept: All block outputs accumulate, so later blocks can access any data from earlier blocks.
Creating Your First Pipeline
Step 1: Navigate to Pipelines page
- Open http://localhost:8000
- Click “Pipelines” in navigation
Step 2: Choose creation method
- From template: Click a template card to create pre-configured pipeline
- From scratch: Click ”+ New Pipeline” to use visual editor
Step 3: Visual Pipeline Editor (if creating from scratch)
- Start/End blocks: Automatically added (circular nodes, cannot be deleted)
- Add blocks: Drag blocks from left palette onto canvas
- Connect blocks: Drag from output handle to input handle
- Configure blocks: Click gear icon on block to open config panel
- See accumulated state: Each block shows available fields at that point
Step 4: Configure blocks
- Click gear icon (⚙️) on any block
- Configuration panel opens on right
- Fill in required parameters
- Red “Not Configured” badge shows missing fields
- Yellow “Not Connected” badge shows disconnected blocks
Step 5: Save pipeline
- Enter pipeline name at top
- Click “Save Pipeline” button
- Pipeline validation runs (checks all blocks configured and connected)
- Pipeline appears in list
Example: JSON Generation Pipeline
This example comes from the built-in “JSON Generation with Validation” template.
Goal: Generate structured JSON objects about topics with automatic validation
Pipeline Blocks:
- StructuredGenerator - Generates structured JSON from topic
- JSONValidatorBlock - Validates and parses JSON structure
Block Configurations:
Block 1: StructuredGenerator
- Temperature:
0.7 - Max tokens:
2048 - User prompt:
"Extract key information from this text and structure it as JSON with 'title' and 'description' fields. Text: {{ content }}" - JSON schema: Defines structure with title and description fields
Block 2: JSONValidatorBlock
- Field name:
generated(validates generator output) - Required fields:
["title", "description"] - Strict mode:
true(enforces schema)
Example Seed Data:
{
"repetitions": 3,
"metadata": {
"content": "Cleaning Your Laptop Screen"
}
}
What Happens:
- StructuredGenerator receives the content and generates structured JSON with title and description
- JSONValidatorBlock validates the JSON structure
- Result is saved for review with full execution trace showing
generated,valid, andparsed_jsonfields
Phase 2: Generating Records
Preparing Seed Data
Seed files define the variables used in your pipeline templates.
Simple content-based format (for templates):
[
{
"repetitions": 3,
"metadata": {
"content": "Electric cars reduce emissions but require charging infrastructure."
}
},
{
"repetitions": 2,
"metadata": {
"content": "Machine learning helps doctors diagnose diseases more accurately."
}
}
]
Custom variables format (for custom pipelines):
[
{
"repetitions": 1,
"metadata": {
"topic": "Python lists",
"difficulty": "beginner"
}
},
{
"repetitions": 1,
"metadata": {
"topic": "Python dictionaries",
"difficulty": "intermediate"
}
}
]
Fields:
repetitions: How many times to run the pipeline with this seedmetadata: Variables accessible in block templates via{{ variable_name }}- For templates, use
contentfield; for custom pipelines, use any variable names
Running Generation
- Navigate to Generator page
- Upload seed file
- Click “Choose File”
- Select your JSON seed file
- Select pipeline
- Choose from saved pipelines in dropdown
- Click “Generate”
- Job starts in background
- Can only run one job at a time
- Monitor progress:
- Global indicator: Top-right corner shows active job
- Detailed progress: Generator page shows:
- Progress bar with percentage
- Current step (e.g., “Processing seed 5/10”)
- Current block being executed
- Success/failure counts
- Elapsed time
- Progress updates every 2 seconds
- Cancel job (optional):
- Click “Cancel Job” to stop generation
- Wait for completion:
- Records saved automatically as generated
- Navigate to Review page when done
- Job appears in job selector
Phase 3: Reviewing Results
Review Interface
The Review page shows generated records with a clean card-based layout:
- Job selector: Filter records by generation job (required)
- Status filter: View Pending, Accepted, or Rejected records
- Record cards: Each shows output, metadata, and actions
- Trace viewer: Collapsible execution history with timing
Reviewing Records
Step 1: Navigate to Review page
Step 2: Select a job (required)
- Choose from dropdown of recent jobs
- Only shows jobs that generated records
- Stats update when switching jobs
Step 3: Filter by status
- Pending: Not yet reviewed (default view)
- Accepted: Approved records
- Rejected: Discarded records
Step 4: Review each record*
- Read the output in card format
- Click “View Trace” to see execution details
- Check metadata (seed variables used)
Step 5: Take action
- Accept (A key): Mark as approved
- Reject (R key): Mark as discarded
- Edit (E key): Modify the output
- Opens edit modal
- Make changes to output text
- Click “Save” to update record
Using Execution Traces
Traces help debug issues by showing:
- Which blocks executed
- How long each took
- What data passed between blocks
- Any errors that occurred
To view trace:
- Click “View Trace” on a record
- Expand to see block-by-block execution
- Check
accumulated_stateto see data flow
Exporting Results
Export is scoped to the currently selected job:
- Select job from dropdown
- Filter by status (optional - export all or just accepted/rejected)
- Click “Export JSONL” button
- File downloads automatically with job-scoped records
Export format:
{"id": 1, "metadata": {...}, "status": "accepted", "accumulated_state": {...}, "created_at": "...", "updated_at": "..."}
{"id": 2, "metadata": {...}, "status": "accepted", "accumulated_state": {...}, "created_at": "...", "updated_at": "..."}
The accumulated_state contains only the block outputs (e.g., assistant, valid, parsed_json), excluding metadata to avoid duplication.
Deleting Records:
- “Delete All” button removes all records for selected job
- Also deletes the job itself
- Job disappears from selector after deletion
Tips and Best Practices
Pipeline Design
- Start simple: Begin with one LLM block, add validation later
- Use templates: Load from templates as starting points
- Test with small batches: Use
repetitions: 1while developing
Seed Data
- Use descriptive variable names:
{{ topic }}not{{ t }} - Include variety: Different seeds produce diverse outputs
- Validate JSON: Use a JSON validator before uploading
Quality Control
- Check traces for failures: View trace if output looks wrong
- Edit instead of reject: Save good records that need minor fixes
- Use validation blocks: Catch issues early in the pipeline
Debugging
- Enable debug mode: Set
DEBUG=truein.env - Check backend logs: Look for trace IDs and timing info
- Review block configurations: Ensure templates render correctly
Common Issues
”Invalid JSON format” error
- Cause: Seed file is not valid JSON
- Fix: Use https://jsonlint.com to validate your file
Generation hangs or times out
- Cause: LLM endpoint is slow or unreachable
- Fix: Check LLM_ENDPOINT in
.env, test endpoint separately
Empty outputs
- Cause: Block template doesn’t match available variables in accumulated state
- Fix: Check trace to see what variables are available, update block configuration
Validation always fails
- Cause: Validator rules too strict
- Fix: Check ValidatorBlock config, adjust min/max length
Next Steps
- Create custom blocks: Extend DataGenFlow with your own logic - Custom Blocks Guide
- Developer guide: Deep dive into architecture and API - DEVELOPERS.md
- Contribute: Share improvements and ideas - CONTRIBUTING.md