How to Use DataGenFlow

This guide walks you through using DataGenFlow to create datasets powered by LLMs.

Overview
Prerequisites
Phase 1: Building Pipelines
Phase 2: Generating Records
- Preparing Seed Data
- Running Generation
Phase 3: Reviewing Results
Tips and Best Practices
Common Issues
Next Steps

Overview

DataGenFlow uses a three-phase workflow:

Build: Create pipelines by connecting blocks
Generate: Run pipelines on seed data to produce records
Review: Validate, edit, and export results

Quick Start with Templates

DataGenFlow includes pre-configured templates for common tasks. See Pipeline Templates for:

JSON Extraction - Extract structured information from text
Text Classification - Classify text into categories with confidence scores
Q&A Generation - Generate question-answer pairs from content

Templates use simplified seeds with just a content field in metadata.

Prerequisites

LLM endpoint configured (Ollama, OpenAI, etc.)
Application running at http://localhost:8000
Seed data file prepared (JSON format)

Phase 1: Building Pipelines

What is a Pipeline?

A pipeline is a sequence of blocks that process data. Each block:

Takes inputs from accumulated state (previous blocks’ outputs + seed data)
Performs an operation (LLM generation, validation, formatting)
Outputs results that get added to accumulated state for next blocks

Key concept: All block outputs accumulate, so later blocks can access any data from earlier blocks.

Creating Your First Pipeline

Step 1: Navigate to Pipelines page

Open http://localhost:8000
Click “Pipelines” in navigation

Step 2: Choose creation method

From template: Click a template card to create pre-configured pipeline
From scratch: Click ”+ New Pipeline” to use visual editor

Step 3: Visual Pipeline Editor (if creating from scratch)

Start/End blocks: Automatically added (circular nodes, cannot be deleted)
Add blocks: Drag blocks from left palette onto canvas
Connect blocks: Drag from output handle to input handle
Configure blocks: Click gear icon on block to open config panel
See accumulated state: Each block shows available fields at that point

Step 4: Configure blocks

Click gear icon (⚙️) on any block
Configuration panel opens on right
Fill in required parameters
Red “Not Configured” badge shows missing fields
Yellow “Not Connected” badge shows disconnected blocks

Step 5: Save pipeline

Enter pipeline name at top
Click “Save Pipeline” button
Pipeline validation runs (checks all blocks configured and connected)
Pipeline appears in list

Example: JSON Generation Pipeline

This example comes from the built-in “JSON Generation with Validation” template.

Goal: Generate structured JSON objects about topics with automatic validation

Pipeline Blocks:

StructuredGenerator - Generates structured JSON from topic
JSONValidatorBlock - Validates and parses JSON structure

Block Configurations:

Block 1: StructuredGenerator

Temperature: 0.7
Max tokens: 2048
User prompt: "Extract key information from this text and structure it as JSON with 'title' and 'description' fields. Text: {{ content }}"
JSON schema: Defines structure with title and description fields

Block 2: JSONValidatorBlock

Field name: generated (validates generator output)
Required fields: ["title", "description"]
Strict mode: true (enforces schema)

Example Seed Data:

{
  "repetitions": 3,
  "metadata": {
    "content": "Cleaning Your Laptop Screen"
  }
}

What Happens:

StructuredGenerator receives the content and generates structured JSON with title and description
JSONValidatorBlock validates the JSON structure
Result is saved for review with full execution trace showing generated, valid, and parsed_json fields

Phase 2: Generating Records

Preparing Seed Data

Seed files define the variables used in your pipeline templates.

Simple content-based format (for templates):

[
  {
    "repetitions": 3,
    "metadata": {
      "content": "Electric cars reduce emissions but require charging infrastructure."
    }
  },
  {
    "repetitions": 2,
    "metadata": {
      "content": "Machine learning helps doctors diagnose diseases more accurately."
    }
  }
]

Custom variables format (for custom pipelines):

[
  {
    "repetitions": 1,
    "metadata": {
      "topic": "Python lists",
      "difficulty": "beginner"
    }
  },
  {
    "repetitions": 1,
    "metadata": {
      "topic": "Python dictionaries",
      "difficulty": "intermediate"
    }
  }
]

Fields:

repetitions: How many times to run the pipeline with this seed
metadata: Variables accessible in block templates via {{ variable_name }}
For templates, use content field; for custom pipelines, use any variable names

Running Generation

Navigate to Generator page
Upload seed file
- Click “Choose File”
- Select your JSON seed file
Select pipeline
- Choose from saved pipelines in dropdown
Click “Generate”
- Job starts in background
- Can only run one job at a time
Monitor progress:
- Global indicator: Top-right corner shows active job
- Detailed progress: Generator page shows:
  - Progress bar with percentage
  - Current step (e.g., “Processing seed 5/10”)
  - Current block being executed
  - Success/failure counts
  - Elapsed time
- Progress updates every 2 seconds
Cancel job (optional):
- Click “Cancel Job” to stop generation
Wait for completion:
- Records saved automatically as generated
- Navigate to Review page when done
- Job appears in job selector

Phase 3: Reviewing Results

Review Interface

The Review page shows generated records with a clean card-based layout:

Job selector: Filter records by generation job (required)
Status filter: View Pending, Accepted, or Rejected records
Record cards: Each shows output, metadata, and actions
Trace viewer: Collapsible execution history with timing

Reviewing Records

Step 1: Navigate to Review page

Step 2: Select a job (required)

Choose from dropdown of recent jobs
Only shows jobs that generated records
Stats update when switching jobs

Step 3: Filter by status

Pending: Not yet reviewed (default view)
Accepted: Approved records
Rejected: Discarded records

Step 4: Review each record*

Read the output in card format
Click “View Trace” to see execution details
Check metadata (seed variables used)

Step 5: Take action

Accept (A key): Mark as approved
Reject (R key): Mark as discarded
Edit (E key): Modify the output
- Opens edit modal
- Make changes to output text
- Click “Save” to update record

Using Execution Traces

Traces help debug issues by showing:

Which blocks executed
How long each took
What data passed between blocks
Any errors that occurred

To view trace:

Click “View Trace” on a record
Expand to see block-by-block execution
Check accumulated_state to see data flow

Exporting Results

Export is scoped to the currently selected job:

Select job from dropdown
Filter by status (optional - export all or just accepted/rejected)
Click “Export JSONL” button
File downloads automatically with job-scoped records

Export format:

{"id": 1, "metadata": {...}, "status": "accepted", "accumulated_state": {...}, "created_at": "...", "updated_at": "..."}
{"id": 2, "metadata": {...}, "status": "accepted", "accumulated_state": {...}, "created_at": "...", "updated_at": "..."}

The accumulated_state contains only the block outputs (e.g., assistant, valid, parsed_json), excluding metadata to avoid duplication.

Deleting Records:

“Delete All” button removes all records for selected job
Also deletes the job itself
Job disappears from selector after deletion

Tips and Best Practices

Pipeline Design

Start simple: Begin with one LLM block, add validation later
Use templates: Load from templates as starting points
Test with small batches: Use repetitions: 1 while developing

Seed Data

Use descriptive variable names: {{ topic }} not {{ t }}
Include variety: Different seeds produce diverse outputs
Validate JSON: Use a JSON validator before uploading

Quality Control

Check traces for failures: View trace if output looks wrong
Edit instead of reject: Save good records that need minor fixes
Use validation blocks: Catch issues early in the pipeline

Debugging

Enable debug mode: Set DEBUG=true in .env
Check backend logs: Look for trace IDs and timing info
Review block configurations: Ensure templates render correctly

Common Issues

”Invalid JSON format” error

Cause: Seed file is not valid JSON
Fix: Use https://jsonlint.com to validate your file

Generation hangs or times out

Cause: LLM endpoint is slow or unreachable
Fix: Check LLM_ENDPOINT in .env, test endpoint separately

Empty outputs

Cause: Block template doesn’t match available variables in accumulated state
Fix: Check trace to see what variables are available, update block configuration

Validation always fails

Cause: Validator rules too strict
Fix: Check ValidatorBlock config, adjust min/max length

Next Steps

Create custom blocks: Extend DataGenFlow with your own logic - Custom Blocks Guide
Developer guide: Deep dive into architecture and API - DEVELOPERS.md
Contribute: Share improvements and ideas - CONTRIBUTING.md

Documentation