Data Augmentation Template

Overview
Pipeline Architecture
Seed Format
Output Format
How It Works
Use Cases
Customization
Filtering Duplicates
Tuning Parameters
Common Issues
Example Workflow
Related Documentation

Overview

Complexity: Advanced (3 blocks with multiplier) Use Case: Generate synthetic data that preserves statistical patterns from samples

This template creates realistic synthetic records from sample data while maintaining:

Statistical distributions (e.g., “electronics” appears 50% of the time)
Numeric range constraints (e.g., electronics prices $299-$899, furniture prices $199-$349)
Semantic coherence (LLM-generated fields match context)
Output diversity (duplicate detection via embeddings)

Special Features:

Statistical sampling preserves distributions
LLM-powered semantic field generation
Embedding-based duplicate detection
Supports field dependencies

Pipeline Architecture

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  Structure  │──►│  Semantic   │──►│  Duplicate  │
│   Sampler   │   │  Infiller   │   │   Remover   │
└─────────────┘   └─────────────┘   └─────────────┘

Input: samples array
  ↓
+ category, _hints (multiplies: 1 seed → N skeletons)
  ↓
+ description, price (LLM-generated fields)
  ↓
+ is_duplicate, similarity_to_seeds, similarity_to_generated

Blocks:

StructureSampler - Learns distributions from samples, generates statistical skeletons
SemanticInfiller - Completes skeletons with LLM-generated semantic fields
DuplicateRemover - Filters similar records using embedding similarity

Key Concept: The StructureSampler is a multiplier block that generates N skeletons from one seed. Each skeleton flows through the remaining blocks to create one record.

Seed Format

Required fields:

samples - Array of example records (minimum 3 recommended)
target_count - Number of synthetic records to generate
categorical_fields - Fields to preserve distribution
fields_to_generate - Fields for LLM to generate

Optional fields:

numeric_fields - Numeric distributions to preserve
dependencies - Field relationships (e.g., role depends on plan)
comparison_fields - Fields for duplicate detection

Example seed (Product Catalog):

[
  {
    "repetitions": 1,
    "metadata": {
      "samples": [
        {"category": "electronics", "price": 299, "description": "Wireless noise-canceling headphones with premium sound quality"},
        {"category": "electronics", "price": 899, "description": "13-inch laptop with high-resolution display"},
        {"category": "furniture", "price": 199, "description": "Ergonomic office chair with lumbar support"},
        {"category": "furniture", "price": 349, "description": "Adjustable standing desk with memory presets"}
      ],
      "target_count": 10,
      "categorical_fields": ["category"],
      "numeric_fields": ["price"],
      "fields_to_generate": ["description", "price"],
      "comparison_fields": ["description"]
    }
  }
]

Field Explanations:

samples - Example products showing the data structure (4 samples provided)
target_count - How many new products to generate (10 in this example)
categorical_fields - Fields with discrete values that preserve distribution (50% electronics, 50% furniture)
numeric_fields - Fields with numeric ranges that provide hints to the LLM (electronics: $299-$899, furniture: $199-$349)
fields_to_generate - Fields for the LLM to create NEW content for (description and price)
comparison_fields - Fields to check for duplicates using embedding similarity (description)

Note: price appears in both numeric_fields and fields_to_generate. This provides range hints to guide the LLM while letting it generate contextually appropriate prices.

Tip: Use 4-10 diverse samples for best results. More samples = better distribution learning.

Output Format

The pipeline outputs a generated_samples array containing the final records.

Each generated record contains:

Sampled categorical fields (preserving distribution)
LLM-generated semantic fields
Duplicate detection metadata

Example output:

{
  "generated_samples": [
    {
      "category": "electronics",
      "price": 449,
      "description": "Bluetooth speaker with 360-degree sound and waterproof design",
      "is_duplicate": false,
      "similarity_to_seeds": 0.45,
      "similarity_to_generated": 0.42
    }
  ]
}

Each record contains:

Sampled categorical fields (category)
LLM-generated fields (price, description)
Duplicate detection metadata:
- similarity_to_seeds: highest similarity to original seed samples
- similarity_to_generated: highest similarity to other generated records
- is_duplicate: true if either similarity exceeds threshold

Note: Input configuration fields like samples, target_count, categorical_fields, etc. are NOT included in the output.

How It Works

Stage 1: StructureSampler (Statistical Skeleton Generation)

What it does:

Analyzes sample data to learn categorical frequencies
Computes numeric statistics (min, max, mean) for range hints
Respects field dependencies (e.g., role depends on plan)
Generates N skeletons respecting learned distributions

Example: If samples show “Free” plan 40% and “Pro” 30%, generated skeletons maintain these ratios.

Output per skeleton:

{
  "category": "electronics",
  "_hints": {
    "price_range": [199.0, 899.0],
    "exemplars": [
      {"category": "electronics", "price": 299, "description": "Wireless headphones"},
      {"category": "electronics", "price": 899, "description": "13-inch laptop"}
    ]
  }
}

Stage 2: SemanticInfiller (LLM-Powered Field Completion)

What it does:

Receives skeleton with locked statistical fields
Builds contextual prompt with numeric hints and exemplar examples
Calls LLM to generate semantic fields (bio, description, etc.)
Restores locked fields if LLM overwrites them

Prompt structure:

You are a data generator. Complete the following record skeleton.

Skeleton: {category: "electronics"}

Numeric hints:
- price should be between 199-899

Matching examples:
- {category: "electronics", price: 299, description: "Wireless headphones"}

Generate: ["description", "price"]
Return JSON: {"description": "...", "price": ...}

Locked fields behavior: Categorical fields sampled by StructureSampler (e.g., category) are preserved even if the LLM tries to modify them.

Stage 3: DuplicateRemover (Similarity Filtering)

What it does:

Extracts text from comparison fields
Generates embeddings via embedding model
Computes cosine similarity with cached embeddings
Marks records as duplicates if similarity > threshold

Output:

{
  "category": "electronics",
  "price": 549,
  "description": "Portable bluetooth speaker with waterproof design",
  "is_duplicate": false,
  "similarity_to_seeds": 0.72,
  "similarity_to_generated": 0.45
}

Output fields:

similarity_to_seeds: highest similarity to any original sample
similarity_to_generated: highest similarity to previously generated records
is_duplicate: true if either similarity exceeds threshold

Note: DuplicateRemover gracefully degrades if embedding model is unavailable - marks all records as is_duplicate: false with similarity scores of 0.0.

Use Cases

Perfect for:

Expanding training datasets while maintaining patterns
Creating realistic test data for applications
Generating synthetic user profiles with distributions
Data augmentation for ML training sets
Privacy-preserving data generation (learn from real, generate synthetic)

Not ideal for:

Time-series data (no temporal modeling)
Graph/network data (no relationship modeling)
Highly correlated numeric fields (limited correlation preservation)

Customization

Modify the template in lib/templates/data_augmentation.yaml:

Adjust generation count:

blocks:
  - type: StructureSampler
    config:
      target_count: 100  # Generate 100 records

Change LLM creativity:

  - type: SemanticInfiller
    config:
      temperature: 0.9  # Higher = more creative (0.7-0.9 recommended)
      max_tokens: 300   # Longer outputs

Adjust duplicate threshold:

  - type: DuplicateRemover
    config:
      similarity_threshold: 0.9  # Stricter (0.8-0.9 recommended)

Add more dependencies:

{
  "dependencies": {
    "role": ["plan"],
    "storage": ["plan"]
  }
}

Filtering Duplicates

Records marked as is_duplicate: true should be filtered post-generation:

Via API:

result = await pipeline.execute(seed_data)
generated = result.result.get("generated_samples", [])
unique_records = [r for r in generated if not r.get("is_duplicate")]

Via export (manual filter):

# Export all records
curl http://localhost:8000/api/export?job_id=1 > output.jsonl

# Filter duplicates from generated_samples
jq '.generated_samples[] | select(.is_duplicate == false)' output.jsonl > unique.jsonl

Note: Keeping duplicates in the trace allows adjusting the threshold post-generation and analyzing similarity score distributions (similarity_to_seeds and similarity_to_generated).

Tuning Parameters

Quality vs Speed

High quality (slower):

target_count: 100
temperature: 0.9
max_tokens: 300
similarity_threshold: 0.9

Fast iteration (lower quality):

target_count: 20
temperature: 0.7
max_tokens: 150
similarity_threshold: 0.75

Diversity vs Fidelity

Preserve distributions (higher fidelity):

Include all important categorical_fields
Specify dependencies accurately
Include numeric_fields with tight ranges

Increase diversity (creative generation):

Omit some categorical_fields (LLM generates freely)
Higher temperature (0.8-0.9)
Lower similarity_threshold (0.75-0.8)

Common Issues

Low diversity (many duplicates)

Causes:

Too few samples (<5)
Temperature too low (<0.5)
Fields too restrictive

Fixes:

Add more diverse samples
Increase temperature to 0.8-0.9
Generate more semantic fields
Increase similarity_threshold to 0.85-0.9

Unrealistic outputs

Causes:

Dependencies not specified
Numeric hints too broad
Temperature too high (>0.95)

Fixes:

Add dependencies config
Provide numeric_fields for constraints
Reduce temperature to 0.7-0.8
Include exemplar samples matching target patterns

LLM errors (invalid JSON)

Causes:

max_tokens too low (truncated JSON)
Complex nested structures

Fixes:

Increase max_tokens to 200-300
Simplify fields (fewer nested objects)
SemanticInfiller handles markdown wrappers automatically

Missing embeddings

Cause: Embedding model not configured

Behavior: DuplicateRemover marks all as is_duplicate: false

Fix: Configure default embedding model in Settings page

Example Workflow

Goal: Generate 100 synthetic user profiles

Step 1: Prepare samples (6 examples)

[
  {"plan": "Free", "role": "Viewer", "storage": 1, "bio": "Student learning"},
  {"plan": "Free", "role": "Viewer", "storage": 2, "bio": "Just exploring"},
  {"plan": "Pro", "role": "Editor", "storage": 50, "bio": "Freelance designer"},
  {"plan": "Pro", "role": "Editor", "storage": 75, "bio": "Agency owner"},
  {"plan": "Pro", "role": "Admin", "storage": 100, "bio": "Team lead"},
  {"plan": "Enterprise", "role": "Admin", "storage": 500, "bio": "CTO"}
]

Step 2: Create pipeline from template

curl -X POST http://localhost:8000/api/pipelines/from_template/data_augmentation \
  -H "Content-Type: application/json" \
  -d '{"name": "User Profile Augmentation"}'

Step 3: Start generation

curl -X POST http://localhost:8000/api/generate \
  -F "file=@seed_data_augmentation.json" \
  -F "pipeline_id=1"

Step 4: Monitor progress

# Poll job status
curl http://localhost:8000/api/jobs/1

Step 5: Review and export

# Export unique records only
curl http://localhost:8000/api/export?job_id=1 | jq 'select(.is_duplicate == false)' > unique_users.jsonl

Result: 100 synthetic user profiles preserving original distributions

Tip: For large datasets, start with 20 records to verify quality before scaling up.

Templates Overview - All available templates
How to Use - Running pipelines with templates
Custom Blocks - Creating custom blocks and understanding multipliers

Documentation

Data Augmentation Template

Table of Contents

Overview

Pipeline Architecture

Seed Format

Output Format

How It Works

Stage 1: StructureSampler (Statistical Skeleton Generation)

Stage 2: SemanticInfiller (LLM-Powered Field Completion)

Stage 3: DuplicateRemover (Similarity Filtering)

Use Cases

Customization

Filtering Duplicates

Tuning Parameters

Quality vs Speed

Diversity vs Fidelity

Common Issues

Low diversity (many duplicates)

Unrealistic outputs

LLM errors (invalid JSON)

Missing embeddings

Example Workflow

Step 1: Prepare samples (6 examples)

Step 2: Create pipeline from template

Step 3: Start generation

Step 4: Monitor progress

Step 5: Review and export

Related Documentation