Data Augmentation Template
Table of Contents
- Overview
- Pipeline Architecture
- Seed Format
- Output Format
- How It Works
- Use Cases
- Customization
- Filtering Duplicates
- Tuning Parameters
- Common Issues
- Example Workflow
- Related Documentation
Overview
Complexity: Advanced (3 blocks with multiplier) Use Case: Generate synthetic data that preserves statistical patterns from samples
This template creates realistic synthetic records from sample data while maintaining:
- Statistical distributions (e.g., “electronics” appears 50% of the time)
- Numeric range constraints (e.g., electronics prices $299-$899, furniture prices $199-$349)
- Semantic coherence (LLM-generated fields match context)
- Output diversity (duplicate detection via embeddings)
Special Features:
- Statistical sampling preserves distributions
- LLM-powered semantic field generation
- Embedding-based duplicate detection
- Supports field dependencies
Pipeline Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Structure │──►│ Semantic │──►│ Duplicate │
│ Sampler │ │ Infiller │ │ Remover │
└─────────────┘ └─────────────┘ └─────────────┘
Input: samples array
↓
+ category, _hints (multiplies: 1 seed → N skeletons)
↓
+ description, price (LLM-generated fields)
↓
+ is_duplicate, similarity_to_seeds, similarity_to_generated
Blocks:
- StructureSampler - Learns distributions from samples, generates statistical skeletons
- SemanticInfiller - Completes skeletons with LLM-generated semantic fields
- DuplicateRemover - Filters similar records using embedding similarity
Key Concept: The StructureSampler is a multiplier block that generates N skeletons from one seed. Each skeleton flows through the remaining blocks to create one record.
Seed Format
Required fields:
samples- Array of example records (minimum 3 recommended)target_count- Number of synthetic records to generatecategorical_fields- Fields to preserve distributionfields_to_generate- Fields for LLM to generate
Optional fields:
numeric_fields- Numeric distributions to preservedependencies- Field relationships (e.g., role depends on plan)comparison_fields- Fields for duplicate detection
Example seed (Product Catalog):
[
{
"repetitions": 1,
"metadata": {
"samples": [
{"category": "electronics", "price": 299, "description": "Wireless noise-canceling headphones with premium sound quality"},
{"category": "electronics", "price": 899, "description": "13-inch laptop with high-resolution display"},
{"category": "furniture", "price": 199, "description": "Ergonomic office chair with lumbar support"},
{"category": "furniture", "price": 349, "description": "Adjustable standing desk with memory presets"}
],
"target_count": 10,
"categorical_fields": ["category"],
"numeric_fields": ["price"],
"fields_to_generate": ["description", "price"],
"comparison_fields": ["description"]
}
}
]
Field Explanations:
samples- Example products showing the data structure (4 samples provided)target_count- How many new products to generate (10 in this example)categorical_fields- Fields with discrete values that preserve distribution (50% electronics, 50% furniture)numeric_fields- Fields with numeric ranges that provide hints to the LLM (electronics: $299-$899, furniture: $199-$349)fields_to_generate- Fields for the LLM to create NEW content for (description and price)comparison_fields- Fields to check for duplicates using embedding similarity (description)
Note:
priceappears in bothnumeric_fieldsandfields_to_generate. This provides range hints to guide the LLM while letting it generate contextually appropriate prices.Tip: Use 4-10 diverse samples for best results. More samples = better distribution learning.
Output Format
The pipeline outputs a generated_samples array containing the final records.
Each generated record contains:
- Sampled categorical fields (preserving distribution)
- LLM-generated semantic fields
- Duplicate detection metadata
Example output:
{
"generated_samples": [
{
"category": "electronics",
"price": 449,
"description": "Bluetooth speaker with 360-degree sound and waterproof design",
"is_duplicate": false,
"similarity_to_seeds": 0.45,
"similarity_to_generated": 0.42
}
]
}
Each record contains:
- Sampled categorical fields (
category) - LLM-generated fields (
price,description) - Duplicate detection metadata:
similarity_to_seeds: highest similarity to original seed samplessimilarity_to_generated: highest similarity to other generated recordsis_duplicate: true if either similarity exceeds threshold
Note: Input configuration fields like samples, target_count, categorical_fields, etc. are NOT included in the output.
How It Works
Stage 1: StructureSampler (Statistical Skeleton Generation)
What it does:
- Analyzes sample data to learn categorical frequencies
- Computes numeric statistics (min, max, mean) for range hints
- Respects field dependencies (e.g., role depends on plan)
- Generates N skeletons respecting learned distributions
Example: If samples show “Free” plan 40% and “Pro” 30%, generated skeletons maintain these ratios.
Output per skeleton:
{
"category": "electronics",
"_hints": {
"price_range": [199.0, 899.0],
"exemplars": [
{"category": "electronics", "price": 299, "description": "Wireless headphones"},
{"category": "electronics", "price": 899, "description": "13-inch laptop"}
]
}
}
Stage 2: SemanticInfiller (LLM-Powered Field Completion)
What it does:
- Receives skeleton with locked statistical fields
- Builds contextual prompt with numeric hints and exemplar examples
- Calls LLM to generate semantic fields (bio, description, etc.)
- Restores locked fields if LLM overwrites them
Prompt structure:
You are a data generator. Complete the following record skeleton.
Skeleton: {category: "electronics"}
Numeric hints:
- price should be between 199-899
Matching examples:
- {category: "electronics", price: 299, description: "Wireless headphones"}
Generate: ["description", "price"]
Return JSON: {"description": "...", "price": ...}
Locked fields behavior: Categorical fields sampled by StructureSampler (e.g., category) are preserved even if the LLM tries to modify them.
Stage 3: DuplicateRemover (Similarity Filtering)
What it does:
- Extracts text from comparison fields
- Generates embeddings via embedding model
- Computes cosine similarity with cached embeddings
- Marks records as duplicates if similarity > threshold
Output:
{
"category": "electronics",
"price": 549,
"description": "Portable bluetooth speaker with waterproof design",
"is_duplicate": false,
"similarity_to_seeds": 0.72,
"similarity_to_generated": 0.45
}
Output fields:
similarity_to_seeds: highest similarity to any original samplesimilarity_to_generated: highest similarity to previously generated recordsis_duplicate: true if either similarity exceeds threshold
Note: DuplicateRemover gracefully degrades if embedding model is unavailable - marks all records as
is_duplicate: falsewith similarity scores of 0.0.
Use Cases
Perfect for:
- Expanding training datasets while maintaining patterns
- Creating realistic test data for applications
- Generating synthetic user profiles with distributions
- Data augmentation for ML training sets
- Privacy-preserving data generation (learn from real, generate synthetic)
Not ideal for:
- Time-series data (no temporal modeling)
- Graph/network data (no relationship modeling)
- Highly correlated numeric fields (limited correlation preservation)
Customization
Modify the template in lib/templates/data_augmentation.yaml:
Adjust generation count:
blocks:
- type: StructureSampler
config:
target_count: 100 # Generate 100 records
Change LLM creativity:
- type: SemanticInfiller
config:
temperature: 0.9 # Higher = more creative (0.7-0.9 recommended)
max_tokens: 300 # Longer outputs
Adjust duplicate threshold:
- type: DuplicateRemover
config:
similarity_threshold: 0.9 # Stricter (0.8-0.9 recommended)
Add more dependencies:
{
"dependencies": {
"role": ["plan"],
"storage": ["plan"]
}
}
Filtering Duplicates
Records marked as is_duplicate: true should be filtered post-generation:
Via API:
result = await pipeline.execute(seed_data)
generated = result.result.get("generated_samples", [])
unique_records = [r for r in generated if not r.get("is_duplicate")]
Via export (manual filter):
# Export all records
curl http://localhost:8000/api/export?job_id=1 > output.jsonl
# Filter duplicates from generated_samples
jq '.generated_samples[] | select(.is_duplicate == false)' output.jsonl > unique.jsonl
Note: Keeping duplicates in the trace allows adjusting the threshold post-generation and analyzing similarity score distributions (
similarity_to_seedsandsimilarity_to_generated).
Tuning Parameters
Quality vs Speed
High quality (slower):
target_count: 100
temperature: 0.9
max_tokens: 300
similarity_threshold: 0.9
Fast iteration (lower quality):
target_count: 20
temperature: 0.7
max_tokens: 150
similarity_threshold: 0.75
Diversity vs Fidelity
Preserve distributions (higher fidelity):
- Include all important
categorical_fields - Specify
dependenciesaccurately - Include
numeric_fieldswith tight ranges
Increase diversity (creative generation):
- Omit some
categorical_fields(LLM generates freely) - Higher temperature (0.8-0.9)
- Lower
similarity_threshold(0.75-0.8)
Common Issues
Low diversity (many duplicates)
Causes:
- Too few samples (<5)
- Temperature too low (<0.5)
- Fields too restrictive
Fixes:
- Add more diverse samples
- Increase temperature to 0.8-0.9
- Generate more semantic fields
- Increase similarity_threshold to 0.85-0.9
Unrealistic outputs
Causes:
- Dependencies not specified
- Numeric hints too broad
- Temperature too high (>0.95)
Fixes:
- Add dependencies config
- Provide numeric_fields for constraints
- Reduce temperature to 0.7-0.8
- Include exemplar samples matching target patterns
LLM errors (invalid JSON)
Causes:
- max_tokens too low (truncated JSON)
- Complex nested structures
Fixes:
- Increase max_tokens to 200-300
- Simplify fields (fewer nested objects)
- SemanticInfiller handles markdown wrappers automatically
Missing embeddings
Cause: Embedding model not configured
Behavior: DuplicateRemover marks all as is_duplicate: false
Fix: Configure default embedding model in Settings page
Example Workflow
Goal: Generate 100 synthetic user profiles
Step 1: Prepare samples (6 examples)
[
{"plan": "Free", "role": "Viewer", "storage": 1, "bio": "Student learning"},
{"plan": "Free", "role": "Viewer", "storage": 2, "bio": "Just exploring"},
{"plan": "Pro", "role": "Editor", "storage": 50, "bio": "Freelance designer"},
{"plan": "Pro", "role": "Editor", "storage": 75, "bio": "Agency owner"},
{"plan": "Pro", "role": "Admin", "storage": 100, "bio": "Team lead"},
{"plan": "Enterprise", "role": "Admin", "storage": 500, "bio": "CTO"}
]
Step 2: Create pipeline from template
curl -X POST http://localhost:8000/api/pipelines/from_template/data_augmentation \
-H "Content-Type: application/json" \
-d '{"name": "User Profile Augmentation"}'
Step 3: Start generation
curl -X POST http://localhost:8000/api/generate \
-F "file=@seed_data_augmentation.json" \
-F "pipeline_id=1"
Step 4: Monitor progress
# Poll job status
curl http://localhost:8000/api/jobs/1
Step 5: Review and export
# Export unique records only
curl http://localhost:8000/api/export?job_id=1 | jq 'select(.is_duplicate == false)' > unique_users.jsonl
Result: 100 synthetic user profiles preserving original distributions
Tip: For large datasets, start with 20 records to verify quality before scaling up.
Related Documentation
- Templates Overview - All available templates
- How to Use - Running pipelines with templates
- Custom Blocks - Creating custom blocks and understanding multipliers