unstructured-pdf-generation
Unstructured PDF Generation
Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.
Overview
This skill uses the generate_pdf_documents MCP tool to create professional PDF documents with:
- LLM-generated content based on your description
- Accompanying JSON files with questions and evaluation guidelines (for RAG testing)
- Automatic upload to Unity Catalog Volumes
Quick Start
Use the generate_pdf_documents MCP tool:
catalog: "my_catalog"schema: "my_schema"description: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references."count: 10
This generates 10 PDF documents and saves them to /Volumes/my_catalog/my_schema/raw_data/pdf_documents/ (using default volume and folder).
With Custom Location
Use the generate_pdf_documents MCP tool:
catalog: "my_catalog"schema: "my_schema"description: "HR policy documents..."count: 10volume: "custom_volume"folder: "hr_policies"overwrite_folder: true
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
catalog |
string | Yes | - | Unity Catalog name |
schema |
string | Yes | - | Schema name |
description |
string | Yes | - | Detailed description of what PDFs should contain |
count |
int | Yes | - | Number of PDFs to generate |
volume |
string | No | raw_data |
Volume name (created if not exists) |
folder |
string | No | pdf_documents |
Folder within volume for output files |
doc_size |
string | No | MEDIUM |
Document size: SMALL (~1 page), MEDIUM (~5 pages), LARGE (~10+ pages) |
overwrite_folder |
bool | No | false |
If true, deletes existing folder contents first |
Document Size Guide
- SMALL: ~1 page, concise content. Best for quick demos or testing.
- MEDIUM: ~4-6 pages, comprehensive coverage. Good balance for most use cases.
- LARGE: ~10+ pages, exhaustive documentation. Use for thorough RAG evaluation.
Output Files
For each document, the tool creates two files:
- PDF file (
<model_id>.pdf): The generated document - JSON file (
<model_id>.json): Metadata for RAG evaluation
JSON Structure
{
"title": "API Authentication Guide",
"category": "Technical",
"pdf_path": "/Volumes/catalog/schema/volume/folder/doc_001.pdf",
"question": "What authentication methods are supported by the API?",
"guideline": "Answer should mention OAuth 2.0, API keys, and JWT tokens with their use cases."
}
Common Patterns
Pattern 1: HR Policy Documents
Use the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "hr_demo"description: "HR policy documents for a technology company including employee handbook, leave policies, performance review procedures, benefits guide, and workplace conduct guidelines."count: 15folder: "hr_policies"overwrite_folder: true
Pattern 2: Technical Documentation
Use the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "tech_docs"description: "Technical documentation for a SaaS analytics platform including installation guides, API references, troubleshooting procedures, security best practices, and integration tutorials."count: 20folder: "product_docs"overwrite_folder: true
Pattern 3: Financial Reports
Use the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "finance_demo"description: "Financial documents for a retail company including quarterly reports, expense policies, budget guidelines, and audit procedures."count: 12folder: "reports"overwrite_folder: true
Pattern 4: Training Materials
Use the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "training"description: "Training materials for new software developers including onboarding guides, coding standards, code review procedures, and deployment workflows."count: 8folder: "courses"overwrite_folder: true
Workflow
- Ask for destination: Default to
ai_dev_kitcatalog, ask user for schema name - Get description: Ask what kind of documents they need
- Generate PDFs: Call
generate_pdf_documentsMCP tool with appropriate parameters - Verify output: Check the volume path for generated files
Best Practices
-
Detailed descriptions: The more specific your description, the better the generated content
- BAD: "Generate some HR documents"
- GOOD: "HR policy documents for a technology company including employee handbook covering remote work policies, leave policies with PTO and sick leave details, performance review procedures with quarterly and annual cycles, and workplace conduct guidelines"
-
Appropriate count:
- For demos: 5-10 documents
- For RAG testing: 15-30 documents
- For comprehensive evaluation: 50+ documents
-
Folder organization: Use descriptive folder names that indicate content type
hr_policies/technical_docs/training_materials/
-
Use overwrite_folder: Set to
truewhen regenerating to ensure clean state
Integration with RAG Pipelines
The generated JSON files are designed for RAG evaluation:
- Ingest PDFs: Use the PDF files as source documents for your vector database
- Test retrieval: Use the
questionfield to query your RAG system - Evaluate answers: Use the
guidelinefield to assess if the RAG response is correct
Example evaluation workflow:
# Load questions from JSON files
questions = load_json_files(f"/Volumes/{catalog}/{schema}/{volume}/{folder}/*.json")
for q in questions:
# Query RAG system
response = rag_system.query(q["question"])
# Evaluate using guideline
is_correct = evaluate_response(response, q["guideline"])
Environment Configuration
The tool requires LLM configuration via environment variables:
# Databricks Foundation Models (default)
LLM_PROVIDER=DATABRICKS
DATABRICKS_MODEL=databricks-meta-llama-3-3-70b-instruct
# Or Azure OpenAI
LLM_PROVIDER=AZURE
AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4o
Common Issues
| Issue | Solution |
|---|---|
| "No LLM endpoint configured" | Set DATABRICKS_MODEL or AZURE_OPENAI_DEPLOYMENT environment variable |
| "Volume does not exist" | The tool creates volumes automatically; ensure you have CREATE VOLUME permission |
| "PDF generation timeout" | Reduce count or check LLM endpoint availability |
| Low quality content | Provide more detailed description with specific topics and document types |