mixedbread-parsing
Mixedbread Parsing
Parse documents, extract structured content, and run OCR using the Parsing API. Supports PDFs, Word documents, PowerPoint presentations, and images.
Docs: https://www.mixedbread.com/docs/parsing/overview.md Agent-readable docs: https://www.mixedbread.com/docs/llms.txt Latest docs search: https://www.mixedbread.com/question?q=parsing§ion=docs
Setup
pip install mixedbread # Python
npm install @mixedbread/sdk # TypeScript
export MXBAI_API_KEY=your_api_key
Quick Start
Python:
from mixedbread import Mixedbread
mxbai = Mixedbread()
# Upload and parse a document (waits for completion)
job = mxbai.parsing.jobs.upload_and_poll(
file=open("report.pdf", "rb"),
return_format="markdown",
)
for chunk in job.result.chunks:
print(chunk.content)
TypeScript:
import Mixedbread from '@mixedbread/sdk';
import fs from 'fs';
const mxbai = new Mixedbread();
const job = await mxbai.parsing.jobs.uploadAndPoll(
fs.createReadStream('report.pdf'),
{ return_format: 'markdown' },
);
for (const chunk of job.result.chunks) {
console.log(chunk.content);
}
Decision Tree
- Which convenience method?
- File on disk →
upload_and_poll()(uploads + creates job + polls) - File already uploaded via Files API →
create_and_poll()(creates job + polls) - Need async control →
upload()orcreate()thenpoll()separately
- File on disk →
- Which parsing mode?
- Born-digital PDF (selectable text) →
fastmode. Fastest, lowest cost. Extracts text, structure, and layout. - Scanned document, image, or complex layout →
high_qualitymode. Uses OCR. Extracts text with confidence scores, handles rotated/skewed pages, multi-column layouts.
- Born-digital PDF (selectable text) →
- Need specific elements only? → Set
element_typesto reduce processing time
Supported File Types
PDF (.pdf), Word (.doc, .docx, .dotx, .docm, .dotm, .odt, .rtf), Slides (.ppt, .pptx, .ppsx, .pptm, .potm, .ppsm, .odp), Images (.jpeg, .png, .webp, .avif).
Element types: text, title, section-header, header, footer, page-number, list-item, figure, picture, table, form, footnote, caption, formula.
Workflows
Extract Tables from Documents
Filter for table elements to pull structured data from reports.
Python:
job = mxbai.parsing.jobs.upload_and_poll(
file=open("financial-report.pdf", "rb"),
element_types=["table"],
return_format="html",
mode="high_quality",
)
for chunk in job.result.chunks:
for element in chunk.elements:
if element.type == "table":
print(f"Page {element.page}, confidence {element.confidence:.2f}")
print(element.content)
TypeScript:
const job = await mxbai.parsing.jobs.uploadAndPoll(
fs.createReadStream('financial-report.pdf'),
{ element_types: ['table'], return_format: 'html', mode: 'high_quality' },
);
for (const chunk of job.result.chunks) {
for (const element of chunk.elements) {
if (element.type === 'table') {
console.log(`Page ${element.page}, confidence ${element.confidence.toFixed(2)}`);
console.log(element.content);
}
}
}
Batch Parse Multiple Files
Upload multiple files asynchronously, then poll all jobs:
Python:
import os
jobs = []
for filename in os.listdir("./documents"):
if filename.endswith(".pdf"):
job = mxbai.parsing.jobs.upload(
file=open(f"./documents/{filename}", "rb"),
return_format="markdown",
)
jobs.append(job)
# Poll all jobs
for job in jobs:
completed = mxbai.parsing.jobs.poll(job_id=job.id)
print(f"{completed.filename}: {len(completed.result.chunks)} chunks")
TypeScript:
import { readdirSync, createReadStream } from 'fs';
import path from 'path';
const files = readdirSync('./documents').filter(f => f.endsWith('.pdf'));
const jobs = await Promise.all(
files.map(f => mxbai.parsing.jobs.upload(
createReadStream(path.join('./documents', f)),
{ return_format: 'markdown' },
)),
);
// Poll all jobs
for (const job of jobs) {
const completed = await mxbai.parsing.jobs.poll(job.id);
console.log(`${completed.filename}: ${completed.result.chunks.length} chunks`);
}
Rules
CRITICAL
- Don't double-parse. Store uploads auto-parse documents. Files uploaded with
parsing_strategy: "high_quality"automatically get OCR text (images), summaries (images), and transcriptions (audio & video) extracted. These are available as fields on search result chunks. There is no benefit to also running the Parsing API on the same file. Use the Parsing API only for standalone document extraction outside of stores. - Use
upload_and_poll()/create_and_poll()instead of manual polling loops. These methods handle backoff automatically. Manualwhileloops withretrieve()are fragile and waste API calls.
HIGH
- Specify
element_typeswhen you only need certain elements. Requesting all types increases processing time and response size. If you only need tables, setelement_typestotableonly. - Use
fastmode for born-digital PDFs. Thehigh_qualitymode adds OCR overhead that provides no benefit when text is already selectable. - Check
confidencescores on OCR output. Low-confidence elements (< 0.5) may contain errors. Filter or flag them.
MEDIUM
- Check
job.errorbefore retrying failed jobs. Common causes: unsupported file type, corrupt file, file too large. Blindly retrying wastes quota. - Use
content_to_embedfor embedding pipelines. Each chunk provides bothcontent(full text) andcontent_to_embed(optimized for embedding). Use the latter when feeding into vector stores outside Mixedbread. - Verify file format before parsing. Only PDF, Word, PowerPoint, and images are supported. Convert other formats first.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Job stuck in pending |
Queue is busy | Use poll() with a longer poll_timeout_ms. Check job status with retrieve(). |
Job status failed |
Unsupported file type, corrupt file, or file too large | Check job.error for details. Verify file format is supported. |
| Empty chunks in result | File has no extractable content (blank pages) | Verify the file has content. Try high_quality mode for scanned documents. |
| Low confidence scores | Scanned or low-resolution source | Use high_quality mode for better OCR accuracy. |
| Missing tables or figures | Element types not requested | Set element_types to include table and figure explicitly. |
upload_and_poll() timeout |
Very large document or slow processing | Increase poll_timeout_ms, or use upload() + poll() separately for more control. |