pinecone-full-text-search
pinecone-full-text-search
Requires
pineconePython SDK ≥ 9.0 (pip install pinecone>=9.0). The FTS document-schema API lives underpinecone.previewand is incomplete or absent in earlier SDK builds. The packaged helper scripts pinpinecone==9.0.0via PEP 723 inline metadata; if you're writing your own code against this skill, pin v9 explicitly. The wire API version is2026-01.alpha.
Authoritative reference (last resort). If you hit a question this skill and its
references/*.mdfiles don't answer, the official Pinecone FTS docs are at https://docs.pinecone.io/guides/search/full-text-search. Prefer this skill's content for anything covered here — the docs may describe surfaces (e.g. classic vector API) that don't apply to the document-schema FTS path. Consult the link only when you're genuinely stuck.
Tell the user up front: "This skill ships a helper at
scripts/ingest.pythat handles bulk ingestion safely (batched upsert, error inspection, readiness polling). When we get to the ingest step, I'll use it." Surface this at the start of the conversation so the user knows the helper exists. Query construction is hand-writtendocuments.search(...)per the Querying section below — there is no query helper.
A workflow skill for building a Pinecone full-text-search index with the preview API (pinecone.preview, API version 2026-01.alpha, public preview as of April 2026). Covers schema design (text, dense vector, sparse vector, filterable metadata), ingestion (including async indexing and polling), and query construction (text / query_string / dense_vector / sparse_vector scoring; $match_phrase / $match_all / $match_any text-match filters; $eq / $in / $gte / $exists / $and / $or / $not metadata filters).
Scope — this skill is for the document-schema FTS API only
This skill covers pc.preview.indexes.create(..., schema=...), pc.preview.index(name), idx.documents.upsert(...) / idx.documents.batch_upsert(...) / idx.documents.search(...). If you find yourself reaching for any of the following, stop — those are different Pinecone APIs and this skill's guidance and helpers won't apply:
- Classic vector / records API:
pc.Index(name),index.upsert(vectors=[...])/index.upsert_records(...),index.query(vector=..., sparse_vector=...),index.search_records(...),pc.create_index(...)withServerlessSpec, the legacypinecone_text.sparse.BM25Encoderfor sparse-dense hybrid. For indexes WITHOUT a schema (raw vectors). - Integrated-embedding indexes:
pc.create_index_for_model(...)withembed={...}. Pinecone vectorizes text server-side. Different upsert/search shapes. Cannot be combined withfull_text_searchfields in the same index.
If the user already has a non-document-schema index, they can stand up a separate document-schema index alongside it — the two are independent — but you can't add FTS fields to a classic index after the fact.
Querying — construct documents.search(...) calls
For any task that asks you to query an FTS index, you write a documents.search(...) call directly. The schema is authoritative — describe the index live before constructing the call so you know which fields are FTS-enabled, which are filterable, and which are vectors.
Workflow:
- Discover the schema. Call
pc.preview.indexes.describe(<index>)and read theschema.fieldsdict. Each field's class indicates its type (PreviewStringField,PreviewIntegerField,PreviewDenseVectorField, etc.); attributes tell you whether it's FTS-enabled (full_text_search), filterable, or carries adimension. Skip this step only if you've already seen the schema in this conversation. - Construct the call matching the rules below — one scoring type per request, hard requirements in
filter, ranking signals inscore_by,include_fieldsexplicit on every call. - Execute with
idx = pc.preview.index(name=<index>); resp = idx.documents.search(...)and readresp.matches.
Canonical shapes:
# Pure BM25 keyword search
resp = idx.documents.search(
namespace="__default__",
top_k=10,
score_by=[{"type": "text", "field": "body", "query": "machine learning"}],
filter={"year": {"$gt": 2024}, "category": {"$eq": "ai"}}, # optional
include_fields=["*"], # always pass explicitly
)
# Hybrid: dense ranking with a lexical filter (one type in score_by + filter narrows)
resp = idx.documents.search(
namespace="__default__",
top_k=10,
score_by=[{"type": "dense_vector", "field": "embedding", "values": query_embedding}],
filter={"body": {"$match_all": "TensorFlow"}, "year": {"$gt": 2024}},
include_fields=["*"],
)
Key rules (the server enforces these; following them locally keeps the agent loop tight):
score_byis a list of clauses, but exactly one scoring type per request (server rejects mixed types). Multi-field BM25 is the one exception: multipletextclauses, or onequery_stringwithfields: [...]. To combine BM25 + dense signals, restrict the dense search with a text-match filter ($match_all/$match_phrase/$match_any); do NOT mix scoring types inscore_by.filterkeys are field names (must exist in schema and be filterable) OR logical operators ($and,$or,$not). Field values are operator dicts ({"$gt": 5}, NOT bare values).include_fieldsis required on every call. Pass["*"]for all stored fields,[]for ids+score only, or a list of names. Some SDK builds 400/422 if it's omitted.
Clause shapes (for score_by):
type |
Required keys | When to pick this |
|---|---|---|
text |
field (string FTS), query |
Open-ended keyword search; BM25 ranking on one field |
query_string |
query (Lucene), fields optional |
Lucene boost (^N), proximity (~N), cross-field boolean, phrase prefix |
dense_vector |
field (dense_vector), values (list of floats) |
Semantic / mood / topic ranking |
sparse_vector |
field (sparse_vector), sparse_values ({indices, values}) |
Custom sparse-encoder ranking |
text / dense_vector / sparse_vector use singular field. Only query_string accepts a fields array (and also accepts singular field as an alias). sparse_vector uses sparse_values (NOT values) — distinct from dense.
Filter operators by field type:
| Field type | Legal operators |
|---|---|
string with FTS |
$match_phrase, $match_all, $match_any |
string filterable |
$eq, $ne, $in, $nin, $exists |
string_list filterable |
$in, $nin, $exists |
float filterable |
$eq, $ne, $gt, $gte, $lt, $lte, $exists |
boolean filterable |
$eq, $exists |
| logical wrappers | $and: [filters], $or: [filters], $not: filter |
Match shape on response:
for m in resp.matches:
m._id # document id
m._score # match score (NOT `score`); some older SDK builds may also surface `score`
m.to_dict() # full doc payload (when include_fields includes the field)
For deeper coverage — multi-field BM25, Lucene patterns, hybrid composition, RRF merges, common error symptoms — see references/querying.md. For schema field types and what they enable on the query side, see references/schema-design.md.
Ingesting — use the packaged helper
For any task that asks you to bulk-ingest a JSONL file into an existing FTS index, the canonical path is to invoke the bundled helper, NOT to hand-write a Python script. Do not read the script's source — everything you need is in this section.
The script does three things bare-LLM ingest code reliably skips, each of which corresponds to a silent production failure:
- Bulk-upserts in batches. No per-doc
upsertloops. - Inspects every batch result.
batch_upsertreturns 202 even when individual documents fail; the failures live inresult.errors/result.has_errors. Without inspection, "100 docs ingested" silently becomes "73 docs ingested + 27 lost." - Polls until searchable. After upsert, Pinecone is still building the inverted index. A
documents.searchcall during that window returns empty. Without the poll, the user debugs their query code for an hour without finding the indexing race.
You provide a prepared, schema-conformant JSONL file and the index name; the script does the rest. Schema validation is upstream concerns (your prep pipeline, or prepare_documents.py when it lands) — ingest.py trusts what you hand it.
Invocation:
uv run --script .claude/skills/pinecone-fts-index/scripts/ingest.py \
--data processed.jsonl \
--index <index_name> \
--sentinel-field <fts_field>
Flags:
| Flag | Short | Required | Purpose |
|---|---|---|---|
--data |
-d |
yes | Path to JSONL file with prepared documents (one per line) |
--index |
-i |
yes | Pinecone index name (must already exist) |
--sentinel-field |
-f |
yes | An FTS-enabled field on the index, used for the readiness-poll query. Pick the longest free-text field on your schema. |
--namespace |
-n |
no | Default __default__ |
--batch-size |
-b |
no | Default 100. Reduce for large dense vectors. A 50-doc batch with 3072-dim float vectors lands ~5-10 MB and can be rejected; drop to --batch-size 50 (or lower) at high dimensions. |
--poll-deadline |
— | no | Default 300 (seconds). Time to wait for documents to become searchable before giving up. |
--sentinel |
-s |
no | Token used for the readiness-poll query. Default: first whitespace-separated token of doc[0][sentinel-field]. |
What the script prints:
Loading processed.jsonl ...
Loaded 5000 document(s).
Sentinel: body='The'
Upserting in batches of 100 ...
batch @ 0: 100 docs in 0.42s (total: 100/5000)
batch @ 100: 100 docs in 0.39s (total: 200/5000)
...
Upsert complete: 5000 doc(s) in 21.4s.
Polling for searchability (deadline 300s) ...
Searchable after 12.3s (3 probe(s)).
Done — total 33.7s.
If a batch fails, the script prints every error message and exits non-zero. If the poll deadline expires, the script prints a hint about why (sentinel field isn't FTS-enabled, deadline too tight, docs structurally upserted but rejected by the inverted-index builder) and exits non-zero. Don't suppress these errors — they're surfacing real problems with the data or the index.
When you should NOT use the script:
- The user is doing per-doc patch updates (single-doc
documents.upsertcalls with selective fields). The script is for bulk loads, not per-record operations. - The user is ingesting from a non-JSONL source (CSV, Parquet, Postgres dump). Convert to JSONL first; the script doesn't parse other formats.
- The user explicitly asks you to write the ingestion code from scratch (teaching context). Honor the request and follow the canonical pattern:
documents.batch_upsert+result.has_errorsinspection +documents.searchpolling with sentinel and deadline.
The script lives at .claude/skills/pinecone-fts-index/scripts/ingest.py. PEP 723 inline-metadata script — uv run --script installs typer and pinecone automatically on first invocation. No setup needed.
Use cases
Three concrete shapes to model your task on. Match the user's request to the closest one and follow its steps; improvise if the task is genuinely a hybrid.
UC-1: Index a new corpus end-to-end
Trigger. "Index this CSV / JSONL / folder for search," "build a search backend over [my articles / products / tickets / transcripts]," "make my [dataset] searchable."
For unprocessed / messy data, load the onboarding walkthrough first. If the user is showing up with raw data (unclear field types, possibly long text fields exceeding FTS limits, comma-separated tag strings, dates as strings, possibly duplicate IDs, etc.) and they haven't given you an explicit schema, read references/onboarding-walkthrough.md and follow it stage-by-stage. It's a conversational guide — meet the data, surface the processing decisions to the user, propose a schema, confirm before creating, then process+ingest+verify together. The walkthrough exists because schemas are immutable and "onboarding a new corpus" is a high-stakes flow that benefits from explicit user buy-in at each decision point.
If the user already gave you a clean JSONL + a schema spec, follow the abbreviated steps below.
Steps (when data is already prepared and the schema is decided):
- Inspect the corpus shape — text fields, structured metadata, do you also need a vector? Match it to one of the canonical shapes in
references/schema-design.md(articles, products, tickets, image library, code). - Pick analyzer settings on each text field —
language,stemming,stop_words. Stemming on for long prose, off for proper nouns / identifiers. - Assemble the schema with
SchemaBuilderand confirm it with the user before callingindexes.create— schemas are immutable in2026-01.alpha, so a wrong call costs a re-ingest. - Create the index, poll
describe()untilstatus.ready: true. - Run
scripts/ingest.py --data <jsonl> --index <name> --sentinel-field <fts_field>— see the Ingesting — use the packaged helper section above. The script handlesbatch_upsert+ per-batch error inspection + post-upsert readiness polling in one invocation. Don't hand-write the loop unless the user explicitly asks you to. - (The script polls automatically — by the time it exits cleanly, the index is searchable. If you skip the script and roll your own, you must poll
documents.searchwith a sentinel query and a deadline;batch_upsertreturning ≠ searchable.) - Validate with one or two probe queries against fields you know contain the sentinel content.
Result. A working documents.search call against the user's data, returning ranked matches.
UC-2: Add a dense (or sparse) signal to a text-only corpus
Trigger. "Add semantic search," "add embeddings," "make this hybrid," or any prompt that describes a query pattern text alone can't serve (visual similarity, mood, cross-modal "looks like").
Steps.
- Confirm the new signal represents a modality or signal text can't express — image / audio / external score, or a different corpus than the existing FTS field. Re-encoding the same text into a dense field is an anti-pattern (
references/schema-design.md→ "When to add a dense field at all"). - Because schemas are immutable, plan a new index, not a migration. Get user confirmation before recreating.
- Pick an embedding provider and pin its output dimension at schema time. Beware payload-size pitfalls at native dimensions — Gemini-3072 etc. need truncation (
references/ingestion.md→ "Dense-vector payload size"). - Schema → create → wait Ready → ingest with embeddings inline or pre-cached.
- Validate with a hybrid query:
dense_vectorscore_by + text-match filter ($match_phrase/$match_all). That's the supported single-call cross-modal shape.
Result. One index, two retrieval shapes — pure text and dense+filter hybrid — both runnable without further setup.
UC-3: Build a documents.search call from a natural-language user prompt (agent mode)
Trigger. Agent receives a user prompt like "find articles about machine learning that mention TensorFlow and were published after 2024" or "documents about climate policy ranked by similarity to this paragraph." The index already exists.
Steps.
- (Optional) Discover the schema by calling
pc.preview.indexes.describe(<NAME>)and readingschema.fields. Skip if you already know the field types from earlier in the conversation. - Decompose the user's prompt into
score_by/filtershapes using the agent-mode decomposition table below. (Hard requirements →filter. Ranking signals →score_by. Always includeinclude_fieldsexplicitly.) - Construct the
documents.search(...)call following the rules in the Querying section above — one scoring type per request, operator/field-type matching,include_fieldsalways set. - Execute the call. The response carries
resp.matches; iterate to getm._id,m._score, and field values viam.to_dict(). Use the matches in whatever shape the user asked for. - If results come back empty or wrong, walk the failure tree in
Common gotchas.
Result. Live search results matching the user's intent.
The four common UC-3 mistakes to actively avoid:
- Mixing scoring types in
score_by(server rejects). Put hard requirements infilter; rank by one signal inscore_by. - Putting hard requirements in
score_byas BM25 terms instead of infilteras$match_all/$match_phrase(returns ranked results that don't guarantee the term is present). - Operator/field-type mismatches (e.g.
$match_allon a float field,$gton a string field). Consult the operator table in the Querying section. - Omitting
include_fields(some SDK builds 400/422). Always pass it explicitly.
Agent-mode query decomposition
Map user prompt cues to API shapes. Read top-down — identify the cue, copy the corresponding shape.
| User prompt cue | API shape |
|---|---|
| Open-ended keywords ("articles about machine learning", search-bar query) | score_by=[{"type": "text", "field": "<field>", "query": "<terms>"}] — BM25 token-OR |
| Exact phrase, drives ranking ("rank by 'beautifully written'") | score_by=[{"type": "query_string", "query": '<field>:("phrase here")'}] |
| Exact phrase, hard requirement ("must contain 'machine learning'") | filter={"<field>": {"$match_phrase": "machine learning"}} |
| Required tokens, any order ("must mention TensorFlow", "must be about Illinois") | filter={"<field>": {"$match_all": "tokens space-separated"}} — preferred over query_string +token because it's a true hard filter, doesn't contribute to score |
| At least one of these tokens ("contains AI or ML or robotics") | filter={"<field>": {"$match_any": "AI ML robotics"}} |
| Excluded tokens ("not about deprecated", "no opinion pieces") | filter={"$not": {"<field>": {"$match_any": "deprecated opinion"}}} — or -token inside query_string |
| Boolean / boost / slop / phrase-prefix ("weight 'eagle' 3x", "within N words") | score_by=[{"type": "query_string", "query": '<expr with ^N / ~N / "…"*>'}] — only Lucene supports these |
| Cross-field boolean ("title or body contains X") | score_by=[{"type": "query_string", "query": 'title:(X) OR body:(X)'}] |
| Numeric / date / range / boolean metadata ("after 2024", "rating > 4", "in stock") | filter={"<field>": {"$gt": ..., "$gte": ..., "$eq": ..., "$exists": true}} |
| Category / tag / list membership ("category = fiction", "tagged X") | filter={"<field>": {"$in": [...]}} (works on string and string_list filterable fields) |
| Semantic similarity / mood / topic ("articles about ML", "documents that feel sombre") | score_by=[{"type": "dense_vector", "field": "<embedding_field>", "values": embed(<text>)}] — requires a dense_vector field |
| Visual appearance / cross-modal text query against an image corpus | Same dense_vector shape, with the embedding model that produced the stored image vectors. Multimodal embedders (Gemini-2 etc.) map a text query into the image space. |
| Hybrid: lexical requirement + semantic ranking ("articles about ML that mention TensorFlow") | Lexical → filter ($match_all / $match_phrase); semantic → score_by (dense_vector). Single call. |
Two structural rules the agent must enforce, no exceptions:
- One scoring type per request.
score_byacceptstext/query_string/dense_vector/sparse_vector, but a request ranks by one. Don't mix dense + text inscore_by— the server rejects it. Multi-field BM25 is the only "list" pattern that's allowed (multipletextclauses, or one cross-fieldquery_string). - Hybrid = filter + score_by, not two
score_byclauses. When a prompt has both a lexical requirement and a semantic ranking signal, lexical goes infilter(via$match_*operators) and semantic goes inscore_by. If both signals genuinely need to drive ranking, run two searches and merge IDs client-side.
Workflow at a glance
Three phases. Each has its own reference file — consult it before writing code for that phase.
- Design the schema. Decide which string fields are full-text-searchable, which are filterable metadata, whether you need a
dense_vectorfield (and whether it earns its place), whether you also need asparse_vectorfield, and which numeric / boolean / array filters to declare. Schemas are fixed at index creation in2026-01.alpha— plan carefully. →references/schema-design.md - Ingest documents. For bulk loads from a prepared JSONL, run the bundled
scripts/ingest.pyhelper (it doesbatch_upsert+ error inspection + readiness polling correctly by construction — see the Ingesting — use the packaged helper section above). For per-doc patch updates, hand-calldocuments.upsert. Either way, documents are indexed asynchronously after the HTTP call returns;batch_upsertreturning 202 ≠ searchable. →references/ingestion.mdfor the canonical pattern in detail. - Query the index. A single search request ranks by one scoring type — pass exactly one of
text,query_string,dense_vector, orsparse_vectorinscore_by(multi-field BM25 is supported via multipletextclauses or a cross-fieldquery_string). Layerfilter={...}for text-match ($match_phrase/$match_all/$match_any) and metadata filters ($eq/$in/$gte/$exists/$and/$or/$not). Control the response payload withinclude_fields. →references/querying.md
Quick template
End-to-end skeleton for a minimal text + filterable-metadata index. Copy it and edit every spot marked # TODO:. The template deliberately omits external embedding calls so it stays generic; see references/ingestion.md for dense / sparse field patterns and embedding-provider integration, and references/querying.md for the four scoring shapes plus text-match and metadata filters.
import time
from pinecone import Pinecone
from pinecone.preview import SchemaBuilder
INDEX_NAME = "my-fts-index" # TODO: name your index (lowercase alphanumeric + hyphens, ≤45 chars)
NAMESPACE = "__default__" # TODO: pick a namespace; auto-created on first upsert
pc = Pinecone() # reads PINECONE_API_KEY
# TODO: preprod backends require an x-environment header on the client:
# pc = Pinecone(additional_headers={"x-environment": "preprod-aws-0"})
# 1. Schema — one FTS string field, one filterable string, one filterable float.
# Field names must NOT start with `_` (reserved for `_id` / `_score`) or `$`
# (reserved for filter operators), and are limited to 64 bytes.
schema = (
SchemaBuilder()
.add_string_field("body", full_text_search={"language": "en"}) # TODO: rename for your content
.add_string_field("category", filterable=True) # TODO: any exact-match metadata
.add_integer_field("year", filterable=True) # TODO: any numeric filter — emits `"type": "float"` on the wire
.build()
)
# 2. Create the index. read_capacity defaults to {"mode": "OnDemand"}; pass
# {"mode": "Dedicated", ...} only if you specifically want provisioned reads.
if not pc.preview.indexes.exists(INDEX_NAME):
pc.preview.indexes.create(name=INDEX_NAME, schema=schema)
# 3. Wait for the index itself to become Ready.
while not pc.preview.indexes.describe(INDEX_NAME).status.ready:
time.sleep(5)
idx = pc.preview.index(name=INDEX_NAME)
# 4. Upsert a single document. `_id` is required, every other field is optional.
# upsert REPLACES the document on conflict — there is no per-field merge in 2026-01.alpha.
idx.documents.upsert(
namespace=NAMESPACE,
documents=[{
"_id": "doc-1",
"body": "Full-text search is great for keyword queries.",
"category": "intro",
"year": 2025.0,
}],
)
# 5. Poll until the FTS side is searchable (upsert returns BEFORE docs are indexed).
deadline = time.time() + 300
while time.time() < deadline:
resp = idx.documents.search(
namespace=NAMESPACE, top_k=1,
score_by=[{"type": "text", "field": "body", "query": "search"}], # TODO: sentinel query likely to hit
include_fields=[], # required on every search; [] = lightest payload (ids + _score only)
)
if resp.matches:
break
time.sleep(5)
# 6. Search — text scoring composed with metadata filter.
resp = idx.documents.search(
namespace=NAMESPACE,
top_k=5,
score_by=[{"type": "text", "field": "body", "query": "keyword queries"}],
filter={"year": {"$gte": 2024}}, # TODO: adjust filter or drop it
include_fields=["*"], # "*" = all stored fields; [] = `_id` + `_score` only
)
for m in resp.matches:
print(m._id, getattr(m, "_score", getattr(m, "score", None)), m.to_dict())
Common gotchas
- One scoring type per search request.
score_byacceptstext,query_string,dense_vector, orsparse_vector— but a request ranks by one type. Multi-field BM25 is fine (pass severaltextclauses, or a single cross-fieldquery_string). To combine BM25 ranking with adense_vector(orsparse_vector) signal, restrict the dense search with a text-matchfilteroperator ($match_phrase/$match_all/$match_any) on the lexical field, not by mixing types inscore_by. The "blend a dense vector and a text clause inscore_by" pattern is rejected by the server. - Text-match filter operators are the cross-modal hinge.
$match_phrase(exact phrase),$match_all(every token, any order),$match_any(at least one token) are filter-side operators onfull_text_searchfields. Each takes a single string (max 128 tokens). They reuse the field's tokenizer / stemmer, compose under$and/$or/$not, and are the supported way to compose lexical pre-filtering with dense or sparse ranking. Phrase slop ("…"~N), term boost (^N), and phrase prefix ("… word"*) are scoring-only — they live inquery_string, not infilter. - Preprod backends need
additional_headers={"x-environment": "..."}on thePinecone()client. Missing the header lands you on prod and you'll see "index not found" / empty-result symptoms that look like code bugs but aren't. include_fieldsis required on everydocuments.search(...)call. When omitted, defaults to[](_id+_scoreonly). Pass["*"]for all stored fields or a list of names to project. Omitting it on some SDK builds yields400/422instead of the documented default; always pass it explicitly to avoid surprises.- Match score is
_score; doc id is_id. Public-preview docs return the system match score on the_scorefield so a user metadata field literally namedscorecan coexist. Always prefer_scoreon read; some older SDK builds may still surface plainscore, so for defensive code usegetattr(m, "_score", getattr(m, "score", None)). - Reserved field names: leading
_and$, max 64 bytes._is for system fields (_id,_score);$is for filter operators. Schema validation rejects names that violate either rule. Length cap is bytes, not characters — be careful with non-ASCII names. - Vector-field cardinality: at most one
dense_vectorand at most onesparse_vectorper index in2026-01.alpha. Multiple text fields are fine. batch_upsertfailures are silent by default. The return value carrieshas_errors,failed_batch_count, and a list ofBatchErrorobjects witherror_message. If you don't inspect them, you'll see "Uploaded 0 / N" and an indefinite "not yet indexed" poll — with the real cause (payload-too-large, schema mismatch, reserved field name) hidden. Always printresult.errors[*].error_messagebefore downstream steps.- Dense-vector payload size matters at batch time. A 50-doc batch with 3072-dim float vectors lands around 5–10 MB and can be rejected by the preview backend. If every batch fails, try reducing the embedding dimension via your provider's truncation knob (e.g. Gemini's
output_dimensionality=768) before debugging schema. - Async indexing:
batch_upsertreturning ≠ searchable. The server builds inverted indexes in the background after the HTTP call returns. If you query immediately you'll see empty result sets. Always polldocuments.searchwith a sentinel query and a deadline (pattern inreferences/ingestion.md). - String FTS field shape is
full_text_search={...}(dict). Pass{}to enable with all server defaults. User-settable sub-fields:language,stemming,stop_words. Server-applied (visible indescribe()responses but NOT settable at index creation):lowercase(defaulttrue) andmax_token_length(default40). Stemming is opt-in (defaultfalse);stop_wordsis opt-in (defaultfalse, opposite of pre-public-preview docs). The earlier SDK shapefull_text_searchable=True, language="en"is legacy and should be avoided. - Schemas are fixed at index creation in
2026-01.alpha. Adding, removing, or retyping fields after creation is not supported. Changing dimension or metric on an existing vector field requires a new index. Plan the schema once. - No partial / per-field updates.
documents.upsertalways replaces the entire document for a given_id. To update one field, fetch the doc, modify in client code, and upsert the full doc back under the same_id. - Document operations: search supports
filter, fetch and delete do not. Fetch is ID-only (POST /documents/fetchwithids: [...]); delete accepts onlyidsordelete_all: true. To act on a metadata expression, search first to collect IDs, then fetch or delete those IDs. - Namespaces auto-create on first upsert. Pass any namespace string to
documents.upsert/batch_upsertand the namespace is created on the fly; documents from different namespaces are fully isolated. Use"__default__"if you don't need partitioning. Caveat: the namespace management endpoints (POST /namespaces,GET /namespaces,DELETE /namespaces/{namespace}) anddescribe_index_statsare NOT yet supported on indexes with document schemas — you can write to a namespace, you just can't list / delete them via the API yet. - Document and request size limits (preview): per-document max 2 MB; per-request max 2 MB and 1000 documents; per FTS-enabled
stringfield max 100 KB and 10,000 tokens (tokens > 256 bytes are truncated by the analyzer); per-document filterable metadata (everything not in an FTS field) max 40 KB. A schema can declare up to 100 FTS string fields. For long-prose corpora, chunk before ingest — seereferences/ingestion.md. score_byclause shape — singularfieldis canonical fortext/dense_vector/sparse_vector; onlyquery_stringtakes afieldsarray.text:{"type":"text", "field":"<fts_field>", "query":"<terms>"}.query_string:{"type":"query_string", "query":"<lucene>", "fields":["<a>","<b>"]}(the optionalfieldsarray;query_stringalso accepts a bare"fields":"body"string and the legacy"field":"body"as an alias).dense_vector:{"type":"dense_vector", "field":"<dense_field>", "values":[/*floats*/]}.sparse_vector:{"type":"sparse_vector", "field":"<sparse_field>", "sparse_values":{"indices":[...],"values":[...]}}— notesparse_values(NOTvalues) for sparse clauses.
- Single-term prefix wildcards aren't supported.
auto*doesn't work inquery_string; use phrase prefix ("machine lea"*— phrase must contain at least two terms, last term is matched as prefix). - Indexes can't be created in CMEK-enabled projects, no backup/restore, no fuzzy or regex search, no S3 bulk import for document-shaped indexes in
2026-01.alpha. If any of these are hard requirements, the public-preview FTS surface isn't yet ready.
Extension points
Currently shipped under scripts/:
scripts/ingest.py— bulk-ingest a prepared JSONL into an existing FTS index. Handlesbatch_upsertin safe-sized chunks, inspects every batch'sresult.errorsand aborts loudly on failure, then pollsdocuments.searchwith a sentinel + deadline until docs are searchable. Schema-agnostic: takes only--data,--index,--sentinel-field. Usage in Ingesting — use the packaged helper section above.
Query construction does NOT have a packaged helper — write documents.search(...) calls directly per the Querying section above.
More from pinecone-io/skills
pinecone-docs
Curated documentation reference for developers building with Pinecone. Contains links to official docs organized by topic and data format references. Use when writing Pinecone code, looking up API parameters, or needing the correct format for vectors or records.
63pinecone-help
Overview of all available Pinecone skills and what a user needs to get started. Invoke when a user asks what skills are available, how to get started with Pinecone, or what they need to set up before using any Pinecone skill.
55pinecone-cli
Guide for using the Pinecone CLI (pc) to manage Pinecone resources from the terminal. The CLI supports ALL index types (standard, integrated, sparse) and all vector operations — unlike the MCP which only supports integrated indexes. Use for batch operations, vector management, backups, namespaces, CI/CD automation, and full control over Pinecone resources.
52pinecone-assistant
Create, manage, and chat with Pinecone Assistants for document Q&A with citations. Handles all assistant operations - create, upload, sync, chat, context retrieval, and list. Recognizes natural language like "create an assistant from my docs", "ask my assistant about X", or "upload my docs to Pinecone".
52pinecone-query
Query integrated indexes using text with Pinecone MCP. IMPORTANT - This skill ONLY works with integrated indexes (indexes with built-in Pinecone embedding models like multilingual-e5-large). For standard indexes or advanced vector operations, use the CLI skill instead. Requires PINECONE_API_KEY environment variable and Pinecone MCP server to be configured.
49pinecone-mcp
Reference for the Pinecone MCP server tools. Documents all available tools - list-indexes, describe-index, describe-index-stats, create-index-for-model, upsert-records, search-records, cascading-search, and rerank-documents. Use when an agent needs to understand what Pinecone MCP tools are available, how to use them, or what parameters they accept.
45