byted-bytehouse-multimodal-search
ByteHouse 多模态检索 Skill
🚀 快速开始
环境准备
pip install clickhouse-connect volcengine-python-sdk[ark] numpy
环境变量配置
优先从环境变量读取配置,禁止硬编码明文敏感信息:
# ByteHouse 配置
export BYTEHOUSE_HOST="<你的ByteHouse连接地址>"
export BYTEHOUSE_PORT="<ByteHouse端口>"
export BYTEHOUSE_USER="<ByteHouse用户名>"
export BYTEHOUSE_PASSWORD="<ByteHouse密码>"
export BYTEHOUSE_DATABASE="<默认数据库,可选,默认default>"
export BYTEHOUSE_SECURE="<是否启用加密,可选,默认true>"
# 火山引擎方舟 API 配置
export ARK_API_KEY="<火山引擎方舟API密钥>"
export ARK_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
export EMBEDDING_MODEL="doubao-embedding-vision-251215"
export EMBEDDING_DIMENSIONS="1536" # 可选,默认1536
如果环境变量未配置,会自动提示用户输入。
📚 核心能力
1. 多模态向量化
基于豆包多模态向量化模型 doubao-embedding-vision-251215:
| 输入类型 | 支持格式 | 最大限制 |
|---|---|---|
| 文本 | 纯文本字符串 | 无长度限制 |
| 图片 | JPG/PNG/GIF/WEBP/BMP | <10MB,宽高>14px |
| 视频 | MP4/AVI/MOV | <50MB |
关键约束:
- 多模态向量化必须调用
/embeddings/multimodal接口 - 图片/视频输入格式:
{"type": "image_url", "image_url": {"url": "xxx"}} - 部分模型不支持
dimensions参数
2. 向量检索功能
| 功能 | 方法 | 说明 |
|---|---|---|
| 纯向量检索 | vector_search() |
基于向量相似度检索 |
| 混合检索 | hybrid_search() |
向量+全文检索融合 |
| 以文搜图 | text_search_image() |
文本搜索图片 |
| 以图搜图 | image_search_image() |
图片搜索相似图片 |
| 以文搜视频 | text_search_video() |
文本搜索视频 |
📖 代码实现
完整示例代码实现位于 scripts/ 目录:
scripts/embedding.py- 多模态向量化模块scripts/search_client.py- ByteHouse 检索客户端scripts/examples.py- 使用示例
快速使用
from scripts import ByteHouseMultimodalSearch
# 初始化客户端
search = ByteHouseMultimodalSearch(connection_type="http")
# 创建表
search.create_multimodal_table("my_index")
# 插入文档
search.insert_document("my_index", doc_id=1, content_type="text",
content="ByteHouse 多模态检索", title="介绍")
# 向量检索
results = search.vector_search("my_index", query_embedding=embedding, top_k=10)
⚙️ 最佳实践
索引选择
| 数据规模 | 索引类型 | 适用场景 |
|---|---|---|
| <100万 | HNSW | 中小规模,低延迟 |
| 100万-1亿 | HNSW_SQ | 大规模,平衡性能成本 |
| >1亿 | IVF_PQ_FS | 超大规模 |
性能优化
SETTINGS
index_granularity = 1024,
index_granularity_bytes = 0,
enable_vector_index_preload = 1
指令优化
| 场景 | Query 侧指令 |
|---|---|
| 通用文搜图 | Target_modality: image. Instruction:根据文本描述找到对应的图片. |
| 电商商品检索 | Target_modality: image. Instruction:找到和描述匹配的同款商品图片. |
| 原图检索 | Target_modality: image. Instruction:查找和本图完全相同的图片. |
❓ 常见问题
Q1: 向量维度怎么选?
- 推荐 1536 维作为通用值
- 维度越高精度越高,但成本也越高
Q2: 如何处理低召回问题?
- 增大
hnsw_ef_s参数
Q3: API 调用失败排查
- 404: 检查路径是否为
/embeddings/multimodal - 400: 检查输入格式,部分模型不支持
dimensions - 401: 检查
ARK_API_KEY是否正确 - 429: 降低请求频率
🔗 参考文档
More from bytedance/agentkit-samples
byted-seedream-image-generate
Generate high-quality images from text prompts using Volcano Engine Seedream models. Supports multiple artistic styles and aspect ratios. Use this skill when users want to create images from text descriptions, generate artwork in various styles, create visual content for creative projects, or need AI-powered image generation capabilities.
183byted-las-video-edit
Extracts and clips video segments from long videos using natural language descriptions. AI-powered smart video editing, video trimming, and video cutting powered by Volcengine LAS. Describe what you want — scenes, people, objects, actions, events — and get trimmed clips automatically. Video search and video content retrieval: find and locate specific people, objects, or scenes in footage. Supports reference images for person matching and object matching (search video by image). Two modes: simple (fast) and detail (thorough, optional ASR). Use this skill when the user wants to edit/clip/cut videos using natural language descriptions, extract highlights or key moments from videos, find specific people/objects/scenes in video footage (by text or reference image), compile highlight reels from long videos, trim video segments, or do AI-powered smart video editing.
163byted-seedance-video-generate
Generate videos using Seedance models. Invoke when user wants to create videos from text prompts, images, or reference materials.
109byted-las-vlm-video
Analyzes and understands video content using Volcengine LAS Doubao vision-language models (VLM). Multimodal AI video analysis, video comprehension, and visual understanding of video clips and footage. Performs video question answering (video Q&A) — ask questions about what happens in a video and get AI answers. Scene recognition and scene description, object recognition and object detection, action recognition and action detection from video frames. Generates video descriptions, video captions, video summaries, video annotations, and content summarization. Visual frame analysis for identifying people, objects, actions, and events in video. Auto-compresses video to 50MB before inference. Synchronous single-call processing. Use this skill when the user wants to analyze or understand video content using VLM/AI, do video Q&A (ask questions about a video), describe what happens in a video, recognize objects/actions/scenes in video frames, generate video captions/descriptions/summaries, annotate or label video content, get AI-powered visual understanding of video clips, or perform multimodal video analysis with vision-language models.
97veadk-go-skills
根据用户的功能需求,完成与 VeADK-Go 相关的功能; 包括:直接根据需求生成 Agent;将Enio Agent转换为VeADK-Go Agent。
42skills-download
Downloads skills from a AgentKit skill space to the local machine. Invoke when the user wants to fetch, download, or retrieve skills from the platform.
40