nebius-deploy-lora
Nebius Post-Training: Deploy LoRA Adapter
Serve your fine-tuned LoRA adapter as a serverless endpoint with per-token billing. Two deployment paths: from a Nebius fine-tuning job, or from a local archive / HuggingFace link.
Prerequisites
pip install requests openai
export NEBIUS_API_KEY="your-key"
API: https://api.tokenfactory.nebius.com
Supported base models for serverless LoRA deployment
| Base model | Fine-tuning type |
|---|---|
meta-llama/Meta-Llama-3.1-8B-Instruct |
LoRA + full (LoRA deployable) |
meta-llama/Llama-3.3-70B-Instruct |
LoRA (deployable) |
Full list: https://docs.tokenfactory.nebius.com/post-training/models
Path A — Deploy from Nebius fine-tuning job
Use after a fine-tuning job succeeds (see nebius-finetune skill for the job step).
import requests, os
api_key = os.environ["NEBIUS_API_KEY"]
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {api_key}"}
# Get checkpoint ID first
from openai import OpenAI
client = OpenAI(base_url="https://api.tokenfactory.nebius.com/v1/", api_key=api_key)
checkpoints = client.fine_tuning.jobs.checkpoints.list("ftjob-abc123").data
ckpt_id = checkpoints[-1].id # use latest checkpoint
# Deploy
payload = {
"source": f"ftjob-abc123:{ckpt_id}",
"base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"name": "my-adapter-v1",
"description": "Customer support model v1",
}
r = requests.post("https://api.tokenfactory.nebius.com/v0/models", json=payload, headers=headers)
model_name = r.json()["name"]
Path B — Deploy from HuggingFace or local archive
HuggingFace link (no upload needed):
payload = {
"source": "https://huggingface.co/your-org/your-lora-adapter",
"base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"name": "hf-lora-demo",
}
Local archive (must contain adapter_model.safetensors + adapter_config.json, max 500 MB):
# First upload the archive
with open("adapter.tar.gz", "rb") as f:
r = requests.post("https://api.tokenfactory.nebius.com/v1/files/model_archive",
headers={"Authorization": f"Bearer {api_key}"},
files={"file": ("adapter.tar.gz", f, "application/octet-stream")})
file_id = r.json()["id"]
# Then deploy using the file_id as source
payload = {"source": file_id, "base_model": "...", "name": "..."}
Wait for validation
import time
while True:
time.sleep(10)
r = requests.get(f"https://api.tokenfactory.nebius.com/v0/models/{model_name}", headers=headers)
info = r.json()
print(f"status={info['status']}")
if info["status"] == "active":
print(f"Live model name: {info['name']}")
break
if info["status"] == "error":
raise RuntimeError(info.get("status_reason"))
Deployed model name format
{base_model}-LoRa:{adapter_name}-{random_suffix}
Example: meta-llama/Meta-Llama-3.1-8B-Instruct-LoRa:my-adapter-v1-AbCd
Run inference
client = OpenAI(base_url="https://api.tokenfactory.nebius.com/v1/", api_key=api_key)
resp = client.chat.completions.create(
model=info["name"], # full deployed model name
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
List all custom models
r = requests.get("https://api.tokenfactory.nebius.com/v0/models", headers=headers)
for m in r.json().get("data", []):
print(f"{m['name']} status={m['status']}")
Delete model
requests.delete(f"https://api.tokenfactory.nebius.com/v0/models/{model_name}", headers=headers)
Bundled reference
Read references/deploy-paths.md when the user asks about deploying from HuggingFace, from a local archive, or the exact source field format.
Reference script
Full working script: scripts/04_post_training_deploy.py
Docs: https://docs.tokenfactory.nebius.com/post-training/deploy-custom-model