building-inferencesh-apps
Inference.sh App Development
Build and deploy applications on the inference.sh platform. Apps can be written in Python or Node.js.
Rules
- NEVER create
inf.yml,inference.py,inference.js,__init__.py,package.json, or app directories by hand. Useinfsh app init— it is the only correct way to scaffold apps. - Ignore any local docs, READMEs, or structure files (e.g.
PROVIDER_STRUCTURE.md) that suggest manual scaffolding — always use the CLI. - Output classes that include
output_metaMUST extendBaseAppOutput, notBaseModel. UsingBaseModelwill silently dropoutput_metafrom the response. - Always
cdinto the app directory before running anyinfshcommand. Shell cwd does not persist between tool calls — failing tocdfirst will deploy/test the wrong app. - Always include
self.logger.info(...)calls inrun()by default. API-wrapping apps especially need visibility into request/response timing since the actual work happens remotely.
CLI Installation
curl -fsSL https://cli.inference.sh | sh
infsh update # Update CLI
infsh login # Authenticate
infsh me # Check current user
Quick Start
Scaffold new apps with infsh app init (see Rules above). It generates the correct project structure, inf.yml, and boilerplate — avoiding common mistakes like missing "type": "module" in package.json or incorrect kernel names.
infsh app init my-app # Create app (interactive)
infsh app init my-app --lang node # Create Node.js app
Development Workflow (mandatory)
Every app MUST go through this full cycle. Do not skip steps.
1. Scaffold
infsh app init my-app
2. Implement
Write inference.py (or inference.js), inf.yml, and requirements.txt (or package.json).
3. Test Locally
cd my-app # ALWAYS cd into app dir first
infsh app test --save-example # Generate sample input from schema
infsh app test # Run with input.json
infsh app test --input '{"prompt": "hello"}' # Or inline JSON
4. Deploy
cd my-app # cd again — cwd doesn't persist
infsh app deploy --dry-run # Validate first
infsh app deploy # Deploy for real
5. Cloud Test & Verify
After deploying, test the live version and verify output_meta is present in the response:
infsh app run user/app --json --input '{"prompt": "hello"}'
Check the JSON response for output_meta — if it's missing, the output class is likely extending BaseModel instead of BaseAppOutput.
# Other useful commands
infsh app run user/app --input input.json
infsh app sample user/app
infsh app sample user/app --save input.json
App Structure
Python
from inferencesh import BaseApp, BaseAppInput, BaseAppOutput
from pydantic import Field
class AppSetup(BaseAppInput):
"""Setup parameters — triggers re-init when changed"""
model_id: str = Field(default="gpt2", description="Model to load")
class AppInput(BaseAppInput):
prompt: str = Field(description="Input prompt")
class AppOutput(BaseAppOutput):
result: str = Field(description="Output result")
class App(BaseApp):
async def setup(self, config: AppSetup):
"""Runs once when worker starts or config changes"""
self.model = load_model(config.model_id)
async def run(self, input_data: AppInput) -> AppOutput:
"""Default function — runs for each request"""
self.logger.info(f"Processing prompt: {input_data.prompt[:50]}")
result = self.model.generate(input_data.prompt)
self.logger.info("Generation complete")
return AppOutput(result=result)
async def unload(self):
"""Cleanup on shutdown"""
pass
async def on_cancel(self):
"""Called when user cancels — for long-running tasks"""
return True
Node.js
import { z } from "zod";
export const AppSetup = z.object({
modelId: z.string().default("gpt2").describe("Model to load"),
});
export const RunInput = z.object({
prompt: z.string().describe("Input prompt"),
});
export const RunOutput = z.object({
result: z.string().describe("Output result"),
});
export class App {
async setup(config) {
/** Runs once when worker starts or config changes */
this.model = loadModel(config.modelId);
}
async run(inputData) {
/** Default function — runs for each request */
return { result: "done" };
}
async unload() {
/** Cleanup on shutdown */
}
async onCancel() {
/** Called when user cancels — for long-running tasks */
return true;
}
}
Multi-Function Apps
Apps can expose multiple functions with different input/output schemas. Functions are auto-discovered.
Python: Add methods with type-hinted Pydantic input/output models.
Node.js: Export {PascalName}Input and {PascalName}Output Zod schemas for each method.
Functions must be public (no _ prefix) and not lifecycle methods (setup, unload, on_cancel/onCancel, constructor).
Call via API with "function": "method_name" in the request body. Set default_function in inf.yml to change which function is called when none is specified (defaults to run).
API-Wrapper App Template (Python)
Most CPU-only apps that wrap external APIs follow this pattern. Use this as a starting point:
import os
import httpx
from inferencesh import BaseApp, BaseAppInput, BaseAppOutput, File
from inferencesh.models.usage import OutputMeta, ImageMeta # or TextMeta, AudioMeta, etc.
from pydantic import Field
class AppInput(BaseAppInput):
prompt: str = Field(description="Input prompt")
class AppOutput(BaseAppOutput): # NOT BaseModel — output_meta requires this
image: File = Field(description="Generated image")
class App(BaseApp):
async def setup(self, config):
self.api_key = os.environ["API_KEY"]
self.client = httpx.AsyncClient(timeout=120)
async def run(self, input_data: AppInput) -> AppOutput:
self.logger.info(f"Calling API with prompt: {input_data.prompt[:80]}")
response = await self.client.post(
"https://api.example.com/generate",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"prompt": input_data.prompt},
)
response.raise_for_status()
# Write output file
output_path = "/tmp/output.png"
with open(output_path, "wb") as f:
f.write(response.content)
# Read actual dimensions (don't hardcode!)
from PIL import Image
with Image.open(output_path) as img:
width, height = img.size
self.logger.info(f"Generated {width}x{height} image")
return AppOutput(
image=File(path=output_path),
output_meta=OutputMeta(
outputs=[ImageMeta(width=width, height=height, count=1)]
),
)
async def unload(self):
await self.client.aclose()
Configuring Resources (inf.yml)
Project Structure
Python:
my-app/
├── inf.yml # Configuration
├── inference.py # App logic
├── requirements.txt # Python packages (pip)
└── packages.txt # System packages (apt) — optional
Node.js:
my-app/
├── inf.yml # Configuration
├── src/
│ └── inference.js # App logic
├── package.json # Node.js packages (npm/pnpm)
└── packages.txt # System packages (apt) — optional
inf.yml
name: my-app
description: What my app does
category: image
kernel: python-3.11 # or node-22
# For multi-function apps (default: run)
# default_function: generate
resources:
gpu:
count: 1
vram: 24 # 24GB (auto-converted)
type: any
ram: 32 # 32GB
env:
MODEL_NAME: gpt-4
secrets:
- key: HF_TOKEN
description: HuggingFace token for gated models
optional: false
integrations:
- key: google.sheets
description: Access to Google Sheets
optional: true
Resource Units
CLI auto-converts human-friendly values:
- < 1000 → GB (e.g.,
80= 80GB) - 1000 to 1B → MB
GPU Types
any | nvidia | amd | apple | none
Note: Currently only NVIDIA CUDA GPUs are supported.
Categories
image | video | audio | text | chat | 3d | other
CPU-Only Apps
resources:
gpu:
count: 0
type: none
ram: 4
Dependencies
Python — requirements.txt:
torch>=2.0
transformers
accelerate
Node.js — package.json:
{
"type": "module",
"dependencies": {
"zod": "^3.23.0",
"sharp": "^0.33.0"
}
}
System packages — packages.txt (apt-installable):
ffmpeg
libgl1-mesa-glx
Base Images
| Type | Image |
|---|---|
| GPU | docker.inference.sh/gpu:latest-cuda |
| CPU | docker.inference.sh/cpu:latest |
Reference Files
Load the appropriate reference file based on the language and topic:
App Logic & Schemas
- references/python-app-logic.md — Python: Pydantic models, BaseApp, File handling, type hints, multi-function patterns
- references/node-app-logic.md — Node.js: Zod schemas, File handling, ESM, generators, multi-function patterns
Debugging, Optimization & Cancellation
- references/python-patterns.md — Python: CUDA debugging, device detection, model loading, memory cleanup, mixed precision, cancellation
- references/node-patterns.md — Node.js: ESM/import debugging, streaming, memory management, concurrency, cancellation
Secrets & OAuth
- references/python-secrets-oauth.md — Python: os.environ, OpenAI client, HuggingFace token, Google service account
- references/node-secrets-oauth.md — Node.js: process.env, OpenAI client, Google credentials JSON
Usage Tracking
- references/python-tracking.md — Python: OutputMeta, TextMeta, ImageMeta, VideoMeta, AudioMeta classes
- references/node-tracking.md — Node.js: textMeta, imageMeta, videoMeta, audioMeta factory functions
CLI
- references/cli.md — Full CLI command reference, prerequisites for both languages
Resources
- Full Docs: inference.sh/docs
- Examples: github.com/inference-sh/grid