scaffold-connector
OpenMetadata Connector Building Skill
When to Activate
When a user asks to build, create, add, or scaffold a new connector, source, or integration for OpenMetadata.
Core Insight
One JSON Schema definition cascades through 6 layers: Python Pydantic models, Java models, UI forms (RJSF auto-render), API validation, test fixtures, and documentation. Define the schema once — everything else is generated or guided.
Workflow: 7 Phases
Phase 0: ENVIRONMENT — Set Up Python Dev Environment
Before any make or python commands, set up the environment from the repo root:
python3.11 -m venv env
source env/bin/activate
make install_dev generate
Always activate before running commands: source env/bin/activate
Phase 1: SCAFFOLD — Generate Boilerplate
Run the scaffold CLI to collect inputs and generate files:
source env/bin/activate
metadata scaffold-connector
Interactive mode collects: connector name, service type, connection type, auth types, capabilities, docs URL, SDK package, API endpoints, implementation notes, Docker image, container port.
Non-interactive mode:
metadata scaffold-connector \
--name my_db \
--service-type database \
--connection-type sqlalchemy \
--scheme "mydb+pymydb" \
--auth-types basic \
--capabilities metadata lineage usage profiler \
--docs-url "https://docs.example.com/api" \
--sdk-package "mydb-sdk" \
--docker-image "mydb/mydb:latest" \
--docker-port 5432
Output: JSON Schema + test connection JSON + Python files + CONNECTOR_CONTEXT.md as an AI working document. SQLAlchemy database connectors get concrete code templates; all others get skeleton files with pointers to reference connectors.
CONNECTOR_CONTEXT.md handling: The scaffold generates CONNECTOR_CONTEXT.md in the connector directory as a working document for any AI tool (Claude Code, Cursor, Codex, Copilot, Windsurf). It is gitignored — it stays local and is never committed to the repo. No cleanup needed.
Phase 2: CLASSIFY — Understand the Source
The scaffold classifies along 3 dimensions. Verify the choices:
Dimension 1 — Service Type (determines directory + base class):
| Service Type | Base Class | Reference |
|---|---|---|
database |
CommonDbSourceService |
mysql/ |
dashboard |
DashboardServiceSource |
metabase/ |
pipeline |
PipelineServiceSource |
airflow/ |
messaging |
MessagingServiceSource |
kafka/ |
mlmodel |
MlModelServiceSource |
mlflow/ |
storage |
StorageServiceSource |
s3/ |
search |
SearchServiceSource |
elasticsearch/ |
api |
ApiServiceSource |
rest/ |
Dimension 2 — Connection Type (database only):
sqlalchemy→BaseConnection[Config, Engine]+ SQLAlchemy dialectrest_api→get_connection()+ custom REST client (ref:salesforce/)sdk_client→get_connection()+ vendor SDK wrapper
Dimension 3 — Capabilities (determines extra files):
metadata (always), lineage, usage, profiler, stored_procedures, data_diff
Read the source-type-specific standard at ${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md for detailed patterns.
Phase 3: RESEARCH — API/SDK Discovery
Read the CONNECTOR_CONTEXT.md generated by the scaffold. Then research the source's API/SDK.
If you can dispatch sub-agents (Claude Code): Launch a connector-researcher agent:
Agent: openmetadata-skills:connector-researcher
Prompt: "Research {source_name} for an OpenMetadata {service_type} connector.
Find: API docs, auth methods, key endpoints, pagination, rate limits, SDK packages."
If you cannot dispatch sub-agents: Perform the research yourself using WebSearch and WebFetch.
Phase 4: IMPLEMENT — Fill in the TODO Items
The scaffold generates files with # TODO markers. Read the relevant standards before implementing:
${CLAUDE_SKILL_DIR}/standards/connection.md— Connection patterns${CLAUDE_SKILL_DIR}/standards/patterns.md— Error handling, pagination, auth${CLAUDE_SKILL_DIR}/standards/performance.md— Pagination, lookup optimization, anti-patterns${CLAUDE_SKILL_DIR}/standards/memory.md— Memory management, streaming, OOM prevention${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md— Service-specific patterns
SQLAlchemy database: Templates are mostly complete. Customize _get_client() if needed.
Non-SQLAlchemy: Study the reference connector, then implement each skeleton file.
Critical for JSON Schema:
- Make auth fields (
username,password,token) required when the service needs authentication by default. If omitting a field means an opaque 401 at runtime, make it required so the UI validates upfront. - Include SSL/TLS config (
verifySSL+sslConfig$ref) for any connector that communicates over HTTPS — enterprise deployments use internal CAs. - SSL must be wired end-to-end: schema →
connection.py(resolve withget_verify_ssl_fn) →client.py(session.verify = verify_ssl). Missing wiring triggers SonarQube Security Review failure. - See
${CLAUDE_SKILL_DIR}/standards/schema.mdfor the$refpatterns and required fields guidance.
Critical for Pydantic API models (models.py):
- Always set
model_config = ConfigDict(populate_by_name=True)when usingField(alias=...)— without this, constructing instances with Python attribute names raisesValidationError. - See
${CLAUDE_SKILL_DIR}/standards/code_style.mdfor the full pattern.
Critical for non-database connectors (client.py):
- Every list endpoint MUST implement pagination if the API supports it. Check the API docs.
- Missing pagination causes silent data loss — only the first page is ingested.
- Build dicts for repeated lookups (e.g., folder path → folder name) instead of iterating lists.
- See
${CLAUDE_SKILL_DIR}/standards/performance.mdfor correct patterns and anti-patterns.
Critical for storage connectors and any connector that reads files:
- Never
.read()entire files without a size check — causes OOM on production instances. - Use framework streaming readers (
metadata/readers/dataframe/) for data files. dellarge objects after processing and callgc.collect().- See
${CLAUDE_SKILL_DIR}/standards/memory.mdfor correct patterns.
Critical for lineage:
- Never use wildcard
table_name="*"in search queries — this links every table in a database to each entity, producing incorrect lineage. - If the source doesn't provide table-level info, skip lineage and document the limitation.
- See
${CLAUDE_SKILL_DIR}/standards/lineage.mdfor correct patterns.
Phase 5: REGISTER — Integration Points
Read ${CLAUDE_SKILL_DIR}/standards/registration.md for detailed instructions. Summary:
| Step | File | Change |
|---|---|---|
| 1 | openmetadata-spec/.../entity/services/{serviceType}Service.json |
Add to type enum + connection oneOf |
| 2 | openmetadata-ui/.../utils/{ServiceType}ServiceUtils.tsx |
Import schema + add switch case |
| 3 | openmetadata-ui/.../locale/languages/ |
Add i18n display name keys |
Phase 6: GENERATE & FORMAT — Run Code Generation and Formatting
This step is mandatory — always run it before committing. Ensure the Python environment is set up:
# Ensure environment is active and tools are installed
source env/bin/activate
pip install -e ".[dev]" 2>/dev/null || make install_dev
# Generate models from schemas
make generate # Python Pydantic models
mvn clean install -pl openmetadata-spec # Java models
cd openmetadata-ui/src/main/resources/ui && yarn parse-schema # UI schemas
# Format ALL code (mandatory before commit)
cd /path/to/repo/root
make py_format # black + isort + pycln
mvn spotless:apply # Format Java
If make py_format fails: The most common cause is missing dev dependencies. Run make install_dev first, then retry.
Never skip formatting — unformatted code will fail CI.
Phase 7: VALIDATE — Run Static Analysis and Checklist
Run the static analyzer as a self-check before submitting:
python skills/connector-review/scripts/analyze_connector.py {service_type} {name}
Fix any issues it reports. Then verify the full checklist:
[ ] JSON Schema: validates, $ref resolves, supports* flags correct
[ ] JSON Schema: auth fields required when service mandates authentication
[ ] JSON Schema: SSL/TLS config included for HTTPS connectors
[ ] Code gen: make generate + mvn install + yarn parse-schema succeed
[ ] Connection: creates client, test_connection passes all steps
[ ] Source: create() validates config type, ServiceSpec is discoverable
[ ] Pydantic models: populate_by_name=True on all aliased models
[ ] Client: all list endpoints paginate (check API docs for pagination support)
[ ] Client: dict lookups in prepare(), not list iteration per entity
[ ] Lineage: no wildcard table_name="*" — skip if no table-level info available
[ ] Tests: unit + connection integration + metadata integration pass (no empty stubs)
[ ] Formatting: make py_format + mvn spotless:apply pass with no changes
[ ] Cleanup: CONNECTOR_CONTEXT.md is gitignored (verify it's not staged)
[ ] Cleanup: no leftover TODO scaffolding comments
Phase 8: TEST LOCALLY — Deploy and Test in the UI
Build everything and bring up a full local OpenMetadata stack with Docker:
Full build (first time or after Java/UI changes):
./docker/run_local_docker.sh -m ui -d mysql -s false -i true -r true
Fast rebuild (ingestion-only changes, ~2-3 minutes):
./docker/run_local_docker.sh -m ui -d mysql -s true -i true -r false
Once services are up (~3-5 minutes):
- Open http://localhost:8585
- Go to Settings → Services → {Your Service Type}
- Click Add New Service and select your connector
- Configure connection details and click Test Connection
- If test passes, run metadata ingestion to verify entities are created
Other service URLs:
- Airflow: http://localhost:8080 (admin / admin)
- Elasticsearch: http://localhost:9200
Tear down: cd docker/development && docker compose down -v
Troubleshooting:
- Connector not in dropdown → check service schema registration, rebuild without
-s true - Test connection fails → check
test_fnkeys match test connection JSON step names - Container logs:
docker compose -f docker/development/docker-compose.yml logs ingestion
Phase 9: CREATE PR — Submit with Quality Summary
When creating a PR for the connector, include the review summary in the PR description so reviewers see the quality assessment upfront:
# Run the static analyzer
analysis=$(python skills/connector-review/scripts/analyze_connector.py {service_type} {name} --json)
# Create PR with quality summary in description
gh pr create --title "feat(ingestion): Add {Name} {service_type} connector" --body "$(cat <<'EOF'
## Summary
- New {service_type} connector for {Name}
- Capabilities: {list capabilities}
## Test plan
- [ ] Unit tests pass (`pytest ingestion/tests/unit/topology/{service_type}/test_{name}.py`)
- [ ] Integration tests pass
- [ ] Local Docker test: connector appears in UI, test connection passes
## Connector Quality Review
**Verdict**: {VERDICT} | **Score**: {SCORE}/10
| Category | Score |
|----------|-------|
| Schema & Registration | X/10 |
| Connection & Auth | X/10 |
| Source, Topology & Performance | X/10 |
| Test Quality | X/10 |
| Code Quality & Style | X/10 |
**Blockers**: 0 | **Warnings**: {count} | **Suggestions**: {count}
<details>
<summary>Static analysis output</summary>
{paste analyze_connector.py output here}
</details>
🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
The quality summary gives maintainers confidence about the connector's state without needing to review every file manually.
Standards Reference
All standards are in ${CLAUDE_SKILL_DIR}/standards/:
| Standard | Content |
|---|---|
main.md |
Architecture overview, connector anatomy, service types |
patterns.md |
Error handling, logging, pagination, auth, filters |
testing.md |
Unit test patterns, integration tests, pytest style |
code_style.md |
Python style, JSON Schema conventions, naming |
schema.md |
Connection schema patterns, $ref usage, test connection JSON |
connection.md |
BaseConnection vs function patterns, SSL, client wrapper |
service_spec.md |
DefaultDatabaseSpec vs BaseSpec |
registration.md |
Service enum, UI utils, i18n |
performance.md |
Pagination, batching, rate limiting |
memory.md |
Memory management, streaming, OOM prevention |
lineage.md |
Lineage extraction methods, dialect mapping, query logs |
sql.md |
SQLAlchemy patterns, URL building, auth, multi-DB |
source_types/*.md |
Service-type-specific patterns |
References
Architecture guides in ${CLAUDE_SKILL_DIR}/references/:
| Reference | Content |
|---|---|
architecture-decision-tree.md |
Service type, connection type, base class selection |
connection-type-guide.md |
SQLAlchemy vs REST API vs SDK client |
capability-mapping.md |
Capabilities by service type, schema flags, generated files |
More from open-metadata/openmetadata
playwright-test
Generate robust, zero-flakiness Playwright E2E tests following OpenMetadata patterns. Creates comprehensive test files with proper waits, API validation, multi-role permissions, and complete entity lifecycle management.
60playwright-validation
Use when validating UI changes in a branch require Playwright E2E testing. Reviews branch changes, validates UI with Playwright MCP, and adds missing test cases.
55writing-playwright-tests
Use when writing new Playwright E2E tests or adding test cases. Provides testing philosophy, patterns, and best practices from the Playwright Developer Handbook.
54connector-review
Review an OpenMetadata connector against golden standards. Runs multi-agent analysis covering architecture, code quality, type safety, testing, and performance. When a PR number is given, automatically posts the quality summary to the PR description and a detailed review as a PR comment.
1test-locally
Build and deploy a full local OpenMetadata stack with Docker to test your connector in the UI. Handles code generation, build optimization, health checks, and guided testing.
1connector-standards
Load all OpenMetadata connector development standards into context. Use before building or reviewing connectors to ensure consistent patterns.
1