scaffold-connector

Installation
SKILL.md

OpenMetadata Connector Building Skill

When to Activate

When a user asks to build, create, add, or scaffold a new connector, source, or integration for OpenMetadata.

Core Insight

One JSON Schema definition cascades through 6 layers: Python Pydantic models, Java models, UI forms (RJSF auto-render), API validation, test fixtures, and documentation. Define the schema once — everything else is generated or guided.

Workflow: 7 Phases

Phase 0: ENVIRONMENT — Set Up Python Dev Environment

Before any make or python commands, set up the environment from the repo root:

python3.11 -m venv env
source env/bin/activate
make install_dev generate

Always activate before running commands: source env/bin/activate

Phase 1: SCAFFOLD — Generate Boilerplate

Run the scaffold CLI to collect inputs and generate files:

source env/bin/activate
metadata scaffold-connector

Interactive mode collects: connector name, service type, connection type, auth types, capabilities, docs URL, SDK package, API endpoints, implementation notes, Docker image, container port.

Non-interactive mode:

metadata scaffold-connector \
  --name my_db \
  --service-type database \
  --connection-type sqlalchemy \
  --scheme "mydb+pymydb" \
  --auth-types basic \
  --capabilities metadata lineage usage profiler \
  --docs-url "https://docs.example.com/api" \
  --sdk-package "mydb-sdk" \
  --docker-image "mydb/mydb:latest" \
  --docker-port 5432

Output: JSON Schema + test connection JSON + Python files + CONNECTOR_CONTEXT.md as an AI working document. SQLAlchemy database connectors get concrete code templates; all others get skeleton files with pointers to reference connectors.

CONNECTOR_CONTEXT.md handling: The scaffold generates CONNECTOR_CONTEXT.md in the connector directory as a working document for any AI tool (Claude Code, Cursor, Codex, Copilot, Windsurf). It is gitignored — it stays local and is never committed to the repo. No cleanup needed.

Phase 2: CLASSIFY — Understand the Source

The scaffold classifies along 3 dimensions. Verify the choices:

Dimension 1 — Service Type (determines directory + base class):

Service Type Base Class Reference
database CommonDbSourceService mysql/
dashboard DashboardServiceSource metabase/
pipeline PipelineServiceSource airflow/
messaging MessagingServiceSource kafka/
mlmodel MlModelServiceSource mlflow/
storage StorageServiceSource s3/
search SearchServiceSource elasticsearch/
api ApiServiceSource rest/

Dimension 2 — Connection Type (database only):

  • sqlalchemyBaseConnection[Config, Engine] + SQLAlchemy dialect
  • rest_apiget_connection() + custom REST client (ref: salesforce/)
  • sdk_clientget_connection() + vendor SDK wrapper

Dimension 3 — Capabilities (determines extra files): metadata (always), lineage, usage, profiler, stored_procedures, data_diff

Read the source-type-specific standard at ${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md for detailed patterns.

Phase 3: RESEARCH — API/SDK Discovery

Read the CONNECTOR_CONTEXT.md generated by the scaffold. Then research the source's API/SDK.

If you can dispatch sub-agents (Claude Code): Launch a connector-researcher agent:

Agent: openmetadata-skills:connector-researcher
Prompt: "Research {source_name} for an OpenMetadata {service_type} connector.
Find: API docs, auth methods, key endpoints, pagination, rate limits, SDK packages."

If you cannot dispatch sub-agents: Perform the research yourself using WebSearch and WebFetch.

Phase 4: IMPLEMENT — Fill in the TODO Items

The scaffold generates files with # TODO markers. Read the relevant standards before implementing:

  • ${CLAUDE_SKILL_DIR}/standards/connection.md — Connection patterns
  • ${CLAUDE_SKILL_DIR}/standards/patterns.md — Error handling, pagination, auth
  • ${CLAUDE_SKILL_DIR}/standards/performance.md — Pagination, lookup optimization, anti-patterns
  • ${CLAUDE_SKILL_DIR}/standards/memory.md — Memory management, streaming, OOM prevention
  • ${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md — Service-specific patterns

SQLAlchemy database: Templates are mostly complete. Customize _get_client() if needed. Non-SQLAlchemy: Study the reference connector, then implement each skeleton file.

Critical for JSON Schema:

  • Make auth fields (username, password, token) required when the service needs authentication by default. If omitting a field means an opaque 401 at runtime, make it required so the UI validates upfront.
  • Include SSL/TLS config (verifySSL + sslConfig $ref) for any connector that communicates over HTTPS — enterprise deployments use internal CAs.
  • SSL must be wired end-to-end: schema → connection.py (resolve with get_verify_ssl_fn) → client.py (session.verify = verify_ssl). Missing wiring triggers SonarQube Security Review failure.
  • See ${CLAUDE_SKILL_DIR}/standards/schema.md for the $ref patterns and required fields guidance.

Critical for Pydantic API models (models.py):

  • Always set model_config = ConfigDict(populate_by_name=True) when using Field(alias=...) — without this, constructing instances with Python attribute names raises ValidationError.
  • See ${CLAUDE_SKILL_DIR}/standards/code_style.md for the full pattern.

Critical for non-database connectors (client.py):

  • Every list endpoint MUST implement pagination if the API supports it. Check the API docs.
  • Missing pagination causes silent data loss — only the first page is ingested.
  • Build dicts for repeated lookups (e.g., folder path → folder name) instead of iterating lists.
  • See ${CLAUDE_SKILL_DIR}/standards/performance.md for correct patterns and anti-patterns.

Critical for storage connectors and any connector that reads files:

  • Never .read() entire files without a size check — causes OOM on production instances.
  • Use framework streaming readers (metadata/readers/dataframe/) for data files.
  • del large objects after processing and call gc.collect().
  • See ${CLAUDE_SKILL_DIR}/standards/memory.md for correct patterns.

Critical for lineage:

  • Never use wildcard table_name="*" in search queries — this links every table in a database to each entity, producing incorrect lineage.
  • If the source doesn't provide table-level info, skip lineage and document the limitation.
  • See ${CLAUDE_SKILL_DIR}/standards/lineage.md for correct patterns.

Phase 5: REGISTER — Integration Points

Read ${CLAUDE_SKILL_DIR}/standards/registration.md for detailed instructions. Summary:

Step File Change
1 openmetadata-spec/.../entity/services/{serviceType}Service.json Add to type enum + connection oneOf
2 openmetadata-ui/.../utils/{ServiceType}ServiceUtils.tsx Import schema + add switch case
3 openmetadata-ui/.../locale/languages/ Add i18n display name keys

Phase 6: GENERATE & FORMAT — Run Code Generation and Formatting

This step is mandatory — always run it before committing. Ensure the Python environment is set up:

# Ensure environment is active and tools are installed
source env/bin/activate
pip install -e ".[dev]" 2>/dev/null || make install_dev

# Generate models from schemas
make generate                                # Python Pydantic models
mvn clean install -pl openmetadata-spec      # Java models
cd openmetadata-ui/src/main/resources/ui && yarn parse-schema  # UI schemas

# Format ALL code (mandatory before commit)
cd /path/to/repo/root
make py_format                               # black + isort + pycln
mvn spotless:apply                           # Format Java

If make py_format fails: The most common cause is missing dev dependencies. Run make install_dev first, then retry.

Never skip formatting — unformatted code will fail CI.

Phase 7: VALIDATE — Run Static Analysis and Checklist

Run the static analyzer as a self-check before submitting:

python skills/connector-review/scripts/analyze_connector.py {service_type} {name}

Fix any issues it reports. Then verify the full checklist:

[ ] JSON Schema: validates, $ref resolves, supports* flags correct
[ ] JSON Schema: auth fields required when service mandates authentication
[ ] JSON Schema: SSL/TLS config included for HTTPS connectors
[ ] Code gen: make generate + mvn install + yarn parse-schema succeed
[ ] Connection: creates client, test_connection passes all steps
[ ] Source: create() validates config type, ServiceSpec is discoverable
[ ] Pydantic models: populate_by_name=True on all aliased models
[ ] Client: all list endpoints paginate (check API docs for pagination support)
[ ] Client: dict lookups in prepare(), not list iteration per entity
[ ] Lineage: no wildcard table_name="*" — skip if no table-level info available
[ ] Tests: unit + connection integration + metadata integration pass (no empty stubs)
[ ] Formatting: make py_format + mvn spotless:apply pass with no changes
[ ] Cleanup: CONNECTOR_CONTEXT.md is gitignored (verify it's not staged)
[ ] Cleanup: no leftover TODO scaffolding comments

Phase 8: TEST LOCALLY — Deploy and Test in the UI

Build everything and bring up a full local OpenMetadata stack with Docker:

Full build (first time or after Java/UI changes):

./docker/run_local_docker.sh -m ui -d mysql -s false -i true -r true

Fast rebuild (ingestion-only changes, ~2-3 minutes):

./docker/run_local_docker.sh -m ui -d mysql -s true -i true -r false

Once services are up (~3-5 minutes):

  1. Open http://localhost:8585
  2. Go to Settings → Services → {Your Service Type}
  3. Click Add New Service and select your connector
  4. Configure connection details and click Test Connection
  5. If test passes, run metadata ingestion to verify entities are created

Other service URLs:

Tear down: cd docker/development && docker compose down -v

Troubleshooting:

  • Connector not in dropdown → check service schema registration, rebuild without -s true
  • Test connection fails → check test_fn keys match test connection JSON step names
  • Container logs: docker compose -f docker/development/docker-compose.yml logs ingestion

Phase 9: CREATE PR — Submit with Quality Summary

When creating a PR for the connector, include the review summary in the PR description so reviewers see the quality assessment upfront:

# Run the static analyzer
analysis=$(python skills/connector-review/scripts/analyze_connector.py {service_type} {name} --json)

# Create PR with quality summary in description
gh pr create --title "feat(ingestion): Add {Name} {service_type} connector" --body "$(cat <<'EOF'
## Summary
- New {service_type} connector for {Name}
- Capabilities: {list capabilities}

## Test plan
- [ ] Unit tests pass (`pytest ingestion/tests/unit/topology/{service_type}/test_{name}.py`)
- [ ] Integration tests pass
- [ ] Local Docker test: connector appears in UI, test connection passes

## Connector Quality Review

**Verdict**: {VERDICT} | **Score**: {SCORE}/10

| Category | Score |
|----------|-------|
| Schema & Registration | X/10 |
| Connection & Auth | X/10 |
| Source, Topology & Performance | X/10 |
| Test Quality | X/10 |
| Code Quality & Style | X/10 |

**Blockers**: 0 | **Warnings**: {count} | **Suggestions**: {count}

<details>
<summary>Static analysis output</summary>

{paste analyze_connector.py output here}

</details>

🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"

The quality summary gives maintainers confidence about the connector's state without needing to review every file manually.

Standards Reference

All standards are in ${CLAUDE_SKILL_DIR}/standards/:

Standard Content
main.md Architecture overview, connector anatomy, service types
patterns.md Error handling, logging, pagination, auth, filters
testing.md Unit test patterns, integration tests, pytest style
code_style.md Python style, JSON Schema conventions, naming
schema.md Connection schema patterns, $ref usage, test connection JSON
connection.md BaseConnection vs function patterns, SSL, client wrapper
service_spec.md DefaultDatabaseSpec vs BaseSpec
registration.md Service enum, UI utils, i18n
performance.md Pagination, batching, rate limiting
memory.md Memory management, streaming, OOM prevention
lineage.md Lineage extraction methods, dialect mapping, query logs
sql.md SQLAlchemy patterns, URL building, auth, multi-DB
source_types/*.md Service-type-specific patterns

References

Architecture guides in ${CLAUDE_SKILL_DIR}/references/:

Reference Content
architecture-decision-tree.md Service type, connection type, base class selection
connection-type-guide.md SQLAlchemy vs REST API vs SDK client
capability-mapping.md Capabilities by service type, schema flags, generated files
Related skills
Installs
1
GitHub Stars
13.8K
First Seen
Mar 29, 2026