OpenMetadata Connector Building Skill

When to Activate

When a user asks to build, create, add, or scaffold a new connector, source, or integration for OpenMetadata.

Core Insight

One JSON Schema definition cascades through 6 layers: Python Pydantic models, Java models, UI forms (RJSF auto-render), API validation, test fixtures, and documentation. Define the schema once — everything else is generated or guided.

Workflow: 7 Phases

Phase 0: ENVIRONMENT — Set Up Python Dev Environment

Before any make or python commands, set up the environment from the repo root:

python3.11 -m venv env
source env/bin/activate
make install_dev generate

Always activate before running commands: source env/bin/activate

Phase 1: SCAFFOLD — Generate Boilerplate

Run the scaffold CLI to collect inputs and generate files:

source env/bin/activate
metadata scaffold-connector

Interactive mode collects: connector name, service type, connection type, auth types, capabilities, docs URL, SDK package, API endpoints, implementation notes, Docker image, container port.

Non-interactive mode:

metadata scaffold-connector \
  --name my_db \
  --service-type database \
  --connection-type sqlalchemy \
  --scheme "mydb+pymydb" \
  --auth-types basic \
  --capabilities metadata lineage usage profiler \
  --docs-url "https://docs.example.com/api" \
  --sdk-package "mydb-sdk" \
  --docker-image "mydb/mydb:latest" \
  --docker-port 5432

Output: JSON Schema + test connection JSON + Python files + CONNECTOR_CONTEXT.md as an AI working document. SQLAlchemy database connectors get concrete code templates; all others get skeleton files with pointers to reference connectors.

CONNECTOR_CONTEXT.md handling: The scaffold generates CONNECTOR_CONTEXT.md in the connector directory as a working document for any AI tool (Claude Code, Cursor, Codex, Copilot, Windsurf). It is gitignored — it stays local and is never committed to the repo. No cleanup needed.

Phase 2: CLASSIFY — Understand the Source

The scaffold classifies along 3 dimensions. Verify the choices:

Dimension 1 — Service Type (determines directory + base class):

Service Type	Base Class	Reference
`database`	`CommonDbSourceService`	`mysql/`
`dashboard`	`DashboardServiceSource`	`metabase/`
`pipeline`	`PipelineServiceSource`	`airflow/`
`messaging`	`MessagingServiceSource`	`kafka/`
`mlmodel`	`MlModelServiceSource`	`mlflow/`
`storage`	`StorageServiceSource`	`s3/`
`search`	`SearchServiceSource`	`elasticsearch/`
`api`	`ApiServiceSource`	`rest/`

Dimension 2 — Connection Type (database only):

sqlalchemy → BaseConnection[Config, Engine] + SQLAlchemy dialect
rest_api → get_connection() + custom REST client (ref: salesforce/)
sdk_client → get_connection() + vendor SDK wrapper

Dimension 3 — Capabilities (determines extra files): metadata (always), lineage, usage, profiler, stored_procedures, data_diff

Read the source-type-specific standard at ${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md for detailed patterns.

Phase 3: RESEARCH — API/SDK Discovery

Read the CONNECTOR_CONTEXT.md generated by the scaffold. Then research the source's API/SDK.

If you can dispatch sub-agents (Claude Code): Launch a connector-researcher agent:

Agent: openmetadata-skills:connector-researcher
Prompt: "Research {source_name} for an OpenMetadata {service_type} connector.
Find: API docs, auth methods, key endpoints, pagination, rate limits, SDK packages."

If you cannot dispatch sub-agents: Perform the research yourself using WebSearch and WebFetch.

Phase 4: IMPLEMENT — Fill in the TODO Items

The scaffold generates files with # TODO markers. Read the relevant standards before implementing:

${CLAUDE_SKILL_DIR}/standards/connection.md — Connection patterns
${CLAUDE_SKILL_DIR}/standards/patterns.md — Error handling, pagination, auth
${CLAUDE_SKILL_DIR}/standards/performance.md — Pagination, lookup optimization, anti-patterns
${CLAUDE_SKILL_DIR}/standards/memory.md — Memory management, streaming, OOM prevention
${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md — Service-specific patterns

SQLAlchemy database: Templates are mostly complete. Customize _get_client() if needed. Non-SQLAlchemy: Study the reference connector, then implement each skeleton file.

Critical for JSON Schema:

Make auth fields (username, password, token) required when the service needs authentication by default. If omitting a field means an opaque 401 at runtime, make it required so the UI validates upfront.
Include SSL/TLS config (verifySSL + sslConfig $ref) for any connector that communicates over HTTPS — enterprise deployments use internal CAs.
SSL must be wired end-to-end: schema → connection.py (resolve with get_verify_ssl_fn) → client.py (session.verify = verify_ssl). Missing wiring triggers SonarQube Security Review failure.
See ${CLAUDE_SKILL_DIR}/standards/schema.md for the $ref patterns and required fields guidance.

Critical for Pydantic API models (models.py):

Always set model_config = ConfigDict(populate_by_name=True) when using Field(alias=...) — without this, constructing instances with Python attribute names raises ValidationError.
See ${CLAUDE_SKILL_DIR}/standards/code_style.md for the full pattern.

Critical for non-database connectors (client.py):

Every list endpoint MUST implement pagination if the API supports it. Check the API docs.
Missing pagination causes silent data loss — only the first page is ingested.
Build dicts for repeated lookups (e.g., folder path → folder name) instead of iterating lists.
See ${CLAUDE_SKILL_DIR}/standards/performance.md for correct patterns and anti-patterns.

Critical for storage connectors and any connector that reads files:

Never .read() entire files without a size check — causes OOM on production instances.
Use framework streaming readers (metadata/readers/dataframe/) for data files.
del large objects after processing and call gc.collect().
See ${CLAUDE_SKILL_DIR}/standards/memory.md for correct patterns.

Critical for lineage:

Never use wildcard table_name="*" in search queries — this links every table in a database to each entity, producing incorrect lineage.
If the source doesn't provide table-level info, skip lineage and document the limitation.
See ${CLAUDE_SKILL_DIR}/standards/lineage.md for correct patterns.

Phase 5: REGISTER — Integration Points

Read ${CLAUDE_SKILL_DIR}/standards/registration.md for detailed instructions. Summary:

Step	File	Change
1	`openmetadata-spec/.../entity/services/{serviceType}Service.json`	Add to type enum + connection oneOf
2	`openmetadata-ui/.../utils/{ServiceType}ServiceUtils.tsx`	Import schema + add switch case
3	`openmetadata-ui/.../locale/languages/`	Add i18n display name keys

Phase 6: GENERATE & FORMAT — Run Code Generation and Formatting

This step is mandatory — always run it before committing. Ensure the Python environment is set up:

# Ensure environment is active and tools are installed
source env/bin/activate
pip install -e ".[dev]" 2>/dev/null || make install_dev

# Generate models from schemas
make generate                                # Python Pydantic models
mvn clean install -pl openmetadata-spec      # Java models
cd openmetadata-ui/src/main/resources/ui && yarn parse-schema  # UI schemas

# Format ALL code (mandatory before commit)
cd /path/to/repo/root
make py_format                               # black + isort + pycln
mvn spotless:apply                           # Format Java

If make py_format fails: The most common cause is missing dev dependencies. Run make install_dev first, then retry.

Never skip formatting — unformatted code will fail CI.

Phase 7: VALIDATE — Run Static Analysis and Checklist

Run the static analyzer as a self-check before submitting:

python skills/connector-review/scripts/analyze_connector.py {service_type} {name}

Fix any issues it reports. Then verify the full checklist:

[ ] JSON Schema: validates, $ref resolves, supports* flags correct
[ ] JSON Schema: auth fields required when service mandates authentication
[ ] JSON Schema: SSL/TLS config included for HTTPS connectors
[ ] Code gen: make generate + mvn install + yarn parse-schema succeed
[ ] Connection: creates client, test_connection passes all steps
[ ] Source: create() validates config type, ServiceSpec is discoverable
[ ] Pydantic models: populate_by_name=True on all aliased models
[ ] Client: all list endpoints paginate (check API docs for pagination support)
[ ] Client: dict lookups in prepare(), not list iteration per entity
[ ] Lineage: no wildcard table_name="*" — skip if no table-level info available
[ ] Tests: unit + connection integration + metadata integration pass (no empty stubs)
[ ] Formatting: make py_format + mvn spotless:apply pass with no changes
[ ] Cleanup: CONNECTOR_CONTEXT.md is gitignored (verify it's not staged)
[ ] Cleanup: no leftover TODO scaffolding comments

Phase 8: TEST LOCALLY — Deploy and Test in the UI

Build everything and bring up a full local OpenMetadata stack with Docker:

Full build (first time or after Java/UI changes):

./docker/run_local_docker.sh -m ui -d mysql -s false -i true -r true

Fast rebuild (ingestion-only changes, ~2-3 minutes):

./docker/run_local_docker.sh -m ui -d mysql -s true -i true -r false

Once services are up (~3-5 minutes):

Open http://localhost:8585
Go to Settings → Services → {Your Service Type}
Click Add New Service and select your connector
Configure connection details and click Test Connection
If test passes, run metadata ingestion to verify entities are created

Other service URLs:

Airflow: http://localhost:8080 (admin / admin)
Elasticsearch: http://localhost:9200

Tear down: cd docker/development && docker compose down -v

Troubleshooting:

Connector not in dropdown → check service schema registration, rebuild without -s true
Test connection fails → check test_fn keys match test connection JSON step names
Container logs: docker compose -f docker/development/docker-compose.yml logs ingestion

Phase 9: CREATE PR — Submit with Quality Summary

When creating a PR for the connector, include the review summary in the PR description so reviewers see the quality assessment upfront:

# Run the static analyzer
analysis=$(python skills/connector-review/scripts/analyze_connector.py {service_type} {name} --json)

# Create PR with quality summary in description
gh pr create --title "feat(ingestion): Add {Name} {service_type} connector" --body "$(cat <<'EOF'
## Summary
- New {service_type} connector for {Name}
- Capabilities: {list capabilities}

## Test plan
- [ ] Unit tests pass (`pytest ingestion/tests/unit/topology/{service_type}/test_{name}.py`)
- [ ] Integration tests pass
- [ ] Local Docker test: connector appears in UI, test connection passes

## Connector Quality Review

**Verdict**: {VERDICT} | **Score**: {SCORE}/10

| Category | Score |
|----------|-------|
| Schema & Registration | X/10 |
| Connection & Auth | X/10 |
| Source, Topology & Performance | X/10 |
| Test Quality | X/10 |
| Code Quality & Style | X/10 |

**Blockers**: 0 | **Warnings**: {count} | **Suggestions**: {count}

<details>
<summary>Static analysis output</summary>

{paste analyze_connector.py output here}

</details>

🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"

The quality summary gives maintainers confidence about the connector's state without needing to review every file manually.

Standards Reference

All standards are in ${CLAUDE_SKILL_DIR}/standards/:

Standard	Content
`main.md`	Architecture overview, connector anatomy, service types
`patterns.md`	Error handling, logging, pagination, auth, filters
`testing.md`	Unit test patterns, integration tests, pytest style
`code_style.md`	Python style, JSON Schema conventions, naming
`schema.md`	Connection schema patterns, $ref usage, test connection JSON
`connection.md`	BaseConnection vs function patterns, SSL, client wrapper
`service_spec.md`	DefaultDatabaseSpec vs BaseSpec
`registration.md`	Service enum, UI utils, i18n
`performance.md`	Pagination, batching, rate limiting
`memory.md`	Memory management, streaming, OOM prevention
`lineage.md`	Lineage extraction methods, dialect mapping, query logs
`sql.md`	SQLAlchemy patterns, URL building, auth, multi-DB
`source_types/*.md`	Service-type-specific patterns

References

Architecture guides in ${CLAUDE_SKILL_DIR}/references/:

Reference	Content
`architecture-decision-tree.md`	Service type, connection type, base class selection
`connection-type-guide.md`	SQLAlchemy vs REST API vs SDK client
`capability-mapping.md`	Capabilities by service type, schema flags, generated files

scaffold-connector

OpenMetadata Connector Building Skill

When to Activate

Core Insight

Workflow: 7 Phases

Phase 0: ENVIRONMENT — Set Up Python Dev Environment

Phase 1: SCAFFOLD — Generate Boilerplate

Phase 2: CLASSIFY — Understand the Source

Phase 3: RESEARCH — API/SDK Discovery

Phase 4: IMPLEMENT — Fill in the TODO Items

Phase 5: REGISTER — Integration Points

Phase 6: GENERATE & FORMAT — Run Code Generation and Formatting

Phase 7: VALIDATE — Run Static Analysis and Checklist

Phase 8: TEST LOCALLY — Deploy and Test in the UI

Phase 9: CREATE PR — Submit with Quality Summary

Standards Reference

References

More from open-metadata/openmetadata

playwright-test

playwright-validation

writing-playwright-tests

connector-review

test-locally

connector-standards