rhino-sdk-harmonize
Rhino Health SDK — Data Harmonization Guide
Guide users through the data harmonization pipeline in the rhino-health Python SDK (v2.1.x): vocabulary setup, semantic mappings, syntactic mappings, configuration, and execution.
Context Loading
Before responding, read these reference files:
-
API Reference —
../../context/sdk_reference.mdFocus on §SemanticMappingEndpoints (line ~282) and §SyntacticMappingEndpoints (line ~306) for method signatures, and §CreateInput Summaries forDataHarmonizationRunInput. -
Patterns & Gotchas —
../../context/patterns_and_gotchas.mdFocus on §9 (Async/Wait) for harmonization wait patterns, §11 (Common Import Paths) for harmonization imports, and §12 (Gotchas) foroutput_dataset_uidstriple nesting.
Pipeline Overview
Data harmonization transforms source data into a target data model (OMOP, FHIR, or custom). The end-to-end flow:
1. Create Vocabulary (optional, for semantic lookups)
↓
2. Create Semantic Mapping → wait_for_completion()
↓
3. Create Syntactic Mapping (references semantic mappings)
↓
4. Configure mapping (global_configuration + table_configurations)
— or use generate_config() for LLM-based auto-generation
↓
5. Run harmonization → wait_for_completion()
↓
6. Access output datasets (triply nested UIDs)
Not every step is required — simple transformations may skip vocabularies and semantic mappings.
Key Concepts
Target Data Models
from rhino_health.lib.endpoints.syntactic_mapping.syntactic_mapping_dataclass import SyntacticMappingDataModel
SyntacticMappingDataModel.OMOP # OMOP Common Data Model
SyntacticMappingDataModel.FHIR # HL7 FHIR resources
SyntacticMappingDataModel.CUSTOM # User-defined schema
Transformation Types
Each column mapping uses a TransformationType to define how source values become target values:
from rhino_health.lib.endpoints.syntactic_mapping.syntactic_mapping_dataclass import TransformationType
| Type | When to use |
|---|---|
SPECIFIC_VALUE |
Hardcode a constant value for the target column |
SOURCE_DATA_VALUE |
Direct pass-through of the source column value |
ROW_PYTHON |
Custom Python code executed per row |
TABLE_PYTHON |
Custom Python code executed on the full table |
SEMANTIC_MAPPING |
Map values using a semantic mapping vocabulary lookup |
VLOOKUP |
Look up values from another data source |
CUSTOM_MAPPING |
User-defined mapping logic |
SECURE_UUID |
Generate a secure UUID |
DATE |
Date format transformation |
Vocabulary Types
from rhino_health.lib.endpoints.semantic_mapping.semantic_mapping_dataclass import VocabularyType
Used when creating semantic mappings to define the vocabulary standard (e.g., ICD-10, SNOMED, LOINC).
Configuration Structure
Syntactic mappings use a two-level configuration:
global_configuration— settings that apply to the entire mapping (target model, global transforms)table_configurations— per-table column mappings, each specifying source column, target column, and transformation type
Use session.syntactic_mapping.generate_config() for LLM-based auto-generation of the configuration (async operation).
Endpoint Methods
Semantic Mappings
# Create
mapping = session.semantic_mapping.create_semantic_mapping(
semantic_mapping_create_input=SemanticMappingCreateInput(...),
return_existing=True,
)
# Wait for indexing (can be slow)
mapping.wait_for_completion(timeout_seconds=6000)
# Lookup
mapping = session.semantic_mapping.get_semantic_mapping_by_name("My Mapping")
Syntactic Mappings
# Create
mapping = session.syntactic_mapping.create_syntactic_mapping(
syntactic_mapping_input=SyntacticMappingCreateInput(...),
return_existing=True,
)
# Auto-generate config (async, LLM-based)
response = session.syntactic_mapping.generate_config(mapping.uid)
# Lookup
mapping = session.syntactic_mapping.get_syntactic_mapping_by_name("My Mapping")
Running Harmonization
Two execution paths exist:
Preferred: via SyntacticMappingEndpoints
from rhino_health.lib.endpoints.syntactic_mapping.syntactic_mapping_dataclass import (
DataHarmonizationRunInput,
)
run_params = DataHarmonizationRunInput(
input_dataset_uids=[dataset.uid], # List[str]
semantic_mapping_uids_by_vocabularies={}, # dict: vocab_uid → semantic_mapping_uid
timeout_seconds=600.0,
)
code_run = session.syntactic_mapping.run_data_harmonization(
syntactic_mapping_or_uid=mapping.uid,
run_params=run_params,
)
result = code_run.wait_for_completion()
Legacy: via CodeObjectEndpoints
Used in older examples (e.g., fhir_pipeline.py). Requires a pre-existing harmonization code object:
code_run = session.code_object.run_data_harmonization(
code_object_uid=harmonization_code_object_uid,
run_params=run_params,
)
result = code_run.wait_for_completion()
Accessing Output Datasets
Output UIDs are triply nested — List[workgroups][slots][dataset_uids]:
output_uid = result.output_dataset_uids.root[0].root[0].root[0]
Response Format
Structure every response as:
- Where in the pipeline — identify which step the user is at or needs help with
- Next step with code — complete, runnable code for that step with correct imports
- Gotchas — triply nested output UIDs, long
wait_for_completiontimeouts for semantic mapping indexing, correct import paths
Working Example
Check ../../context/examples/INDEX.md for matching examples. The key harmonization example is:
fhir_pipeline.py— end-to-end: data harmonization, FHIR resource generation, CSV export. Read the full file at../../context/examples/fhir_pipeline.pywhen relevant.
More from naverazy-rhino/rhino-sdk-skills
rhino-sdk
Plan and execute federated analytics workflows with the Rhino Health Python SDK. Use when the user wants to run survival analysis, metrics, data harmonization, model training, or any multi-step SDK workflow. Takes high-level research goals and produces phased execution plans with runnable code. Also handles SDK questions, debugging rhino_health errors, and metric selection. Triggers on: rhino-health, rhino_health, RhinoSession, FCP, federated analytics, OMOP, FHIR, harmonization, or any of the 40+ federated metrics.
13rhino-sdk-guide
This skill should be used when the user asks about the Rhino Health Python SDK API, asks 'how do I...' questions about rhino-health, needs to understand SDK concepts like endpoints, sessions, metrics, or dataclasses, or mentions rhino_health, RhinoSession, FCP, federated analytics, or rhino-health SDK.
9rhino-sdk-debug
This skill should be used when the user encounters an error or traceback from the Rhino Health Python SDK, needs to debug rhino_health code, sees NotAuthenticatedError, ValidationError, TypeError, ImportError, or any exception mentioning rhino_health, or asks why their SDK code is failing.
9rhino-sdk-plan
This skill should be used when the user describes a high-level research or analytics goal using the Rhino Health Python SDK, wants to plan a multi-step workflow, says 'plan', 'design a workflow', 'how should I approach', 'what are the steps to', 'architect', 'set up a pipeline', wants to combine analytics with code objects or harmonization, or needs help decomposing a complex federated computing task into ordered SDK operations.
9rhino-sdk-write
This skill should be used when the user wants to write code using the Rhino Health Python SDK, generate a script, create a workflow, implement federated analytics, run metrics across sites, or is writing Python that imports rhino_health.
8rhino-sdk-metrics
This skill should be used when the user wants to run federated metrics or analytics using the Rhino Health Python SDK, asks about survival analysis, statistical tests, KaplanMeier, Cox, Mean, Count, TTest, ChiSquare, RocAuc, correlation, odds ratio, or needs help choosing or configuring a metric.
8