Cognite Data Fusion is an industrial data platform that ingests, models, and exposes data from operational technology (OT) and information technology (IT) systems. Agents use CDF to build data pipelines (extract → transform → contextualize), define industrial knowledge graphs via data modeling, query resources via REST API or SDKs, and deploy configurations using the Cognite Toolkit. Key entry points: REST API at https://{cluster}.cognitedata.com/api/v1, Python SDK (cognite-sdk), JavaScript SDK (@cognite/sdk), and the Cognite Toolkit CLI (cdf). Primary docs: docs.cognite.com
When to use
Reach for this skill when:
Integrating data: Setting up extractors (OPC UA, PI, SAP, databases) to stream data into CDF, configuring extraction pipelines, or monitoring data ingestion
Modeling data: Designing data models with spaces, containers, views, and instances; querying graphs with GraphQL or REST; managing access via spaces
Working with Assets in CDM: Creating or updating CogniteAsset hierarchies, populating assets via transformations, querying assets in CDF Search (shown as "Asset")
Working with Assets (legacy): Maintaining existing asset-centric applications using the /assets API
Querying resources: Retrieving assets, time series, events, files, or sequences; using external IDs for lookups; filtering with advanced query language (AQL)
Deploying infrastructure: Using the Cognite Toolkit to manage CDF projects as code, setting up CI/CD pipelines, or deploying modules
Building applications: Authenticating with OAuth 2.0/OIDC, using Python/JavaScript SDKs to read/write data, or calling REST endpoints
Managing access: Configuring OIDC providers, creating groups, assigning capabilities, or controlling access to spaces and data sets
Automating workflows: Setting up data workflows with tasks and triggers, transforming data, or orchestrating multi-step processes
Core data modeling vs. legacy
The Cognite Data Fusion (CDF) platform supports two modes:
Mode
Use case
Key concepts
API
Core data modeling (CDM)
New projects, knowledge graphs
CogniteAsset, CogniteTimeSeries, CogniteFile
Instances API, GraphQL
Legacy
Existing asset-centric applications
Assets, Time series, Events, Files
/assets, /timeseries, /events, /files
For new development, prefer core data modeling. Use the legacy APIs only when maintaining existing applications.
Quick reference
Core resource types (core data modeling)
Resource
CDM concept
Purpose
API / SDK
Assets
CogniteAsset
Hierarchical entities; shown as "Asset" in CDF Search
Instances API / data_modeling.instances
Time series
CogniteTimeSeries
Ordered data points over time
Instances API + Time Series API (data points)
Files
CogniteFile
Documents, diagrams, images
Instances API + File content API (upload)
Legacy resource types
Resource
Legacy API
Use when
Assets
/assets
Maintaining existing asset-centric apps
Time series
/timeseries
Metadata only; data points work with both
Events
/events
Legacy event storage
Files
/files
Metadata only; content works with both
Other resource types
Resource
Purpose
Key field
RAW
Unstructured staging data
dbName, tableName
Data models
Graphs with spaces, containers, views
space, externalId
Common developer goals
Goal
CDM approach
Documentation
Build an asset hierarchy
Create CogniteAsset instances with parent via Instances API
External ID uniqueness: External IDs are unique per resource type, not globally. An asset and time series can both have externalId=123. Enforce this in your source system mapping.
Search vs CRUD consistency: Advanced query (AQL) is eventually consistent and slower than CRUD endpoints. Don't use it for large-scale batch synchronization; use /byids or /list instead.
Pagination limits: Max 10,000 items per page. For data modeling queries, pagination only works at the top level; nested results cannot be paginated.
Data modeling access control: Access is scoped to spaces, not data sets. Users need dataModelsAcl.READ to the space and dataModelInstancesAcl.READ/WRITE to instances.
Extraction pipeline monitoring: Create extraction pipelines to track ingestion health. Without them, you won't see run history or failure notifications.
Toolkit module ordering: Spaces must be created before containers, containers before views, and views before data models. Use numbered prefixes (e.g., 01_space.yaml, 02_container.yaml) if dependencies exist within the same type.
Token expiration: OAuth 2.0 tokens expire. SDKs handle refresh automatically, but custom HTTP clients must implement token refresh logic.
Rate limiting: CDF enforces throttling on parallel requests. Default: 20 parallel time series operations, 10 parallel data point operations. Respect these limits or requests will be queued.
Deprecated API versions: API v0.5, v0.6 are removed. Always use the latest API version (v1).
Missing capabilities: Users need specific capabilities (e.g., timeseries:read, assets:write) to perform operations. Check access errors against the capabilities reference.
Verification checklist
Before submitting work:
Authentication works: Test with cdf status (Toolkit) or client.data_modeling.instances.list() (SDK, CDM) or client.assets.list() (SDK, legacy)
External IDs are set: All resources have unique, consistent external IDs from source systems
Data flows end-to-end: Verify data appears in CDF (check extraction pipeline runs, query a sample asset/time series)
Access is configured: Test that intended users/groups can read/write resources; check space and capability assignments
Queries are efficient: For data modeling, run with profile: true to check debug notices; avoid full table scans
Pagination is handled: If retrieving >10,000 items, implement cursor-based pagination
Dry-run passes: Run cdf deploy --dry-run and review changes before applying
CI/CD is integrated: Toolkit configs are in version control and CI/CD pipeline validates/deploys on merge
Monitoring is in place: Extraction pipelines, data workflows, and functions have alerts configured
Documentation is updated: Record external ID mappings, data model schema, and access policies