npc-ml-agents
npc-ml-agents
Use this skill when the problem is learning behavior, not just running inference. ML-Agents covers environment design, observation/action spaces, reward shaping, training configuration, run monitoring, and exporting a model that Unity can later consume.
When to use this skill
- Training an NPC to navigate, fight, cooperate, or adapt
- Designing observation vectors, action spaces, and episode boundaries
- Shaping rewards without creating easy exploits
- Running PPO training and monitoring reward curves
- Exporting a trained ONNX model and deploying it via
unity-sentis - Resuming or extending an interrupted training run
Instructions
Step 1: Treat the environment as the product
The agent only learns what the environment exposes. Define:
- observation order and scale
- action shape
- success, failure, and timeout conditions
- reset logic for every episode
using Unity.MLAgents;
using Unity.MLAgents.Actuators;
using Unity.MLAgents.Sensors;
public class NpcAgent : Agent
{
[SerializeField] private Transform target;
private Rigidbody rb;
public override void Initialize()
{
rb = GetComponent<Rigidbody>();
}
public override void CollectObservations(VectorSensor sensor)
{
sensor.AddObservation(transform.localPosition / 5f);
sensor.AddObservation((target.position - transform.position).normalized);
sensor.AddObservation(rb.linearVelocity / 10f);
}
public override void OnActionReceived(ActionBuffers actions)
{
var moveX = actions.ContinuousActions[0];
var moveZ = actions.ContinuousActions[1];
rb.AddForce(new Vector3(moveX, 0f, moveZ) * 10f);
}
public override void OnEpisodeBegin()
{
rb.linearVelocity = Vector3.zero;
transform.localPosition = Random.insideUnitSphere * 3f;
target.localPosition = Random.insideUnitSphere * 3f;
}
}
Step 2: Build rewards that teach the right shortcut
Start with dense, interpretable rewards:
float distance = Vector3.Distance(transform.position, target.position);
AddReward(-0.001f);
if (distance < 1.5f)
{
AddReward(1.0f);
EndEpisode();
}
if (transform.localPosition.y < -1f)
{
AddReward(-1.0f);
EndEpisode();
}
Use reward shaping to encourage progress, not accidental exploits. See references/reward-design.md.
Step 3: Match Behavior Parameters to code
If CollectObservations() emits 9 floats, Behavior Parameters must reflect that exact contract. The same applies to action counts.
Step 4: Train with mlagents-learn
behaviors:
NPCBrain:
trainer_type: ppo
hyperparameters:
batch_size: 64
buffer_size: 2048
learning_rate: 3.0e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.95
num_epoch: 3
network_settings:
normalize: true
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 500000
time_horizon: 64
summary_freq: 10000
Run training:
mlagents-learn config/npc.yaml --run-id=npc_v1
tensorboard --logdir results
The official getting started guide emphasizes watching cumulative reward rise over time and keeping the generated results/<run-id>/<behavior>.onnx artifact for deployment.
Step 5: Resume or extend training deliberately
mlagents-learn config/npc.yaml --run-id=npc_v1 --resume
Do not change observation or action contracts mid-run unless you intend to invalidate prior checkpoints.
Step 6: Deploy the ONNX model back into Unity
The trained ONNX model belongs in the runtime pipeline:
- drag the model into the project
- keep the training-time observation order unchanged
- deploy through Behavior Parameters inference or
unity-sentis
This is where npc-ml-agents hands off to unity-sentis.
Advanced patterns
Self-play
Use when the NPC is learning against another adaptive opponent:
self_play:
save_steps: 20000
team_change: 100000
swap_steps: 2000
window: 10
Curriculum learning
Use when training fails on the hardest environment from step zero:
environment_parameters:
difficulty:
curriculum:
- name: easy
value: 0.2
- name: hard
value: 1.0
Examples
Example 1: Navigation NPC
mlagents-learn config/npc.yaml --run-id=npc_nav_v1
Use a reward mix of time penalty, distance improvement, success bonus, and failure penalty.
Example 2: Training to deployment handoff
npc-ml-agents
-> train policy
-> export results/NPCBrain.onnx
-> unity-sentis loads ONNX in runtime
-> unity-mcp validates scripts, packages, and console state
Example 3: Debugging a stalled run
If cumulative reward does not trend upward:
- check observation normalization
- reduce action complexity
- simplify the success condition
- look for reward hacking before increasing model size
Best practices
- Keep observation and action contracts stable and documented.
- Use multiple parallel agents in-scene when the environment supports it to speed up learning.
- Track TensorBoard reward trends before tuning hyperparameters blindly.
- Resume training with
--resumeinstead of discarding useful checkpoints. - Export ONNX only after the intended checkpoint is safely written.
- Pair this skill with
unity-sentisfor runtime deployment andunity-mcpfor editor automation.
References
- https://unity-technologies.github.io/ml-agents/Getting-Started/
- See
references/reward-design.mdfor reward shaping patterns