model-extraction-relu-logits
Installation
SKILL.md
Model Extraction for ReLU Networks
This skill provides guidance for extracting internal weight matrices from black-box ReLU neural networks using only input-output access.
Problem Understanding
Model extraction tasks typically involve:
- A black-box neural network that accepts inputs and returns outputs (logits)
- The goal of recovering internal parameters (weight matrices, biases)
- No direct access to the network's implementation or internal state
Critical Principle: True Black-Box Treatment
Treat the target network as a genuine black-box. Never rely on implementation details that may change during evaluation:
- Do not hardcode hidden layer dimensions from example code
- Do not assume specific random seeds or initialization schemes
- Do not directly compare extracted weights to "true" weights read from source files
- The test environment may use completely different parameters than any provided examples