mcp-server-evaluations
MCP Server Evaluations Skill
Systematically evaluate MCP servers to ensure they function correctly, handle errors gracefully, and meet quality standards.
Workflow
Phase 1: Environment Verification
- Verify MCP server is running
curl -s http://localhost:3030/health # Expected: 200 OK curl -s -X POST http://localhost:3030/mcp \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":1,"method":"ping"}' # Expected: {"jsonrpc":"2.0","id":1,"result":{}}
Phase 2: Tool Discovery
-
List all available tools
curl -X POST http://localhost:3030/mcp \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"tools/list","id":1}' -
Verify tool completeness
- All OpenAPI operations exposed as tools
- Tool names follow consistent convention (e.g.,
getUsers,createOrder) - Descriptions are clear and actionable
- Required vs optional parameters clearly marked
- Parameter types match OpenAPI schema
-
Document discovered tools — Create inventory of tools for systematic testing.
Phase 3: Functional Testing
For each discovered tool:
-
Basic functionality test
curl -X POST http://localhost:3030/mcp \ -H "Content-Type: application/json" \ -d '{ "jsonrpc": "2.0", "method": "tools/call", "params": { "name": "<tool_name>", "arguments": { <valid_arguments> } }, "id": 2 }' -
Verify response structure
- Response contains expected data
- Data types match schema
- No unexpected null values
- Pagination works (if applicable)
-
Error handling test — Call with invalid/missing arguments:
curl -X POST http://localhost:3030/mcp \ -H "Content-Type: application/json" \ -d '{ "jsonrpc": "2.0", "method": "tools/call", "params": { "name": "<tool_name>", "arguments": {} }, "id": 3 }' -
Verify error response quality
- Error message is actionable
- Missing required parameters identified
- HTTP status codes propagated correctly
Phase 4: Question-Based Evaluation
Generate and test with realistic user questions:
-
Generate 10+ test questions covering:
- Simple single-tool queries
- Multi-step workflows requiring multiple tools
- Edge cases (empty results, large datasets)
- Error scenarios (invalid IDs, unauthorized access)
-
Execute each question through MCP client or Inspector
-
Score responses using evaluation criteria:
- Correctness: Does the answer match expected result?
- Completeness: Is all relevant information included?
- Clarity: Is the response well-structured?
- Performance: Response time within acceptable limits?
Phase 5: Quality Scoring
Calculate overall quality score:
| Category | Weight | Criteria |
|---|---|---|
| Tool Discovery | 20% | All operations exposed, proper naming |
| Basic Functionality | 30% | Valid inputs return correct responses |
| Error Handling | 20% | Graceful errors with actionable messages |
| Question Accuracy | 20% | Test questions answered correctly |
| Performance | 10% | Response times < 5s for standard ops |
Pass threshold: 80% overall score
Quick Evaluation Checklist
Run this minimal check for fast validation:
# 1. Health check
curl -s http://localhost:3030/health | grep -q "" && echo "✓ Health OK" || echo "✗ Health FAILED"
# 2. MCP ping
curl -s -X POST http://localhost:3030/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"ping"}' | jq -e '.jsonrpc == "2.0" and .result' > /dev/null && echo "✓ Ping OK" || echo "✗ Ping FAILED"
# 3. Tools list
curl -s -X POST http://localhost:3030/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"tools/list","id":1}' | jq '.result.tools | length' | xargs -I {} echo "✓ {} tools discovered"
# 4. Sample tool call (adjust tool name and args)
curl -s -X POST http://localhost:3030/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"listPets","arguments":{}},"id":2}' | jq '.result' > /dev/null && echo "✓ Tool call OK" || echo "✗ Tool call FAILED"
Test Question Templates
Use these patterns to generate effective test questions:
- List/Query: "Show me all [resources] that match [criteria]"
- Get Details: "What are the details of [resource] with ID [id]?"
- Create: "Create a new [resource] with [properties]"
- Update: "Update [resource] [id] to change [field] to [value]"
- Delete: "Remove [resource] with ID [id]"
- Aggregate: "How many [resources] exist with [status]?"
- Search: "Find [resources] where [field] contains [term]"
- Workflow: "Create a [resource], then update it, then list all"
References
For detailed documentation:
- references/mcp-inspector-guide.md — Inspector setup & usage
- references/evaluation-criteria.md — Quality metrics & scoring
- references/question-templates.md — Test question generation
Example: Petstore API Evaluation
# 1. Run health checks
curl -s http://localhost:3030/health
curl -s -X POST http://localhost:3030/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"ping"}' | jq -e '.jsonrpc == "2.0" and .result' > /dev/null && echo "✓ Ping OK" || echo "✗ Ping FAILED"
# 2. Tool discovery
curl -s -X POST http://localhost:3030/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"tools/list","id":1}' | jq '.result.tools'
# 3. Test questions:
# - "List all available pets"
# - "Show details of pet with ID 1"
# - "Find pets with status 'available'"
# - "Create a new pet named 'Fluffy'"