pipelines-health-check
Pipelines & PR Health Check
This skill checks the health of MSBuild's CI pipelines, the status of insertion PRs in the VS repository, and the health of VMR codeflow PRs that flow MSBuild source into the dotnet/dotnet unified repository.
When to Use
- User asks about MSBuild pipeline health, CI status, or build failures
- User asks about VS insertion PR status or whether insertions are going through
- User asks about VMR codeflow PRs, dotnet/dotnet PR status, or dotnet-unified-build pipeline
- User asks to check if there are failing checks on PRs
- User asks for a health check or status overview
- Periodic monitoring requests
Prerequisites
azCLI must be installed and authenticated (az loginwith access to the DevDiv organization)- Azure DevOps extension for
azmust be installed:az extension add --name azure-devops - PowerShell 5.1+ or PowerShell Core
Optional: WorkIQ (for infrastructure issue investigation)
WorkIQ is an MCP server / CLI that can query Microsoft 365 data (people, emails, Teams, documents) to find service ownership, contacts, and incident context when pipeline failures are caused by infrastructure issues outside MSBuild's control.
Check availability:
workiq version
# Expected: 0.2.x or later
If not installed, set it up:
# Install globally (use --registry if your .npmrc redirects @microsoft scope to GitHub Packages)
npm install -g @microsoft/workiq --registry https://registry.npmjs.org
# Accept the EULA (required once)
workiq accept-eula
WorkIQ is not required for the core health check. If unavailable, the skill will still work — it will simply skip the ownership lookup and suggest manual investigation or offer to help install WorkIQ.
Reference Information
Pipelines
| Pipeline | Org/Project | ID | Purpose |
|---|---|---|---|
| MSBuild | devdiv/DevDiv | 9434 | Main CI pipeline — builds and tests on every commit to main |
| MSBuild-OptProf | devdiv/DevDiv | 17389 | Optimization/profiling pipeline — runs on schedule |
| dotnet-unified-build | dnceng-public/public | — | VMR build pipeline — runs on codeflow PRs in dotnet/dotnet |
Key URLs
- MSBuild pipeline:
https://devdiv.visualstudio.com/DevDiv/_build?definitionId=9434 - OptProf pipeline:
https://devdiv.visualstudio.com/DevDiv/_build?definitionId=17389 - VS PRs assigned to MSBuild:
https://dev.azure.com/devdiv/DevDiv/_git/VS/pullrequests?_a=active&assignedTo=66cc9d27-aef7-4399-ba2c-3dccb4489098 - VMR codeflow PRs:
https://github.com/dotnet/dotnet/pulls?q=is:pr+is:open+"Source+code+updates+from+dotnet/msbuild" - dotnet-unified-build (public):
https://dev.azure.com/dnceng-public/public/_build
Phase 1: Collect Data & Present Overview Table
Step 1: Run all data collection scripts
Run these three scripts from the repository root in parallel. They output JSON to stdout. Each script may take 1–3 minutes depending on the number of PRs and pipeline runs to fetch, so use an initial_wait of 120 seconds or higher.
# Pipeline health (checks both MSBuild and MSBuild-OptProf)
$pipelineJson = & .\.github\skills\pipelines-health-check\check-pipeline-health.ps1
# VS PR status (checks active non-Experimental PRs and last merged PR)
$prJson = & .\.github\skills\pipelines-health-check\check-vs-pr-status.ps1
# VMR codeflow status (checks open codeflow PRs from msbuild and their dotnet-unified-build runs)
$vmrJson = & .\.github\skills\pipelines-health-check\check-vmr-codeflow.ps1
All scripts write progress messages to stderr (Write-Host) and the JSON result to stdout.
The scripts sanitize error messages (stripping control characters, truncating to 500 chars) so the JSON output can be parsed directly with ConvertFrom-Json without additional cleanup.
Step 2: Present the overview table IMMEDIATELY
Parse the JSON outputs and render status overview tables to the user before doing any deeper investigation. This gives the user instant visibility. Present ALL tables - for both pipelines and for the VS insertion PRs. Do not omit any of those unless explicitly asked by user just for some specific overview.
Pipeline Health Table
For each pipeline in the JSON output, render one row:
| Pipeline | Last Success | Age | Recent Runs | Status |
|---|---|---|---|---|
| {pipelineName} ({pipelineId}) | {lastSuccessfulRun.finishTime} | {lastSuccessfulRun.ageHours}h | emoji sequence | status emoji + label |
Recent Runs column: Show an emoji for each run in recentRuns array (newest first):
✅forsucceeded❌forfailed⏳forinProgress⚪forcanceledor other
Status column — derive from healthSummary and lastSuccessfulRun.ageHours:
✅ HEALTHY— healthSummary starts with "HEALTHY"⚠️ FLAKY— healthSummary starts with "FLAKY"🔴 UNHEALTHY— healthSummary starts with "UNHEALTHY"- Add
⚠️if ageHours > 24,🔴if ageHours > 48 (even if some runs succeed, stale success is a concern)
VS Insertion PRs Table (non-Experimental)
For each PR in the prs array:
| PR | Title | Checks ✅ | Checks ⏳ | Checks ❌ | Status |
|---|---|---|---|---|---|
| {id} | {title} (truncated) | {checks.succeeded} | {checks.pending} | {checks.failed} | status |
Status column:
🔴 Failing— ifactionNeededis true (has failed required checks)⏳ Running— ifchecks.pending > 0and no failures✅ Green— if all checks succeeded or notApplicable
Last Merged Insertion Row
| Last Merged PR | Date | Age | Status |
|---|---|---|---|
| {lastMergedPr.id} | {lastMergedPr.closedDate} | {ageDays} days | status |
Status:
✅ Recent— ageHours ≤ 48 (≤ 2 business days)⚠️ Getting stale— ageHours > 48 and ≤ 96🔴 Stale insertion— ageHours > 96 (> 4 business days)
Note on weekends: When computing business-day age, be aware that weekends inflate the hour count. If today is Monday and the last merge was Friday, that's ~72h but only 1 business day. Mention this nuance to the user if the age seems borderline.
VMR Codeflow PRs Table
For each codeflow PR in the codeflowPRs array from $vmrJson:
| Codeflow PR | Age | Pipeline Runs | Upstream PRs | Status |
|---|---|---|---|---|
| #{prNumber} | {prAge}h | emoji sequence from pipelineRuns | count | status emoji + label |
Pipeline Runs column: Show an emoji for each run in pipelineRuns (newest first):
✅forresult == "succeeded"❌forresult == "failed"⏳forstatus == "inProgress"⚪for other/no runs
Status column — derive from healthSummary:
✅ HEALTHY— healthSummary starts with "HEALTHY"🔄 IN PROGRESS— healthSummary starts with "IN_PROGRESS" or "RETRYING"🔴 FAILING— healthSummary starts with "UNHEALTHY"⚠️ MIXED— healthSummary starts with "MIXED"❓ UNKNOWN— no pipeline runs found
If a PR has failures, also render a failure details sub-table:
| Failed Job | Failed Task | Error Category | Related Upstream PRs |
|---|---|---|---|
| {job.name} | {task.name} | {category} | PR links from failureCorrelation |
The failureCorrelation array maps each failed build to error categories and potentially related upstream PRs (matched by title keywords). This helps quickly identify which msbuild change likely caused a VMR build failure.
Upstream PRs list: For each codeflow PR, list the included upstream msbuild PRs:
Included changes: #13175 Add App Host Support, #13306 IBuildEngine callbacks, ...
Step 3: Identify problems
After rendering the table, build a list of distinct problems. A "problem" is any of:
- Pipeline failure — A pipeline whose latest run on main failed, especially if
lastSuccessfulRun.ageHours > 24 - PR check failure — An active non-Experimental PR that has
actionNeeded: true(failed required checks) - Stale insertion —
lastMergedPr.ageHours > 48(no successful insertion in >2 business days) - All checks pending — A PR where all checks are still pending/queued (may indicate a stuck pipeline or queue issue)
- VMR codeflow failure — A codeflow PR whose
healthSummarystarts with "UNHEALTHY" (dotnet-unified-build failing) - VMR codeflow stale — A codeflow PR older than 48 hours with no successful pipeline run
If there are no problems, report ✅ ALL CLEAR — pipelines healthy, PRs on track, insertions flowing, VMR codeflow green and stop. Do not proceed to Phase 2.
Phase 2: Investigate Problems via Subagents
For each distinct problem identified in Step 3, launch a separate subagent to perform DEEP, DETAILED investigation (use #tool:agent/runSubagent to run the investigation tasks). Fire them in parallel when possible. Use the template below to seed them.
Subagent prompt templates
For pipeline failures
Investigate why the Azure DevOps pipeline "{pipelineName}" (ID: {pipelineId}) is failing.
Recent failed runs on branch {branch}:
{for each failed run, list: Run ID, start time, URL, and the failedTasks with their error messages}
Last successful run: {lastSuccessfulRun.finishTime} ({ageHours} hours ago)
URL: {lastSuccessfulRun.url}
Tasks:
1. Categorize each failure as one of:
- BUILD ERROR: compilation failures, test failures, task execution errors in MSBuild code
- CONFIG/PERMISSION: signing errors, NuGet authentication, certificate issues, feed access
- INFRA/TRANSIENT: errors indicating unavailability or outage of services or resources
2. Check if all recent failures share the same root cause or if there are different issues
3. If infra/transient: suggest retrying the pipeline (provide the pipeline URL)
4. If build error:
- Check the `For build errors` section below on how to investigate build errors with binlogs
- identify which component/task is failing and check recent commits to main to try to identify offending one.
5. If infrastructure issues:
- Try to distill the exact reason for the issue, check if there are other failing pipelines with the same issue or any open bugs for the issue.
- **Use WorkIQ** to find the owning team and contacts. Check if `workiq` CLI is available (`workiq version`).
- If available, run: `workiq ask -q "Who owns the {failing service/task name} service in Microsoft DevDiv? Who should be contacted about {brief error description}?"`
- Include the WorkIQ response in your findings — it typically returns team names, distribution lists, contact people, and escalation paths.
- You can also ask WorkIQ about known outages: `workiq ask -q "Are there any known outages or incidents for {service name} in Azure DevOps?"`
- If WorkIQ is NOT available, note this in your report and suggest the user install it:
```
npm install -g @microsoft/workiq --registry https://registry.npmjs.org
workiq accept-eula
```
- Put together a concise overview of the issue, along with links to the failure messages, the owning team/contacts from WorkIQ, and suggested next steps.
Return: A comprehensive root cause analysis with category, explanation, links to failure messages, ownership info (from WorkIQ if available), and recommended action.
For PR check failures
Investigate failing checks on VS insertion PR #{prId}: "{prTitle}"
PR URL: {prUrl}
Failed checks:
{for each item in checks.failedChecks: genre, name, description, isRequired}
Pending checks (still running):
{for each item in checks.pendingChecks: genre, name, description, isRequired}
Pipeline health context:
{brief summary of pipeline health from Phase 1 — are pipelines also failing?}
Tasks:
1. Identify which failed checks are required vs optional
2. If required checks are failing, determine if this could be related to pipeline failures (same root cause)
3. If checks are just pending/queued, note that they may still be running and suggest waiting
4. Recommend specific actions: retry checks, investigate pipeline, or wait
5. If check is failing - try to traverse the chain of called pipelines to the actual error, then:
- Check the `For build errors` section below on how to investigate build errors with binlogs
- identify which component/task is failing and check recent commits to msbuild main to try to identify offending one.
Return: Which checks need attention, likely cause, and recommended action.
For stale insertion
Investigate why MSBuild insertions into VS appear stale.
Last successfully merged non-Experimental PR: #{lastMergedPr.id} "{lastMergedPr.title}"
Merged: {lastMergedPr.closedDate} ({ageDays} days ago)
URL: {lastMergedPr.url}
Currently active non-Experimental PRs:
{for each PR: id, title, url, checks summary, actionNeeded}
Pipeline health:
{brief pipeline health summary}
Tasks:
1. Check if there are active non-Experimental PRs waiting — if none, the issue may be that no insertion was triggered
2. If there are active PRs with failing checks, identify if those failures are blocking the insertion
3. If there are active PRs with all checks pending, they may just need time
4. Correlate with pipeline health — if the CI pipeline is broken, insertions can't succeed
5. Recommend specific actions to unblock
Return: Explanation of why insertion appears stuck and what to do about it.
For VMR codeflow failures
Investigate failing dotnet-unified-build pipeline for VMR codeflow PR #{prNumber}: "{prTitle}"
PR URL: {prUrl}
PR Branch: {prBranch}
This is a codeflow PR that brings MSBuild source changes into the dotnet/dotnet VMR (Virtual Monolithic Repository).
The dotnet-unified-build pipeline runs in the dnceng-public/public Azure DevOps org.
Failed pipeline runs:
{for each failed run in pipelineRuns: buildId, buildNumber, URL, stages, failedJobs with their failedTasks and errors}
Failure categories detected: {failureCorrelation[].categories}
Upstream MSBuild PRs included in this codeflow:
{for each PR in upstreamPRs: number, title, url, merged status}
Failure-to-change correlation (from script):
{for each item in failureCorrelation: buildId, categories, relatedUpstreamPRs}
Tasks:
1. For each failed job, examine the error messages and categorize:
- TASK_HOST: MSB4216 errors about MSBuild not finding the task host executable — typically caused by changes to MSBuild's app host, node launching, or SDK layout
- SOURCE_BUILD_TASK_HOST: Same as TASK_HOST but in the source-only build (CentOS offline) — the previously-source-built SDK doesn't have the expected MSBuild executable
- COMPILATION: CS/VB/FS compilation errors — a code change broke the build
- BUILD_COMMAND: MSB3073 "exited with code N" — a build script or test failed
- NUGET_AUTH/SIGNING/TIMEOUT/RESOURCE: Infrastructure issues
2. Use the failure-to-change correlation to identify which upstream MSBuild PR most likely caused each failure:
- Check if the error is in an area touched by one of the upstream PRs
- If TASK_HOST errors: look for PRs touching NodeLauncher, NodeProvider, task host, app host, or BuildEnvironmentHelper
- If COMPILATION errors: check which files the upstream PRs modified and whether any could cause the compilation break
- If test failures: identify which test is failing and which upstream PR likely affects that code path
3. Check if there is an in-progress retry build that might resolve the issue
4. If the failure appears to be an infrastructure issue (not caused by MSBuild changes), note that
5. For build errors, check the `For build errors` section on how to investigate with binlogs
6. Recommend specific actions:
- If an upstream PR is clearly at fault, suggest reverting it or filing a fix
- If a VMR-side fix is needed (e.g., SDK layout change), describe what needs to change
- If a retry might help (transient infra), suggest re-running the pipeline
- If the issue is already being retried (in-progress build), suggest waiting
Return: Root cause analysis mapping each failure to the likely upstream PR, with explanation, links, and recommended action.
For build errors
Tasks:
- Try to find a .binlog file(s) in the build or step artifacts and fetch it
- Ensure to acquire the binlog-failure-analysis skill together with the binlog-mcp (spawn via
dnx -y baronfel.binlog.mcp@0.0.13) - Use the binlog analysis skill and mcp to analyse the binlog(s) you found and analyse problems from those
Phase 3: Final Report
After all subagent results return, present the findings below the overview table under a "🔍 Problems & Recommendations" heading. For each problem:
### Problem: {brief title}
**Category:** {INFRA | BUILD | CONFIG | PR_CHECKS | STALE_INSERTION | VMR_CODEFLOW}
**Details:** {subagent's explanation}
**Ownership:** {owning team, contacts, DL from WorkIQ — include only for INFRA/CONFIG issues}
**Recommended Action:** {subagent's recommendation}
If all problems are infra/transient, add a note: "All current issues appear to be infrastructure-related. Consider retrying the pipelines and checking again in 30 minutes."
Troubleshooting
"az: command not found" or "az account get-access-token" fails
The az CLI is not installed or not authenticated. Run az login first.
Scripts return empty arrays
- Check that you have access to the DevDiv organization
- The branch filter defaults to
main— if checking a different branch, pass-Branch <name>to the pipeline script
PR statuses all show as "pending"
This is normal for newly created PRs. The checks take time to queue and run. If checks are pending for more than a few hours, this may indicate a stuck pipeline or queue issue.
Timeout or rate limiting
If the scripts take a long time or fail with 429 errors, Azure DevOps may be rate-limiting. Wait a minute and retry.
WorkIQ not found or EULA not accepted
If workiq version fails, install it:
npm install -g @microsoft/workiq --registry https://registry.npmjs.org
workiq accept-eula
Note: If your .npmrc redirects the @microsoft scope to GitHub Packages, use --registry https://registry.npmjs.org to override, or pass --userconfig pointing to a clean .npmrc.
WorkIQ returns empty or unhelpful results
WorkIQ queries Microsoft 365 data (Outlook, Teams, SharePoint). Results depend on your account's access and the data available in your tenant. Try rephrasing the question or being more specific about the service name. Example queries that work well:
workiq ask -q "Who owns the MicroBuild service in Microsoft?"workiq ask -q "Who owns the CloudBuild signing service in DevDiv?"workiq ask -q "Who should I contact about NuGet feed authentication failures in Azure DevOps?"
VMR codeflow script returns no PRs
- Verify there are actually open codeflow PRs: check
https://github.com/dotnet/dotnet/pulls?q=is:pr+is:open+"Source+code+updates+from+dotnet/msbuild" - Verify
ghCLI is authenticated: rungh auth status
VMR pipeline runs not found
- The dotnet-unified-build pipeline runs in
dnceng-public/public, which is publicly accessible - If no runs appear, the pipeline may not have been triggered yet for that PR
- Codeflow PRs may take a few minutes after creation before CI triggers
JSON output too large or contains unexpected characters
- Error messages from Azure DevOps timelines can contain Windows paths, control characters, and multi-KB stack traces
- The scripts sanitize and truncate these to 500 characters — if you still see issues, check that you're running the latest version of the scripts
- Use
ConvertFrom-Jsonin PowerShell orjson.loads()in Python to parse the output; avoid manual string manipulation