Data Analysis
Data Analysis
Uses the DataVoyager agent (hosted on Cloud Run, accessed via asta-gateway) to run a multi-agent data-science pipeline: the agent writes and executes code against the caller's dataset(s) in a sandboxed notebook and answers a research question. Auth is asta auth login.
Step 1 — Draft a tightened query
Before asking the user anything, analyze their request and the surrounding context (current project, conversation history, files they've been working on) to produce a tightened analytical question that:
- Names the specific dataset(s) that will be analyzed
- States what decision or insight the user is after, not just "analyze X"
- Is phrased as a question DataVoyager can actually answer with code
Examples:
- User says "look at this CSV" → inspect the file path, sample the top rows if possible, produce a concrete query like "Which columns in
sales_q3.csvhave the strongest correlation with quarterly revenue, controlling for region?" - User says "what's in the titanic data" → "What features best predict survival in
titanic.csv, and how do survival rates differ across passenger class and sex?" - User gives a specific question with a specific dataset → echo it verbatim
Step 2 — Confirm with one chat question
In chat (not AskUserQuestion), present:
Proposed analysis:
- Dataset(s):
<path>(will be uploaded to your DataVoyager workspace)- Question:
<tightened>You can:
- Reply yes / go to run as-is
- Reply with edits (e.g. "focus on just Q3", "ignore missing values") and I'll revise the question
- (Only if
AskUserQuestionis available) Reply interview to refine the question through a form
Only include the interview bullet when the AskUserQuestion tool is available.
Wait for the user's response. Paths:
- Affirmative ("yes", "go", "proceed", "looks good") → Step 4.
- Natural-language edit → update the question, re-show the same prompt.
- "Interview" / "refine" → Step 3 (AskUserQuestion required; if unavailable, ask in chat instead).
Never ask the user to pre-upload the dataset. The skill handles the upload — see Step 4 — and the user just supplies the local file path.
Step 3 — Optional interview (only when requested)
Fire one AskUserQuestion with:
- Question —
<tightened>vs<alternative reframe>(if you have one) - Scope — "focused (single hypothesis)" / "exploratory (multiple angles)"
- Extra context — free-text "anything DataVoyager should know about the data?" (field meanings, known caveats)
Fold the answers back into the query string.
Step 4 — Submit
A single submit call mints a session UUID, uploads the file(s) under context/<uuid>/, and starts the analysis. The response carries the task ID (id) and the session UUID (contextId); capture both for polling and any follow-ups.
asta analyze-data submit \
--output "/tmp/analyze-data-$$.json" \
"<confirmed question>" ./sales.csv ./regions.csv
TID=$(jq -r .id /tmp/analyze-data-$$.json)
CTX=$(jq -r .contextId /tmp/analyze-data-$$.json)
Resumability. $CTX identifies the DataVoyager session. To ask a follow-up against the same workspace, run asta analyze-data submit --context-id "$CTX" '<follow-up question>' [<new-files>...]. New files (if any) attach to the existing context; if no files are passed, the agent reuses what's already there. To start a clean session over the same datasets, omit --context-id and pass the files again.
Step 5 — Poll
Don't foreground-poll in a loop (session blocks) and don't start individual sleep 60-then-check turns (harness blocks long leading sleeps). Instead, run the poll subcommand backgrounded — it exits on a terminal state and the harness will notify you when it finishes.
asta analyze-data poll "$TID" --output "/tmp/analyze-data-$TID.json"
Run with run_in_background: true. Status ticks go to stderr; the final Task JSON is written to --output. When the completion notification fires, read /tmp/analyze-data-$TID.json for the final payload.
While it's running, do not proactively check. Work on other things or wait — the notification is authoritative. If the user asks for a status check before the notification, only then tail the background task's stderr.
Terminal states:
completed→ Step 6failed→ reportstatus.messageand stopinput-required→ relay to user, thenasta analyze-data send-message --task-id <ID> --context-id "$CTX" '<reply>'and re-kick the polling loop
Runtime: highly variable (simple EDAs finish in a few minutes; multi-step modeling runs can take 20–40 min). Don't hard-fail before ~2 hours.
Step 6 — Export and index
Hand off to the Asta Artifacts skill to export the task output (tables, plots, the notebook, any written analysis) and register each artifact with asta-documents. Pass analyze-data as the invoking skill and a slug derived from the analytical question; Asta Artifacts handles the path convention, manifest, and index.yaml.
Step 7 — Summarize for the user
Present, in this order:
-
Indexing + exploration paths — one short block naming both ways to browse. Always include BOTH (the skill path for semantic search, the filesystem path for direct reading):
Indexed N artifacts in
.asta/analyze-data/<slug>/index.yaml. Explore viaasta documents search --summary='<concept>' --root=.asta/analyze-data/<slug>or open the directory directly:open <absolute-path-to-slug-dir>Use the absolute path (e.g.
/Users/.../project/.asta/analyze-data/2026-04-23-…/). Pick<concept>from a term central to the analysis (concrete, not generic — e.g. a column name, a model type). -
One-paragraph synthesis — 2–4 sentences written fresh for this run. What's the headline finding? What did the data say vs. what did the user expect? Surface surprises, caveats, and whether the analysis answered the original question. This is discretionary — don't template it, read the output and synthesize.
-
Table of key findings / charts — one row per notable insight or figure: finding + 1–2 sentence detail + (if applicable) chart filename.
Don't dump raw JSON. Don't repeat every step the agent took. Don't add a trailing "let me know if you'd like…" summary — the exploration block already tells the user how to keep going.
References
More from allenai/asta-plugins
semantic scholar lookup
This skill should be used when the user asks to "get paper details", "look up a paper", "find citations", "who cited this paper", "papers by [author]", "search for papers on [topic]", or needs quick lookups of paper metadata, citations, or author information from Semantic Scholar. Use this for fast, targeted queries (not comprehensive reports).
46asta literature reports
Create or update literature reviews/reports. Use whenever you need to research, summarize, or synthesize the literature.
32asta library
Local document metadata index for files used by Asta skills and tools. Use this skill when the user asks to store a document "in Asta" or retrieve "from Asta". Use it when the
31preview
Render and deploy project documents, reports, and notebooks. Use when docs need to be shared or when previewing how documents render with citations and formatting.
29pdf text extraction
Extract text from PDFs using olmOCR or remote OCR. Use when user asks to "extract text from PDF", "OCR a document", "read a PDF", or needs to process scanned documents.
25asta literature search
This skill should be used when the user asks to "find papers", "search for papers", "what does the literature say", "find research on", "academic papers about", "literature review", "cite papers", or needs to answer questions using academic literature.
19