The Agent Skills Directory

[PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection because it retrieves and processes metadata, descriptions, and README files from external, potentially user-controlled platforms (HuggingFace, OpenML, GitHub, and Semantic Scholar). This content is presented to the agent, which could allow an attacker to embed instructions in dataset metadata designed to influence agent behavior.
Ingestion points: Dataset search results and detailed metadata fetched via external APIs in scripts/search_ml_datasets.py.
Boundary markers: The data is formatted into structured markdown tables, but there are no explicit instructions for the agent to treat the content as untrusted or to ignore embedded commands.
Capability inventory: The agent has access to tools like run_terminal and write_file, which could be targeted by malicious instructions found in external data.
Sanitization: Dataset descriptions are truncated to 200 characters, but README files and full metadata objects are ingested with fewer constraints.
[COMMAND_EXECUTION]: The included script scripts/search_ml_datasets.py executes the GitHub CLI tool (gh) using subprocess.run. While the implementation correctly uses a list of arguments to prevent shell injection, it relies on the presence of external system binaries.
[EXTERNAL_DOWNLOADS]: The skill performs network operations to fetch metadata from well-known services (huggingface.co, openml.org, semanticscholar.org). It also lists the third-party requests library as a required Python dependency.

dataset-discovery