data-download
SKILL.md
Data Download & Acquisition
Objectives
- Download datasets from various sources (HTTP, API, cloud platforms)
- Handle authentication and API keys securely
- Implement retry logic and resume capability
- Validate downloaded data integrity
- Cache downloads to avoid redundant requests
Core Strategy
1. Choose the Right Method
Select download method based on source:
- Built-in libraries: Use sklearn, TensorFlow, PyTorch, HuggingFace datasets when available (fastest, most reliable)
- Direct HTTP: For simple file URLs, use
requestswith streaming - Platform APIs: Use official clients for Kaggle, HuggingFace, AWS S3
- Web scraping: Only when no API exists (see
dev-web_scrapingskill)
2. Implement Reliability
Always include:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=4, max=10))
def download_with_retry(url: str, output_path: Path):
response = requests.get(url, timeout=30, stream=True)
response.raise_for_status()
with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
3. Cache Downloads
Avoid redundant downloads:
def download_with_cache(url: str, cache_dir: Path = Path('.cache')):
filename = url.split('/')[-1]
cache_path = cache_dir / filename
if cache_path.exists():
print(f"Using cached: {cache_path}")
return cache_path
cache_dir.mkdir(exist_ok=True)
download_with_retry(url, cache_path)
return cache_path
4. Validate Data
After download, verify integrity:
# Check file size
if output_path.stat().st_size == 0:
raise ValueError("Downloaded file is empty")
# Verify checksum if available
if expected_md5:
verify_checksum(output_path, expected_md5)
# Validate format
df = pd.read_csv(output_path) # Will raise if invalid
Platform-Specific Downloads
Built-in Datasets
# Scikit-learn
from sklearn.datasets import load_diabetes, fetch_openml
data = load_diabetes()
# HuggingFace
from datasets import load_dataset
dataset = load_dataset('imdb', cache_dir='./cache')
# TensorFlow/Keras
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Kaggle
# Setup: Place kaggle.json in ~/.kaggle/
pip install kaggle
import kaggle
kaggle.api.dataset_download_files('uciml/iris', path='data/', unzip=True)
Direct HTTP
import requests
from pathlib import Path
response = requests.get(url, stream=True)
response.raise_for_status()
with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
Google Drive
pip install gdown
import gdown
gdown.download(f'https://drive.google.com/uc?id={file_id}', output_path)
Security Best Practices
Store API Keys Securely
import os
from dotenv import load_dotenv
# Use environment variables
load_dotenv()
api_key = os.getenv('KAGGLE_KEY')
# Never hardcode keys in source code!
Add to .gitignore
.env
.secrets/
kaggle.json
data/
*.csv
*.zip
.cache/
Common Patterns
Pattern 1: Download and Extract
import zipfile
# Download
download_with_retry(url, Path('temp.zip'))
# Extract
with zipfile.ZipFile('temp.zip', 'r') as zip_ref:
zip_ref.extractall('data/')
# Cleanup
Path('temp.zip').unlink()
Pattern 2: Batch Download
from concurrent.futures import ThreadPoolExecutor
def download_batch(urls: list[str], output_dir: Path, max_workers: int = 5):
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(download_file, url, output_dir) for url in urls]
for future in futures:
future.result()
Pattern 3: Resume Partial Downloads
def download_resumable(url: str, output_path: Path):
headers = {}
mode = 'wb'
if output_path.exists():
existing_size = output_path.stat().st_size
headers['Range'] = f'bytes={existing_size}-'
mode = 'ab'
response = requests.get(url, headers=headers, stream=True)
with open(output_path, mode) as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
Validation Checklist
Before using downloaded data:
- File downloaded completely (check size > 0)
- Checksum verified (if available)
- File format is valid (can be opened/parsed)
- Data structure matches expectations
- Cached for future use
- API keys stored securely (not in code)
Common Issues
Timeout errors → Increase timeout: requests.get(url, timeout=300)
SSL certificate error → Verify SSL: requests.get(url, verify=True)
Rate limiting → Add delays: time.sleep(1) between requests
Memory error (large files) → Use streaming: response.iter_content(chunk_size=8192)
Partial download → Implement resume capability with Range headers
Helper Scripts
Use provided scripts for common tasks:
# Download single file
python .skills/dev-data_download/scripts/download_file.py <url> <output>
# Download from Kaggle
python .skills/dev-data_download/scripts/download_kaggle.py <dataset-id>
# Batch download from URLs file
python .skills/dev-data_download/scripts/batch_download.py urls.txt
References
For detailed code examples: See references/examples.md
For platform-specific guides: See references/platforms.md
For API authentication setup: See references/authentication.md
Quick Reference
# Simple download
response = requests.get(url)
Path('data.csv').write_bytes(response.content)
# With retry and cache
download_with_cache(url, cache_dir=Path('.cache'))
# From Kaggle
kaggle.api.dataset_download_files('dataset-id', path='data/', unzip=True)
# From HuggingFace
dataset = load_dataset('dataset-name', cache_dir='./cache')
# From sklearn
from sklearn.datasets import load_iris
data = load_iris()
Weekly Installs
1
Repository
lannieyoo/gangw…s-portalFirst Seen
9 days ago
Security Audits
Installed on
crush1
amp1
cline1
openclaw1
opencode1
cursor1