Data Download & Acquisition

Objectives

Download datasets from various sources (HTTP, API, cloud platforms)
Handle authentication and API keys securely
Implement retry logic and resume capability
Validate downloaded data integrity
Cache downloads to avoid redundant requests

Core Strategy

1. Choose the Right Method

Select download method based on source:

Built-in libraries: Use sklearn, TensorFlow, PyTorch, HuggingFace datasets when available (fastest, most reliable)
Direct HTTP: For simple file URLs, use requests with streaming
Platform APIs: Use official clients for Kaggle, HuggingFace, AWS S3
Web scraping: Only when no API exists (see dev-web_scraping skill)

2. Implement Reliability

Always include:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=4, max=10))
def download_with_retry(url: str, output_path: Path):
    response = requests.get(url, timeout=30, stream=True)
    response.raise_for_status()
    
    with open(output_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

3. Cache Downloads

Avoid redundant downloads:

def download_with_cache(url: str, cache_dir: Path = Path('.cache')):
    filename = url.split('/')[-1]
    cache_path = cache_dir / filename
    
    if cache_path.exists():
        print(f"Using cached: {cache_path}")
        return cache_path
    
    cache_dir.mkdir(exist_ok=True)
    download_with_retry(url, cache_path)
    return cache_path

4. Validate Data

After download, verify integrity:

# Check file size
if output_path.stat().st_size == 0:
    raise ValueError("Downloaded file is empty")

# Verify checksum if available
if expected_md5:
    verify_checksum(output_path, expected_md5)

# Validate format
df = pd.read_csv(output_path)  # Will raise if invalid

Platform-Specific Downloads

Built-in Datasets

# Scikit-learn
from sklearn.datasets import load_diabetes, fetch_openml
data = load_diabetes()

# HuggingFace
from datasets import load_dataset
dataset = load_dataset('imdb', cache_dir='./cache')

# TensorFlow/Keras
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Kaggle

# Setup: Place kaggle.json in ~/.kaggle/
pip install kaggle

import kaggle
kaggle.api.dataset_download_files('uciml/iris', path='data/', unzip=True)

Direct HTTP

import requests
from pathlib import Path

response = requests.get(url, stream=True)
response.raise_for_status()

with open(output_path, 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

Google Drive

pip install gdown

import gdown
gdown.download(f'https://drive.google.com/uc?id={file_id}', output_path)

Security Best Practices

Store API Keys Securely

import os
from dotenv import load_dotenv

# Use environment variables
load_dotenv()
api_key = os.getenv('KAGGLE_KEY')

# Never hardcode keys in source code!

Add to .gitignore

.env
.secrets/
kaggle.json
data/
*.csv
*.zip
.cache/

Common Patterns

Pattern 1: Download and Extract

import zipfile

# Download
download_with_retry(url, Path('temp.zip'))

# Extract
with zipfile.ZipFile('temp.zip', 'r') as zip_ref:
    zip_ref.extractall('data/')

# Cleanup
Path('temp.zip').unlink()

Pattern 2: Batch Download

from concurrent.futures import ThreadPoolExecutor

def download_batch(urls: list[str], output_dir: Path, max_workers: int = 5):
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(download_file, url, output_dir) for url in urls]
        for future in futures:
            future.result()

Pattern 3: Resume Partial Downloads

def download_resumable(url: str, output_path: Path):
    headers = {}
    mode = 'wb'
    
    if output_path.exists():
        existing_size = output_path.stat().st_size
        headers['Range'] = f'bytes={existing_size}-'
        mode = 'ab'
    
    response = requests.get(url, headers=headers, stream=True)
    
    with open(output_path, mode) as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

Validation Checklist

Before using downloaded data:

File downloaded completely (check size > 0)
Checksum verified (if available)
File format is valid (can be opened/parsed)
Data structure matches expectations
Cached for future use
API keys stored securely (not in code)

Common Issues

Timeout errors → Increase timeout: requests.get(url, timeout=300)

SSL certificate error → Verify SSL: requests.get(url, verify=True)

Rate limiting → Add delays: time.sleep(1) between requests

Memory error (large files) → Use streaming: response.iter_content(chunk_size=8192)

Partial download → Implement resume capability with Range headers

Helper Scripts

Use provided scripts for common tasks:

# Download single file
python .skills/dev-data_download/scripts/download_file.py <url> <output>

# Download from Kaggle
python .skills/dev-data_download/scripts/download_kaggle.py <dataset-id>

# Batch download from URLs file
python .skills/dev-data_download/scripts/batch_download.py urls.txt

References

For detailed code examples: See references/examples.md

For platform-specific guides: See references/platforms.md

For API authentication setup: See references/authentication.md

Quick Reference

# Simple download
response = requests.get(url)
Path('data.csv').write_bytes(response.content)

# With retry and cache
download_with_cache(url, cache_dir=Path('.cache'))

# From Kaggle
kaggle.api.dataset_download_files('dataset-id', path='data/', unzip=True)

# From HuggingFace
dataset = load_dataset('dataset-name', cache_dir='./cache')

# From sklearn
from sklearn.datasets import load_iris
data = load_iris()

data-download