skills/vanman2024/ai-dev-marketplace/google-cloud-configs

google-cloud-configs

SKILL.md

Use when:

  • Setting up BigQuery ML for SQL-based machine learning
  • Configuring Vertex AI custom training jobs
  • Setting up GCP authentication for ML workflows
  • Selecting appropriate GPU/TPU configurations
  • Estimating costs for GCP ML training
  • Deploying models to Vertex AI endpoints
  • Configuring distributed training on GCP
  • Optimizing cost vs performance for cloud ML

Platform Overview

BigQuery ML

What it is: SQL-based machine learning directly in BigQuery Best for:

  • Quick ML prototypes using existing data warehouse data
  • Classification, regression, forecasting on structured data
  • Users familiar with SQL but not Python/ML frameworks
  • Large-scale batch predictions

Available Models:

  • Linear/Logistic Regression
  • XGBoost (BOOSTED_TREE)
  • Deep Neural Networks (DNN)
  • AutoML Tables
  • TensorFlow/PyTorch imported models

Pricing:

  • Based on data processed (same as BigQuery queries)
  • $5 per TB processed for analysis
  • AutoML: $19.32/hour for training

Vertex AI Training

What it is: Fully managed ML training platform Best for:

  • Custom PyTorch/TensorFlow training
  • Large-scale distributed training
  • GPU/TPU-accelerated workloads
  • Production ML pipelines

Available Compute:

  • CPUs: n1-standard, n1-highmem, n1-highcpu
  • GPUs: NVIDIA T4, P4, V100, P100, A100, L4
  • TPUs: v2, v3, v4, v5e (8 cores to 512 cores)

Pricing:

  • CPU: $0.05-0.30/hour depending on machine type
  • GPU T4: $0.35/hour
  • GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
  • TPU v3: $8.00/hour (8 cores)
  • TPU v4: $11.00/hour (8 cores)

GPU/TPU Selection Guide

GPU Selection (Vertex AI)

T4 (16GB VRAM):

  • Use case: Inference, light training, small models
  • Cost: $0.35/hour
  • Good for: BERT-base, small CNNs, inference serving

V100 (16GB VRAM):

  • Use case: Mid-size training, mixed precision training
  • Cost: $2.48/hour
  • Good for: ResNet training, medium transformers

A100 (40GB/80GB VRAM):

  • Use case: Large model training, distributed training
  • Cost: $3.67/hour (40GB), $4.95/hour (80GB)
  • Good for: GPT-style models, large vision models, multi-GPU training

L4 (24GB VRAM):

  • Use case: Modern alternative to T4, better performance
  • Cost: $0.66/hour
  • Good for: Mid-size models, efficient inference

TPU Selection (Vertex AI)

TPU v2 (8 cores):

  • Use case: TensorFlow/JAX training, matrix operations
  • Cost: $4.50/hour
  • Memory: 8GB per core (64GB total)
  • Good for: Legacy TensorFlow models

TPU v3 (8 cores):

  • Use case: Standard TPU training
  • Cost: $8.00/hour
  • Memory: 16GB per core (128GB total)
  • Good for: BERT, T5, image classification

TPU v4 (8 cores):

  • Use case: Latest generation, best performance
  • Cost: $11.00/hour
  • Memory: 32GB per core (256GB total)
  • Good for: Large language models, cutting-edge research

TPU v5e (8 cores):

  • Use case: Cost-optimized TPU
  • Cost: $2.50/hour
  • Good for: Development, training at scale on budget

Multi-node TPU Pods:

  • v3-32: 32 cores, $32/hour
  • v3-128: 128 cores, $128/hour
  • v4-128: 128 cores, $176/hour
  • Use for: Massive distributed training (GPT-3 scale)

Usage

Setup BigQuery ML Environment

bash scripts/setup-bigquery-ml.sh

Prompts for:

  • GCP Project ID
  • BigQuery dataset name
  • Service account credentials
  • Default model type preference

Creates:

  • bigquery_config.json - Project configuration
  • .bigqueryrc - CLI configuration
  • Example training SQL in examples/

Setup Vertex AI Training Environment

bash scripts/setup-vertex-ai.sh

Prompts for:

  • GCP Project ID
  • Region (us-central1, europe-west4, etc.)
  • Service account credentials
  • Default machine type
  • GPU/TPU preference

Creates:

  • vertex_config.yaml - Training job configuration
  • vertex_requirements.txt - Python dependencies
  • Training script template

Configure GCP Authentication

bash scripts/configure-auth.sh

Prompts for:

  • Authentication method (service account, user account, workload identity)
  • Service account key path (if applicable)
  • IAM roles needed

Creates:

  • .gcp_auth_config - Authentication configuration
  • Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
  • Validates permissions

Required IAM Roles:

  • BigQuery ML: roles/bigquery.dataEditor, roles/bigquery.jobUser
  • Vertex AI: roles/aiplatform.user, roles/storage.objectAdmin
  • Both: roles/serviceusage.serviceUsageConsumer

Estimate GCP Training Costs

bash scripts/estimate-gcp-cost.sh

Interactive prompts:

  • Platform: BigQuery ML or Vertex AI
  • If BigQuery ML: Data size to process
  • If Vertex AI:
    • Machine type (CPU/GPU/TPU)
    • Number of machines
    • Training duration estimate
    • Storage requirements

Output:

  • Estimated compute cost
  • Storage cost
  • Data transfer cost (if applicable)
  • Total estimated cost
  • Cost comparison with other GCP options

Templates

BigQuery ML Training Template (templates/bigquery_ml_training.sql)

SQL template for creating and training models:

  • Model creation syntax
  • Feature engineering examples
  • Training options (L1/L2 reg, learning rate, etc.)
  • Evaluation queries
  • Prediction queries

Supported model types:

  • LINEAR_REG, LOGISTIC_REG
  • BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
  • DNN_CLASSIFIER, DNN_REGRESSOR
  • AUTOML_CLASSIFIER, AUTOML_REGRESSOR

Vertex AI Training Job Template (templates/vertex_training_job.py)

Python template for custom training:

  • Training loop structure
  • Distributed training setup (PyTorch DDP)
  • Checkpointing and model saving
  • Metrics logging to Vertex AI
  • Hyperparameter tuning integration

Includes:

  • Single GPU training
  • Multi-GPU training (DataParallel, DistributedDataParallel)
  • TPU training with PyTorch/XLA
  • Cloud Storage integration

GPU Configuration Template (templates/vertex_gpu_config.yaml)

YAML configuration for GPU training jobs:

  • Machine type selection
  • GPU type and count
  • Disk configuration
  • Network configuration
  • Environment variables

Presets included:

  • Single T4 (budget)
  • Single A100 (standard)
  • 4x A100 (distributed)
  • 8x A100 (large-scale)

TPU Configuration Template (templates/vertex_tpu_config.yaml)

YAML configuration for TPU training jobs:

  • TPU type and topology
  • TPU version selection
  • JAX/TensorFlow runtime
  • XLA compilation flags

Presets included:

  • v3-8 (single TPU)
  • v4-32 (TPU pod slice)
  • v5e-8 (cost-optimized)

GCP Authentication Template (templates/gcp_auth.json)

Service account configuration template:

  • Project ID
  • Service account email
  • Key file path
  • Required scopes
  • IAM role assignments

Security notes:

  • Uses placeholders only (never real keys)
  • Documents how to create service accounts
  • Includes .gitignore protection

Examples

BigQuery ML Regression Example (examples/bigquery-regression-example.sql)

Complete example:

  • Dataset: NYC taxi trip data
  • Task: Predict trip duration
  • Model: BOOSTED_TREE_REGRESSOR
  • Includes feature engineering, training, evaluation

Demonstrates:

  • CREATE MODEL syntax
  • TRANSFORM clause for feature engineering
  • MODEL evaluation
  • Batch predictions

Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py)

Complete training script:

  • Dataset: IMDB sentiment analysis
  • Model: DistilBERT fine-tuning
  • Training: Single GPU
  • Logging: Vertex AI experiments

Demonstrates:

  • Loading data from GCS
  • Training loop with mixed precision
  • Checkpointing to GCS
  • Metrics logging
  • Model export to Vertex AI

Vertex AI Distributed Training Example (examples/vertex-distributed-training.py)

Multi-GPU training example:

  • Dataset: ImageNet subset
  • Model: ResNet-50
  • Training: 4x A100 with DDP
  • Scaling: Linear scaling rule

Demonstrates:

  • PyTorch DistributedDataParallel
  • Gradient accumulation
  • Learning rate scaling
  • Synchronized batch norm
  • Multi-node coordination

Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py)

Production fine-tuning template:

  • Dataset: Custom text classification
  • Model: BERT/RoBERTa/DeBERTa
  • Training: Hugging Face Trainer API
  • Deployment: Vertex AI endpoint

Demonstrates:

  • Hugging Face Trainer integration
  • Hyperparameter tuning with Vertex AI
  • Model versioning
  • Endpoint deployment
  • Online predictions

Cost Optimization Tips

BigQuery ML

Reduce data processed:

  • Use partitioned tables
  • Filter data in WHERE clause before training
  • Use table sampling for experimentation
  • Cache intermediate results

Use appropriate model types:

  • Start with LINEAR_REG/LOGISTIC_REG (cheapest)
  • Use BOOSTED_TREE for better accuracy at moderate cost
  • Reserve AutoML for when simpler models fail

Optimize queries:

  • Avoid SELECT * (specify columns)
  • Use clustering on filter columns
  • Materialize views for repeated training

Vertex AI

Machine type selection:

  • Start with CPU for prototyping
  • Use T4 for small models (cheapest GPU)
  • Use A100 only for large models that need it
  • Consider TPU v5e for TensorFlow/JAX (very cost-effective)

Training optimization:

  • Use preemptible instances (60-70% cheaper, can be interrupted)
  • Enable automatic checkpoint/resume for preemptible
  • Use mixed precision training (FP16/BF16) for faster training
  • Profile to eliminate CPU bottlenecks

Storage optimization:

  • Store datasets in Cloud Storage (cheaper than persistent disk)
  • Use Filestore only if needed for POSIX filesystem
  • Clean up old model artifacts
  • Use lifecycle policies to archive old data

Multi-GPU efficiency:

  • Ensure near-linear scaling before adding more GPUs
  • Profile inter-GPU communication
  • Use gradient accumulation instead of larger batch sizes
  • Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)

Integration with ML Training Plugin

This skill integrates with other ml-training components:

  • training-patterns: Provides GCP configs for generated training scripts
  • cost-calculator: Uses GCP pricing data for budget planning
  • monitoring-dashboard: Integrates with Vertex AI TensorBoard
  • validation-scripts: Validates GCP credentials and permissions
  • integration-helpers: Deploys trained models to Vertex AI endpoints

Common Workflows

Workflow 1: Quick BigQuery ML Prototype

  1. Run bash scripts/setup-bigquery-ml.sh
  2. Copy templates/bigquery_ml_training.sql to your project
  3. Modify SQL for your dataset and features
  4. Run training query in BigQuery console
  5. Evaluate with built-in ML.EVALUATE()
  6. Export predictions with ML.PREDICT()

Time: 30 minutes setup + training time Cost: $5 per TB of data processed

Workflow 2: Custom PyTorch Training on Vertex AI

  1. Run bash scripts/configure-auth.sh
  2. Run bash scripts/setup-vertex-ai.sh
  3. Copy templates/vertex_training_job.py
  4. Customize training loop for your model
  5. Copy templates/vertex_gpu_config.yaml
  6. Submit job: gcloud ai custom-jobs create ...
  7. Monitor in Vertex AI console

Time: 1 hour setup + training time Cost: Depends on GPU/TPU selection

Workflow 3: Large-Scale Distributed Training

  1. Setup Vertex AI (workflow 2)
  2. Copy examples/vertex-distributed-training.py
  3. Adapt for your model architecture
  4. Test locally with 1 GPU
  5. Test with 2 GPUs to verify scaling
  6. Scale to 4-8 GPUs for full training
  7. Use preemptible instances with checkpointing

Time: 2-4 hours setup + training time Cost: $15-60/hour depending on GPU count

Troubleshooting

BigQuery ML Issues

"Insufficient permissions":

  • Verify roles/bigquery.dataEditor and roles/bigquery.jobUser
  • Check dataset-level permissions
  • Ensure billing is enabled

"Model training failed":

  • Check for NULL values in features
  • Verify data types match model expectations
  • Review feature engineering TRANSFORM clause
  • Check for sufficient training data

Vertex AI Issues

"Service account lacks permissions":

  • Verify roles/aiplatform.user
  • Add roles/storage.objectAdmin for GCS access
  • Check project-level IAM policies

"GPU/TPU quota exceeded":

  • Request quota increase in GCP console
  • Use different region with availability
  • Start with smaller GPU/TPU configuration
  • Use preemptible instances (separate quota)

"Training job crashes":

  • Check for CUDA OOM (reduce batch size)
  • Verify dependencies in requirements.txt
  • Review logs in Cloud Logging
  • Test locally before submitting to Vertex

Security Best Practices

Credentials Management

DO:

  • ✅ Use service accounts with minimal permissions
  • ✅ Store credentials in Secret Manager
  • ✅ Use Workload Identity for GKE deployments
  • ✅ Rotate service account keys regularly
  • ✅ Add .gitignore for *.json key files

DON'T:

  • ❌ Hardcode credentials in code
  • ❌ Commit service account keys to git
  • ❌ Use overly permissive roles (e.g., Owner)
  • ❌ Share service account keys across projects
  • ❌ Use personal credentials for production

IAM Best Practices

  • Use separate service accounts for training vs serving
  • Grant roles at resource level, not project level when possible
  • Use Workload Identity Federation instead of keys when possible
  • Enable Cloud Audit Logs for ML API usage
  • Review IAM permissions quarterly

Performance Benchmarks

BigQuery ML vs Vertex AI

BigQuery ML:

  • Best for: Structured data, SQL users, quick prototypes
  • Training time: Minutes to hours (depends on data size)
  • Scalability: Automatic (serverless)
  • Cost: $5/TB processed

Vertex AI Custom Training:

  • Best for: Deep learning, custom architectures, GPU/TPU workloads
  • Training time: Hours to days (configurable hardware)
  • Scalability: Manual (choose machine type)
  • Cost: $0.35-20/hour depending on hardware

Rule of thumb:

  • Use BigQuery ML for tabular data with < 100M rows
  • Use Vertex AI for images, text, audio, or custom models
  • Use Vertex AI for models requiring GPU/TPU acceleration

Additional Resources

Weekly Installs
3
GitHub Stars
3
First Seen
Feb 11, 2026
Installed on
opencode3
gemini-cli3
claude-code3
github-copilot3
codex3
amp3