Data Analysis

Expert guidance for data analysis and data science workflows. Provides tool selection, workflow patterns, and curated resources across the full data stack.

Tool Selection Quick Reference

Data Manipulation

Need	Recommended Tool	Alternative
General tabular data	Pandas	Polars (faster, multithreaded)
Large datasets (out-of-core)	Polars, Dask	Vaex, Modin
GPU-accelerated DataFrames	cuDF (RAPIDS)	CuPy (NumPy on GPU)
Parallel pandas operations	Pandarallel	Modin
Cross-engine portability	Fugue (Pandas/Spark/Dask)	-
Fuzzy string matching	TheFuzz	-
Date/time handling	Pendulum, Arrow	DateUtil

Automated EDA

Need	Recommended Tool
One-line auto visualization	AutoViz
EDA with dataset comparison	Sweetviz
Data quality profiling	YData Profiling
Missing data patterns	Missingno
Interactive GUI analysis	D-Tale, PyGWalker
Conversational analysis (LLM)	PandasAI

Data Quality & Validation

Need	Recommended Tool
Outlier/anomaly detection	PyOD, Alibi Detect
Schema validation (DataFrames)	Pandera
Schema validation (general)	Pydantic, Cerberus
Data pipeline testing	Great Expectations

Feature Engineering

Need	Recommended Tool
Automated feature engineering	FeatureTools
Scikit-learn compatible FE	Feature Engine
Dimensionality reduction (PCA/MCA)	Prince
Categorical encoding	Category Encoders
Imbalanced datasets	Imbalanced Learn
Distribution fitting	Fitter

Visualization

Need	Recommended Tool	Alternative
Static plots	Matplotlib	Plotnine (ggplot2-style)
Statistical visualization	Seaborn	-
Interactive plots	Plotly	Bokeh, Altair
Geographic/maps	Folium, GeoPandas	QGIS, OSMnx
Large dataset rendering	Datashader	Deck.gl
3D/scientific	VisPy	Glumpy
No-code visualization	Flourish	-
Chart selection guide	From Data to Viz	Data Viz Catalogue

Dashboards & BI

Need	Recommended Tool
Quick data apps	Streamlit
Production dashboards	Dash (Plotly)
ML demo interfaces	Gradio
Notebook to web app	Voila
Full-stack Python web	Reflex, Taipy
Enterprise BI	Tableau, Power BI
Open-source BI	Apache Superset, Metabase
Monitoring dashboards	Grafana

SQL & Databases

Need	Recommended Tool
In-process analytics	DuckDB
Python ORM	SQLAlchemy
PostgreSQL adapter	Psycopg2
SQL linting/formatting	SQLFluff
SQL transpilation	SQLGlot
Natural language to SQL	Vanna.AI
Time-series database	TimescaleDB, TDengine
Database GUI	DBeaver, Beekeeper Studio

Statistics & Probability

Need	Recommended Tool
General scientific computing	SciPy
Statistical modeling	Statsmodels
Bayesian modeling	PyMC, NumPyro
User-friendly stats	Pingouin
Survival analysis	Lifelines, scikit-survival
Causal inference	DoWhy, CausalImpact
Bayesian visualization	ArviZ
GAMs	PyGAM

Time Series

Need	Recommended Tool
General forecasting	Facebook Prophet
Bayesian forecasting	Uber Orbit
ML time series framework	sktime
Deep learning forecasting	PyTorch Forecasting, GluonTS
Zero-shot forecasting	TimesFM (Google)
Feature extraction	TSFresh
ARIMA modeling	pmdarima

Machine Learning

Need	Recommended Tool
Classical ML	Scikit-learn
Gradient boosting	XGBoost, LightGBM, CatBoost
AutoML	H2O
GPU-accelerated ML	cuML (RAPIDS)
Model explainability	SHAP, InterpretML
Hyperparameter tuning	Optuna

Deep Learning

Need	Recommended Tool
Research & production	PyTorch
High-level API	Keras
Fast prototyping	Fast.ai
Transformers/NLP	HuggingFace Transformers
Object detection	Ultralytics (YOLOv8)
Model interoperability	ONNX
Efficient fine-tuning	PEFT, Unsloth
Graph neural networks	PyTorch Geometric

Data Engineering

Need	Recommended Tool
SQL transformations	dbt-core
Workflow orchestration	Apache Airflow, Dagster, Prefect
Distributed processing	Apache Spark
Event streaming	Apache Kafka
Table format (lakehouse)	Apache Iceberg, Delta Lake
Distributed SQL queries	Trino
Data lineage	OpenLineage, DataHub
Reproducible pipelines	Kedro

MLOps

Need	Recommended Tool
Experiment tracking	MLflow, Wandb
Data/model versioning	DVC
Model/data drift monitoring	Evidently
Model serving	BentoML, KServe
LLM inference (high-throughput)	vLLM
Unified LLM API	LiteLLM
ML on Kubernetes	Kubeflow
Feature store	Feast

Web Scraping

Need	Recommended Tool
Simple HTTP requests	Requests, HTTPX
HTML parsing	BeautifulSoup
Browser automation	Playwright, Selenium
Full crawling framework	Scrapy
AI-powered scraping	ScrapeGraph AI, Crawl4AI
Auto-detect patterns	AutoScraper
Text extraction from web	Trafilatura

NLP

Need	Recommended Tool
Production NLP pipeline	SpaCy
Text preprocessing	NLTK, TextBlob
Sentence embeddings	SentenceTransformers
Topic modeling	Gensim
Transformer models	HuggingFace Transformers
Conversational AI	Rasa
Adversarial NLP testing	TextAttack

Data Analysis Workflow

1. Data Acquisition

Files: OpenPyXL (Excel), PyPDF2/Camelot (PDF), CleverCSV (messy CSV)
Web: Requests/HTTPX + BeautifulSoup, or Scrapy/Crawl4AI for large-scale
Databases: SQLAlchemy + DuckDB for analytics, Psycopg2 for PostgreSQL
APIs: HTTPX (async), Requests-cache (with caching)
Synthetic: Faker, Mimesis for test data generation

2. Exploration & Profiling

Run YData Profiling or Sweetviz for automated EDA
Use Missingno to visualize missing data patterns
Use D-Tale or PyGWalker for interactive exploration
For quick auto-visualization: AutoViz

3. Cleaning & Validation

Pandera for DataFrame schema validation
Great Expectations for data pipeline testing
PyOD for outlier detection
Pandas DQ for automatic type correction and cleaning

4. Feature Engineering

FeatureTools for automated feature engineering
Category Encoders for categorical variables
Prince for dimensionality reduction (PCA/MCA)
Imbalanced Learn for handling class imbalance

5. Analysis & Modeling

Statistical: SciPy, Statsmodels, Pingouin
Classical ML: Scikit-learn, XGBoost/LightGBM/CatBoost
Deep Learning: PyTorch + relevant framework
Time Series: Prophet, sktime, pmdarima
NLP: SpaCy, HuggingFace Transformers
Hyperparameter tuning: Optuna

6. Visualization & Reporting

Matplotlib/Seaborn for static publication plots
Plotly/Altair for interactive visualizations
Streamlit/Dash for data apps and dashboards
Gradio for ML model demos

7. Deployment & Monitoring

MLflow/Wandb for experiment tracking
DVC for data versioning
BentoML/KServe for model serving
Evidently for drift monitoring

Reference Files

Read these for detailed tool lists and learning resources:

File	Contents	Read when...
`references/python-ecosystem.md`	Python tools: data processing, EDA, validation, feature engineering, specialized tools	Working with Python data libraries
`references/sql-databases.md`	SQL resources, database tools, drivers, learning materials	Working with SQL or databases
`references/visualization-dashboards.md`	Visualization tools, dashboard frameworks, BI software	Building charts, dashboards, or data apps
`references/statistics-ml.md`	Statistics, ML, deep learning, NLP, time series, AI tools	Doing statistical analysis or building models
`references/data-engineering.md`	Data pipelines, MLOps, cloud infrastructure, web scraping	Building data pipelines or deploying models
`references/learning-resources.md`	Courses, datasets, cheatsheets, interview prep, career	Looking for learning materials or datasets

File Processing Quick Reference

Format	Library
Excel (.xlsx)	OpenPyXL, Xlwings
CSV (messy)	CleverCSV
PDF (text)	PyPDF2, PyMuPDF
PDF (tables)	Camelot
Word (.docx)	Python-docx
HTML to Markdown	Python-markdownify, MarkItDown
XML	Xmltodict
JSON querying	JmesPath, jq (CLI)
YAML	yq (CLI)
Tabular export	Tablib (XLSX/JSON/CSV)

Command-Line Data Tools

Tool	Purpose
jq	JSON processor
yq	YAML/XML processor
q	Run SQL on CSV/TSV files
VisiData	Interactive tabular data explorer
csvkit	CSV manipulation suite
Miller	Multi-format data processor (CSV/JSON/etc.)
DuckDB CLI	SQL analytics on files
hyperfine	Benchmarking
termgraph	Terminal-based graphs

data-analysis