skills/0xkynz/codekit/data-analysis

data-analysis

SKILL.md

Data Analysis

Expert guidance for data analysis and data science workflows. Provides tool selection, workflow patterns, and curated resources across the full data stack.

Tool Selection Quick Reference

Data Manipulation

Need Recommended Tool Alternative
General tabular data Pandas Polars (faster, multithreaded)
Large datasets (out-of-core) Polars, Dask Vaex, Modin
GPU-accelerated DataFrames cuDF (RAPIDS) CuPy (NumPy on GPU)
Parallel pandas operations Pandarallel Modin
Cross-engine portability Fugue (Pandas/Spark/Dask) -
Fuzzy string matching TheFuzz -
Date/time handling Pendulum, Arrow DateUtil

Automated EDA

Need Recommended Tool
One-line auto visualization AutoViz
EDA with dataset comparison Sweetviz
Data quality profiling YData Profiling
Missing data patterns Missingno
Interactive GUI analysis D-Tale, PyGWalker
Conversational analysis (LLM) PandasAI

Data Quality & Validation

Need Recommended Tool
Outlier/anomaly detection PyOD, Alibi Detect
Schema validation (DataFrames) Pandera
Schema validation (general) Pydantic, Cerberus
Data pipeline testing Great Expectations

Feature Engineering

Need Recommended Tool
Automated feature engineering FeatureTools
Scikit-learn compatible FE Feature Engine
Dimensionality reduction (PCA/MCA) Prince
Categorical encoding Category Encoders
Imbalanced datasets Imbalanced Learn
Distribution fitting Fitter

Visualization

Need Recommended Tool Alternative
Static plots Matplotlib Plotnine (ggplot2-style)
Statistical visualization Seaborn -
Interactive plots Plotly Bokeh, Altair
Geographic/maps Folium, GeoPandas QGIS, OSMnx
Large dataset rendering Datashader Deck.gl
3D/scientific VisPy Glumpy
No-code visualization Flourish -
Chart selection guide From Data to Viz Data Viz Catalogue

Dashboards & BI

Need Recommended Tool
Quick data apps Streamlit
Production dashboards Dash (Plotly)
ML demo interfaces Gradio
Notebook to web app Voila
Full-stack Python web Reflex, Taipy
Enterprise BI Tableau, Power BI
Open-source BI Apache Superset, Metabase
Monitoring dashboards Grafana

SQL & Databases

Need Recommended Tool
In-process analytics DuckDB
Python ORM SQLAlchemy
PostgreSQL adapter Psycopg2
SQL linting/formatting SQLFluff
SQL transpilation SQLGlot
Natural language to SQL Vanna.AI
Time-series database TimescaleDB, TDengine
Database GUI DBeaver, Beekeeper Studio

Statistics & Probability

Need Recommended Tool
General scientific computing SciPy
Statistical modeling Statsmodels
Bayesian modeling PyMC, NumPyro
User-friendly stats Pingouin
Survival analysis Lifelines, scikit-survival
Causal inference DoWhy, CausalImpact
Bayesian visualization ArviZ
GAMs PyGAM

Time Series

Need Recommended Tool
General forecasting Facebook Prophet
Bayesian forecasting Uber Orbit
ML time series framework sktime
Deep learning forecasting PyTorch Forecasting, GluonTS
Zero-shot forecasting TimesFM (Google)
Feature extraction TSFresh
ARIMA modeling pmdarima

Machine Learning

Need Recommended Tool
Classical ML Scikit-learn
Gradient boosting XGBoost, LightGBM, CatBoost
AutoML H2O
GPU-accelerated ML cuML (RAPIDS)
Model explainability SHAP, InterpretML
Hyperparameter tuning Optuna

Deep Learning

Need Recommended Tool
Research & production PyTorch
High-level API Keras
Fast prototyping Fast.ai
Transformers/NLP HuggingFace Transformers
Object detection Ultralytics (YOLOv8)
Model interoperability ONNX
Efficient fine-tuning PEFT, Unsloth
Graph neural networks PyTorch Geometric

Data Engineering

Need Recommended Tool
SQL transformations dbt-core
Workflow orchestration Apache Airflow, Dagster, Prefect
Distributed processing Apache Spark
Event streaming Apache Kafka
Table format (lakehouse) Apache Iceberg, Delta Lake
Distributed SQL queries Trino
Data lineage OpenLineage, DataHub
Reproducible pipelines Kedro

MLOps

Need Recommended Tool
Experiment tracking MLflow, Wandb
Data/model versioning DVC
Model/data drift monitoring Evidently
Model serving BentoML, KServe
LLM inference (high-throughput) vLLM
Unified LLM API LiteLLM
ML on Kubernetes Kubeflow
Feature store Feast

Web Scraping

Need Recommended Tool
Simple HTTP requests Requests, HTTPX
HTML parsing BeautifulSoup
Browser automation Playwright, Selenium
Full crawling framework Scrapy
AI-powered scraping ScrapeGraph AI, Crawl4AI
Auto-detect patterns AutoScraper
Text extraction from web Trafilatura

NLP

Need Recommended Tool
Production NLP pipeline SpaCy
Text preprocessing NLTK, TextBlob
Sentence embeddings SentenceTransformers
Topic modeling Gensim
Transformer models HuggingFace Transformers
Conversational AI Rasa
Adversarial NLP testing TextAttack

Data Analysis Workflow

1. Data Acquisition

  • Files: OpenPyXL (Excel), PyPDF2/Camelot (PDF), CleverCSV (messy CSV)
  • Web: Requests/HTTPX + BeautifulSoup, or Scrapy/Crawl4AI for large-scale
  • Databases: SQLAlchemy + DuckDB for analytics, Psycopg2 for PostgreSQL
  • APIs: HTTPX (async), Requests-cache (with caching)
  • Synthetic: Faker, Mimesis for test data generation

2. Exploration & Profiling

  • Run YData Profiling or Sweetviz for automated EDA
  • Use Missingno to visualize missing data patterns
  • Use D-Tale or PyGWalker for interactive exploration
  • For quick auto-visualization: AutoViz

3. Cleaning & Validation

  • Pandera for DataFrame schema validation
  • Great Expectations for data pipeline testing
  • PyOD for outlier detection
  • Pandas DQ for automatic type correction and cleaning

4. Feature Engineering

  • FeatureTools for automated feature engineering
  • Category Encoders for categorical variables
  • Prince for dimensionality reduction (PCA/MCA)
  • Imbalanced Learn for handling class imbalance

5. Analysis & Modeling

  • Statistical: SciPy, Statsmodels, Pingouin
  • Classical ML: Scikit-learn, XGBoost/LightGBM/CatBoost
  • Deep Learning: PyTorch + relevant framework
  • Time Series: Prophet, sktime, pmdarima
  • NLP: SpaCy, HuggingFace Transformers
  • Hyperparameter tuning: Optuna

6. Visualization & Reporting

  • Matplotlib/Seaborn for static publication plots
  • Plotly/Altair for interactive visualizations
  • Streamlit/Dash for data apps and dashboards
  • Gradio for ML model demos

7. Deployment & Monitoring

  • MLflow/Wandb for experiment tracking
  • DVC for data versioning
  • BentoML/KServe for model serving
  • Evidently for drift monitoring

Reference Files

Read these for detailed tool lists and learning resources:

File Contents Read when...
references/python-ecosystem.md Python tools: data processing, EDA, validation, feature engineering, specialized tools Working with Python data libraries
references/sql-databases.md SQL resources, database tools, drivers, learning materials Working with SQL or databases
references/visualization-dashboards.md Visualization tools, dashboard frameworks, BI software Building charts, dashboards, or data apps
references/statistics-ml.md Statistics, ML, deep learning, NLP, time series, AI tools Doing statistical analysis or building models
references/data-engineering.md Data pipelines, MLOps, cloud infrastructure, web scraping Building data pipelines or deploying models
references/learning-resources.md Courses, datasets, cheatsheets, interview prep, career Looking for learning materials or datasets

File Processing Quick Reference

Format Library
Excel (.xlsx) OpenPyXL, Xlwings
CSV (messy) CleverCSV
PDF (text) PyPDF2, PyMuPDF
PDF (tables) Camelot
Word (.docx) Python-docx
HTML to Markdown Python-markdownify, MarkItDown
XML Xmltodict
JSON querying JmesPath, jq (CLI)
YAML yq (CLI)
Tabular export Tablib (XLSX/JSON/CSV)

Command-Line Data Tools

Tool Purpose
jq JSON processor
yq YAML/XML processor
q Run SQL on CSV/TSV files
VisiData Interactive tabular data explorer
csvkit CSV manipulation suite
Miller Multi-format data processor (CSV/JSON/etc.)
DuckDB CLI SQL analytics on files
hyperfine Benchmarking
termgraph Terminal-based graphs
Weekly Installs
1
Repository
0xkynz/codekit
GitHub Stars
1
First Seen
7 days ago
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1