data-analysis
SKILL.md
Data Analysis
Expert guidance for data analysis and data science workflows. Provides tool selection, workflow patterns, and curated resources across the full data stack.
Tool Selection Quick Reference
Data Manipulation
| Need | Recommended Tool | Alternative |
|---|---|---|
| General tabular data | Pandas | Polars (faster, multithreaded) |
| Large datasets (out-of-core) | Polars, Dask | Vaex, Modin |
| GPU-accelerated DataFrames | cuDF (RAPIDS) | CuPy (NumPy on GPU) |
| Parallel pandas operations | Pandarallel | Modin |
| Cross-engine portability | Fugue (Pandas/Spark/Dask) | - |
| Fuzzy string matching | TheFuzz | - |
| Date/time handling | Pendulum, Arrow | DateUtil |
Automated EDA
| Need | Recommended Tool |
|---|---|
| One-line auto visualization | AutoViz |
| EDA with dataset comparison | Sweetviz |
| Data quality profiling | YData Profiling |
| Missing data patterns | Missingno |
| Interactive GUI analysis | D-Tale, PyGWalker |
| Conversational analysis (LLM) | PandasAI |
Data Quality & Validation
| Need | Recommended Tool |
|---|---|
| Outlier/anomaly detection | PyOD, Alibi Detect |
| Schema validation (DataFrames) | Pandera |
| Schema validation (general) | Pydantic, Cerberus |
| Data pipeline testing | Great Expectations |
Feature Engineering
| Need | Recommended Tool |
|---|---|
| Automated feature engineering | FeatureTools |
| Scikit-learn compatible FE | Feature Engine |
| Dimensionality reduction (PCA/MCA) | Prince |
| Categorical encoding | Category Encoders |
| Imbalanced datasets | Imbalanced Learn |
| Distribution fitting | Fitter |
Visualization
| Need | Recommended Tool | Alternative |
|---|---|---|
| Static plots | Matplotlib | Plotnine (ggplot2-style) |
| Statistical visualization | Seaborn | - |
| Interactive plots | Plotly | Bokeh, Altair |
| Geographic/maps | Folium, GeoPandas | QGIS, OSMnx |
| Large dataset rendering | Datashader | Deck.gl |
| 3D/scientific | VisPy | Glumpy |
| No-code visualization | Flourish | - |
| Chart selection guide | From Data to Viz | Data Viz Catalogue |
Dashboards & BI
| Need | Recommended Tool |
|---|---|
| Quick data apps | Streamlit |
| Production dashboards | Dash (Plotly) |
| ML demo interfaces | Gradio |
| Notebook to web app | Voila |
| Full-stack Python web | Reflex, Taipy |
| Enterprise BI | Tableau, Power BI |
| Open-source BI | Apache Superset, Metabase |
| Monitoring dashboards | Grafana |
SQL & Databases
| Need | Recommended Tool |
|---|---|
| In-process analytics | DuckDB |
| Python ORM | SQLAlchemy |
| PostgreSQL adapter | Psycopg2 |
| SQL linting/formatting | SQLFluff |
| SQL transpilation | SQLGlot |
| Natural language to SQL | Vanna.AI |
| Time-series database | TimescaleDB, TDengine |
| Database GUI | DBeaver, Beekeeper Studio |
Statistics & Probability
| Need | Recommended Tool |
|---|---|
| General scientific computing | SciPy |
| Statistical modeling | Statsmodels |
| Bayesian modeling | PyMC, NumPyro |
| User-friendly stats | Pingouin |
| Survival analysis | Lifelines, scikit-survival |
| Causal inference | DoWhy, CausalImpact |
| Bayesian visualization | ArviZ |
| GAMs | PyGAM |
Time Series
| Need | Recommended Tool |
|---|---|
| General forecasting | Facebook Prophet |
| Bayesian forecasting | Uber Orbit |
| ML time series framework | sktime |
| Deep learning forecasting | PyTorch Forecasting, GluonTS |
| Zero-shot forecasting | TimesFM (Google) |
| Feature extraction | TSFresh |
| ARIMA modeling | pmdarima |
Machine Learning
| Need | Recommended Tool |
|---|---|
| Classical ML | Scikit-learn |
| Gradient boosting | XGBoost, LightGBM, CatBoost |
| AutoML | H2O |
| GPU-accelerated ML | cuML (RAPIDS) |
| Model explainability | SHAP, InterpretML |
| Hyperparameter tuning | Optuna |
Deep Learning
| Need | Recommended Tool |
|---|---|
| Research & production | PyTorch |
| High-level API | Keras |
| Fast prototyping | Fast.ai |
| Transformers/NLP | HuggingFace Transformers |
| Object detection | Ultralytics (YOLOv8) |
| Model interoperability | ONNX |
| Efficient fine-tuning | PEFT, Unsloth |
| Graph neural networks | PyTorch Geometric |
Data Engineering
| Need | Recommended Tool |
|---|---|
| SQL transformations | dbt-core |
| Workflow orchestration | Apache Airflow, Dagster, Prefect |
| Distributed processing | Apache Spark |
| Event streaming | Apache Kafka |
| Table format (lakehouse) | Apache Iceberg, Delta Lake |
| Distributed SQL queries | Trino |
| Data lineage | OpenLineage, DataHub |
| Reproducible pipelines | Kedro |
MLOps
| Need | Recommended Tool |
|---|---|
| Experiment tracking | MLflow, Wandb |
| Data/model versioning | DVC |
| Model/data drift monitoring | Evidently |
| Model serving | BentoML, KServe |
| LLM inference (high-throughput) | vLLM |
| Unified LLM API | LiteLLM |
| ML on Kubernetes | Kubeflow |
| Feature store | Feast |
Web Scraping
| Need | Recommended Tool |
|---|---|
| Simple HTTP requests | Requests, HTTPX |
| HTML parsing | BeautifulSoup |
| Browser automation | Playwright, Selenium |
| Full crawling framework | Scrapy |
| AI-powered scraping | ScrapeGraph AI, Crawl4AI |
| Auto-detect patterns | AutoScraper |
| Text extraction from web | Trafilatura |
NLP
| Need | Recommended Tool |
|---|---|
| Production NLP pipeline | SpaCy |
| Text preprocessing | NLTK, TextBlob |
| Sentence embeddings | SentenceTransformers |
| Topic modeling | Gensim |
| Transformer models | HuggingFace Transformers |
| Conversational AI | Rasa |
| Adversarial NLP testing | TextAttack |
Data Analysis Workflow
1. Data Acquisition
- Files: OpenPyXL (Excel), PyPDF2/Camelot (PDF), CleverCSV (messy CSV)
- Web: Requests/HTTPX + BeautifulSoup, or Scrapy/Crawl4AI for large-scale
- Databases: SQLAlchemy + DuckDB for analytics, Psycopg2 for PostgreSQL
- APIs: HTTPX (async), Requests-cache (with caching)
- Synthetic: Faker, Mimesis for test data generation
2. Exploration & Profiling
- Run YData Profiling or Sweetviz for automated EDA
- Use Missingno to visualize missing data patterns
- Use D-Tale or PyGWalker for interactive exploration
- For quick auto-visualization: AutoViz
3. Cleaning & Validation
- Pandera for DataFrame schema validation
- Great Expectations for data pipeline testing
- PyOD for outlier detection
- Pandas DQ for automatic type correction and cleaning
4. Feature Engineering
- FeatureTools for automated feature engineering
- Category Encoders for categorical variables
- Prince for dimensionality reduction (PCA/MCA)
- Imbalanced Learn for handling class imbalance
5. Analysis & Modeling
- Statistical: SciPy, Statsmodels, Pingouin
- Classical ML: Scikit-learn, XGBoost/LightGBM/CatBoost
- Deep Learning: PyTorch + relevant framework
- Time Series: Prophet, sktime, pmdarima
- NLP: SpaCy, HuggingFace Transformers
- Hyperparameter tuning: Optuna
6. Visualization & Reporting
- Matplotlib/Seaborn for static publication plots
- Plotly/Altair for interactive visualizations
- Streamlit/Dash for data apps and dashboards
- Gradio for ML model demos
7. Deployment & Monitoring
- MLflow/Wandb for experiment tracking
- DVC for data versioning
- BentoML/KServe for model serving
- Evidently for drift monitoring
Reference Files
Read these for detailed tool lists and learning resources:
| File | Contents | Read when... |
|---|---|---|
references/python-ecosystem.md |
Python tools: data processing, EDA, validation, feature engineering, specialized tools | Working with Python data libraries |
references/sql-databases.md |
SQL resources, database tools, drivers, learning materials | Working with SQL or databases |
references/visualization-dashboards.md |
Visualization tools, dashboard frameworks, BI software | Building charts, dashboards, or data apps |
references/statistics-ml.md |
Statistics, ML, deep learning, NLP, time series, AI tools | Doing statistical analysis or building models |
references/data-engineering.md |
Data pipelines, MLOps, cloud infrastructure, web scraping | Building data pipelines or deploying models |
references/learning-resources.md |
Courses, datasets, cheatsheets, interview prep, career | Looking for learning materials or datasets |
File Processing Quick Reference
| Format | Library |
|---|---|
| Excel (.xlsx) | OpenPyXL, Xlwings |
| CSV (messy) | CleverCSV |
| PDF (text) | PyPDF2, PyMuPDF |
| PDF (tables) | Camelot |
| Word (.docx) | Python-docx |
| HTML to Markdown | Python-markdownify, MarkItDown |
| XML | Xmltodict |
| JSON querying | JmesPath, jq (CLI) |
| YAML | yq (CLI) |
| Tabular export | Tablib (XLSX/JSON/CSV) |
Command-Line Data Tools
| Tool | Purpose |
|---|---|
| jq | JSON processor |
| yq | YAML/XML processor |
| q | Run SQL on CSV/TSV files |
| VisiData | Interactive tabular data explorer |
| csvkit | CSV manipulation suite |
| Miller | Multi-format data processor (CSV/JSON/etc.) |
| DuckDB CLI | SQL analytics on files |
| hyperfine | Benchmarking |
| termgraph | Terminal-based graphs |
Weekly Installs
1
Repository
0xkynz/codekitGitHub Stars
1
First Seen
7 days ago
Security Audits
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1