skills/pelchers/sessionsaver/using-huggingface

using-huggingface

SKILL.md

Using HuggingFace

Comprehensive integration with HuggingFace Hub for AI model deployment, dataset management, and inference.

What This Skill Does

Connects your projects to HuggingFace ecosystem:

  • Model integration: Load and use pre-trained models
  • Dataset management: Access and process HuggingFace datasets
  • Inference API: Call models via HuggingFace API
  • Fine-tuning: Train models on custom data
  • Spaces deployment: Deploy interactive ML demos
  • Model hub search: Discover and compare models

Quick Start

Load a Model

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love using HuggingFace!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

Use a Dataset

from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")
print(dataset["train"][0])

Inference API

node scripts/huggingface-inference.js "Translate to French: Hello world"

HuggingFace Workflow

graph TD
    A[HuggingFace Hub] --> B{Resource Type}
    B -->|Model| C[Load Model]
    B -->|Dataset| D[Load Dataset]
    B -->|Space| E[Deploy App]

    C --> F[Local Inference]
    C --> G[API Inference]
    C --> H[Fine-tune]

    D --> I[Preprocess Data]
    I --> J[Train Model]
    J --> K[Evaluate]

    E --> L[Gradio/Streamlit]
    L --> M[Public Demo]

    H --> N[Push to Hub]
    K --> N

    style M fill:#99ff99
    style N fill:#99ff99

Model Integration

Transformers Library

Installation:

pip install transformers torch

Basic usage:

from transformers import AutoTokenizer, AutoModel

# Load pre-trained model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenize input
inputs = tokenizer("Hello, world!", return_tensors="pt")

# Get embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state

Common Pipelines

Text Classification:

from transformers import pipeline

classifier = pipeline("text-classification",
                     model="distilbert-base-uncased-finetuned-sst-2-english")

result = classifier("This product is amazing!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

Named Entity Recognition (NER):

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

text = "Apple Inc. was founded by Steve Jobs in California."
entities = ner(text)

for entity in entities:
    print(f"{entity['word']}: {entity['entity']}")
# Apple: B-ORG
# Steve Jobs: B-PER
# California: B-LOC

Text Generation:

generator = pipeline("text-generation", model="gpt2")

prompt = "Once upon a time"
result = generator(prompt, max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])

Translation:

translator = pipeline("translation_en_to_fr",
                     model="Helsinki-NLP/opus-mt-en-fr")

result = translator("Hello, how are you?")
# [{'translation_text': 'Bonjour, comment allez-vous?'}]

Question Answering:

qa = pipeline("question-answering")

context = "HuggingFace is a company that develops NLP tools."
question = "What does HuggingFace develop?"

result = qa(question=question, context=context)
# {'answer': 'NLP tools', 'score': 0.98}

Model Selection Guide

Task Recommended Models Use Case
Text Classification distilbert-base, roberta-base Sentiment, topic classification
NER bert-large-cased, roberta-large Entity extraction
Text Generation gpt2, gpt-neo-2.7B Content creation
Translation Helsinki-NLP/opus-mt-* Language translation
Summarization facebook/bart-large-cnn Document summarization
Question Answering bert-base-uncased, distilbert-base Q&A systems
Zero-shot facebook/bart-large-mnli Classification without training

Dataset Management

Loading Datasets

From Hub:

from datasets import load_dataset

# Load full dataset
dataset = load_dataset("imdb")
print(dataset.keys())  # ['train', 'test', 'unsupervised']

# Load specific split
train_dataset = load_dataset("imdb", split="train")

# Load subset
small_dataset = load_dataset("imdb", split="train[:1000]")

Custom datasets:

from datasets import Dataset

# From dictionary
data = {
    "text": ["Hello", "World"],
    "label": [1, 0]
}
dataset = Dataset.from_dict(data)

# From pandas
import pandas as pd
df = pd.read_csv("data.csv")
dataset = Dataset.from_pandas(df)

# From CSV directly
dataset = load_dataset("csv", data_files="data.csv")

Dataset Operations

Filtering:

# Filter by condition
long_texts = dataset.filter(lambda x: len(x["text"]) > 100)

# Filter by index
subset = dataset.select(range(1000))

Mapping:

# Preprocess function
def preprocess(example):
    example["text"] = example["text"].lower()
    return example

# Apply to dataset
processed = dataset.map(preprocess)

# Batch processing
def batch_preprocess(examples):
    examples["text"] = [text.lower() for text in examples["text"]]
    return examples

processed = dataset.map(batch_preprocess, batched=True)

Shuffling and Splitting:

# Shuffle
shuffled = dataset.shuffle(seed=42)

# Train/test split
split_dataset = dataset.train_test_split(test_size=0.2)
train = split_dataset["train"]
test = split_dataset["test"]

Dataset Features

from datasets import ClassLabel, Value, Features

# Define schema
features = Features({
    "text": Value("string"),
    "label": ClassLabel(names=["negative", "positive"]),
    "score": Value("float")
})

# Create dataset with schema
dataset = Dataset.from_dict(data, features=features)

Inference API

REST API Integration

Setup:

// scripts/huggingface-inference.js
import fetch from 'node-fetch';

const HF_API_KEY = process.env.HUGGINGFACE_API_KEY;
const API_URL = "https://api-inference.huggingface.co/models/";

async function query(model, inputs) {
  const response = await fetch(`${API_URL}${model}`, {
    headers: {
      "Authorization": `Bearer ${HF_API_KEY}`,
      "Content-Type": "application/json"
    },
    method: "POST",
    body: JSON.stringify({ inputs })
  });

  return await response.json();
}

// Text generation
const result = await query(
  "gpt2",
  "The future of AI is"
);

console.log(result);

Common API endpoints:

// Sentiment analysis
await query("distilbert-base-uncased-finetuned-sst-2-english",
           "I love this product!");

// Translation
await query("Helsinki-NLP/opus-mt-en-fr",
           "Hello, how are you?");

// Image classification
await query("google/vit-base-patch16-224",
           imageBuffer);

// Text-to-image
await query("stabilityai/stable-diffusion-2-1",
           "A beautiful sunset over mountains");

// Speech-to-text
await query("openai/whisper-large-v2",
           audioBuffer);

Python API Client

from huggingface_hub import InferenceClient

client = InferenceClient(token=HF_API_KEY)

# Text generation
response = client.text_generation(
    "The future of AI is",
    model="gpt2",
    max_new_tokens=50
)

# Image generation
image = client.text_to_image(
    "A beautiful sunset over mountains",
    model="stabilityai/stable-diffusion-2-1"
)
image.save("output.png")

# Chat completion
messages = [
    {"role": "user", "content": "What is machine learning?"}
]
response = client.chat_completion(
    messages,
    model="meta-llama/Llama-2-7b-chat-hf"
)

Fine-Tuning Models

Training Setup

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset

# Load model and tokenizer
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare dataset
dataset = load_dataset("imdb")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"]
)

# Train
trainer.train()

# Save model
trainer.save_model("./my-finetuned-model")

Push to Hub

from huggingface_hub import HfApi

# Login
from huggingface_hub import login
login(token=HF_API_KEY)

# Push model
model.push_to_hub("my-username/my-model-name")
tokenizer.push_to_hub("my-username/my-model-name")

# Push dataset
dataset.push_to_hub("my-username/my-dataset-name")

Spaces Deployment

Gradio App

Create app:

# app.py
import gradio as gr
from transformers import pipeline

# Load model
classifier = pipeline("sentiment-analysis")

def predict(text):
    result = classifier(text)[0]
    return {
        "label": result["label"],
        "confidence": result["score"]
    }

# Create interface
demo = gr.Interface(
    fn=predict,
    inputs=gr.Textbox(lines=3, placeholder="Enter text here..."),
    outputs=[
        gr.Label(label="Sentiment"),
        gr.Number(label="Confidence")
    ],
    title="Sentiment Analysis",
    description="Analyze the sentiment of text"
)

demo.launch()

Deploy to Space:

# Create requirements.txt
echo "transformers
torch
gradio" > requirements.txt

# Push to HuggingFace Space
git init
git add .
git commit -m "Initial commit"
git remote add origin https://huggingface.co/spaces/username/space-name
git push origin main

Streamlit App

# app.py
import streamlit as st
from transformers import pipeline

st.title("Text Summarization")

# Load model
@st.cache_resource
def load_model():
    return pipeline("summarization", model="facebook/bart-large-cnn")

summarizer = load_model()

# Input
text = st.text_area("Enter text to summarize", height=200)

if st.button("Summarize"):
    if text:
        summary = summarizer(text, max_length=130, min_length=30)[0]
        st.write("**Summary:**")
        st.write(summary['summary_text'])

Advanced Features

Model Hub Search

from huggingface_hub import HfApi

api = HfApi()

# Search models
models = api.list_models(
    filter="text-classification",
    sort="downloads",
    direction=-1,
    limit=10
)

for model in models:
    print(f"{model.modelId}: {model.downloads} downloads")

# Get model info
model_info = api.model_info("bert-base-uncased")
print(model_info.tags)
print(model_info.pipeline_tag)

Private Models

from huggingface_hub import login

# Login with token
login(token=HF_API_KEY)

# Load private model
model = AutoModel.from_pretrained("my-org/private-model")

# Push private model
model.push_to_hub(
    "my-username/my-private-model",
    private=True
)

Model Versioning

# Load specific revision
model = AutoModel.from_pretrained(
    "bert-base-uncased",
    revision="v1.0.0"
)

# List model revisions
from huggingface_hub import list_repo_refs

refs = list_repo_refs("bert-base-uncased")
for branch in refs.branches:
    print(branch.name)

Best Practices

Model Selection

  1. Start small: Use distilled models (distilbert, distilgpt2) for faster iteration
  2. Check benchmarks: Review model performance on common datasets
  3. Consider size: Larger models = better performance but slower inference
  4. License awareness: Check model licenses before commercial use

Performance Optimization

Quantization:

from transformers import AutoModelForCausalLM

# Load in 8-bit
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    load_in_8bit=True,
    device_map="auto"
)

Caching:

# Cache models locally
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "bert-base-uncased",
    cache_dir="./model_cache"
)

Batching:

# Process multiple inputs
classifier = pipeline("sentiment-analysis")

texts = ["Great product!", "Terrible service", "Okay experience"]
results = classifier(texts)

Error Handling

from transformers import pipeline
import logging

try:
    model = pipeline("sentiment-analysis")
    result = model("Test text")
except Exception as e:
    logging.error(f"Model loading failed: {e}")
    # Fallback to simpler model
    model = pipeline("sentiment-analysis",
                    model="distilbert-base-uncased-finetuned-sst-2-english")

Common Use Cases

1. Content Moderation

classifier = pipeline("text-classification",
                     model="unitary/toxic-bert")

comments = [
    "This is a great post!",
    "You're an idiot",
    "Nice work, keep it up"
]

for comment in comments:
    result = classifier(comment)[0]
    if result['label'] == 'toxic' and result['score'] > 0.8:
        print(f"⚠️ Toxic: {comment}")

2. Document Search

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode documents
documents = [
    "Python is a programming language",
    "Machine learning is a subset of AI",
    "HuggingFace provides ML tools"
]
doc_embeddings = model.encode(documents)

# Search
query = "What is Python?"
query_embedding = model.encode(query)

similarities = util.cos_sim(query_embedding, doc_embeddings)
best_match = documents[similarities.argmax()]
print(f"Best match: {best_match}")

3. Chatbot

from transformers import pipeline

chatbot = pipeline("conversational",
                  model="microsoft/DialoGPT-medium")

from transformers import Conversation

conversation = Conversation("Hello!")
conversation = chatbot(conversation)
print(conversation.generated_responses[-1])

conversation.add_user_input("How are you?")
conversation = chatbot(conversation)
print(conversation.generated_responses[-1])

Integration Patterns

Next.js API Route

// app/api/sentiment/route.ts
import { HfInference } from '@huggingface/inference';

const hf = new HfInference(process.env.HUGGINGFACE_API_KEY);

export async function POST(request: Request) {
  const { text } = await request.json();

  try {
    const result = await hf.textClassification({
      model: 'distilbert-base-uncased-finetuned-sst-2-english',
      inputs: text
    });

    return Response.json(result);
  } catch (error) {
    return Response.json({ error: 'Analysis failed' }, { status: 500 });
  }
}

React Component

// components/SentimentAnalyzer.tsx
'use client';

import { useState } from 'react';

export function SentimentAnalyzer() {
  const [text, setText] = useState('');
  const [result, setResult] = useState(null);

  const analyze = async () => {
    const response = await fetch('/api/sentiment', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ text })
    });

    const data = await response.json();
    setResult(data);
  };

  return (
    <div>
      <textarea value={text} onChange={(e) => setText(e.target.value)} />
      <button onClick={analyze}>Analyze</button>
      {result && <div>Sentiment: {result[0].label}</div>}
    </div>
  );
}

Advanced Topics

For detailed information:

  • Model Fine-tuning Guide: resources/fine-tuning-guide.md
  • Dataset Processing: resources/dataset-processing.md
  • Inference Optimization: resources/inference-optimization.md
  • Spaces Deployment: resources/spaces-deployment.md

References

Weekly Installs
3
First Seen
5 days ago
Installed on
opencode3
gemini-cli3
claude-code3
github-copilot3
codex3
kimi-cli3