observability
Observability Skill
Developer-focused observability: logging, debugging, profiling, and performance monitoring.
When This Skill Activates
- Adding logging to code
- Debugging errors or issues
- Profiling performance bottlenecks
- Analyzing memory usage
- Adding instrumentation
- Setting up structured logging
- Keywords: "logging", "debug", "profiling", "performance", "trace"
Structured Logging
Why Structured Logging?
Traditional (unstructured):
print(f"User {user_id} logged in from {ip_address}")
# Hard to parse, search, filter
Structured (JSON):
logger.info("user_login", extra={
"user_id": user_id,
"ip_address": ip_address,
"timestamp": datetime.utcnow().isoformat()
})
# Easy to parse, search, filter, aggregate
Benefits:
- ✅ Easy to search/filter
- ✅ Machine-readable
- ✅ Aggregate metrics
- ✅ Correlate events
- ✅ Feed to log aggregators (ELK, Splunk, Datadog)
Python Logging Setup
Basic configuration:
import logging
import sys
# Configure root logger
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(sys.stdout), # Console
logging.FileHandler('app.log') # File
]
)
logger = logging.getLogger(__name__)
# Use it
logger.info("Application started")
logger.warning("Database connection slow")
logger.error("Failed to process request", exc_info=True)
Log Levels
| Level | Value | Use When | Example |
|---|---|---|---|
| DEBUG | 10 | Development, detailed tracing | Variable values, function calls |
| INFO | 20 | Normal operations | User logged in, request processed |
| WARNING | 30 | Unexpected but handled | Deprecated API used, slow query |
| ERROR | 40 | Error occurred | Failed to save file, API error |
| CRITICAL | 50 | System failure | Database down, disk full |
Usage:
logger.debug(f"Processing item: {item}") # Development only
logger.info(f"User {user_id} logged in") # Normal event
logger.warning(f"Query took {duration}s (slow!)") # Unexpected
logger.error(f"Failed to save: {error}") # Error
logger.critical(f"Database unreachable!") # System failure
Structured JSON Logging
Using python-json-logger:
from pythonjsonlogger import jsonlogger
import logging
# Setup JSON formatter
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(
'%(asctime)s %(name)s %(levelname)s %(message)s'
)
logHandler.setFormatter(formatter)
logger = logging.getLogger()
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)
# Log with extra fields
logger.info("user_login", extra={
"user_id": 123,
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0..."
})
Output:
{
"asctime": "2025-10-24 12:00:00,000",
"name": "root",
"levelname": "INFO",
"message": "user_login",
"user_id": 123,
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0..."
}
Context Logging
Add context to all logs in a function:
import logging
from contextvars import ContextVar
# Context variable
request_id_var: ContextVar[str] = ContextVar('request_id', default=None)
class ContextFilter(logging.Filter):
def filter(self, record):
record.request_id = request_id_var.get()
return True
# Add filter
logger.addFilter(ContextFilter())
# Use in request handler
def handle_request(request):
request_id = generate_request_id()
request_id_var.set(request_id)
logger.info("Processing request") # Includes request_id automatically
process_data()
logger.info("Request complete") # Includes request_id automatically
Best Practices for Logging
✅ DO:
# 1. Use appropriate log levels
logger.debug("Variable value: {value}") # Debug details
logger.info("User action completed") # Normal operations
logger.error("Failed to process", exc_info=True) # Errors
# 2. Include context
logger.info("Processing payment", extra={
"user_id": user_id,
"amount": amount,
"currency": currency
})
# 3. Log exceptions with stack traces
try:
process_data()
except Exception as e:
logger.error("Processing failed", exc_info=True) # Includes stack trace
# 4. Use lazy formatting
logger.debug("Processing %s items", len(items)) # Only formats if DEBUG enabled
❌ DON'T:
# 1. Don't log sensitive data
logger.info(f"User password: {password}") # NEVER!
logger.info(f"Credit card: {card_number}") # NEVER!
# 2. Don't log in loops (too verbose)
for item in items:
logger.info(f"Processing {item}") # 1M items = 1M log lines!
# 3. Don't use print() in production
print("Debug info") # Use logger.debug() instead
# 4. Don't catch and ignore
try:
process()
except:
pass # Silent failure! At least log it!
Debugging
Print Debugging
Strategic print statements:
def process_data(data):
print(f"[DEBUG] Input: {data}") # See input
result = transform(data)
print(f"[DEBUG] After transform: {result}") # See intermediate
validated = validate(result)
print(f"[DEBUG] After validate: {validated}") # See output
return validated
Use temporary, remove before commit:
# ✅ GOOD: Temporary debug
print(f"[DEBUG] value={value}") # Remove before commit
# ❌ BAD: Permanent debug prints
print("Processing...") # Clutters output forever
Interactive Debugger (pdb)
Built-in Python debugger:
import pdb
def buggy_function(x, y):
result = x + y
pdb.set_trace() # Debugger starts here
return result * 2
# When debugger starts:
# (Pdb) p x # Print variable x
# (Pdb) p y # Print variable y
# (Pdb) p result # Print result
# (Pdb) n # Next line
# (Pdb) s # Step into function
# (Pdb) c # Continue execution
# (Pdb) q # Quit debugger
Breakpoint (Python 3.7+):
def buggy_function(x, y):
result = x + y
breakpoint() # Better than pdb.set_trace()
return result * 2
IPython Debugger (ipdb)
Enhanced debugger with syntax highlighting:
pip install ipdb
import ipdb
def buggy_function(x, y):
result = x + y
ipdb.set_trace() # Colorized, tab completion
return result * 2
Commands:
(ipdb) p x # Print x
(ipdb) pp x # Pretty-print x
(ipdb) l # List code around current line
(ipdb) w # Where (stack trace)
(ipdb) u # Up stack frame
(ipdb) d # Down stack frame
(ipdb) h # Help
Post-Mortem Debugging
Debug after exception:
import pdb
import sys
def buggy_function():
x = 1
y = 0
return x / y # ZeroDivisionError
try:
buggy_function()
except:
pdb.post_mortem(sys.exc_info()[2])
# Debugger starts at exception point
Automatic on uncaught exception:
import sys
import pdb
def exception_handler(exc_type, exc_value, exc_traceback):
if exc_type is KeyboardInterrupt:
sys.__excepthook__(exc_type, exc_value, exc_traceback)
return
print("Uncaught exception, starting debugger:")
pdb.post_mortem(exc_traceback)
sys.excepthook = exception_handler
# Now any uncaught exception starts debugger
Profiling
CPU Profiling (cProfile)
Find slow functions:
import cProfile
import pstats
def slow_function():
# Your code here
pass
# Profile
profiler = cProfile.Profile()
profiler.enable()
slow_function()
profiler.disable()
# Print stats
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 slowest functions
Output:
ncalls tottime percall cumtime percall filename:lineno(function)
10 0.500 0.050 0.800 0.080 mymodule.py:10(slow_function)
100 0.300 0.003 0.300 0.003 mymodule.py:20(helper)
Sort options:
cumulative- Total time (includes subfunctions)time- Internal time (excludes subfunctions)calls- Number of calls
Profile Decorator
Profile specific functions:
import cProfile
import pstats
from functools import wraps
def profile(func):
@wraps(func)
def wrapper(*args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)
return result
return wrapper
@profile
def slow_function():
# Your code
pass
Line Profiler (line_profiler)
Profile line-by-line:
pip install line_profiler
from line_profiler import LineProfiler
def slow_function():
total = 0
for i in range(1000000):
total += i
return total
profiler = LineProfiler()
profiler.add_function(slow_function)
profiler.enable()
slow_function()
profiler.disable()
profiler.print_stats()
Output:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 def slow_function():
3 1 2.0 2.0 0.0 total = 0
4 1000001 150000.0 0.1 30.0 for i in range(1000000):
5 1000000 350000.0 0.4 70.0 total += i
6 1 1.0 1.0 0.0 return total
Memory Profiling (memory_profiler)
Find memory leaks:
pip install memory_profiler
from memory_profiler import profile
@profile
def memory_hog():
big_list = [0] * (10 ** 7) # 10M items
return sum(big_list)
memory_hog()
Output:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
2 40.0 MiB 40.0 MiB 1 @profile
3 def memory_hog():
4 116.4 MiB 76.4 MiB 1 big_list = [0] * (10 ** 7)
5 116.4 MiB 0.0 MiB 1 return sum(big_list)
Sampling Profiler (py-spy)
Profile running processes (no code changes needed):
pip install py-spy
# Profile running Python process
py-spy top --pid 12345
# Generate flamegraph
py-spy record -o profile.svg --pid 12345
Advantages:
- ✅ No code modification
- ✅ Profile production code
- ✅ Low overhead
- ✅ Visualizations (flamegraphs)
Stack Traces
Print Stack Trace
import traceback
try:
risky_operation()
except Exception as e:
traceback.print_exc() # Print stack trace to stderr
Capture to string:
import traceback
try:
risky_operation()
except Exception as e:
trace = traceback.format_exc()
logger.error(f"Error occurred: {trace}")
Rich Tracebacks (rich)
Beautiful, informative tracebacks:
pip install rich
from rich.traceback import install
install(show_locals=True)
def buggy_function():
x = 42
y = None
return x + y # TypeError with beautiful traceback
buggy_function()
Shows:
- ✅ Syntax highlighted code
- ✅ Local variables at each frame
- ✅ Clear error message
Performance Monitoring
Timing Decorator
Measure function execution time:
import time
from functools import wraps
def timer(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
duration = end - start
print(f"{func.__name__} took {duration:.4f}s")
return result
return wrapper
@timer
def slow_function():
time.sleep(1)
return "done"
slow_function()
# Output: slow_function took 1.0001s
Context Manager Timer
import time
from contextlib import contextmanager
@contextmanager
def timer(name):
start = time.perf_counter()
yield
end = time.perf_counter()
print(f"{name} took {end - start:.4f}s")
# Usage
with timer("Database query"):
result = db.query(...)
with timer("File processing"):
process_file(path)
Performance Assertions
Fail if too slow:
import time
def performance_test():
start = time.perf_counter()
# Your code
process_data()
duration = time.perf_counter() - start
assert duration < 0.1, f"Too slow: {duration:.4f}s"
Simple Metrics
Counter
class Metrics:
def __init__(self):
self.counters = {}
def increment(self, name, value=1):
self.counters[name] = self.counters.get(name, 0) + value
def get(self, name):
return self.counters.get(name, 0)
def reset(self):
self.counters = {}
# Usage
metrics = Metrics()
def process_item(item):
metrics.increment("items_processed")
if item.is_valid():
metrics.increment("items_valid")
else:
metrics.increment("items_invalid")
# Later
print(f"Processed: {metrics.get('items_processed')}")
print(f"Valid: {metrics.get('items_valid')}")
print(f"Invalid: {metrics.get('items_invalid')}")
Histogram
Track value distributions:
from collections import defaultdict
class Histogram:
def __init__(self):
self.buckets = defaultdict(int)
def record(self, value):
# Bucket by value range
if value < 1:
bucket = "< 1s"
elif value < 5:
bucket = "1-5s"
elif value < 10:
bucket = "5-10s"
else:
bucket = "> 10s"
self.buckets[bucket] += 1
def report(self):
for bucket, count in sorted(self.buckets.items()):
print(f"{bucket}: {count}")
# Usage
histogram = Histogram()
for duration in query_durations:
histogram.record(duration)
histogram.report()
# Output:
# < 1s: 850
# 1-5s: 120
# 5-10s: 25
# > 10s: 5
Debugging Best Practices
1. Binary Search Debugging
Narrow down the problem:
# Problem: Something breaks between start and end
def debug_binary_search():
# Step 1: Check middle
middle_result = process_half()
print(f"Middle result: {middle_result}")
# Step 2: Narrow to first or second half
if middle_result == expected:
# Problem in second half
debug_second_half()
else:
# Problem in first half
debug_first_half()
2. Rubber Duck Debugging
Explain code out loud (or to a rubber duck):
- Describe what code should do
- Explain what code actually does
- Often you'll spot the bug while explaining!
3. Add Assertions
Catch bugs early:
def divide(a, b):
assert b != 0, "Divisor cannot be zero"
assert isinstance(a, (int, float)), "a must be numeric"
assert isinstance(b, (int, float)), "b must be numeric"
result = a / b
assert isinstance(result, (int, float)), "Result should be numeric"
return result
4. Simplify and Isolate
Reduce problem to minimal case:
# ❌ BAD: Complex function with bug
def complex_function(data, config, options, flags):
# 100 lines of code
pass
# ✅ GOOD: Extract minimal failing case
def minimal_failing_case():
data = [1, 2, 3]
result = transform(data) # Fails here
return result
Logging Anti-Patterns
❌ DON'T:
# 1. Log sensitive data
logger.info(f"Password: {password}")
# 2. Log in tight loops
for item in million_items:
logger.info(f"Processing {item}")
# 3. Catch and ignore silently
try:
critical_operation()
except:
pass # Silent failure!
# 4. Over-log (log everything)
logger.debug("Starting function")
logger.debug("Variable x = 1")
logger.debug("Variable y = 2")
logger.debug("Calling helper")
logger.debug("Helper returned")
# 5. Use wrong log level
logger.error("User logged in") # Should be INFO
logger.info("Database crashed") # Should be CRITICAL
Quick Reference
Logging
logger.debug("Detailed trace")
logger.info("Normal event")
logger.warning("Unexpected")
logger.error("Error occurred", exc_info=True)
logger.critical("System failure")
Debugging
breakpoint() # Start debugger
pdb.set_trace() # Start debugger (old way)
traceback.print_exc() # Print stack trace
Profiling
# CPU
python -m cProfile -s cumulative script.py
# Memory
@profile # from memory_profiler
def func(): pass
# Live process
py-spy top --pid 12345
Timing
@timer
def func(): pass
with timer("operation"):
do_work()
Key Takeaways
- Structured logging - Use JSON for machine-readable logs
- Appropriate log levels - DEBUG/INFO/WARNING/ERROR/CRITICAL
- Don't log secrets - Never log passwords, API keys, etc.
- Use debugger - breakpoint() or ipdb for interactive debugging
- Profile first - Don't optimize without profiling
- cProfile for CPU - Find slow functions
- memory_profiler for memory - Find memory leaks
- Stack traces on errors - Always log exc_info=True
- Time critical paths - Use @timer decorator
- Simple metrics - Track counters and histograms
Version: 1.0.0 Type: Knowledge skill (no scripts) See Also: python-standards (best practices), testing-guide (test debugging), security-patterns (safe logging)
More from akaszubski/anyclaude-local
testing-guide
Complete testing methodology - TDD, progression tracking, regression prevention, and test patterns
1git-workflow
Git best practices, commit conventions, branching strategies, and pull request workflows. Use when working with git operations, commits, branches, or PRs.
1python-standards
Python code quality standards (PEP 8, type hints, docstrings). Use when writing Python code.
1github-workflow
GitHub-first workflow - Issues, PRs, milestones, auto-tracking for solo developer productivity
1code-review
This skill should be used when reviewing code or preparing code for review. It provides guidelines for what to look for in reviews, how to write constructive feedback, and standards for review comments.
1database-design
Database schema design, migrations, query optimization, and ORM patterns. Use when designing database schemas, writing migrations, optimizing queries, or working with ORMs like SQLAlchemy or Django ORM.
1