distribution-search
Distribution Search
Overview
This skill provides structured guidance for finding probability distributions that satisfy specific statistical constraints. These problems typically involve constructing a discrete probability distribution over a large vocabulary that achieves target values for metrics like KL divergence, entropy, or other information-theoretic quantities.
Mathematical Analysis First
Before writing any code, perform thorough mathematical analysis to constrain the solution space:
-
Derive analytical relationships between the target metrics and distribution properties
- For KL divergence from uniform: KL(P||U) = -H(P) + log(V) where H(P) is entropy and V is vocabulary size
- For KL divergence to uniform: KL(U||P) = log(V) - (1/V) * Σ log(P(i))
-
Calculate fixed quantities from the problem specification
- Vocabulary size V determines log(V)
- Target values combined with log(V) constrain feasible entropy ranges
-
Identify implied constraints from simultaneous requirements
- Multiple target metrics often severely restrict the feasible solution space
- Determine if the problem is over-constrained before attempting optimization
-
Estimate the solution structure analytically
- Determine the approximate entropy the distribution must have
- Estimate how concentrated or spread the probability mass should be
Parameterization Strategy
High-dimensional optimization over all probability values is impractical. Use reduced parameterizations:
Recommended Parameterizations
-
Power law with uniform tail: k high-probability elements following p_i ∝ i^(-α), remaining elements share probability equally
- Parameters: k (number of special elements), α (power law exponent), p_rest (probability for tail)
-
Two-group distribution: k elements with probability p_high, remaining elements with probability p_low
- Parameters: k, p_high (p_low is determined by normalization)
-
Multi-tier distribution: Several groups of elements with distinct probability levels
- Parameters: group sizes and probability levels
Parameter Selection Guidance
- Start with few parameters (2-3) and add flexibility only if needed
- Derive parameter bounds from mathematical constraints rather than arbitrary choices
- Calculate reasonable initial values from the analytical solution estimates
Implementation Approach
Modular Code Structure
Organize code into separate, tested functions:
- kl_forward(P, V): Compute KL(P||Uniform)
- kl_backward(P, V): Compute KL(Uniform||P)
- create_distribution(params, V): Generate distribution from parameterization
- objective(params): Optimization objective combining constraints
- verify_solution(P, V, targets): Independent verification
Computational Efficiency
- Avoid full array operations when possible - use analytical formulas for symmetric distributions
- For two-group distributions: compute entropy/KL directly from group sizes and probabilities
- Estimate computational cost before running - set appropriate timeouts
Optimization Strategy
- Coarse grid search over reduced parameter space to find promising regions
- Local optimization from multiple starting points in promising regions
- Refinement with tighter tolerances near solutions
Verification
Independent Verification
Always verify solutions with code independent from the optimization:
- Check probability constraints: All values non-negative, sum to 1.0
- Compute metrics directly: Use explicit formulas, not the optimization objective
- Test against known cases: Verify computation on uniform distribution or other known solutions
Common Verification Bugs
- Incorrect handling of log(0) - use appropriate thresholds (e.g., max(p, 1e-30))
- Array indexing errors when elements are counted vs indexed
- Forgetting to normalize after parameter adjustments
- Multiplication errors when computing sums over groups
Common Pitfalls
Computational
- Starting with full-dimensional optimization: Immediately recognize the need for reduced parameterization
- Insufficient timeout estimation: Estimate iterations needed before running
- Rewriting entire scripts: Use modular code to enable targeted fixes
Mathematical
- Incomplete constraint analysis: Fully leverage mathematical relationships before coding
- Arbitrary parameter bounds: Derive bounds from problem constraints
- Poor initial values: Use analytical estimates for starting points
Verification
- Trusting optimization output directly: Always verify with independent computation
- Not testing verification code: Verify the verifier using known solutions first
- Numerical precision issues: Handle small probabilities carefully
Problem-Solving Workflow
- Analyze: Derive analytical relationships and estimate solution structure
- Parameterize: Choose reduced parameterization matching expected structure
- Implement: Write modular, tested code for each component
- Search: Grid search followed by local optimization
- Verify: Independent verification of candidate solutions
- Refine: Adjust parameterization if tolerances not achieved
When Solutions Are Not Found
If optimization fails to find valid solutions:
- Check if the problem is mathematically feasible given constraints
- Verify the parameterization can represent valid solutions
- Expand the parameterization to add flexibility
- Check for bugs in objective function or constraint handling
- Try different optimization algorithms or starting points