Observability First
SKILL.md
Observability First (可观测性优先)
Instructions
- 先定义要观测的关键流程与指标
- 先填写 Required Inputs(流程、阈值、负责人)并冻结
- 依序创建 Crash/ANR、结构化日志、性能指标
- 一次只补强一类信号,避免噪音扩散
- 完成后对照 Quick Checklist
When to Use
- 发布前需要稳定监控与回馈闭环
- 事故频繁但缺乏可定位信息
- 需要把性能与稳定性纳入日常决策
Example Prompts
- "请创建支付流程的可观测性指标与事件"
- "请设计 Crash/ANR 的告警门槛"
- "帮我创建结构化日志的字段规格"
- "请用 OpenTelemetry 追踪关键 API 调用链"
Workflow
- 先确认 Required Inputs(关键流程、SLO、告警接收人)
- 定义关键流程与 SLO,并建立事件命名规范
- 创建 Crash/ANR 与结构化事件
- 加入性能指标与告警门槛
- 建立回馈回路(分析 -> 修复 -> 验证)与值班流程
- 运行 Monitoring Gate 验收命令并记录结果
Practical Notes (2026)
- 先有指标再谈优化,避免主观调整
- 事件字段需一致,便于查找与汇整
- 告警门槛要可运行且可回溯
- 同一个业务事件只定义一个 canonical name,避免跨系统同义词
- P0 告警必须有 owner 与响应时限(SLA),避免“看见但没人处理”
- 仪表板要区分 release/build version,支持回归比对
Minimal Template
目标:
关键流程 owner:
告警接收渠道:
SLO 窗口(日/周):
关键流程:
指标/事件:
告警门槛:
验收: Quick Checklist
Required Inputs (执行前输入)
关键流程清单(启动/登录/支付/列表等)SLO(指标定义、统计窗口、目标值)Owner & Oncall(每个 P0/P1 指标对应负责人)告警渠道(PagerDuty/Slack/Email)事件命名与字段字典(event name、必填字段、枚举值)发布维度(build version、flavor、region)
Deliverables (完成后交付物)
SLO 文档(含阈值、owner、响应时限)结构化事件字典(字段说明 + 示例)Crash/ANR 上下文策略(custom keys 与分级)性能仪表板(启动、网络、滚动、关键交易)告警规则(P0/P1/P2)与升级路径反馈闭环记录模板(问题 -> 修复 -> 指标回归)
Monitoring Gate (验收门槛)
# 1) 基础质量
./gradlew lint test assemble
# 2) 性能量测(若项目有 benchmark 模块)
./gradlew :benchmark:connectedBenchmarkAndroidTest
# 3) 关键信号验证(按项目脚本调整)
./gradlew :app:connectedDebugAndroidTest
# 4) 发布前手动核查
# - Crashlytics/Performance dashboard 有最近 24h 数据
# - P0 告警路由已验证(至少演练一次)
没有 benchmark 模块时,需在 PR 说明中记录替代量测方法与结果。
Signals & SLOs
关键流程清单
| 流程 | P0 指标 | SLO 目标 |
|---|---|---|
| 启动 | Cold Start 时间 | P95 < 1.5s |
| 登录 | 成功率 | > 99.5% |
| 核心交易 | 完成率、延迟 | 成功率 > 99.9%, P95 < 3s |
| 列表滚动 | 掉帧率 | Jank < 1% |
| 网络请求 | 成功率、延迟 | 成功率 > 99%, P95 < 500ms |
指标分级
enum class MetricPriority {
P0, // Crash/ANR 率、关键流程失败率 — 立即告警
P1, // 首次渲染时间、列表滚动流畅度 — 每日检视
P2 // 特定功能转换率或完成率 — 每周检视
}
Firebase Performance Monitoring
自定义 Trace
class CheckoutTracer @Inject constructor() {
fun <T> traceCheckout(block: () -> T): T {
val trace = Firebase.performance.newTrace("checkout_flow")
trace.start()
return try {
val result = block()
trace.putAttribute("result", "success")
result
} catch (e: Exception) {
trace.putAttribute("result", "failure")
trace.putAttribute("error", e.javaClass.simpleName)
throw e
} finally {
trace.stop()
}
}
}
// 使用
class CheckoutUseCase @Inject constructor(
private val tracer: CheckoutTracer,
private val repository: OrderRepository
) {
suspend fun execute(order: Order): OrderResult {
return tracer.traceCheckout {
repository.submitOrder(order)
}
}
}
网络请求自动追踪
class PerformanceInterceptor : Interceptor {
override fun intercept(chain: Interceptor.Chain): Response {
val request = chain.request()
val metric = Firebase.performance.newHttpMetric(
request.url.toString(),
request.method
)
metric.start()
return try {
val response = chain.proceed(request)
metric.setResponseContentType(response.header("Content-Type"))
metric.setHttpResponseCode(response.code)
metric.setResponsePayloadSize(response.body?.contentLength() ?: 0)
response
} catch (e: IOException) {
metric.putAttribute("result", "io_exception")
throw e
} finally {
metric.stop()
}
}
}
Structured Events
统一事件接口
interface AnalyticsTracker {
fun track(event: AnalyticsEvent)
}
data class AnalyticsEvent(
val name: String,
val params: Map<String, Any> = emptyMap()
)
class CompositeTracker @Inject constructor(
private val trackers: Set<@JvmSuppressWildcards AnalyticsTracker>
) : AnalyticsTracker {
override fun track(event: AnalyticsEvent) {
trackers.forEach { it.track(event) }
}
}
class FirebaseTracker @Inject constructor() : AnalyticsTracker {
override fun track(event: AnalyticsEvent) {
Firebase.analytics.logEvent(event.name) {
event.params.forEach { (key, value) ->
when (value) {
is String -> param(key, value)
is Long -> param(key, value)
is Double -> param(key, value)
is Bundle -> param(key, value)
}
}
}
}
}
事件字段规格
object EventKeys {
const val FLOW_ID = "flow_id"
const val USER_TIER = "user_tier"
const val BUILD_VERSION = "build_version"
const val LATENCY_MS = "latency_ms"
const val RESULT = "result"
const val ERROR_CODE = "error_code"
const val SCREEN_NAME = "screen_name"
}
// 使用
tracker.track(AnalyticsEvent(
name = "checkout_completed",
params = mapOf(
EventKeys.FLOW_ID to flowId,
EventKeys.LATENCY_MS to duration,
EventKeys.RESULT to "success",
EventKeys.USER_TIER to "premium"
)
))
Crash / ANR Strategy
Crash 上下文增强
class CrashContextManager @Inject constructor() {
fun setFlowContext(flowName: String, params: Map<String, String> = emptyMap()) {
Firebase.crashlytics.apply {
setCustomKey("current_flow", flowName)
setCustomKey("flow_timestamp", System.currentTimeMillis().toString())
params.forEach { (k, v) -> setCustomKey(k, v) }
}
}
fun clearFlowContext() {
Firebase.crashlytics.setCustomKey("current_flow", "none")
}
}
Non-Fatal 分级策略
enum class NonFatalSeverity { LOW, MEDIUM, HIGH }
class NonFatalReporter @Inject constructor() {
fun report(
exception: Exception,
severity: NonFatalSeverity,
context: Map<String, String> = emptyMap()
) {
if (severity == NonFatalSeverity.LOW) return
Firebase.crashlytics.apply {
setCustomKey("severity", severity.name)
context.forEach { (k, v) -> setCustomKey(k, v) }
recordException(exception)
}
}
}
Performance Signals
Startup 量测上报
class StartupMetricReporter @Inject constructor(
private val tracker: AnalyticsTracker
) {
private var processStartTime: Long = 0L
fun onProcessStart() {
processStartTime = SystemClock.elapsedRealtime()
}
fun onFirstFrameRendered() {
val duration = SystemClock.elapsedRealtime() - processStartTime
tracker.track(AnalyticsEvent(
name = "app_startup",
params = mapOf(
EventKeys.LATENCY_MS to duration,
"startup_type" to "cold"
)
))
}
}
CI Gate 性能门槛
# .github/workflows/performance-gate.yml
name: Performance Gate
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: macos-latest
steps:
- uses: actions/checkout@v4
- name: Run Macrobenchmark
uses: reactivecircus/android-emulator-runner@v2
with:
api-level: 34
script: ./gradlew :benchmark:connectedBenchmarkAndroidTest
- name: Check Startup Threshold
run: |
STARTUP_MS=$(jq '.benchmarks[0].metrics.timeToInitialDisplayMs.median' benchmark/build/outputs/connected_android_test_additional_output/benchmarkData.json)
if (( $(echo "$STARTUP_MS > 1500" | bc -l) )); then
echo "Startup ${STARTUP_MS}ms exceeds 1500ms threshold"
exit 1
fi
Alerting & Feedback Loop
告警门槛
| 指标 | 告警门槛 | 动作 |
|---|---|---|
| Crash-free rate | < 99.5% | P0 立即通知 |
| ANR rate | > 0.5% | P0 立即通知 |
| API 成功率 | < 99% | P1 当日处理 |
| Cold Start P95 | > 2s | P1 当日处理 |
| Jank rate | > 5% | P2 本周处理 |
回馈回路
发现问题 → 定位根因 → 修复 → 验证指标回归 → 更新 SLO
↑ │
└──────────────────────────────────────────────┘
Quick Checklist
- Required Inputs 已填写并冻结(流程/SLO/owner/告警渠道)
- 关键流程与 SLO 定义完成
- 关键事件命名与字段字典完成(含必填字段)
- Firebase Performance 自定义 Trace 覆盖核心流程
- 事件字段统一且可查找
- Crash/ANR 上下文增强(Custom Keys)
- Non-Fatal 分级策略避免噪音
- 性能指标有量测与 CI Gate 门槛
- 告警门槛与回馈回路已创建
- 每个 P0 指标有明确 owner 与响应时限
- Monitoring Gate 已执行并记录结果