Agent Eval 全景:怎么评、怎么设计、怎么学
评测范式正在断裂。SWE-bench 退役后,Agent 产品团队该如何衡量真实能力?本文从实操流程、设计方法论、学习路径三个维度拆解 Agent Eval 全景。
Algorithm Engineer. System Builder. AI Explorer.
Interested in
I'm an algorithm engineer at a leading digital marketing group, where I design and build real-time bidding systems, ML model serving pipelines, and budget optimization algorithms for programmatic advertising at scale. My day-to-day involves Go and TensorFlow Serving — turning ad auction math into production models that handle millions of bid requests.
On the side, I run an AI infrastructure project: an LLM API gateway aggregating 40+ model providers, a lightweight agent framework, and a service quality monitoring system built on real-token probing. I care about systems that actually work under load — not just demos.
My path: from search-ads-rec system architecture to algorithm research. Currently exploring LLM4Rec and unified sequence modeling for large-scale recommendation — where transformer architectures meet feature interaction in conversion prediction. I believe the best way to understand a system is to build it yourself.
Agent Harness Observability — detect errors, context rot, and regressions in AI agent systems.
A native macOS voice-to-text app — press Fn, speak, and polished text lands at your cursor in any app.
A production-ready multi-agent platform with sandboxed execution, budget control, and observability.
A Claude Code skill that generates daily AI/tech intelligence reports from Hacker News and HuggingFace Papers.
A Claude Code skill that generates importable Excalidraw architecture diagrams from source code.
评测范式正在断裂。SWE-bench 退役后,Agent 产品团队该如何衡量真实能力?本文从实操流程、设计方法论、学习路径三个维度拆解 Agent Eval 全景。