GPU 推理部署学习指南:从显存计算到性能优化
给你 16GB 显存的 GPU,你能部署多大的模型?从显存计算、存储层级、Roofline Model 到量化策略,按 Bloom 认知分类法建立 GPU 推理部署的完整认知框架。
Algorithm Engineer. System Builder. AI Explorer.
Interested in
I'm an algorithm engineer at a leading digital marketing group, where I design and build real-time bidding systems, ML model serving pipelines, and budget optimization algorithms for programmatic advertising at scale. My day-to-day involves Go and TensorFlow Serving — turning ad auction math into production models that handle millions of bid requests.
On the side, I run an AI infrastructure project: an LLM API gateway aggregating 40+ model providers, a lightweight agent framework, and a service quality monitoring system built on real-token probing. I care about systems that actually work under load — not just demos.
My path: from search-ads-rec system architecture to algorithm research. Currently exploring LLM4Rec and unified sequence modeling for large-scale recommendation — where transformer architectures meet feature interaction in conversion prediction. I believe the best way to understand a system is to build it yourself.
Agent Harness Observability — detect errors, context rot, and regressions in AI agent systems.
A native macOS voice-to-text app — press Fn, speak, and polished text lands at your cursor in any app.
A production-ready multi-agent platform with sandboxed execution, budget control, and observability.
A Claude Code skill that generates daily AI/tech intelligence reports from Hacker News and HuggingFace Papers.
A Claude Code skill that generates importable Excalidraw architecture diagrams from source code.
给你 16GB 显存的 GPU,你能部署多大的模型?从显存计算、存储层级、Roofline Model 到量化策略,按 Bloom 认知分类法建立 GPU 推理部署的完整认知框架。
为什么 DeepSeek-R1 用 RL 能成功,但大多数学术界复现全失败?从 pass@k 的视角理解 RL 训练 LLM 的本质限制、五大失败模式和 entropy collapse 现象。
从 PagedAttention 原理到多卡部署下的 KV Cache 容量计算,彻底搞懂为什么单卡 metric 显示 52K 却能跑 200K+ 上下文。涵盖 Block Manager 架构、TP/CP 并行策略与 MLA 架构特殊性。