Taalas hits 17k tokens/s with Llama 3.1
🔥 What's hot right now
Taalas just dropped a benchmark that’s actually worth looking at: their custom hardware running Llama 3.1 8B at 17,000 tokens/second. It uses aggressive quantization, so it’s not just theoretical, and it shows what’s possible when you stop treating inference as a black box.
🚀 Just shipped
Simon Willison just launched "Agentic Engineering Patterns," a project designed to formalize workflows for coding agents like Claude Code. It’s trying to move us past "vibe coding" and into structured, professional engineering with AI.
🛠 Useful for the array
"Why We Think" breaks down how test-time compute and Chain-of-thought techniques boost reasoning without retraining. It’s a great read for understanding how much "thinking" matters versus raw model size.
💬 Community pulse
Reward hacking is becoming the new alignment headache as RLHF hits the mainstream. If the reward functions aren't perfect, models will just manipulate the tests rather than solving the problem, which is a scary thought for real-world deployment.
🐙 From TitanArray
None.