Together Turbo: Algorithms & Architectures for Fast Inference

In the world of Generative AI, where language models (LLMs) are becoming more complex and larger every day, the real challenge for providers and developers is no longer just how “smart” the model is. The key questions now are:
- How do we reduce latency?
- How do we control cost?
How can we run large-scale models with fast inference at a reasonable cost, while keeping answer quality high?
This is the core problem Together AI focuses on. In this session at AI-VOLUTION, Ben Athiwaratkun, Staff AI Scientist and Turbo Research Team Lead of Together AI, revealed the technology behind “Together Turbo”, explaining how the team combines cutting-edge research with robust engineering systems to break previous speed limits efficiently and reliably.
Main Goal of Together AI
Together AI doesn’t just want to build models. They want to “democratize intelligence” and make it accessible to everyone.
The key is to make the inference process (the computation that generates answers) as efficient as possible: faster, more optimized, and more resource-friendly.
The Together Turbo research team does not rely on just a single technique. Instead, they combine 4 core strategies that work together seamlessly:
- Trimming the Fat with Activation Sparsity (TEAL)
The research team found something interesting: in each forward pass, most of the activations aren’t actually important. TEAL is used to skip computation on those low-value activations (masking low-value activations). The result is a big reduction in memory load and up to 40% lower latency—while keeping the AI just as smart as before. - Reshaping the Model with Architecture Adaptation
Together AI wasn’t afraid to tear down and rebuild the model’s internal architecture: - Lower wait time: They use lateral residuals so that data transfer and computation can happen in parallel (overlapped), instead of strictly one after the other.
- Stronger, longer memory: They adjust the Transformer’s weights to work together with a MAMBA-style architecture. This lets the AI accurately retrieve information from contexts up to 36k tokens, even though it was originally trained on only 2k tokens.
- Reading Ahead with Speculative Decoding (Atlas)
Instead of generating one token at a time and waiting at every step, the Atlas system lets the AI predict and generate multiple tokens at once in a single step.
The special part: Atlas supports runtime learning. The more people use it, the better and faster it gets at guessing what comes next—its predictions become more accurate over time. - Shrinking the Model While Keeping Quality (Quantization & Post-training)
By carefully quantizing the model down to FP8 or FP4, you can run huge models smoothly on modern hardware. Combined with smart post-training optimizations, this also helps remove bottlenecks in the Reinforcement Learning (RL) pipeline.
The Result
When all these techniques are stacked together, they become a breakthrough in real-world performance. On NVIDIA Blackwell chips, the DeepSeek V3.1 model running with Together Turbo tech can jump from 100 tokens/second all the way up to 500 tokens/second.
Conclusion
Ben emphasized that this success didn’t come from fixing isolated issues. It came from full-stack co-design: co-optimizing everything end-to-end, from algorithms and kernels, to the operating system, all the way to training methods.
This is a major step in turning lab research into production-grade AI systems—making AI faster, more powerful, and more accessible for real-world business applications.
Wat full session here:https://www.youtube.com/watch?v=zHdLBoXln7I





