Our Privacy Statement & Cookie Policy

By continuing to browse our site you agree to our use of cookies, revised Privacy Policy and Terms of Use. You can change your cookie settings through your browser.

I agree

Catalyst DeepSeek: The innovation behind its cost efficiency

Yang Zhao

 , Updated 23:01, 09-Feb-2025

Translating...

Content is automatically generated by Microsoft Azure Translator Text API. CGTN is not responsible for any of the translations.

VCG
VCG

VCG

Editor's Note: The AI Action Summit 2025 is being held in Paris from February 10 to 11. Last month, Chinese AI company DeepSeek sent shockwaves through the global market, undoubtedly making China's voice and solutions the focal point of the summit. We continue to feature a series of commentary articles by CGTN technology reporter and commentator Yang Zhao on China's AI company DeepSeek. In this chapter, we will break down, in the simplest terms, how DeepSeek is leveraging innovative thinking to overcome U.S. chip restrictions.

Let me start with my conclusion: DeepSeek's key to success is maximizing efficiency under constraints.

Due to U.S. chip export restrictions, Chinese companies cannot access cutting-edge AI chips like NVIDIA's H100, which are superior in bandwidth and communication speed. This forced DeepSeek to innovate under hardware limitations, pushing efficiency to the extreme by minimizing computational waste and maximizing every GPU cycle.

Here are some examples on how DeepSeek optimized its performance:

MoE (Mixture of Experts): Traditional models like GPT-3.5 activate the entire model for every task. DeepSeek's MoE approach divides the model into multiple specialized "experts" and only activates the necessary ones, significantly improving efficiency. This means that instead of using all of the model's resources for every task, only the parts that are most relevant are used, which reduces computational overhead.

DeepSeekMLA (Multi-head Latent Attention): This technique compresses memory usage by focusing on key contextual information rather than storing everything – akin to remembering the essence of a book rather than every word. Latent attention helps prioritize the most important data, allowing DeepSeek's model to store and process fewer, more relevant pieces of information while maintaining high performance. By selecting only the critical data, DeepSeek reduces the strain on memory and speeds up processing.

Precision Optimization: Instead of using high-precision formats like BF16 or FP32, DeepSeek stores parameters in FP8, reducing memory requirements without significant loss of accuracy. Imagine replacing high-resolution images with well-detailed sketches – less data, same impact.

In DeepSeek's technical report on its V3 model, they mention that their training utilized NVIDIA's H800 GPUs. The emergence of this product is tied to the U.S. government's chip export restrictions imposed on China. The H100, one of the most powerful GPUs for AI training, was unavailable to Chinese companies due to these restrictions, which led NVIDIA to create the H800 as a "scaled-down" version to comply with export controls.

So, what exactly does "scaled-down" mean? The primary difference lies in the cross-GPU communication bandwidth – when AI tasks need to be distributed across multiple GPUs, they require fast data exchange, similar to a group of workers collaborating on a task. If the bandwidth is restricted, this communication slows down, impacting the overall computation efficiency. The NVLink bandwidth in the H800 is significantly reduced, much like workers going from face-to-face communication to using walkie-talkies, resulting in less efficient collaboration.

DeepSeek's approach is skipping the "commander" and doing it themselves. NVIDIA already provides a high-level management system for GPU computing – CUDA (Compute Unified Device Architecture). Think of CUDA as a factory manager who can automatically assign tasks to workers (GPU cores) without the user needing to worry about low-level details. However, in the case of the H800's limitations, DeepSeek found that the default scheduling method provided by CUDA wasn't sufficient for their extreme optimization needs.

To overcome this hardware bottleneck, DeepSeek's engineers decided to bypass CUDA and directly control the GPU with lower-level instructions. They used a programming method called PTX (Parallel Thread Execution), which is much more granular than CUDA. If CUDA is a high-level "manager" in a factory, PTX is the act of directly instructing each worker (GPU core) individually. While this approach increases development complexity, it allows DeepSeek to fine-tune task distribution, improving the H800's performance and compensating for the bandwidth limitation.

DeepSeek demonstrated that even with the H800's limitations, extreme optimization could still maintain high efficiency in AI training. This means that the impact of NVIDIA's restricted version of GPUs might not be as significant as initially anticipated. The market began to reassess the viability of AI development in China and whether there would be a future reliance on NVIDIA's high-end chips. This series of events could be one of the factors contributing to the drop in NVIDIA's stock price.

Of course, there are many reasons for NVIDIA's stock price drop. Aside from DeepSeek's breakthroughs, the market is also concerned that more AI companies might explore alternatives to NVIDIA's ecosystem, such as products from AMD, Intel and domestic chipmakers. DeepSeek's success isn't just a technical breakthrough – it could also be signaling a shift in the AI industry's landscape.

Preview: In the next article, we will explore how China is building global tech competitors from policy to innovation along with the future of artificial intelligence. 

About the author:

Yang Zhao is in charge of CGTN's science, technology, and environmental coverage. He also founded CGTN's Tech It Out studio, which produces award-winning scientific documentaries, including Human Carbon Footprint, Architectural Intelligence, and Land of Diversity.

Search Trends