A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Z. Deng, W. Ma, Q.-L. Han, W. Zhou, X. Zhu, S. Wen, and  Y. Xiang,  “Exploring DeepSeek: A survey on advances, applications, challenges and future directions,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 5, pp. 872–893, May 2025. doi: 10.1109/JAS.2025.125498
Citation: Z. Deng, W. Ma, Q.-L. Han, W. Zhou, X. Zhu, S. Wen, and  Y. Xiang,  “Exploring DeepSeek: A survey on advances, applications, challenges and future directions,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 5, pp. 872–893, May 2025. doi: 10.1109/JAS.2025.125498

Exploring DeepSeek: A Survey on Advances, Applications, Challenges and Future Directions

doi: 10.1109/JAS.2025.125498
More Information
  • The rapid advancement of large models has led to the development of increasingly sophisticated models capable of generating diverse, personalized, and high-quality content. Among these, DeepSeek has emerged as a pivotal open-source initiative, demonstrating high performance at significantly lower computation costs compared to closed-source counterparts. This survey provides a comprehensive overview of the DeepSeek family of models, including DeepSeek-V3 and DeepSeek-R1, covering their core innovations in architecture, system pipeline, algorithm, and infrastructure. We explore their practical applications across various domains, such as healthcare, finance, and education, highlighting their impact on both industry and society. Furthermore, we examine potential security, privacy, and ethical concerns arising from the widespread deployment of these models, emphasizing the need for responsible AI development. Finally, we outline future research directions to enhance the performance, safety, and scalability of DeepSeek models, aiming to foster further advancements in the open-source large model community.

     

  • THE rapid advancement of large models has significantly accelerated the creation of intelligent systems. large models are designed to support or substitute human efforts in producing diverse, personalized, and high-quality content more efficiently and cost-effectively, tailored to user needs and requests [1], [2]. large models cover a wide array of synthetic content, such as text [3], images [4], audio [5], video [6], and interactive 3D elements [7]. The progress in GPU technology and the rise in computational capacity have enabled the training of large-scale deep neural networks with vast numbers of parameters. This allows these models to capture more complex information, leading to improved performance across a wide range of downstream tasks. Since 2022, a diverse range of large models has been created and progressively enhanced by prominent technology companies around the globe, such as OpenAI model series [8] and Claude model series [9]. However, their closed-source nature limits transparency regarding model architecture, training data, and optimization strategies. This lack of openness hinders the broader large model community from understanding the key challenges and breakthroughs in model development, leading to high trial-and-error costs and substantial training expenses for other AI companies.

    Open-source AI models [6], [10] play a critical role in democratizing AI technology by lowering barriers to entry and accelerating innovation across diverse sectors. They foster collaborative research and development, enabling the global community to build upon and refine the existing models. Additionally, open-source models enhance transparency and reproducibility, which are essential for ensuring ethical AI deployment and addressing societal concerns related to AI-generated content. Despite these benefits, achieving high performance at a lower cost remains a significant challenge for open-source developers, who often have limited resources compared to large technology companies [11].

    In the late 2024, DeepSeek introduced two open-source models, DeepSeek-V3 [12] and DeepSeek-R1 [13], which quickly garnered global attention. These models demonstrated performance slightly surpassing that of GPT-4o [14] and GPT-o1 [15], while achieving this at computation costs an order of magnitude lower. This breakthrough not only showcased the potential of cost-efficient AI development but also highlighted the growing competitiveness of open-source alternatives in the landscape of large models. The success of DeepSeek models can be attributed to several key factors, including innovative architectural designs, optimized training pipelines, and strategic use of computational resources. These advancements enable developers to train large-scale models more efficiently, reducing both financial and environmental costs associated with AI development.

    Moreover, the open-source nature of DeepSeek family models has fostered a collaborative ecosystem, where researchers and developers worldwide can experiment with, improve, and adapt these models for various applications. This collective effort has accelerated the adoption of DeepSeek models across industries, such as healthcare [35], finance [36], [37], and education [38]. However, the widespread deployment of these models also raises critical issues about safety [39], fairness [40], and ethical use [35], necessitating comprehensive assessments of their impact on society.

    This paper aims to present a concise overview of the DeepSeek family of models, exploring their current status, core innovations, practical applications, and safety considerations. By analyzing these aspects, the survey seeks to provide insights into the factors that enable cost-effective AI development, highlight the advantages and limitations of open-source approaches, and identify future research directions. Ultimately, this work aims to support the responsible development and deployment of AI technologies that benefit both industry and society as a whole.

    Our main contributions are summarized as follows:

    ● We provide a comprehensive review of the entire DeepSeek family models, summarizing the core innovations in their development processes, including data processing, training, and infrastructure, and comparing them with traditional counterparts.

    ● We analyze how the existing DeepSeek models are applied to various downstream tasks, highlighting their practical use cases and performance.

    ● We examine the potential security, privacy, and ethical concerns associated with current DeepSeek models, considering both technical and ethical implications.

    ● We outline potential future directions for the development of DeepSeek family models, offering insights to guide further research and applications in the field.

    This paper is organized as follows. Section II introduces the overview of DeepSeek. Section III depicts the core innovation of DeepSeek. Section IV demonstrates the application of the DeepSeek family models. We discuss the security, privacy, and ethical issues in Section V. Lastly, we take an insight into Section VI to explain the future development.

    This section presents a comparison between DeepSeek and its competitors, offering a high-level overview of the DeepSeek model family, their architectural evolution, structured development process, and summarized technical innovations.

    DeepSeek has emerged as a major player in the large model landscape, warranting a detailed investigation into its capabilities and innovations. As shown in Table II (data source from [41]), both DeepSeek-V3 and DeepSeek-R1 utilize the DeepSeekMoE architecture [18], featuring 671B parameters and a 128 K context length. Mixture-of-Experts (MoE) [49]-[51] is a sparse model architecture that activates only a subset of expert networks per input, enabling large-scale models to achieve high performance with greater computational efficiency. Their API token costs are notably competitive, with DeepSeek-V3 priced at 1.1permilliontokensandDeepSeekR1at2.19 per million tokens. These cost structures position them among the most cost-effective options compared to other leading MoE-based proprietary models.

    Table  II.  Comparison of the DeepSeek-V3 & DeepSeek-R1 and Other LLM. Bold, underline and double underline__ Mean the Best, Second and Third Performance,respectively. Token Price, Code Index, Math Index and Intelligence Index are provided by [41]
    Model Series Model
    Parameters
    Context Length Base Model
    Architecture
    Input Token
    Price
    Output Token
    Price
    Coding
    Index
    Math Index Intelligence
    Index
    DeepSeek-V3 671B 128K DeepSeekMoE $0.27 $1.1 36 57__ 46
    DeepSeek-R1 671B 128K DeepSeekMoE $0.55 $2.19 49 82 60__
    Qwen-2.5 72B 1M MoE $1.60 $6.40 35 53 40
    Mixstral 8x22B 8x22B 65K MoE $2.00 $6.00 17 27 26
    Falcon 180B 2048 Transformer / / / / /
    Gemma 2 27B 128K Transformer $0.80 $0.80 17 32 38
    Llama-3.1 405B 128K Transformer $3.50 $3.50 30 46 40
    GPT-4o 1.8T 128K MoE $2.50 $10.00 32 45 41
    GPT-o1 1.8T 200K MoE $15.00 $60.00 52 85 62
    Claude-3.7 250B+ 128K MoE $3.00 $15.00 44__ 54 57
    Grok-3 2.7T 128K MoE $5.00 $15.00 / / 66
    /: the item is not available;    indicates closed-source models; Token Price represents the official API cost per million tokens (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Llama-3.1); Token Price, Code Index, Math Index and Intelligence Index are provided by [41]; Coding Index is to evaluate the coding performance on LiveCodeBench [42] and SciCode [43]; Math Index is to assess the math performance on AIME [44] and MATH-500 [45]. Intelligence Index is a combination metric covering MMLU-Pro [46], GPQA Diamond [47], Humanity’s Last Exam [48], LiveCodeBench [42], SciCode [43], AIME [44], MATH-500 [45];
     | Show Table
    DownLoad: CSV

    In contrast, models, such as GPT-4o [14], GPT-o1 [15], and Claude-3.7 [52], also adopt MoE architectures but have significantly higher API costs, ranging from 10to60 per million tokens. Among them, GPT-o1 is the most expensive, priced at 60permillionoutputtokens.Qwen2.5,despitehavingasmallerparametersize(72B),offersamuchlonger1Mtokencontextwindow,albeitatahighercostof6.40 per million tokens. Meanwhile, Llama-3.1, which employs a transformer-based architecture with 405 billion parameters, shows relatively lower performance across key evaluation metrics.

    From a performance perspective, GPT-o1 achieves the highest Coding Index (52) and Math Index (85), suggesting superior reasoning and problem-solving capabilities. Grok-3 [53] achieves the highest Intelligence Index (66). DeepSeek-R1 demonstrates a strong balance across multiple dimensions, achieving the second-highest Coding Index (49) and Intelligence Index (60), while maintaining a significantly lower token price than its competitors.

    These comparisons highlight DeepSeek’s strong positioning within the open-source model ecosystem, where it delivers state-of-the-art performance while maintaining remarkable cost-efficiency and accessibility. Its open availability, combined with leading scores in coding and intelligence, makes it a standout choice for developers and researchers seeking high-performing models without the high costs or restrictions of proprietary systems.

    At the same time, DeepSeek also holds its ground among top proprietary models, outperforming or closely matching the capabilities of well-known closed-source systems such as GPT-o1, GPT-4o, and Claude-3.7 often at a fraction of their cost. Given this rare balance of openness, efficiency, and raw capability, it is essential to further explore the methodologies behind DeepSeek’s architecture, its practical applications, and its potential influence on the future of the AI landscape. This survey aims to provide a comprehensive analysis of DeepSeek in that broader context.

    As shown in Table I, the DeepSeek family models constitute an extensive suite designed for a range of tasks, including language comprehension, code generation, mathematical reasoning, and multimodal applications. This encompasses a range of series: DeepSeek, which targets language understanding, DeepSeek-Coder, dedicated to coding, DeepSeek-Math/DeepSeek-Prover, designed for mathematical reasoning, DeepSeek-VL, intended for vision understanding, and Janus, which facilitates both vision understanding and visual generation.

    Table  I.  Overview of the DeepSeek Model Family and Their Capabilities
    Model Series Model Name Release Date Architecture Source Main Data Source Model Parameter Language Understanding Reasoning Visual Understanding Visual
    Generation
    Math Coding
    DeepSeek DeepSeek-LLM [16] Jan.2024 LLama [17] Diverse Text data 7B/67B
    DeepSeek-V2 [12] May2024 DeepMoE [18] Internet Data 16B/236B
    DeepSeek-V3 [19] Dec.2024 DeepMoE [18] Math/Coding/Multilingual 671B
    DeepSeek-R1 [13] Jan.2025 DeepSeek-V3 [19] Long COT data/Distillation 671B
    DeepSeek-Math DeepSeek-Math [20] Feb.2024 DeepSeek-Coder-V1 [21] DeepSeekMath Corpus [20] 7B
    DeepSeek-Coder DeepSeek-Coder-V1 [21] Jan.2024 LLama [17] Github 1.3B/6.7B/33B
    DeepSeek-Coder-V2 [22] Jun.2024 DeepMoE [18] Common Crawl [23] 16B/236B
    DeepSeek-Prover DeepSeek-Prover-V1 [24] May2024 DeepSeek-Math [20] Synthetic Lean 4 proof data 7B
    DeepSeek-Prover-V1.5 [25] Aug.2024 DeepSeek-Prover-V1 [24] Synthetic Lean 4 proof data 7B
    DeepSeek-VL DeepSeek-VL [26] Mar.2024 DeepSeek-LLM [16] + hybrid vision encoder [27], [28] Multi-place Text-image Data 1.5B/7B
    DeepSeek-VL2 [29] Dec.2024 DeepMoE [18] + hybrid vision encoder [27], [28] Multi-stage Text-image Data 3B/16B/27B
    Janus Janus [30] Oct.2024 LLama + independent vision encoder [31], [32] Image-text Paired Data 1.5B
    JanusFlow [33] Nov.2024 LLama + independent vision encoder [31], [32] Image-text Paired Data 1.5B
    JanusPro [34] Jan.2025 LLama + independent vision encoder [31], [32] Large-scale Data 1.5B/7B
    ○: the item is not considered; ◑: the item ‘is partially considered; ●: the item is considered;
     | Show Table
    DownLoad: CSV

    These architectures of the DeepSeek family models can be broadly categorized into two main structures. The first is a decoder-only architecture, represented by Llama [32], which remains the foundational structure for models such as Janus [30], [33], [34] and earlier DeepSeek series. The second is the DeepSeekMoE architecture [18], an improved version of Google’s Mixture-of-Experts (MoE) framework, which has been adopted starting from DeepSeek-V2 [12]. We observe that the DeepSeek family models tend to transition towards using DeepSeekMoE in their later models. This shift to DeepSeekMoE results in notable improvements in performance and efficiency, especially for large-scale and complex tasks [18]. For multimodal models, although Janus continues to build upon the LLaMA foundation, DeepSeek-VL2 [29] integrates DeepSeekMoE with specialized visual encoders, enabling advanced cross-modal reasoning and enhancing text-vision integration capabilities.

    The data sources of the DeepSeek family models are highly diverse and tailored to each series’ specialized tasks. These include domain-specific corpora [20]-[22], synthetic datasets [13], [24], [25], large-scale vision-language datasets [26], [29], [30], [33], [34], and curated mathematical reasoning datasets [20].

    As shown in Fig. 1, the DeepSeek family models showcase a systematic evolution of multimodal and large language models through continuous improvements in data diversity, algorithmic advancements, and architectural innovations.

    Figure  1.  The Evolution Overview of the DeepSeek family models. represents the data-related improvment on the training datasets. represents the algorithm/architecture-related improvement on the training processes.

    As illustrated in Fig. 2, the DeepSeek workflow consists of three primary stages: Data Processing, Training, and Inference.

    Figure  2.  The Overview of DeepSeek Workflow. It consists of three stages: data processing, training, and inference.

    In the Data Processing stage, data from diverse sources—including text, images, and structured datasets—is collected and transformed into a unified format suitable for downstream tasks. Unlike conventional LLMs, DeepSeek-R1 [13] employs advanced data distillation techniques to extract high-quality knowledge from more powerful models (e.g., Qwen [54]). This approach enhances dataset richness and quality, improving the training process by emphasizing the most informative patterns, thereby boosting both efficiency and effectiveness in downstream applications.

    To ensure a smooth and logical progression, the output from the data processing stage serves as the input to the training stage, and similarly, the trained models are used in the final inference phase.

    The Training stage is divided into three key phases:

    Pre-training. DeepSeek family models undergo large-scale training on a diverse corpus to learn fundamental patterns and representations through next-token prediction tasks.

    Supervised Fine-tuning. The models are further refined using domain-specific data, enabling them to adapt to specialized tasks with improved accuracy and relevance.

    Reinforced Fine-tuning. Reinforcement learning techniques, such as reward-based optimization, help align the model’s outputs with desired behaviors, enhancing overall performance and user satisfaction.

    Finally, in the Inference stage, trained DeepSeek models interact with users, generating responses tailored to queries.

    We summarize the core innovation of DeepSeek family models, and divide them into three parts, model architecture, algorithm, and infrastructure, as illustrated in Fig. 3.

    Figure  3.  The Summarization of DeepSeek Innovation.

    Model Architecture & System Pipeline primarily aims to optimize the structural design and computational efficiency of the DeepSeek family of models. Detailed in Section III-A. The model architecture innovations are summarized as follows.

    Language Model Architecture outlines the design and optimization of decoder-only models, highlighting DeepSeekMoE and Multi-head Latent Attention (MLA).

    Multimodal Model Architecture integrates specialized vision encoders, with DeepSeek-VL [26] using a hybrid encoder for enhanced understanding and Janus [30] employing dual encoders for both comprehension and generation.

    The system pipeline innovations are summarized as follows.

    Data Collection Pipeline designs efficient data collection and storage pipelines, ensuring seamless acquisition and processing of multimodal and multilingual data to expand the model’s knowledge coverage.

    COT & Distillation applies Chain-of-Thought (CoT) reasoning to improve the model’s ability to handle complex tasks, while data distillation techniques efficiently extract large model knowledge into smaller models with similar capabilities.

    Algorithm focuses on improving model training efficiency, generalization ability, and task adaptability, covering various aspects from data processing to optimization strategies. Detailed in Section III-B.

    Reinforcement Learning leverages reinforcement learning to optimize model behavior, particularly in applications like dialogue generation and code generation, guiding the model to produce more desirable outputs through reward signals.

    Training Loss designs task-specific and data-driven loss functions to enhance model training efficiency and performance.

    Multi-token Prediction extends traditional next-token prediction by requiring the model to predict multiple future tokens simultaneously, enhancing training efficiency and long-term coherence in generated sequences.

    Fill-in-the-middle ensures the construction of high-quality large-scale pretraining datasets with proper data transformation.

    Infrastructure primarily aims to ensure hardware effectiveness and efficiency of the DeepSeek family of models. Specifically, it includes Network Co-design, HaiScale, HFReduce, 3FS distribution File System, HAI Platform. Detailed in Section III-C.

    Network Co-design. Large-scale model training typically requires hundreds to thousands of GPUs [55], and the distribution and configuration of these GPUs significantly impact training efficiency and data transmission performance. DeepSeek adopts an innovative network co-design approach that integrates both hardware and software requirements, optimizing communication bandwidth, reducing latency, and improving parallel scalability for large language models.

    HFReduce. In Large-scale model training, allreduce is a critical communication operation used to aggregate gradients across multiple GPUs during distributed training [56]. DeepSeek considered the data communication overhead during training and utilized HFReduce to optimize the communication process.

    HaiScale. HaiScale further optimizes the parallelism method (e.g., Data Paralleism [57], Pipeline Parallelism [58], Tensor Parallelism [59] and Experts Parallelism [60].)

    3FS Distributed File System. The 3FS Distributed File System mitigates I/O bottlenecks in large-scale AI tasks, leveraging optimized communication and network tuning to minimize congestion in both storage and computation processes.

    HAI Platform. HAI Platform provides task scheduling, fault tolerance, and disaster recovery, improving resource utilization and lowering costs. It serves as an open-box solution tailored for deep learning researchers.

    In this section, we delve into the core technical innovations behind the DeepSeek family of models, highlighting advancements in model architecture & system pipeline (Section III-A), training algorithm (Section III-B), and infrastructure (Section III-C).

    Language Models Architecture. The architectural improvements of the DeepSeek family models are mainly reflected in two aspects: DeepSeek Mixture of Experts (MoE) and Multi-head latent attention.

    DeepSeekMoE [18] builds upon conventional Mixture of Experts (MoE) architectures [49]-[51], which are widely used in transformer-based language models. As illustrated on the left side of Fig. 5, a standard transformer language model consists of L stacked layers of transformer blocks. In MoE-based architectures, the Feed-Forward Network (FFN) within each block is replaced with an MoE module, where multiple experts process different parts of the input. A router dynamically selects a subset of experts for each token during both training and inference, activating only a fraction of the total model parameters in a forward pass. This dynamic routing mechanism enhances parameter efficiency while maintaining computational efficiency.

    Figure  5.  Overview of Model Architecture Improvement on Language Models. RoPE stands for Rotary Position Embedding [62].

    However, conventional MoE architectures face two key challenges. First, individual experts tend to specialize in vastly different types of knowledge, leading to fragmented learning. Second, common knowledge is often redundantly learned across multiple experts, resulting in inefficiencies.

    To address these issues, DeepSeekMoE introduces two novel mechanisms: Fine-Grained Expert Segmentation and Shared Expert Isolation. Fine-Grained Expert Segmentation further divides each expert into smaller, more specialized sub-experts, each focusing on a specific domain of knowledge. This segmentation enhances performance on specialized tasks while mitigating knowledge conflicts among experts. Shared Expert Isolation tackles the redundancy problem by introducing dedicated shared experts that focus exclusively on learning and storing common knowledge. This allows domain-specific experts to concentrate on their respective domains without unnecessary duplication of shared information. By integrating these two mechanisms, DeepSeekMoE achieves more efficient expert routing and knowledge organization. As a result, DeepSeekMoE-2B outperforms GShard-3B (a conventional MoE model) while utilizing only 30% of its computational cost [18].

    In summary, the motivation for DeepSeekMoE lies in optimizing expert efficiency, reducing redundancy, and achieving better performance with fewer resources. Its architectural refinements reflect a thoughtful response to the known limitations of prior MoE frameworks, making it well-suited for building large-scale, cost-effective, and high-performing language models.

    Multi-Head Latent Attention (MLA) introduces an innovative approach to attention mechanisms by leveraging low-rank key-value joint compression, significantly reducing computational and memory overhead. Traditional Multi-Head Attention (MHA) mechanisms [61] operate independently over full-rank key-value representations, requiring the caching of 2nhdh elements per token during inference, where nh is the number of attention heads, dh is the dimension per head, and represents the number of layers. MLA mitigates this bottleneck by compressing key and value representations into a compact latent space, reducing the KV cache to (dc+dRh)92dh, where dc is the KV compression dimension and dRh represents the per-head dimension of the decoupled queries and key in MLA.

    The core mechanism of MLA (Fig. 5) is low-rank joint compression for keys and values, which reduces the KV cache size. Specifically, the compression process is formulated as:

    cKVt=WDKVht
    (1)
    kCt=WUKcKVt
    (2)
    vCt=WUVcKVt
    (3)

    where cKVtRdc represents the compressed latent vector for keys and values, with dc(dhnh) as the KV compression dimension. The down-projection matrix WDKVRdc×d reduces dimensionality, while WUK,WUVRdhnh×dc restore the keys and values. During inference, MLA caches only cKVt, requiring just dc storage elements across layers. Additionally, since WUK and WUV can be absorbed into WQ and WO, explicit computation of keys and values becomes unnecessary.

    To further save activation memory during training, MLA also compresses queries:

    cQt=WDQht
    (4)
    qCt=WUQcQt
    (5)

    where cQtRdc is the compressed query representation with dc(dhnh) as the query compression dimension. The matrices WDQRdc×d and WUQRdhnh×dc handle down-projection and up-projection, respectively. Although this query compression does not reduce the KV cache, it helps lower memory consumption during training.

    Furthermore, MLA employs a Rotary Position Embedding (RoPE) strategy [62] to preserve rotational invariance, which enhances the model’s ability to generalize across different positions. However, RoPE is inherently position-sensitive for both keys and queries, making it incompatible with low-rank KV compression. To overcome this limitation, a decoupled RoPE strategy is introduced. This approach separates the positional encoding process by utilizing additional multi-head queries and a shared key dedicated to carrying RoPE information. By decoupling RoPE from the compressed KV representations, the strategy allows for more efficient inference while maintaining positional awareness.

    Multimodal Model Architecture. As mentioned in Table I, the multimodal models of DeepSeek are divided into two series: DeepSeek-VL [26], [29] and Janus [30], [33], [34]. Their base architecture is built upon the DeepSeek base language models, with different types of vision encoders integrated into the architecture (See Fig. 4). DeepSeek-VL series models are primarily designed for visual understanding tasks, whereas Janus series models support both visual understanding and visual generation.

    Figure  4.  Overview of Model Architecture Improvement on Multimodal Models. This figure compares conventional models with the DeepSeek family, highlighting the architectural enhancements in vision encoders and model integration.

    DeepSeek-VL differs from conventional vision-language models by incorporating a hybrid vision encoder, which includes SAM-B [27] and SigLIP-L [28]. Unlike conventional models that typically rely on a single vision encoder with an adaptor to align vision and language features, DeepSeek-VL enhances feature extraction and representation learning by integrating multiple specialized encoders. This hybrid approach enables better generalization and performance on various vision-language understanding tasks.

    On the other hand, DeepSeek Janus expands beyond conventional vision understanding and generation models [63], [64] by introducing distinct vision encoders for different tasks: an understanding vision encoder and a generation vision encoder. This dual-encoder structure, combined with DeepSeek-LLM, allows Janus to excel in both visual comprehension and image generation. In contrast, traditional models often rely on a shared visual encoder for both tasks, which can limit their effectiveness in handling diverse multimodal challenges. In this section, we demonstrate the technical innovation of DeepSeek family models, and comparison with conventional open-source counterparts from three workflow stages, data processing, training/inference, and infrastructure.

    The data collection pipeline. As illustrated in Fig. 6, the pipeline highlights key differences compared to conventional domain-specific data collection methods. Unlike traditional approaches, the DeepSeek pipeline employs an iterative process to systematically assemble a large-scale domain corpus from diverse data sources [20]. The process begins with a seed corpus representing the target domain, from which an initial domain classifier is trained, albeit with limited domain diversity. This initial classifier is then used to collect the first subset of domain-specific data. Following the initial iteration of data collection, the pipeline continues by identifying additional domain-relevant data sources to iteratively refine and retrain the domain classifier. Additionally, the DeepSeek pipeline offers significant advantages in domain-specific data collection. It excels in discovering diverse data sources, enabling the system to expand the range of domain-relevant data beyond predefined static repositories. Furthermore, the iterative refinement process gradually enhances data quality by retraining the domain classifier after each collection cycle, ensuring higher relevance and purity of the dataset. These features collectively make DeepSeek highly efficient and versatile for large-scale, high-quality data collection in various domain-specific applications.

    Figure  6.  The data collection pipeline of DeepSeek Family Models.

    Chain-of-Thought (COT) dataset & distillation. Reasoning data plays a critical role in enhancing model utility. To obtain high-quality reasoning data, DeepSeek employs two primary approaches (as illustrated in Fig. 7): (1) the curation of Chain-of-Thought (CoT) data across multiple domains and varying levels of complexity, and (2) the distillation of outputs from other advanced models (e.g., QWEN [54]). The current CoT data curation includes three formats: COT via text reasoning [65], CoT via Coding Reasoning [66], and COT via tool-integrated reasoning [67]. These reasonings are maintained within the CoT data curation process. Additionally, to acquire extended CoT datasets, DeepSeek leverages distillation techniques from high-performing models, such as DeepSeek-V3 and distilled versions of DeepSeek-R1. Notably, it has been observed that distillation significantly accelerates improvements in model performance.

    Figure  7.  Overview of Chain-of-Thought (COT) and data distillation.

    Takeaway 1: DeepSeek achieves enhanced model performance and efficiency through key architectural and optimized data strategies. These include Fine-Grained Expert Segmentation and Shared Expert Isolation in the DeepSeekMoE framework, MLA for reduced computational overhead, and hybrid and dual-encoder designs in multimodal models. Additionally, DeepSeek ensures efficient data expansion and quality enhancement through iterative domain-specific data collection and Chain-of-Thought (CoT) data distillation, enabling superior performance in complex reasoning tasks.

    Reinforcement Learning on DeepSeek. Reinforcement learning plays a crucial role in fine-tuning large language/multimodal models, aiming to optimize the quality of generated outputs. Compared to standard supervised fine-tuning (SFT), reinforcement learning exhibits notable advantages, particularly in terms of generalization. Recent research [68] demonstrates that RL-based methods significantly outperform traditional SFT when handling out-of-distribution data. Specifically, as shown in Fig. 1 of [68], RL methods consistently enhance performance on out-of-distribution tasks with increasing training computation, whereas traditional SFT approaches experience performance degradation under similar conditions. This indicates that reinforcement learning not only improves the quality of generated outputs within the training distribution but also substantially strengthens the model’s robustness and generalization capabilities when encountering unseen or diverse scenarios.

    Existing reinforcement fine-tuning approaches primarily include Proximal Policy Optimization (PPO) [69] and the recently proposed Group Relative Policy Optimization (GRPO) introduced by DeepSeek [20]. As shown in Fig. 8, PPO [69] has been widely used for this purpose. PPO is a policy gradient method that ensures stable and efficient training by constraining the magnitude of policy updates. In this process, the policy model receives a query and generates a response, which is then evaluated by the reward model to assign a reward signal. The value model estimates the value of the generated response, which is used to compute the advantage function, helping the model assess how much better or worse an action is compared to the average. Additionally, a reference model serves as a fixed baseline to measure the divergence between the current policy and a stable reference, ensuring that policy updates are not excessively large. The core of PPO lies in utilizing Generalized Advantage Estimation (GAE) [70] to reduce variance, coupled with an actor-critic framework that balances policy improvement and value function optimization.

    Figure  8.  Overview of Reinforced Fine-tuning. Unlike PPO, GRPO does not require a Value Model to compute advantages.

    In contrast, DeepSeek employs Group Relative Policy Optimization (GRPO), a variant of PPO designed to leverage comparative feedback from multiple candidate responses generated for the same query. Unlike PPO, which focuses on the absolute quality of a single response, GRPO introduces a group module that ranks multiple responses within a query. The reward model assigns scores to each response, and the policy model is updated based on their relative quality rather than absolute scores. By comparing responses within the same group, GRPO provides a more informative reward signal, enabling the model to learn nuanced preferences and generate outputs that better align with the desired quality. The reference model remains crucial in this process, acting as a baseline to prevent excessive divergence from the original policy. Overall, GRPO enhances the reinforcement learning process by incorporating relative comparisons, leading to more effective fine-tuning of the language model.

    One of the key advantages of GRPO over traditional PPO is that it not only reduces computational costs but also maintains/improves competitive performance. In PPO, the value model is a trainable component that needs to be continuously updated during reinforcement fine-tuning. This process can be computationally expensive, as it requires additional forward and backward passes to estimate the value function accurately. Moreover, training the value model can sometimes introduce instability, especially if the value estimates are biased or have high variance. GRPO eliminates the need for a trainable value model during the fine-tuning process. By leveraging relative comparisons among multiple candidate responses within the same group, GRPO directly guides the policy model based on the relative quality of different outputs. This comparative approach eliminates the dependency on value estimation, reducing both the memory space and diminishing the computational overhead associated with training a separate value model. Despite this simplification, GRPO maintains performance levels comparable to PPO because the relative ranking within each group provides a more stable and informative learning signal. This not only accelerates the training process but also enhances efficiency, making GRPO a more scalable and cost-effective solution for large-scale reinforcement fine-tuning of language models like DeepSeek.

    To further enhance reinforced training in DeepSeek, the reward model in GRPO supports three primary modes: rule-based rewards, model-based rewards, and tool-based rewards. Each mode serves a distinct purpose in shaping the policy updates and optimizing the generated responses.

    1) Rule-Based Rewards: Rule-based rewards are typically employed in the early stages of model development. A notable example is DeepSeek-R1-Zero [71], which incorporates a rule-based reward mechanism to guide initial model alignment. In this approach, predefined heuristics or logical correctness checks determine the reward signals.

    2) Model-Based Rewards: Model-based rewards leverage a separate language model to evaluate the quality of generated responses. This method is exemplified by DeepSeek-V3-SFT [12], which employs a fine-tuned model as a reward function. The reward model assesses responses based on linguistic quality, coherence, informativeness, and relevance to the given prompt. By using a supervised fine-tuned variant as the reward function, this approach captures nuanced preferences that go beyond predefined rules, allowing for a more dynamic and adaptive learning signal. Compared to rule-based rewards, model-based rewards provide greater flexibility and scalability, enabling reinforcement fine-tuning to align the policy model with human-like response quality.

    3) Model-Based Rewards: Tool-based rewards utilize external tools or verifiers to provide structured feedback. An example of this approach is DeepSeek-Prover-V1.5 [25], which incorporates proof assistant feedback as a reward signal. This is particularly useful in domains such as mathematical reasoning and formal verification, where correctness must be rigorously validated. The proof assistant evaluates the logical soundness of generated proofs and assigns rewards accordingly. By integrating domain-specific validation tools, DeepSeek ensures that its model not only generates fluent and coherent responses but also adheres to strict correctness constraints in specialized tasks.

    Training Loss on DeepSeekMoE. In Mixture of Experts, load imbalance among experts can lead to routing collapse [72] and reduced computational efficiency. Traditional methods typically use auxiliary losses [50], [51] to encourage balanced expert usage. These auxiliary losses penalize load imbalance across experts during training, promoting more uniform utilization. Deepseek-V2 [12] also employs similar auxiliary losses for load balance:

    Ltotal=α1Ni=1fiPiExpertLevel Balance Loss+α2|E|j=1fjPjDeviceLevel Balance Loss+α3Dp=1fpPpCommunication Balance Loss
    (6)

    where N represents the number of experts, and fi denotes the fraction of tokens selecting expert i, while Pi is the average activation probability of expert i. For the Device-Level Balance Loss, |E| is the number of devices, with each device assigned a group of experts. fj represents the average fraction of tokens processed by all experts on device j, and Pj is the sum of their activation probabilities. In the Communication Balance Loss, D denotes the number of devices, where fp is the fraction of tokens sent to device p, and Pp is the total activation probability of all experts assigned to that device.

    However, excessively large auxiliary losses may negatively impact model performance by interfering with the primary training objective [73]. This trade-off between load balancing and performance is a key limitation of traditional MoE approaches.

    In contrast, DeepSeekMoE from the Deepseek-V3 eliminates the need for auxiliary loss by introducing a bias term bi for each expert, which adjusts dynamically during training to maintain balanced expert usage. Routing decisions are determined using the modified score gi,t=si,t+bi, where si,t is the original affinity score. The bias term is updated after each training step: it is decreased by γ if the expert is overloaded and increased by γ if the expert is underutilized. Here, γ is a hyperparameter that controls the rate of bias adjustment. Unlike traditional methods, this auxiliary-loss-free strategy ensures load balancing without compromising model performance, as the routing mechanism self-regulates through dynamic bias updates.

    To prevent extreme imbalance within a single sequence, DeepSeek-V3 also incorporates a lightweight sequence-wise auxiliary loss defined as LBal=αNri=1fiPi, where fi represents the expert usage frequency within the sequence, and Pi is the average normalized score of the expert across the sequence. The balance factor α is set to a very small value, ensuring that this auxiliary loss has minimal impact on the model’s optimization. While traditional MoE models rely heavily on auxiliary losses to balance load across batches, DeepSeek-V3 achieves this balance primarily through dynamic bias adjustment, with the sequence-wise auxiliary loss serving only as a complementary measure to prevent localized imbalances. This approach reduces the trade-off between load balancing and performance, resulting in more efficient and effective training.

    Multi-token Prediction. Multi-token prediction (MTP) [19] differs from the traditional next-token prediction (NTP) objective by requiring the model to predict multiple future tokens simultaneously, rather than just the immediate next token. Formally, given a sequence of tokens (x1,x2,,xT), the NTP objective aims to maximize the likelihood of the next token xt+1 given the context (x1,,xt), using a cross-entropy loss defined as:

    LNTP=CE(P1t+1,xt+1)=1TT1t=1logP1t[xt+1]
    (7)

    where CE is the cross entropy loss, T denotes the input sequence length, xt+1 is the ground-truth token at position t+1, and p1t[xt+1] is the probability assigned to the correct token by the next-token prediction module.

    In contrast, the MTP objective extends this by predicting a sequence of future tokens (xt+1,,xt+k) for a predefined horizon k. The cross-entropy loss for the k-th prediction depth is calculated as:

    LkMTP=CE(Pk2+k:T+1,x2+k:T+1)=1TT+1i=2+klogPki[xi]
    (8)

    where xi denotes the ground-truth token at position i, and pki[xi] represents the prediction probability of the correct token at depth k. To obtain the overall MTP loss, we compute the average loss across all prediction depths and scale it by a weighting factor λ:

    LMTP=λDDk=1LkMTP
    (9)

    where D represents the number of prediction depths. The multi-token objective provides two main advantages. First, it densifies the training signal by enabling the model to learn from multiple predictions at each step, thus improving data efficiency. Second, MTP encourages the model to plan its internal representations more strategically, anticipating longer-term dependencies, which can enhance the coherence and accuracy of generated sequences. Consequently, MTP aligns better with real-world generation tasks where maintaining consistency over longer contexts is crucial.

    Fill-in-the-Middle. Fill-in-the-Middle (FIM) [74], proposed by OpenAI, is a data transformation technique employed during the pretraining process. In FIM, an input instance is segmented into three parts: prefix, middle, and suffix before tokenization. Each part is then encoded (Enc), with sentinel tokens prepended at the beginning of each section. These sentinel tokens are denoted as <PRE>, <MID>, and <SUF>. The FIM instance can be formulated as follows:

    <PRE>;Enc(prefix);<SUF>;Enc(suffix);<MID>;Enc(middle).

    Compared to Fill-in-Blank (FIB) [75], which masks a segment randomly within the input, FIM explicitly conditions the model on both the preceding and succeeding context, thereby enhancing its ability to reason about intermediate content. The FIB approach can be formulated as follows:

    Enc(contextleft);<MASK>;Enc(contextright)

    where a single sentinel token (e.g., <MASK>) is used to indicate missing content, and the model is tasked with predicting the masked segment based solely on the surrounding context. While FIB enables inpainting, its reliance on a single generic mask results in less structured conditioning compared to FIM.

    Starting from DeepSeek-Coder-V1 [21], the DeepSeek family models adopted FIM for pretraining, recognizing its advantages in structured text completion and code generation tasks. By leveraging explicit bidirectional conditioning, FIM improves the model’s ability to reconstruct missing code blocks, generate function bodies from surrounding context, and synthesize coherent structured text. Moreover, its effectiveness in enhancing reasoning and inpainting capabilities makes it a crucial component in modern generative AI applications.

    Takeaway 2: DeepSeek advances generative AI capabilities through key innovations in training algorithm. Group Relative Policy Optimization (GRPO) enhances reinforced fine-tuning by leveraging comparative feedback. DeepSeekMoE redesigns a new complementary sequence-wise auxiliary loss, ensuring efficient Mixture of Experts (MoE) balancing. Multi-Token Prediction (MTP) strengthens long-range dependency modeling by enabling simultaneous future token prediction, and Fill-in-the-Middle (FIM) improves structured text completion and code generation by utilizing bidirectional context.

    Co-designed software and hardware architecture. DeepSeek constructs a Fire-Flyer AI-HPC architecture [76], a synergistic hardware-software system for LLM training. The key technical components are summarized as follows.

    1) Network Co-Design: The primary changes in the network topology include the selection of a Two-Layer Fat-Tree topology [77] with integrated storage and computation, as well as a two-zone configuration instead of a three-layer Fat-Tree solution to reduce costs. Each zone consists of an 800-port Fat-Tree connected to approximately 600 GPU compute nodes, with storage servers equipped with dual IB NICs connected to both zones, enabling shared access to storage resources. The two zones are interconnected with limited links, and cross-zone computing tasks are minimized through a scheduling strategy that allows only one pair of nodes to communicate across zones. The Dragonfly topology [78] was not chosen because, despite its cost-effectiveness and performance, it lacked sufficient bisection bandwidth, making it unsuitable for the integrated storage and computation network design required for the cluster of 10 000 A100 GPUs and nearly 200 storage servers.

    2) HFReduce: HFReduce, a library specifically designed for efficient allreduce operations, offers several key advantages that enhance communication performance in large-scale deep learning training. Firstly, it reduces PCIe bandwidth consumption by optimizing the data transfer process. Unlike NCCL [79], where each unit of data must go through multiple PCIe bidirectional transmissions, HFReduce requires only one Device-To-Host (D2H) and one Host-To-Device (H2D) data transfer per unit of data, significantly lowering PCIe bandwidth usage and enhancing performance. Secondly, HFReduce minimizes GPU kernel overhead by utilizing the GPU’s Copy Engine (CE) for asynchronous transfers, unlike NCCL, which occupies GPU kernel execution, thereby reducing computational interference. Lastly, HFReduce achieves superior inter-node bandwidth performance, reaching up to 6.3–8.1 GB/s compared to NCCL’s 1.6–4.8 GB/s when performing allreduce on a 186 MiB data size. Additionally, when combined with NVLink, HFReduce further enhances intra-node communication, leveraging NVLink’s 600 GB/s bandwidth to reduce memory-bound issues and achieve inter-node bandwidths exceeding 10 GB/s. These advantages make HFReduce highly efficient and versatile for large-scale allreduce operations, improving both speed and scalability.

    3) HaiScale Distributed Data Parallel: The HaiScale framework [76] is a system designed to optimize the training of large-scale models by implementing various parallelism strategies (Similar to DeepSpeed [80] and Megagron [59]). These strategies include data parallelism, model parallelism (encompassing tensor and pipeline parallelism), and hybrid approaches that combine multiple techniques to enhance training efficiency and scalability.

    4) 3FS Distributed File System: 3FS is a high-performance distributed file system designed to maximize the IOPS of NVMe SSDs and the throughput of Remote Direct Memory Access (RDMA) networks [81]. Unlike similar tools like WekaFS [82] and DAOS [83], 3FS uses Chain Replication with Apportioned Queries (CRAQ) [84] to enhance SSD utilization and ensure data consistency. Its request-to-send control mechanism prevents client-side congestion, while its Fat-Tree topology provides full-bisection bandwidth, allowing every client to access any storage service without bottlenecks.

    5) HAI Platform: The HAI Platform [85] is a time-sharing scheduling system for cluster resource management. It interrupts and resumes tasks based on resource availability and cluster load, ensuring tasks can continue from breakpoints by handling interruption signals, saving checkpoints, notifying the cluster, and resuming from saved states. Unlike systems that pool GPUs, HAI classifies resources by computing nodes, optimizing parallel GPU utilization to achieve up to 99% utilization.

    Takeaway 3: DeepSeek enhances performance and scalability through a Two-Layer Fat-Tree network, HFReduce for efficient allreduce, HaiScale for optimized distributed training, 3FS for high-throughput storage with CRAQ, and the HAI Platform for up to 99% GPU utilization. These innovations maximize speed, efficiency, and scalability for large-scale AI training.

    In this section, we delve into the transformative applications of DeepSeek models, showcasing their role in enhancing efficiency, accuracy, and accessibility across diverse fields such as healthcare, software engineering, business, education, scientific research, and personal assistants. A summary of the content is provided in Table III.

    Table  III.  Applications of DeepSeek Models
    Field Paper Functionalities DeepSeek Models
    Medical and Healthcare[86]Electronic Health Record analysisDeepSeek-R1
    [71]Diagnosing and managing ophthalmology clinical casesDeepSeek-R1
    [87]Pediatric Clinical Decision SupportDeepSeek-R1
    [88]Medical Licensing ExaminationDeepSeek-R1
    [89]Patient education material generationDeepSeek-R1/V3
    [90]User acceptanceDeepSeek-R1
    [91]Drug-drug interaction predictionDeepSeek-R1
    Software Engineering[92]Code optimization recommendationsDeepSeek-R1
    [93]Automatic code generationDeepSeek-R1/V3/V2.5
    [94]Automatic code generationDeepSeek-V3
    [95]Code generation and repairDeepSeek-coder-V2
    [96]Cybersecurity expertise and understandingDeepSeek-R1
    Business and Finance[37]Financial reasoningDeepSeek-R1/V3
    [97]Stock Market predictionDeepSeek-R1
    [36]AccountingDeepSeek-R1
    [98]Financial servicesDeepSeek-R1/V3
    Education[99]Science EducationDeepSeek-VL/Prover
    [100]Improving learning process in higher educationDeepSeek-R1
    [38]Scoliosis education material generationDeepSeek-R1/V3
    Scientific Research[24]Theorem provingDeepSeek-prover
    [101]Theorem provingDeepSeek-prover
    [102]Theorem provingDeepSeek-prover
    [103]Theorem provingDeepSeek-math
    [104]Mathematical reasoningDeepSeek-R1
    [105]Mathematical reasoningDeepSeek-R1
    [106]Social science survey response simulationDeepSeek-R1
    Personal Assistants and Interactive AI[107]AI-assisted micro-video creationDeepSeek-V3
    [108]Real-time simultaneous human-AI collaborationDeepSeek-V3
     | Show Table
    DownLoad: CSV

    DeepSeek models have shown significant potential in healthcare [35], offering cost-effective and accessible solutions for clinical decision-making [86], diagnostics [71], [87], patient education [89], [88] and pharmaceutical research [109], [110], [91]. In clinical decision support, DeepSeek-R1 was fine-tuned using the MIMIC-IV dataset to predict prescribed medications from Electronic Health Records (EHR), improving prediction accuracy through NLP techniques and semantic similarity evaluation[86]. For diagnostics, DeepSeek-R1 demonstrated robust performance in ophthalmology [71] and pediatrics [87]. It matched OpenAI o1 with 82% accuracy in ophthalmology cases and achieved 87% accuracy in pediatric diagnostics, slightly below ChatGPT O1’s 92.8%. However, its open-source nature and lower costs make it suitable for resource-limited settings. In patient education, DeepSeek-R1 provided the most readable content among tested LLMs when explaining spinal surgeries, though information quality remained “fair” due to limited citations[89]. User adoption is influenced by ease of use, trust, and perceived usefulness. DeepSeek’s transparency and affordability contribute to increased trust, supporting its integration into healthcare systems [90]. Beyond clinical applications, LLMs, especially DeepSeek-R1 and GPT-4, are also transforming the pharmaceutical industry by streamlining drug discovery and development. Recent studies highlight that LLMs assist in identifying novel drug targets [91], optimizing molecular design [109], and accelerating the preclinical research phase [110], ultimately shortening the pharmaceutical research cycle.

    Overall, DeepSeek models offer practical, cost-efficient tools for improving clinical care and patient communication [111].

    DeepSeek models, particularly DeepSeek-R1 and DeepSeek-Coder, have shown strong capabilities in code optimization [92], generation [93]-[95], and repair [96]. In code optimization, DeepSeek-R1 was integrated into SCALENE, an open-source Python profiler, to analyze performance bottlenecks and suggest optimizations. It delivered competitive recommendations while maintaining SCALENE’s open-source accessibility [92]. For code generation, DeepSeek-Coder, trained with test-case synthesis and reinforcement learning, improved coding benchmarks like HumanEval-plus and MBPP-plus by up to 25% and 6%, respectively, within just 80 optimization steps [93]. This highlights reinforcement learning’s potential in advancing code models. Benchmarking on backend application generation [94], proof-oriented programming [95], and cybersecurity tasks [96] further demonstrated DeepSeek’s competitive performance against models like GPT-4o and OpenAI o1. With its cost efficiency and open-source nature, DeepSeek is driving more accessible and reliable software development tools.

    DeepSeek models are proving valuable in financial reasoning [36], [37], market prediction [97], and FinTech innovation [98]. In financial analysis, DeepSeek-R1 outperformed models like GPT-4o, achieving the highest score (68.93) on tasks involving financial text, tabular data, and equations[37]. Its success is attributed to reinforcement learning and strong numerical reasoning, particularly in XBRL-Math tasks. For stock market prediction, DeepSeek demonstrated effectiveness in capturing investor sentiment from financial news, although it lagged behind ChatGPT in forecasting future returns and macroeconomic indicators, likely due to differences in language training [97]. In FinTech, DeepSeek’s open-source and cost-efficient approach is lowering barriers for startups, enabling applications in lending, investment management, and insurance [36]. Its accessibility fosters financial inclusion and competition, though challenges remain regarding security, data governance, and regulatory compliance [98], [112]. Overall, DeepSeek’s affordability and performance are reshaping financial services, promoting both innovation and efficiency.

    DeepSeek models have shown significant potential in transforming education by offering personalized learning experiences, supporting advanced research, and enhancing student engagement. In science education, DeepSeek-R1 aids students with complex, visual concepts and simulations, fostering critical thinking and problem-solving skills, while ChatGPT excels at interactive dialogue [99]. In higher education, particularly in computer science, DeepSeek-R1 achieves 88% accuracy in providing feedback, improving grades by 20-25%. Unlike GPT-4, which is better for content generation, DeepSeek enhances comprehension and performance [100]. Its readability, with a Flesch-Kincaid Grade Level of 6.2 and Reading Ease Score of 64.5, surpasses other models, making content more accessible, especially for medical education [38]. Additionally, DeepSeek integrates with learning management systems (LMS) platforms to enable personalized learning, automate assessments, and streamline content creation, boosting both teaching efficiency and student engagement [113]. These capabilities position DeepSeek as a key tool for modern, scalable education.

    DeepSeek models have significantly contributed to scientific research, particularly in theorem proving and structured data analysis. In formal mathematics, DeepSeek-Prover, trained on 8 million synthetic proof statements, achieved a 52% accuracy on the Lean 4 miniF2F test, surpassing GPT-4 and other baselines [24], [25]. Iterative methods like Goedel-Prover and self-play theorem proving (STP) further improved proof generation, with STP achieving a 61.1% pass rate on miniF2F [101], [102]. Additionally, reinforcement learning from theorem prover feedback (RLTPF) refined proof validation and reasoning accuracy [103]. Dyve, a verification framework integrating fast and slow thinking, further improved reasoning accuracy by guiding LLMs to select better solution paths [105]. However, an analysis of mathematical reasoning capabilities revealed persistent flaws in LLMs’ strategic planning and constraint handling, highlighting areas for further improvement [103]. Beyond mathematics, DeepSeek models have also been used to simulate survey response distributions, providing an efficient alternative to costly human-run surveys and demonstrating the potential of LLMs in social science research [106]. These applications demonstrate DeepSeek’s impact on both formal reasoning and data-driven research methodologies.

    DeepSeek models power diverse personal assistant applications. For media creation, DeepSeek-V3 and DeepSeek-R1 assist in AI-generated micro-videos, achieving human-level popularity through prompt enhancements [107] In human-AI collaboration, DPT-Agent integrates fast intuitive decision-making with reflective reasoning, improving real-time interactions using DeepSeek models [108]. Beyond research, DeepSeek’s official GitHub showcases extensive integrations for real-world applications like DeepChat, a smart assistant for conversations and searches, and Coco AI, a knowledge management tool [114]. These applications highlight DeepSeek’s practicality, affordability, and open-source accessibility as a foundation for AI-driven productivity and interactive experiences.

    This section provides a taxonomy of security and privacy threats, along with existing and potential countermeasures on DeepSeek family models. The security and privacy issues are summarized in Table IV.

    Table  IV.  Security, Privacy, and Ethical Issues of DeepSeek Family Models
    # Month Year Paper Risk Source Target Effects Affected
    Models
    Affected Domain Threats Issue Type Risk Layer
    Security Privacy Model Application
    1 Jan. 2025 [115] Deployment Unsafe outputs DeepSeek-R1 NLP Benchmark
    2 Oct. 2024 [116] Development Unsafe outputs DeepSeek-R1 NLP Benchmark
    3 Feb. 2025 [39] Development Unsafe outputs DeepSeek-V3/R1 NLP Benchmark
    4 Feb. 2025 [117] Development Unsafe outputs DeepSeek-R1 NLP Benchmark
    5 Feb. 2025 [118] Development Unsafe outputs DeepSeek-VL Multimodality Benchmark
    6 Dec. 2024 [119] Development Code
    Vulnerability
    DeepSeek-Coder Coding Benchmark
    7 Feb. 2025 [120] Reasoning Trustworthiness DeepSeek-R1 Finance Supply Chain
    Attack
    8 Dec. 2024 [121] Inference Unsafe outputs DeepSeek-V2.5 Peer Review Supply Chain
    Attack
    9 Jan. 2025 [122] Deployment Deceptive
    tendencies
    DeepSeek-R1 AI Agent Supply Chain
    Attack
    10 Feb. 2025 [123] Reasoning Unsafe outputs DeepSeek-R1 NLP Jailbreak
    11 Oct. 2024 [124] RL on
    inference
    Harmful
    outputs
    DeepSeek-LLM NLP Jailbreak
    12 Feb. 2025 [125] Manipulation Hallucination Janus Multimodality Jailbreak
    13 Feb. 2025 [126] Reasoning Unsafe outputs DeepSeek-R1 NLP Jailbreak
    14 Jan. 2025 [127] Reinforced
    Training
    Harmful
    outputs
    DeepSeek-R1 NLP Reward
    Hacking
    15 July 2024 [128] Training Unsafe outputs DeepSeek-Coder Coding Backdoor
    Attack
    16 Sep. 2024 [129] Training Code
    Vulnerability
    DeepSeek-Coder Coding Backdoor
    Attack
    17 Apr. 2024 [130] Training Code
    Vulnerability
    DeepSeek-Coder Coding Backdoor
    Attack
    18 Feb. 2025 [131] Training Unsafe outputs DeepSeek-R1 NLP Backdoor
    Attack
    19 Feb. 2025 [132] Training Unsafe outputs DeepSeek-R1 Peer Review Backdoor
    Attack
    20 Oct. 2024 [133] Training Memorization DeepSeek-Coder Coding Data Extraction Attack
    21 Feb. 2025 [134] Training Bias DeepSeek-R1 NLP Algorithmic Bias
    22 Feb. 2025 [90] Inference Social Behavior DeepSeek-R1 Healthcare Targeted Phishing Attack
    √: the item is applicable
     | Show Table
    DownLoad: CSV

    Benchmarks. Evaluating the security and safety of DeepSeek models requires well-defined benchmarks that assess their ability to generate accurate, reliable, and ethically appropriate outputs while mitigating harmful, biased, or misleading content. Arrieta et al. [115] applied the ASTRAL safety testing tool [135] to the automated assessment of the outcomes from GPT-o3-mini and DeepSeek-R1, evaluating their performance in handling safety-related concerns. However, content security evaluation varies across regions due to differing regulatory standards.

    Several benchmark datasets have been developed to assess these aspects. ChineseSafe [116] and CHiSafetyBench [39] are specifically designed for evaluating content security in Chinese, ensuring alignment with local safety standards. In contrast, Safety-R1 [117] focuses on English-language evaluations, capturing different ethical and safety concerns. For multimodal large language models (MLLMs), MLLMGUARD [118] provides a comprehensive benchmark covering both Chinese and English safety assessments. Additionally, LLMSecCode [119] serves as a specialized benchmark for evaluating the security of code generated by language models, addressing risks in software development.

    Performance on these benchmarks also highlights regional biases in training data. For instance, DeepSeek-V2 [12] performs slightly worse on the MMLU Humanity-Moral benchmark [46], as its training data primarily originates from China, whereas the benchmark itself is more aligned with American moral perspectives. These discrepancies underscore the need for diverse and region-specific benchmarks to ensure fair and effective evaluations of content security across different linguistic and cultural contexts.

    Security from the Model Inference. The processes of model inference is vulnerable to Supply Chain Attack and Jailbreak.

    1) Supply Chain Attack: A Supply Chain Attack occurs when a model, after being integrated into downstream applications, exhibits vulnerabilities due to system interaction or manipulated data. For instance, in financial scenarios [120], DeepSeek-R1 may compromise its trustworthiness to safeguard its own interests. Similarly, in peer review settings [121], an injected system prompt could manipulate the review process, leading to biased or unfair feedback. Furthermore, when DeepSeek-R1 serves as a base model in multi-agent environments [122], its reasoning mechanisms may prioritize self-interest, producing deceptive outcomes that mislead other agents, ultimately causing task failure. The root cause of these vulnerabilities in DeepSeek-R1 is its lack of robustness. Its inability to maintain consistent and reliable behavior across different downstream applications makes it susceptible to manipulation and unintended self-serving actions.

    2) Jailbreak: Jailbreaking refers to attacks that circumvent the built-in safety mechanisms of language models, enabling them to produce outputs they were originally designed to avoid, such as harmful or inappropriate content. Recent studies have investigated various jailbreak techniques targeting models like DeepSeek-R1, DeepSeek-LLM, and Janus. One prominent method, H-CoT [123], embeds harmful behavior into the intermediate steps of long chain-of-thought (CoT) reasoning, significantly compromising DeepSeek-R1’s safety mechanisms. This approach achieves an attack success rate close to 100%. Another attack, PathSeeker [124], utilizes a multi-agent reinforcement learning (RL) framework to iteratively uncover input strategies that effectively bypass DeepSeek-R1’s safeguards. Additionally, Islam et al. [125] expose weaknesses in DeepSeek-R1’s multimodal reasoning by systematically optimizing image embeddings, compelling the model to generate hallucinated outputs. Further analysis from SafeChain [126] evaluates jailbreak performance across different benchmarks and reveals that, despite DeepSeek-R1’s strong reasoning capabilities, it struggles to resist sophisticated jailbreak attempts. Collectively, these studies highlight the evolving nature of jailbreak techniques and the ongoing challenges in securing DeepSeek family models against adversarial exploits.

    To mitigate these risks from model inference, prompt-based defenses [136]—such as inserting intentional prompts to remind LLMs to resist malicious attempts—and auditing mechanisms [137] in the deployment phase that detect and block unsafe outputs can be employed to prevent model misuse and enhance system robustness.

    Security from the Model Training. The processes of model training is vulnerable to reward hacking attack and backdoor attack.

    1) Reward Hacking Attack: During RL training, such as GRPO, the process heavily relies on positive or negative signals from reward models to train the policy model. However, in cases of reward hacking [127], the model becomes highly sensitive to these signals, which may lead to outputs exhibiting anti-ethical or harmful tendencies. This vulnerability allows adversarial manipulation of reward signals, potentially causing the model to optimize for unintended behaviors that deviate from human values or ethical constraints.

    2) Backdoor Attack: A backdoor attack is a technique where a model is intentionally manipulated during training to produce malicious or unintended behaviors when triggered by specific inputs, while functioning normally otherwise. In DeepSeek family models, backdoor attacks primarily occur in two scenarios: code-related scenarios and long-chain-of-thought (Long CoT) reasoning. In code-related scenarios, Yang et al. [128] found that using cross-entropy is particularly conducive to backdoor attacks due to overfitting. Similarly, Li et al. [129] and Hossen et al. [130] discovered that code vulnerabilities are often unintentionally retained in training datasets, leading to the generation of code vulnerabilities. In the Long CoT scenario, if the trigger is embedded within the chain-of-thought reasoning, it is more likely to pose risks to safety alignment [131], [132].

    To address the aforementioned security vulnerabilities from the model training, it is crucial to design robust and adversary-resistant reward models [138] that better reflect nuanced human intent, minimizing exploitable shortcuts during reinforcement learnin. Against backdoor threats, techniques such as dataset sanitization [139], differential privacy [140], and gradient analysis [141] can help detect or prevent the insertion of malicious triggers.

    Intellectual Property (IP) Issues. DeepSeek has been banned from use on federal devices in several countries, such as the USA [142] and Australia [143], with the one concern being its IP issues. These issues typically refer to the use of data owned by others without proper authorization. DeepSeek has faced allegations that the data used for its training was utilized without the consent of the rightful owners [144]. This raises significant legal and ethical concerns, particularly regarding the transparency and compliance of AI development processes. Ensuring that training datasets are sourced with proper permissions is essential to maintain trust and mitigate potential legal repercussions, especially as regulations surrounding AI and data usage continue to evolve globally.

    To reduce IP risks, DeepSeek should prioritize the use of legally licensed or publicly available datasets and establish clearer data provenance tracking. Incorporating copyright detection tools [145] and obtaining explicit permissions during data collection are critical steps.

    Data Privacy on DeepSeek Models. Data privacy on DeepSeek Models is a critical concern, as these models are often trained on extensive datasets that may contain sensitive information. One key issue is the presence of Personally Identifiable Information (PII) within the training data. Despite efforts to anonymize data, achieving complete anonymization is challenging due to the complexity and diversity of data sources. As highlighted in [133], an analysis of the DeepSeek-Coder revealed instances of PII within its training dataset, raising concerns about potential privacy breaches and emphasizing the importance of robust data preprocessing and privacy-preserving techniques.

    To reduce the risk of exposing PII in training data, DeepSeek should adopt stricter data filtering protocols [146] and integrate privacy-preserving techniques such as differential privacy [140], and advanced anonymization [147]. Regular audits of training datasets are also essential to ensure that sensitive information is effectively removed before model training.

    Data Privacy on DeepSeek Application. Data privacy on DeepSeek applications refers to the concerns surrounding the storage of user prompts and responses on third-party servers during model deployment. This raises significant privacy issues [148], as sensitive or confidential information shared by users may be accessible to external entities. Such concerns have led to regulatory scrutiny, with several governments [143] considering or enforcing bans on the official DeepSeek API to mitigate potential privacy risks and ensure better control over user data. This highlights the importance of implementing secure data handling practices and ensuring compliance with data protection regulations when deploying AI applications.

    To address privacy concerns in deployed applications, DeepSeek should implement end-to-end encryption [149], minimize prompt injection possibility [150], and offer on-device or self-hosted deployment options.

    The ethical threats associated with DeepSeek primarily stem from biases in human and algorithmic decision-making [134], the unintended influence of DeepSeek family models on societal behaviors [151], and concerns over DeepSeek’s potential role in modern digital slavery [152].

    Bias in Human and Algorithmic Decision-Making. Bias in AI is a well-documented issue, and DeepSeek is not exempt from such concerns. It inherits biases from training data [134] and human fine-tuning [127]. Large-scale pretraining reflects societal biases, while reinforcement learning with human feedback (RLHF) [153] can introduce new ones. This interaction between human and algorithmic biases may lead to discrimination, misrepresentation of marginalized groups, and reinforcement of stereotypes.

    DeepSeek’s Influence on Human and Societal Behavior. AI does not merely reflect human behavior; it actively shapes it. AI-driven content, from search results to recommendations, influences opinions and decisions [154], [90]. DeepSeek, as a generative AI, can amplify beliefs, manipulate narratives, and alter behavior at scale, raising concerns about misinformation and echo chambers. Unregulated AI deployment may erode critical thinking, underscoring the need for transparency and accountability.

    Digital Exploitation and Modern Slavery.

    The rapid expansion of large models (e.g., GPT and DeepSeek) presents significant ethical challenges, notably their potential to facilitate digital exploitation and modern slavery [152]. For instance, OpenAI reportedly employed [155] Kenyan workers at wages below $2 per hour for essential data labeling tasks, a practice that exacerbates economic inequality and reinforces exploitative labor conditions [156], [157]. Such practices illustrate how leading AI firms often overlook fundamental human rights, highlighting the urgent need for ethical standards and oversight in the development of advanced technological systems.

    To address ethical threats, DeepSeek should implement fairness-aware training techniques, such as bias correction in data sampling [158] and adversarial debiasing during fine-tuning [159], to reduce both algorithmic and human-induced biases. Transparent algorithmic auditing [137] and third-party evaluations [160] can help ensure accountability for the model’s societal influence, while clear labeling of AI-generated content can mitigate risks of misinformation. To combat digital exploitation, DeepSeek and its collaborators must adopt ethical labor standards, including fair wages, safe working conditions, and responsible sourcing of annotation services, ensuring human rights are respected throughout the AI development lifecycle [152].

    In this section, we will analyze and discuss the potential future development directions of DeepSeek.

    Despite its promise, the current DeepSeek family models still face notable limitations in handling multi-modal tasks. As illustrated in Table I, DeepSeek’s multi-modal capabilities primarily focus on image-text processing (i.e., DeepSeek-VL and Janus). This narrow scope restricts its application in more complex domains such as video-based large generative models [6], [161], interactive 3D visualization [162], [163], and other emerging modalities. In future iterations, more comprehensive DeepSeek family models could extend well beyond image-text modalities. Such a framework would integrate advanced modules for video understanding and generation, as well as tools for 3D environment modeling or reconstruction. By embracing open source practices, the DeepSeek community could collaborate to develop flexible, extensible models that accommodate novel modalities, ultimately driving broader research and industrial applications.

    To fully realize the capabilities of a powerful model, it must be seamlessly integrated into the entire pipeline of downstream applications. Current AI agent frameworks [164], [165] typically involve planning, decision-making, and action steps based on large-scale language models. However, different models exhibit varying levels of utility and robustness, the way they respond to downstream tasks can differ significantly [166]. While the official DeepSeek project [114] has already incorporated certain applications through the DeepSeek API—primarily for chat or code-related tasks—these represent only a fraction of its potential.

    In future iterations, AI agents enhanced by DeepSeek should support more complex interactions, such as embodied AI agents [167] that can perform tasks in the physical world. By broadening the scope of downstream tasks to include real-world, interactive applications, DeepSeek can accelerate advancements in fields ranging from robotics [168] and manufacturing [169] to autonomous systems.

    While DeepSeek has demonstrated impressive performance in English and Chinese [19], [13], its multi-lingual capabilities remain relatively limited compared to globally oriented models. Future development should emphasize expanding support for additional languages, including low-resource languages, to promote inclusive AI that serves diverse populations worldwide. By broadening its linguistic scope, DeepSeek can bridge communication barriers, empowering individuals and organizations to interact more seamlessly across linguistic boundaries.

    Furthermore, enhancing cross-cultural comprehension would enable DeepSeek to better understand cultural nuances and context-specific meanings, improving its performance in tasks involving global communication, media, and content creation. This involves not only expanding language support but also incorporating cultural context into model training, ensuring that the AI responds appropriately across diverse cultural scenarios. Open-sourcing multi-lingual datasets [170] would accelerate research in this area, fostering greater collaboration within the global AI community and promoting ethical, culturally aware AI development.

    As large-scale models like DeepSeek expand into critical real-world applications, ensuring robust security and privacy measures becomes paramount [171]-[176]. This entails not only developing stringent data protection [177], [178] and access control mechanisms [179]-[182] but also proactively addressing potential vulnerabilities that could be exploited by adversarial actors. Given the increasing sophistication of cyber threats, a multi-layered security approach is essential, incorporating techniques such as differential privacy [183] and homomorphic encryption [184] to minimize risks associated with data breaches and unauthorized access.

    Moreover, compliance with evolving regulations—such as the General Data Protection Regulation (GDPR) [185], a comprehensive EU framework that safeguards individual rights over personal data (including access, correction, and deletion), and the California Consumer Privacy Act (CCPA) [186], which ensures consumers’ rights to know, control, and delete their personal data—is essential for the legal and ethical deployment of AI systems, particularly in sensitive domains. DeepSeek has made initial progress by integrating basic data filtering, but it still falls short of full compliance, especially in areas such as fine-grained data traceability and enforceable user data deletion within large-scale models.

    Additionally, transparency and explainability further reinforce DeepSeek’s security posture. By integrating interpretable AI techniques, such as SHAP (SHapley Additive exPlanations) [187], developers and stakeholders can better understand the model’s decision-making processes. This is particularly vital in high-stakes environments like finance, healthcare, and legal applications, where opaque decision-making could have serious consequences [188].

    Finally, fostering a security-aware development culture is crucial. Implementing best practices such as secure model training [189], red-teaming exercises [190], and routine adversarial testing [191] can help identify vulnerabilities before they are exploited in real-world scenarios. By prioritizing security, privacy, and transparency, DeepSeek can establish itself as a security, trustworthy and resilient AI system in sensitive applications.

    To ensure responsible AI development and mitigate ethical risks, DeepSeek must implement systematic auditing mechanisms addressing biases [134], societal influences [151], and potential labor exploitation [192. This requires comprehensive bias auditing protocols, including dataset analysis to identify imbalances, diverse and inclusive human feedback processes to reduce subjective biases, and fairness-aware training techniques to mitigate algorithmic discrimination. Additionally, monitoring DeepSeek’s societal impact is essential [154], [90], with user interaction studies assessing behavioral influence, transparency measures improving AI explainability, and algorithmic safeguards preventing the reinforcement of harmful narratives. To prevent AI from enabling digital oppression or modern slavery, DeepSeek should enforce ethical labor practices by ensuring fair compensation for data annotation workers, auditing AI-driven surveillance systems to prevent misuse, and collaborating with policymakers to establish digital rights protections. By embedding these auditing frameworks into its development process, DeepSeek can contribute to a more transparent, equitable, and ethically responsible AI ecosystem.

    This survey provides a comprehensive overview of DeepSeek, covering model architecture & system pipeline, training algorithm, and infrastructure. We explored how DeepSeek models leverage cutting-edge techniques, such as Mixture of Experts (MoE), Multi-Head Latent Attention (MLA), and Group Relative Policy Optimization (GRPO), to enhance performance while maintaining computational efficiency. Additionally, we highlight their impact across various domains, including healthcare, finance, education, and software engineering, demonstrating their versatility and cost-effectiveness.

    Beyond technical advancements, we analyze security, privacy, and ethical considerations of DeepSeek, underscoring the challenges associated with content safety, data governance, and trustworthiness. Looking ahead, we identify key directions for future development, including expanding DeepSeek’s multimodal capabilities, strengthening multilingual and cross-cultural adaptability, and improving security frameworks for responsible AI deployment. As an open-source initiative, DeepSeek continues to push the boundaries of AI accessibility and innovation, fostering collaboration within the global AI research community.

  • [1]
    X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,” Mach. Intell. Res., vol. 20, no. 4, pp. 447–482, Jun. 2023. doi: 10.1007/s11633-022-1410-8
    [2]
    Y.-F. Li, H. Wang, and M. Sun, “ChatGPT-like large-scale foundation models for prognostics and health management: A survey and roadmaps,” Reliab. Eng. Syst. Saf., vol. 243, p. 109850, Mar. 2024. doi: 10.1016/j.ress.2023.109850
    [3]
    Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly,” High-Confid. Comput., vol. 4, no. 2, p. 100211, Jun. 2024. doi: 10.1016/j.hcc.2024.100211
    [4]
    F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 9, pp. 10850–10869, Mar. 2023. doi: 10.1109/TPAMI.2023.3261988
    [5]
    S. Latif, M. Shoukat, F. Shamshad, M. Usama, Y. Ren, H. Cuayáhuitl, W. Wang, X. Zhang, R. Togneri, E. Cambria, and B. W. Schuller, “Sparks of large audio models: A survey and outlook,” arXiv preprint arXiv: 2308.12792, 2023.
    [6]
    B. Zhang, K. Li, Z. Cheng, Z. Hu, Y. Yuan, G. Chen, S. Leng, Y. Jiang, H. Zhang, X. Li, P. Jin, W. Zhang, F. Wang, L. Bing, and D. Zhao, “VideoLLaMA 3: Frontier multimodal foundation models for image and video understanding,” arXiv preprint arXiv: 2501.13106, 2025.
    [7]
    Z. Qian, S. Wang, M. Mihajlovic, A. Geiger, and S. Tang, “3DGS-Avatar: Animatable avatars via deformable 3D Gaussian splatting,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 5020–5030.
    [8]
    K. S. Kalyan, “A survey of GPT-3 family large language models including ChatGPT and GPT-4,” Nat. Language Process. J., vol. 6, p. 100048, Mar. 2024. doi: 10.1016/j.nlp.2023.100048
    [9]
    S. A. A. Safavi-Naini, S. Ali, O. Shahab, Z. Shahhoseini, T. Savage, S. Raffee, J. S. Samaan, R. Al Shabeeb, F. Ladak, J. O. Yang, J. Echavarria, S. Babar, A. Shaukat, S. Margolis, N. P. Tatonetti, G. Nadkarni, B. El Kurdi, and A. Soroush, “Vision-language and large language model performance in gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models,” arXiv preprint arXiv: 2409.00084, 2024.
    [10]
    X. Sun, Y. Chen, Y. Huang, R. Xie, J. Zhu, K. Zhang, S. Li, Z. Yang, J. Han, X. Shu, J. Bu, Z. Chen, X. Huang, F. Lian, S. Yang, J. Yan, Y. Zeng, X. Ren, C. Yu, L. Wu, Y. Mao, J. Xia, T. Yang, S. Zheng, K. Wu, D. Jiao, J. Xue, X. Zhang, D. Wu, K. Liu, D. Wu, G. Xu, S. Chen, S. Chen, X. Feng, Y. Hong, J. Zheng, C. Xu, Z. Li, X. Kuang, J. Hu, Y. Chen, Y. Deng, G. Li, A. Liu, C. Zhang, S. Hu, Z. Zhao, Z. Wu, Y. Ding, W. Wang, H. Liu, R. Wang, H. Fei, P. Yu, Z. Zhao, X. Cao, H. Wang, F. Xiang, M. Huang, Z. Xiong, B. Hu, X. Hou, L. Jiang, J. Ma, J. Wu, Y. Deng, Y. Shen, Q. Wang, W. Liu, J. Liu, M. Chen, L. Dong, W. Jia, H. Chen, F. Liu, R. Yuan, H. Xu, Z. Yan, T. Cao, Z. Hu, X. Feng, D. Du, T. Yu, Y. Tao, F. Zhang, J. Zhu, C. Xu, X. Li, C. Zha, W. Ouyang, Y. Xia, X. Li, Z. He, R. Chen, J. Song, R. Chen, F. Jiang, C. Zhao, B. Wang, H. Gong, R. Gan, W. Hu, Z. Kang, Y. Yang, Y. Liu, D. Wang, and J. Jiang, “Hunyuan-large: An open-source MoE model with 52 billion activated parameters by tencent,” arXiv preprint arXiv: 2411.02265, 2024.
    [11]
    M. Xu, W. Yin, D. Cai, R. Yi, D. Xu, Q. Wang, B. Wu, Y. Zhao, C. Yang, S. Wang, Q. Zhang, Z. Lu, L. Zhang, S. Wang, Y. Li, Y. Liu, X. Jin, and X. Liu, “A survey of resource-efficient LLM and multimodal foundation models,” arXiv preprint arXiv: 2401.08092, 2024.
    [12]
    DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” arXiv preprint arXiv: 2405.04434, 2024.
    [13]
    DeepSeek-AI, “DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,” arXiv preprint arXiv: 2501.12948, 2025.
    [14]
    OpenAI, “GPT-4O system card,” arXiv preprint arXiv: 2410.21276, 2024.
    [15]
    OpenAI, “Introducing OpenAI o1,” 2025. [Online]. Available: https://openai.com/o1/.
    [16]
    DeepSeek-AI, X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, H. Gao, K. Gao, W. Gao, R. Ge, K. Guan, D. Guo, J. Guo, G. Hao, Z. Hao, Y. He, W. Hu, P. Huang, E. Li, G. Li, J. Li, Y. Li, Y. K. Li, W. Liang, F. Lin, A. X. Liu, B. Liu, W. Liu, X. Liu, X. Liu, Y. Liu, H. Lu, S. Lu, F. Luo, S. Ma, X. Nie, T. Pei, Y. Piao, J. Qiu, H. Qu, T. Ren, Z. Ren, C. Ruan, Z. Sha, Z. Shao, J. Song, X. Su, J. Sun, Y. Sun, M. Tang, B. Wang, P. Wang, S. Wang, Y. Wang, Y. Wang, T. Wu, Y. Wu, X. Xie, Z. Xie, Z. Xie, Y. Xiong, H. Xu, R. X. Xu, Y. Xu, D. Yang, Y. You, S. Yu, X. Yu, B. Zhang, H. Zhang, L. Zhang, L. Zhang, M. Zhang, M. Zhang, W. Zhang, Y. Zhang, C. Zhao, Y. Zhao, S. Zhou, S. Zhou, Q. Zhu, Y. Zou, “DeepSeek LLM: Scaling open-source language models with longtermism,” arXiv preprint arXiv: 2401.02954, 2024.
    [17]
    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv: 2302.13971, 2023.
    [18]
    D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models,” in Proc. 62nd Annu. Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 1280–1297.
    [19]
    DeepSeek-AI, “DeepSeek-V3 technical report,” arXiv preprint arXiv: 2412.19437, 2024.
    [20]
    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv: 2402.03300, 2024.
    [21]
    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang, “DeepSeek-Coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv: 2401.14196, 2024.
    [22]
    DeepSeek-AI, Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y. Wu, Y. Li, H. Gao, S. Ma, W. Zeng, X. Bi, Z. Gu, H. Xu, D. Dai, K. Dong, L. Zhang, Y. Piao, Z. Gou, Z. Xie, Z. Hao, B. Wang, J. Song, D. Chen, X. Xie, K. Guan, Y. You, A. Liu, Q. Du, W. Gao, X. Lu, Q. Chen, Y. Wang, C. Deng, J. Li, C. Zhao, C. Ruan, F. Luo, and W. Liang, “DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv: 2406.11931, 2024.
    [23]
    Common Crawl, “ Common Crawl maintains a free, open repository of web crawl data that can be used by anyone,” 2025. [Online]. Available: https://commoncrawl.org.
    [24]
    H. Xin, D. Guo, Z. Shao, Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang, “DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,” arXiv preprint arXiv: 2405.14333, 2024.
    [25]
    H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, H. Zhang, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan, “DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and Monte-Carlo tree search,” in Proc. 13th Int. Conf. Learning Representations, Singapore, Singapore, 2025.
    [26]
    H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, Y. Sun, C. Deng, H. Xu, Z. Xie, and C. Ruan, “DeepSeek-VL: Towards real-world vision-language understanding,” arXiv preprint arXiv: 2403.05525, 2024.
    [27]
    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 3992–4003.
    [28]
    Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in Proc. 17th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 280–296.
    [29]
    Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, Z. Xie, Y. Wu, K. Hu, J. Wang, Y. Sun, Y. Li, Y. Piao, K. Guan, A. Liu, X. Xie, Y. You, K. Dong, X. Yu, H. Zhang, L. Zhao, Y. Wang, and C. Ruan, “DeepSeek-VL2: Mixture-of-experts vision-language models for advanced multimodal understanding,” arXiv preprint arXiv: 2412.10302, 2024.
    [30]
    C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo, “Janus: Decoupling visual encoding for unified multimodal understanding and generation,” arXiv preprint arXiv: 2410.13848, 2024.
    [31]
    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 11941–11952.
    [32]
    P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan, “Autoregressive model beats diffusion: Llama for scalable image generation,” arXiv preprint arXiv: 2406.06525, 2024.
    [33]
    Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. Yu, L. Zhao, Y. Wang, J. Liu, and C. Ruan, “JanusFlow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation,” arXiv preprint arXiv: 2411.07975, 2024.
    [34]
    X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-Pro: Unified multimodal understanding and generation with data and model scaling,” arXiv preprint arXiv: 2501.17811, 2025.
    [35]
    A. Temsah, K. Alhasan, I. Altamimi, A. Jamal, A. Al-Eyadhy, K. H. Malki, and M.-H. Temsah, “DeepSeek in healthcare: Revealing opportunities and steering challenges of a new open-source artificial intelligence frontier,” Cureus, vol. 17, p. 2, Feb. 2025. doi: 10.18605/2175-7275/cereus.v16n4p2-16
    [36]
    O. Arabiat, “DeepSeek AI in accounting: Opportunities and challenges in intelligent automation,” 2025.
    [37]
    L. Qian, W. Zhou, Y. Wang, X. Peng, J. Huang, Q. Xie, and J. Nie, “Fino1: On the transferability of reasoning enhanced LLMs to finance,” arXiv preprint arXiv: 2502.08127, 2025.
    [38]
    M. Zhao, H. He, M. Zhou, Y. Han, X. Song, and Y. Zhou, “Evaluating the readability and quality of AI-generated scoliosis education materials: A comparative analysis of five language models,” 2025.
    [39]
    W. Zhang, X. Lei, Z. Liu, N. Wang, Z. Long, P. Yang, J. Zhao, M. Hua, C. Ma, K. Wang, and S. Lian, “Safety evaluation of DeepSeek models in Chinese contexts,” arXiv preprint arXiv: 2502.11137, 2025.
    [40]
    M. N.-U.-R. Chowdhury, A. Haque, and I. Ahmed, “DeepSeek vs. ChatGPT: A comparative analysis of performance, efficiency, and ethical AI considerations,” TechRxiv, 2025, DOI: 10.36227/techrxiv.173929663.35290537/v1.
    [41]
    Artificial analysis. [Online]. Available: https://artificialanalysis.ai/. Accessed on: Mar. 4, 2025.
    [42]
    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “LiveCodeBench: Holistic and contamination free evaluation of large language models for code,” arXiv preprint arXiv: 2403.07974, 2024.
    [43]
    M. Tian, L. Gao, S. D. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y. Li, S. Liu, D. Luo, Y. Ma, H. Tong, K. Trinh, C. Tian, Z. Wang, B. Wu, S. Yin, M. Zhu, K. Lieret, Y. Lu, G. Liu, Y. Du, T. Tao, O. Press, J. Callan, E. A. Huerta, and H. Peng, “SciCode: A research coding benchmark curated by scientists,” in Proc. 38th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2024, pp. 30624–30650.
    [44]
    M. Jia, “AIME 2024 dataset,” 2024. [Online]. Available: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024. Accessed on: Mar. 4, 2025.
    [45]
    HuggingFaceH4, “MATH-500 dataset,” 2024. [Online]. Available: https://huggingface.co/datasets/HuggingFaceH4/MATH-500/blob/main/README.md. Accessed on: Mar. 4, 2025.
    [46]
    Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen, “MMLU-Pro: A more robust and challenging multi-task language understanding benchmark,” arXiv preprint arXiv: 2406.01574, 2024.
    [47]
    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “GPQA: A graduate-level google-proof Q&A benchmark,” arXiv preprint arXiv: 2311.12022, 2023.
    [48]
    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, M. Choi, A. Agrawal, A. Chopra, A. Khoja, R. Kim, R. Ren, J. Hausenloy, O. Zhang, M. Mazeika, S. Yue, A. Wan, and D. Hendrycks, “Humanity’s last exam,” arXiv preprint arXiv: 2501.14249, 2025.
    [49]
    N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui, “GLaM: Efficient scaling of language models with mixture-of-experts,” in Proc. 39th Int. Conf. Machine Learning, Baltimore, USA, 2022, pp. 5547–5569.
    [50]
    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” J. Mach. Learn. Res., vol. 23, no. 1, p. 120, 2022.
    [51]
    D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” in Proc. 9th Int. Conf. Learning Representations, Austria, 2021.
    [52]
    ANTHROPIC. Claude 3.5 sonnet. 2024. [Online]. Available: https://www.anthropic.com/news/claude-3-5-sonnet
    [53]
    xAI. xAI official website. [Online]. Available: https://x.ai/. Accessed on: Mar. 3, 2025.
    [54]
    Qwen Team, “Qwen2.5 technical report,” arXiv preprint arXiv: 2412.15115, 2024.
    [55]
    Z. Jiang, H. Lin, Y. Zhong, Q. Huang, Y. Chen, Z. Zhang, Y. Peng, X. Li, C. Xie, S. Nong, Y. Jia, S. He, H. Chen, Z. Bai, Q. Hou, S. Yan, D. Zhou, Y. Sheng, Z. Jiang, H. Xu, H. Wei, Z. Zhang, P. Nie, L. Zou, S. Zhao, L. Xiang, Z. Liu, Z. Li, X. Jia, J. Ye, X. Jin, and X. Liu, “MegaScale: Scaling large language model training to more than 10,000 GPUs,” in Proc. 21st USENIX Symp. Networked Systems Design and Implementation, Santa Clara, USA, 2024, pp. 745–760.
    [56]
    Q. Chen, Q. Hu, G. Wang, Y. Xiong, T. Huang, X. Chen, Y. Gao, H. Yan, Y. Wen, T. Zhang, and P. Sun, “Lins: Reducing communication overhead of ZeRo for efficient LLM training,” in Proc. IEEE/ACM 32nd Int. Symp. Quality of Service, Guangzhou, China, 2024, pp. 1–10.
    [57]
    Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, “GPipe: Efficient training of giant neural networks using pipeline parallelism,” in Proc. 33rd Int. Conf. Neural Information Processing Systems, 2019, pp. 10.
    [58]
    D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. Operating Systems Principles, Huntsville, Canada, 2019, pp. 1–15.
    [59]
    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv: 1909.08053, 2019.
    [60]
    Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y. Hao, A. Mathews, and S. Li, “PyTorch FSDP: Experiences on scaling fully sharded data parallel,” Proc. VLDB Endow., vol. 16, no. 12, pp. 3848–3860, Aug. 2023. doi: 10.14778/3611540.3611569
    [61]
    A. Waswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 6000–6010.
    [62]
    J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “RoFormer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, Feb. 2024. doi: 10.1016/j.neucom.2023.127063
    [63]
    Chameleon Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv: 2405.09818, 2024.
    [64]
    X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, Y. Zhao, Y. Ao, X. Min, T. Li, B. Wu, B. Zhao, B. Zhang, L. Wang, G. Liu, Z. He, X. Yang, J. Liu, Y. Lin, T. Huang, and Z. Wang, “Emu3: Next-token prediction is all you need,” arXiv preprint arXiv: 2409.18869, 2024.
    [65]
    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, pp. 1800.
    [66]
    W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” Trans. Mach. Learn. Res., vol. 2023, 2023.
    [67]
    Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, M. Huang, N. Duan, and W. Chen, “ToRA: A tool-integrated reasoning agent for mathematical problem solving,” in Proc. 12th Int. Conf. Learning Representations, Vienna, Austria, 2024, pp. 1–34.
    [68]
    T. Chu, Y. Zhai, J. Yang, S. Tong, S. Xie, S. Levine, and Y. Ma, “SFT memorizes, RL generalizes: A comparative study of foundation model post-training,” in Proc. 2nd Conf. Parsimony and Learning, Stanford, USA, 2025.
    [69]
    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv: 1707.06347, 2017.
    [70]
    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” in Proc. 4th Int. Conf. Learning Representations, San Juan, Puerto Rico, 2016.
    [71]
    D. Mikhail, A. Farah, J. Milad, W. Nassrallah, A. Mihalache, D. Milad, F. Antaki, M. Balas, M. M. Popovic, A. Feo, R. H. Muni, P. A. Keane, and R. Duval, “Performance of DeepSeek-R1 in ophthalmology: An evaluation of clinical decision-making and cost-effectiveness,” medRxiv, 2025, DOI: 10.1101/2025.02.10.25322041.
    [72]
    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in Proc. 5th Int. Conf. Learning Representations, Toulon, France, 2017.
    [73]
    L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai, “Auxiliary-loss-free load balancing strategy for mixture-of-experts,” in Proc. 13th Int. Conf. Learning Representations, Singapore, Singapore, 2025.
    [74]
    M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen, “Efficient training of language models to fill in the middle,” arXiv preprint arXiv: 2207.14255, 2022.
    [75]
    C. Donahue, M. Lee, and P. Liang, “Enabling language models to fill in the blanks,” in Proc. 58th Annu. Meeting of the Association for Computational Linguistics, 2020, pp. 2492–2501.
    [76]
    W. An, X. Bi, G. Chen, S. Chen, C. Deng, H. Ding, K. Dong, Q. Du, W. Gao, K. Guan, J. Guo, Y. Guo, Z. Fu, Y. He, P. Huang, J. Li, W. Liang, X. Liu, X. Liu, Y. Liu, Y. Liu, S. Lu, X. Lu, X. Nie, T. Pei, J. Qiu, H. Qu, Z. Ren, Z. Sha, X. Su, X. Sun, Y. Tan, M. Tang, S. Wang, Y. Wang, Y. Wang, Z. Xie, Y. Xiong, Y. Xu, S. Ye, S. Yu, Y. Zha, L. Zhang, H. Zhang, M. Zhang, W. Zhang, Y. Zhang, C. Zhao, Y. Zhao, S. Zhou, S. Zhou, and Y. Zou, “Fire-Flyer AI-HPC: A cost-effective software-hardware co-design for deep learning.” in Proc. Int. Conf. High Performance Computing, Networking, Storage, and Analysis, Atlanta, USA, 2024, pp. 83.
    [77]
    C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficient supercomputing,” IEEE Trans. Computers, vol. C-34, no. 10, pp. 892–901, Oct. 1985. doi: 10.1109/TC.1985.6312192
    [78]
    J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology,” ACM SIGARCH Comput. Archit. News, vol. 36, no. 3, pp. 77–88, Jun. 2008. doi: 10.1145/1394608.1382129
    [79]
    NVIDIA, “NVIDIA collective communications library (NCCL),” 2017.
    [80]
    J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proc. 26th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
    [81]
    H. Subramoni, P. Lai, M. Luo, and D. K. Panda, “RDMA over Ethernet—a preliminary study,” in Proc. IEEE Int. Conf. Cluster Computing and Workshops, New Orleans, USA, 2009, pp. 1–9.
    [82]
    WEKA, “WEKA architectural whitepaper,” 2025. [Online]. Available: https://www.weka.io/resources/white-paper/wekaio-architectural-whitepaper/. Accessed on Feb. 23, 2025.
    [83]
    Z. Liang, J. Lombardi, M. Chaarawi, and M. Hennecke, “DAOS: A scale-out high performance storage stack for storage class memory,” in Proc. Supercomputing Frontiers: 6th Asian Conf., Singapore, Singapore, 2020, pp. 40–54.
    [84]
    J. Terrace and M. J. Freedman, “Object storage on CRAQ: High-throughput chain replication for read-mostly workloads,” in Proc. USENIX Annu. Technical Conf., San Diego, USA, 2009, pp. 11.
    [85]
    HFAiLab, “Hai platform,” [Online]. Available: https://github.com/HFAiLab/hai-platform. Accessed on Feb. 23, 2025.
    [86]
    H. Alghamdi and A. Mostafa, “Advancing EHR analysis: Predictive medication modeling using LLMs,” Inf. Syst., vol. 131, p. 102528, Jun. 2025. doi: 10.1016/j.is.2025.102528
    [87]
    G. Mondillo, S. Colosimo, A. Perrotta, V. Frattolillo, and M. Masino, “Comparative evaluation of advanced AI reasoning models in pediatric clinical decision support: ChatGPT O1 vs. DeepSeek-R1,” medRxiv, 2025, DOI: 10.1101/2025.01.27.25321169.
    [88]
    L. F. de Paiva, G. Luijten, B. Puladi, and J. Egger, “How does DeepSeek-R1 perform on USMLE?” medRxiv, 2025, DOI: 10.1101/2025.02.06.25321749.
    [89]
    M. Zhou, Y. Pan, Y. Zhang, X. Song, and Y. Zhou, “Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and discern quality across ChatGPT and DeepSeek models,” Int. J. Med. Inform., vol. 198, p. 105871, Jun. 2025. doi: 10.1016/j.ijmedinf.2025.105871
    [90]
    A. Choudhury, Y. Shahsavar, and H. Shamszare, “User intent to use DeepSeek for healthcare purposes and their trust in the large language model: Multinational survey study,” arXiv preprint arXiv: 2502.17487, 2025.
    [91]
    G. De Vito, F. Ferrucci, and A. Angelakis, “LLMs for drug-drug interaction prediction: A comprehensive comparison,” arXiv preprint arXiv: 2502.06890, 2025.
    [92]
    S. Hasan and S. Basak, “Open-source AI-powered optimization in scalene: Advancing python performance profiling with DeepSeek-R1 and LLaMA 3.2,” arXiv preprint arXiv: 2502.10299, 2025.
    [93]
    H. Zeng, D. Jiang, H. Wang, P. Nie, X. Chen, and W. Chen, “ACECODER: Acing coder RL via automated test-case synthesis,” arXiv preprint arXiv: 2502.01718, 2025.
    [94]
    M. Vero, N. Mündler, V. Chibotaru, V. Raychev, M. Baader, N. Jovanović, J. He, and M. Vechev, “BaxBench: Can LLMs generate correct and secure backends?” in Proc. 13th Int. Conf. Learning Representations, Singapore, Singapore, 2025.
    [95]
    D. Zhang, J. Wang, and T. Sun, “Building a proof-oriented programmer that is 64% better than GPT-4o under data scarcity,” arXiv preprint arXiv: 2502.11901, 2025.
    [96]
    Y.-C. Yu, T.-H. Chiang, C.-W. Tsai, C.-M. Huang, and W.-K. Tsao, “Primus: A pioneering collection of open-source datasets for cybersecurity LLM training,” arXiv preprint arXiv: 2502.11191, 2025.
    [97]
    J. Chen, G. Tang, G. Zhou, and W. Zhu, “ChatGPT and DeepSeek: Can they predict the stock market and macroeconomy?” arXiv preprint arXiv: 2502.10008, 2025.
    [98]
    D. Krause, “DeepSeek and FinTech: The democratization of AI and its global implications,” Available at SSRN 5116322, 2025.
    [99]
    K. T. Kotsis, “ChatGPT and DeepSeek evaluate one another for science education,” EIKI J. Eff. Teach. Methods, vol. 3, no. 1, 2025.
    [100]
    N. Kerimbayev, Z. Menlibay, M. Garvanova, S. Djaparova, and V. Jotsov, “A comparative analysis of generative AI models for improving learning process in higher education,” in Proc. Int. Conf. Automatics and Informatics (ICAI), Varna, Bulgaria, 2024, pp. 271–276.
    [101]
    Y. Lin, S. Tang, B. Lyu, J. Wu, H. Lin, K. Yang, J. Li, M. Xia, D. Chen, S. Arora, and C. Jin, “Goedel-Prover: A Frontier model for open-source automated theorem proving,” arXiv preprint arXiv: 2502.07640, 2025.
    [102]
    K. Dong and T. Ma, “Beyond limited data: Self-play LLM theorem provers with iterative conjecturing and proving,” arXiv preprint arXiv: 2502.00212, 2025.
    [103]
    J. O. J. Leang, G. Hong, W. Li, and S. B. Cohen, “Theorem prover as a judge for synthetic data generation,” arXiv preprint arXiv: 2502.13137, 2025.
    [104]
    J. Boye and B. Moell, “Large language models and mathematical reasoning failures,” arXiv preprint arXiv: 2502.11574, 2025.
    [105]
    J. Zhong, Z. Li, Z. Xu, X. Wen, and Q. Xu, “Dyve: Thinking fast and slow for dynamic process verification,” arXiv preprint arXiv: 2502.11157, 2025.
    [106]
    Y. Cao, H. Liu, A. Arora, I. Augenstein, P. Röttger, and D. Hershcovich, “Specializing large language models to simulate survey response distributions for global populations,” arXiv preprint arXiv: 2502.07068, 2025.
    [107]
    J. Fu, X. Ge, K. Zheng, I. Arapakis, X. Xin, and J. M. Jose, “LLMPopcorn: An empirical study of LLMs as assistants for popular micro-video generation,” arXiv preprint arXiv: 2502.12945, 2025.
    [108]
    S. Zhang, X. Wang, W. Zhang, C. Li, J. Song, T. Li, L. Qiu, X. Cao, X. Cai, W. Yao, W. Zhang, X. Wang, and Y. Wen, “Leveraging dual process theory in language agent framework for real-time simultaneous human-AI collaboration,” arXiv preprint arXiv: 2502.11882, 2025.
    [109]
    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller, “Augmenting large language models with chemistry tools,” Nat. Mach. Intell., vol. 6, no. 5, pp. 525–535, May 2024. doi: 10.1038/s42256-024-00832-8
    [110]
    C. Chakraborty, M. Bhattacharya, S. Pal, S. Chatterjee, A. Das, and S.-S. Lee, “Ai-enabled language models (LMs) to large language models (LLMs) and multimodal large language models (MLLMS) in drug discovery and development,” J. Adv. Res., 2025, DOI: 10.1016/j.jare.2025.02.011.
    [111]
    Y. Peng, B. A. Malin, J. F. Rousseau, Y. Wang, Z. Xu, X. Xu, C. Weng, and J. Bian, “From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare,” J. Biomed. Inf., vol. 163, p. 104791, Mar. 2025. doi: 10.1016/j.jbi.2025.104791
    [112]
    Z. Gan, Y. Lu, D. Zhang, H. Li, C. Liu, J. Liu, J. Liu, H. Wu, C. Fu, Z. Xu, R. Zhang, and Y. Dai, “MME-Finance: A multimodal finance benchmark for expert-level understanding and reasoning,” arXiv preprint arXiv: 2411.03314, 2024.
    [113]
    “DeepSeek course,” 2025. [Online]. Available: https://www.nobleprog.com.au/cc/DeepSeekedu.
    [114]
    “Awesome DeepSeek integrations,” 2025. [Online]. Available: https://github.com/DeepSeek-ai/awesome-DeepSeek-integration.
    [115]
    A. Arrieta, M. Ugarte, P. Valle, J. A. Parejo, and S. Segura, “o3-mini vs DeepSeek-R1: Which one is safer?” arXiv preprint arXiv: 2501.18438, 2025.
    [116]
    H. Zhang, H. Gao, Q. Hu, G. Chen, L. Yang, B. Jing, H. Wei, B. Wang, H. Bai, and L. Yang, “ChineseSafe: A Chinese benchmark for evaluating safety in large language models,” arXiv preprint arXiv: 2410.18491, 2024.
    [117]
    K. Zhou, C. Liu, X. Zhao, S. Jangam, J. Srinivasa, G. Liu, D. Song, and X. E. Wang, “The hidden risks of large reasoning models: A safety assessment of R1,” arXiv preprint arXiv: 2502.12659, 2025.
    [118]
    T. Gu, Z. Zhou, K. Huang, D. Liang, Y. Wang, H. Zhao, Y. Yao, X. Qiao, K. Wang, Y. Yang, Y. Teng, Y. Qiao, and Y. Wang, “MLLMguard: A multi-dimensional safety evaluation suite for multimodal large language models,” in Proc. 38th Int. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2025, pp. 7256–7295.
    [119]
    A. Rydén, E. Näslund, E. M. Schiller, and M. Almgren, “LLMSecCode: Evaluating large language models for secure coding,” in Proc. 8th Int. Symp. Cyber Security, Cryptology, and Machine Learning, Beer Sheva, Israel, 2024, pp. 100–118.
    [120]
    R. Lu, J. Sedoc, and A. Sundararajan, “Reasoning and the trusting behavior of DeepSeek and GPT: An experiment revealing hidden fault lines in large language models,” arXiv preprint arXiv: 2502.12825, 2025.
    [121]
    R. Ye, X. Pang, J. Chai, J. Chen, Z. Yin, Z. Xiang, X. Dong, J. Shao, and S. Chen, “Are we there yet? Revealing the risks of utilizing large language models in scholarly peer review,” arXiv preprint arXiv: 2412.01708, 2024.
    [122]
    S. K. Barkur, S. Schacht, and J. Scholl, “Deception in LLMs: Self-preservation and autonomous goals in large language models,” arXiv preprint arXiv: 2501.16513, 2025.
    [123]
    M. Kuo, J. Zhang, A. Ding, Q. Wang, L. DiValentin, Y. Bao, W. Wei, H. Li, and Y. Chen, “H-CoT: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 flash thinking,” arXiv preprint arXiv: 2502.12893, 2025.
    [124]
    Z. Lin, W. Ma, M. Zhou, Y. Zhao, H. Wang, Y. Liu, J. Wang, and L. Li, “PathSeeker: Exploring LLM security vulnerabilities with a reinforcement learning-based jailbreak approach,” arXiv preprint arXiv: 2409.14177, 2024.
    [125]
    C. M. Islam, S. J. Chacko, P. Horne, and X. Liu, “DeepSeek on a trip: Inducing targeted visual hallucinations via representation vulnerabilities,” arXiv preprint arXiv: 2502.07905, 2025.
    [126]
    F. Jiang, Z. Xu, Y. Li, L. Niu, Z. Xiang, B. Li, B. Y. Lin, and R. Poovendran, “SafeChain: Safety of language models with long chain-of-thought reasoning capabilities,” in Proc. 13th Int. Conf. Learning Representations, Singapore, Singapore, 2025.
    [127]
    M. Parmar and Y. Govindarajulu, “Challenges in ensuring AI safety in DeepSeek-R1 models: The shortcomings of reinforcement learning strategies,” arXiv preprint arXiv: 2501.17030, 2025.
    [128]
    G. Yang, Y. Zhou, X. Chen, X. Zhang, T. Y. Zhuo, D. Lo, and T. Chen, “DeCE: Deceptive cross-entropy loss designed for defending backdoor attacks,” arXiv preprint arXiv: 2407.08956, 2024.
    [129]
    D. Li, M. Yan, Y. Zhang, Z. Liu, C. Liu, X. Zhang, T. Chen, and D. Lo, “CoSec: On-the-Fly security hardening of code LLMs via supervised co-decoding,” in Proc. 33rd ACM SIGSOFT Int. Symp. Software Testing and Analysis, Vienna, Austria, 2024, pp. 1428–1439.
    [130]
    M. I. Hossen, J. Zhang, Y. Cao, and X. Hei, “Assessing cybersecurity vulnerabilities in code large language models,” arXiv preprint arXiv: 2404.18567, 2024.
    [131]
    Z. Xu, J. Gardiner, and S. Belguith, “The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models,” arXiv preprint arXiv: 2502.01225, 2025.
    [132]
    Z. Zhu, H. Zhang, M. Zhang, R. Wang, G. Wu, K. Xu, and B. Wu, “BoT: Breaking long thought processes of o1-like large language models through backdoor attack,” arXiv preprint arXiv: 2502.12202, 2025.
    [133]
    Y. Nie, C. Wang, K. Wang, G. Xu, G. Xu, and H. Wang, “Decoding secret memorization in code LLMs through token-level characterization,” arXiv preprint arXiv: 2410.08858, 2024.
    [134]
    R. Gupta, “Comparative analysis of DeepSeek R1, ChatGPT, Gemini, Alibaba, and LLaMA: Performance, reasoning capabilities, and political bias,” Authorea, 2025, DOI: 10.22541/au.173921625.50315230/v1.
    [135]
    M. Ugarte, P. Valle, J. A. Parejo, S. Segura, and A. Arrieta, “ASTRAL: Automated safety testing of large language models,” arXiv preprint arXiv: 2501.17132, 2025.
    [136]
    Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu, “Defending ChatGPT against jailbreak attack via self-reminders,” Nat. Mach. Intell., vol. 5, no. 12, pp. 1486–1496, Dec. 2023. doi: 10.1038/s42256-023-00765-8
    [137]
    J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi, “Auditing large language models: A three-layered approach,” AI Ethics, vol. 4, no. 4, pp. 1085–1115, Nov. 2024. doi: 10.1007/s43681-023-00289-2
    [138]
    T. Liu, W. Xiong, J. Ren, L. Chen, J. Wu, R. Joshi, Y. Gao, J. Shen, Z. Qin, T. Yu, D. Sohn, A. Makarova, J. Z. Liu, Y. Liu, B. Piot, A. Ittycheriah, A. Kumar, and M. Saleh, “RRM: Robust reward model training mitigates reward hacking,” in Proc. 13th Int. Conf. Learning Representations, Singapore, Singapore, 2025.
    [139]
    C. Iwendi, S. A. Moqurrab, A. Anjum, S. Khan, S. Mohan, and G. Srivastava, “N-Sanitization: A semantic privacy-preserving framework for unstructured medical datasets,” Comput. Commun., vol. 161, pp. 160–171, Sep. 2020. doi: 10.1016/j.comcom.2020.07.032
    [140]
    L. Miao, W. Yang, R. Hu, L. Li, and L. Huang, “Against backdoor attacks in federated learning with differential privacy,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Singapore, Singapore, 2022, pp. 2999–3003.
    [141]
    N. Gu, P. Fu, X. Liu, Z. Liu, Z. Lin, and W. Wang, “A gradient control method for backdoor attacks on parameter-efficient tuning,” in Proc. 61st Annu. Meeting of the Association for Computational Linguistics, Toronto, Canada, 2023, pp. 3508–3520.
    [142]
    “House bill Report HB 1121,” [Online]. Available: https://lawfilesext.leg.wa.gov/biennium/2025-26/Pdf/Bill%20Reports/House/1121%20HBR%20LAWS%2025.pdf.
    [143]
    Government of South Australia. DeepSeek banned from SA government. 2025. [Online]. Available: https://www.premier.sa.gov.au/media-releases/news-items/deepseek-banned-from-sa-government. Accessed on: Feb. 25, 2025.
    [144]
    V. Habib Lantyer, “How U.S. trade sanctions fueled Chinese innovation in AI: The DeepSeek case,” SSRN, 2025, DOI: 10.2139/ssrn.5112973.
    [145]
    M. Novak, M. Joy, and D. Kermek, “Source-code similarity detection and detection tools used in academia: A systematic review,” ACM Trans. Comput. Educ., vol. 19, no. 3, p. 27, May 2019.
    [146]
    S. Goyal, P. Maini, Z. C. Lipton, A. Raghunathan, and J. Z. Kolter, “Scaling laws for data filtering–data curation cannot be compute agnostic,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 22702–22711.
    [147]
    I. E. Olatunji, J. Rauch, M. Katzensteiner, and M. Khosla, “A review of anonymization for healthcare data,” Big Data, vol. 12, no. 6, pp. 538–555, Dec. 2024. doi: 10.1089/big.2021.0169
    [148]
    N. Sun, J. Zhang, P. Rimba, S. Gao, L. Y. Zhang, and Y. Xiang, “Data-driven cybersecurity incident prediction: A survey,” IEEE Commun. Surv. Tutorials, vol. 21, no. 2, pp. 1744–1772, Apr.-Jun. 2019. doi: 10.1109/COMST.2018.2885561
    [149]
    B. B. Gupta, A. Gaurav, V. Arya, W. Alhalabi, D. Alsalman, and P. Vijayakumar, “Enhancing user prompt confidentiality in Large Language Models through advanced differential encryption,” Comput. Electr. Eng., vol. 116, p. 109215, 2024. doi: 10.1016/j.compeleceng.2024.109215
    [150]
    J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt injection defense by task-specific finetuning,” in Proc. 29th European Symp. Research in Computer Security, Bydgoszcz, Poland, 2024, pp. 105–124.
    [151]
    Q. Wang, Z. Tang, and B. He, “From ChatGPT to DeepSeek: Can LLMs simulate humanity?” arXiv preprint arXiv: 2502.18210, 2025.
    [152]
    E. Loza de Siles, “Slavery. AI,” Wash. Lee J. Civ. Rights Soc. Just., vol. 30, no. 2, p. 4, Jun. 2024.
    [153]
    J. Xiao, Z. Li, X. Xie, E. Getzen, C. Fang, Q. Long, and W. J. Su, “On the algorithmic bias of aligning large language models with RLHF: Preference collapse and matching regularization,” arXiv preprint arXiv: 2405.16455, 2024.
    [154]
    G. Adomavicius, J. Bockstedt, S. P. Curley, J. Zhang, and S. Ransbotham, “The hidden side effects of recommendation systems,” MIT Sloan Manage. Rev., vol. 60, no. 2, pp. 1, Nov. 2018.
    [155]
    B. Perrigo, “Exclusive: OpenAI used Kenyan workers on less than $2 per hour to make ChatGPT less toxic,” 2023. [Online]. Available: https://time.com/6247678/openai-ChatGPT-kenya-workers/.
    [156]
    R. Tan, “Behind the AI boom, an army of overseas workers in ’digital sweatshops’,” The Washington Post, 2023. [Online]. Available: https://bestofai.com/article/behind-the-ai-boom-an-army-of-overseas-workers-in-digital-sweatshops.
    [157]
    L. Finlay, P. Hooton, and C. Wallace, “Big tech is ignoring the human cost behind the rise of ChatGPT,” The Australian Financial Review, 2023. [Online]. Available: https://www.afr.com/technology/big-tech-isignoring-the-human-cost-behind-the-rise-of-ChatGPT-20230210-p5cjll.
    [158]
    B. Draghi, Z. Wang, P. Myles, and A. Tucker, “Identifying and handling data bias within primary healthcare data using synthetic data generators,” Heliyon, vol. 10, p. 2, 2024.
    [159]
    R. Correa, K. Pahwa, B. Patel, C. M. Vachon, J. W. Gichoya, and I. Banerjee, “Efficient adversarial debiasing with concept activation vector—medical image case-studies,” J. Biomed. Inf., vol. 149, p. 104548, Jan. 2024. doi: 10.1016/j.jbi.2023.104548
    [160]
    U. Iqbal, T. Kohno, and F. Roesner, “LLM platform security: Applying a systematic evaluation framework to OpenAI’s ChatGPT plugins,” in Proc. 7th AAAI/ACM Conf. AI, Ethics, and Society, San Jose, USA, 2024, pp. 611–623.
    [161]
    Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, L. He, and L. Sun, “Sora: A review on background, technology, limitations, and opportunities of large vision models,” arXiv preprint arXiv: 2402.17177, 2024.
    [162]
    R. Wu, B. Mildenhall, P. Henzler, K. Park, R. Gao, D. Watson, P. P. Srinivasan, D. Verbin, J. T. Barron, B. Poole, and A. Hołyński, “ReconFusion: 3D reconstruction with diffusion priors,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 21551–21561.
    [163]
    Y. Lin, R. Clark, and P. Torr, “DreamPolisher: Towards high-quality text-to-3D generation via geometric diffusion,” arXiv preprint arXiv: 2403.17237, 2024.
    [164]
    Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y. Noda, D. Terzopoulos, Y. Choi, K. Ikeuchi, H. Vo, L. Fei-Fei, and J. Gao, “Agent AI: Surveying the horizons of multimodal interaction,” arXiv preprint arXiv: 2401.03568, 2024.
    [165]
    Z. Deng, Y. Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y. Xiang, “AI agents under threat: A survey of key security challenges and future pathways,” ACM Comput. Surv., vol. 57, no. 7, p. 182, Feb. 2025. doi: 10.1145/3716628
    [166]
    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang, “On the opportunities and risks of foundation models,” arXiv preprint arXiv: 2108.07258, 2021.
    [167]
    J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied AI: From simulators to research tasks,” IEEE Trans. Emerg. Top. Comput. Intell., vol. 6, no. 2, pp. 230–244, Apr. 2022. doi: 10.1109/TETCI.2022.3141105
    [168]
    S. Patel, X. Yin, W. Huang, S. Garg, H. Nayyeri, L. Fei-Fei, S. Lazebnik, and Y. Li, “A real-to-sim-to-real approach to robotic manipulation with VLM-generated iterative keypoint rewards,” arXiv preprint arXiv: 2502.08643, 2025.
    [169]
    T. Wang, J. Fan, and P. Zheng, “An LLM-based vision and language cobot navigation approach for human-centric smart manufacturing,” J. Manuf. Syst., vol. 75, pp. 299–305, Aug. 2024. doi: 10.1016/j.jmsy.2024.04.020
    [170]
    T. Nguyen, C. Van Nguyen, V. D. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen, “CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages,” in Proc. Joint Int. Conf. Computational Linguistics, Language Resources and Evaluation, Torino, Italia, 2024, pp. 4226–4237.
    [171]
    D. Shi, T. Shen, Y. Huang, Z. Li, Y. Leng, R. Jin, C. Liu, X. Wu, Z. Guo, L. Yu, L. Shi, B. Jiang, and D. Xiong, “Large language model safety: A holistic survey,” arXiv preprint arXiv: 2412.17686, 2024.
    [172]
    X. Chen, C. Li, D. Wang, S. Wen, J. Zhang, S. Nepal, Y. Xiang, and K. Ren, “Android HIV: A study of repackaging malware for evading machine-learning detection,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 987–1001, Jan. 2020. doi: 10.1109/TIFS.2019.2932228
    [173]
    W. Zhou, X. Zhu, Q.-L. Han, L. Li, X. Chen, S. Wen, and Y. Xiang, “The security of using large language models: A survey with emphasis on ChatGPT,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 1, pp. 1–26, Jan. 2025. doi: 10.1109/JAS.2024.124983
    [174]
    X. Zhu, W. Zhou, Q.-L. Han, W. Ma, S. Wen, and Y. Xiang, “When software security meets large language models: A survey,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 2, pp. 317–334, Feb. 2025. doi: 10.1109/JAS.2024.124971
    [175]
    C. Sheng, W. Zhou, Q. L. Han, W. Ma, X. Zhu, S. Wen, and Y. Xiang, “Network traffic fingerprinting for IIoT device identification: A survey,” IEEE Trans. Ind. Inf., 2025, DOI: 10.1109/TII.2025.3534441.
    [176]
    X. Zhu, S. Wen, S. Camtepe, and Y. Xiang, “Fuzzing: A survey for roadmap,” ACM Comput. Surv., vol. 54, no. 11s, p. 230, Sep. 2022. doi: 10.1145/3512345
    [177]
    J. S. Butt, “Data, privacy, and the law: Safeguarding rights in the new millennium,” in Proc. 19th Int. Conf. European Integration-Realities and Perspectives, Galati, Romania, 2024, pp. 9–18.
    [178]
    L. Liu, O. De Vel, Q.-L. Han, J. Zhang, and Y. Xiang, “Detecting and preventing cyber insider threats: A survey,” IEEE Commun. Surv. Tutorials, vol. 20, no. 2, pp. 1397–1417, Apr.-Jun. 2018. doi: 10.1109/COMST.2018.2800740
    [179]
    D. Dhinakaran, S. M. U. Sankar, D. Selvaraj, and S. E. Raja, “Privacy-preserving data in IoT-based cloud systems: A comprehensive survey with AI integration,” arXiv preprint arXiv: 2401.00794, 2024.
    [180]
    J. Zhang, L. Pan, Q.-L. Han, C. Chen, S. Wen, and Y. Xiang, “Deep learning based attack detection for cyber-physical system cybersecurity: A survey,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 3, pp. 377–391, Mar. 2022. doi: 10.1109/JAS.2021.1004261
    [181]
    J. Qiu, J. Zhang, W. Luo, L. Pan, S. Nepal, and Y. Xiang, “A survey of android malware detection with deep neural models,” ACM Comput. Surv., vol. 53, no. 6, p. 126, Dec. 2020. doi: 10.1145/3417978
    [182]
    X. Feng, X. Zhu, Q.-L. Han, W. Zhou, S. Wen, and Y. Xiang, “Detecting vulnerability on IoT device firmware: A survey,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 1, pp. 25–41, Jan. 2023. doi: 10.1109/JAS.2022.105860
    [183]
    T. Singh, H. Aditya, V. K. Madisetti, and A. Bahga, “Whispered tuning: Data privacy preservation in fine-tuning LLMs through differential privacy,” J. Software Eng. Appl., vol. 17, no. 1, pp. 1–22, Jan. 2024. doi: 10.4236/jsea.2024.171001
    [184]
    D. Rho, T. Kim, M. Park, J. W. Kim, H. Chae, E. K. Ryu, and J. H. Cheon, “Encryption-friendly LLM architecture,” in Proc. 13th Int. Conf. Learning Representations, 2025.
    [185]
    P. Voigt and A. Von dem Bussche, “The EU General Data Protection Regulation (GDPR): A Practical Guide,” Cham, Germany: Springer, 2017, pp. 10–5555.
    [186]
    R. Y. Wong, A. Chong, and R. C. Aspegren, “Privacy legislation as business risks: How GDPR and CCPA are represented in technology companies’ investment risk disclosures,” Proc. ACM on Hum.-Comput. Interact., vol. 7, no. CSCW1, pp. 82, Apr. 2023.
    [187]
    C.-C. Hsu, I.-Z. Wu, and S.-M. Liu, “Decoding AI complexity: SHAP textual explanations via LLM for improved model transparency,” in Proc. Int. Conf. Consumer Electronics-Taiwan, Taichung, China, 2024, pp. 197–198.
    [188]
    H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, and M. Du, “Explainability for large language models: A survey,” ACM Trans. Intell. Syst. Technol., vol. 15, no. 2, p. 20, Feb. 2024.
    [189]
    J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang, “BEAVERTAILS: Towards improved safety alignment of LLM via a human-preference dataset,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, pp. 1072.
    [190]
    Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases,” in Proc. 38th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2024, pp. 130185–130213.
    [191]
    B. Hannon, Y. Kumar, D. Gayle, J. J. Li, and P. Morreale, “Robust testing of AI language model resiliency with novel adversarial prompts,” Electronics, vol. 13, no. 5, p. 842, Feb. 2024. doi: 10.3390/electronics13050842
    [192]
    M. Chisnall, “Digital slavery, time for abolition?” Policy Stud., vol. 41, no. 5, pp. 488–506, Feb. 2020.
  • Related Articles

    [1]Chenxi Gu, Xinli Wang, Kang Li, Xiaohong Yin, Shaoyuan Li, Lei Wang. Enhanced Tube-Based Event-Triggered Stochastic Model Predictive Control With Additive Uncertainties[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(3): 596-605. doi: 10.1109/JAS.2024.124974
    [2]Jiawen Li, Tao Zhou. A Robust Large-Scale Multiagent Deep Reinforcement Learning Method for Coordinated Automatic Generation Control of Integrated Energy Systems in a Performance-Based Frequency Regulation Market[J]. IEEE/CAA Journal of Automatica Sinica. doi: 10.1109/JAS.2024.124482
    [3]Jianqing Lin, Cheng He, Ye Tian, Linqiang Pan. Variable Reconstruction for Evolutionary Expensive Large-Scale Multiobjective Optimization and Its Application on Aerodynamic Design[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(4): 719-733. doi: 10.1109/JAS.2024.124947
    [4]Bo Lu, Qinghai Miao, Yahui Liu, Tariku Sinshaw Tamir, Hongxia Zhao, Xiqiao Zhang, Yisheng Lv, Fei-Yue Wang. A Diffusion Model for Traffic Data Imputation[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(3): 606-617. doi: 10.1109/JAS.2024.124611
    [5]Rongkai Wang, Chaojie Gu, Shibo He, Jiming Chen. Text2UA: Automatic OPC UA Information Modeling from Textual Data with Large Language Model[J]. IEEE/CAA Journal of Automatica Sinica. doi: 10.1109/JAS.2025.125114
    [6]Wei Zhou, Xiaogang Zhu, Qing-Long Han, Lin Li, Xiao Chen, Sheng Wen, Yang Xiang. The Security of Using Large Language Models: A Survey With Emphasis on ChatGPT[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(1): 1-26. doi: 10.1109/JAS.2024.124983
    [7]Xiaogang Zhu, Wei Zhou, Qing-Long Han, Wanlun Ma, Sheng Wen, Yang Xiang. When Software Security Meets Large Language Models: A Survey[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(2): 317-334. doi: 10.1109/JAS.2024.124971
    [8]Luolin Xiong, Haofen Wang, Xi Chen, Lu Sheng, Yun Xiong, Jingping Liu, Yanghua Xiao, Huajun Chen, Qing-Long Han, Yang Tang. DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(5): 841-858. doi: 10.1109/JAS.2025.125495
    [9]Bing Zhu, Xiaozhuoer Yuan, Li Dai, Zhiwen Qiang. Finite-Time Stabilization for Constrained Discrete-time Systems by Using Model Predictive Control[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(7): 1656-1666. doi: 10.1109/JAS.2024.124212
    [10]Xiongbo Wan, Chaoling Zhang, Fan Wei, Chuan-Ke Zhang, Min Wu. Hybrid Dynamic Variables-Dependent Event-Triggered Fuzzy Model Predictive Control[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(3): 723-733. doi: 10.1109/JAS.2023.123957
    [11]Sheng Qi, Rui Wang, Tao Zhang, Xu Yang, Ruiqing Sun, Ling Wang. A Two-Layer Encoding Learning Swarm Optimizer Based on Frequent Itemsets for Sparse Large-Scale Multi-Objective Optimization[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(6): 1342-1357. doi: 10.1109/JAS.2024.124341
    [12]Hongmin Liu, Qi Zhang, Yufan Hu, Hui Zeng, Bin Fan. Unsupervised Multi-Expert Learning Model for Underwater Image Enhancement[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(3): 708-722. doi: 10.1109/JAS.2023.123771
    [13]Sheng Qi, Rui Wang, Tao Zhang, Weixiong Huang, Fan Yu, Ling Wang. Enhancing Evolutionary Algorithms With Pattern Mining for Sparse Large-Scale Multi-Objective Optimization Problems[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(8): 1786-1801. doi: 10.1109/JAS.2024.124548
    [14]Tianyu Shen, Jinlin Sun, Shihan Kong, Yutong Wang, Juanjuan Li, Xuan Li, Fei-Yue Wang. The Journey/DAO/TAO of Embodied Intelligence: From Large Models to Foundation Intelligence and Parallel Intelligence[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(6): 1313-1316. doi: 10.1109/JAS.2024.124407
    [15]Zhenyu Lei, Shangce Gao, Zhiming Zhang, Haichuan Yang, Haotian Li. A Chaotic Local Search-Based Particle Swarm Optimizer for Large-Scale Complex Wind Farm Layout Optimization[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(5): 1168-1180. doi: 10.1109/JAS.2023.123387
    [16]Kecai Cao, YangQuan Chen, Daniel Stuart. A Fractional Micro-Macro Model for Crowds of Pedestrians Based on Fractional Mean Field Games[J]. IEEE/CAA Journal of Automatica Sinica, 2016, 3(3): 261-270.
    [17]Norelys Aguila-Camacho, Manuel A. Duarte-Mermoud. Improving the Control Energy in Model Reference Adaptive Controllers Using Fractional Adaptive Laws[J]. IEEE/CAA Journal of Automatica Sinica, 2016, 3(3): 332-337.
    [18]Bruce J. West, Malgorzata Turalska. The Fractional Landau Model[J]. IEEE/CAA Journal of Automatica Sinica, 2016, 3(3): 257-260.
    [19]Xiaohua Xia, Jiangfeng Zhang. Operation Efficiency Optimisation Modelling and Application of Model Predictive Control[J]. IEEE/CAA Journal of Automatica Sinica, 2015, 2(2): 166-172.
    [20]Xiaoli Li, Kang Wang, Dexin Liu. An Improved Result of Multiple Model Iterative Learning Control[J]. IEEE/CAA Journal of Automatica Sinica, 2014, 1(3): 315-322.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(8)  / Tables(4)

    Article Metrics

    Article views (307) PDF downloads(81) Cited by()

    Highlights

    • This paper provides a comprehensive review of the entire DeepSeek family models, summarizing the core innovations in their development processes, including data processing, training, and infrastructure, and comparing them with traditional counterparts
    • This paper analyzes how the existing DeepSeek models are applied to various downstream tasks, examines the potential security, privacy, and ethical concerns associated with current DeepSeek models
    • This paper outlines potential future directions for the development of DeepSeek family models, offering insights to guide further research and applications in the field

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return