Search CORE

2 research outputs found

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Author: :
Bi Xiao
Chen Deli
Chen Guanting
Chen Shanhuang
Dai Damai
DeepSeek-AI
Deng Chengqi
Ding Honghui
Dong Kai
Du Qiushi
Fu Zhe
Gao Huazuo
Gao Kaige
Gao Wenjun
Ge Ruiqi
Guan Kang
Guo Daya
Guo Jianzhong
Hao Guangbo
Hao Zhewen
He Ying
Hu Wenjie
Huang Panpan
Li Erhang
Li Guowei
Li Jiashi
Li Y. K.
Li Yao
Liang Wenfeng
Lin Fangyun
Liu A. X.
Liu Bo
Liu Wen
Liu Xiaodong
Liu Xin
Liu Yiyuan
Lu Haoyu
Lu Shanghao
Luo Fuli
Ma Shirong
Nie Xiaotao
Pei Tian
Piao Yishi
Qiu Junjie
Qu Hui
Ren Tongzheng
Ren Zehui
Ruan Chong
Sha Zhangli
Shao Zhihong
Song Junxiao
Su Xuecheng
Sun Jingxiang
Sun Yaofeng
Tang Minghui
Wang Bingxuan
Wang Peiyi
Wang Shiyu
Wang Yaohui
Wang Yongji
Wu Tong
Wu Y.
Xie Xin
Xie Zhenda
Xie Ziwei
Xiong Yiliang
Xu Hanwei
Xu R. X.
Xu Yanhong
Yang Dejian
You Yuxiang
Yu Shuiping
Yu Xingkai
Zhang B.
Zhang Haowei
Zhang Lecong
Zhang Liyue
Zhang Mingchuan
Zhang Minghua
Zhang Wentao
Zhang Yichao
Zhao Chenggang
Zhao Yao
Zhou Shangyan
Zhou Shunfeng
Zhu Qihao
Zou Yuheng
Publication venue
Publication date: 05/01/2024
Field of study

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5

arXiv.org e-Print Archive

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Author: An Wei
Bi Xiao
Cai J. L.
Chen Deli
Chen Guanting
Chen Jin
Chen Qinyu
Chen R. J.
Chen Ruyi
Chen Shanhuang
Chen Xiaokang
Chen Xiaosha
Dai Damai
DeepSeek-AI
Dengr Chengqi
Ding Honghui
Dong Kai
Dong Yixin
Du Qiushi
Feng Bei
Fu Zhe
Gao Huazuo
Gao Kaige
Gao Wenjun
Ge Ruiqi
Gu Zihui
Guan Kang
Guo Daya
Guo Jianzhong
Guo Yongqiang
Hao Guangbo
Hao Zhewen
He Ying
Huang Panpan
Huang Yanping
Huang Zhen
Ji Dongjie
Jin R. L.
Jin Xiangyue
Li Erhang
Li Guowei
Li Hui
Li Jiashi
Li Meng
Li Mingming
Li S. S.
Li X. Q.
Li Y. K.
Li Yao
Li Yaohui
Li Zhuoshu
Li Zilin
Liang Jian
Liang Wenfeng
Lin Fangyun
Liu Aixin
Liu Bo
Liu Wen
Liu Xiaodong
Liu Xin
Liu Yiyuan
Liu Yuxuan
Lu Shanghao
Lu Xuan
Luo Fuli
Ma Shirong
Ma Yunxian
Ni Jiaqi
Nie Xiaotao
Pan Ruizhe
Pei Tian
Piao Yishi
Qiu Junjie
Qu Hui
Ren Z. Z.
Ren Zehui
Ruan Chong
Sha Zhangli
Shao Zhihong
Shen Xiaojin
Song Junxiao
Song Xinnan
Su Xuecheng
Sun Tianyu
Sun Xiaowen
Sun Yaofeng
Tan Yixuan
Tang Minghui
Tang Ying
Tian Ning
Wang Bin
Wang Bingxuan
Wang Lean
Wang Miaojun
Wang Peiyi
Wang Shiyu
Wang T.
Wang Xianzu
Wang Xiaohan
Wang Xiaoxiang
Wang Yaohui
Wang Yongji
Wang Yuduan
Wang Zihan
Wei Y. X.
Wen Zhiniu
Wu Shaoqing
Wu Y.
Xia Leyi
Xiao W. L.
Xie Xin
Xie Zhenda
Xie Ziwei
Xin Huajian
Xiong Yiliang
Xu Hanwei
Xu Lei
Xu Runxin
Xu Yanhong
Xu Zhipeng
Yan Yuting
Yang Dejian
Yang Hao
Yang Xinyu
Ye Shengfeng
You Yuxiang
Yu Shuiping
Yu Xingkai
Yuan Jingyang
Yuan Tian
Zeng Wangding
Zha Yukun
Zhang H.
Zhang Haowei
Zhang Lecong
Zhang Liyue
Zhang Mingchuan
Zhang Minghua
Zhang Peng
Zhang Wentao
Zhang Yichao
Zhang Zhen
Zhang Zhongyu
Zhao Chenggang
Zhao Liang
Zhao Yao
Zhao Yilong
Zheng Size
Zheng Yi
Zhou Shangyan
Zhou Shuang
Zhou Shunfeng
Zhou Xinyi
Zhu Qihao
Zhu Y. X.
Zhu Yuchen
Zou Yuheng
Publication venue
Publication date: 19/06/2024
Field of study

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models

arXiv.org e-Print Archive