Yekun Chai

Contact:
chaiyekun (at) gmail.com

I am a staff research engineer working on large language models (LLMs) at Baidu NLP. Before that, I was associated with Institute of Automation, Chinese Academy of Sciences (CASIA). I graduated from Edinburgh Informatics in 2018 under the supervision of Adam Lopez and Naomi Saphra.

My research endeavors revolve around the generative pre-training paradigm of NLP, with a particular emphasis on:

General language model pre-training, prompting, instruction tuning, and their variants across tasks, languages, and modalities;
LLM alignment with human preferences;
Augmented LLMs with non-parametric priors.

news

Feb 20, 2024	One paper on HumanEval-XL, a multilingual code generation benchmark has been accepted to LREC-COLING 2024. We’ve released the code and data!
Jan 16, 2024	One paper on reward models with tool-augmented feedback has been accepted to ICLR 2024 (spotlight). Dive into our research and code now!
Sep 23, 2023	One paper on XAI has been accepted to NeurIPS 2023 Datasets and Benchmarks Track. Code is available here.
May 02, 2023	ERNIE-Code on multilingual text and code pre-training has been accepted to ACL 2023 Findings. Check our code and models.

selected publications

preprint
Dual Modalities of Text: Visual and Textual Generative Pre-training

Yekun Chai , Qingyi Liu^ , Jingwu Xiao^ , Shuohuan Wang , and 2 more authors

2024

Bib PDF
@misc{chai2024pixelgpt, title = {Dual Modalities of Text: Visual and Textual Generative Pre-training}, author = {Chai, Yekun and Liu^, Qingyi and Xiao^, Jingwu and Wang, Shuohuan and Sun, Yu and Wu, Hua}, year = {2024}, eprint = {2404.10710}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
preprint
On Training Data Influence of GPT Models

Qingyi Liu*^ , Yekun Chai* , Shuohuan Wang , Yu Sun , and 3 more authors

2024

Bib PDF
@misc{gptfluence2024training, title = {On Training Data Influence of GPT Models}, author = {Liu*^, Qingyi and Chai*, Yekun and Wang, Shuohuan and Sun, Yu and Peng, Qiwei and Wang, Keze and Wu, Hua}, year = {2024}, eprint = {2404.07840}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }

LREC-COLING

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Qiwei Peng* , Yekun Chai* , and Xuhong Li

2024

Bib PDF Code

@misc{he-xl-2024-pcl,
  title = {HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization},
  author = {Peng*, Qiwei and Chai*, Yekun and Li, Xuhong},
  year = {2024},
  eprint = {2402.16694},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
}

ICLRSpotlight
Tool-Augmented Reward Modeling

Lei Li*^ , Yekun Chai* , Shuohuan Wang , Yu Sun , and 3 more authors

In The Twelfth International Conference on Learning Representations , 2024

Bib PDF Code Poster
@inproceedings{li2024toolaugmented, title = {Tool-Augmented Reward Modeling}, author = {Li*^, Lei and Chai*, Yekun and Wang, Shuohuan and Sun, Yu and Tian, Hao and Zhang, Ningyu and Wu, Hua}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=d94x0gWTUX}, }
ACLFindings
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Yekun Chai , Shuohuan Wang , Chao Pang , Yu Sun , and 2 more authors

In Findings of the Association for Computational Linguistics: ACL 2023 , Jul 2023

Abs Bib PDF Code

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.
@inproceedings{chai-etal-2023-ernie-code, url = {https://aclanthology.org/2023.findings-acl.676}, title = {{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages}, author = {Chai, Yekun and Wang, Shuohuan and Pang, Chao and Sun, Yu and Tian, Hao and Wu, Hua}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2023.findings-acl.676}, pages = {10628--10650} }