Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Large Language Models for Programming Languages

A note of code pre-trained language models (PLMs).


Model Source #params L2R LM Mask LM seq2seq LM Code structure Warmup tokenizer Model #PLs Data
CuBERT ICML’20 (Google) 345M ✔️ - python tokenizer BERT-large 1 7.4M Python files
CodeBERT EMNLP’20 findings (MSRA) 125M ✔️ ✔️ BBPE fine-tuned RoBERTa 6 CodeSearchNet (6 PL languages)
GPT-C CSEC/FSE’20 (MS) 366M ✔️ - BBPE GPT-2 variant 1/multi monolingual/multilingual PLs
CodeGPT (MSRA) 124M ✔️ both BBPE GPT-2 variant 1 from GitHub
PLBART NAACL’21 406M ✔️ - SentencePiece BART-base 2 470M Java, 219M Python, NL 47M.
CodeT5 EMNLP’21 (Salesforce/NTU) 220M (T5 base) 60M (T5 small) ✔️ identifier - BBPE T5-base 6+2 8.35M instances (CodeSearchNet/Collected)
UniXcoder ACL’22 (MSRA) ~110M (BERT-base) ✔️ ✔️ - BBPE BERT-base 6 CodeSearchNet
DOBF NeurIPS’21 (FAIR France) base ✔️ code obfuscation both BBPE CodeBERT init/ from scratch 2 Python/Java files in GitHub repo from Google BigQuery
GraphCodeBERT ICLR’21 (MSRA) 125M ✔️ data flow ✔️ BBPE CodeBERT init. 6 CodeSearchNet (6 PLs)
SynCoBERT AAAI-22 125M ✔️ ✔️ ✔️ BBPE CodeBERT init. 6 CodeSearchNet
CodeParrot huggingface 1.5B ✔️ - - BBPE GPT-2 1 20M files Python files from Google BigQuery Github database
GPT-Neo EleutherAI 2.7B ✔️ - - BBPE Transformer Decoder - Mix
GPT-NeoX EleutherAI 20B ✔️ - - BBPE GPT-NeoX - Mix
GPT-J (open source) 6B ✔️ - - BBPE GPT - Mix
PolyCoder CMU 2.7B ✔️ - - BBPE GPT-2 12 24M files (12 PLs).
Codex OpenAI 12B ✔️ - ✔️ BBPE fine-tuned GPT-3 1 159GB python files after filtering.
AlphaCode DeepMind 41B / 9B ✔️ - - SentencePiece enc-dec 12 715.1 GB after filtering.
Google’s (Austin 2021) Google Resarch 137B ✔️ - - SentencePiece Decoder - mixed (2.97B documents)
InCoder Meta AI 6.7B ✔️ - - BBPE Decoder 28 1TB -> 250GB. GitHub and GitLab via API.
CodeGen Salesforce 16.1B ✔️ - - BBPE Decoder Multi GitHub
PaLM-Coder Google Research 540B ✔️ - - SentencePiece Decoder Multi Mixed
StarCoder BigCode project 15.5B ✔️ - - BPE Decoder 86 The Stack (GitHub)

Evaluation task

  • Program understanding: code search, program repair, bug detection and localization,.
  • Program generation: code completion, program synthesis, code summarization, source code to pseudo-code mapping, API-sequece prediction, natural language to code mapping, document generation.

Code token types: local variables, methods or APIs, arguments, punctuation, language keywords, delimiters.

CodeXGLUE[5] includes 14 datasets, consisting of 10 diversified PL understanding and generation tasks.

  • code-code:
    1. Clone detection: Measure the semantic similarity between codes. It includes two subtasks: binary classification between a pair of codes and code retrieval, where the goal is to find semantically similar codes.
    2. Defect detection: The object is to identify whether a body of source code contains defects that may be used to attract software systems, such as resource leaks, use-after-free vulnerabilities, and DoS attack.
    3. Cloze test: predict the masked token of a code and includes two subtasks: (1) to measure the accuracy of predicting the masked token from the whole vocabulary (2) to test the semantic reasoning ability by distinguishing between “max” and “min”.
    4. Code completion: Predict following tokens based on a code context. Two subtasks: (1) token-level completion: check whether the next token has been predicted correctly; (2) line-level completion: test the goodness of the generated line.
    5. Code repair: to refine the code by fixing the bugs automatically.
    6. Code-to-code translation: translating from one PL to another one.
  • text-code:
    1. NL code search: It measures the semantic relatedness between texts and codes. Two subtasks: (1) Given an NL query, find the most relevant code in a collection of codes; (2) Given a query-code pair, predict whether the code answers the query or not.
    2. Text-to-code generation: generate a code via a NL description.
  • code-text:
    1. Code summarization: generate the NL comment for a code.
  • text-text:
    1. Documentation translation: translate code documentation from one NL to another one.

Code PLMs


Background: There is no attempt yet to obtain the high-quality contextual embeddings of source code, and evaluate it on multiple program-understanding tasks simultaneously. That is the gap that CuBERT aims to mitigate.

CuBERT[1] (code understanding BERT) presents the first attempt at code pre-training on (python) source code.


  • Pre-training data: [1] curated 7.4 million python files with a total of 9.3 billion tokens (1.6 billion unique).
  • Tokenization: first tokenize the python program using the standard Python tokenizer (tokenize package);; then greedily compress them into a subword vocabulary using the SubwordTextEncoder in the Tensor2Tensor project, resulting in ~50k tokens.
  • Vocabulary size: ~50K.


  • Model config: BERT-large models.
  • Training details: Linear warm up 10% of examples.
  • Pre-training task: masked language model (MLM); next sentence prediction (NSP).
  • Models and datasets


It shows that CuBERT can use only 33% labeled data with only 2 epoch to match the baselines trained with full data and many more epochs.


Background: the success of PLMs drive the surge of multi-modal pre-training, which are learned from bi-modality.

CodeBERT[2] is a bimodal PLM for natural language (NL) and programming language (PL).
It is the first large NL-PL PLM for multiple PLs.


  • Pre-training data: CodeSearchNet[3] (1) bimodal data of NL-PL pairs: 2.1M datapoints; (2) large amount of unimodal code data without paired documents: 6.4M codes across six PLs (Python, Java, JavaScript, PHP, Ruby, and Go).

  • Tokenization: RoBERTa BBPE.


  • Model config: RoBERTa_base. (125M params)
  • Pre-training task:
    (1) MLM;
    (2) Replaced token detection (RTD). Different from ELECTRA, it uses n-gram LMs as PL and NL generator as shown in the figure.
  • Finetune settings: (1) For NL code search, use the CodeBERT for pre-training; (2) For code-to-text generation, it uses an encoder-decoder model, and initializes the encoder with CoderBERT.


  • NL code search: “Init=S” vs. “Init=R” $\rightarrow$ RoBERTa warmup confers performance gain. This observation is different from OpenAI Codex!!
  • NL-PL probing (zero-shot): construct dataset to fill in a keyword from {max, maximize, min, minimize, less, greater}.
  • Code documentation generation: [2] initializes the encoder of encoder-decoder framework with CoderBERT and evaluates the results by means of the smoothed BLEU score.
  • Test on C# that is unseen before. The performance is worse than code2seq which uses the compositional paths of its abstract syntax tree (AST). The AST experiments of CodeBERT fail.


CodeGPT[5] is a variant of GPT-2 (L12/A12/H768). [5] trained both from scratch with newly obtained vocabularies and from GPT-2 initialization with original vocabularies (termed CodeGPT-adapted).

  • Pre-training data: monolingual data on Python and Java in the CodeSearchNet dataset, including 1.1M Python functions and 1.6M Java methods.
  • Huggingface model: CodeGPT-small, CodeGPT-small-java-adapted.


  • Background: Majority of argument completion in code completion systems only work when the name of the method or API call is already typed in, thus leaving the task of completing the method calls to software developers.
  • Cons: Previously existing code completion tools have focused on specific token types or features, often failing to have a holistic view of the surrounding context.
  • Motivating Example: The example below shows an method completion and an argument completion in C Sharp PL served by the Intellicode extension in Visual Studio IDE, and the whole-line of code completion generated by IntelliCode Compose.

GPT-C [4] (i.e., IntelliCode Compose), a variant of GPT-2, can generate syntactically correct code in multiple PLs, capable of completing an entire line of code in a couple of key strokes. It is able to learn to infer types of PL identifiers and long-range code semantics without inputs extracted by means of a static analyzer explicitly passed to the model as features.


[4] collects 52k top-starred (non-fork) project in GitHub, containing over 4.7M source code files, comprising over 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript PLs.

  • Data split: It splits the data into 70/30 as dev/test on the repository level. The dev set is then split into 80/20 as training/validation set. The final deployed model is re-trained using the entire dataset.

Tokenization (see below figure example):

  1. BPE tokenization. It uses sentencepiece tokenizer with special tokens for control flow and code structure representations. For control flow tokens <BOF> and <EOF> to mark the beginning and ending of a file in order to disambiguate similar identifier names in different files, and <EOL> to mark the ending of a line. Since python uses white-spaces and indentation to demarcate code scope, [4] introduces <INDENT> and <DEDENT> tokens to represent those scope delimiters.
  2. Splitting PL identifiers using casing conventions. $\rightarrow$ work for PL, not for NL.

Exposing sensitive data through code suggestions. The figure shows an example completion served by the TabNine system exposing irrelevant and potentially sensitive data.

To address this problem, the training should be shielded from inadvertently gaining access to secrets or personally identifiable data. For this reason, [4] identifies and normalizes numeric literals, string literals and comments, including docstrings, to <NUM_LIT>, <STR_LIT>, and <COMMENT> special tokens, respectively.


  • Model config: GPT-2.
    • Best monolingual model: L24, H16, #vocab 50k.
    • Best multilingual model: L26, H16, #vocab 60k.
  • Training details: training from scratch; weight tying;
  • Decoding: beam search.

Code completion system

For user experience, if a response time under 100ms is necessary to avoid any feeling of delay or lag. To achieve it in a cloud-based model deployment setting, [4] presents caching on the client side. When typing a non-alphanumeric character, suggestions are queried from the server. Those suggestions, each as a list of tokens along with their scores, are stored in a trie placed in the cache. This allows to prune the tree efficiently at a character-level as the user continues typing. [4] simply traverse this tree greedily by always branching to the node with the highest score.

To preserve accuracy, [4] terminates the completion-tree traversal if none of the child nodes has a score that is equal to or larger than the score of its parent multiplied by a ratio $R$, defined as:

where $L$ is the position of the root node of the trie, $\alpha$ is the relaxation factor, and $\kappa$ is the curvature factor. $\alpha$ is used to adjust the values of $R$ for very small or very large values of $L$. A lower value of $\alpha$ would relax the policy producing longer completion suggestions, while a value closer to 1.0 would tighten the policy producing shorter suggestions. $\kappa$ controls the rate of increase of the $R$: A smaller $\kappa$ would give a steeper curve for smaller values of $L$, producing shorter suggestions, while a larger value of $\kappa$ would yield a flatter curve resulting in longer completion suggestion. [4] selects $\alpha=0.8$ and $\kappa=10$ to gain a balance between suggestion length and relevance.

Multilingual model

  • Tokenization: BPE tokenizer.
  • Evaluation metric: perplexity, ROUGE-L, Levenshtein similarity. ROUGE-L variant is based on the longest common subsequence (LCS) statistics, which takes into count structure similarity and identifies longest co-occurring n-grams. Levenshtein distance measures how many single-character edits - including insertion, substitution, or deletion - does it take to transform one sequence of tokens to another.
  • Online evaluation metric: surfacing rate (SR) and click-through rate (CTR).
    (1) SR is the total number of completions displayed divided by the total number of times a completion could potentially be shown, which is after every character typed into a code document. The SR is not only dependent on the accuracy of the model but also on the typing speed of a user and their network reliability.
    (2) The CTR is defined as the fraction of accepted completions over the total number of completions displayed. The low CTR can be partially attributed to the momentum in typing.


  1. Language-agnostic baseline.
  2. Language type-embedding. Add language type embedding with the token and position embedding matrices.
  3. Language-specific control codes. Insert a sequence of tokens in the beginning of each training sample code: “lang %s remaining token sequence”, where lang $\in$ {Python, C#, JavaScript, TypeScript}.
  4. Add a PL classification pre-training task to detect programming languages besides language modeling.


  • Background: Existing code PLMs regard the code snippets as a sequence of tokens, while ignoring the inherent structure of code that contains semantics.

GraphCodeBERT[6] is the first code PLM that uses semantic structure of code to learn code representation. It presents two structure-aware pre-training tasks: (1) data flow edge prediction (to learn from code structure); (2) variable alignment across source code and data flow (to align source code and code structure).

  • Data: CodeSearchNet (6 PLs)

GraphCodeBERT[6] incorporates the code structure of data flow into pre-training. Data flow is a graph that represents dependency relation between variables, in which nodes represent variables and edges represent where the value of each variable comes from.

Data flow

Given a source code, [4] firstly parses the code into an abstract syntax tree (AST) which includes syntax structure of code. The terminal nodes (leaves) are used to identify the variable sequence $V$. We take each variable as a node of graph and a direct edge $\epsilon = \langle v_i, v_j \rangle$ from $v_i$ to $v_j$ refers the value of $j$-th variable comes form $i$-th variable. The set of directed edges as $E = { \epsilon_1, \epsilon_2, \cdots, \epsilon_l }$ and graph $\mathcal{G}(C) = (V, E)$ is data flow that represents dependency relation between variables in the source code.


[4] concats the comment, source code, and variables as the sequence input. It uses a graph-guided masked attention that represents the relation between source code tokens and nodes of the data flow. Given $\langle v_i, c_i \rangle / \langle c_j, v_i \rangle \in E’$, if the variable $v_i$ is identified from the source code token $c_j$, it allows the node and code attend to each other if and only if $\langle v_i, c_i \rangle / \langle c_j, v_i \rangle \in E’ $ .

The graph-guided masked attention matrix $M$ is as follows:


  • Edge prediction: randomly mask 20% nodes by adding infinite values in the mask, then predict these masked edges. The probability of the edge is the dot-product following a sigmoid function using representations of two nodes.
  • Node alignment: randomly sample 20% nodes, mask edges between code tokens and sampled nodes, then predict masked edges.


The table reports the Mean Reciprocal Rank (MRR) on the CodeSearchNet.

  • Case study
    After a small change, GraphCodeBERT w/ data flow can also makes the correct prediction while that w/o data flow can not.


TransCoder[11] uses a transformer encoder-decoder model to perform monolingual PL translation, in which the encoder is initialized as XLM, and the decoder is randomly initialized.

[11] instantiates the pre-training with following settings of unsupervised transcompilation:

  1. Cross PL model pre-training. Thr cross-lingual nature comes from the significant number of common tokens (anchor points) that exist across languages. In the context of English-French translation, the anchor points consists essentially of digits and city and people names. In PL, these anchor points come form common keyworks (e.g., for , while, if, try), and also digits, mathematical operators, and English strings that appear in the source code. [11] treats the PL.

  2. Cross PL model pre-training. Thr cross-lingual nature comes from the significant number of common tokens (anchor points) that exist across languages. In the context of English-French translation, the anchor points consists essentially of digits and city and people names. In PL, these anchor points come form common keyworks (e.g., for , while, if, try), and also digits, mathematical operators, and English strings that appear in the source code. [11] applies the masked language modeling (MLM) pre-training on source code sequences.

  3. Denoising auto-encoding (DAE). [11] predict a sequence of code tokens given a corrupted version of that sequence, that is, randomly mask, remove and shuffle input tokens.
  4. Back-translation (BT). The translation quality will tend to be low if the model is never trained to do what is expected to do at test time, i.e., to translate functions from one language to another. [11] applies back-translation, one of the most effective methods to leverage monolingual data in a wearkly-supervised scenario.


  • GitHub public dataset on Google BigQuery contains more than 2.8 million open source GitHub repositories.
  • Tokenization: javalang tokenizer for java, tokenizer of the standard library for Python, clang for C++. These tokenizers ensure that meaningless modeifications (e.g., add extra new lines or spaces) in the code do not have any impact on the tokenized sequences. The [11] learns BPE codes using FastBPE on extracted tokens, and split tokens into subword units.
  • TransCoder train the DAE and BT objectives on functions only. Keeping comments in the source code increases the number of achorpoints across language, which results in a better overall performance.


  • Evaluation:
    (1) BLEU.
    (2) Reference match: the percentage of translations that perfectly match the ground truth reference.
    (3) Computational accuracy.: whether the hypothesis function generates the same outputs as the reference when given the same inputs.
  • Decoding: beam search


  • Background: Previous PL pre-training uses masked language model objectives, which was initially designed for NL and does not leverage the particular structure of source code. PL is more structured than NL, which makes predicting masked tokens much easier for PLs.

Deobfuscation (DOBF) [10] proposes a new objective based on the deobfuscation of identifier names in source code. It leverages the particular structure of PLs. Although it does not require any parallel copora of source code aligned to NL, DOBF outperform GraphCodeBERT, CodeBERT and MLM pre-training on multiple downstream tasks.

Deobfuscation objective

DOBF obfuscates code snippets by replacing class, function and variable names with special tokens, and train a model to recover the original names. When an identifier is selected, all of tis instances in the code are replaced by the same special token. This differs from MLM when the name of a variable can appear multiple times while being masked a single time. As a result, the feaction of meaningful tokens masked by the objective is language independent: for more verbose languages (e.g., Java), the less informative syntax-related tokens will not be masked out by the DOBF objective.

Each identifier is replaced with probability . We ensure that the original input is modefied: if no identifier is replaced, we draw a random one to obfuscate. When , only one random identifier in the input is obfuscated. When , all the identifiers defined in the file will be obfuscated. The model needs to recover a dictionary mapping special tokens to their initial values.


  • Pre-training data: Python/Java files within GitHub public repos avilable on Google BigQuery.
  • Model: Encoder-decoder.
  • Tokenizer: BBPE (same as CodeBERT).

CodeXGLUE results

  1. DOBF beats COdeBERT by a wide margin on NL code search and code summarization, showing that PL data aligned with NL is unnecessary to train an effective model on those tasks.
  2. Objectives such as MLM and DAE that provide unstructured noise are complementary to DOBF.


PLBART (Program and Language BART)[7] is a bidirectional and autoregressive transformer pre-trained on unlabeled data across PL and NL to learn multilingual representations applicable to a broad spectrum of program and language understanding and generation applications.


  • Data: Java and Python repo on Google BigQuery; StackOverflow posts (including both questions and answers, excluing code snipeets) by downloading the data dump (7th Sep 2020) from stackexchange.
  • Tokenizer: sentencepiece.
  • Vocabulary: (newly trained) #50k subwords.

One key challenge to aggregate data from differnt modalities is that some modalities may have more data, such as we have 14 times more data in PL than NL. Thus, it mixes and up/down samples the data following XLM-R[9] to alleviate the bias towards PL. It samples instances for pre-training according to multinomial distribution with probabilities ($q_1, q_2, \cdots, q_N$):

where . $N$ is the total number of languages and $n_i$ is the total number of instances in language $i$. The smoothing parameter $\alpha=0.3$.

Denoising pre-training

  • Config: BART-base (L6 encoder, L6 decoder, H768, A12) ~140M params.
  • Pre-training tasks: mask 35% of the tokens in each instance.
    1. token masking.
    2. token deletion.
    3. token infilling: sample text spans and replace them with a single mask token.
  • Input/output format: A language id symbol (e.g., <java>, <python>) is appended / prepended to the encoder/decoder inputs, respectively.


  • Evaluation metrics:
    1. BLEU for generation, except smoothed BLEU for code summarization;
    2. CodeBLEU: considers grammatical and logical correctness based on the AST and data-flow structure.
    3. Exact match (EM): evaluates if generated sequence exactly matches the reference.

It shows that PLBART learns better generic program semantics. It achieves the highest improvement in Ruby, however, PLBART is not pre-trained on Ruby.


  • Background: Previous work reply on the encoder- or decoder- only models, i.e., BERT or GPT, which is suboptimal for generation and understanding tasks, respectively. Initializing the encoder with CoderBERT and decoder with random initialization cannot benefit from pre-trianing. Also, most works regard the PL as a sequence of tokens like NL, ignoring the rich structural information in the code, which is vital for comprehending the code sementics.

CodeT5[8] is a unified encoder-decoder model, which considers the token type information in the source code. It proposes an identifier-aware pre-training objective.


  • Data: CodeSearchNet, collected C/C# from BigQuery. ~8.35M instances for pre-training (8 PLs).
  • Tokenizer: (newly trained) BBPE. It largely reduces the length of tokenized code sequence by 30%-45% on downstream tasks.
  • Vocabulary: #32k, plus [PAD], [CLS], [SEP], [MASK0-99].


  • Config: CodeT5-small (60M); CodeT5-base (220M).


  • Encoding NL/PL: CodeT5 converts the PL segment into an Abstact Syntac Tree (AST) and extract the node types for each code token. Then, it constructs a sequence of binary labels for the PL segment, were each represents whether the code token is an identifier or not.
  1. Masked span prediction: the same corrupted rate (15%) as T5 and average span length to be 3 by uniformly sampling spans from 1 to 5 tokens. It also employ whole word masking as in ERNIE.
  2. Identifier tagging: use the CodeT5 encoder to predict whether the token is an identifier or not (binary classification).
  3. Masked identifier prediction: mask all identifiers in the PL and use a unique sentinel token for all occurrences of one specific identifier. It is called obfuscation where changing identifier names does not impact the code semantics. See the (c) subfigure below.
  4. Bimodal dual generation. Train NL $\rightarrow$ PL and PL $\rightarrow$ NL generation simultaneously, which can be seen as a special case of T5’s (full) span masking.


In code summarization tasks, CodeT5 outperforms previous SOTA with smaller model parameters (50M vs 140M).

Ablation test:

  1. Masked span prediction (MSP)
  2. Identifier tagging (IT)
  3. Masked identifier prediction (MIP)

It is observed that removing MSP can largely reduce the generation task performance but instead increase the defect detection performance, indicating that MSP can capture syntactic information for generation tasks.

Removing MIP would hurt the defect detection task the most, indicating that it might focus more on code semantic understanding.


  • Background: The cons of previous work:
    (1) Encoder-only models is inferior to generation tasks, which requires an additional decoder for generation.
    (2) Decoder-only models underperform in understanding tasks.
    (3) Encoder-decoder models (PLBART, CodeT5) are sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference.

UniXCoder[12] uses a UniLM structure for code pre-training. It uses three objectives:

  1. MLM
  2. Unidirectional LM
  3. Denoising objective (similar to T5 for enc-dec mode): first split the input sequence into chunks and then randmly mask a span of from 1 to $2l-1$ tokens for each chunk, $n$ is the length of the input, $r=15%$ is the corruption rate and $l=5$ is the average length of masked spans.


  • Background: Code is seldom written in a single left-to-right pass and is instead repeatly edited and refined.

InCoder[13], a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling), is the first large generative code model (6.7B) that is able to infill arbitrary regions.

It learns to infill by randomly replacing spans of code with a sentinel token and moving them to the end of the sequence. The model is trained to predict all tokens in the complete sequence in this permuted ordering. During inference, it can edit code by replacing spans with sentinel tokens, prompting the model with the new sequence, and having it generate new tokens to replace the masked spans.


It samples the number of spans from a Poisson distribution with a mean of one, truncated to the support [1, 256], so that there are typically a small number of spans. The length of each span is sampled unifromly from the length of the document and the set of sampled spans is rejected and resampled if any spans overlap.

Once spans are sampled, each span $k$ is replaced with a special masked sentinel token <MASK:k>. The sequence of tokens in the span is then moved to the end of document. Let “Left” be the left context, and “Right” be the right context, “Span” be the sampled span between left and right contexts, then it maximizes the log probability of the masked document: log P([Left; <MASK:0> Right; <MASK:0> Span; <EOM>]).

[13] computes the probability of the sequence auto-regressively and train the model using cross-entropy loss on all tokens except the mask sentinel tokens <MASK:k>, so that the model does not generate these tokens during inference.


During inference, [13] samples the target spans autoregressively from the distribution: P([Left; <MASK:0> Right; <MASK:0> Span; <EOM>]).

Training Data

It uses (1) public code with permissive, non-copyleft, open-source licenses and (2) StackOverflow questions, answers, and comments.

Code data

(1) Code files and repo metadata from GitHub and GitLab via public APIs. ~670M public non-fork repos, including all code from a list of 28 PLs (determined by file extention).
(2) include all other Python and Jupyter files obtainable through the GitHub archive on BigQUery that cannot already obtain from GitHub directly.
(3) All text and code (with markdown formatiing removed from text cells) in Jupyter notebooks.

(1) Remove code files using exact match on the sequence of alphanumereic tokens in the file.
(2) Use regular expressions to replace email address with dummy address “”


  • Remove overlap between training data and the evaluation set. Remove any repos contained in the validation and test set of CodeSearchNet.

Remove that contain

  • any line longer than 3000 tokens
  • an average line length greater than 100 tokens
  • less than 40% of their chars being alphanumetric or underscores
  • appear to be automatically generated, using substring match.


It trains a new BBPE, allowing tokens to extend across whitespace (excluding newline characters) so that common code idioms (e.g., import numpy as np) are single tokens in the vocabulary. It reduces the total number of tokens required to encode the training corpus by 45% relative to the BBPE tokenizer and vocabulary of GPT-2.


The table compares the generative code models on the HumanEval and MBPP becnmarks, which requires models to condition on NL descriptions (docstrings) to produce Python programs (typically a single function), and evaluates overall functional accuracy (pass rates).

InCoder achieves comparable performance on the HumanEval metrics to CodeGen-Multi[14].


Codex[15], a finetuned variant of GPT-3 created by OpenAI, has powered the GitHub Copilot and exceled at a variety of codeing tasks.

Pre-training Setup

  • Model: 175B GPT-3 (Transformer decoder).


Collect 179GB unique Python files under 1MB.
Filter out files:

  • which were likely auto-generated
  • had average line length greater than 100
  • had maximum line length geater than 1000
  • contained small percentage of alphanumeric chacters.

After filtering, it has 159GB.

Tokenizer: GPT-3 tokenizer plus additional set of tokens for whitespace runs of different lengths (multi-whitespace tokens), allowing to reducing approximately 30% fewer tokens.


Pass@k metric: First generate n ≥ k samples per task, count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator.

The numpy script for the unbiased estimate of pass@k.



AlphaCode[16] is an encoder-decoder transformer model developed by DeepMind, achieving on average top 54.3% with more than 5,000 human participants on Codeforces.

Pre-training Setup

  • Tokenizer: Sentencepiece
  • Vocabulary size: 8k, trained on a mix of GitHub and CodeContests data.]
  • Data: GitHub repos including several popular languages. It follows Codex to filter out all files larger than 1MB or with lines longer than 1000 characters, to exclude automatically generated code. It also remove duplicates of the same file, ignoring whitespace in comparisons. It has 715.1GB code intotal.


PolyCoder[17] is a 2.7B code language model trained on 12 different PLs, achieving the new SOTA in C langauge.

  • Model: GPT-2.
  • Tokenizer: BBPE.
  • Data: at least 50 stars of 12 PLs from GitHub (stopping at 15k per language).


CodeGen[14] is a 16.1B causal language model pre-trained on code created by Salesforce, outperforming OpenAI Codex on HumanEval.


PaLM-Coder[18] is a fine-tuned 540B PaLM with decoder-only setup, training on GitHub repositories.


The BigCode community proposes StarCoder[19], a 15.5B causal LLM with 8k context length, which was trained towards Fill-in-the-Middle (FIM) objective on 1T tokens of 86 programming languages from The Stack, an open-source code corpora from GitHub. It uses multi-query attention (for faster inference) and learned absolute positional embeddings. StarCoder finetuned on Python outperforms OpenAI code-cushman-001 on HumanEval.


  1. 1.Kanade, Aditya, Petros Maniatis, Gogul Balakrishnan and Kensen Shi. “Learning and Evaluating Contextual Embedding of Source Code.” ICML (2020).
  2. 2.Feng, Zhangyin, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang and Ming Zhou. “CodeBERT: A Pre-Trained Model for Programming and Natural Languages.” Findings of EMNLP (2020).
  3. 3.Husain, Hamel, Hongqi Wu, Tiferet Gazit, Miltiadis Allamanis and Marc Brockschmidt. “CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.ArXiv abs/1909.09436 (2019).
  4. 4.Svyatkovskiy, Alexey, Shao Kun Deng, Shengyu Fu and Neel Sundaresan. “IntelliCode compose: code generation using transformer.” Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2020).
  5. 5.Lu, Shuai, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu and Shujie Liu. “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation.” ArXiv abs/2102.04664 (2021).
  6. 6.Guo, Daya, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang and M. Zhou. “GraphCodeBERT: Pre-training Code Representations with Data Flow.” ICLR (2021).
  7. 7.Ahmad, Wasi Uddin, Saikat Chakraborty, Baishakhi Ray and Kai-Wei Chang. “Unified Pre-training for Program Understanding and Generation.” NAACL (2021).
  8. 8.Wang, Yue, Weishi Wang, Shafiq R. Joty and Steven C. H. Hoi. “CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation.” EMNLP (2021).
  9. 9.Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. “Unsupervised Cross-lingual Representation Learning at Scale.” ACL (2020).
  10. 10.Rozière, Baptiste, Marie-Anne Lachaux, Marc Szafraniec and Guillaume Lample. “DOBF: A Deobfuscation Pre-Training Objective for Programming Languages.” NeurIPS (2021).
  11. 11.Lachaux, Marie-Anne, Baptiste Rozière, Lowik Chanussot and Guillaume Lample. “Unsupervised Translation of Programming Languages.” NeurIPS (2020).
  12. 12.Guo, Daya, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou and Jian Yin. “UniXcoder: Unified Cross-Modal Pre-training for Code Representation.” ACL (2022).
  13. 13.Fried, Daniel, Armen Aghajanyan, Jessy Lin, Sida I. Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer and Mike Lewis. “InCoder: A Generative Model for Code Infilling and Synthesis.” ArXiv abs/2204.05999 (2022).
  14. 14.Nijkamp, Erik, Bo Pang, Hiroaki Hayashi, Lifu Tu, Haiquan Wang, Yingbo Zhou, Silvio Savarese and Caiming Xiong. “A Conversational Paradigm for Program Synthesis.” ArXiv abs/2203.13474 (2022).
  15. 15.Chen, Mark et al. “Evaluating Large Language Models Trained on Code.” ArXiv abs/2107.03374 (2021).
  16. 16.Li, Yujia et al. “Competition-Level Code Generation with AlphaCode.” ArXiv abs/2203.07814 (2022): n. pag.
  17. 17.Xu, Frank F., Uri Alon, Graham Neubig and Vincent J. Hellendoorn. “A Systematic Evaluation of Large Language Models of Code.” DL4C @ ICLR 2022 (2022).
  18. 18.Chowdhery, Aakanksha et al. “PaLM: Scaling Language Modeling with Pathways.” ArXiv abs/2204.02311 (2022).
  19. 19.Li, R., Allal, L.B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T.Y., Wang, T., Dehaene, O., Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko, O., Gontier, N., Meade, N., Zebaze, A., Yee, M., Umapathi, L.K., Zhu, J., Lipkin, B., Oblokulov, M., Wang, Z., Murthy, R., Stillerman, J., Patel, S.S., Abulkhanov, D., Zocca, M., Dey, M., Zhang, Z., Fahmy, N., Bhattacharyya, U., Yu, W., Singh, S., Luccioni, S., Villegas, P., Kunakov, M., Zhdanov, F., Romero, M., Lee, T., Timor, N., Ding, J., Schlesinger, C., Schoelkopf, H., Ebert, J., Dao, T., Mishra, M., Gu, A., Robinson, J., Anderson, C.J., Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried, D., Bahdanau, D., Jernite, Y., Ferrandis, C.M., Hughes, S.M., Wolf, T., Guha, A., Werra, L.V., & Vries, H.D. (2023). StarCoder: may the source be with you! ArXiv, abs/2305.06161.