phi-1 – Textbooks Are All You Need

Tip

It introduces phi-1, a model with 1.3B parameters that obtains a pass@1 rate of 50.6% on HumanEval thanks to a novel training process. Unfortunately, the weights are not available.

📝 Paper: https://arxiv.org/pdf/2306.11644.pdf

The authors argue that high quality data can change the shape of the scaling laws, allowing small models to match the performance of bigger ones.

The importance of high-quality data

Motivation: The authors observe that standard code datasets like The Stack, StackOverflow and CodeContests suffer from several drawbacks: samples are not self-contained but referenced, a lot of them are trivial while the most complex ones are poorly documented, and the overall distribution is skewed towards certain topics and use cases.

They train their solution (phi-1) on a new dataset (<7B tokens) as follows:

Pretraining on CodeTextbook, comprised of a filtered version of The Stack and StackOverflow(~6B tokens) + a synthetic samples generated by GPT-3.5 (<1B tokens)
Fine-tuning on CodeExercises, a small synthetic dataset of Python exercises and solutions also generated by GPT-3.5 (~180M tokens)

Filtering code

The authors use the Python subset of the deduplicated version of The Stack and StackOverflow (35B tokens). They used GPT-4 to annotate the quality of a subset (100k samples), using the following prompt: “determine its educational value for a student whose goal is to learn basic coding concepts.”

Tip

This prompt could probably be improved by asking GPT-4 to break down its reasoning into steps before outputting the final value.

This creates a dataset of code snippets and corresponding values. The authors produce embeddings of each code snippet using a pretrained CodeGen model, and train a random forest classifier to predict the quality of each sample.

Generating synthetic data

The authors argue the synthetic samples should be diverse (concepts, skills, scenarios, difficulty, complexity, style) and non-repetitive to reduce the risk of overfitting/memorizing and be more robust. Inspired by TinyStories, they use randomized seeds à la Alpaca to generate samples with GPT-3.5:

Synthetic training data (<1B tokens): code and text with examples and constraints.
CodeExercises (~180M tokens): Python exercises and solutions, where each exercise is a docstring of a function that needs to be completed.

Model architecture and training

phi-1 is a decoder-only transformer using rotary position embedding, FlashAttention and multi-head attention (MHA), with parallel MHA and MLP layers + codegen-350M-mono’s tokenizer.

Tip

Its architecture is very much inspired by CodeGen and does not include Fill-In-the-Middle or Multi-Query-Attention like StarCoder (low-hanging fruit improvement).

Hyperparameters for phi-1 with 1.3B/350M parameters:

24/20 layers
Hidden dimension = 2048/1024
MLP-inner dimension = 8192/4096
32/16 attention heads with dimension = 64/32
Sequence length = 2048
Objective = next-token prediction

The importance of fine-tuning

Fine-tuning phi-1 on CodeExercises greatly improves the model’s performance, even for tasks that are not in the fine-tuning dataset.

The authors notice that the model gets better at interpreting questions and logical relationships in the prompts. Interestingly, it also becomes better at using external libraries, even when they do not appear in the fine-tuning set (e.g., Pygame and Tkinter).

Performance evaluation

The authors argue HumanEval’s binary score (code passes the unit tests, or it fails) does not capture the nuances of the model’s performance. They introduce an LLM grading using GPT-4 (between 0 and 10), which does not require tests.

There’s a concern that CodeExercises might contain samples that are similar to exercises in HumanEval. The authors propose to remove these samples and retrain phi-1 on this decontaminated set.

They report no meaningful n-gram overlap between CodeExercises and HumanEval (4 false positives). They then use a combination of embedding and syntax-based distances to find similar code snippets:

Semantics: They use the L2 distance between embeddings produced by CodeGen
Syntax: They calculate the (string) edit distance between the abstract syntax trees (ASTs) of two code snippets.

Despite this data pruning, the authors claim that phi-1 still outperforms StarCoder on HumanEval.