LIMA is a 65B parameter LLaMA model fine-tuned (supervised learning) using 1,000 samples, which outperforms DaVinci003.
📝 Paper: https://arxiv.org/abs/2305.11206
Superficial Alignement Hypothesis: alignement is a process where the model learns to leverage knowledge and capabilities that were acquired during pre-training.
Although I find this hypothesis credible, I do not feel like the paper truly proves it. Proving it would require a much more in-depth analysis based on OOD tasks and information.
The model is fine-tuned on 1,000 examples that approximate user prompts and high-quality responses (750 from forums + 250 manually written examples), optimized for quality and diversity.
Alignment Data
Training models is easy but building high-quality datasets is hard. This is the most interesting part of the paper to me.
The authors collect data from three popular QA websites:
- Stack Exchange:
- Divide the exchanges into 75 STEM exchanges + 99 other (English, cooking, travel, etc.) + discard 5 nice exchanges.
- Sample 200 QAs from each set (= exchange or STEM/other?).
- Wthin each exchange, take the questions with the highest score (needs to self-contained in the title = no body).
- Select the top answer for each question with score >= 10, 1200 characters < length < 4096 characters, not written in the first person, and without reference to other answers.
- Links, images, and other HTML tags (except code blocks and lists) are also removed rom the answers.
- They randomly select the title or the description as questions since Stack Exchange has both.
- wikiHow:
- Sample 200 articles while ensuring a diversity of categories.
- Prompt = title (“How to cook an omelette?”) and response = article’s body.
- Replace “This article…” with “The following answer…”
- Remove links, images, and certain sections of the text.
- Pushshift Reddit Dataset:
- Restricted to two subreddits: r/AskReddit and r/WritingPrompts.
- Manually select examples fro within the most upvoted posts.
- QAs from r/AskReddit are deemed not necessarily reliable, which is why they are used as a test set.
The authors also manually authored examples. They created two groups (A and B) to create 250 prompts each, based on personal interests. They selected 200 prompts from Group A and (50 prompts as a held-out development set) + 230 prompts from Group B (for test only).
They also manually wrote high-quality answers, with a uniform tone and some acknowledgement of the question, followed by the answer. In this data, they include 13 adversarial training prompts (toxic, malevolent) + a rejection in the corresponding answers.
They also add 50 samples from Super-Natural Instructions, which are modified to correspond to the style of the 200 manual examples.
Training LIMA
They trained a 65B parameter LLaMA model on 1,000 samples. To distinguish user and assistant, they introduce an end-of-turn token (EOT) at the end of each utterance.
Hyperparameters:
- 15 epochs using AdamW (, , and weght decay of 0.1).
- No warmup steps, initial learning rate = , linearly decaying to by the end of training.
- Batch size of 32 examples (64 for smaller models).
- Texts longer than 2048 tokens (LLaMA’s context window) are trimmed.
They also apply dropout over residual connections, starting at at the bottom layer and linearly raising the rate to at the last layer ( for smaller models).
They manually selected checkpoints between the 5th and the 10th epochs using the held-out 50 samples (not based on perplexity).
There are two interesting improvements/modifications over the original model: EOT and residual dropout.
Human Evaluation
Baselines: Alpaca 65B, DaVinci003, Bard, Claude, GPT-4.
Generation’s parameters: nucleus sampling (), temperature of 0.7, repetition penalty of previous tokens of 1.2, max token length = 2048.
Multi-Turn Dialogue
The authors created a smal multi-turn dialogue dataset with 30 samples (10 manual, 20 based on modified comments from Stack Exchange).
They train a LIMA model on the 1,000 original samples + 30 multi-turn dialogue examples and show the performance of the model greatly improves (from 45.2% to 76.1% excellent responses).