Introduction
In the XCS330 Problem Set 4, we will explore methods for performing few-shot learning with pre-trained LMs.
Datasets from HuggingFace:
Amazon Reviews
: five-way classification problem. Given a review, predict the start rating, from 1 to 5.XSum
: news articles from BBC. Given a news, return a one sentence summary. Use n-gram overlap to measure the scorebAbI
: question-answering tasks that requires reasoning about contextual information. AI benchmark developed by Facebook AI Research.
Fine-tuning
Here, we fine tune the entire model on k examples using two diffrent sizes of samller BERT models.
What we need to implement is a function that returns the parameters that we need to fine tune:
def parameters_to_fine_tune(model: nn.Module, mode: str) -> List:
...
The mode
can be [all
, last
, first
, middel
, loraN
]
In PyTorch, we can:
- use
model.parameters()
to get the list of parameters - the weights and the biases. - use
mode.named_parameters()
to get the parameter name and the parameter itself.
Transformer Block
The requirement of parameters_to_fine_tune()
says, when mode = last
, it returns the last 2 transformer blocks.
The following implementation:
list(model.parameters())[-2:]
Is wrong! When I print out the model:
for name, param in model.named_parameters():
print(name, param.shape)
I got:
transformer.wte.weight torch.Size([50257, 1024])
transformer.wpe.weight torch.Size([1024, 1024])
# Transformer block 0:
transformer.h.0.ln_1.weight torch.Size([1024])
transformer.h.0.ln_1.bias torch.Size([1024])
transformer.h.0.attn.c_attn.weight torch.Size([1024, 3072])
transformer.h.0.attn.c_attn.bias torch.Size([3072])
transformer.h.0.attn.c_proj.weight torch.Size([1024, 1024])
transformer.h.0.attn.c_proj.bias torch.Size([1024])
transformer.h.0.ln_2.weight torch.Size([1024])
transformer.h.0.ln_2.bias torch.Size([1024])
transformer.h.0.mlp.c_fc.weight torch.Size([1024, 4096])
transformer.h.0.mlp.c_fc.bias torch.Size([4096])
transformer.h.0.mlp.c_proj.weight torch.Size([4096, 1024])
transformer.h.0.mlp.c_proj.bias torch.Size([1024])
# ...
# Transformer block 23:
transformer.h.23.ln_1.weight torch.Size([1024])
transformer.h.23.ln_1.bias torch.Size([1024])
transformer.h.23.attn.c_attn.weight torch.Size([1024, 3072])
transformer.h.23.attn.c_attn.bias torch.Size([3072])
transformer.h.23.attn.c_proj.weight torch.Size([1024, 1024])
transformer.h.23.attn.c_proj.bias torch.Size([1024])
transformer.h.23.ln_2.weight torch.Size([1024])
transformer.h.23.ln_2.bias torch.Size([1024])
transformer.h.23.mlp.c_fc.weight torch.Size([1024, 4096])
transformer.h.23.mlp.c_fc.bias torch.Size([4096])
transformer.h.23.mlp.c_proj.weight torch.Size([4096, 1024])
transformer.h.23.mlp.c_proj.bias torch.Size([1024])
transformer.ln_f.weight torch.Size([1024])
transformer.ln_f.bias torch.Size([1024])
Therefore, I need to get the whole block!
There are some useful tricks:
print(len(model.transformer.h))
# output: 24
Which means, you can get a group of transformer blocks!
We can then do something like:
# Get the last 2 transformer blocks, and return all the params in a flatten array
blocks = model.transformer.h[-2:]
return [param for block in blocks for param in block.parameters()]
Implementation 1.a
Some basic techniques for implementing 1.a
is that:
# Calculate cross entropy loss given logics (N, C) and targets (N)
loss = F.cross_entropy(logits, targets)
# Calcuate accuracy
preds = torch.argmax(logits, dim=1) # Get the index of the max value
acc = (preds == targets).float().mean().item() # Get the mean of the correct prediction
Note that .item()
is important here, because we want to convert a tensor object into a real value.
BTW, a zero-dim
tensor is simply a scalar value.
In-Context Learning
What surprises people is LLM’s emergent abaility to learn in-context, that is, their ability to learn a task without updating its parameters.
In task 2.a
, we need to implement the prompt creation for the XSum and the bAbI tasks. This is very similar to the prompt library that I have been using.
There are some tricks that worth sharing.
Shuffle Data
In the def get_icl_prompts()
function, we need to shuffle the support_inputs
and support_labels
altogether.
This is how we can shuffule two array in the same order:
a = np.array([...])
b = np.array([...])
indicies = np.arange(a.shape[0])
np.random.shuffle(indices)
a = a[indices]
b = b[indices]
Or we can zip
two arrays together and shuffle them and unpack them:
a = ['a', 'b', 'c']
b = [1, 2, 3]
c = list(zip(a, b))
random.shuffle(c)
a, b = zip(*c) # unzip c
print a # ('b', 'a', 'c')
print b # (2, 1, 3)
Note that the second approach returns tuples instead of lists.
Update: I realize that the requirement enforces us to use np.random.permutation
to shuffle the data. Otherwise, I cannot pass all the hidden unit tests, because of the way they freeze the random seed. Therefore, I should use:
indices = np.random.permutation(len(support_inputs))
support_inputs = [support_inputs[i] for i in indices]
support_labels = [support_labels[i] for i in indices]
Greedy Sample
In 2.a.ii
, we need to implement a greedy sampling algorithm.
In the utils.py
, there is a function called get_model_and_tokenizer
, which returns a pre-trained model from HuggingFace and the corresponding tokenizer.
The note says:
(Don’t use model.generate; implement this yourself in a loop)
That translates to:
- Given
input_ids
, generate one token at a time - Select the token with the highest probability (the greedy sampling algorithm)
- Stop the loop when: a. the max number of tokens is reached b. the stop token is returned
AutoModelForCausalLM
AutoModelForCausalLM is a generic interface for loading pre-trained causal language models, such as GPT-2. The causal language models are designed for auto-regressive tassks, where the model predicts the next token based on the previous tokens.
When you pass the input tokens to the model, the output logits has the shape of (batch_size, sequence_length, vocab_size)
:
batch_size
: the number of input sequencessequence_length
: the length of the input sequencevocab_size
: the size of the tokens
What confused me at first is the second dimension sequence_length
.
Shouldn’t the model just output the probability distribution of the last token, and therefor the size is (batch_size, vocab_size)
?
In HuggingFace forum, I found people ask similar question.
Github Copilot gives me the answer:
If the input sequence is `[1, 2, 3]` (with `batch_size=1`), the logits will provide predictions for the next token after each position:
- logits for position 1 predict the next token after `[1]`.
- logits for position 2 predict the next token after `[1, 2]`.
- logits for position 3 predict the next token after `[1, 2, 3]`.
Tensor Size
import torch
a = torch.tensor([[1, 2, 3]])
a.shape # torch.Size([1, 3])
b = torch.tensor(4)
b # tensor(4)
b.shape # torch.Size([])
b.item() # 4
b = b.unsqueeze(0).unsqueeze(0)
b # tensor([[4]])
Tokenizer
We can use HuggingFace’s tokenizer to encode and decode the tokens.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load the tokenizer
prompt = "sky is blue."
input_ids = tokenizer(prompt, return_tensors="pt").to(DEVICE).input_ids # Returns a tensor of token ids
# Call the model
output_ids = [101, 102, 103]
output = tokenizer.decode(output_ids, skip_special_tokens=True)
Note:
- the tokenizer by default give you results on CPU, therefore we need to call
to(DEVICE)
to move the tensors to the proper device. - access the tokens from the
input_ids
property - use
skip_special_tokens
if you want to remove special tokens in the decoding
Tokenize GPT-2 Batch
In this task, we need to implement the tokenization step for a batch of examples for GPT-2.
Not too bad. We already have the tokenizer
returned by the GPT-2 model, and this is the building block.
Let’s deep dive into it:
x = ['Who is the singer for the band Queen?', 'What is the capital of France?']
y = ['Freddie Mercury', 'Paris']
tokenizer = transformers.GPT2Tokenizer.from_pretrained('gpt2')
print(tokenizer.keys())
# dict_keys(['input_ids', 'attention_mask'])
tokenizer(x)['input_ids']
# token ids for x (the inputs)
[[8241, 318, 262, 14015, 329, 262, 4097, 7542, 30],
[2061, 318, 262, 3139, 286, 4881, 30]]
tokenizer(y)['input_ids']
# token ids for y (the output, i.e. the target)
[[30847, 11979, 21673],
[40313]]
When we combine the input and target together:
tokenizer_dict = tokenizer([x_ + y_ for x_, y_ in zip(x, y)], return_tensors='pt', padding=True)
tokenizer_dict['input_ids']
# input ids, each row represents (x, y) pair
tensor([[ 8241, 318, 262, 14015, 329, 262, 4097, 7542, 30, 30847, 11979, 21673],
[ 2061, 318, 262, 3139, 286, 4881, 30, 40313, 50256, 50256, 50256, 50256]])
tokenizer_dict['attention_mask']
# attention mask, 0 means padding
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])
Now, let’s check the requirement of this function tokenize_gpt2_batch
, it needs to return:
input_ids
: a tensor[batch_size, seq_length]
containing token idsattention_mask
: a tensor[batch_size, seq_length]
containing1
s and0
s indicating paddingslabels
: a tensor[batch_size, seq_length]
with-100
for non-target tokens
In the above scenario:
[[-100, -100, -100, -100, -100, -100, -100, -100, -100, 30847, 11979, 21673],
[-100, -100, -100, -100, -100, -100, -100, 40313, -100, -100, -100, -100]]
With these concepts, we can:
- zip x and y
- use the
tokenizer
function to get the batchinput_ids
andattention_mask
- use the
attention_mask
to set the mask location to-100
- use the length of
tokenizer(x)['input_ids']
to set the mask location to-100
Pay attention to the last point. We should not use the length of x
, because original string and tokens have different lengths!
LoRA
One of the fine-tuning techniques - Low-rank Adaptation. The LoRA approach makes fine-tuning more efficient by representing the weight with two smaller matices through low-rank decomposition.
Let’s denote the pre-trained weight matrix $W^0_l$ and the fine-tuned parameters $W^ft_l$, we define two matrices $A$ and $B$:
$$ W^\text{ft}_l = W^0_l + AB^T $$
Where both $W$ has dimension of ${d1 \times d2}$, and A has dimension of ${d1 \times p}$ and B has dimension of ${d2 \times p}$, and $p « d1, d2$.
Parameter Saving
The ratio of the parameters fine-tuned by LoRA to the number of parameters in $W^0_l$ is:
$$ \frac{d1 \times p + d2 \times p}{d1 \times d2} = \frac{d1 + d2}{d1 \times d2} * p $$
when $d1 = d2$, the ratio becomes $\frac{2p}{d}$!
Implementation Details
Named Parameters
For the loraN
mode, we can use named_parameters()
function to get a dictionary, and returns all the parameters that starts with lora
.
We also need to check if a module is an instance of LoRAConv1DWrapper
:
lora_params = []
for m in model.modules():
if isinstance(m, LoRAConv1DWrapper):
for name, param in m.named_parameters():
if name == "lora_A" or name == "lora_B":
lora_params.append(param)
return lora_params
Weight Shape
We can use base_module.weight.shape
to extract the shape of the pre-trained weight matrix.
(out_features, in_features) = base_module.weight.shape
Initialization
When we initialize lora_A
and lora_B
, we just need to make sure one matrix is zero matrix.
For instance, lora_A = torch.randn(d1, p)
and lora_B = torch.zeros(p, d2)
.
If we set both matrices to zeros, then during backpropagation, the grandients of both A and B will all zeros.
Note that we can wrap the tensors with nn.Parameter()
such as the gradient will be calculated automatically.
Forward Path
In the forward()
function, we could do something like this:
def forward(self, x):
self.base_module.weight.data = self.base_module.weight.data + self.lora_A.T @ self.lora_B
return self.base_module(x)
We explicitly construct the matrix AB^T here. The assignment note says we don’t need to do it explicitly, and here’s how to do that:
return self.base_module(x) + self.lora_A @ self.lora_B.T @ x
Another trick here is that we need to freeze the pre-trained model:
self.base_module.weight.requires_grad = False