Introduction

In the XCS330 Problem Set 4, we will explore methods for performing few-shot learning with pre-trained LMs.

Datasets from HuggingFace:

Amazon Reviews: five-way classification problem. Given a review, predict the start rating, from 1 to 5.
XSum: news articles from BBC. Given a news, return a one sentence summary. Use n-gram overlap to measure the score
bAbI: question-answering tasks that requires reasoning about contextual information. AI benchmark developed by Facebook AI Research.

Fine-tuning

Here, we fine tune the entire model on k examples using two diffrent sizes of samller BERT models.

What we need to implement is a function that returns the parameters that we need to fine tune:

def parameters_to_fine_tune(model: nn.Module, mode: str) -> List:
    ...

The mode can be [all, last, first, middel, loraN]

In PyTorch, we can:

use model.parameters() to get the list of parameters - the weights and the biases.
use mode.named_parameters() to get the parameter name and the parameter itself.

Transformer Block

The requirement of parameters_to_fine_tune() says, when mode = last, it returns the last 2 transformer blocks.

The following implementation:

list(model.parameters())[-2:]

Is wrong! When I print out the model:

for name, param in model.named_parameters():
    print(name, param.shape)

I got:

transformer.wte.weight torch.Size([50257, 1024])
transformer.wpe.weight torch.Size([1024, 1024])

# Transformer block 0:
transformer.h.0.ln_1.weight torch.Size([1024])
transformer.h.0.ln_1.bias torch.Size([1024])
transformer.h.0.attn.c_attn.weight torch.Size([1024, 3072])
transformer.h.0.attn.c_attn.bias torch.Size([3072])
transformer.h.0.attn.c_proj.weight torch.Size([1024, 1024])
transformer.h.0.attn.c_proj.bias torch.Size([1024])
transformer.h.0.ln_2.weight torch.Size([1024])
transformer.h.0.ln_2.bias torch.Size([1024])
transformer.h.0.mlp.c_fc.weight torch.Size([1024, 4096])
transformer.h.0.mlp.c_fc.bias torch.Size([4096])
transformer.h.0.mlp.c_proj.weight torch.Size([4096, 1024])
transformer.h.0.mlp.c_proj.bias torch.Size([1024])

# ...

# Transformer block 23:
transformer.h.23.ln_1.weight torch.Size([1024])
transformer.h.23.ln_1.bias torch.Size([1024])
transformer.h.23.attn.c_attn.weight torch.Size([1024, 3072])
transformer.h.23.attn.c_attn.bias torch.Size([3072])
transformer.h.23.attn.c_proj.weight torch.Size([1024, 1024])
transformer.h.23.attn.c_proj.bias torch.Size([1024])
transformer.h.23.ln_2.weight torch.Size([1024])
transformer.h.23.ln_2.bias torch.Size([1024])
transformer.h.23.mlp.c_fc.weight torch.Size([1024, 4096])
transformer.h.23.mlp.c_fc.bias torch.Size([4096])
transformer.h.23.mlp.c_proj.weight torch.Size([4096, 1024])
transformer.h.23.mlp.c_proj.bias torch.Size([1024])

transformer.ln_f.weight torch.Size([1024])
transformer.ln_f.bias torch.Size([1024])

Therefore, I need to get the whole block!

There are some useful tricks:

print(len(model.transformer.h))
# output: 24

Which means, you can get a group of transformer blocks!

We can then do something like:

# Get the last 2 transformer blocks, and return all the params in a flatten array
blocks = model.transformer.h[-2:]
return [param for block in blocks for param in block.parameters()]

Implementation 1.a

Some basic techniques for implementing 1.a is that:

# Calculate cross entropy loss given logics (N, C) and targets (N)
loss = F.cross_entropy(logits, targets)

# Calcuate accuracy
preds = torch.argmax(logits, dim=1) # Get the index of the max value
acc = (preds == targets).float().mean().item() # Get the mean of the correct prediction

Note that .item() is important here, because we want to convert a tensor object into a real value. BTW, a zero-dim tensor is simply a scalar value.

In-Context Learning

What surprises people is LLM’s emergent abaility to learn in-context, that is, their ability to learn a task without updating its parameters.

In task 2.a, we need to implement the prompt creation for the XSum and the bAbI tasks. This is very similar to the prompt library that I have been using.

There are some tricks that worth sharing.

Shuffle Data

In the def get_icl_prompts() function, we need to shuffle the support_inputs and support_labels altogether.

This is how we can shuffule two array in the same order:

a = np.array([...])
b = np.array([...])

indicies = np.arange(a.shape[0])
np.random.shuffle(indices)

a = a[indices]
b = b[indices]

Or we can zip two arrays together and shuffle them and unpack them:

a = ['a', 'b', 'c']
b = [1, 2, 3]
c = list(zip(a, b))
random.shuffle(c)
a, b = zip(*c) # unzip c

print a # ('b', 'a', 'c')
print b # (2, 1, 3)

Note that the second approach returns tuples instead of lists.

Update: I realize that the requirement enforces us to use np.random.permutation to shuffle the data. Otherwise, I cannot pass all the hidden unit tests, because of the way they freeze the random seed. Therefore, I should use:

indices = np.random.permutation(len(support_inputs))
support_inputs = [support_inputs[i] for i in indices]
support_labels = [support_labels[i] for i in indices]

Greedy Sample

In 2.a.ii, we need to implement a greedy sampling algorithm. In the utils.py, there is a function called get_model_and_tokenizer, which returns a pre-trained model from HuggingFace and the corresponding tokenizer.

The note says:

(Don’t use model.generate; implement this yourself in a loop)

That translates to:

Given input_ids, generate one token at a time
Select the token with the highest probability (the greedy sampling algorithm)
Stop the loop when: a. the max number of tokens is reached b. the stop token is returned

AutoModelForCausalLM

AutoModelForCausalLM is a generic interface for loading pre-trained causal language models, such as GPT-2. The causal language models are designed for auto-regressive tassks, where the model predicts the next token based on the previous tokens.

When you pass the input tokens to the model, the output logits has the shape of (batch_size, sequence_length, vocab_size):

batch_size: the number of input sequences
sequence_length: the length of the input sequence
vocab_size: the size of the tokens

What confused me at first is the second dimension sequence_length. Shouldn’t the model just output the probability distribution of the last token, and therefor the size is (batch_size, vocab_size)? In HuggingFace forum, I found people ask similar question.

Github Copilot gives me the answer:

If the input sequence is `[1, 2, 3]` (with `batch_size=1`), the logits will provide predictions for the next token after each position:

- logits for position 1 predict the next token after `[1]`.
- logits for position 2 predict the next token after `[1, 2]`.
- logits for position 3 predict the next token after `[1, 2, 3]`.

Tensor Size

import torch

a = torch.tensor([[1, 2, 3]])
a.shape # torch.Size([1, 3])

b = torch.tensor(4)
b # tensor(4)
b.shape # torch.Size([])
b.item() # 4

b = b.unsqueeze(0).unsqueeze(0)
b # tensor([[4]])

Tokenizer

We can use HuggingFace’s tokenizer to encode and decode the tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load the tokenizer
prompt = "sky is blue."
input_ids = tokenizer(prompt, return_tensors="pt").to(DEVICE).input_ids # Returns a tensor of token ids

# Call the model

output_ids = [101, 102, 103]
output = tokenizer.decode(output_ids, skip_special_tokens=True)

Note:

the tokenizer by default give you results on CPU, therefore we need to call to(DEVICE) to move the tensors to the proper device.
access the tokens from the input_ids property
use skip_special_tokens if you want to remove special tokens in the decoding

Tokenize GPT-2 Batch

In this task, we need to implement the tokenization step for a batch of examples for GPT-2.

Not too bad. We already have the tokenizer returned by the GPT-2 model, and this is the building block.

Let’s deep dive into it:

x = ['Who is the singer for the band Queen?', 'What is the capital of France?']
y = ['Freddie Mercury', 'Paris']

tokenizer = transformers.GPT2Tokenizer.from_pretrained('gpt2')

print(tokenizer.keys())
# dict_keys(['input_ids', 'attention_mask'])

tokenizer(x)['input_ids']
# token ids for x (the inputs)
[[8241, 318, 262, 14015, 329, 262, 4097, 7542, 30],
 [2061, 318, 262, 3139, 286, 4881, 30]]

tokenizer(y)['input_ids']
# token ids for y (the output, i.e. the target)
[[30847, 11979, 21673],
 [40313]]

When we combine the input and target together:

tokenizer_dict = tokenizer([x_ + y_ for x_, y_ in zip(x, y)], return_tensors='pt', padding=True)

tokenizer_dict['input_ids']
# input ids, each row represents (x, y) pair
tensor([[ 8241, 318, 262, 14015, 329, 262, 4097, 7542, 30, 30847, 11979, 21673],
        [ 2061, 318, 262,  3139, 286,  4881, 30, 40313, 50256, 50256, 50256, 50256]])

tokenizer_dict['attention_mask']
# attention mask, 0 means padding
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])

Now, let’s check the requirement of this function tokenize_gpt2_batch, it needs to return:

input_ids: a tensor [batch_size, seq_length] containing token ids
attention_mask: a tensor [batch_size, seq_length] containing 1s and 0s indicating paddings
labels: a tensor [batch_size, seq_length] with -100 for non-target tokens

In the above scenario:

[[-100, -100, -100, -100, -100, -100, -100, -100,   -100,  30847, 11979, 21673],
 [-100, -100, -100, -100, -100, -100, -100,  40313, -100, -100,  -100,  -100]]

With these concepts, we can:

zip x and y
use the tokenizer function to get the batch input_ids and attention_mask
use the attention_mask to set the mask location to -100
use the length of tokenizer(x)['input_ids'] to set the mask location to -100

Pay attention to the last point. We should not use the length of x, because original string and tokens have different lengths!

LoRA

One of the fine-tuning techniques - Low-rank Adaptation. The LoRA approach makes fine-tuning more efficient by representing the weight with two smaller matices through low-rank decomposition.

Let’s denote the pre-trained weight matrix $W^0_l$ and the fine-tuned parameters $W^ft_l$, we define two matrices $A$ and $B$:

$$ W^\text{ft}_l = W^0_l + AB^T $$

Where both $W$ has dimension of ${d1 \times d2}$, and A has dimension of ${d1 \times p}$ and B has dimension of ${d2 \times p}$, and $p « d1, d2$.

Parameter Saving

The ratio of the parameters fine-tuned by LoRA to the number of parameters in $W^0_l$ is:

$$ \frac{d1 \times p + d2 \times p}{d1 \times d2} = \frac{d1 + d2}{d1 \times d2} * p $$

when $d1 = d2$, the ratio becomes $\frac{2p}{d}$!

Implementation Details

Named Parameters

For the loraN mode, we can use named_parameters() function to get a dictionary, and returns all the parameters that starts with lora.

We also need to check if a module is an instance of LoRAConv1DWrapper:

lora_params = []
for m in model.modules():
    if isinstance(m, LoRAConv1DWrapper):
        for name, param in m.named_parameters():
            if name == "lora_A" or name == "lora_B":
                lora_params.append(param)
return lora_params

Weight Shape

We can use base_module.weight.shape to extract the shape of the pre-trained weight matrix.

(out_features, in_features) = base_module.weight.shape

Initialization

When we initialize lora_A and lora_B, we just need to make sure one matrix is zero matrix. For instance, lora_A = torch.randn(d1, p) and lora_B = torch.zeros(p, d2). If we set both matrices to zeros, then during backpropagation, the grandients of both A and B will all zeros.

Note that we can wrap the tensors with nn.Parameter() such as the gradient will be calculated automatically.

Forward Path

In the forward() function, we could do something like this:

def forward(self, x):
    self.base_module.weight.data = self.base_module.weight.data + self.lora_A.T @ self.lora_B
    return self.base_module(x)

We explicitly construct the matrix AB^T here. The assignment note says we don’t need to do it explicitly, and here’s how to do that:

return self.base_module(x) + self.lora_A @ self.lora_B.T @ x

Another trick here is that we need to freeze the pre-trained model:

self.base_module.weight.requires_grad = False

Introduction#

Fine-tuning#

Transformer Block#

Implementation 1.a#

In-Context Learning#

Shuffle Data#

Greedy Sample#

AutoModelForCausalLM#

Tensor Size#

Tokenizer#

Tokenize GPT-2 Batch#

LoRA#

Parameter Saving#

Implementation Details#

Named Parameters#

Weight Shape#

Initialization#

Forward Path#