Introduction

In traditional machine learning, we have a lot of dataset for a specific tasks, while in meta-learning, we have many tasks with small datasets, and the hope is that we can train a model that can learn some fundamental idea from other tasks.

Put it this way: the goal of traditional ML is to optimize the performance of a single task, while the goal of meta-learning is to optimize for adaptability.

In this assignment, we will learn the MAML algorith. MAML stands for Model-Agnostic Meta-Learning, and it is a very popular algorithm in meta-learning.

In a nutshell, MAML tries to find a parameter vector $\theta$ that allows the model to be quickly adapted to new tasks. Here’s the image that I got from the XCS330 lecture notes:

MAML

In this assignment, we will need to implement the inner loop and outer loop of the MAML algorithm.

MAML Deep-Dive

Besides the XCS330 lecture notes, I also watched Professor Hung-yi Lee (李宏毅)’s lecture notes on MAML. For those who understand Mandarin Chinese, I highly recommend it. Here’s the link to the first lecture.

In this section, I want to share some notes from the video.

First of all, the goal of MAML is to find a function $F$ such that it can output a mapping function $f$ which can learn new tasks quickly - again, the goal here is the adaptability:

maml goal

There are two loops in the MAML algorithm:

inner loop
outer loop

Inner Loop

In the inner loop, MAML focuses on task-specific training. In this step, MAML uses the support set data to update the paramters $\phi$, and this is very similar to the traditional ML. Here’s the definition of the loss function in this step:

$$ \newcommand{\dataset}{\mathcal{D}} \begin{equation} \mathcal{L}(\phi, \dataset_i) = \frac{1}{\lvert \dataset_i \rvert} \sum_{(x^j, y^j) \in \dataset_i} -\log p_\phi (y = y^j \mid x^j) \end{equation} $$

Outer Loop

In the outer loop, MAML focuses on meta-learning. In this step, MAML optimizes $\theta$ on the query data so that the model is optimized for adaptability.

Note that in the inner loop, the model optimizes for task-specific parameters $\phi$ while here the model optimizes for the meta-learning parameters $\theta$. (Note: $\phi$ is a copy of $\theta$ at the beginning of each inner loop).

maml params

Implementation

In this section, I will talk about some implementation details. The notes here are very specific to the assignment and pytorch. Feel free to skip it since the context here is lost for most readers.

Understand `requires_grad`

To implement the inner loop, I need to understand torch.autograd, and I read this tutorial. Here are some notes that I took from it.

Background

Neural networks are a collection of functions executed on some input data.

Training NN happens in two steps:

forward propagation: $y = f(x)$
backward propagation: NN adjusts its parameters by traversing backwards from the output, and use the error and the derivatives to update the params.

Example

Let’s say we have a function $Q = 3a^3-b^2$, where $Q$ is the loss, and $a$ and $b$ are parameters.

import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

Q = 3*a**3 - b**2

When we call .backward() on Q, autograd calculates these gradients and stores them in the tensors’ .grad attribute.

print(9*a**2 == a.grad)
print(-2*b == b.grad)

I highly recommend these YouTube videos:

Understand `autograd.grad()`

I got these answers from Google (copy-paste):

In PyTorch, torch.autograd.grad() is a function used for computing gradients. It calculates the sum of gradients of specified output tensors with respect to input tensors.

Unlike tensor.backward(), which accumulates gradients into the .grad attribute of leaf tensors, torch.autograd.grad() returns the gradients directly without modifying the .grad attributes.

This provides more flexibility when computing gradients for specific tensors or when needing to prevent accumulation.

Here’s the example (copy-paste):

import torch

# Example usage
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
z = x * y

gradients = torch.autograd.grad(z, [x, y])
print(gradients)  # Output: (tensor(3.), tensor(2.))

Introduction#

MAML Deep-Dive#

Inner Loop#

Outer Loop#

Implementation#

Understand requires_grad#

Background#

Example#

Understand autograd.grad()#