Here’s a pretty cool article on understanding PyTorch conv1d shapes for text classification.

In this article, the shape of the example is:

  • n = 1: number of batches
  • d = 3: dimension of the word embedding
  • l = 5: length of the sentence
import torch.nn as nn
import torch

# Example represents one sentence here
example = torch.rand(n=1, l=3, d=5)
example.shape # torch.Size([1, 5, 3])
example

# This is the output:
tensor([[[0.0959, 0.1674, 0.1259],
         [0.8330, 0.5789, 0.2141],
         [0.3774, 0.8055, 0.4218],
         [0.1992, 0.4722, 0.3167],
         [0.4633, 0.0352, 0.8803]]])

In the above output, you can image each row represents one word.

To make the tensor fit into a conv1 layer, we need to reshape it to have the channel first:

# Batch size remains the same
# Use permute() to move the channel first:
example = example.permute(0,2,1)
example

# This is the output:
tensor([[[0.6075, 0.4998, 0.4491, 0.6581, 0.4392],
         [0.2709, 0.1681, 0.5006, 0.3150, 0.8715],
         [0.4999, 0.2203, 0.8735, 0.9370, 0.4723]]])

With the new shape, you can think of the new structure as an image, and each row is one channel of an image: R, B, G.

In terms of a sentence, you can think of it as different semantic spaces of the same sentence. In the blog post that I linked above, the example sentence the author uses is word embedding is so cool.

Now let’s define a conv1d layer:

conv1 = nn.Conv1d(d, 1, 2)

where:

  • input channel = 3, the same as the word embedding dimension
  • output channel = 1, in this case, the classification label
  • kernel size = 2
  • stride = 1, this is the default value

The kernel size defines the field of view of convolution. A common choice of kernel size is 3, and that means $3x3$ pixels.

Here’s a visualization that I got from this blog post: An Introduction to different Types of Convolutions in Deep Learning:

Kernel Size

In our case, when kernel size is set to 2, that means we operate on 2 words at a time. Again, given the example word embedding is so cool in the blog post:

  • [word, embedding] -> value
  • [embedding, is] -> value
  • [is, so] -> value
  • [so, cool] -> value

Let’s run the math in my case:

example
# the example:
tensor([[[0.6075, 0.4998, 0.4491, 0.6581, 0.4392],
         [0.2709, 0.1681, 0.5006, 0.3150, 0.8715],
         [0.4999, 0.2203, 0.8735, 0.9370, 0.4723]]])

conv1.weight
# the weight matrix:
tensor([[[ 0.2121,  0.3245],
         [ 0.3465, -0.0467],
         [-0.0528,  0.0373]]], requires_grad=True)

conv1.bias
# the bias term:
tensor([-0.0737], requires_grad=True)

Let’s calcualte the convolution value of [word, embedding]:

The embedding values are:

# word:
[[[0.6075,
   0.2709,
   0.4999]]]

# embedding:
[[[0.4998,
   0.1681,
   0.2203]]]

The result should be:

# "word" * weight
(0.6075 * 0.2121) + (0.2709 * 0.3465) + (0.4999 * -0.0528)

# "embedding" * weight
+ (0.4998 * 0.3245) + (0.1681 * -0.0467) + (0.2203 * 0.0373)

# bias term
+ (-0.0737)

= 0.2851749

If I run the example through the conv1d example:

output = conv1(example)

# Output
tensor([[[0.2852, 0.2338, 0.3826, 0.2449]]], grad_fn=<ConvolutionBackward0>)

As you can see, the first convolution value is 0.2852.

For more detail, check out the blog post that I mentioned above.