Using conv2d op on RGB image

As someone that doesn’t have a lot of experience using Tensors, I find myself a bit confused about the expected shape of the the filter argument in the builtin conv2d operation.

I’m trying to apply a gaussian blur effect to an rgb image, and I understand that I should be able to use a simple NxN matrix that can be applied to each color channel of each pixel, but I can’t figure out the relation between the in/out channels component of the filter shape, and the groups arg of the conv2d op in order to achieve that?

Hey there! to apply an NxN filter on an RGB image, we’d expect the filter to have shape [N, N, 3, 3], where 3 is the number of input channels (RGB) and the number of output channels (still RGB) per the docs. If you would like to apply the same filter to each channel, then you can broadcast the filter [N,N] filter to the expected [N, N, 3, 3,]

3 Likes

Hello Austin thanks! I have this current snippet, and while it does work if N=3 (using the term work loosely since the output is a bit odd which could be a symptom of me doing this wrong), but if I try any other value of N the broadcast fails.

image = orig_image.broadcast_to((1,) + tuple(orig_image.shape.static_dims)).cast(DType.float64)
N = 3
arr = gkern(N, sig=1.)
kernel = ops.constant(arr, dtype=DType.float64)
kernel = kernel.broadcast_to((N, N, 3, 3))
res = ops.conv2d(
    image,
    kernel
)[0].tensor

Interesting! Could you share the implementation of the gkern function and also an example of what the output is when you try a different value of N?

I’ve just inlined the content of gkern here

def gaussian_blur(orig_image: TensorValue, sigma: float=1.) -> TensorValue:
    assert_rgb(orig_image)

    orig_dtype = orig_image.dtype

    image = orig_image.broadcast_to((1,) + tuple(orig_image.shape.static_dims)).cast(DType.float64)
    N = 6

    ax = np.linspace(-(N - 1) / 2., (N - 1) / 2., N)
    gauss = np.exp(-0.5 * np.square(ax) / np.square(sigma))
    kernel = np.outer(gauss, gauss)
    arr =  kernel / np.sum(kernel)

    kernel = ops.constant(arr, dtype=DType.float64)
    kernel = kernel.broadcast_to((N, N, 3, 3))
    res = ops.conv2d(
        image,
        kernel
    )[0].tensor

    return res.cast(orig_dtype)

This is the error I get when it tries to broadcast the kernel with N=6

  File "/Users/bgreni/Coding/max-cv/.magic/envs/showcase/lib/python3.12/site-packages/max/graph/graph.py", line 314, in _add_op_get_op_with_results
    raise ValueError(
ValueError: Failed to create op 'broadcast_to':
Inputs:
    input = TensorValue(dtype=float64, shape=[Dim(6), Dim(6)], device=None)
    new_shape = Attribute(#mosh<ape[6, 6, 3, 3]> : !mosh.ape)

Diagnostics:
    [broadcast_to] input dim must be either 1 or equal to corresponding output dim starting from the rightmost dim
Operation creation failed

Awesome, thanks for the context! I see what is happening here, and I think it might be a bug. I’m going to follow up with the team and get it fixed, but you should be able to workaround the issue if you change change your code to the following:

    kernel = ops.constant(arr, dtype=DType.float64)
    kernel = kernel.reshape(N, N, 1, 1)
    kernel = kernel.broadcast_to((N, N, 3, 3))

EDIT: Actually, this is expected behavior, we don’t have a bug to fix here :slight_smile: broadcasts only implicitly create leading dimenmsions, any trailing dimensions will have to be explicitly created via reshape. Other frameworks (ex: PyTorch) have similar behavior to this.

1 Like

Hi @bgreni

The offending line is
kernel = kernel.broadcast_to((N, N, 3, 3))

Where you are trying to broadcast shape (6, 6) to (6, 6, 3, 3) which is not considered legal. The MAX stack implements the broadcasting rules found in numpy/torch and under these frameworks this will fail too. This is a good link to learn the mechanics fully.

I think I see what you want to do however and there are a variety of solutions. I think the simplest is to simply reshape your (6, 6) tensor to shape (6, 6, 1, 1). Then broadcast shape (6, 6, 1, 1) → (6, 6, 3, 3).

This detail simply has to do with how broadcasting deals with “padding” dimensions. Instead of approach things from the trailing direction, the mechanics instead approach things from the leading direction.

2 Likes

Yup that works, thank you so much!