How not to pay Adobe and process multiple photos with Python (using the GPU no less!)

Thiago Lira
3 min readNov 23, 2022

There is a very elegant technique to remove crowds from photos in which you just use a tripod and take multiple photos from the exact same spot. Every single picture might have tons of people, but every picture has people in different spots! So if you just average the pixels most of the crowd will just disappear. (Technically, the median works better, but I’m getting ahead of myself).

I took both of these pictures in the Zu lai Temple located in Cotia, Brazil. Notice that both have people passing by but in slightly different spots.

Photoshop has a feature that lets you select a bunch of pictures, “stack” them, and then apply some function to aggregate it’s pixels, like the mean or median. But honestly, I pay Adobe way to much money already for Lightroom, which is the software I use to develop my photos. And even worse, they make you go for the package with the cloud storage you don’t even need just to gain access to Lightroom and it doesn’t come with Photoshop and you have to pay MONTHLY.

NOT TODAY ADOBE. I will take my own median from my own goddamn pixels thank you very much.

It’s all very straightforward: When you use a library like PIL you can load any .jpg file in Python like a numpy array with the dimensions:

(WIDTH, HEIGHT, RGB)

We want to stack all photos I took from the same place in the same array, or, simply:

all_images = np.stack(img_list)

So, now, our array dimensions are like so:

(N_IMAGES, WIDTH, HEIGHT, RGB)

Giving a concrete example, one of the pictures I produced came from an array with dimensions (17, 4160, 6240, 3), which means that I stacked data from 17 pictures, each having 4160x6240 pixels with 3 colors each (the RGB channels).

Now it is just a question of taking the median, but it is easy to get lost of what exactly we are taking the median of with so many dimensions.

Imagine we are aligning the pixels of 17 images and for each aligned pixel we will just take the median from the 17 RGB values (element-wise), which are of course a triple of 3 colors, so our median for each pixel will be the result of an expression similar to this one:

So, the expression in Numpy of Pytorch to perform this calculation along the first dimension (the stacked images) is as follows:

result = np.median(all_images, axis=0)

And we get back a result which is a single image, with dimensions (1, 4160, 6240, 3), ready to be exported.

That little crowd on the right and the woman on the left are gone!

No person in sight! Well, almost. The result would have been better if I had taken more pictures, but this was just to prove a point.

And to be even more pedantic about it, I wrote code to do the exact same thing using the CPU (on Numpy) and the GPU (on PyTorch), it took 5s to process this image on my CPU and 2s on my GPU, not too bad!

--

--