Fast Subpixel Coordinate Regression with SoftArgMax

Artur B. Carneiro/

Real-time inference demo at 30fps on a MacBook

Coordinate regression — predicting the (x,y)(x, y) location of something in an image — is one of those problems that sounds like it should be easy. You have a CNN, it learns some representations, you stick a fully connected layer on top, train with MSE, done. Except it doesn't work nearly as well as you'd expect, for reasons that are subtle and interesting. And the fix is a very elegant idea: a parameter-free, differentiable operator called SoftArgMax that achieves subpixel precision by preserving the spatial structure that the naive approach throws away.

I spent the last year applying this idea to a specific problem: estimating an aircraft's pose from runway images during landing approach, as part of my work at the Stanford Intelligent Systems Laboratory (SISL). The resulting paper was published at the Digital Avionics Systems Conference (DASC) 20251. But the technique itself is completely general: it works for any task where you need to locate keypoints in an image, whether that's runway corners, facial landmarks, pose estimation, etc. (comment some ideas you might have!)

A lot of the decision making and tricks we used to derive this solution did not make it in the final paper, so I thought it would be cool to walk through some of that in this post. Let's get into it!

The motivating problem: landing an airplane with a camera

To make this concrete, here's the application that motivated the work. Autonomous aircraft landing requires precise pose estimation — the aircraft needs to know where it is (pR3\mathbf{p} \in \mathbb{R}^3) and how it's oriented (RSO(3)\mathbf{R} \in SO(3)) relative to the runway (see image below for axis of rotation). Yaw Axis Corrected.svg GPS handles this most of the time, but GPS can be jammed, spoofed, or simply unavailable. So what if the aircraft could just look out the window and figure out where it is? It turns out that if you can locate four known points on the runway (say, the four corners) in a camera image, recovering the full 6-DoF pose is just geometry — specifically, the Pose-from-N-Points problem:

argmin(p,R)i=1KyiprojectC ⁣(ξi,(p,R))22\arg\min_{(\mathbf{p}, \mathbf{R})} \sum_{i=1}^{K} \left\| \mathbf{y}_i - \text{project}_{\mathcal{C}}\!\left(\boldsymbol{\xi}_i, (\mathbf{p}, \mathbf{R})\right) \right\|_2^2

where ξi\boldsymbol{\xi}_i is the known 3D location of corner ii, yi\mathbf{y}_i is its observed 2D pixel location, and projectC\text{project}_\mathcal{C} is the camera projection. This optimization problem can be solved by Newton-Raphson, Levenberg-Marquardt, etc. We get into this in more detail in the paper but it won't be the focus on this post.

The catch is that at low altitudes, this inverse problem becomes really sensitive to the pixel coordinates yi\mathbf{y}_i. A couple of pixels of error in your corner estimate can translate to meters of error in position. So the interesting problem (the one we can use a neural network for) is: given an image, predict the 2D pixel coordinates of specific keypoints with as much precision as possible.

This is coordinate regression, and it's a fundamental task in computer vision that shows up far beyond aviation.

Here's what the regression task looks like — given a runway image, predict the pixel locations of its four corners (A, B, C, D):

Example regression task: given a runway image, predict the precise pixel locations of its four corners.

Why coordinate regression is hard (or: why naive doesn't work)

The obvious thing to do, if you've stared at enough CNNs, goes like this:

  1. Take a pretrained ResNet or similar backbone.
  2. Flatten the final feature map into a big vector.
  3. Stick a fully connected layer on top that outputs your coordinates.
  4. Train with MSE.
  5. Done.

This works, sort of. But it has a problem that's easy to miss until you actually try it: you're throwing away a lot of the spatial representations the CNN just spent 17 layers building.

Think about what a CNN's final feature map actually is. If you feed in a 224×224 image and get out a 7×7×C7 \times 7 \times C tensor, each of those 49 spatial locations corresponds to some region of the input image. Convolutional features are inherently positional — a "corner-like" activation in the top-right cell tells you something very different from the same activation in the bottom-left. When you flatten and fully-connect, you're asking the FC layer to re-learn that spatial correspondence from scratch, as a bunch of weights. You need a lot of parameters, and you need a lot of data, and even then the predictions can be unstable across slightly different inputs.

The second problem is that MSE on pixel coordinates, combined with a fully connected head, has a nasty failure mode: if the network isn't sure where a keypoint is, it hedges to the average. You get these weirdly timid predictions pulled toward the center of the image.

There's a third problem, which is that pretrained backbones were never trained for this. They were trained on ImageNet for image classification (e.g., "this is a cat"). Getting them to say "this pixel is at (312.47,418.91)(312.47, 418.91)" is asking them to translate across domains of representation. The flatten-and-FC approach asks the network to do that translation entirely in the final layer, which is a lot to ask.

The key idea: SoftArgMax

The trick, which comes from a 2016 paper by Levine, Finn, Darrell, and Abbeel2 on end-to-end training of deep visuomotor policies, is to not throw away the spatial structure at all. Instead, interpret the feature map as a probability map over where the keypoint is, and take its center of mass.

Here's the full architecture. Notice how the backbone feeds into a 1×11 \times 1 conv that produces one heatmap per keypoint, and then the SoftArgMax layer extracts the coordinates:

Input 224 x 224 x 3 CNN Backbone e.g. ResNet-18 Feature map H' x W' x C' 1x1 Conv Heatmaps H' x W' x K Softmax Prob. maps H' x W' x K E[x, y] (x, y) A (x, y) B (x, y) C (x, y) D Coordinates K x 2, in [0, 1] Spatial SoftArgMax (0 trainable params)

Here's how SoftArgMax works, step by step. We take the CNN backbone's output feature tensor FRH×W×CF \in \mathbb{R}^{H' \times W' \times C'} and apply a single 1×11 \times 1 convolution that maps it to KK channels (one per keypoint):

H=Conv1×1(F)RH×W×KH = \text{Conv}_{1\times 1}(F) \in \mathbb{R}^{H' \times W' \times K}

Each channel hkRH×Wh_k \in \mathbb{R}^{H' \times W'} is a "heatmap" for keypoint kk. The SoftArgMax operator then does two things:

First, it turns the heatmap into a probability distribution via a spatial softmax:

Pi,j=exp(hk[i,j])i=0H1j=0W1exp(hk[i,j])P_{i,j} = \frac{\exp(h_k[i,j])}{\sum_{i'=0}^{H'-1} \sum_{j'=0}^{W'-1} \exp(h_k[i',j'])}

Second, it takes the expected value of the grid coordinates under that distribution:

x^k=i=0H1j=0W1Pi,jjW1\hat{x}_k = \sum_{i=0}^{H'-1} \sum_{j=0}^{W'-1} P_{i,j} \cdot \frac{j}{W'-1} y^k=i=0H1j=0W1Pi,jiH1\hat{y}_k = \sum_{i=0}^{H'-1} \sum_{j=0}^{W'-1} P_{i,j} \cdot \frac{i}{H'-1}

That's it. The diagram below shows the operator visually — a heatmap goes through a softmax to become a probability map, then that distribution is element-wise multiplied with coordinate grids and summed to produce the final (x,y)(x, y):

SoftArgMax layer: input heatmap → Softmax → probability map → element-wise multiply with X and Y coordinate grids → sum → (x, y) output

Notice what we've gained:

  • The output is continuous. Because we're taking a soft argmax (expected value) rather than a hard one (index of max), the predicted coordinates live in [0,1][0, 1] as continuous values. The feature map might be 7×77 \times 7, but we can still predict subpixel locations in the original 224×224 image.

  • Spatial structure is preserved end-to-end. The network doesn't need to learn "top-left pixels correspond to top-left coordinates" — it's baked into the layer.

  • Almost no new parameters. The 1×11 \times 1 conv adds (C+1)×K(C' + 1) \times K weights. For our setup that's around 2,000 parameters. The SAM operator itself is parameter-free. You could in theory add more convolutional layers before the 1×11 \times 1 layer to "smooth" the downsampling. We actually did that and it improves the performance, not enough to justify the number of added trainable parameters.

  • It's differentiable. All the gradients flow back cleanly through the softmax and the expectation.

There's something almost suspicious about how simple this is. We didn't add a decoder, we didn't upsample, we didn't do anything clever with multi-scale features. We just reinterpreted the feature map as a distribution and took its mean.

The implementation

The good news is you can see all the work we did, as we open-sourced it: github.com/sisl/LARDRunwayPredict3. But let's walk through the most important parts together.

One of the main design goals was to make the SoftArgMax head completely backbone-agnostic. Swapping the backbone is a one-line change:

# ResNet
model = LARDModel(model_name="resnet18")
 
# EfficientNet
model = LARDModel(model_name="efficientnet_b5")

Under the hood, the model is strikingly simple: a backbone, a 1×11 \times 1 conv, and the SoftArgMax layer:

class SAMPredictor(nn.Module):
    def __init__(self):
        self.backbone = timm.create_model(
            model_name="resnet18",
            pretrained=True,
            features_only=True
        )
        self.head = nn.Sequential(
            nn.Conv2d(in_channels=512, out_channels=4,
                      kernel_size=1),  # 2k params
            SpatialSoftArgmax(),       #  0 params!
        )
 
    def forward(self, x):
        y = self.backbone(x)
        return self.head(y)

This also works with Vision Transformers — we reshape the patch tokens back into a 14×1414 \times 14 spatial grid after dropping the CLS token, and the rest is identical. We didn't benchmark ViTs extensively for this paper, but it's a natural next step and would be a good exploration.

A few other decisions ended up mattering more than I expected.

Pre-cropping the image with bogo crop. The LARD dataset4 gives us full-resolution images that are way bigger than 224×224224 \times 224. Resizing down to fit would destroy the subpixel precision we're trying so hard to preserve. So instead we pre-crop: we randomly pick a crop of the original image that contains all four runway corners, scale it to 224×224224 \times 224, and transform the corner coordinates to match. I remember calling it "bogo crop" when Roman suggested it and, funnily enough, the name made it to the paper. (if you didn't get the reference, "bogo crop" is a nod to bogosort, my favorite sorting algorithm)

The loss. We started with the classic MSE on the normalized coordinates in (0,1)(0, 1), but in the DASC paper we upgraded this to a Gaussian negative log-likelihood (NLL) that also predicts per-coordinate variance:

LNLL=1Kk=1K[12L1(yktrueμk)22+logΣk]\mathcal{L}_{\text{NLL}} = \frac{1}{K} \sum_{k=1}^{K} \left[ \frac{1}{2} \| \mathbf{L}^{-1} (\mathbf{y}_k^{\text{true}} - \boldsymbol{\mu}_k) \|_2^2 + \log |\boldsymbol{\Sigma}_k| \right]

The intuition: the first term penalizes prediction error (scaled by the predicted uncertainty Σk\boldsymbol{\Sigma}_k), and the second term prevents the network from cheating by predicting infinite uncertainty — if it's unsure, it pays a cost for that too. This means the network outputs not just "the corner is at (x,y)(x, y)" but "the corner is at (x,y)(x, y) with standard deviations (σx,σy)(\sigma_x, \sigma_y)." That uncertainty estimate is what the downstream RAIM-style integrity monitor uses to flag bad predictions at runtime, which is crucial for aviation. The paper goes into much more detail on this1.

Following Andrej Karpathy's advice — "monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy)"5 — we also tracked what we ended up calling pixel error: the mean Euclidean distance (in pixels) between predicted and ground truth keypoints.

Pixel Error=1Kk=1Ky^kyk2\text{Pixel Error} = \frac{1}{K} \sum_{k=1}^{K} \left\| \hat{\mathbf{y}}_k - \mathbf{y}_k \right\|_2

It tells you, on average, how many pixels off your prediction is from the ground truth. Much easier to reason about than a normalized NLL value.

Results

We compared three configurations: a baseline ResNet-18 with a fully connected head, a ResNet-18 with the SoftArgMax head, and an EfficientNet-B5 with the SoftArgMax head. All trained on the LARD dataset, all evaluated on held-out runways.

Here's the convergence plot. The SAM head reaches the FC baseline's final accuracy about 3× faster, and then keeps improving:

Training convergence: Baseline FC ResNet18 (blue), ResNet18 with SoftArgMax (green), EfficientNet B5 with SoftArgMax (orange). SAM reaches baseline accuracy 3x faster.

And here's the final accuracy across all configurations:

Model FC (Mean Pixel Error) SAM (Mean Pixel Error)
ResNet-18 1.46 0.80
ResNet-50 2.50 0.65
EfficientNet-B5 10.59 0.50

The EfficientNet result is worth pausing on. With the FC head, EfficientNet-B5 does worse than the smaller ResNets — 10.59 pixels of error versus 1.46. EfficientNet produces richer, higher-dimensional feature maps than ResNet-18, but that's exactly what makes it harder to flatten and regress through an FC layer: more dimensions means more spatial structure to destroy, and more weights the FC layer needs to learn from scratch. With the SAM head, which preserves all that spatial information, EfficientNet becomes the best model we tested (0.50 px). The better the backbone's features, the more SoftArgMax has to work with.

You can see the difference visually: here's a zoomed-in view of the same runway corner predicted by all three models. The FC baseline (blue) is noticeably off, while the two SAM models (green and orange) cluster tightly around the ground truth:

Zoomed-in corner comparison: FC baseline (blue) is off, ResNet18+SAM (green) and EfficientNet B5+SAM (orange) cluster tightly around the true corner location.

What I took away

The lesson that kept repeating itself on this project is that the right interface between two parts of a system often matters more than the sophistication of either part. CNN backbones are good at producing spatially-structured feature maps. Downstream solvers (PnP, Kalman filters, whatever) are good at working with noisy point estimates. The bottleneck was the thing in the middle: how you convert "a bunch of convolutional features" into "here's where the keypoint is, with subpixel precision."

The FC head is a bad interface because it destroys the spatial structure. A full upsampling decoder is a good interface, but it's expensive and introduces a lot of new parameters. SoftArgMax is a great interface because it respects the structure of what's on either side of it: it takes a feature map and produces continuous coordinates, with almost no overhead, and all the gradients flow through cleanly.

This project gave me a lot of appreciation for small, well-placed architectural choices. Sometimes the best thing you can add to a neural network is almost nothing.

If you want to read the full paper, it's on arXiv1 and includes the uncertainty quantification and the RAIM-based integrity monitoring built on top. The code is at github.com/sisl/LARDRunwayPredict3 and trains in a few hours on a single GPU. And if you ever find yourself having to regress coordinates from an image (whether it's runway corners, facial landmarks, or the tip of a robot arm) I'd really recommend trying SoftArgMax before you try anything else.

References

Footnotes

  1. R. Valentin, S. M. Katz, A. B. Carneiro, D. Walker, and M. J. Kochenderfer. "Predictive Uncertainty for Runtime Assurance of a Real-Time Computer Vision-Based Landing System." arXiv preprint arXiv:2508.09732, 2025. Link 2 3

  2. S. Levine, C. Finn, T. Darrell, P. Abbeel. "End-to-End Training of Deep Visuomotor Policies." JMLR, 2016. Link

  3. LARDRunwayPredict repository. A PyTorch implementation of SoftArgMax-based runway corner regression with support for ResNet and ViT backbones. 2

  4. M. Ducoffe et al. "LARD - Landing Approach Runway Detection - Dataset for Vision Based Landing." Recherche Data Gouv, 2023.

  5. Andrej Karpathy. "A Recipe for Training Neural Networks." 2019. Link