DPT

Created on March 24, 2025

2025 · Transformer · CV-Object

DPT(Dense Prediction Transformer)

https://arxiv.org/pdf/2103.13413

https://github.com/isl-org/DPT

Encoder:

Using VIT

The image embeddings are thus concatenated with a learnable position embedding to add this information to the representation.
Following work in NLP, the ViT additionally adds a special token that is not grounded in the input image and serves as the final, global image representation which is used for classification. We refer to this special token as the readout token.

Decoder:

Read Operation:

Concatenate & $Resample_s$ & Project

Dˆ = 256 in paper

Task Head

Enjoy Reading This Article?

Here are some more articles you might like to read next:

3DGS Ray Tracing

Modeling (3) — 3D Shape

Modeling (2) — Surface Reconstruction

Modeling (1) — Curve, Surface, Mesh