DPT

DPT(Dense Prediction Transformer)

https://arxiv.org/pdf/2103.13413

https://github.com/isl-org/DPT

Encoder:

Using VIT

  • The image embeddings are thus concatenated with a learnable position embedding to add this information to the representation.
  • Following work in NLP, the ViT additionally adds a special token that is not grounded in the input image and serves as the final, global image representation which is used for classification. We refer to this special token as the readout token.

Decoder:

Read Operation:

Concatenate & $Resample_s$ & Project

  • Dˆ = 256 in paper

Task Head




    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • 3DGS
  • NeRF
  • SDS
  • Pretrain Diffusion
  • Latent Diffusion