DPT
DPT(Dense Prediction Transformer)
https://arxiv.org/pdf/2103.13413
https://github.com/isl-org/DPT
Encoder:
Using VIT
- The image embeddings are thus concatenated with a learnable position embedding to add this information to the representation.
- Following work in NLP, the ViT additionally adds a special token that is not grounded in the input image and serves as the final, global image representation which is used for classification. We refer to this special token as the readout token.
Decoder:

Read Operation:

Concatenate & $Resample_s$ & Project

- Dˆ = 256 in paper
Task Head

Enjoy Reading This Article?
Here are some more articles you might like to read next: