[1] Hugo Touvron,Matthieu Cord, Matthijs Douze, FranciscoMassa, Alexandre Sablayrolles, andHerv’e J’egou. Train-ing data-efficient image transformers & distillationthroughattention.arXiv preprint arXiv:2012.12877, 2020.[2] Nicolas Carion,Francisco Massa, Gabriel Synnaeve, NicolasUsunier, Alexander Kirillov, andSergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.[3] Kaiming He,Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for imagerecognition. In CVPR,2016.[4] Jia Deng, WeiDong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: Alarge-scale hierarchical imagedatabase. In CVPR, 2009.[5] Tsung-Yi Lin,Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, PiotrDoll ́ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InECCV, 2014.[6] Ashish Vaswani,Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, andIllia Polosukhin. Attention is all you need. In NeurIPS, 2017.[7] Jacob Devlin,Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805, 2018.[8] AlexeyDosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, SylvainGelly, et al. An image is worth 16x16 words: Transformers for image recognitionat scale. ICLR, 2021.[9] Xizhou Zhu,Dazhi Cheng, Zheng Zhang, Stephen Lin, and Jifeng Dai. An empirical study ofspatial attention mechanisms in deep networks. In ICCV, 2019.[10] PrajitRamachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, andJonathon Shlens. Standalone self-attention in vision models. arXiv preprintarXiv:1906.05909, 2019.[11] Wang, H., Zhu,Y., Green, B., Adam, H., Yuille, A., & Chen, L. C. Axial-deeplab:Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020.[12] Xiangxiang Chu,Bo Zhang, Zhi Tian, Xiaolin Wei, and Huaxia Xia. Do we really need explicitposition encodings for vision transformers? arXiv preprint arXiv:2102.10882, 2021.[13] AravindSrinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and AshishVaswani. Bottleneck transformers for visual recognition. arXiv preprintarXiv:2101.11605, 2021.[14] Peter Shaw,Jakob Uszkoreit, and Ashish Vaswani. Self attention with relative positionrepresentations. ACL, 2018.[15] Zihang Dai,Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov.Transformer-xl: Attentive language models beyond a fixed-length context. InACL, 2019.[16] Zhilin Yang,Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.Xlnet: Generalized autoregressive pretraining for language understanding.NeurIPS, 32, 2019.[17] Zhiheng Huang,Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with betterrelative position embeddings. In EMNLP, 2020.[18] Colin Raffel,Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning witha unified text-to-text transformer. JMLR, 2020.Illustrastion by Murat Kalkavan from Icons8-The End-