当前位置: 首页 > 图灵资讯 > 技术篇> Generalized Decoding for Pixel, Image, and Language

Generalized Decoding for Pixel, Image, and Language

来源:图灵教育
时间:2023-06-01 09:54:17

Figure 1. With one suite of parameters, X-Decoder after pretraining supports all types of image segmentation tasks ranging from open-vocabulary instance/semantic/panoptic segmentation to referring segmentation, and vision-language tasks including image-text retrieval, and image captioning (labeled in

green boxes). It further empowers composite tasks like referring captioning using X-Decoder itself and image editing that combines with generative modelssuch as Stable Diffusion [66] (labeled in yellow boxes).

Generalized Decoding for Pixel, Image, and Language_ci

Abstract

We present X-Decoder, a generalized decoding modelthat can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types ofqueries: (i) generic non-semantic queries and (ii) semanticqueries induced from text inputs, to decode different pixellevel and token-level outputs in the same semantic space.With such a novel design, X-Decoder is the first work thatprovides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further,our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream

tasks in both zero-shot and finetuning settings. Notably,

it achieves (1) state-of-the-art results on open-vocabularysegmentation and referring segmentation on eight datasets;(2) better or competitive finetuned performance to othergeneralist and specialist models on segmentation and VLtasks; and (3) flexibility for efficient finetuning and noveltask composition (e.g., referring captioning and image editing shown in Fig. 1). Code, demo, video and visualizationare available at: https://x-decoder-vl.github.io.1. IntroductionVisual understanding at different levels of granularityhas been a longstanding problem in the vision community.The tasks span from image-level tasks (e.g., image classification [15], image-text retrieval, image captioning [8],and visual question answering (VQA) [2]), region-level loThe work is developed during an internship at Microsoft.

calization tasks (e.g., object detection and phrase grounding [63]), to pixel-level grouping tasks (e.g., image instance/semantic/panoptic segmentation [28, 37, 51]). Until

recently, most of these tasks have been separately tackledwith specialized model designs, preventing the synergy oftasks across different granularities from being exploited. Inlight of the versatility of transformers [72], we are now witnessing a growing interest in building general-purpose models that can learn from and be applied to a perse set ofvision and vision-language tasks, through multi-task learning [27, 32], sequential decoding [7, 54, 76, 85], or unifiedlearning strategy [84, 91, 94, 95]. While these works haveshown encouraging cross-task generalization capabilities,most target the unification of image-level and region-leveltasks, leaving the important pixel-level understanding underexplored. In [7, 54], the authors attempt to unify segmentation into a decoding of a coordinate sequence or acolor map, which, however, produces suboptimal performance and limited support for open-world generalization.Arguably, understanding images down to the pixel levelis one of the most important yet challenging problems inthat: (1) pixel-level annotations are costly and undoubtedly much more scarce compared to other types of annotations; (2) grouping every pixel and recognizing them inan open-vocabulary manner is less studied; and (3) moreimportantly, it is non-trivial to learn from data at two substantially different granularities while also obtaining mutualbenefits. Some recent efforts have attempted to bridge thisgap from different aspects. In [12], Chen et al. proposea unified architecture Mask2Former that tackles all threetypes of segmentation tasks but in a closed set. To supportopen vocabulary recognition, a number of works study howto transfer or distill rich semantic knowledge from imagelevel vision-language foundation models such as CLIP [64]and ALIGN [34] to specialist models [17,25,65]. However,all these initial explorations focus on specific segmentationtasks of interest and do not show generalization to tasks atdifferent granularities. In this work, we take one step further to build a generalized decoder called X-decoder1towards the unification of pixel-level and image-level visionlanguage understanding, as shown in Figure 1.A generalized decoding framework. We formulateall tasks including pixel-level image segmentation, imagelevel retrieval and vision-language tasks into a generic decoding procedure. Specifically, X-Decoder is built on topof a vision backbone and a transformer encoder for extracting multi-scale image features, following the framework of Mask2Former [12]. The key novelty lies in thedecoder design. First, it takes two sets of queries as input: (i) generic non-semantic queries that aim to decodesegmentation masks for universal segmentation, similarto Mask2Former [12], and (ii) newly introduced textual1Here, ‘X’ denotes versatile, and also represents ‘piXel’.

queries to make the decoder language-aware for a perse

set of language-related vision tasks. Second, it predicts twotypes of outputs: pixel-level masks and token-level semantics, and their different combinations can seamlessly support all tasks of interest. Third, we use a single text encoderto encode the textual corpus involved in all tasks, including concepts in segmentation, phrases in referring segmentation, tokens in image captioning and questions in VQA,etc. As a result, our X-Decoder can naturally facilitate thesynergy across tasks and advocate the learning of a sharedvisual-semantic space, while respecting the heterogeneousnature of different tasks.An end-to-end learning paradigm. With our generalized decoder design, we propose an end-to-end pretrainingmethod to learn from all granularities of supervision. Weunite three types of data: panoptic segmentation, referringsegmentation, and image-text pairs. Unlike previous worksthat use pseudo-labeling techniques to extract fine-grainedsupervision from image-text pairs [25, 95], X-Decoder directly groups and proposes a few meaningful segmentationcandidates, so that it can map the regions easily to the contents described in the captions on the fly. Meanwhile, the referring segmentation task bridges generic segmentation andimage captioning by sharing the pixel-level decoding withthe former and semantic queries with the latter.Strong zero-shot and task-specific transferability to awide range of segmentation and VL tasks. Pre-trainedwith a limited amount of segmentation data and millionsof image-text pairs, our X-Decoder supports a persity oftasks in a zero-shot and open-vocabulary manner. Concretely, our model can be directly applied for all three typesof segmentation tasks in a wide range of domains, establishing new state-of-the-art on ten settings of seven datasets.When transferred to specific tasks, our model also exhibitsconsistent superiority to previous works. Finally, we observe some intriguing properties in our model that it cansupport some novel task compositions and efficient finetuning, thanks to the flexibility endowed by our model design.2. From Specialist to Generalist Models2.1. Pixel-Level UnderstandingPixel-level image understanding, also known as imagesegmentation, has been a long-standing problem [23, 62].Generic Segmentation. There are mainly three welldefined tasks for pixel-level understanding, including semantic [51], instance [28], and panoptic [37] segmentation.Semantic segmentation cares about the per-pixel semanticwithin an image [6, 11, 51], whereas instance segmentationgroups pixels of the same semantic meaning into object instances. Models for both tasks have evolved from CNNbased architectures [51] to transformer-based ones [11], andfrom two-stage models [29] to one-stage models [3, 71]and to the recent query-based approaches [18, 100]. With

the capability of per-pixel and instance-level understanding, a natural step was taken to formulate panoptic segmentation [12, 37, 73]. Most recently, Mask2Former [12]

proposed to address all three tasks with a unified encoderdecoder architecture. Nevertheless, all these works copewith a limited number of categories, i.e., models can hardlyrecognize concepts absent in the training set. In MSeg [40],the authors manually merge different datasets and train amore generalized model on the composite set, which is stilllimited to being a closed set.Open-Vocabulary Segmentation. Recently, a numberof works opt to transfer or distill the rich visual-semanticknowledge from foundation models like CLIP [64] andALIGN [34] to specific segmentation tasks. Prominent examples include LSeg [41], OpenSeg [25], and [33]. Instead of using existing models, GroupViT [81] performedlanguage-image pretraining from scratch with a bottom-upgrouping ViT [19], while DenseCLIP [65] demonstrated thesuperiority of foundation models in finetuning settings compared with supervised models. Recently, MaskCLIP [17]proposed to tackle open-vocabulary panoptic and semanticsegmentation by leveraging CLIP, and achieved SoTA performance on ADE20K [98] and PASCAL [22, 59].Referring Segmentation by nature is open-vocabulary inthat it does not presume a fixed number of phrases in thetraining and inference times. Models are usually designedspecifically to learn from target datasets using various multimodal fusion strategies [31,49,57,88,92]. Since the emergence of vision transformers, works like LAVT [86] enhance the cross-modal interactions from the very beginning,which led to SoTA on RefCOCO [92], RefCOCO+ [92] andG-Ref [56, 60]. CLIPSeg [55] extended the textual query toa visual query and showed superior performance not only onreferring segmentation but also on semantic segmentation.In this work, we propose X-Decoder, which is the firstmodel to tackle generic and referring segmentation tasks allin one model. Furthermore, the generalized decoder jointlylearns from segmentation data and image-text pairs end-toend, and thus can augment the synergy across tasks for richpixel-level and image-level understanding.2.2. Vision-Language UnderstandingVision-language (VL) pretraining has proven to be effective for various VL tasks [44, 53, 69, 70]. The fieldhas evolved from a transformer fusion model [10, 46, 96]with pre-extracted object features [1] to end-to-end transformers [21, 36, 43], that directly learn from raw imagepixels. Recently, researchers [68, 78, 79] have found thatimage-text data at scale can be helpful for visual representation learning (e.g., enabling zero-shot image classification [34, 64] and action recognition [91, 94]). VL pretrained models can be further extended to region-level tasks,such as phrase grounding and open-vocabulary object detection [26, 35, 58, 97], and unified frameworks that aim to

Generalized Decoding for Pixel, Image, and Language_ci_02

Figure 2. Overall pipeline for our model. It consists of an image encoder,

a text encoder and our own designed X-Decoder.combine image-text pairs with region-level data have alsobeen proposed [4, 20, 45, 87, 95]. A comprehensive reviewon this topic is provided in [24].We are witnessing a clear trend from building specialist models to generalist ones. Early efforts [27, 32] builda multi-task learning paradigm to accommodate a persityof tasks. However, the interactions among different tasksin these works are less studied, and the combination usually leads to performance degradation compared with specialist models. Recently, a number of works aim to reformulate the tasks into a unified sequential decoding process [7, 38, 54, 76, 85]. In this work, instead of developinga unified interface for vision and VL tasks, our X-Decoderbuilds a generalized decoding paradigm that can seamlesslyconnect the tasks by taking the common (e.g., semantic) butrespecting the natural differences (e.g., spatial mask v.s. sequential language), leading to significant improvements fordifferent segmentation and VL tasks across the board.3. X-Decoder3.1. FormulationOur model follows the generic design of encoderdecoder architecture as shown in Fig. 2. Given an inputimage I ∈ RH×W×3, we first use an image encoder EncIto extract features Z. Afterwards, we use the text encoderEncT to encode a textual query T into Qt = hqt1, · · · , qtniof length n. The visual features, textual queries and the mnon-semantic or latent queries Qh = hqh1, · · · , qhmi are fedto our X-Decoder to predict the outputs:hOp, Osi = XDec hQh, Qti; Z(1)where Op and Osare the pixel-level masks and token-levelsemantics, respectively. In the above formula, we note threecritical designs to empower the generalization ability of ourX-Decoder to a variety of vision and vision-language tasks.We define two types of queries and outputs for XDecoder. As discussed earlier, the queries for the decoderare categorized into latent queries Qh and text queries Qt,which undertake generic vision and vision-language tasks,respectively, and their combinations can further support various language-aware tasks such as referring segmentation,

Generalized Decoding for Pixel, Image, and Language_ide_03

VQA, etc. Likewise, the output is categorized into pixellevel mask Op

and semantic embedding Os. By simplyusing different combinations, we can adapt our X-Decoderto various tasks with the same suite of parameters.We employ a single text encoder EncT to encode thetextual corpus from all tasks. The common text encoderis used to encode referring phrases, text descriptions, image captions in the task of referring segmentation, imagetext retrieval and image captioning, respectively. Furthermore, we reformulate the mask classification in segmentation into a mask-text matching problem between Osandthe textual embeddings of prompted textual concepts similar to [25,84]. Sharing the text encoder for all textual corpuscould maximally exchange knowledge from different tasksand learn a richer and more coherent semantic space.We fully decouple the image and text encoder. In manyprevious unified encoder-decoder models [7, 35, 85], theimage and text are fused in the encoder side. This design makes it intractable not only for global image-textcontrastive learning [64, 84], but also generative pretraining [75]. In contrast, by fully decoupling the image and textencoder and using the outputs all as queries, X-Decoder canlearn from both intra-image supervisions and inter-imageones, which is essential to learn stronger pixel-level representations and support different granularity of tasks.3.2. Unification of TasksBased on the above designs, X-Decoder can be used toseamlessly unify different vision and vision-language tasks,simply with different combinations of queries as inputs.Generic Segmentation. For this task, there are no textualqueries as inputs. Hence, Eq. (1) becomes:hOp, Osi = XDec(Qh; Z) (2)where Op, Os have the same size of Qh. Eq. (2) reduces tomask2former [12], but with open-vocabulary capacity sincewe use mask-text matching for mask classification.Referring Segmentation. It requires both latent and textqueries as inputs, thus shares the same formula as Eq. (1).Similar to generic segmentation, we only use the first mdecoded outputs corresponding to the latent queries. Compared with Eq. (2), referring segmentation can be regardedas language-conditioned generic segmentation.

Image-Text Retrieval. The decoupled image and text encoder in our X-Decoder makes it straightforward for interimage retrieval tasks. Specifically, we only feed the latent

queries to the decoder and obtain the semantic representation of an image:Os = XDec Qh; Z(3)where Os has the same length as Qh, and the last (m-th)token in Osis then used to compute the similarities betweenimages and texts.Image Captioning and VQA. For both tasks, X-Decodertakes both latent and text queries and decodes the outputs:Os = XDec hQh, Qti; Z(4)where Oscorrespondingly has equal size to Qt, and nomasks are predicted. There are two slight differences between the two tasks. First, the caption prediction follows acausal masking strategy while VQA does not. Second, weuse all the outputs in Osfor captioning, but only the lastone to predict the answer for VQA.The adaptation of our X-Decoder to each task is furtherdepicted in Fig. 3. Based on this unification, we can pretrain our X-Decoder jointly with all tasks using a propercombination of queries and losses, and further finetune forinpidual tasks without any extra heads.2 As discussed earlier, a lineup of works exploited a sequential decoding interface for the unification [7, 7, 13, 54, 77, 85]. However,in this work, we advocate the unification by functionalityrather than interface, namely, we maximally share the common parts of different tasks while keeping the remainingunchanged for inpidual tasks.3.3. Unified ArchitectureWe follow Mask2Former [12] to build our decoder architecture. Given an image I ∈ RH×W×3, we extract hierarchical visual features from L layers:Z = EncI (I) = hzliLl=1 (5)where zl ∈ RHl×Wl×dand {Hl, Wl} is the size of feature map at level l and d is the feature dimension. Thesehierarchical feature maps are important for pixel-level understanding at different scales.2VQA is used for pretraining following common practice.

Generalized Decoding for Pixel, Image, and Language_ci_04

One Decoder XDec for All Tasks. Given the visual features Z, X-Decoder uses a stack of transformer layers to

refine the queries and render the outputs. At layer l, itfirst cross-attends the visual features and then performs selfattention among latent and text queries:hQˆ hl−1, Qˆ tl−1i = CrossAtt(hQhl−1, Qtl−1i; Z) (6)hQhl, Qtl i = SelfAtt(hQˆ hl−1, Qˆ tl−1i) (7)In Eq. (6), we let all queries cross-attend the visual features.For latent queries, we use a masked cross-attention mechanism as in [12], and full attention for the textual queries.In Eq. (7), we specifically design the self-attention mechanism to prompt the synergy of tasks: (i) we use the lastlatent query to extract the global image representation andthe remaining for generic segmentation; (ii) for image captioning, each textual query can attend itself, its predecessorsand all latent queries; (iii) for referring segmentation, latentqueries will attend all text queries to use it as the languagecondition.Based on these rules, the resulting self-attention in ourX-Decoder is shown in Fig. 4.The output of our X-Decoder is also categorized intotwo types: 1) pixel-wise mask and 2) semantic outputs. XDecoder always produces the masks only for the m latentqueries, i.e., Op = {op1, · · · , opm} ∈ {0, 1}m×H×W for allthe latent queries. As for the semantic outputs, X-Decoderpredicts the outputs for both latent and text queries, i.e.,Os = {os1, · · · , osm+n} ∈ R(m+n)×d, to cover both maskrecognition and caption generation.One Encoder EncT for All Texts. Our text encoder consists of a number of transformer layers. Given the raw textsuch as a phrase or caption, we convert it to discrete tokensusing an off-the-shelf tokenizer and then send it to the textencoder. We apply causal masking to ensure its outputs arecompatible with caption decoding. For segmentation, wefollow [64, 84] to convert the class name into a phrase witha text prompt (e.g., “dog” → “an image of dog”), and encode the phrase as above.3.4. End-to-End Pre-trainingWe train our X-Decoder in an end-to-end manner withtwo types of losses corresponding to the outputs.Semantic Loss. There are three losses on the semantic outputs corresponding to three tasks. For image-text retrieval

we compute the language-image contrastive loss as [64].

We take the last valid token feature of Qtfrom the textencoder to represent a text as qˆtand take the last entry inOs derived from X-Decoder as oˆs. As a result, we obtain Bpairs of features hqˆti, oˆsiiBi=1 for a minibatch of B imagetext pairs. Afterwards, we compute the dot-product between these B × B feature pairs to obtain an affinity matrixSit ∈ RB×B, and compute the bidirectional cross-entropyloss:Lit = CE(Sit, yit) + CE(STit, yit) (8)where yit are the class labels corresponding to diagonal entries in Sit, and STit is the transpose of Sit.For mask classification, we encode all C class names including “background” into C text queries and take the lastvalid token feature from each to represent the concept. Afterward, we take the decoder outputs corresponding to thefirst (m − 1) latent queries and compute the dot-productbetween these outputs and concept embeddings to obtainan affinity matrix Scls ∈ R(m−1)×C and compute the lossLcls = CE(Scls, ycls), with the ground-truth class ycls.For image captioning, we first extract the embeddingsfor all tokens in the vocabulary of size V from the text encoder. Given the last n semantic outputs from X-Decoder,we compute the dot-product with all token embeddings toobtain an affinity matrix Scap ∈ Rn×V. Then we computethe cross-entropy loss Lcap = CE(Scap, ycap), with theground-truth next-token id ycap.Mask Loss. Given the predictions hOp, Osi derived fromm latent queries, we use Hungarian matching [5, 12] to findthe matched entries of first (m − 1) outputs to ground-truthannotations. Afterward, we follow [12] to use binary crossentropy loss Lbce and dice loss Ldice to compute the lossfor masks. We combine the above four losses to pretrainour X-Decoder. More details can be found in Appendix.4. Experiments4.1. Experimental SetupDatasets and Settings. We pretrain X-Decoder on threetypes of data including panoptic segmentation, image-textpairs (itp), and referring segmentation. For panoptic andreferring segmentation, we use COCO2017 [48] with segmentation annotations and exclude the validation sets ofRef-COCOg UMD [92] and COCO Karpathy [89]. In total,there are 104k images for segmentation pretraining, out ofwhich 30k images are with referring segmentation annotations. For image-text pairs, we use the standard 4M corpora,including Conceptual Captions [67], SBU Captions [61],Visual Genome [39], and COCO Captions [9]. We broadlyevaluate our models on all tasks covered by pretraining,including generic (Semantic/Instance/Panoptic) segmentation, referring segmentation, image-text retrieval, and image captioning. In particular, we benchmark on 10 settingsof 7 datasets covering a wide range of domains. Moreover

Generalized Decoding for Pixel, Image, and Language_ide_05

we finetune and report results on VQA for fine-grained visual reasoning.

Implementation Details. Our visual encoder follows [12]to use 100 latent queries and 9 decoder layers for segmentation, and we add one additional latent query for image-leveltask. However, we do not adopt a deformable encoder asit does not generalize well to open-vocabulary settings (seein Appendix). We adopt Focal-T [83] and DaViT-B/L [16]as the vision encoder and a transformer text encoder withcausal masking [64, 94] as language encoder. The modelsare pretrained on large-scale image-text data [94] (Base orLarge) or UniCL [84] for the tiny model. During pretraining, we set a minibatch for segmentation to 32 and imagetext pairs to 1024. The image resolution is set to 1024 forsegmentation and 224 for image-text data respectively. Wefollow a similar balanced sampling strategy in [84] to ensure the segmentation data are always observed for a consistent number of epochs, regardless of the total number ofimage-text pairs. Based on this, we pretrain all models for50 epochs using AdamW [52] as the optimizer. During finetuning, we have task-specific designs, please refer to detailsin Appendix.4.2. Task-Specific TransferWithout any architecture change except adding a headfor VQA, we directly finetune X-Decoder to demonstrateits task transfer capability. Table 1 presents the comparisonswith previous specialized and generalized models.Comparison with segmentation models. We listthe most recent models for inpidual tasks, includingMaskFormer [12], Panoptic SegFormer [47], KMaX

DeepLab [93] for generic segmentation, and LAVT [86] for

referring segmentation. Notably, our 25 epoch finetuned XDecoder (L) establishes a new SoTA on ADE20k datasetthat outperforms the current SoTA KMaX-DeepLab (L) onADE Panoptic Segmentation (our model trained with 1024resolution achieves 51.0 PQ), as well as Instance Segmentation SoTA, Mask2Former-L. On COCO, our model attains comparable performance to Mask2Former and kMaXDeepLab. There are three reasons to explain minor inferiority. First, we do not use deformable attention in XDecoder, which typically benefits supervised settings buthurts open-vocabulary performance. Second, we use thelanguage-image pretrained model as the backbone, whichcan understand richer semantics but lags behind the supervised model for classification tasks [42]. Third, we use100 latent queries for segmentation, which is half of that inMask2Former (L). Finally, we compare with LAVT [86] onCOCO G-ref. It is worth pointing out that with lightweightfinetuning, our tiny model already outperforms LAVT-Base(61.9 v.s. 61.2). Further increasing the model size can bringadditional gains by 2.6 and 2.7 points respectively, whichhelps to set a new record on this benchmark.Comparison with VL models. We compare with a set ofVL models on image-text retrieval, image captioning andVQA in Table 1. X-Decoder achieves competitive performance across the board. Specifically, X-Decoder outperforms strong baseline UNITER [10] and rivals VinVL [96]on COCO retrieval, and even beats all the methods onflickr30kr [63]. Unlike all these works, the image and textencoders are fully decoupled in X-Decoder, which leads toa much faster inference speed. On captioning and VQA

Generalized Decoding for Pixel, Image, and Language_ide_06

our models also demonstrate superior performance to their

counterparts. For example, it outperforms VinVL by 1.3and 1.7 on CIDEr and BLEU, respectively. Note that mostof these works use sophisticatedly designed training objectives, such as masked data modeling, image-text matching and hard-negative mining [20, 43, 74]. In contrast, XDecoder is pretrained with image-text contrastive and image captioning, along with the segmentation losses. Thesimplicity and effectiveness imply a great potential of usingX-Decoder as a general pretraining paradigm for VL.Comparison with generalist models. We further compare with prior arts that explore general-purpose visionmodels. Limited works report the generic segmentationperformance. Our model outperforms UViM [38] andPix2Seq v2 [7] significantly on COCO panoptic (56.7 v.s.45.8) and instance segmentation (46.7 v.s. 38.2), respectively. With the same amount of segmentation data, thesemargins strongly justify our model design, i.e., unifyingfunctionality without any tweaks for inpidual tasks. Whencompared with GLIPv2 [95], our model achieves comparable performance. Note that GLIPv2 uses over 10M pretraining data, including around 2M with box supervision.Despite the huge gap in pretraining data, X-Decoder outperforms GLIPv2 on both captioning and VQA. Furthermore,X-Decoder also beats other general-purpose models likeUniT [32], GPV [27], UniTAB [85] and Unified-IO [54].Efficient Finetuning. Finally, we study whether our pretrained X-Decoder can be finetuned for segmentation witha low cost. In Table 3, we show that we can simply finetune the class embedding layer, mask embedding layer orthe whole decoder to reach a decent segmentation performance and surpass the fully finetuned tiny SoTA modelslike kMaX-DeepLab [93]. These results imply an efficientway of using our pretrained X-Decoder models.4.3. Zero-Shot TransferWithout any change in model weights, X-Decoder can bedirectly applied to various segmentation tasks and datasets

Generalized Decoding for Pixel, Image, and Language_sed_07

Table 3. Performance with different efficient finetuning strategies for XDecoder large, and comparisons with fully-finetuned models.

after pretraining. In Table 2, we evaluate our model ina zero-shot manner on seven commonly used segmentation datasets in 10 different settings from perse domains,including common indoor (e.g., ADE20K [98] and Pascal [22]), outdoor (e.g., Cityscapes [14]) and self-drivingscenarios (e.g., BDD [90]). We report PQ, mAP and mIoUfor panoptic, instance and semantic segmentation respectively. And we visualize the predicted open-vocabulary segmentation result on each dataset in Fig. 5.Comparison with baselines. We build two X-Decodervariants: (1) X-Decoder-Seg, which is only trained withCOCO panoptic segmentation using a text encoder for classnames; and (2) X-Decoder-Seg+, where we take the heuristic way to extract noun phrases from COCO captions anduse them as extra supervision on top of the matched decoderoutputs. First, X-Decoder-Seg shows clear advantages onopen-vocabulary segmentation over MSeg [40], that manually conducts label mapping across different datasets. Second, the extra supervision from COCO captions improvesmodel performance on 9 out of 15 metrics, which indicatesthe benefit of joint learning with image-level supervision.Third, when pretraining with the full X-Decoder, the performance is significantly boosted. Notably, the mIoU metric is improved by 7.4, 3.4 and 2.6 on SUN, ADE-150 andPC-459, respectively.Comparison with state-of-the-art. We further comparewith the most advanced methods for open-vocabulary image segmentation in Table 2. Clearly, our models achieve

Generalized Decoding for Pixel, Image, and Language_ide_08

Generalized Decoding for Pixel, Image, and Language_ci_09

the best results across all datasets. Among the base-sized

models, X-Decoder (B) outperforms OpenSeg (B) [25] ontwo challenging datasets, ADE-150 and PC-459 for semantic segmentation. Scaling X-Decoder to large size further improves mIoU by 2.4 and 1.4 on these two datasets.Among prior arts, MaskCLIP [17] is the first proposedfor open-vocabulary panoptic segmentation by combingMaskFormer with CLIP models. With COCO caption supervisions, our simple baseline X-Decoder-Seg+ alreadyperforms comparably. The full version of our tiny model XDecoder (T) surpasses MaskCLIP across the board exceptA-847. We note that these comparisons are not strictly fairin terms of supervision, settings and models used. However, these results demonstrate the effectiveness of our XDecoder to learn from the different granularity of supervisions end-to-end for open-vocabulary segmentation, whichleads to new SoTA on 10 settings of 7 datasets across threesegmentation tasks.4.4. Model InspectionPretraining Tasks. By default, we exploit four pretrainingtasks including generic and referring segmentation, captioning and retrieval. In Table 6, we keep the generic segmentation while ablating the importance of the other pretrainingtasks. Accordingly, we have the following observations:• Image-text retrieval can help open-vocabulary segmentation. On ADE, the mIoU drops from 23.4 to 21.8,and PQ drops 0.7 without image-text retrieval. Since weshare the same semantic space for both tasks, a goodvisual-semantic alignment learned from the retrieval taskcan directly benefit the recognition of novel concepts.• Image captioning helps referring segmentation and

Generalized Decoding for Pixel, Image, and Language_ci_10

vice versa. We observe a drop of 2.0 pts on COCO gRef without captioning task, and a 3.2 pts drop of CIDEr

from removing referring task. The two tasks share thesame text encoder for text queries. Joint training, therefore, improves the understanding of text inputs.• Image captioning and retrieval can mutually benefiteach other. When removing captioning during pretraining, the image retrieval R@1 drops by 0.8, and the captioning CIDEr drops significantly by 3.2 pts from removing retrieval task. Our X-Decoder promotes harmony ofgenerative and contrastive learning.The above observations verify that the unified design of XDecoder can prompt the synergy of different tasks.Query Interactions. The interaction among tasks ishighly dependent on the interaction between latent and textqueries. We have described how the queries interact witheach other by default in Fig. 4. Here, we investigate howour model behaves with different interactions. In Table 4,we show the performance across tasks with ablated versionsand have the following takeaways:• Image captioning requires both fine-grained andglobal image information. Comparing the first with thesecond and third row in the table, we find the CIDErscore significantly drops if we cut off the information flowfrom the global latent query or other latent queries to textqueries (82.0 → 78.6 and 78.9, respectively).• Language-condition is important for referring segmentation. In the last row, we turn off the interactionfrom text queries to latent queries. This significantly hurtsreferring segmentation (59.7→ 57.6). On the one hand,this indicates that we can convert generic segmentation to

Generalized Decoding for Pixel, Image, and Language_ci_11

referring segmentation using post-hoc matching with referring texts. On the other hand, sending the text phrase

as input to X-Decoder is essential to modulate our modelto specifically decode the targets.VL Batch Size & Dataset The default batch size of VL taskis 1024, here we explore the gradual decreasing of VL batchsize. In addition, each VL dataset is removed inpidually toinvestigate the pre-trained performance on different tasks.• Decreasing VL batch size hurts VL tasks and openvocab Segmentation performance. As shown in Table. 5, decreasing the VL task batch size from 1024 to 256significantly hurts the retrieval and captioning tasks’ performance, where ir@1, tr@1, CIDEr decrease by 3.2, 4.3,and 8.9 points respectively. Further, the open-vocabularyperformance also drops 0.3 points on each metric.• VG dataset hurts pretraining VL tasks performancebut improves open-vocab segmentation. As shown inTable 7, removing the visual genome from the pretrainingVL dataset significantly improves captioning task with22.1 points during pretraining, but only 0.2 points after finetuning. Moreover, open-vocabulary semantic segmentation drops around 0.8 points.4.5. Task CompositionX-Decoder has the unique benefit of task interaction,thanks to the sophisticated architecture design on latent andtext queries as well as the decoder architecture. It enablesjoint task inference and iterative task inference with a sin

gle set of weights. In Fig. 6, we show our model can

perform region-based retrieval and referring based captioning without any architecture/weight change. For example,given a set of animal images (row 1, Fig. 6) and text query,our model first retrieves the correct image (flamingo andgiraffe) and then grounds the query with pixel-level predictions. Further, our model can easily adapted to referring captioning by first localizing a given word and thenmodulating the predicted mask in the cross-attention layers.Lastly, we also integrate X-Deocder with diffusion modelto do referring image editing demonstrated in the latter halfof the second row in Fig. 6.5. ConclusionWe present X-Decoder, a model that seamlessly supportspixel-level and image-level vision-language understanding.With a simple and generalized design, X-Decoder can uniteand support generic segmentation, referring segmentationand VL tasks effortlessly, achieving strong generalizabilityand competitive or even SoTA performance. We hope thiswork can shed a light on the design of the next-generationgeneral-purpose vision system.Acknowledgements. We appreciated the constructive discussion with Haotian Zhang. This work was also supported in part by NSF CAREER IIS2150012, the WisconsinAlumni Research Foundation, and the Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government (MSIT) (No.2022- 0-00871, Development of AI Autonomy and Knowl