社区模型库

飞桨目前包含260+社区模型，覆盖CV、NLP、推荐等多个领域，详细内容如下表：

图像分类

序号	论文名称(链接)	摘要	数据集	快速开始
1	Wide Residual Networks	Abstract Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https://github.com/szagoruyko/wide-residual-networks	CIFAR-10(WRN-28-20-dropout): 96.55%	快速开始
2	Colorful Image Colorization	Abstract Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a "colorization Turing test," asking human participants to choose between a generated and ground truth color image. Our method successfully fools humans on 32% of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. This approach results in state-of-the-art performance on several feature learning benchmarks.	AuC: non-rebal=89.5%rebal=67.3%VGG Top-1 Class Acc=56%AMT Labeled Real=32.3%	快速开始
3	Prototypical Networks for Few-shot Learning	Abstract Dropout is a powerful and widely used technique to regularize the training ofdeep neural networks. In this paper, we introduce a simple regularizationstrategy upon dropout in model training, namely R-Drop, which forces the outputdistributions of different sub models generated by dropout to be consistentwith each other. Specifically, for each training sample, R-Drop minimizes thebidirectional KL-divergence between the output distributions of two sub modelssampled by dropout. Theoretical analysis reveals that R-Drop reduces thefreedom of the model parameters and complements dropout. Experiments on$\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total),including neural machine translation, abstractive summarization, languageunderstanding, language modeling, and image classification, show that R-Drop isuniversally effective. In particular, it yields substantial improvements whenapplied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large,and BART, and achieves state-of-the-art (SOTA) performances with the vanillaTransformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU)and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassingmodels trained with extra large-scale data and expert-designed advancedvariants of Transformer models. Our code is available atGitHub{\url{this https URL}}.	1-shot=49.42%, 5-shot=68.2%	快速开始
4	R-Drop: Regularized Dropout for Neural Networks	Abstract Inspired by recent work in machine translation and object detection, weintroduce an attention based model that automatically learns to describe thecontent of images. We describe how we can train this model in a deterministicmanner using standard backpropagation techniques and stochastically bymaximizing a variational lower bound. We also show through visualization howthe model is able to automatically learn to fix its gaze on salient objectswhile generating the corresponding words in the output sequence. We validatethe use of attention with state-of-the-art performance on three benchmarkdatasets: Flickr8k, Flickr30k and MS COCO.	ViT-B/16+RD=93.29	快速开始
5	Weight Uncertainty in Neural Networks	Abstract We introduce a new, efficient, principled and backpropagation-compatible algorithm for learning a probability distribution on the weights of a neural network, called Bayes by Backprop. It regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood. We show that this principled kind of regularisation yields comparable performance to dropout on MNIST classification. We then demonstrate how the learnt uncertainty in the weights can be used to improve generalisation in non-linear regression problems, and how this weight uncertainty can be used to drive the exploration-exploitation trade-off in reinforcement learning.	MNIST: Test Error=1.32%	快速开始
6	Matching Networks for One Shot Learning	Abstract Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.	omniglot k-way=5, n-shot=1, acc = 98.1%	快速开始
7	Modeling Relational Data with Graph Convolutional Networks	Abstract Recognizing arbitrary multi-character text in unconstrained naturalphotographs is a hard problem. In this paper, we address an equally hardsub-problem in this domain viz. recognizing arbitrary multi-digit numbers fromStreet View imagery. Traditional approaches to solve this problem typicallyseparate out the localization, segmentation, and recognition steps. In thispaper we propose a unified approach that integrates these three steps via theuse of a deep convolutional neural network that operates directly on the imagepixels. We employ the DistBelief implementation of deep neural networks inorder to train large, distributed neural networks on high quality images. Wefind that the performance of this approach increases with the depth of theconvolutional network, with the best performance occurring in the deepestarchitecture we trained, with eleven hidden layers. We evaluate this approachon the publicly available SVHN dataset and achieve over $96\%$ accuracy inrecognizing complete street numbers. We show that on a per-digit recognitiontask, we improve upon the state-of-the-art, achieving $97.84\%$ accuracy. Wealso evaluate this approach on an even more challenging dataset generated fromStreet View imagery containing several tens of millions of street numberannotations and achieve over $90\%$ accuracy. To further explore theapplicability of the proposed system to broader text recognition tasks, weapply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of themost secure reverse turing tests that uses distorted text to distinguish humansfrom bots. We report a $99.8\%$ accuracy on the hardest category of reCAPTCHA.Our evaluations on both tasks indicate that at specific operating thresholds,the performance of the proposed system is comparable to, and in some casesexceeds, that of human operators.	Accuracy = 95.83%	快速开始
8	Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks	Abstract Deep neural networks are typically trained by optimizing a loss function withan SGD variant, in conjunction with a decaying learning rate, untilconvergence. We show that simple averaging of multiple points along thetrajectory of SGD, with a cyclical or constant learning rate, leads to bettergeneralization than conventional training. We also show that this StochasticWeight Averaging (SWA) procedure finds much flatter solutions than SGD, andapproximates the recent Fast Geometric Ensembling (FGE) approach with a singlemodel. Using SWA we achieve notable improvement in test accuracy overconventional SGD training on a range of state-of-the-art residual networks,PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, andImageNet. In short, SWA is extremely easy to implement, improvesgeneralization, and has almost no computational overhead.	Accuracy=95.65%	快速开始
9	Averaging Weights Leads to Wider Optima and Better Generalization	Abstract Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.	VGG16+SWA 1budget, CIFAR10 top1=93.59	快速开始
10	Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation	Abstract Deep learning (DL) based semantic segmentation methods have been providing state-of-the-art performance in the last few years. More specifically, these techniques have been successfully applied to medical image classification, segmentation, and detection tasks. One deep learning technique, U-Net, has become one of the most popular for these applications. In this paper, we propose a Recurrent Convolutional Neural Network (RCNN) based on U-Net as well as a Recurrent Residual Convolutional Neural Network (RRCNN) based on U-Net models, which are named RU-Net and R2U-Net respectively. The proposed models utilize the power of U-Net, Residual Network, as well as RCNN. There are several advantages of these proposed architectures for segmentation tasks. First, a residual unit helps when training deep architecture. Second, feature accumulation with recurrent residual convolutional layers ensures better feature representation for segmentation tasks. Third, it allows us to design better U-Net architecture with same number of network parameters with better performance for medical image segmentation. The proposed models are tested on three benchmark datasets such as blood vessel segmentation in retina images, skin cancer segmentation, and lung lesion segmentation. The experimental results show superior performance on segmentation tasks compared to equivalent models including U-Net and residual U-Net (ResU-Net).	R2U-Net F1-score=0.8171	快速开始
11	Unsupervised Representation Learning by Predicting Image Rotations	Abstract Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification. The code and models of our paper will be published on: https://github.com/gidariss/FeatureLearningRotNet .	RotNet+conv, CIFAR-10上top1=91.16	快速开始
12	FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence	Abstract Over the last decade, Convolutional Neural Network (CNN) models have beenhighly successful in solving complex vision problems. However, these deepmodels are perceived as "black box" methods considering the lack ofunderstanding of their internal functioning. There has been a significantrecent interest in developing explainable deep learning models, and this paperis an effort in this direction. Building on a recently proposed method calledGrad-CAM, we propose a generalized method called Grad-CAM++ that can providebetter visual explanations of CNN model predictions, in terms of better objectlocalization as well as explaining occurrences of multiple object instances ina single image, when compared to state-of-the-art. We provide a mathematicalderivation for the proposed method, which uses a weighted combination of thepositive partial derivatives of the last convolutional layer feature maps withrespect to a specific class score as weights to generate a visual explanationfor the corresponding class label. Our extensive experiments and evaluations,both subjective and objective, on standard datasets showed that Grad-CAM++provides promising human-interpretable visual explanations for a given CNNarchitecture across multiple tasks including classification, image captiongeneration and 3D action recognition; as well as in new settings such asknowledge distillation.	cifar 10, 40label: 93.6%, 250 label 95.31%, 4000 labels 95.77%	快速开始
13	MLP-Mixer: An all-MLP Architecture for Vision	Abstract Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.	Mixer-B/16CIFAR-10upstream: ImageNet96.72%upstream: ImageNet-21k96.82%(官方JAX repo提供)	快速开始
14	Deep Networks with Stochastic Depth	Abstract Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91% on CIFAR-10).	CIFAR-10 test error=5.25	快速开始
15	Recurrent Models of Visual Attention	Abstract Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.	28*28 Mnist, RAM, 6 glimpses, 8 × 8, 1 scale达到论文指标	快速开始
16	Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet	Abstract Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3\% top1 accuracy in image resolution 384×384 on ImageNet. (Code: this https URL)	ImageNet1k: T2T-ViT-7, 71.7%	快速开始
17	Rethinking Spatial Dimensions of Vision Transformers	Abstract Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit.	ImageNet1k: pit_ti 73.0%	快速开始
18	Masked Autoencoders Are Scalable Vision Learners	Abstract This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.	ImageNet1k: val 83.6%	快速开始
19	XCiT: Cross-Covariance Image Transformers	Abstract Following tremendous success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlyingtransformers yields global interactions between all tokens, i.e. words or image patches, andenables flexible modelling of image data beyond the local interactions of convolutions. Thisflexibility, however, comes with a quadratic complexity in time and memory, hinderingapplication to long sequences and high-resolution images. We propose a “transposed”version of self-attention that operates across feature channels rather than tokens, wherethe interactions are based on the cross-covariance matrix between keys and queries. Theresulting cross-covariance attention (XCA) has linear complexity in the number of tokens,and allows efficient processing of high-resolution images. Our cross-covariance imagetransformer (XCiT) – built upon XCA – combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness andgenerality of XCiT by reporting excellent results on multiple vision benchmarks, including (self-supervised) image classification on ImageNet-1k, object detection and instancesegmentation on COCO, and semantic segmentation on ADE20k.	ImageNet; xcit_nano_12_p8224: top1=73.8224: top1=76.3384: top1=77.8	快速开始
20	MicroNet: Improving Image Recognition with Extremely Low FLOPs	Abstract This paper aims at addressing the problem of substantial performance degradation at extremely low computational cost (e.g. 5M FLOPs on ImageNet classification). We found that two factors, sparse connectivity and dynamic activation function, are effective to improve the accuracy. The former avoids the significant reduction of network width, while the latter mitigates the detriment of reduction in network depth. Technically, we propose micro-factorized convolution, which factorizes a convolution matrix into low rank matrices, to integrate sparse connectivity into convolution. We also present a new dynamic activation function, named Dynamic Shift Max, to improve the non-linearity via maxing out multiple dynamic fusions between an input feature map and its circular channel shift. Building upon these two new operators, we arrive at a family of networks, named MicroNet, that achieves significant performance gains over the state of the art in the low FLOP regime. For instance, under the constraint of 12M FLOPs, MicroNet achieves 59.4% top-1 accuracy on ImageNet classification, outperforming MobileNetV3 by 9.6%. Source code is at https://github.com/liyunsheng13/micronet.	ImageNet1k数据集, MicroNet-M3 top1-acc 62.5%	快速开始
21	Rethinking Bottleneck Structure for Efficient Mobile Network Design	Abstract The inverted residual block is dominating architecture design for mobile networks recently. It changes the classic residual bottleneck by introducing two design rules: learning inverted residuals and using linear bottlenecks. In this paper, we rethink the necessity of such design changes and find it may bring risks of information loss and gradient confusion. We thus propose to flip the structure and present a novel bottleneck design, called the sandglass block, that performs identity mapping and spatial transformation at higher dimensions and thus alleviates information loss and gradient confusion effectively. Extensive experiments demonstrate that, different from the common belief, such bottleneck structure is more beneficial than the inverted ones for mobile networks. In ImageNet classification, by simply replacing the inverted residual block with our sandglass block without increasing parameters and computation, the classification accuracy can be improved by more than 1.7% over MobileNetV2. On Pascal VOC 2007 test set, we observe that there is also 0.9% mAP improvement in object detection. We further verify the effectiveness of the sandglass block by adding it into the search space of neural architecture search method DARTS. With 25% parameter reduction, the classification accuracy is improved by 0.13% over previous DARTS models. Code can be found at: this https URL.	ImageNet1k数据集, MobileNeXt-1.00 top1-acc 74.02%	快速开始
22	CvT: Introducing Convolutions to Vision Transformers	Abstract We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at \url{https://github.com/leoxiaobin/CvT}.	ImageNet1k数据集CvT-13, 81.6%	快速开始
23	An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection	Abstract As DenseNet conserves intermediate features with diverse receptive fields by aggregating them with dense connection, it shows good performance on the object detection task. Although feature reuse enables DenseNet to produce strong features with a small number of model parameters and FLOPs, the detector with DenseNet backbone shows rather slow speed and low energy efficiency. We find the linearly increasing input channel by dense connection leads to heavy memory access cost, which causes computation overhead and more energy consumption. To solve the inefficiency of DenseNet, we propose an energy and computation efficient architecture called VoVNet comprised of One-Shot Aggregation (OSA). The OSA not only adopts the strength of DenseNet that represents diversified features with multi receptive fields but also overcomes the inefficiency of dense connection by aggregating all features only once in the last feature maps. To validate the effectiveness of VoVNet as a backbone network, we design both lightweight and large-scale VoVNet and apply them to one-stage and two-stage object detectors. Our VoVNet based detectors outperform DenseNet based ones with 2x faster speed and the energy consumptions are reduced by 1.6x - 4.1x. In addition to DenseNet, VoVNet also outperforms widely used ResNet backbone with faster speed and better energy efficiency. In particular, the small object detection performance has been significantly improved over DenseNet and ResNet.	imagenet VoVNet-39, top1 acc 0.7677	快速开始
24	Residual Attention: A Simple but Effective Method for Multi-Label Recognition	Abstract Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.	VOC2007 dataset, resnet101, head num=1, mAP 94.7%	快速开始
25	HashNet: Deep Learning to Hash by Continuation	Abstract Learning to hash has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval, due to its computation efficiency and retrieval quality. Deep learning to hash, which improves retrieval quality by end-to-end representation learning and hash encoding, has received increasing attention recently. Subject to the ill-posed gradient difficulty in the optimization with sign activations, existing deep learning to hash methods need to first learn continuous representations and then generate binary hash codes in a separated binarization step, which suffer from substantial loss of retrieval quality. This work presents HashNet, a novel deep architecture for deep learning to hash by continuation method with convergence guarantees, which learns exactly binary hash codes from imbalanced similarity data. The key idea is to attack the ill-posed gradient problem in optimizing deep networks with non-smooth binary activations by continuation method, in which we begin from learning an easier network with smoothed activation function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, deep network with the sign activation function. Comprehensive empirical evidence shows that HashNet can generate exactly binary hash codes and yield state-of-the-art multimedia retrieval performance on standard benchmarks.	MS COCO 16bits 0.6873, 32bits 0.7184, 48bits 0.7301, 64bits 0.7362	快速开始
26	Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN	Abstract To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks.	cifar10(1) 12bits 0.774, 24bits 0.795, 32bit 0.810, 48bits 0.822	快速开始
27	Trusted Multi-View Classification	Abstract Multi-view classification (MVC) generally focuses on improving classification accuracy by using information from different views, typically integrating them into a unified comprehensive representation for downstream tasks. However, it is also crucial to dynamically assess the quality of a view for different samples in order to provide reliable uncertainty estimations, which indicate whether predictions can be trusted. To this end, we propose a novel multi-view classification method, termed trusted multi-view classification, which provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The algorithm jointly utilizes multiple views to promote both classification reliability and robustness by integrating evidence from each view. To achieve this, the Dirichlet distribution is used to model the distribution of the class probabilities, parameterized with evidence from different views and integrated with the Dempster-Shafer theory. The unified learning framework induces accurate uncertainty and accordingly endows the model with both reliability and robustness for out-of-distribution samples. Extensive experimental results validate the effectiveness of the proposed model in accuracy, reliability and robustness.	nan	快速开始
28	See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification	Abstract Data augmentation is usually adopted to increase the amount of training data, prevent overfitting and improve the performance of deep models. However, in practice, random data augmentation, such as random image cropping, is low-efficiency and might introduce many uncontrolled background noises. In this paper, we propose Weakly Supervised Data Augmentation Network (WS-DAN) to explore the potential of data augmentation. Specifically, for each training image, we first generate attention maps to represent the object's discriminative parts by weakly supervised learning. Next, we augment the image guided by these attention maps, including attention cropping and attention dropping. The proposed WS-DAN improves the classification accuracy in two folds. In the first stage, images can be seen better since more discriminative parts' features will be extracted. In the second stage, attention regions provide accurate location of object, which ensures our model to look at the object closer and further improve the performance. Comprehensive experiments in common fine-grained visual classification datasets show that our WS-DAN surpasses the state-of-the-art methods, which demonstrates its effectiveness.	WS-DAN CUB-200-2011 acc 89.4%, FGVC-Aircraft 93.0%, Standford Cars 94.5%	快速开始
29	CycleMLP: A MLP-like Architecture for Dense Prediction	Abstract This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have O(N2) computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.	CycleMLP-B1：78.9	快速开始
30	A ConvNet for the 2020s	Abstract The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.	ConvNeXt-T top1 acc 0.821	快速开始
31	Visual Attention Network	Abstract While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN surpasses similar size vision transformers(ViTs) and convolutional neural networks(CNNs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation, pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark and set new state-of-the-art performance (58.2 PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4% mIoU (50.1 vs. 46.1) for semantic segmentation on ADE20K benchmark, 2.6% AP (48.8 vs. 46.2) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. Code is available at https://github.com/Visual-Attention-Network.	VAN-Tiny top1 acc 0.754	快速开始
32	Pelee: A Real-Time Object Detection System on Mobile Devices	Abstract An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and MobileNetV2. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy and over 1.8 times faster speed than MobileNet and MobileNetV2 on NVIDIA TX2. Meanwhile, PeleeNet is only 66% of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system2, named Pelee, achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset at the speed of 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2. The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times smaller model size.	peleenet top1 acc 0.726	快速开始
33	All Tokens Matter: Token Labeling for Training Better Vision Transformers	Abstract In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.	LV-ViT-T @ImageNet val top1 acc=79.1%	快速开始
34	Destruction and Construction Learning for Fine-grained Image Recognition	Abstract Delicate feature representation about object parts plays a critical role in fine-grained recognition. For example, experts can even distinguish fine-grained objects relying only on object parts according to professional knowledge. In this paper, we propose a novel "Destruction and Construction Learning" (DCL) method to enhance the difficulty of fine-grained recognition and exercise the classification model to acquire expert knowledge. Besides the standard classification backbone network, another "destruction and construction" stream is introduced to carefully "destruct" and then "reconstruct" the input image, for learning discriminative regions and features. More specifically, for "destruction", we first partition the input image into local regions and then shuffle them by a Region Confusion Mechanism (RCM). To correctly recognize these destructed images, the classification network has to pay more attention to discriminative regions for spotting the differences. To compensate the noises introduced by RCM, an adversarial loss, which distinguishes original images from destructed ones, is applied to reject noisy patterns introduced by RCM. For "construction", a region alignment network, which tries to restore the original spatial layout of local regions, is followed to model the semantic correlation among local regions. By jointly training with parameter sharing, our proposed DCL injects more discriminative local details to the classification network. Experimental results show that our proposed framework achieves state-of-the-art performance on three standard benchmarks. Moreover, our proposed method does not need any external knowledge during training, and there is no computation overhead at inference time except the standard classification network feed-forwarding. Source code: https://github.com/JDAI-CV/DCL.	ResNet50, CUB-200-2011 acc 87.8%, Stanford Cars 94.5%, FGVC-Aircraft 93.0%	快速开始
35	How to Trust Unlabeled Data? Instance Credibility Inference for Few-Shot Learning	Abstract Deep learning based models have excelled in many computer vision tasks and appear to surpass humans' performance. However, these models require an avalanche of expensive human labeled training data and many iterations to train their large number of parameters. This severely limits their scalability to the real-world long-tail distributed categories, some of which are with a large number of instances, but with only a few manually annotated. Learning from such extremely limited labeled examples is known as Few-shot learning (FSL). Different to prior arts that leverage meta-learning or data augmentation strategies to alleviate this extremely data-scarce problem, this paper presents a statistical approach, dubbed Instance Credibility Inference (ICI) to exploit the support of unlabeled instances for few-shot visual recognition. Typically, we repurpose the self-taught learning paradigm to predict pseudo-labels of unlabeled instances with an initial classifier trained from the few shot and then select the most confident ones to augment the training set to re-train the classifier. This is achieved by constructing a (Generalized) Linear Model (LM/GLM) with incidental parameters to model the mapping from (un-)labeled features to their (pseudo-)labels, in which the sparsity of the incidental parameters indicates the credibility of the corresponding pseudo-labeled instance. We rank the credibility of pseudo-labeled instances along the regularization path of their corresponding incidental parameters, and the most trustworthy pseudo-labeled examples are preserved as the augmented labeled instances. Theoretically, under mild conditions of restricted eigenvalue, irrepresentability, and large error, our approach is guaranteed to collect all the correctly-predicted instances from the noisy pseudo-labeled set.	mini-ImageNet, semi ICIR 1shot 73.12%, 5shot 83.28%	快速开始

目标检测

序号	论文名称(链接)	摘要	数据集	快速开始
1	EfficientDet: Scalable and Efficient Object Detection	Abstract Accurate depth estimation from images is a fundamental task in manyapplications including scene understanding and reconstruction. Existingsolutions for depth estimation often produce blurry approximations of lowresolution. This paper presents a convolutional neural network for computing ahigh-resolution depth map given a single RGB image with the help of transferlearning. Following a standard encoder-decoder architecture, we leveragefeatures extracted using high performing pre-trained networks when initializingour encoder along with augmentation and training strategies that lead to moreaccurate results. We show how, even for a very simple decoder, our method isable to achieve detailed high-resolution depth maps. Our network, with fewerparameters and training iterations, outperforms state-of-the-art on twodatasets and also produces qualitatively better results that capture objectboundaries more faithfully. Code and corresponding pre-trained weights are madepublicly available.	efficientdet_d0 mAP: 33.6	快速开始
2	High Quality Monocular Depth Estimation via Transfer Learning	Abstract We show that the YOLOv4 object detection neural network based on the CSPapproach, scales both up and down and is applicable to small and large networkswhile maintaining optimal speed and accuracy. We propose a network scalingapproach that modifies not only the depth, width, resolution, but alsostructure of the network. YOLOv4-large model achieves state-of-the-art results:55.5% AP (73.4% AP50) for the MS COCO dataset at a speed of ~16 FPS on TeslaV100, while with the test time augmentation, YOLOv4-large achieves 56.0% AP(73.3 AP50). To the best of our knowledge, this is currently the highestaccuracy on the COCO dataset among any published work. The YOLOv4-tiny modelachieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while byusing TensorRT, batch size = 4 and FP16-precision the YOLOv4-tiny achieves 1774FPS.	NYU Depth v2 δ1: 0.895 参考原论文table1	快速开始
3	Scaled-YOLOv4: Scaling Cross Stage Partial Network	Abstract The highest accuracy object detectors to date are based on a two-stageapproach popularized by R-CNN, where a classifier is applied to a sparse set ofcandidate object locations. In contrast, one-stage detectors that are appliedover a regular, dense sampling of possible object locations have the potentialto be faster and simpler, but have trailed the accuracy of two-stage detectorsthus far. In this paper, we investigate why this is the case. We discover thatthe extreme foreground-background class imbalance encountered during trainingof dense detectors is the central cause. We propose to address this classimbalance by reshaping the standard cross entropy loss such that itdown-weights the loss assigned to well-classified examples. Our novel FocalLoss focuses training on a sparse set of hard examples and prevents the vastnumber of easy negatives from overwhelming the detector during training. Toevaluate the effectiveness of our loss, we design and train a simple densedetector we call RetinaNet. Our results show that when trained with the focalloss, RetinaNet is able to match the speed of previous one-stage detectorswhile surpassing the accuracy of all existing state-of-the-art two-stagedetectors. Code is at: this https URL.	YOLOv4-P5 mAP: 51.2	快速开始
4	Focal Loss for Dense Object Detection	Abstract There are a huge number of features which are said to improve ConvolutionalNeural Network (CNN) accuracy. Practical testing of combinations of suchfeatures on large datasets, and theoretical justification of the result, isrequired. Some features operate on certain models exclusively and for certainproblems exclusively, or only for small-scale datasets; while some features,such as batch-normalization and residual-connections, are applicable to themajority of models, tasks, and datasets. We assume that such universal featuresinclude Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections(CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT)and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation,Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, andcombine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50)for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Sourcecode is at this https URL	RetinaNet R-50-FPN 1x mAP: 35.7 参考github	快速开始
5	YOLOv4: Optimal Speed and Accuracy of Object Detection	Abstract Cascade is a classic yet powerful architecture that has boosted performanceon various tasks. However, how to introduce cascade to instance segmentationremains an open question. A simple combination of Cascade R-CNN and Mask R-CNNonly brings limited gain. In exploring a more effective approach, we find thatthe key to a successful instance segmentation cascade is to fully leverage thereciprocal relationship between detection and segmentation. In this work, wepropose a new framework, Hybrid Task Cascade (HTC), which differs in twoimportant aspects: (1) instead of performing cascaded refinement on these twotasks separately, it interweaves them for a joint multi-stage processing; (2)it adopts a fully convolutional branch to provide spatial context, which canhelp distinguishing hard foreground from cluttered background. Overall, thisframework can learn more discriminative features progressively whileintegrating complementary features together in each stage. Without bells andwhistles, a single HTC obtains 38.4 and 1.5 improvement over a strong CascadeMask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018Challenge Object Detection Task. Code is available at:this https URL.	input size: 416x416, MS COCO上mAP=41.2	快速开始
6	Hybrid Task Cascade for Instance Segmentation	Abstract We trained a convolutional neural network (CNN) to map raw pixels from asingle front-facing camera directly to steering commands. This end-to-endapproach proved surprisingly powerful. With minimum training data from humansthe system learns to drive in traffic on local roads with or without lanemarkings and on highways. It also operates in areas with unclear visualguidance such as in parking lots and on unpaved roads.The system automatically learns internal representations of the necessaryprocessing steps such as detecting useful road features with only the humansteering angle as the training signal. We never explicitly trained it todetect, for example, the outline of roads.Compared to explicit decomposition of the problem, such as lane markingdetection, path planning, and control, our end-to-end system optimizes allprocessing steps simultaneously. We argue that this will eventually lead tobetter performance and smaller systems. Better performance will result becausethe internal components self-optimize to maximize overall system performance,instead of optimizing human-selected intermediate criteria, e.g., lanedetection. Such criteria understandably are selected for ease of humaninterpretation which doesn't automatically guarantee maximum systemperformance. Smaller networks are possible because the system learns to solvethe problem with the minimal number of processing steps.We used an NVIDIA DevBox and Torch 7 for training and an NVIDIA DRIVE(TM) PXself-driving car computer also running Torch 7 for determining where to drive.The system operates at 30 frames per second (FPS).	HTC R-50-FPN 1x box AP: 42.3 mask AP: 37.4	快速开始
7	Holistically-Nested Edge Detection	Abstract Humans recognize the visual world at multiple levels: we effortlesslycategorize scenes and detect objects inside, while also identifying thetextures and surfaces of the objects along with their different compositionalparts. In this paper, we study a new task called Unified Perceptual Parsing,which requires the machine vision systems to recognize as many visual conceptsas possible from a given image. A multi-task framework called UPerNet and atraining strategy are developed to learn from heterogeneous image annotations.We benchmark our framework on Unified Perceptual Parsing and show that it isable to effectively segment a wide range of concepts from images. The trainednetworks are further applied to discover visual knowledge in natural scenes.Models are available at \url{this https URL}.	BSD500 dataset (ODS F-score of .782)	快速开始
8	Show, Attend and Tell: Neural Image Caption Generation with Visual Attention	Abstract We present an approach to efficiently detect the 2D pose of multiple peoplein an image. The approach uses a nonparametric representation, which we referto as Part Affinity Fields (PAFs), to learn to associate body parts withindividuals in the image. The architecture encodes global context, allowing agreedy bottom-up parsing step that maintains high accuracy while achievingrealtime performance, irrespective of the number of people in the image. Thearchitecture is designed to jointly learn part locations and their associationvia two branches of the same sequential prediction process. Our method placedfirst in the inaugural COCO 2016 keypoints challenge, and significantly exceedsthe previous state-of-the-art result on the MPII Multi-Person benchmark, bothin performance and efficiency.	bleu-1: 67%, bleu-2: 45.7%, bleu-3: 31.4%, bleu-4: 21.3%	快速开始
9	Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering	Abstract Recent work has demonstrated that deep neural networks are vulnerable toadversarial examples---inputs that are almost indistinguishable from naturaldata and yet classified incorrectly by the network. In fact, some of the latestfindings suggest that the existence of adversarial attacks may be an inherentweakness of deep learning models. To address this problem, we study theadversarial robustness of neural networks through the lens of robustoptimization. This approach provides us with a broad and unifying view on muchof the prior work on this topic. Its principled nature also enables us toidentify methods for both training and attacking neural networks that arereliable and, in a certain sense, universal. In particular, they specify aconcrete security guarantee that would protect against any adversary. Thesemethods let us train networks with significantly improved resistance to a widerange of adversarial attacks. They also suggest the notion of security againsta first-order adversary as a natural and broad security guarantee. We believethat robustness against such well-defined classes of adversaries is animportant stepping stone towards fully resistant deep learning models. Code andpre-trained models are available at this https URLand this https URL.	coco 2014 bleu-1=79.8%	快速开始
10	Pixel Recurrent Neural Networks	Abstract Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.	NLL test 81.3	快速开始
11	Residual Attention Network for Image Classification	Abstract In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.	Attention-92 top1 error 4.99%	快速开始
12	Fast R-CNN	Abstract This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.	COCO R50 mAP=37.8	快速开始
13	Simple Baselines for Human Pose Estimation and Tracking	Abstract There has been significant progress on pose estimation and increasing interests on pose tracking in recent years. At the same time, the overall algorithm and system complexity increases as well, making the algorithm analysis and comparison more difficult. This work provides simple and effective baseline methods. They are helpful for inspiring and evaluating new ideas for the field. State-of-the-art results are achieved on challenging benchmarks. The code will be available at https://github.com/leoxiaobin/pose.pytorch.	MPII; 256x256_pose_resnet_50 mean=88.53	快速开始
14	VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection	Abstract Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.	KITTI; VoxelNet 3D detection kitti validation: car easy: 81, 97 moderate: 65.46 hard: 62.85(参考原论文table2 github repo实现精度 easy: 53.43 moderate: 48.78 hard: 48.06)	快速开始
15	MnasNet: Platform-Aware Neural Architecture Search for Mobile	Abstract Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8x faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3x faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. Code is at this https URL	ImageNet: MnasNet-A top-1 73.5	快速开始
16	Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning	Abstract While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.	KITTI: MSE 0.007	快速开始
17	Gradient Harmonized Single-stage Detector	Abstract This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. From the perspective of optimization, we introduce an alternative way to address the problem instead of adopting the complex feature pyramids - {\em utilizing only one-level feature for detection}. Based on the simple and efficient solution, we present You Only Look One-level Feature (YOLOF). In our method, two key components, Dilated Encoder and Uniform Matching, are proposed and bring considerable improvements. Extensive experiments on the COCO benchmark prove the effectiveness of the proposed model. Our YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet while being 2.5× faster. Without transformer layers, YOLOF can match the performance of DETR in a single-level feature manner with 7× less training epochs. With an image size of 608×608, YOLOF achieves 44.3 mAP running at 60 fps on 2080Ti, which is 13% faster than YOLOv4. Code is available at \url{https://github.com/megvii-model/YOLOF}.	coco: R50 37.0	快速开始
18	You Only Look One-level Feature	Abstract Many modern object detectors demonstrate outstanding performances by using the mechanism of looking and thinking twice. In this paper, we explore this mechanism in the backbone design for object detection. At the macro level, we propose Recursive Feature Pyramid, which incorporates extra feedback connections from Feature Pyramid Networks into the bottom-up backbone layers. At the micro level, we propose Switchable Atrous Convolution, which convolves the features with different atrous rates and gathers the results using switch functions. Combining them results in DetectoRS, which significantly improves the performances of object detection. On COCO test-dev, DetectoRS achieves state-of-the-art 55.7% box AP for object detection, 48.5% mask AP for instance segmentation, and 50.0% PQ for panoptic segmentation. The code is made publicly available.	coco: R50 37.5	快速开始
19	YOLOX: Exceeding YOLO Series in 2021	Abstract In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector — YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLONano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4- CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and * Equal contribution. † Corresponding author. researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/ Megvii-BaseDetection/YOLOX.	COCO 2017 test-dev, YOLOX-X (640*640) map: 51.2	快速开始

图像分割

序号	论文名称(链接)	摘要	数据集	快速开始
1	PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation	Abstract Few prior works study deep learning on point sets. PointNet by Qi et al. is apioneer in this direction. However, by design PointNet does not capture localstructures induced by the metric space points live in, limiting its ability torecognize fine-grained patterns and generalizability to complex scenes. In thiswork, we introduce a hierarchical neural network that applies PointNetrecursively on a nested partitioning of the input point set. By exploitingmetric space distances, our network is able to learn local features withincreasing contextual scales. With further observation that point sets areusually sampled with varying densities, which results in greatly decreasedperformance for networks trained on uniform densities, we propose novel setlearning layers to adaptively combine features from multiple scales.Experiments show that our network called PointNet++ is able to learn deep pointset features efficiently and robustly. In particular, results significantlybetter than state-of-the-art have been obtained on challenging benchmarks of 3Dpoint clouds.	ModelNet40: 89.2%	快速开始
2	PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space	Abstract The ability to perform pixel-wise semantic segmentation in real-time is ofparamount importance in mobile applications. Recent deep neural networks aimedat this task have the disadvantage of requiring a large number of floatingpoint operations and have long run-times that hinder their usability. In thispaper, we propose a novel deep neural network architecture named ENet(efficient neural network), created specifically for tasks requiring lowlatency operation. ENet is up to 18$\times$ faster, requires 75$\times$ lessFLOPs, has 79$\times$ less parameters, and provides similar or better accuracyto existing models. We have tested it on CamVid, Cityscapes and SUN datasetsand report on comparisons with existing state-of-the-art methods, and thetrade-offs between accuracy and processing time of a network. We presentperformance measurements of the proposed architecture on embedded systems andsuggest possible software improvements that could make ENet even faster.	ModelNet40: 90.7%	快速开始
3	ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation	Abstract We develop a new edge detection algorithm that tackles two important issuesin this long-standing vision problem: (1) holistic image training andprediction; and (2) multi-scale and multi-level feature learning. Our proposedmethod, holistically-nested edge detection (HED), performs image-to-imageprediction by means of a deep learning model that leverages fully convolutionalneural networks and deeply-supervised nets. HED automatically learns richhierarchical representations (guided by deep supervision on side responses)that are important in order to approach the human ability resolve thechallenging ambiguity in edge and object boundary detection. We significantlyadvance the state-of-the-art on the BSD500 dataset (ODS F-score of .782) andthe NYU Depth dataset (ODS F-score of .746), and do so with an improved speed(0.4 second per image) that is orders of magnitude faster than some recentCNN-based edge detection algorithms.	Cityscapes mIoU 58.3%	快速开始
4	Unified Perceptual Parsing for Scene Understanding	Abstract In this work we address the task of semantic image segmentation with DeepLearning and make three main contributions that are experimentally shown tohave substantial practical merit. First, we highlight convolution withupsampled filters, or 'atrous convolution', as a powerful tool in denseprediction tasks. Atrous convolution allows us to explicitly control theresolution at which feature responses are computed within Deep ConvolutionalNeural Networks. It also allows us to effectively enlarge the field of view offilters to incorporate larger context without increasing the number ofparameters or the amount of computation. Second, we propose atrous spatialpyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPPprobes an incoming convolutional feature layer with filters at multiplesampling rates and effective fields-of-views, thus capturing objects as well asimage context at multiple scales. Third, we improve the localization of objectboundaries by combining methods from DCNNs and probabilistic graphical models.The commonly deployed combination of max-pooling and downsampling in DCNNsachieves invariance but has a toll on localization accuracy. We overcome thisby combining the responses at the final DCNN layer with a fully connectedConditional Random Field (CRF), which is shown both qualitatively andquantitatively to improve localization performance. Our proposed "DeepLab"system sets the new state-of-art at the PASCAL VOC-2012 semantic imagesegmentation task, reaching 79.7% mIOU in the test set, and advances theresults on three other datasets: PASCAL-Context, PASCAL-Person-Part, andCityscapes. All of our code is made publicly available online.	Cityscapes mIoU 80.1%	快速开始
5	DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs	Abstract In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.	Cityscapes mIoU 71.4%	快速开始
6	Real-Time High-Resolution Background Matting	Abstract We introduce a real-time, high-resolution background replacement technique which operates at 30fps in 4K resolution, and 60fps for HD on a modern GPU. Our technique is based on background matting, where an additional frame of the background is captured and used in recovering the alpha matte and the foreground layer. The main challenge is to compute a high-quality alpha matte, preserving strand-level hair details, while processing high-resolution images in real-time. To achieve this goal, we employ two neural networks; a base network computes a low-resolution result which is refined by a second network operating at high-resolution on selective patches. We introduce two largescale video and image matting datasets: VideoMatte240K and PhotoMatte13K/85. Our approach yields higher quality results compared to the previous state-of-the-art in background matting, while simultaneously yielding a dramatic boost in both speed and resolution.	PhotoMatte85 SAD8.65、MSE9.57	快速开始
7	Panoptic Feature Pyramid Networks	Abstract The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, top-performing method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.	Cityscapes mIoU 75.8%	快速开始
8	SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation	Abstract We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. We show that SegNet provides good performance with competitive inference time and more efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.	image size: 360× 480; Dataset: Camvid; mIOU: 60.1	快速开始
9	YOLACT: Real-time Instance Segmentation	Abstract We develop an algorithm that can detect pneumonia from chest X-rays at alevel exceeding practicing radiologists. Our algorithm, CheXNet, is a 121-layerconvolutional neural network trained on ChestX-ray14, currently the largestpublicly available chest X-ray dataset, containing over 100,000 frontal-viewX-ray images with 14 diseases. Four practicing academic radiologists annotate atest set, on which we compare the performance of CheXNet to that ofradiologists. We find that CheXNet exceeds average radiologist performance onthe F1 metric. We extend CheXNet to detect all 14 diseases in ChestX-ray14 andachieve state of the art results on all 14 diseases.	Image size: 550 Resnet101-FPN FPS=33.5 mAP=29.8	快速开始
10	YOLACT++: Better Real-time Instance Segmentation	Abstract We present a new method for efficient high-quality image segmentation ofobjects and scenes. By analogizing classical computer graphics methods forefficient rendering with over- and undersampling challenges faced in pixellabeling tasks, we develop a unique perspective of image segmentation as arendering problem. From this vantage, we present the PointRend (Point-basedRendering) neural network module: a module that performs point-basedsegmentation predictions at adaptively selected locations based on an iterativesubdivision algorithm. PointRend can be flexibly applied to both instance andsemantic segmentation tasks by building on top of existing state-of-the-artmodels. While many concrete implementations of the general idea are possible,we show that a simple design already achieves excellent results. Qualitatively,PointRend outputs crisp object boundaries in regions that are over-smoothed byprevious methods. Quantitatively, PointRend yields significant gains on COCOand Cityscapes, for both instance and semantic segmentation. PointRend'sefficiency enables output resolutions that are otherwise impractical in termsof memory or computation compared to existing approaches. Code has been madeavailable atthis https URL.	Image size: 550 Resnet50-FPN FPS=33.5 mAP=34.1	快速开始
11	PointRend: Image Segmentation as Rendering	Abstract Recent works have widely explored the contextual dependencies to achieve moreaccurate segmentation results. However, most approaches rarely distinguishdifferent types of contextual dependencies, which may pollute the sceneunderstanding. In this work, we directly supervise the feature aggregation todistinguish the intra-class and inter-class context clearly. Specifically, wedevelop a Context Prior with the supervision of the Affinity Loss. Given aninput image and corresponding ground truth, Affinity Loss constructs an idealaffinity map to supervise the learning of Context Prior. The learned ContextPrior extracts the pixels belonging to the same category, while the reversedprior focuses on the pixels of different classes. Embedded into a conventionaldeep CNN, the proposed Context Prior Layer can selectively capture theintra-class and inter-class contextual dependencies, leading to robust featurerepresentation. To validate the effectiveness, we design an effective ContextPrior Network (CPNet). Extensive quantitative and qualitative evaluationsdemonstrate that the proposed model performs favorably against state-of-the-artsemantic segmentation approaches. More specifically, our algorithm achieves46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU onCityscapes. Code is available at this https URL.	cityscapes resnet50+FPN mIoU 78.3% 参考P5 architecture	快速开始
12	Context Prior for Scene Segmentation	Abstract BiSeNet has been proved to be a popular two-stream network for real-timesegmentation. However, its principle of adding an extra path to encode spatialinformation is time-consuming, and the backbones borrowed from pretrainedtasks, e.g., image classification, may be inefficient for image segmentationdue to the deficiency of task-specific design. To handle these problems, wepropose a novel and efficient structure named Short-Term Dense Concatenatenetwork (STDC network) by removing structure redundancy. Specifically, wegradually reduce the dimension of feature maps and use the aggregation of themfor image representation, which forms the basic module of STDC network. In thedecoder, we propose a Detail Aggregation module by integrating the learning ofspatial information into low-level layers in single-stream manner. Finally, thelow-level features and deep features are fused to predict the finalsegmentation results. Extensive experiments on Cityscapes and CamVid datasetdemonstrate the effectiveness of our method by achieving promising trade-offbetween segmentation accuracy and inference speed. On Cityscapes, we achieve71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti,which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0FPS while inferring on higher resolution images.	cityscapes resnet101 mIoU 81.3% 参考论文table6	快速开始
13	Rethinking BiSeNet For Real-time Semantic Segmentation	Abstract BiSeNet has been proved to be a popular two-stream network for real-time segmentation. However, its principle of adding an extra path to encode spatial information is time-consuming, and the backbones borrowed from pretrained tasks, e.g., image classification, may be inefficient for image segmentation due to the deficiency of task-specific design. To handle these problems, we propose a novel and efficient structure named Short-Term Dense Concatenate network (STDC network) by removing structure redundancy. Specifically, we gradually reduce the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network. In the decoder, we propose a Detail Aggregation module by integrating the learning of spatial information into low-level layers in single-stream manner. Finally, the low-level features and deep features are fused to predict the final segmentation results. Extensive experiments on Cityscapes and CamVid dataset demonstrate the effectiveness of our method by achieving promising trade-off between segmentation accuracy and inference speed. On Cityscapes, we achieve 71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti, which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0 FPS while inferring on higher resolution images.	imgsize 512 × 1024 cityscapes STDC2-Seg50 mIoU 74.2% 参见论文table6	快速开始
14	ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network	Abstract We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on four different tasks: (1) object classification, (2) semantic segmentation, (3) object detection, and (4) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network outperforms ESPNet by 4-5% and has 2-4x fewer FLOPs on the PASCAL VOC and the Cityscapes dataset. Compared to YOLOv2 on the MS-COCO object detection, ESPNetv2 delivers 4.4% higher accuracy with 6x fewer FLOPs. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2	cityscapes ESPNetv2-val mIoU 66.4% 参见论文 fig7	快速开始
15	Exploring Cross-Image Pixel Contrast for Semantic Segmentation	Abstract Current semantic segmentation methods focus only on mining "local" context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization criteria (e.g., IoU-like loss). However, they ignore "global" context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCR) and backbones (i.e., ResNet, HR-Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL-Context, COCO-Stuff, CamVid). We expect this work will encourage our community to rethink the current de facto training paradigm in fully supervised semantic segmentation.	HRNet-W48 Cityscaoes mIOU=80.18	快速开始
16	Category-Level Adversarial Adaptation for Semantic Segmentation using Purified Features	Abstract We target the problem named unsupervised domain adaptive semantic segmentation. A key in this campaign consists inreducing the domain shift, so that a classifier based on labeled data from one domain can generalize well to other domains. With theadvancement of adversarial learning blackmethod, recent works prefer the strategy of aligning the marginal distribution in the featurespaces for minimizing the domain discrepancy. However, based on the observance in experiments, only focusing on aligning globalmarginal distribution but ignoring the local joint distribution alignment fails to be the optimal choice. Other than that, the noisy factorsexisting in the feature spaces, which are not relevant to the target task, entangle with the domain invariant factors improperly and makethe domain distribution alignment more difficult. To address those problems, we introduce two new modules, Significance-awareInformation Bottleneck (SIB) and Category-level alignment (CLA), to construct a purified embedding-based category-level adversarialnetwork. As the name suggests, our designed network, CLAN, can not only disentangle the noisy factors and suppress their influencesfor target tasks but also utilize those purified features to conduct a more delicate level domain calibration, i.e., global marginal distributionand local joint distribution alignment simultaneously. In three domain adaptation tasks, i.e., GTA5 → Cityscapes, SYNTHIA → Cityscapesand Cross Season, we validate that our proposed method matches the state of the art in segmentation accuracy.	Resnet101 Cityscapes mIoU 45.5%	快速开始
17	Brain Tumor Segmentation with Deep Neural Networks	Abstract In this paper, we present a fully automatic brain tumor segmentation method based on Deep Neural Networks (DNNs). The proposed networks are tailored to glioblastomas (both low and high grade) pictured in MR images. By their very nature, these tumors can appear anywhere in the brain and have almost any kind of shape, size, and contrast. These reasons motivate our exploration of a machine learning solution that exploits a flexible, high capacity DNN while being extremely efficient. Here, we give a description of different model choices that we've found to be necessary for obtaining competitive performance. We explore in particular different architectures based on Convolutional Neural Networks (CNN), i.e. DNNs specifically adapted to image data. We present a novel CNN architecture which differs from those traditionally used in computer vision. Our CNN exploits both local features as well as more global contextual features simultaneously. Also, different from most traditional uses of CNNs, our networks use a final layer that is a convolutional implementation of a fully connected layer which allows a 40 fold speed up. We also describe a 2-phase training procedure that allows us to tackle difficulties related to the imbalance of tumor labels. Finally, we explore a cascade architecture in which the output of a basic CNN is treated as an additional source of information for a subsequent CNN. Results reported on the 2013 BRATS test dataset reveal that our architecture improves over the currently published state-of-the-art while being over 30 times faster.	BRATS 2013 test: Dice Complete 0.84, Core 0.72, Enhancing 0.57	快速开始
18	Dynamic Graph CNN for Learning on Point Clouds	Abstract Point clouds provide a flexible geometric representation suitable for countless applications in computer graphics; they also comprise the raw output of most 3D data acquisition devices. While hand-designed features on point clouds have long been proposed in graphics and vision, however, the recent overwhelming success of convolutional neural networks (CNNs) for image analysis suggests the value of adapting insight from CNN to the point cloud world. Point clouds inherently lack topological information so designing a model to recover topology can enrich the representation power of point clouds. To this end, we propose a new neural network module dubbed EdgeConv suitable for CNN-based high-level tasks on point clouds including classification and segmentation. EdgeConv acts on graphs dynamically computed in each layer of the network. It is differentiable and can be plugged into existing architectures. Compared to existing modules operating in extrinsic space or treating each point independently, EdgeConv has several appealing properties: It incorporates local neighborhood information; it can be stacked applied to learn global shape properties; and in multi-layer systems affinity in feature space captures semantic characteristics over potentially long distances in the original embedding. We show the performance of our model on standard benchmarks including ModelNet40, ShapeNetPart, and S3DIS.	mIOU=85.2% 参考论文 Table.6	快速开始
19	Adaptive Pyramid Context Network for Semantic Segmentation	Abstract Semi-supervised learning (SSL) provides an effective means of leveragingunlabeled data to improve a model's performance. In this paper, we demonstratethe power of a simple combination of two common SSL methods: consistencyregularization and pseudo-labeling. Our algorithm, FixMatch, first generatespseudo-labels using the model's predictions on weakly-augmented unlabeledimages. For a given image, the pseudo-label is only retained if the modelproduces a high-confidence prediction. The model is then trained to predict thepseudo-label when fed a strongly-augmented version of the same image. Despiteits simplicity, we show that FixMatch achieves state-of-the-art performanceacross a variety of standard semi-supervised learning benchmarks, including94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 -- just4 labels per class. Since FixMatch bears many similarities to existing SSLmethods that achieve worse performance, we carry out an extensive ablationstudy to tease apart the experimental factors that are most important toFixMatch's success. We make our code available atthis https URL.	Cityscapes: mIOU=79.28%	快速开始
20	CGNet: A Light-weight Context Guided Network for Semantic Segmentation	Abstract We focus on the challenging task of real-time semantic segmentation in thispaper. It finds many practical applications and yet is with fundamentaldifficulty of reducing a large portion of computation for pixel-wise labelinference. We propose an image cascade network (ICNet) that incorporatesmulti-resolution branches under proper label guidance to address thischallenge. We provide in-depth analysis of our framework and introduce thecascade feature fusion unit to quickly achieve high-quality segmentation. Oursystem yields real-time inference on a single GPU card with decent qualityresults evaluated on challenging datasets like Cityscapes, CamVid andCOCO-Stuff.	Cityscapes valset: M3N21, mIOU=68.27%	快速开始
21	ICNet for Real-Time Semantic Segmentation on High-Resolution Images	Abstract Recent deep learning based approaches have shown promising results for thechallenging task of inpainting large missing regions in an image. These methodscan generate visually plausible image structures and textures, but often createdistorted structures or blurry textures inconsistent with surrounding areas.This is mainly due to ineffectiveness of convolutional neural networks inexplicitly borrowing or copying information from distant spatial locations. Onthe other hand, traditional texture and patch synthesis approaches areparticularly suitable when it needs to borrow textures from the surroundingregions. Motivated by these observations, we propose a new deep generativemodel-based approach which can not only synthesize novel image structures butalso explicitly utilize surrounding image features as references during networktraining to make better predictions. The model is a feed-forward, fullyconvolutional neural network which can process images with multiple holes atarbitrary locations and with variable sizes during the test time. Experimentson multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) andnatural images (ImageNet, Places2) demonstrate that our proposed approachgenerates higher-quality inpainting results than existing ones. Code, demo andmodels are available at: this https URL.	Cityscapes mIOU 69.6%	快速开始
22	Context Encoding for Semantic Segmentation	Abstract Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-the-art results 51.7% mIoU on PASCAL-Context, 85.9% mIoU on PASCAL VOC 2012. Our single model achieves a final score of 0.5567 on ADE20K test set, which surpass the winning entry of COCO-Place Challenge in 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45%, which is comparable with state-of-the-art approaches with over 10 times more layers. The source code for the complete system are publicly available.	Cityscapes; Cityscapes mIOU = 78.55%	快速开始
23	BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation	Abstract Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048x1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.	Cityscapes: resnet18 mIOU=74.8 对应论文 Table.6 中实现	快速开始
24	FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation	Abstract Modern approaches for semantic segmentation usually employ dilated convolutions in the backbone to extract high-resolution feature maps, which brings heavy computation complexity and memory footprint. To replace the time and memory consuming dilated convolutions, we propose a novel joint upsampling module named Joint Pyramid Upsampling (JPU) by formulating the task of extracting high-resolution feature maps into a joint upsampling problem. With the proposed JPU, our method reduces the computation complexity by more than three times without performance loss. Experiments show that JPU is superior to other upsampling modules, which can be plugged into many existing approaches to reduce computation complexity and improve performance. By replacing dilated convolutions with the proposed JPU module, our method achieves the state-of-the-art performance in Pascal Context dataset (mIoU of 53.13%) and ADE20K dataset (final score of 0.5584) while running 3 times faster.	ADE20K: EncNet+JPU (resnet50)mIOU=42.5	快速开始
25	Dynamic Multi-Scale Filters for Semantic Segmentation	Abstract Multi-scale representation provides an effective way to address scale variation of objects and stuff in semantic segmentation. Previous works construct multi-scale representation by utilizing different filter sizes, expanding filter sizes with dilated filters or pooling grids, and the parameters of these filters are fixed after training. These methods often suffer from heavy computational cost or have more parameters, and are not adaptive to the input image during inference. To address these problems, this paper proposes a Dynamic Multi-scale Network (DMNet) to adaptively capture multi-scale contents for predicting pixel-level semantic labels. DMNet is composed of multiple Dynamic Convolutional Modules (DCMs) arranged in parallel, each of which exploits context-aware filters to estimate semantic representation for a specific scale. The outputs of multiple DCMs are further integrated for final segmentation. We conduct extensive experiments to evaluate our DMNet on three challenging semantic segmentation and scene parsing datasets, PASCAL VOC 2012, Pascal-Context, and ADE20K. DMNet achieves a new record 84.4% mIoU on PASCAL VOC 2012 test set without MS COCO pre-trained and post-processing, and also obtains state-of-the-art performance on PascalContext and ADE20K.	Cityscapes: mIOU = 79.64%	快速开始
26	ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation	Abstract We introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated ESPNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively.	Cityscapes; 1. mIOU=60.3 对应论文 Table.1(a) 中实现; 2. 训练日志中包含周期性的在 valset 上的评估结果。	快速开始
27	Adversarial Learning for Semi-Supervised Semantic Segmentation	Abstract We propose a method for semi-supervised semantic segmentation using an adversarial network. While most existing discriminators are trained to classify input images as real or fake on the image level, we design a discriminator in a fully convolutional manner to differentiate the predicted probability maps from the ground truth segmentation distribution with the consideration of the spatial resolution. We show that the proposed discriminator can be used to improve semantic segmentation accuracy by coupling the adversarial loss with the standard cross entropy loss of the proposed model. In addition, the fully convolutional discriminator enables semi-supervised learning through discovering the trustworthy regions in predicted results of unlabeled images, thereby providing additional supervisory signals. In contrast to existing methods that utilize weakly-labeled images, our method leverages unlabeled images to enhance the segmentation model. Experimental results on the PASCAL VOC 2012 and Cityscapes datasets demonstrate the effectiveness of the proposed algorithm.	Pascal VOC 2012: data amount= 1/8; DeepLab-v2 with ResNet-101; mIOU=69.5 对应论文 Table.1	快速开始
28	V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation	Abstract Convolutional Neural Networks (CNNs) have been recently employed to solve problems from both the computer vision and medical image analysis fields. Despite their popularity, most approaches are only able to process 2D images while most medical data used in clinical practice consists of 3D volumes. In this work we propose an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network. Our CNN is trained end-to-end on MRI volumes depicting prostate, and learns to predict segmentation for the whole volume at once. We introduce a novel objective function, that we optimise during training, based on Dice coefficient. In this way we can deal with situations where there is a strong imbalance between the number of foreground and background voxels. To cope with the limited number of annotated volumes available for training, we augment the data applying random non-linear transformations and histogram matching. We show in our experimental evaluation that our approach achieves good performances on challenging test data while requiring only a fraction of the processing time needed by other previous methods.	Prostate dataset Dice coefficient: 0.869参考论文指标	快速开始
29	Polarized Self-Attention: Towards High-quality Pixel-wise Regression	Abstract Pixel-wise regression is probably the most common problem in fine-grained computer vision tasks, such as estimating keypoint heatmaps and segmentation masks. These regression problems are very challenging particularly because they require, at low computation overheads, modeling long-range dependencies on high-resolution inputs/outputs to estimate the highly nonlinear pixel-wise semantics. While attention mechanisms in Deep Convolutional Neural Networks(DCNNs) has become popular for boosting long-range dependencies, element-specific attention, such as Nonlocal blocks, is highly complex and noise-sensitive to learn, and most of simplified attention hybrids try to reach the best compromise among multiple types of tasks. In this paper, we present the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression: (1) Polarized filtering: keeping high internal resolution in both channel and spatial attention computation while completely collapsing input tensors along their counterpart dimensions. (2) Enhancement: composing non-linearity that directly fits the output distribution of typical fine-grained regression, such as the 2D Gaussian distribution (keypoint heatmaps), or the 2D Binormial distribution (binary segmentation masks). PSA appears to have exhausted the representation capacity within its channel-only and spatial-only branches, such that there is only marginal metric differences between its sequential and parallel layouts. Experimental results show that PSA boosts standard baselines by 2−4 points, and boosts state-of-the-arts by 1−2 points on 2D pose estimation and semantic segmentation benchmarks.	数据集 cityscapes valset验收指标：1. HRNetV2-OCR+PSA(s) mIOU= 86.7% 参考论文 Table.52. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg	快速开始
30	nnFormer: Interleaved Transformer for Volumetric Segmentation	Abstract Transformer, the model of choice for naturallanguage processing, has drawn scant attention from themedical imaging community. Given the ability to exploitlong-term dependencies, transformers are promising tohelp atypical convolutional neural networks to overcometheir inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations. To address this issue, we introducennFormer (i.e., not-another transFormer), a 3D transformerfor volumetric medical image segmentation. nnFormer notonly exploits the combination of interleaved convolutionand self-attention operations, but also introduces localand global volume-based self-attention mechanism to learnvolume representations. Moreover, nnFormer proposes touse skip attention to replace the traditional concatenation/summation operations in skip connections in U-Netlike architecture. Experiments show that nnFormer significantly outperforms previous transformer-based counterparts by large margins on three public datasets. Comparedto nnUNet, nnFormer produces significantly lower HD95and comparable DSC results. Furthermore, we show thatnnFormer and nnUNet are highly complementary to eachother in model ensembling. Codes and models of nnFormerare available at https://git.io/JSf3i.	数据集 ACDC：注册后下载https://acdc.creatis.insa-lyon.fr/#phase/5846c3ab6a3c7735e84b67f2验收指标：1.Dice = 91.78% 对应论文 Table.3中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg	快速开始
31	Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation	Abstract In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at this https URL.	数据集SYNAPSE：联系 jienengchen01@gmail.com 获取处理后数据链接，或者在synapse 官网下载。验收指标：1. Avg Dice = 79.13% 对应论文 Table.1中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg	快速开始
32	FCCDN: Feature Constraint Network for VHR Image Change Detection	Abstract Change detection is the process of identifying pixelwise differences in bitemporal co-registered images. It is of great significance to Earth observations. Recently, with the emergence of deep learning (DL), the power and feasibility of deep convolutional neural network (CNN)-based methods have been shown in the field of change detection. However, there is still a lack of effective supervision for change feature learning. In this work, a feature constraint change detection network (FCCDN) is proposed. We constrain features both in bitemporal feature extraction and feature fusion. More specifically, we propose a dual encoder-decoder network backbone for the change detection task. At the center of the backbone, we design a nonlocal feature pyramid network to extract and fuse multiscale features. To fuse bitemporal features in a robust way, we build a dense connection-based feature fusion module. Moreover, a self-supervised learning-based strategy is proposed to constrain feature learning. Based on FCCDN, we achieve state-of-the-art performance on two building change detection datasets (LEVIR-CD and WHU). On the LEVIR-CD dataset, we achieve an IoU of 0.8569 and an F1 score of 0.9229. On the WHU dataset, we achieve an IoU of 0.8820 and an F1 score of 0.9373. Moreover, for the first time, the acquisition of accurate bitemporal semantic segmentation results is achieved without using semantic segmentation labels. This is vital for the application of change detection because it saves the cost of labeling.	数据集 LEVIR-CD验收指标：1.FCCDN (512) F1 = 92.29% 参考论文 Table.32. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS	快速开始
33	A Transformer-Based Siamese Network for Change Detection	Abstract This paper presents a transformer-based Siamese network architecture (abbreviated by ChangeFormer) for Change Detection (CD) from a pair of co-registered remote sensing images. Different from recent CD frameworks, which are based on fully convolutional networks (ConvNets), the proposed method unifies hierarchically structured transformer encoder with Multi-Layer Perception (MLP) decoder in a Siamese network architecture to efficiently render multi-scale long-range details required for accurate CD. Experiments on two CD datasets show that the proposed end-to-end trainable ChangeFormer architecture achieves better CD performance than previous counterparts. Our code is available at https://github.com/wgcban/ChangeFormer.	数据集 LEVIR-CD验收指标：1.ChangeFormer F1 = 90.4% 参考论文 Table.12. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS	快速开始
34	TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation	Abstract Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.	数据集SYNAPSE：联系 jienengchen01@gmail.com 获取处理后数据链接，或者在synapse 官网下载验收指标：1. Avg Dice = 77.48% 对应论文 Table.1中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg	快速开始
35	FactSeg: Foreground Activation Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery	Abstract The small object semantic segmentation task is aimed at automatically extracting key objects from high-resolution remote sensing (HRS) imagery. Compared with the large-scale coverage areas for remote sensing imagery, the key objects, such as cars and ships, in HRS imagery often contain only a few pixels. In this article, to tackle this problem, the foreground activation (FA)-driven small object semantic segmentation (FactSeg) framework is proposed from perspectives of structure and optimization. In the structure design, FA object representation is proposed to enhance the awareness of the weak features in small objects. The FA object representation framework is made up of a dual-branch decoder and collaborative probability (CP) loss. In the dual-branch decoder, the FA branch is designed to activate the small object features (activation) and suppress the large-scale background, and the semantic refinement (SR) branch is designed to further distinguish small objects (refinement). The CP loss is proposed to effectively combine the activation and refinement outputs of the decoder under the CP hypothesis. During the collaboration, the weak features of the small objects are enhanced with the activation output, and the refined output can be viewed as the refinement of the binary outputs. In the optimization stage, small object mining (SOM)-based network optimization is applied to automatically select effective samples and refine the direction of the optimization while addressing the imbalanced sample problem between the small objects and the large-scale background. The experimental results obtained with two benchmark HRS imagery segmentation datasets demonstrate that the proposed framework outperforms the state-of-the-art semantic segmentation methods and achieves a good tradeoff between accuracy and efficiency. Code will be available at: http://rsidea.whu.edu.cn/FactSeg.htm	数据集 iSAID：https://captain-whu.github.io/iSAID/dataset.html验收指标：1. FactSeg ResNet-50 mIOU= 64.79% 参考论文Table.4 实现2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS	快速开始
36	PSANet: Point-wise Spatial Attention Network for Scene Parsing	Abstract We notice information flow in convolutional neural networks is restricted inside local neighborhood regions due to the physical design of convolutional filters, which limits the overall understanding of complex scenes. In this paper, we propose the point-wise spatial attention network (PSANet) to relax the local neighborhood constraint. Each position on the feature map is connected to all the other ones through a self-adaptively learned attention mask. Moreover, information propagation in bi-direction for scene parsing is enabled. Information at other positions can be collected to help the prediction of the current position and vice versa, information at the current position can be distributed to assist the prediction of other ones. Our proposed approach achieves top performance on various competitive scene parsing datasets, including ADE20K, PASCAL VOC 2012 and Cityscapes, demonstrating its effectiveness and generality.	数据集 cityscapes valset 验收指标：1. PSANet-resnet50 输入分辨率512x1024 mIOU=77.24% 参考 https://github.com/open-mmlab/mmsegmentation/tree/master/configs/psanet2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg	快速开始
37	CCNet: Criss-Cross Attention for Semantic Segmentation	Abstract Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at \url{https://github.com/speedinghzl/CCNet}.	数据集 cityscapes valset验收指标：1. CCNet-resnet101 R=2+OHEM mIOU= 80.0% 参考 https://github.com/speedinghzl/CCNet/tree/pure-python2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg	快速开始
38	Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes	Abstract Semantic segmentation is a key technology for autonomous vehicles to understand the surrounding scenes. The appealing performances of contemporary models usually come at the expense of heavy computations and lengthy inference time, which is intolerable for self-driving. Using light-weight architectures (encoder-decoder or two-pathway) or reasoning on low-resolution images, recent methods realize very fast scene parsing, even running at more than 100 FPS on a single 1080Ti GPU. However, there is still a significant gap in performance between these real-time methods and the models based on dilation backbones. To tackle this problem, we proposed a family of efficient backbones specially designed for real-time semantic segmentation. The proposed deep dual-resolution networks (DDRNets) are composed of two deep branches between which multiple bilateral fusions are performed. Additionally, we design a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) to enlarge effective receptive fields and fuse multi-scale context based on low-resolution feature maps. Our method achieves a new state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid dataset. In particular, on a single 2080Ti GPU, DDRNet-23-slim yields 77.4% mIoU at 102 FPS on Cityscapes test set and 74.7% mIoU at 230 FPS on CamVid test set. With widely used test augmentation, our method is superior to most state-of-the-art models and requires much less computation. Codes and trained models are available online.	数据集 cityscapes valset验收指标：1. DDRNet-23 mIOU= 79.5% 参考论文 Table.42. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg	快速开始
39	nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation	Abstract The U-Net was presented in 2015. With its straight-forward and successful architecture it quickly evolved to a commonly used benchmark in medical image segmentation. The adaptation of the U-Net to novel problems, however, comprises several degrees of freedom regarding the exact architecture, preprocessing, training and inference. These choices are not independent of each other and substantially impact the overall performance. The present paper introduces the nnU-Net ('no-new-Net'), which refers to a robust and self-adapting framework on the basis of 2D and 3D vanilla U-Nets. We argue the strong case for taking away superfluous bells and whistles of many proposed network designs and instead focus on the remaining aspects that make out the performance and generalizability of a method. We evaluate the nnU-Net in the context of the Medical Segmentation Decathlon challenge, which measures segmentation performance in ten disciplines comprising distinct entities, image modalities, image geometries and dataset sizes, with no manual adjustments between datasets allowed. At the time of manuscript submission, nnU-Net achieves the highest mean dice scores across all classes and seven phase 1 tasks (except class 1 in BrainTumour) in the online leaderboard of the challenge.	数据集 MSD-Lung :https://drive.google.com/drive/folders/1HqEgzS8BV2c7xYNrZdEAnrHk7osJJ--2验收指标：1. 3DUnet-cascade Avg Dice = 66.85%；ensemble 2DUnet+ 3DUnet Avg Dice = 61.18%；对应论文 Table.2中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg	快速开始
40	UNETR: Transformers for 3D Medical Image	Abstract Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard. Code: https://monai.io/research/unetr	nan	快速开始

OCR

序号	论文名称(链接)	摘要	数据集	快速开始
1	Detecting Text in Natural Image with Connectionist Text Proposal Network	Abstract We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi- language text without further post-processing, departing from previous bottom-up methods requiring multi-step post-processing. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0:14s/image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/.	icdar2015: 0.61	快速开始
2	Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network	Abstract Attention-based scene text recognizers have gained huge success, whichleverages a more compact intermediate representation to learn 1d- or 2d-attention by a RNN-based encoder-decoder architecture. However, such methodssuffer from attention-drift problem because high similarity among encodedfeatures leads to attention confusion under the RNN-based local attentionmechanism. Moreover, RNN-based methods have low efficiency due to poorparallelization. To overcome these problems, we propose the MASTER, aself-attention based scene text recognizer that (1) not only encodes theinput-output attention but also learns self-attention which encodesfeature-feature and target-target relationships inside the encoder and decoderand (2) learns a more powerful and robust intermediate representation tospatial distortion, and (3) owns a great training efficiency because of hightraining parallelization and a high-speed inference because of an efficientmemory-cache mechanism. Extensive experiments on various benchmarks demonstratethe superior performance of our MASTER on both regular and irregular scenetext. Pytorch code can be found at this https URL,and Tensorflow code can be found at this https URL.	ResNet18 ctw1500 0.806	快速开始
3	MASTER: Multi-Aspect Non-local Network for Scene Text Recognition	Abstract Temporal action proposal generation is an important and challenging task invideo understanding, which aims at detecting all temporal segments containingaction instances of interest. The existing proposal generation approaches aregenerally based on pre-defined anchor windows or heuristic bottom-up boundarymatching strategies. This paper presents a simple and efficient framework(RTD-Net) for direct action proposal generation, by re-purposing aTransformer-alike architecture. To tackle the essential visual differencebetween time and space, we make three important improvements over the originaltransformer detection framework (DETR). First, to deal with slowness prior invideos, we replace the original Transformer encoder with a boundary attentivemodule to better capture long-range temporal information. Second, due to theambiguous temporal boundary and relatively sparse annotations, we present arelaxed matching scheme to relieve the strict criteria of single assignment toeach groundtruth. Finally, we devise a three-branch head to further improve theproposal confidence estimation by explicitly predicting its completeness.Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstratethe effectiveness of RTD-Net, on both tasks of temporal action proposalgeneration and temporal action detection. Moreover, due to its simplicity indesign, our framework is more efficient than previous proposal generationmethods, without non-maximum suppression post-processing. The code and modelsare made available at this https URL.	IIIT5K: 95 SVT: 90.6 IC03: 96.4 IC13: 95.3IC15: 79.4 SVTP: 834.5 CT80: 84.5 avg: 89.81	快速开始
4	Fourier Contour Embedding for Arbitrary-Shaped Text Detection	Abstract One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. We further construct FCENet with a backbone, feature pyramid networks (FPN) and a simple post-processing with the Inverse Fourier Transformation (IFT) and Non-Maximum Suppression (NMS). Different from previous methods, FCENet first predicts compact Fourier signatures of text instances, and then reconstructs text contours via IFT and NMS during test. Extensive experiments demonstrate that FCE is accurate and robust to fit contours of scene texts even with highly-curved shapes, and also validate the effectiveness and the good generalization of FCENet for arbitrary-shaped text detection. Furthermore, experimental results show that our FCENet is superior to the state-of-the-art (SOTA) methods on CTW1500 and Total-Text, especially on challenging highly-curved text subset.	ResNet50 + DCNv2 ctw1500 0.851	快速开始
5	Primitive Representation Learning for Scene Text Recognition	Abstract Scene text recognition is a challenging task due to diverse variations of text instances in natural scene images. Conventional methods based on CNN-RNN-CTC or encoder-decoder with attention mechanism may not fully investigate stable and efficient feature representations for multi-oriented scene texts. In this paper, we propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. We model elements in feature maps as the nodes of an undirected graph. A pooling aggregator and a weighted aggregator are proposed to learn primitive representations, which are transformed into high-level visual text representations by graph convolutional networks. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. Furthermore, by integrating visual text representations into an encoderdecoder model with the 2D attention mechanism, we propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods. Experimental results on both English and Chinese scene text recognition tasks demonstrate that PREN keeps a balance between accuracy and efficiency, while PREN2D achieves state-of-theart performance.	SynthText+Mjsynth; IIIT5k: 86.03%, SVT: 87.17%, IC03: 95.16%, IC13: 93.93%, IC15: 78.52%, SVTP: 81.71%, CUTE80: 75.69%, avg: 85.5%	快速开始
6	RF-Learning:Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition	Abstract Text recognition is a popular topic for its broad applications. In this work, we excavate the implicit task, character counting within the traditional text recognition, without additional labor annotation cost. The implicit task plays as an auxiliary branch for complementing the sequential recognition. We design a two-branch reciprocal feature learning framework in order to adequately utilize the features from both the tasks. Through exploiting the complementary effect between explicit and implicit tasks, the feature is reliably enhanced. Extensive experiments on 7 benchmarks show the advantages of the proposed methods in both text recognition and the new-built character counting tasks. In addition, it is convenient yet effective to equip with variable networks and tasks. We offer abundant ablation studies, generalizing experiments with deeper understanding on the tasks. Code is available.	RF-Learning visual IIIT5K: 96, SVT:94.7 IC03:96.2 IC13:95.9 IC15:88.7 SVTP:86.7 CUTE80:88.2 avg: 92.34	快速开始
7	Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection	Abstract Arbitrary shape text detection is a challenging task due to the high variety and complexity of scenes texts. In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection. In our method, an innovative local graph bridges a text proposal model via Convolutional Neural Network (CNN) and a deep relational reasoning network via Graph Convolutional Network (GCN), making our network end-to-end trainable. To be concrete, every text instance will be divided into a series of small rectangular components, and the geometry attributes (e.g., height, width, and orientation) of the small components will be estimated by our text proposal model. Given the geometry attributes, the local graph construction model can roughly establish linkages between different text components. For further reasoning and deducing the likelihood of linkages between the component and its neighbors, we adopt a graph-based network to perform deep relational reasoning on local graphs. Experiments on public available datasets demonstrate the state-of-the-art performance of our method.	ResNet50 ctw1500 0.840	快速开始
8	Scene Text Telescope: Text-Focused Scene Image Super-Resolution	Abstract Image super-resolution, which is often regarded as a preprocessing procedure of scene text recognition, aims to recover the realistic features from a low-resolution text image. It has always been challenging due to large variations in text shapes, fonts, backgrounds, etc. However, most existing methods employ generic super-resolution frameworks to handle scene text images while ignoring text-specific properties such as text-level layouts and character-level details. In this paper, we establish a text-focused super-resolution framework, called Scene Text Telescope (STT). In terms of text-level layouts, we propose a Transformer-Based Super-Resolution Network (TBSRN) containing a Self-Attention Module to extract sequential information, which is robust to tackle the texts in arbitrary orientations. In terms of character-level details, we propose a Position-Aware Module and a Content-Aware Module to highlight the position and the content of each character. By observing that some characters look indistinguishable in low-resolution conditions, we use a weighted cross-entropy loss to tackle this problem. We conduct extensive experiments, including text recognition with pre-trained recognizers and image quality evaluation, on TextZoom and several scene text recognition benchmarks to assess the super-resolution images. The experimental results show that our STT can indeed generate text-focused super-resolution images and outperform the existing methods in terms of recognition accuracy.	CRNN+tbsrn，easy: 0.5979, medium: 0.4507, hard: 0.3418, avg: 0.4634	快速开始
9	When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition	Abstract Recently, most handwritten mathematical expression recognition (HMER) methods adopt the encoder-decoder networks, which directly predict the markup sequences from formula images with the attention mechanism. However, such methods may fail to accurately read formulas with complicated structure or generate long markup sequences, as the attention results are often inaccurate due to the large variance of writing styles or spatial layouts. To alleviate this problem, we propose an unconventional network for HMER named Counting-Aware Network (CAN), which jointly optimizes two tasks: HMER and symbol counting. Specifically, we design a weakly-supervised counting module that can predict the number of each symbol class without the symbol-level position annotations, and then plug it into a typical attention-based encoder-decoder model for HMER. Experiments on the benchmark datasets for HMER validate that both joint optimization and counting results are beneficial for correcting the prediction errors of encoder-decoder models, and CAN consistently outperforms the state-of-the-art methods. In particular, compared with an encoder-decoder model for HMER, the extra time cost caused by the proposed counting module is marginal. The source code is available at https://github.com/LBH1024/CAN.	1.ExpRate=65.89 2.复现后合入PaddleOCR套件，并添加TIPC	快速开始
10	RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition	Abstract The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.	IIIT5K: 95.1 SVT:89.2 IC13:93.1 IC15:77.8 SVTP:80.3 CT80:90.3 avg 87.63	快速开始
11	SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition	Abstract Arbitrary text appearance poses a great challenge in scene text recognition tasks. Existing works mostly handle with the problem in consideration of the shape distortion, including perspective distortions, line curvature or other style variations. Therefore, methods based on spatial transformers are extensively studied. However, chromatic difficulties in complex scenes have not been paid much attention on. In this work, we introduce a new learnable geometric-unrelated module, the Structure-Preserving Inner Offset Network (SPIN), which allows the color manipulation of source data within the network. This differentiable module can be inserted before any recognition architecture to ease the downstream tasks, giving neural networks the ability to actively transform input intensity rather than the existing spatial rectification. It can also serve as a complementary module to known spatial transformations and work in both independent and collaborative ways with them. Extensive experiments show that the use of SPIN results in a significant improvement on multiple text recognition benchmarks compared to the state-of-the-arts.	IIIT5K: 94.6, SVT:89, IC03: 93.3, IC13:94.2,IC15:80.7,SVTP:83,CUTE80:84.7,avg: 88.5	快速开始

图像生成

序号	论文名称(链接)	摘要	数据集	快速开始
1	Deep Image Prior	Abstract Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flash-no flash input pairs. Apart from its diverse applications, our approach highlights the inductive bias captured by standard generator network architectures. It also bridges the gap between two very popular families of image restoration methods: learning-based methods using deep convolutional networks and learning-free methods based on handcrafted image priors such as self-similarity. Code and supplementary material are available at https://dmitryulyanov.github.io/deep_image_prior .	8× super-resolution, avg psnr=24.15%	快速开始
2	Progressive Growing of GANs for Improved Quality, Stability, and Variation	Abstract We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.	CelebA MS-SSIM=0.2838, SWD=2.64(64)	快速开始
3	Image Inpainting for Irregular Holes Using Partial Convolutions	Abstract Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model outperforms other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.	CelebA 人眼评估生成的图像(参考论文Figure8)	快速开始
4	Generative Adversarial Text to Image Synthesis	Abstract Synthesizing high resolution photorealistic images has been a long-standingchallenge in machine learning. In this paper we introduce new methods for theimproved training of generative adversarial networks (GANs) for imagesynthesis. We construct a variant of GANs employing label conditioning thatresults in 128x128 resolution image samples exhibiting global coherence. Weexpand on previous work for image quality assessment to provide two newanalyses for assessing the discriminability and diversity of samples fromclass-conditional image synthesis models. These analyses demonstrate that highresolution samples provide class information not present in low resolutionsamples. Across 1000 ImageNet classes, 128x128 samples are more than twice asdiscriminable as artificially resized 32x32 samples. In addition, 84.7% of theclasses have samples exhibiting diversity comparable to real ImageNet data.	Oxford-102 人眼评估生成的图像(参考论文中展示的生成图片)	快速开始
5	Conditional Image Synthesis With Auxiliary Classifier GANs	Abstract We introduce SinGAN, an unconditional generative model that can be learnedfrom a single natural image. Our model is trained to capture the internaldistribution of patches within the image, and is then able to generate highquality, diverse samples that carry the same visual content as the image.SinGAN contains a pyramid of fully convolutional GANs, each responsible forlearning the patch distribution at a different scale of the image. This allowsgenerating new samples of arbitrary size and aspect ratio, that havesignificant variability, yet maintain both the global structure and the finetextures of the training image. In contrast to previous single image GANschemes, our approach is not limited to texture images, and is not conditional(i.e. it generates samples from noise). User studies confirm that the generatedsamples are commonly confused to be real images. We illustrate the utility ofSinGAN in a wide range of image manipulation tasks.	ImageNet 人眼评估生成的图像(参考论文中展示的生成图片)	快速开始
6	SinGAN: Learning a Generative Model from a Single Natural Image	Abstract We propose spatially-adaptive normalization, a simple but effective layer forsynthesizing photorealistic images given an input semantic layout. Previousmethods directly feed the semantic layout as input to the deep network, whichis then processed through stacks of convolution, normalization, andnonlinearity layers. We show that this is suboptimal as the normalizationlayers tend to ``wash away'' semantic information. To address the issue, wepropose using the input layout for modulating the activations in normalizationlayers through a spatially-adaptive, learned transformation. Experiments onseveral challenging datasets demonstrate the advantage of the proposed methodover existing approaches, regarding both visual fidelity and alignment withinput layouts. Finally, our model allows user control over both semantic andstyle. Code is available at this https URL .	任意一张图片人眼评估生成的图像(可参考论文中展示的生成图片Figure6)	快速开始
7	Semantic Image Synthesis with Spatially-Adaptive Normalization	Abstract We present a generic image-to-image translation framework, pixel2style2pixel(pSp). Our pSp framework is based on a novel encoder network that directlygenerates a series of style vectors which are fed into a pretrained StyleGANgenerator, forming the extended W+ latent space. We first show that our encodercan directly embed real images into W+, with no additional optimization. Next,we propose utilizing our encoder to directly solve image-to-image translationtasks, defining them as encoding problems from some input domain into thelatent domain. By deviating from the standard invert first, edit latermethodology used with previous StyleGAN encoders, our approach can handle avariety of tasks even when the input image is not represented in the StyleGANdomain. We show that solving translation tasks through StyleGAN significantlysimplifies the training process, as no adversary is required, has bettersupport for solving tasks without pixel-to-pixel correspondence, and inherentlysupports multi-modal synthesis via the resampling of styles. Finally, wedemonstrate the potential of our framework on a variety of facialimage-to-image translation tasks, even when compared to state-of-the-artsolutions designed specifically for a single task, and further show that it canbe extended beyond the human facial domain.	cityscapes mIoU=62.3, accu=81.9, FID=71.8, 及人眼观察可视化效果	快速开始
8	Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation	Abstract The task of age transformation illustrates the change of an individual'sappearance over time. Accurately modeling this complex transformation over aninput facial image is extremely challenging as it requires making convincing,possibly large changes to facial features and head shape, while stillpreserving the input identity. In this work, we present an image-to-imagetranslation method that learns to directly encode real facial images into thelatent space of a pre-trained unconditional GAN (e.g., StyleGAN) subject to agiven aging shift. We employ a pre-trained age regression network to explicitlyguide the encoder in generating the latent codes corresponding to the desiredage. In this formulation, our method approaches the continuous aging process asa regression task between the input age and desired target age, providingfine-grained control over the generated image. Moreover, unlike approaches thatoperate solely in the latent space using a prior on the path controlling age,our method learns a more disentangled, non-linear path. Finally, we demonstratethat the end-to-end nature of our approach, coupled with the rich semanticlatent space of StyleGAN, allows for further editing of the generated images.Qualitative and quantitative evaluations show the advantages of our methodcompared to state-of-the-art approaches.	CelebA LPIPS=0.17, similarity=0.56, MSE=0.03 (task of StyleGAN Inversion)	快速开始
9	Only a Matter of Style: Age Transformation Using a Style-Based Regression Model	Abstract The task of age transformation illustrates the change of an individual's appearance over time. Accurately modeling this complex transformation over an input facial image is extremely challenging as it requires making convincing, possibly large changes to facial features and head shape, while still preserving the input identity. In this work, we present an image-to-image translation method that learns to directly encode real facial images into the latent space of a pre-trained unconditional GAN (e.g., StyleGAN) subject to a given aging shift. We employ a pre-trained age regression network to explicitly guide the encoder in generating the latent codes corresponding to the desired age. In this formulation, our method approaches the continuous aging process as a regression task between the input age and desired target age, providing fine-grained control over the generated image. Moreover, unlike approaches that operate solely in the latent space using a prior on the path controlling age, our method learns a more disentangled, non-linear path. Finally, we demonstrate that the end-to-end nature of our approach, coupled with the rich semantic latent space of StyleGAN, allows for further editing of the generated images. Qualitative and quantitative evaluations show the advantages of our method compared to state-of-the-art approaches.	CelebA 人眼评估生成的图像(参考论文中展示的生成图片Figure4, 6, 8)	快速开始
10	ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic	Abstract We aim at accelerating super-resolution (SR) networks on large images (2K-8K). The large images are usually decomposed into small sub-images in practical usages. Based on this processing, we found that different image regions have different restoration difficulties and can be processed by networks with different capacities. Intuitively, smooth areas are easier to super-solve than complex textures. To utilize this property, we can adopt appropriate SR networks to process different sub-images after the decomposition. On this basis, we propose a new solution pipeline -- ClassSR that combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions. We further introduce a new classification method with two losses -- Class-Loss and Average-Loss to produce the classification results. After joint training, a majority of sub-images will pass through smaller networks, thus the computational cost can be significantly reduced. Experiments show that our ClassSR can help most existing methods (e.g., FSRCNN, CARN, SRResNet, RCAN) save up to 50% FLOPs on DIV8K datasets. This general framework can also be applied in other low-level vision tasks.	DIV2K PSNR=26.39, FLOPs=21.22G(65%)(Test2K, ClassSR-RCAN)	快速开始
11	Self-Attention Generative Adversarial Networks	Abstract In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.	ImageNet FID=18.28 Inception score=52.52	快速开始
12	Generative Image Inpainting with Contextual Attention	Abstract We apply basic statistical reasoning to signal reconstruction by machinelearning -- learning to map corrupted observations to clean signals -- with asimple and powerful conclusion: it is possible to learn to restore images byonly looking at corrupted examples, at performance at and sometimes exceedingtraining using clean data, without explicit image priors or likelihood modelsof the corruption. In practice, we show that a single model learns photographicnoise removal, denoising synthetic Monte Carlo images, and reconstruction ofundersampled MRI scans -- all corrupted by different processes -- based onnoisy data only.	L1Loss=8.6%, L2Loss=2.1%, PSNR=18.91, TVLoss=25.3%	快速开始
13	Noise2Noise: Learning Image Restoration without Clean Data	Abstract While humans easily recognize relations between data from different domainswithout any supervision, learning to automatically discover them is in generalvery challenging and needs many ground-truth pairs that illustrate therelations. To avoid costly pairing, we address the task of discoveringcross-domain relations given unpaired data. We propose a method based ongenerative adversarial networks that learns to discover relations betweendifferent domains (DiscoGAN). Using the discovered relations, our proposednetwork successfully transfers style from one domain to another whilepreserving key attributes such as orientation and face identity. Source codefor official implementation is publicly availablethis https URL	Denoised 与clear image PSNR持平 (Gaussian noise (σ = 25))	快速开始
14	Learning to Discover Cross-Domain Relations with Generative Adversarial Networks	Abstract We propose to restore old photos that suffer from severe degradation througha deep learning approach. Unlike conventional restoration tasks that can besolved through supervised learning, the degradation in real photos is complexand the domain gap between synthetic images and real old photos makes thenetwork fail to generalize. Therefore, we propose a novel triplet domaintranslation network by leveraging real photos along with massive syntheticimage pairs. Specifically, we train two variational autoencoders (VAEs) torespectively transform old photos and clean photos into two latent spaces. Andthe translation between these two latent spaces is learned with syntheticpaired data. This translation generalizes well to real photos because thedomain gap is closed in the compact latent space. Besides, to address multipledegradations mixed in one old photo, we design a global branch with apartialnonlocal block targeting to the structured defects, such as scratches and dustspots, and a local branch targeting to the unstructured defects, such as noisesand blurriness. Two branches are fused in the latent space, leading to improvedcapability to restore old photos from multiple defects. Furthermore, we applyanother face refinement network to recover fine details of faces in the oldphotos, thus ultimately generating photos with enhanced perceptual quality.With comprehensive experiments, the proposed pipeline demonstrates superiorperformance over state-of-the-art methods as well as existing commercial toolsin terms of visual quality for old photos restoration.	可视化, 参考论文图7, 8, 9	快速开始
15	Old Photo Restoration via Deep Latent Space Translation	Abstract We present a novel method for constructing Variational Autoencoder (VAE).Instead of using pixel-by-pixel loss, we enforce deep feature consistencybetween the input and the output of a VAE, which ensures the VAE's output topreserve the spatial correlation characteristics of the input, thus leading theoutput to have a more natural visual appearance and better perceptual quality.Based on recent deep learning works such as style transfer, we employ apre-trained deep convolutional neural network (CNN) and use its hidden featuresto define a feature perceptual loss for VAE training. Evaluated on the CelebAface dataset, we show that our model produces better results than other methodsin the literature. We also show that our method can produce latent vectors thatcan capture the semantic information of face expressions and can be used toachieve state-of-the-art performance in facial attribute prediction.	PSNR=23.33, SSIM= 0.69, LPIPS=0.25, FID=134.35(table2)	快速开始
16	Deep Feature Consistent Variational Autoencoder	Abstract Scene text detection, an important step of scene text reading systems, haswitnessed rapid development with convolutional neural networks. Nonetheless,two main challenges still exist and hamper its deployment to real-worldapplications. The first problem is the trade-off between speed and accuracy.The second one is to model the arbitrary-shaped text instance. Recently, somemethods have been proposed to tackle arbitrary-shaped text detection, but theyrarely take the speed of the entire pipeline into consideration, which may fallshort in practical this http URL this paper, we propose an efficient andaccurate arbitrary-shaped text detector, termed Pixel Aggregation Network(PAN), which is equipped with a low computational-cost segmentation head and alearnable post-processing. More specifically, the segmentation head is made upof Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM).FPEM is a cascadable U-shaped module, which can introduce multi-levelinformation to guide the better segmentation. FFM can gather the features givenby the FPEMs of different depths into a final feature for segmentation. Thelearnable post-processing is implemented by Pixel Aggregation (PA), which canprecisely aggregate text pixels by predicted similarity vectors. Experiments onseveral standard benchmarks validate the superiority of the proposed PAN. It isworth noting that our method can achieve a competitive F-measure of 79.9% at84.2 FPS on CTW1500.	Average accuracies=88.73 (Table1. VAE-345)	快速开始
17	Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data	Abstract Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data. Specifically, a high-order degradation modeling process is introduced to better simulate complex real-world degradations. We also consider the common ringing and overshoot artifacts in the synthesis process. In addition, we employ a U-Net discriminator with spectral normalization to increase discriminator capability and stabilize the training dynamics. Extensive comparisons have shown its superior visual performance than prior works on various real datasets. We also provide efficient implementations to synthesize training pairs on the fly.	DIV2K and Flickr2K and OST; 可视化效果与论文一致	快速开始
18	GFP-GAN: Towards Real-World Blind Face Restoration with Generative Facial Prior	Abstract Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via novel channel-split spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require expensive image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.	CelebA-Test： LPIPS=0.3646， FID=42.62	快速开始
19	Aggregated Contextual Transformations for High-Resolution Image Inpainting	Abstract State-of-the-art image inpainting approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512x512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn, facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art by a significant margin in terms of FID with 38.60% relative improvement. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release code and models in https://github.com/researchmm/AOT-GAN-for-Inpainting.	Places365-val(20-30% ): PSNR=26.03, SSIM=0.890	快速开始
20	Progressive Image Deraining Networks: A Better and Simpler Baseline	Abstract Along with the deraining performance improvement of deep networks, their structures and learning become more and more complicated and diverse, making it difficult to analyze the contribution of various network modules when developing new deraining networks. To handle this issue, this paper provides a better and simpler baseline deraining network by considering network architecture, input and output, and loss functions. Specifically, by repeatedly unfolding a shallow ResNet, progressive ResNet (PRN) is proposed to take advantage of recursive computation. A recurrent layer is further introduced to exploit the dependencies of deep features across stages, forming our progressive recurrent network (PReNet). Furthermore, intra-stage recursive computation of ResNet can be adopted in PRN and PReNet to notably reduce network parameters with graceful degradation in deraining performance. For network input and output, we take both stage-wise result and original rainy image as input to each ResNet and finally output the prediction of {residual image}. As for loss functions, single MSE or negative SSIM losses are sufficient to train PRN and PReNet. Experiments show that PRN and PReNet perform favorably on both synthetic and real rainy images. Considering its simplicity, efficiency and effectiveness, our models are expected to serve as a suitable baseline in future deraining research. The source codes are available at https://github.com/csdwren/PReNet.	Rain100H数据集,PReNet模型,psnr=29.46, ssim=0.899	快速开始
21	StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery	Abstract Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images. However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for mapping a text prompts to input-agnostic directions in StyleGAN's style space, enabling interactive text-driven image manipulation. Extensive results and comparisons demonstrate the effectiveness of our approaches.	可视化	快速开始
22	GAN Prior Embedded Network for Blind Face Restoration in the Wild	Abstract Blind face restoration (BFR) from severely degraded face images in the wild is a very challenging problem. Due to the high illness of the problem and the complex unknown degradation, directly training a deep neural network (DNN) usually cannot lead to acceptable results. Existing generative adversarial network (GAN) based methods can produce better results but tend to generate over-smoothed restorations. In this work, we propose a new method by first learning a GAN for high-quality face image generation and embedding it into a U-shaped DNN as a prior decoder, then fine-tuning the GAN prior embedded DNN with a set of synthesized low-quality face images. The GAN blocks are designed to ensure that the latent code and noise input to the GAN can be respectively generated from the deep and shallow features of the DNN, controlling the global face structure, local face details and background of the reconstructed image. The proposed GAN prior embedded network (GPEN) is easy-to-implement, and it can generate visually photo-realistic results. Our experiments demonstrated that the proposed GPEN achieves significantly superior results to state-of-the-art BFR methods both quantitatively and qualitatively, especially for the restoration of severely degraded face images in the wild. The source code and models can be found at https://github.com/yangxy/GPEN.	FID=31.72（CelebA-HQ-val ）	快速开始

图像修复

序号	论文名称(链接)	摘要	数据集	快速开始
1	Simple Baselines for Image Restoration	Abstract Although there have been significant advances in the field of image restoration recently, the system complexity of the state-of-the-art (SOTA) methods is increasing as well, which may hinder the convenient analysis and comparison of methods. In this paper, we propose a simple baseline that exceeds the SOTA methods and is computationally efficient. To further simplify the baseline, we reveal that the nonlinear activation functions, e.g. Sigmoid, ReLU, GELU, Softmax, etc. are not necessary: they could be replaced by multiplication or removed. Thus, we derive a Nonlinear Activation Free Network, namely NAFNet, from the baseline. SOTA results are achieved on various challenging benchmarks, e.g. 33.69 dB PSNR on GoPro (for image deblurring), exceeding the previous SOTA 0.38 dB with only 8.4% of its computational costs; 40.30 dB PSNR on SIDD (for image denoising), exceeding the previous SOTA 0.28 dB with less than half of its computational costs. The code and the pre-trained models are released at https://github.com/megvii-research/NAFNet.	SIDD PSNR: 40.3045, SSIM:0.9614	快速开始
2	HINet: Half Instance Normalization Network for Image Restoration	Abstract In this paper, we explore the role of Instance Normalization in low-level vision tasks. Specifically, we present a novel block: Half Instance Normalization Block (HIN Block), to boost the performance of image restoration networks. Based on HIN Block, we design a simple and powerful multi-stage network named HINet, which consists of two subnetworks. With the help of HIN Block, HINet surpasses the state-of-the-art (SOTA) on various image restoration tasks. For image denoising, we exceed it 0.11dB and 0.28 dB in PSNR on SIDD dataset, with only 7.5% and 30% of its multiplier-accumulator operations (MACs), 6.8 times and 2.9 times speedup respectively. For image deblurring, we get comparable performance with 22.5% of its MACs and 3.3 times speedup on REDS and GoPro datasets. For image deraining, we exceed it by 0.3 dB in PSNR on the average result of multiple datasets with 1.4 times speedup. With HINet, we won 1st place on the NTIRE 2021 Image Deblurring Challenge - Track2. JPEG Artifacts, with a PSNR of 29.70. The code is available at https://github.com/megvii-model/HINet.	SIDD PSNR: 39.99, SSIM:0.958	快速开始
3	Invertible Denoising Network: A Light Solution for Real Noise Removal	Abstract Invertible networks have various benefits for image denoising since they are lightweight, information-lossless, and memory-saving during back-propagation. However, applying invertible models to remove noise is challenging because the input is noisy, and the reversed output is clean, following two different distributions. We propose an invertible denoising network, InvDN, to address this challenge. InvDN transforms the noisy input into a low-resolution clean image and a latent representation containing noise. To discard noise and restore the clean image, InvDN replaces the noisy latent representation with another one sampled from a prior distribution during reversion. The denoising performance of InvDN is better than all the existing competitive models, achieving a new state-of-the-art result for the SIDD dataset while enjoying less run time. Moreover, the size of InvDN is far smaller, only having 4.2% of the number of parameters compared to the most recently proposed DANet. Further, via manipulating the noisy latent representation, InvDN is also able to generate noise more similar to the original one. Our code is available at: https://github.com/Yang-Liu1082/InvDN.git.	SIDD PSNR: 39.28, SSIM:0.955	快速开始
4	SwinIR: Image Restoration Using Swin Transformer	Abstract Image restoration is a long-standing low-level vision problem that aims to restore high-quality images from low-quality images (e.g., downscaled, noisy and compressed images). While state-of-the-art image restoration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on high-level vision tasks. In this paper, we propose a strong baseline model SwinIR for image restoration based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. We conduct experiments on three representative tasks: image super-resolution (including classical, lightweight and real-world image super-resolution), image denoising (including grayscale and color image denoising) and JPEG compression artifact reduction. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks by up to 0.14∼0.45dB, while the total number of parameters can be reduced by up to 67%.	CBSD68, average PSNR, noise 15: 34.42	快速开始
5	Neighbor2Neighbor: Self-Supervised Denoising from Single Noisy Images	Abstract In the last few years, image denoising has benefited a lot from the fast development of neural networks. However, the requirement of large amounts of noisy-clean image pairs for supervision limits the wide use of these models. Although there have been a few attempts in training an image denoising model with only single noisy images, existing self-supervised denoising approaches suffer from inefficient network training, loss of useful information, or dependence on noise modeling. In this paper, we present a very simple yet effective method named Neighbor2Neighbor to train an effective image denoising model with only noisy images. Firstly, a random neighbor sub-sampler is proposed for the generation of training image pairs. In detail, input and target used to train a network are images sub-sampled from the same noisy image, satisfying the requirement that paired pixels of paired images are neighbors and have very similar appearance with each other. Secondly, a denoising network is trained on sub-sampled training pairs generated in the first stage, with a proposed regularizer as additional loss for better performance. The proposed Neighbor2Neighbor framework is able to enjoy the progress of state-of-the-art supervised denoising networks in network architecture design. Moreover, it avoids heavy dependence on the assumption of the noise distribution. We explain our approach from a theoretical perspective and further validate it through extensive experiments, including synthetic experiments with different noise distributions in sRGB space and real-world experiments on a denoising benchmark dataset in raw-RGB space.	Gaussion 25, BSD300: PSNR: 30.79, SSIM:0.873	快速开始
6	Restormer: Efficient Transformer for High-Resolution Image Restoration	Abstract Since convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data, these models have been extensively applied to image restoration and related tasks. Recently, another class of neural architectures, Transformers, have shown significant performance gains on natural language and high-level vision tasks. While the Transformer model mitigates the shortcomings of CNNs (i.e., limited receptive field and inadaptability to input content), its computational complexity grows quadratically with the spatial resolution, therefore making it infeasible to apply to most image restoration tasks involving high-resolution images. In this work, we propose an efficient Transformer model by making several key designs in the building blocks (multi-head attention and feed-forward network) such that it can capture long-range pixel interactions, while still remaining applicable to large images. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks, including image deraining, single-image motion deblurring, defocus deblurring (single-image and dual-pixel data), and image denoising (Gaussian grayscale/color denoising, and real image denoising). The source code and pre-trained models are available at https://github.com/swz30/Restormer.	CBSD68, average PSNR, noise 15: 34.39	快速开始
7	Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising	Abstract Discriminative model learning for image denoising has been recently attracting considerable attentions due to its favorable denoising performance. In this paper, we take one step forward by investigating the construction of feed-forward denoising convolutional neural networks (DnCNNs) to embrace the progress in very deep architecture, learning algorithm, and regularization method into image denoising. Specifically, residual learning and batch normalization are utilized to speed up the training process as well as boost the denoising performance. Different from the existing discriminative denoising models which usually train a specific model for additive white Gaussian noise (AWGN) at a certain noise level, our DnCNN model is able to handle Gaussian denoising with unknown noise level (i.e., blind Gaussian denoising). With the residual learning strategy, DnCNN implicitly removes the latent clean image in the hidden layers. This property motivates us to train a single DnCNN model to tackle with several general image denoising tasks such as Gaussian denoising, single image super-resolution and JPEG image deblocking. Our extensive experiments demonstrate that our DnCNN model can not only exhibit high effectiveness in several general image denoising tasks, but also be efficiently implemented by benefiting from GPU computing.	BSD68: average PSNR, noise 15: 31.73	快速开始
8	Learning Enriched Features for Real Image Restoration and Enhancement	Abstract With the goal of recovering high-quality image content from its degraded version, image restoration enjoys numerous applications, such as in surveillance, computational photography, medical imaging, and remote sensing. Recently, convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. Existing CNN-based methods typically operate either on full-resolution or on progressively low-resolution representations. In the former case, spatially precise but contextually less robust results are achieved, while in the latter case, semantically reliable but spatially less accurate outputs are generated. In this paper, we present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network and receiving strong contextual information from the low-resolution representations. The core of our approach is a multi-scale residual block containing several key elements: (a) parallel multi-resolution convolution streams for extracting multi-scale features, (b) information exchange across the multi-resolution streams, (c) spatial and channel attention mechanisms for capturing contextual information, and (d) attention based multi-scale feature aggregation. In a nutshell, our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Extensive experiments on five real image benchmark datasets demonstrate that our method, named as MIRNet, achieves state-of-the-art results for a variety of image processing tasks, including image denoising, super-resolution, and image enhancement. The source code and pre-trained models are available at https://github.com/swz30/MIRNet.	SIDD PSNR: 39.72, SSIM:0.959	快速开始

异常检测

序号	论文名称(链接)	摘要	数据集	快速开始
1	Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection	Abstract Anomaly detection is commonly pursued as a one-class classification problem, where models can only learn from normal training samples, while being evaluated on both normal and abnormal test samples. Among the successful approaches for anomaly detection, a distinguished category of methods relies on predicting masked information (e.g. patches, future frames, etc.) and leveraging the reconstruction error with respect to the masked information as an abnormality score. Different from related methods, we propose to integrate the reconstruction-based functionality into a novel self-supervised predictive architectural building block. The proposed self-supervised block is generic and can easily be incorporated into various state-of-the-art anomaly detection methods. Our block starts with a convolutional layer with dilated filters, where the center area of the receptive field is masked. The resulting activation maps are passed through a channel attention module. Our block is equipped with a loss that minimizes the reconstruction error with respect to the masked area in the receptive field. We demonstrate the generality of our block by integrating it into several state-of-the-art frameworks for anomaly detection on image and video, providing empirical evidence that shows considerable performance improvements on MVTec AD, Avenue, and ShanghaiTech. We release our code as open source at https://github.com/ristea/sspcab.	MVTec AD数据集，结合CutPaste方法，3-way detection AUROC 96.1%	快速开始
2	CutPaste: Self-Supervised Learning for Anomaly Detection and Localization	Abstract We aim at constructing a high performance model for defect detection that detects unknown anomalous patterns of an image without anomalous data. To this end, we propose a two-stage framework for building anomaly detectors using normal training data only. We first learn self-supervised deep representations and then build a generative one-class classifier on learned representations. We learn representations by classifying normal data from the CutPaste, a simple data augmentation strategy that cuts an image patch and pastes at a random location of a large image. Our empirical study on MVTec anomaly detection dataset demonstrates the proposed algorithm is general to be able to detect various types of real-world defects. We bring the improvement upon previous arts by 3.1 AUCs when learning representations from scratch. By transfer learning on pretrained representations on ImageNet, we achieve a new state-of-theart 96.6 AUC. Lastly, we extend the framework to learn and extract representations from patches to allow localizing defective areas without annotations during training.	MVTec AD数据集，3-way detection AUROC 95.2%	快速开始
3	Anomaly Detection via Reverse Distillation from One-Class Embedding	Abstract Knowledge distillation (KD) achieves promising results on the challenging problem of unsupervised anomaly detection (AD).The representation discrepancy of anomalies in the teacher-student (T-S) model provides essential evidence for AD. However, using similar or identical architectures to build the teacher and student models in previous studies hinders the diversity of anomalous representations. To tackle this problem, we propose a novel T-S model consisting of a teacher encoder and a student decoder and introduce a simple yet effective "reverse distillation" paradigm accordingly. Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input and targets to restore the teacher's multiscale representations. Inherently, knowledge distillation in this study starts from abstract, high-level presentations to low-level features. In addition, we introduce a trainable one-class bottleneck embedding (OCBE) module in our T-S model. The obtained compact embedding effectively preserves essential information on normal patterns, but abandons anomaly perturbations. Extensive experimentation on AD and one-class novelty detection benchmarks shows that our method surpasses SOTA performance, demonstrating our proposed approach's effectiveness and generalizability.	MVTec AD数据集， 256尺度，wide-resnet50，detection AUROC 98.5%， loc-AUROC 97.8%， loc-PRO 93.9%	快速开始
4	FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows	Abstract Unsupervised anomaly detection and localization is crucial to the practical application when collecting and labeling sufficient anomaly data is infeasible. Most existing representation-based approaches extract normal image features with a deep convolutional neural network and characterize the corresponding distribution through non-parametric distribution estimation methods. The anomaly score is calculated by measuring the distance between the feature of the test image and the estimated distribution. However, current methods can not effectively map image features to a tractable base distribution and ignore the relationship between local and global features which are important to identify anomalies. To this end, we propose FastFlow implemented with 2D normalizing flows and use it as the probability distribution estimator. Our FastFlow can be used as a plug-in module with arbitrary deep feature extractors such as ResNet and vision transformer for unsupervised anomaly detection and localization. In training phase, FastFlow learns to transform the input visual feature into a tractable distribution and obtains the likelihood to recognize anomalies in inference phase. Extensive experimental results on the MVTec AD dataset show that FastFlow surpasses previous state-of-the-art methods in terms of accuracy and inference efficiency with various backbone networks. Our approach achieves 99.4% AUC in anomaly detection with high inference efficiency.	MVTec AD数据集，ResNet18 , image-level AUC 97.9%，pixel-leval AUC 97.2%	快速开始
5	PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization	Abstract We present a new framework for Patch Distribution Modeling, PaDiM, to concurrently detect and localize anomalies in images in a one-class learning setting. PaDiM makes use of a pretrained convolutional neural network (CNN) for patch embedding, and of multivariate Gaussian distributions to get a probabilistic representation of the normal class. It also exploits correlations between the different semantic levels of CNN to better localize anomalies. PaDiM outperforms current state-of-the-art approaches for both anomaly detection and localization on the MVTec AD and STC datasets. To match real-world visual industrial inspection, we extend the evaluation protocol to assess performance of anomaly localization algorithms on non-aligned dataset. The state-of-the-art performance and low complexity of PaDiM make it a good candidate for many industrial applications.	resnet18 mvtec image-level auc 0.891, pixel-level auc: 0.968	快速开始
6	Student-Teacher Feature Pyramid Matching for Anomaly Detection	Abstract Anomaly detection is a challenging task and usually formulated as an one-class learning problem for the unexpectedness of anomalies. This paper proposes a simple yet powerful approach to this issue, which is implemented in the student-teacher framework for its advantages but substantially extends it in terms of both accuracy and efficiency. Given a strong model pre-trained on image classification as the teacher, we distill the knowledge into a single student network with the identical architecture to learn the distribution of anomaly-free images and this one-step transfer preserves the crucial clues as much as possible. Moreover, we integrate the multi-scale feature matching strategy into the framework, and this hierarchical feature matching enables the student network to receive a mixture of multi-level knowledge from the feature pyramid under better supervision, thus allowing to detect anomalies of various sizes. The difference between feature pyramids generated by the two networks serves as a scoring function indicating the probability of anomaly occurring. Due to such operations, our approach achieves accurate and fast pixel-level anomaly detection. Very competitive results are delivered on the MVTec anomaly detection dataset, superior to the state of the art ones.	resnet18 mvtec image-level auc 0.893, pixel-level auc: 0.951	快速开始
7	Multiresolution Knowledge Distillation for Anomaly Detection	Abstract Unsupervised representation learning has proved to be a critical component of anomaly detection/localization in images. The challenges to learn such a representation are two-fold. Firstly, the sample size is not often large enough to learn a rich generalizable representation through conventional techniques. Secondly, while only normal samples are available at training, the learned features should be discriminative of normal and anomalous samples. Here, we propose to use the "distillation" of features at various layers of an expert network, pre-trained on ImageNet, into a simpler cloner network to tackle both issues. We detect and localize anomalies using the discrepancy between the expert and cloner networks' intermediate activation values given the input data. We show that considering multiple intermediate hints in distillation leads to better exploiting the expert's knowledge and more distinctive discrepancy compared to solely utilizing the last layer activation values. Notably, previous methods either fail in precise anomaly localization or need expensive region-based training. In contrast, with no need for any special or intensive training procedure, we incorporate interpretability algorithms in our novel framework for the localization of anomalous regions. Despite the striking contrast between some test datasets and ImageNet, we achieve competitive or significantly superior results compared to the SOTA methods on MNIST, F-MNIST, CIFAR-10, MVTecAD, Retinal-OCT, and two Medical datasets on both anomaly detection and localization.	mvtec detection AUROC 87.74%, loc AUROC 90.71%	快速开始
8	Towards Total Recall in Industrial Anomaly Detection	Abstract Being able to spot defective parts is a critical component in large-scale industrial manufacturing. A particular challenge that we address in this work is the cold-start problem: fit a model using nominal (non-defective) example images only. While handcrafted solutions per class are possible, the goal is to build systems that work well simultaneously on many different tasks automatically. The best performing approaches combine embeddings from ImageNet models with an outlier detection model. In this paper, we extend on this line of work and propose \textbf{PatchCore}, which uses a maximally representative memory bank of nominal patch-features. PatchCore offers competitive inference times while achieving state-of-the-art performance for both detection and localization. On the challenging, widely used MVTec AD benchmark PatchCore achieves an image-level anomaly detection AUROC score of up to 99.6%, more than halving the error compared to the next best competitor. We further report competitive results on two additional datasets and also find competitive results in the few samples regime.\freefootnote{∗ Work done during a research internship at Amazon AWS.} Code: github.com/amazon-research/patchcore-inspection.	resnet18 mvtec image-level auc 0.973, pixel-level auc: 0.976	快速开始
9	Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation	Abstract We present the efficiency of semi-orthogonal embedding for unsupervised anomaly segmentation. The multi-scale features from pre-trained CNNs are recently used for the localized Mahalanobis distances with significant performance. However, the increased feature size is problematic to scale up to the bigger CNNs, since it requires the batch-inverse of multi-dimensional covariance tensor. Here, we generalize an ad-hoc method, random feature selection, into semi-orthogonal embedding for robust approximation, cubically reducing the computational cost for the inverse of multi-dimensional covariance tensor. With the scrutiny of ablation studies, the proposed method achieves a new state-of-the-art with significant margins for the MVTec AD, KolektorSDD, KolektorSDD2, and mSTC datasets. The theoretical and empirical analyses offer insights and verification of our straightforward yet cost-effective approach.	mvtec pro 0.942, roc 0.982	快速开始

人脸识别

序号	论文名称(链接)	摘要	数据集	快速开始
1	RetinaFace: Single-stage Dense Face Localisation in the Wild	Abstract We propose a novel Connectionist Text Proposal Network (CTPN) that accuratelylocalizes text lines in natural image. The CTPN detects a text line in asequence of fine-scale text proposals directly in convolutional feature maps.We develop a vertical anchor mechanism that jointly predicts location andtext/non-text score of each fixed-width proposal, considerably improvinglocalization accuracy. The sequential proposals are naturally connected by arecurrent neural network, which is seamlessly incorporated into theconvolutional network, resulting in an end-to-end trainable model. This allowsthe CTPN to explore rich context information of image, making it powerful todetect extremely ambiguous text. The CTPN works reliably on multi-scale andmulti- language text without further post-processing, departing from previousbottom-up methods requiring multi-step post-processing. It achieves 0.88 and0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recentresults [8, 35] by a large margin. The CTPN is computationally efficient with0:14s/image, by using the very deep VGG16 model [27]. Online demo is availableat: this http URL.	MAP: 52.318	快速开始
2	Real-time Convolutional Neural Networks for Emotion and Gender Classification	Abstract In this paper we propose an implement a general convolutional neural network (CNN) building framework for designing real-time CNNs. We validate our models by creating a real-time vision system which accomplishes the tasks of face detection, gender classification and emotion classification simultaneously in one blended step using our proposed CNN architecture. After presenting the details of the training procedure setup we proceed to evaluate on standard benchmark sets. We report accuracies of 96% in the IMDB gender dataset and 66% in the FER-2013 emotion dataset. Along with this we also introduced the very recent real-time enabled guided back-propagation visualization technique. Guided back-propagation uncovers the dynamics of the weight changes and evaluates the learned features. We argue that the careful implementation of modern CNN architectures, the use of the current regularization methods and the visualization of previously hidden features are necessary in order to reduce the gap between slow performances and real-time architectures. Our system has been validated by its deployment on a Care-O-bot 3 robot used during RoboCup@Home competitions. All our code, demos and pre-trained architectures have been released under an open-source license in our public repository.	IMDB: 96%	快速开始
3	PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models	Abstract The primary aim of single-image super-resolution is to construct high-resolution (HR) images from corresponding low-resolution (LR) inputs. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present an algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require supervised training on databases of LR-HR image pairs). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the "downscaling loss," which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee realistic outputs. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show proof of concept of our approach in the domain of face super-resolution (i.e., face hallucination). We also present a discussion of the limitations and biases of the method as currently implemented with an accompanying model card with relevant metrics. Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.	CelebA HQ 3.6	快速开始
4	Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks	Abstract Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.	FDDB: 0.82	快速开始
5	Finding Tiny Faces	Abstract Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects of the problem in the context of finding small faces: the role of scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, the cues for recognizing a 3px tall face are fundamentally different than those for recognizing a 300px tall face. We take a different approach and train separate detectors for different scales. To maintain efficiency, detectors are trained in a multi-task fashion: they make use of features extracted from multiple layers of single (deep) feature hierarchy. While training detectors for large objects is straightforward, the crucial challenge remains training detectors for small objects. We show that context is crucial, and define templates that make use of massively-large receptive fields (where 99% of the template extends beyond the object of interest). Finally, we explore the role of scale in pre-trained deep networks, providing ways to extrapolate networks tuned for limited scales to rather extreme ranges. We demonstrate state-of-the-art results on massively-benchmarked face datasets (FDDB and WIDER FACE). In particular, when compared to prior art on WIDER FACE, our results reduce error by a factor of 2 (our models produce an AP of 82% while prior art ranges from 29-64%).	WIDER_FACE: resnet50 500*500 easy: 0.902 medium: 0.892 medium: 0.892	快速开始

行为识别

序号	论文名称(链接)	摘要	数据集	快速开始
1	Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks	Abstract This paper addresses the visualisation of image classification models, learntusing deep Convolutional Networks (ConvNets). We consider two visualisationtechniques, based on computing the gradient of the class score with respect tothe input image. The first one generates an image, which maximises the classscore [Erhan et al., 2009], thus visualising the notion of the class, capturedby a ConvNet. The second technique computes a class saliency map, specific to agiven image and class. We show that such maps can be employed for weaklysupervised object segmentation using classification ConvNets. Finally, weestablish the connection between the gradient-based ConvNet visualisationmethods and deconvolutional networks [Zeiler et al., 2013].	可视化方法	快速开始
2	Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?	Abstract The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch	ucf-101; resnet18 112*112 Kinetics400 : 66.1 ucf-101: 42.4(不加载预训练)	快速开始
3	Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution	Abstract In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale methods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing convolutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.	imagenet: 1.125 MobileNet (v2) 3224224 imagenet top1 73	快速开始
4	Learning Spatiotemporal Features with 3D Convolutional Networks	Abstract We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.	UCF101: 128x171 TOP1=83.27%	快速开始
5	MVFNet: Multi-View Fusion Network for Efficient Video Recognition	Abstract Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.	UCF101: 4x16, Top1=96.6%	快速开始
6	Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition	Abstract Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. In GCNs, graph topology dominates feature aggregation and therefore is the key to extracting representative features. In this work, we propose a novel Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition. The proposed CTR-GC models channel-wise topologies through learning a shared topology as a generic prior for all channels and refining it with channel-specific correlations for each channel. Our refinement method introduces few extra parameters and significantly reduces the difficulty of modeling channel-wise topologies. Furthermore, via reformulating graph convolutions into a unified form, we find that CTR-GC relaxes strict constraints of graph convolutions, leading to stronger representation capability. Combining CTR-GC with temporal modeling modules, we develop a powerful graph convolutional network named CTR-GCN which notably outperforms state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.	nan	快速开始
7	Revisiting Skeleton-based Action Recognition	Abstract Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt graph convolutional networks (GCN) to extract features on top of human skeletons. Despite the positive results shown in previous works, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseC3D, a new approach to skeleton-based action recognition, which relies on a 3D heatmap stack instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseC3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseC3D can handle multiple-person scenarios without additional computation cost, and its features can be easily integrated with other modalities at early fusion stages, which provides a great design space to further boost the performance. On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.	UCF101 split1, top1=87.0	快速开始

自然语言处理

序号	论文名称(链接)	摘要	数据集	快速开始
1	A Structured Self-attentive Sentence Embedding	Abstract This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.	SNLI: accuracy=84.4%(见论文Table 2)	快速开始
2	End-To-End Memory Networks	Abstract This article offers an empirical exploration on the use of character-levelconvolutional networks (ConvNets) for text classification. We constructedseveral large-scale datasets to show that character-level convolutionalnetworks could achieve state-of-the-art or competitive results. Comparisons areoffered against traditional models such as bag of words, n-grams and theirTFIDF variants, and deep learning models such as word-based ConvNets andrecurrent neural networks.	Penn Treebank: ppl=111; Text8: ppl=147	快速开始
3	Character-level Convolutional Networks for Text Classification	Abstract Building open-domain chatbots is a challenging area for machine learningresearch. While prior work has shown that scaling neural models in the numberof parameters and the size of the data they are trained on gives improvedresults, we show that other ingredients are important for a high-performingchatbot. Good conversation requires a number of skills that an expertconversationalist blends in a seamless way: providing engaging talking pointsand listening to their partners, and displaying knowledge, empathy andpersonality appropriately, while maintaining a consistent persona. We show thatlarge scale models can learn these skills when given appropriate training dataand choice of generation strategy. We build variants of these recipes with 90M,2.7B and 9.4B parameter models, and make our models and code publiclyavailable. Human evaluations show our best models are superior to existingapproaches in multi-turn dialogue in terms of engagingness and humannessmeasurements. We then discuss the limitations of this work by analyzing failurecases of our models.	Amazon Review Full: error rate=40.45%; Yahoo! Answers: error rate=28.80%	快速开始
4	Recipes for building an open-domain chatbot	Abstract Pre-trained language models like BERT and its variants have recently achievedimpressive performance in various natural language understanding tasks.However, BERT heavily relies on the global self-attention block and thussuffers large memory footprint and computation cost. Although all its attentionheads query on the whole input sequence for generating the attention map from aglobal perspective, we observe some heads only need to learn localdependencies, which means the existence of computation redundancy. We thereforepropose a novel span-based dynamic convolution to replace these self-attentionheads to directly model local dependencies. The novel convolution heads,together with the rest self-attention heads, form a new mixed attention blockthat is more efficient at both global and local context learning. We equip BERTwith this mixed attention design and build a ConvBERT model. Experiments haveshown that ConvBERT significantly outperforms BERT and its variants in variousdownstream tasks, with lower training cost and fewer model parameters.Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher thanELECTRAbase, while using less than 1/4 training cost. Code and pre-trainedmodels will be released.	BlenderbotForConditionalGeneration模型和BlenderbotSmallForConditionalGeneration模型前向推理输出与论文对齐(90M和2.7B distilled to 360M两个权重)	快速开始
5	ConvBERT: Improving BERT with Span-based Dynamic Convolution	Abstract Natural Language Processing (NLP) has recently achieved great success byusing huge pre-trained models with hundreds of millions of parameters. However,these models suffer from heavy model sizes and high latency such that theycannot be deployed to resource-limited mobile devices. In this paper, wepropose MobileBERT for compressing and accelerating the popular BERT model.Like the original BERT, MobileBERT is task-agnostic, that is, it can begenerically applied to various downstream NLP tasks via simple fine-tuning.Basically, MobileBERT is a thin version of BERT_LARGE, while equipped withbottleneck structures and a carefully designed balance between self-attentionsand feed-forward networks. To train MobileBERT, we first train a speciallydesigned teacher model, an inverted-bottleneck incorporated BERT_LARGE model.Then, we conduct knowledge transfer from this teacher to MobileBERT. Empiricalstudies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASEwhile achieving competitive results on well-known benchmarks. On the naturallanguage inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuADv1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of90.0/79.2 (1.5/2.1 higher than BERT_BASE).	QNLI测试集accuracy=93.2%(见论文Table 3), SQuAD v1.1验证集上Exact Match=84.7%, F1=90.9%, SQuAD v2.0验证集Exact Match=80.6%, F1=83.1%(见论文Table 4)	快速开始
6	MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices	Abstract BERT adopts masked language modeling (MLM) for pre-training and is one of themost successful pre-training models. Since BERT neglects dependency amongpredicted tokens, XLNet introduces permuted language modeling (PLM) forpre-training to address this problem. However, XLNet does not leverage the fullposition information of a sentence and thus suffers from position discrepancybetween pre-training and fine-tuning. In this paper, we propose MPNet, a novelpre-training method that inherits the advantages of BERT and XLNet and avoidstheir limitations. MPNet leverages the dependency among predicted tokensthrough permuted language modeling (vs. MLM in BERT), and takes auxiliaryposition information as input to make the model see a full sentence and thusreducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on alarge-scale dataset (over 160GB text corpora) and fine-tune on a variety ofdown-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNetoutperforms MLM and PLM by a large margin, and achieves better results on thesetasks compared with previous state-of-the-art pre-trained methods (e.g., BERT,XLNet, RoBERTa) under the same model setting. The code and the pre-trainedmodels are available at: this https URL.	MNLI验证集-m/mm accuracy=83.3/82.6(见论文table 4), SQuAD 1.1验证集F1/EM=90.0/82.9, SQuAD 2.0验证集F1/EM=79.2/76.2(见论文table 5)	快速开始
7	MPNet: Masked and Permuted Pre-training for Language Understanding	Abstract BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting. The code and the pre-trained models are available at: https://github.com/microsoft/MPNet.	QQP验证集accuracy=91.9(见论文Table 3), SQuAD 1.1 F1/EM (dev set)=92.7/86.9, SQuAD 2.0 F1/EM (dev set)=85.7/82.7(见论文Table 4)	快速开始
8	Reformer: The Efficient Transformer	Abstract Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(LlogL), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.	ReformerModel, ReformerForSequenceClassification和ReformerForQuestionAnswering网络前向推理输出与论文对齐	快速开始
9	SqueezeBERT: What can computer vision teach NLP about efficient neural networks?	Abstract Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. The SqueezeBERT code will be released.	QQP验证集accuracy=89.4(见论文Table 2); SqueezeBERT模型加速比对比BERT-Base达到4.3x(见论文Table 2)	快速开始
10	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	Abstract Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.	GLUE dev set上达到平均指标85.97, CNNDM dev set上达到ROUGE-2=20.90(见论文Table 15)	快速开始
11	VisualBERT: A Simple and Performant Baseline for Vision and Language	Abstract We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.	VQA: Test-Dev=70.80, Test-Std=71.00; NLVR: accuracy=67.4(见论文Table1, Table3)	快速开始
12	ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information	Abstract Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the {\it glyph} and {\it pinyin} information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The porpsoed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition. Code and pretrained models are publicly available at https://github.com/ShannonAI/ChineseBert.	CMRC dev/test=70.70/78.05(见论文Table 2), XNLI dev/test=82.7/81.6(见论文Table 4), ChnSentiCorp dev/test=95.8/95.9	快速开始
13	CTRL: A Conditional Transformer Language Model for Controllable Generation	Abstract Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://github.com/salesforce/ctrl.	CTRLLMHeadModel模型和CTRLForSequenceClassification模型前向推理输出与论文对齐	快速开始
14	Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing	Abstract With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at https://github.com/laiguokun/Funnel-Transformer.	QNLI验证集上accuracy=95.1%(见论文table 3), SQuAD v1.1验证集上F1/EM=94.7/89.0, SQuAD v2.0验证集F1/EM=90.4/87.6(见论文table 5)	快速开始
15	Splinter: Few-Shot Question Answering by Pretraining Span Selection	Abstract In several question answering benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available, and observe that standard models perform poorly, highlighting the discrepancy between current pretraining objectives and question answering. We propose a new pretraining scheme tailored for question answering: recurring span selection. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD with only 128 training examples), while maintaining competitive performance in the high-resource setting.	SQuAD 1.1验证集, 16 examples F1=54.6, 128 examples F1=72.7, 1024 Examples F1=82.8(见论文Table1)	快速开始
16	UNILMv1: Unified Language Model Pre-training for Natural Language Understanding and Generation	Abstract This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UniLM achieves new state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm.	QNLI测试集达到92.7(见论文table11), CoQA验证集F1=82.5(见论文 table 7)	快速开始
17	BERT for Joint Intent Classification and Slot Filling	Abstract Intent classification and slot filling are two essential tasks for natural language understanding. They often suffer from small-scale human-labeled training data, resulting in poor generalization capability, especially for rare words. Recently a new language representation model, BERT (Bidirectional Encoder Representations from Transformers), facilitates pre-training deep bidirectional representations on large-scale unlabeled corpora, and has created state-of-the-art models for a wide variety of natural language processing tasks after simple fine-tuning. However, there has not been much effort on exploring BERT for natural language understanding. In this work, we propose a joint intent classification and slot filling model based on BERT. Experimental results demonstrate that our proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on several public benchmark datasets, compared to the attention-based recurrent neural network models and slot-gated models.	在Snips测试集指标达到98.6, 97.0, 92.8; 在ATIS测试集上指标达到97.9, 96.0, 88.6 (见table2)	快速开始
18	Fastformer: Additive Attention Can Be All You Need	Abstract Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.	Amazon F1=43.23(见论文table4), Pubmed测试集R-L=34.81(见论文table6),	快速开始
19	Convolutional Sequence to Sequence Learning	Abstract The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks.1 Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT’14 English-German and WMT’14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.	在WMT’14 English-German测试集上BLEU=25.16 或者在WMT’16 English-Romanian测试集上BLEU=30.02	快速开始
20	FNet: Mixing Tokens with Fourier Transforms	Abstract We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.	QQP验证集Acc=85%, SST-2验证集Acc=95%(见论文Table1)	快速开始
21	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	Abstract Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).	megatron-bert-cased-345m模型在MNLI验证集上Acc=89.7/90.0; 2. megatron-bert-cased-345m模型在SQuAD-1.1验证集上F1/EM=94.2/88.0, 在SQuAD-2.0验证集上F1/EM=88.1/84.8	快速开始
22	LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention	Abstract Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.	Open Entity: F1=78.2, SQuAD1.1: F1=95.4, EM=90.2(见论文Table1 & Table5)	快速开始
23	RemBERT: Rethinking embedding coupling in pre-trained language models	Abstract We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.	XTREME: Sentence-pair Classification: Acc=84.2(见论文Table7)	快速开始
24	Nyströmformer: A Nystöm-based Algorithm for Approximating Self-Attention	Abstract Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nystr\"{o}mformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nystr\"{o}m method to approximate standard self-attention with O(n) complexity. The scalability of Nystr\"{o}mformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nystr\"{o}mformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nystr\"{o}mformer performs favorably relative to other efficient self-attention methods. Our code is available at https://github.com/mlpen/Nystromformer.	IMDB: F1=93.2, LRA benchmark下text任务 acc=65.52;	快速开始
25	ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training	Abstract This paper presents a new sequence-tosequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-stepahead prediction in the traditional sequenceto-sequence model, the ProphetNet is optimized by n-step ahead prediction that predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large-scale dataset (160GB), respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.	prophetnet-large-uncased模型在CNN/DailyMail测试集上R-1=44.20, R-2=21.17, R-L=41.30 (见论文Table 4); prophetnet-large-uncased模型在Gigaword测试集上R-1=39.51, R-2=20.42, R-L=36.69 (见论文Table 4);	快速开始
26	Universal Language Model Fine-tuning for Text Classification	Abstract Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18- 24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100× more data. We opensource our pretrained models and code1 .	AG’s News: Err=5.01% (见论文Table 3)	快速开始
27	ByT5: Towards a token-free future with pre-trained byte-to-byte models	Abstract Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.	在GEM-Xsum验证集上, small model BLEU score=9.1; 在TweetQA验证集上, small model BLEU-1/ROUGE-L=65.7/69.7 (见论文table3)	快速开始

多模态

序号	论文名称(链接)	摘要	数据集	快速开始
1	Comprehensive Semi-Supervised Multi-Modal Learning	Abstract Multi-modal learning refers to the process of learning a precise model to represent the joint representations of different modalities. Despite its promise for multi-modal learning, the co-regularization method is based on the consistency principle with a sufficient assumption, which usually does not hold for real-world multi-modal data. Indeed, due to the modal insufficiency in real-world applications, there are divergences among heterogeneous modalities. This imposes a critical challenge for multi-modal learning. To this end, in this paper, we propose a novel Comprehensive Multi-Modal Learning (CMML) framework, which can strike a balance between the consistency and divergency modalities by considering the insufficiency in one unified framework. Specifically, we utilize an instance level attention mechanism to weight the sufficiency for each instance on different modalities. Moreover, novel diversity regularization and robust consistency metrics are designed for discovering insufficient modalities. Our empirical studies show the superior performances of CMML on real-world data in terms of various criteria.	Coverage: 2.669 Average Precision: 0.914 Ranking Loss: 0.058 Example AUC: 0.942 Micro AUC: 0.94 Macro AUC: 0.932	快速开始
2	Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks	Abstract We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.	RefCOCO+-val=72.34	快速开始
3	Attention on Attention for Image Captioning	Abstract Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an “Attention on Attention” (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an “information vector” and an “attention gate” using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the “attended information”, the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-ofthe-art performance of 129.8 CIDEr-D score on MS COCO “Karpathy” offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.	COCO; {‘Bleu_1’: 0.8054903453672397, ‘Bleu_2’: 0.6523038976984842, ‘Bleu_3’: 0.5096621263772566, ‘Bleu_4’: 0.39140307771618477, ‘METEOR’: 0.29011216375635934, ‘ROUGE_L’: 0.5890369750273199, ‘CIDEr’: 1.2892294296245852, ‘SPICE’: 0.22680092759866174}	快速开始
4	VinVL: Revisiting Visual Representations in Vision-Language Models	Abstract This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [21], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL.	COCO 2014; Oscar-Large: COCO-Text Retireval: Recall@1=89.8 Recall@5=98.8 Recall@10=99.7 COCO-Image Retireval: Recall@1=78.2 Recall@5=95.8 Recall@10=98.3	快速开始
5	From Recognition to Cognition: Visual Commonsense Reasoning	Abstract Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.	VQA val, Q->A 63.8%, QA->R: 67.2%, Q-AR: 43.1%	快速开始
6	Uniter: Learning universal image-text representations	Abstract Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2. Code is available at https://github.com/ChenRocks/UNITER.	IR-flickr30K-R1=73.66	快速开始
7	Efficient Low-rank Multimodal Fusion with Modality-Specific Factors	Abstract Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, these methods often suffer from exponential increase in dimensions and in computational complexity introduced by transformation of input into tensor. In this paper, we propose the Low-rank Multimodal Fusion method, which performs multimodal fusion using low-rank tensors to improve efficiency. We evaluate our model on three different tasks: multimodal sentiment analysis, speaker trait analysis, and emotion recognition. Our model achieves competitive results on all these tasks while drastically reducing computational complexity. Additional experiments also show that our model can perform robustly for a wide range of low-rank settings, and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.	F1-Happy, 85.8%, F1-Sad 85.9%, F1-Angry 89.0%, F1-Neutral 71.7%	快速开始

科学计算

序号	论文名称(链接)	摘要	数据集	快速开始
1	Solving inverse problems using conditional invertible neural networks	Abstract Inverse modeling for computing a high-dimensional spatially-varying property field from indirect sparse and noisy observations is a challenging problem. This is due to the complex physical system of interest often expressed in the form of multiscale PDEs, the high-dimensionality of the spatial property of interest, and the incomplete and noisy nature of observations. To address these challenges, we develop a model that maps the given observations to the unknown input field in the form of a surrogate model. This inverse surrogate model will then allow us to estimate the unknown input field for any given sparse and noisy output observations. Here, the inverse mapping is limited to a broad prior distribution of the input field with which the surrogate model is trained. In this work, we construct a two- and three-dimensional inverse surrogate models consisting of an invertible and a conditional neural network trained in an end-to-end fashion with limited training data. The invertible network is developed using a flow-based generative model. The developed inverse surrogate model is then applied for an inversion task of a multiphase flow problem where given the pressure and saturation observations the aim is to recover a high-dimensional non-Gaussian permeability field where the two facies consist of heterogeneous permeability and varying length-scales. For both the two- and three-dimensional surrogate models, the predicted sample realizations of the non-Gaussian permeability field are diverse with the predictive mean being close to the ground truth even when the model is trained with limited data.	2D/3D模型下得到与Fig16和Fig17相吻合的结果	快速开始
2	Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems	Abstract Deep learning has been shown to be an effective tool in solving partial differential equations (PDEs) through physics-informed neural networks (PINNs). PINNs embed the PDE residual into the loss function of the neural network, and have been successfully employed to solve diverse forward and inverse PDE problems. However, one disadvantage of the first generation of PINNs is that they usually have limited accuracy even with many training points. Here, we propose a new method, gradient-enhanced physics-informed neural networks (gPINNs), for improving the accuracy and training efficiency of PINNs. gPINNs leverage gradient information of the PDE residual and embed the gradient into the loss function. We tested gPINNs extensively and demonstrated the effectiveness of gPINNs in both forward and inverse PDE problems. Our numerical results show that gPINN performs better than PINN with fewer training points. Furthermore, we combined gPINN with the method of residual-based adaptive refinement (RAR), a method for improving the distribution of training points adaptively during training, to further improve the performance of gPINN, especially in PDEs with solutions that have steep gradients.	完成论文3.2 得到和Fig2 Fig3 Fig3相吻合的结果，论文3.3 得到fig 6 fig7相吻合的结果，论文3.4.1 gPINN fig10 11	快速开始
3	DeepCFD: Efficient Steady-State Laminar Flow Approximation with Deep Convolutional Neural Networks	Abstract Computational Fluid Dynamics (CFD) simulation by the numerical solution of the Navier-Stokes equations is an essential tool in a wide range of applications from engineering design to climate modeling. However, the computational cost and memory demand required by CFD codes may become very high for flows of practical interest, such as in aerodynamic shape optimization. This expense is associated with the complexity of the fluid flow governing equations, which include non-linear partial derivative terms that are of difficult solution, leading to long computational times and limiting the number of hypotheses that can be tested during the process of iterative design. Therefore, we propose DeepCFD: a convolutional neural network (CNN) based model that efficiently approximates solutions for the problem of non-uniform steady laminar flows. The proposed model is able to learn complete solutions of the Navier-Stokes equations, for both velocity and pressure fields, directly from ground-truth data generated using a state-of-the-art CFD code. Using DeepCFD, we found a speedup of up to 3 orders of magnitude compared to the standard CFD approach at a cost of low error rates.	DeepCFD MSE (Ux= 0.773±0.0897，Uy=0.2153±0.0186，P=1.042±0.0431，Total 2.03±0.136）	快速开始
4	Unsupervised deep learning for super-resolution reconstruction of turbulence	Abstract Recent attempts to use deep learning for super-resolution reconstruction of turbulent flows have used supervised learning, which requires paired data for training. This limitation hinders more practical applications of super-resolution reconstruction. Therefore, we present an unsupervised learning model that adopts a cycle-consistent generative adversarial network (CycleGAN) that can be trained with unpaired turbulence data for super-resolution reconstruction. Our model is validated using three examples: (i) recovering the original flow field from filtered data using direct numerical simulation (DNS) of homogeneous isotropic turbulence; (ii) reconstructing full-resolution fields using partially measured data from the DNS of turbulent channel flows; and (iii) generating a DNS-resolution flow field from large-eddy simulation (LES) data for turbulent channel flows. In examples (i) and (ii), for which paired data are available for supervised learning, our unsupervised model demonstrates qualitatively and quantitatively similar performance as that of the best supervised learning model. More importantly, in example (iii), where supervised learning is impossible, our model successfully reconstructs the high-resolution flow field of statistical DNS quality from the LES data. Furthermore, we find that the present model has almost universal applicability to all values of Reynolds numbers within the tested range. This demonstrates that unsupervised learning of turbulence data is indeed possible, opening a new door for the wide application of super-resolution reconstruction of turbulent fields.	可复现三个example中的任意一个，若对于example3可得到论文中cycleGAN对应的结果（图11,12,13,14,15）	快速开始
5	Lettuce: PyTorch-based Lattice Boltzmann Framework	Abstract The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent studies have not addressed possible connections between machine learning and LBM. Here, we introduce Lettuce, a PyTorch-based LBM code with a threefold aim. Lettuce enables GPU accelerated calculations with minimal source code, facilitates rapid prototyping of LBM models, and enables integrating LBM simulations with PyTorch's deep learning and automatic differentiation facility. As a proof of concept for combining machine learning with the LBM, a neural collision model is developed, trained on a doubly periodic shear layer and then transferred to a different flow, a decaying turbulence. We also exemplify the added benefit of PyTorch's automatic differentiation framework in flow control and optimization. To this end, the spectrum of a forced isotropic turbulence is maintained without further constraining the velocity field. The source code is freely available from https://github.com/lettucecfd/lettuce.	得到图1配置的展示结果与图2的曲线吻合	快速开始
6	Systems Biology: Identifiability analysis and parameter identification via systems-biology informed neural networks	Abstract The dynamics of systems biological processes are usually modeled by a system of ordinary differential equations (ODEs) with many unknown parameters that need to be inferred from noisy and sparse measurements. Here, we introduce systems-biology informed neural networks for parameter estimation by incorporating the system of ODEs into the neural networks. To complete the workflow of system identification, we also describe structural and practical identifiability analysis to analyze the identifiability of parameters. We use the ultridian endocrine model for glucose-insulin interaction as the example to demonstrate all these methods and their implementation.	结果吻合Fig13	快速开始
7	TorchMD: A deep learning framework for molecular simulations	Abstract Molecular dynamics simulations provide a mechanistic description of molecules by relying on empirical potentials. The quality and transferability of such potentials can be improved leveraging data-driven models derived with machine learning approaches. Here, we present TorchMD, a framework for molecular simulations with mixed classical and machine learning potentials. All of force computations including bond, angle, dihedral, Lennard-Jones and Coulomb interactions are expressed as PyTorch arrays and operations. Moreover, TorchMD enables learning and simulating neural network potentials. We validate it using standard Amber all-atom simulations, learning an ab-initio potential, performing an end-to-end training and finally learning and simulating a coarse-grained model for protein folding. We believe that TorchMD provides a useful tool-set to support molecular simulations of machine learning potentials. Code and data are freely available at \url{github.com/torchmd}.	Di-alanine 688 8 min 44 s， Trypsin 3,248 13 min 2 s。	快速开始
8	PHYSICS-INFORMED NEURAL NETWORKS WITH HARD CONSTRAINTS FOR INVERSE DESIGN∗	Abstract We achieve the same objective as conventional PDE-constrained optimization methods based on adjoint methods and numerical PDE solvers, but find that the design obtained from hPINN is often simpler and smoother for problems whose solution is not unique.	得到图6的结果，（the PDE loss is below 10−4 (Fig. 6A), and the L2 relative error of \|E\|2 between hPINN and FDFD for the final ε is 1.2%）可以展示图7	快速开始

其他

序号	论文名称(链接)	摘要	数据集	快速开始
1	End to End Learning for Self-Driving Cars	Abstract Point cloud is an important type of geometric data structure. Due to itsirregular format, most researchers transform such data to regular 3D voxelgrids or collections of images. This, however, renders data unnecessarilyvoluminous and causes issues. In this paper, we design a novel type of neuralnetwork that directly consumes point clouds and well respects the permutationinvariance of points in the input. Our network, named PointNet, provides aunified architecture for applications ranging from object classification, partsegmentation, to scene semantic parsing. Though simple, PointNet is highlyefficient and effective. Empirically, it shows strong performance on par oreven better than state of the art. Theoretically, we provide analysis towardsunderstanding of what the network has learnt and why the network is robust withrespect to input perturbation and corruption.	能在模拟器上运行不偏离路面, 模拟器地址https: //github.com/udacity/self-driving-car-sim	快速开始
2	Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields	Abstract Top-down visual attention mechanisms have been used extensively in imagecaptioning and visual question answering (VQA) to enable deeper imageunderstanding through fine-grained analysis and even multiple steps ofreasoning. In this work, we propose a combined bottom-up and top-down attentionmechanism that enables attention to be calculated at the level of objects andother salient image regions. This is the natural basis for attention to beconsidered. Within our approach, the bottom-up mechanism (based on FasterR-CNN) proposes image regions, each with an associated feature vector, whilethe top-down mechanism determines feature weightings. Applying this approach toimage captioning, our results on the MSCOCO test server establish a newstate-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability ofthe method, applying the same approach to VQA we obtain first place in the 2017VQA Challenge.	full test set 75.6%	快速开始
3	Towards Deep Learning Models Resistant to Adversarial Attacks	Abstract Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.	PGD-steps100-restarts20-sourceA: 89.3%PGD-steps100-restarts20-sourceA: 95.7%PGD-steps40-restarts1-sourceB: 96.4%	快速开始
4	Stacked Hourglass Networks for Human Pose Estimation	Abstract This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a "stacked hourglass" network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.	MPII Human Pose Dataset, hourglass52, size: 384x384, mean@0.1: 0.366 size: 256x256, mean@0.1: 0.317	快速开始
5	Learning to See in the Dark	Abstract Imaging in low light is challenging due to low photon count and low SNR. Short-exposure images suffer from noise, while long exposure can induce blur and is often impractical. A variety of denoising, deblurring, and enhancement techniques have been proposed, but their effectiveness is limited in extreme conditions, such as video-rate imaging at night. To support the development of learning-based pipelines for low-light image processing, we introduce a dataset of raw short-exposure low-light images, with corresponding long-exposure reference images. Using the presented dataset, we develop a pipeline for processing low-light images, based on end-to-end training of a fully-convolutional network. The network operates directly on raw sensor data and replaces much of the traditional image processing pipeline, which tends to perform poorly on such data. We report promising results on the new dataset, analyze factors that affect performance, and highlight opportunities for future work. The results are shown in the supplementary video at https://youtu.be/qWKUFK7MWvg	PSNR/SSIM: 28.88/0.787	快速开始
6	Adversarial Autoencoders	Abstract In this paper, we propose the "adversarial autoencoder" (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.	MNIST, Log-likelihood(10K): 340±2	快速开始
7	Supervised Contrastive Learning	Abstract Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement, and reference TensorFlow code is released at https://t.ly/supcon.	基于Contrastive loss, CIFAR10数据集在ResNet-50上, Top-1 Acc=96%	快速开始
8	CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning	Abstract We present a simple, fully-convolutional model for real-time (>30 fps)instance segmentation that achieves competitive results on MS COCO evaluated ona single Titan Xp, which is significantly faster than any previousstate-of-the-art approach. Moreover, we obtain this result after training ononly one GPU. We accomplish this by breaking instance segmentation into twoparallel subtasks: (1) generating a set of prototype masks and (2) predictingper-instance mask coefficients. Then we produce instance masks by linearlycombining the prototypes with the mask coefficients. We find that because thisprocess doesn't depend on repooling, this approach produces very high-qualitymasks and exhibits temporal stability for free. Furthermore, we analyze theemergent behavior of our prototypes and show they learn to localize instanceson their own in a translation variant manner, despite beingfully-convolutional. We also propose Fast NMS, a drop-in 12 ms fasterreplacement for standard NMS that only has a marginal performance penalty.Finally, by incorporating deformable convolutions into the backbone network,optimizing the prediction head with better anchor scales and aspect ratios, andadding a novel fast mask re-scoring branch, our YOLACT++ model can achieve 34.1mAP on MS COCO at 33.5 fps, which is fairly close to the state-of-the-artapproaches while still running at real-time.	Resnet50 AUC: Atelectasis=0.707 Cardiomegaly=0.81 Effusion=0.73 Infiltration =0.61 Mass=0.56Nodule=0.71 Pneumonia=0.63 Pneumothorax=0.78	快速开始
9	TabNet: Attentive Interpretable Tabular Learning	Abstract Though tremendous strides have been made in uncontrolled face detection,accurate and efficient face localisation in the wild remains an open challenge.This paper presents a robust single-stage face detector, named RetinaFace,which performs pixel-wise face localisation on various scales of faces bytaking advantages of joint extra-supervised and self-supervised multi-tasklearning. Specifically, We make contributions in the following five aspects:(1) We manually annotate five facial landmarks on the WIDER FACE dataset andobserve significant improvement in hard face detection with the assistance ofthis extra supervision signal. (2) We further add a self-supervised meshdecoder branch for predicting a pixel-wise 3D shape face information inparallel with the existing supervised branches. (3) On the WIDER FACE hard testset, RetinaFace outperforms the state of the art average precision (AP) by 1.1%(achieving AP equal to 91.4%). (4) On the IJB-C test set, RetinaFace enablesstate of the art methods (ArcFace) to improve their results in faceverification (TAR=89.59% for FAR=1e-6). (5) By employing light-weight backbonenetworks, RetinaFace can run real-time on a single CPU core for aVGA-resolution image. Extra annotations and code have been made available at:this https URL.	Forest Cover Type: acc=96.99%	快速开始
10	Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps	Abstract Relational reasoning is a central component of generally intelligentbehavior, but has proven difficult for neural networks to learn. In this paperwe describe how to use Relation Networks (RNs) as a simple plug-and-play moduleto solve problems that fundamentally hinge on relational reasoning. We testedRN-augmented networks on three tasks: visual question answering using achallenging dataset called CLEVR, on which we achieve state-of-the-art,super-human performance; text-based question answering using the bAbI suite oftasks; and complex reasoning about dynamic physical systems. Then, using acurated dataset called Sort-of-CLEVR we show that powerful convolutionalnetworks do not have a general capacity to solve relational questions, but cangain this capacity when augmented with RNs. Our work shows how a deep learningarchitecture equipped with an RN module can implicitly discover and learn toreason about entities and their relations.	可视化方法	快速开始
11	A simple neural network module for relational reasoning	Abstract We introduce the variational graph auto-encoder (VGAE), a framework forunsupervised learning on graph-structured data based on the variationalauto-encoder (VAE). This model makes use of latent variables and is capable oflearning interpretable latent representations for undirected graphs. Wedemonstrate this model using a graph convolutional network (GCN) encoder and asimple inner product decoder. Our model achieves competitive results on a linkprediction task in citation networks. In contrast to most existing models forunsupervised learning on graph-structured data and link prediction, our modelcan naturally incorporate node features, which significantly improvespredictive performance on a number of benchmark datasets.	CLEVR: Acc = 95.5%	快速开始
12	Variational Graph Auto-Encoders	Abstract This paper presents a self-supervised framework for training interest pointdetectors and descriptors suitable for a large number of multiple-view geometryproblems in computer vision. As opposed to patch-based neural networks, ourfully-convolutional model operates on full-sized images and jointly computespixel-level interest point locations and associated descriptors in one forwardpass. We introduce Homographic Adaptation, a multi-scale, multi-homographyapproach for boosting interest point detection repeatability and performingcross-domain adaptation (e.g., synthetic-to-real). Our model, when trained onthe MS-COCO generic image dataset using Homographic Adaptation, is able torepeatedly detect a much richer set of interest points than the initialpre-adapted deep model and any other traditional corner detector. The finalsystem gives rise to state-of-the-art homography estimation results on HPatcheswhen compared to LIFT, SIFT and ORB.	CiteSeer AUC: 90.8, AP: 92	快速开始
13	SuperPoint: Self-Supervised Interest Point Detection and Description	Abstract The demand of applying semantic segmentation model on mobile devices has beenincreasing rapidly. Current state-of-the-art networks have enormous amount ofparameters hence unsuitable for mobile devices, while other small memoryfootprint models follow the spirit of classification network and ignore theinherent characteristic of semantic segmentation. To tackle this problem, wepropose a novel Context Guided Network (CGNet), which is a light-weight andefficient network for semantic segmentation. We first propose the ContextGuided (CG) block, which learns the joint feature of both local feature andsurrounding context, and further improves the joint feature with the globalcontext. Based on the CG block, we develop CGNet which captures contextualinformation in all stages of the network and is specially tailored forincreasing segmentation accuracy. CGNet is also elaborately designed to reducethe number of parameters and save memory footprint. Under an equivalent numberof parameters, the proposed CGNet significantly outperforms existingsegmentation networks. Extensive experiments on Cityscapes and CamVid datasetsverify the effectiveness of the proposed approach. Specifically, without anypost-processing and multi-scale testing, the proposed CGNet achieves 64.8% meanIoU on Cityscapes with less than 0.5 M parameters. The source code for thecomplete system can be found at this https URL.	MS-COCO 2014 HPatches Homography Estimation, e=1 0.460	快速开始
14	Relaxed Transformer Decoders for Direct Action Proposal Generation	Abstract Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and efficient framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer encoder with a boundary attentive module to better capture long-range temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching scheme to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our framework is more efficient than previous proposal generation methods, without non-maximum suppression post-processing. The code and models are made available at https://github.com/MCG-NJU/RTD-Action.	THUMOS14, AR@50=41.52	快速开始
15	Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting	Abstract Multi-horizon forecasting problems often contain a complex mix of inputs -- including static (i.e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed historically -- without any prior information on how they interact with the target. While several deep learning models have been proposed for multi-step prediction, they typically comprise black-box models which do not account for the full range of inputs present in common scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) -- a novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, the TFT utilizes recurrent layers for local processing and interpretable self-attention layers for learning long-term dependencies. The TFT also uses specialized components for the judicious selection of relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of regimes. On a variety of real-world datasets, we demonstrate significant performance improvements over existing benchmarks, and showcase three practical interpretability use-cases of TFT.	Dataset: Electricity P90 loss: 0.027	快速开始
16	Learning to Adapt Structured Output Space for Semantic Segmentation	Abstract Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorithms that can adapt source ground truth labels to the target domain is of great interest. In this paper, we propose an adversarial learning method for domain adaptation in the context of semantic segmentation. Considering semantic segmentations as structured outputs that contain spatial similarities between the source and target domains, we adopt adversarial learning in the output space. To further enhance the adapted model, we construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. Extensive experiments and ablation study are conducted under various domain adaptation settings, including synthetic-to-real and cross-city scenarios. We show that the proposed method performs favorably against the state-of-the-art methods in terms of accuracy and visual quality.	Cityscapes: resnet101, mIOU=42.4	快速开始
17	Unsupervised Monocular Depth Estimation with Left-Right Consistency	Abstract Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.	KiTTI: Abs Rel 0.130	快速开始
18	Digging Into Self-Supervised Monocular Depth Estimation	Abstract Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.	KiTTI: Abs Rel 0.106	快速开始
19	Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation	Abstract Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the-art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach.	IMDb测试集error rates=4.6%, TREC-6测试集error rates=3.6% , AG’s News测试集 error rates=5.01%(见论文Table 2 & Table 3)	快速开始
20	PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds	Abstract We introduce Position Adaptive Convolution (PAConv), a generic convolution operation for 3D point cloud processing. The key of PAConv is to construct the convolution kernel by dynamically assembling basic weight matrices stored in Weight Bank, where the coefficients of these weight matrices are self-adaptively learned from point positions through ScoreNet. In this way, the kernel is built in a data-driven manner, endowing PAConv with more flexibility than 2D convolutions to better handle the irregular and unordered point cloud data. Besides, the complexity of the learning process is reduced by combining weight matrices instead of brutally predicting kernels from point positions. Furthermore, different from the existing point convolution operators whose network architectures are often heavily engineered, we integrate our PAConv into classical MLP-based point cloud pipelines without changing network configurations. Even built on simple networks, our method still approaches or even surpasses the state-of-the-art models, and significantly improves baseline performance on both classification and segmentation tasks, yet with decent efficiency. Thorough ablation studies and visualizations are provided to understand PAConv. Code is released on https://github.com/CVMI-Lab/PAConv.	Classification accuracy (%) on ModelNet40：93.9	快速开始
21	Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery	Abstract Geospatial object segmentation, as a particular semantic segmentation task, always faces with larger-scale variation, larger intra-class variance of background, and foreground-background imbalance in the high spatial resolution (HSR) remote sensing imagery. However, general semantic segmentation methods mainly focus on scale variation in the natural scene, with inadequate consideration of the other two problems that usually happen in the large area earth observation scene. In this paper, we argue that the problems lie on the lack of foreground modeling and propose a foreground-aware relation network (FarSeg) from the perspectives of relation-based and optimization-based foreground modeling, to alleviate the above two problems. From perspective of relation, FarSeg enhances the discrimination of foreground features via foreground-correlated contexts associated by learning foreground-scene relation. Meanwhile, from perspective of optimization, a foreground-aware optimization is proposed to focus on foreground examples and hard examples of background during training for a balanced optimization. The experimental results obtained using a large scale dataset suggest that the proposed method is superior to the state-of-the-art general semantic segmentation methods and achieves a better trade-off between speed and accuracy. Code has been made available at: \url{this https URL}.	数据集 iSAID 验收指标：1. FarSeg ResNet-50 mIOU= 63.71% 参考论文Table.62. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS	快速开始
22	PYSKL: Towards Good Practices for Skeleton Action Recognition	Abstract We present PYSKL: an open-source toolbox for skeleton-based action recognition based on PyTorch. The toolbox supports a wide variety of skeleton action recognition algorithms, including approaches based on GCN and CNN. In contrast to existing open-source skeleton action recognition projects that include only one or two algorithms, PYSKL implements six different algorithms under a unified framework with both the latest and original good practices to ease the comparison of efficacy and efficiency. We also provide an original GCN-based skeleton action recognition model named ST-GCN++, which achieves competitive recognition performance without any complicated attention schemes, serving as a strong baseline. Meanwhile, PYSKL supports the training and testing of nine skeleton-based action recognition benchmarks and achieves state-of-the-art recognition performance on eight of them. To facilitate future research on skeleton action recognition, we also provide a large number of trained models and detailed benchmark results to give some insights. PYSKL is released at https://github.com/kennymckormick/pyskl and is actively maintained. We will update this report when we add new features or benchmarks. The current version corresponds to PYSKL v0.2.	1.joint top1=97.4 2.复现后合入PaddleVideo套件，并添加TIPC	快速开始
23	STANet for remote sensing image change detection	Abstract Remote sensing image change detection (CD) is done to identify desired significant changes between bitemporal images. Given two co-registered images taken at different times, the illumination variations and misregistration errors overwhelm the real object changes. Exploring the relationships among different spatial–temporal pixels may improve the performances of CD methods. In our work, we propose a novel Siamese-based spatial–temporal attention neural network. In contrast to previous methods that separately encode the bitemporal images without referring to any useful spatial–temporal dependency, we design a CD self-attention mechanism to model the spatial–temporal relationships. We integrate a new CD self-attention module in the procedure of feature extraction. Our self-attention module calculates the attention weights between any two pixels at different times and positions and uses them to generate more discriminative features. Considering that the object may have different scales, we partition the image into multi-scale subregions and introduce the self-attention in each subregion. In this way, we could capture spatial–temporal dependencies at various scales, thereby generating better representations to accommodate objects of various sizes. We also introduce a CD dataset LEVIR-CD, which is two orders of magnitude larger than other public datasets of this field. LEVIR-CD consists of a large set of bitemporal Google Earth images, with 637 image pairs (1024 _x0002_ 1024) and over 31 k independently labeled change instances. Our proposed attention module improves the F1-score of our baseline model from 83.9 to 87.3 with acceptable computational overhead. Experimental results on a public remote sensing image CD dataset show our method outperforms several other state-of-the-art methods.	数据集 LEVIR-CD验收指标：1. STANet-PAM F1=87.3% 参考论文 Table.42. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS	快速开始
24	SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images	Abstract Change detection is an important task in remote sensing (RS) image analysis. It is widely used in natural disaster monitoring and assessment, land resource planning, and other fields. As a pixel-to-pixel prediction task, change detection is sensitive about the utilization of the original position information. Recent change detection methods always focus on the extraction of deep change semantic feature, but ignore the importance of shallow-layer information containing high-resolution and fine-grained features, this often leads to the uncertainty of the pixels at the edge of the changed target and the determination miss of small targets. In this letter, we propose a densely connected siamese network for change detection, namely SNUNet-CD (the combination of Siamese network and NestedUNet). SNUNet-CD alleviates the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder, and between decoder and decoder. In addition, Ensemble Channel Attention Module (ECAM) is proposed for deep supervision. Through ECAM, the most representative features of different semantic levels can be refined and used for the final classification. Experimental results show that our method improves greatly on many evaluation criteria and has a better tradeoff between accuracy and calculation amount than other state-of-the-art (SOTA) change detection methods.	数据集 CDD https://drive.google.com/file/d/1GX656JqqOyBi_Ef0w65kDGVto-nHrNs9/edit验收指标：1. SNUNet-c32 F1-Score=95.3% 参考https://paperswithcode.com/sota/change-detection-for-remote-sensing-images-on2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS	快速开始
25	Remote Sensing Image Change Detection with Transformers	Abstract Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD remains challenging due to the complexity of objects in the scene. Objects with the same semantic concept may show distinct spectral characteristics at different times and spatial locations. Most recent CD pipelines using pure convolutions are still struggling to relate long-range concepts in space-time. Non-local self-attention approaches show promising performance via modeling dense relations among pixels, yet are computationally inefficient. Here, we propose a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain. Our intuition is that the high-level concepts of the change of interest can be represented by a few visual words, i.e., semantic tokens. To achieve this, we express the bitemporal image into a few tokens, and use a transformer encoder to model contexts in the compact token-based space-time. The learned context-rich tokens are then feedback to the pixel-space for refining the original features via a transformer decoder. We incorporate BIT in a deep feature differencing-based CD framework. Extensive experiments on three CD datasets demonstrate the effectiveness and efficiency of the proposed method. Notably, our BIT-based model significantly outperforms the purely convolutional baseline using only 3 times lower computational costs and model parameters. Based on a naive backbone (ResNet18) without sophisticated structures (e.g., FPN, UNet), our model surpasses several state-of-the-art CD methods, including better than four recent attention-based methods in terms of efficiency and accuracy. Our code is available at https://github.com/justchenhao/BIT\_CD.	数据集 LEVIR-CD验收指标：1. STANet-PAM F1 = 89.31% 参考论文 Table.12. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS	快速开始
26	Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition	Abstract In skeleton-based action recognition, graph convolutional networks (GCNs), which model the human body skeletons as spatiotemporal graphs, have achieved remarkable performance. However, in existing GCN-based methods, the topology of the graph is set manually, and it is fixed over all layers and input samples. This may not be optimal for the hierarchical GCN and diverse samples in action recognition tasks. In addition, the second-order information (the lengths and directions of bones) of the skeleton data, which is naturally more informative and discriminative for action recognition, is rarely investigated in existing methods. In this work, we propose a novel two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition. The topology of the graph in our model can be either uniformly or individually learned by the BP algorithm in an end-to-end manner. This data-driven method increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples. Moreover, a two-stream framework is proposed to model both the first-order and the second-order information simultaneously, which shows notable improvement for the recognition accuracy. Extensive experiments on the two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the state-of-the-art with a significant margin.	NTU-RGBD数据集, X-Sub=88.5%, X-view=95.1%	快速开始
27	MLDA-Net: Multi-Level Dual Attention-BasedNetwork for Self-Supervised MonocularDepth Estimation	Abstract The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.	KITTI， ResNet18，RMSE：4.690	快速开始
28	FFA-Net: Feature Fusion Attention Network for Single Image Dehazing	Abstract In this paper, we propose an end-to-end feature fusion at-tention network (FFA-Net) to directly restore the haze-free image. The FFA-Net architecture consists of three key components: 1) A novel Feature Attention (FA) module combines Channel Attention with Pixel Attention mechanism, considering that different channel-wise features contain totally different weighted information and haze distribution is uneven on the different image pixels. FA treats different features and pixels unequally, which provides additional flexibility in dealing with different types of information, expanding the representational ability of CNNs. 2) A basic block structure consists of Local Residual Learning and Feature Attention, Local Residual Learning allowing the less important information such as thin haze region or low-frequency to be bypassed through multiple local residual connections, let main network architecture focus on more effective information. 3) An Attention-based different levels Feature Fusion (FFA) structure, the feature weights are adaptively learned from the Feature Attention (FA) module, giving more weight to important features. This structure can also retain the information of shallow layers and pass it into deep layers. The experimental results demonstrate that our proposed FFA-Net surpasses previous state-of-the-art single image dehazing methods by a very large margin both quantitatively and qualitatively, boosting the best published PSNR metric from 30.23db to 36.39db on the SOTS indoor test dataset. Code has been made available at GitHub.	nan	快速开始
29	YOWO: You only watch once: A unified cnn architecture for real-time spatiotemporal action localization	Abstract Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips, which is currently the fastest state-of-the-art architecture on spatiotemporal action localization task. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and ~12%, respectively. Moreover, YOWO is the first and only single-stage architecture that provides competitive results on AVA dataset. We make our code and pretrained models publicly available.	UCF101-24数据集，YOWO (16-frame)模型，frame-mAP under IoU threshold of 0.5=80.4	快速开始
30	Token Shift Transformer for Video Classification	Abstract Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer.	UCF101数据集，无预训练模型条件下，8x256x256输入尺寸，Top1=91.60	快速开始
31	XLM: Cross-lingual Language Model Pretraining	Abstract This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.	XNLI测试集average accuracy=75.1%（见论文Table 1）	快速开始
32	A Closer Look at Few-shot Classification	Abstract Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the \miniI and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.	nan	快速开始
33	Matching Networks for One Shot Learning	Abstract Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.	omniglot k-way=5, n-shot=1, 精度98.1	快速开始
34	Few-Shot Learning with Graph Neural Networks	Abstract We propose to study the problem of few-shot learning with the prism of inference on a partially observed graphical model, constructed from a collection of input images whose label can be either observed or not. By assimilating generic message-passing inference algorithms with their neural-network counterparts, we define a graph neural network architecture that generalizes several of the recently proposed few-shot learning models. Besides providing improved numerical performance, our framework is easily extended to variants of few-shot learning, such as semi-supervised or active learning, demonstrating the ability of graph-based models to operate well on ‘relational’ tasks.	nan	快速开始
35	Exploring Simple Siamese Representation Learning	Abstract Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.	下面二者之一即可1. bs 256的情况下，top1-acc 68.3%2. bs512的情况下，top1-acc 68.1%	快速开始
36	Unsupervised Learning of Visual Features by Contrasting Cluster Assignments	Abstract Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.	swav 2x224 + 6x96 ImageNet-1k 100epoch linear top1 acc 72.1%	快速开始
37	Dense Contrastive Learning for Self-Supervised Visual Pre-Training	Abstract To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: https://git.io/AdelaiDet	densecl_resnet50_8xb32-coslr-200epoch ImageNet-1k linear top1 acc 63.62%	快速开始
38	EfficientGCN: Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition	Abstract One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where "x" denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1.	NTU RGB+D 60数据集，EfficientGCN-B0模型，X-sub=90.2% X-view=94.9%	快速开始
39	CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation	Abstract Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.	CANINE-S模型，TYDI-QA Passage Selection Task上F1=66.0, TYDI-QA Minimal Answer Span Task上F1=52.5（见论文Table2）	快速开始
40	INFOXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training	Abstract In this work, we present an information-theoretic framework that formulates cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, inspired by the framework, we propose a new pre-training task based on contrastive learning. Specifically, we regard a bilingual sentence pair as two views of the same meaning and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at https://aka.ms/infoxlm.	Taboeba测试集 cross-lingual sentence retrival avg(xx to en; en to xx)分别达到77.8, 80.6(见论文table2); XNLI测试集 transfer gap score=10.3(见论文table 5)	快速开始

序号	论文名称(链接)	摘要	数据集	快速开始
1	Field-Embedded Factorization Machines for Click-through rate prediction	Abstract Click-through rate (CTR) prediction models are common in many online applications such as digital advertising and recommender systems. Field-Aware Factorization Machine (FFM) and Field-weighted Factorization Machine (FwFM) are state-of-the-art among the shallow models for CTR prediction. Recently, many deep learning-based models have also been proposed. Among deeper models, DeepFM, xDeepFM, AutoInt+, and FiBiNet are state-of-the-art models. The deeper models combine a core architectural component, which learns explicit feature interactions, with a deep neural network (DNN) component. We propose a novel shallow Field-Embedded Factorization Machine (FEFM) and its deep counterpart Deep Field-Embedded Factorization Machine (DeepFEFM). FEFM learns symmetric matrix embeddings for each field pair along with the usual single vector embeddings for each feature. FEFM has significantly lower model complexity than FFM and roughly the same complexity as FwFM. FEFM also has insightful mathematical properties about important fields and field interactions. DeepFEFM combines the FEFM interaction vectors learned by the FEFM component with a DNN and is thus able to learn higher order interactions. We conducted comprehensive experiments over a wide range of hyperparameters on two large publicly available real-world datasets. When comparing test AUC and log loss, the results show that FEFM and DeepFEFM outperform the existing state-of-the-art shallow and deep models for CTR prediction tasks. We have made the code of FEFM and DeepFEFM available in the DeepCTR library (https://github.com/shenweichen/DeepCTR).	criteo auc >0.8	快速开始
2	Deep Learning Recommendation Model for Personalization and Recommendation Systems	Abstract With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.	criteo auc > 0.79	快速开始
3	A Dual Input-aware Factorization Machine for CTR Prediction	Abstract Factorization Machines (FMs) refer to a class of general predictors working with real valued feature vectors, which are well-known for their ability to estimate model parameters under significant sparsity and have found successful applications in many areas such as the click-through rate (CTR) prediction. However, standard FMs only produce a single fixed representation for each feature across different input instances, which may limit the CTR model’s expressive and predictive power. Inspired by the success of Input-aware Factorization Machines (IFMs), which aim to learn more flexible and informative representations of a given feature according to different input instances, we propose a novel model named Dual Input-aware Factorization Machines (DIFMs) that can adaptively reweight the original feature representations at the bit-wise and vector-wise levels simultaneously. Furthermore, DIFMs strategically integrate various components including Multi-Head Self-Attention, Residual Networks and DNNs into a unified end-to-end model. Comprehensive experiments on two real-world CTR prediction datasets show that the DIFM model can outperform several state-of-the-art models consistently.	crito auc >0.799	快速开始
4	FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine	Abstract Click through rate (CTR) estimation is a fundamental task in personalized advertising and recommender systems. Recent years have witnessed the success of both the deep learning based model and attention mechanism in various tasks in computer vision (CV) and natural language processing (NLP). How to combine the attention mechanism with deep CTR model is a promising direction because it may ensemble the advantages of both sides. Although some CTR model such as Attentional Factorization Machine (AFM) has been proposed to model the weight of second order interaction features, we posit the evaluation of feature importance before explicit feature interaction procedure is also important for CTR prediction tasks because the model can learn to selectively highlight the informative features and suppress less useful ones if the task has many input features. In this paper, we propose a new neural CTR model named Field Attentive Deep Field-aware Factorization Machine (FAT-DeepFFM) by combining the Deep Field-aware Factorization Machine (DeepFFM) with Compose-Excitation network (CENet) field attention mechanism which is proposed by us as an enhanced version of Squeeze-Excitation network (SENet) to highlight the feature importance. We conduct extensive experiments on two real-world datasets and the experiment results show that FAT-DeepFFM achieves the best performance and obtains different improvements over the state-of-the-art methods. We also compare two kinds of attention mechanisms (attention before explicit feature interaction vs. attention after explicit feature interaction) and demonstrate that the former one outperforms the latter one significantly.	crito AUC>=0.8099	快速开始
5	BERT4Rec：Sequential Recommendation with Bidirectional Encoder Representations from Transformer	Abstract Top-$N$ sequential recommendation models each user as a sequence of itemsinteracted in the past and aims to predict top-$N$ ranked items that a userwill likely interact in a `near future'. The order of interaction implies thatsequential patterns play an important role where more recent items in asequence have a larger impact on the next item. In this paper, we propose aConvolutional Sequence Embedding Recommendation Model (\emph{Caser}) as asolution to address this requirement. The idea is to embed a sequence of recentitems into an `image' in the time and latent spaces and learn sequentialpatterns as local features of the image using convolutional filters. Thisapproach provides a unified and flexible network structure for capturing bothgeneral preferences and sequential patterns. The experiments on public datasetsdemonstrated that Caser consistently outperforms state-of-the-art sequentialrecommendation methods on a variety of common evaluation metrics.	1、Beauty HR@10=0.30252、Steam HR@10=0.40133、ML-1m HR@10=0.69704、ML-20m HR@10=0.7473	快速开始
6	Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding	Abstract Top-N sequential recommendation models each user as a sequence of items interacted in the past and aims to predict top-N ranked items that a user will likely interact in a `near future'. The order of interaction implies that sequential patterns play an important role where more recent items in a sequence have a larger impact on the next item. In this paper, we propose a Convolutional Sequence Embedding Recommendation Model (\emph{Caser}) as a solution to address this requirement. The idea is to embed a sequence of recent items into an `image' in the time and latent spaces and learn sequential patterns as local features of the image using convolutional filters. This approach provides a unified and flexible network structure for capturing both general preferences and sequential patterns. The experiments on public datasets demonstrated that Caser consistently outperforms state-of-the-art sequential recommendation methods on a variety of common evaluation metrics.	1、MovieLens MAP=0.15072、Gowalla MAP=0.09283、Foursquare MAP=0.09094、Tmall MAP=0.0310	快速开始
7	SASRec：Self-Attentive Sequential Recommendation	Abstract Sequential dynamics are a key feature of many modern recommender systems, which seek to capture the `context' of users' activities on the basis of actions they have performed recently. To capture such patterns, two approaches have proliferated: Markov Chains (MCs) and Recurrent Neural Networks (RNNs). Markov Chains assume that a user's next action can be predicted on the basis of just their last (or last few) actions, while RNNs in principle allow for longer-term semantics to be uncovered. Generally speaking, MC-based methods perform best in extremely sparse datasets, where model parsimony is critical, while RNNs perform better in denser datasets where higher model complexity is affordable. The goal of our work is to balance these two goals, by proposing a self-attention based sequential model (SASRec) that allows us to capture long-term semantics (like an RNN), but, using an attention mechanism, makes its predictions based on relatively few actions (like an MC). At each time step, SASRec seeks to identify which items are `relevant' from a user's action history, and use them to predict the next item. Extensive empirical studies show that our method outperforms various state-of-the-art sequential models (including MC/CNN/RNN-based approaches) on both sparse and dense datasets. Moreover, the model is an order of magnitude more efficient than comparable CNN/RNN-based models. Visualizations on attention weights also show how our model adaptively handles datasets with various density, and uncovers meaningful patterns in activity sequences.	Hit Rate@10(Recall@10; Precision@10) andNDCG@10,	快速开始
8	Neural Collaborative Reasoning	Abstract Existing Collaborative Filtering (CF) methods are mostly designed based on the idea of matching, i.e., by learning user and item embeddings from data using shallow or deep models, they try to capture the associative relevance patterns in data, so that a user embedding can be matched with relevant item embeddings using designed or learned similarity functions. However, as a cognition rather than a perception intelligent task, recommendation requires not only the ability of pattern recognition and matching from data, but also the ability of cognitive reasoning in data. In this paper, we propose to advance Collaborative Filtering (CF) to Collaborative Reasoning (CR), which means that each user knows part of the reasoning space, and they collaborate for reasoning in the space to estimate preferences for each other. Technically, we propose a Neural Collaborative Reasoning (NCR) framework to bridge learning and reasoning. Specifically, we integrate the power of representation learning and logical reasoning, where representations capture similarity patterns in data from perceptual perspectives, and logic facilitates cognitive reasoning for informed decision making. An important challenge, however, is to bridge differentiable neural networks and symbolic reasoning in a shared architecture for optimization and inference. To solve the problem, we propose a modularized reasoning architecture, which learns logical operations such as AND (∧), OR (∨) and NOT (¬) as neural modules for implication reasoning (→). In this way, logical expressions can be equivalently organized as neural networks, so that logical reasoning and prediction can be conducted in a continuous space. Experiments on real-world datasets verified the advantages of our framework compared with both shallow, deep and reasoning models.	ML100K : HR@K >0.68	快速开始
9	TiSASRec: Time Interval Aware Self-Attention for Sequential Recommendation	Abstract Sequential recommender systems seek to exploit the order of users’ interactions, in order to predict their next action based on the context of what they have done recently. Traditionally, Markov Chains (MCs), and more recently Recurrent Neural Networks (RNNs) and Self Attention (SA) have proliferated due to their ability to capture the dynamics of sequential patterns. However a simplifying assumption made by most of these models is to regard interaction histories as ordered sequences, without regard for the time intervals between each interaction (i.e., they model the time-order but not the actual timestamp). In this paper, we seek to explicitly model the timestamps of interactions within a sequential modeling framework to explore the influence of different time intervals on next item prediction. We propose TiSASRec (Time Interval aware Self-attention based sequential recommendation), which models both the absolute positions of items as well as the time intervals between them in a sequence. Extensive empirical studies show the features of TiSASRec under different settings and compare the performance of self-attention with different positional encodings. Furthermore, experimental results show that our method outperforms various state-of-the-art sequential models on both sparse and dense datasets and different evaluation metrics.	ml-1m: NDCG@10: 0.5706, Hit@10: 0.8038	快速开始
10	Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation	Abstract To provide more accurate recommendation, it is a trending topic to go beyond modeling user-item interactions and take context features into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it is not robust as sampling may lost important information and usually leads to non-optimal performances in practical. Several recent e_x001d_orts have enhanced FM with deep learning architectures for modelling high-order feature interactions. While they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic _x001e_uctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for context-aware recommendation. In this paper, we propose to learn FM without sampling for ranking tasks that helps context-aware recommendation particularly. Despite e_x001d_ectiveness, such a non-sampling strategy presents strong challenge in learning e_x001c_ciency of the model. Accordingly, we further design a new ideal framework named E_x001c_cient Non-Sampling Factorization Machines (ENSFM). ENSFM not only seamlessly connects the relationship between FM and Matrix Factorization (MF), but also resolves the challenging e_x001c_ciency issue via novel memorization strategies. Through extensive experiments on three realworld public datasets, we show that 1) the proposed ENSFM consistently and signi_x001b_cantly outperforms the state-of-the-art methods on context-aware Top-K recommendation, and 2) ENSFM achieves signi_x001b_cant advantages in training e_x001c_ciency, which makes it more applicable to real-world large-scale systems. Moreover, the empirical results indicate that a proper learning method is even more important than advanced neural network structures for Top-K recommendation task. Our implementation has been released 1 to facilitate further developments on e_x001c_cient non-sampling methods.	Movielens: HR@5: 0.0601, HR@10: 0.1024, HR@20: 0.1690 (论文table3)	快速开始
11	AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction	Abstract Learning feature interactions is crucial for click-through rate (CTR) prediction in recommender systems. In most existing deep learning models, feature interactions are either manually designed or simply enumerated. However, enumerating all feature interactions brings large memory and computation cost. Even worse, useless interactions may introduce noise and complicate the training process. In this work, we propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS). AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence. In the search stage, instead of searching over a discrete set of candidate feature interactions, we relax the choices to be continuous by introducing the architecture parameters. By implementing a regularized optimizer over the architecture parameters, the model can automatically identify and remove the redundant feature interactions during the training process of the model. In the re-train stage, we keep the architecture parameters serving as an attention unit to further boost the performance. Offline experiments on three large-scale datasets (two public benchmarks, one private) demonstrate that AutoFIS can significantly improve various FM based models. AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service, where a 10-day online A/B test demonstrated that AutoFIS improved the DeepFM model by 20.3% and 20.1% in terms of CTR and CVR respectively.	Criteo; (DeepFM)auc: 0.8009, logloss: 0.5404 (table1)	快速开始
12	Training Deep AutoEncoders for Collaborative Filtering	Abstract �is paper proposes a model for the rating prediction task in recommender systems which signi�cantly outperforms previous stateof-the art models on a time-split Net�ix data set. Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training. We empirically demonstrate that: a) deep autoencoder models generalize much be�er than the shallow ones, b) non-linear activation functions with negative parts are crucial for training deep models, and c) heavy use of regularization techniques such as dropout is necessary to prevent over��ing. We also propose a new training algorithm based on iterative output re-feeding to overcome natural sparseness of collaborate �ltering. �e new algorithm signi�cantly speeds up training and improves model performance. Our code is available at h�ps://github.com/NVIDIA/DeepRecommender.	RMSE: Netflix 3 months: 0.9373, Netflix 6 months : 0.9207, Netflix 1 year : 0.9225, Netflix full: 0.9099;	快速开始
13	DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning	Abstract The Mixture-of-experts (MoE) architecture is showing promising results in multitask learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable “sparse gate” to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. Our gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k in the context of MTL, on both synthetic and real datasets with up to 128 tasks. Our experiments indicate that MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance. Notably, on a real-world large-scale recommender system, DSelect-k achieves over 22% average improvement in predictive performance compared to the Top-k gate. We provide an open-source TensorFlow implementation of our gate1 .	MNIST: Accuracy1 92.56% Accuracy2: 90.98 %	快速开始
14	Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation	Abstract Social relations are often used to improve recommendation quality when user-item interaction data is sparse in recommender systems. Most existing social recommendation models exploit pairwise relations to mine potential user preferences. However, real-life interactions among users are very complicated and user relations can be high-order. Hypergraph provides a natural way to model complex high-order relations, while its potentials for improving social recommendation are under-explored. In this paper, we fill this gap and propose a multi-channel hypergraph convolutional network to enhance social recommendation by leveraging high-order user relations. Technically, each channel in the network encodes a hypergraph that depicts a common high-order user relation pattern via hypergraph convolution. By aggregating the embeddings learned through multiple channels, we obtain comprehensive user representations to generate recommendation results. However, the aggregation operation might also obscure the inherent characteristics of different types of high-order connectivity information. To compensate for the aggregating loss, we innovatively integrate self-supervised learning into the training of the hypergraph convolutional network to regain the connectivity information with hierarchical mutual information maximization. The experimental results on multiple real-world datasets show that the proposed model outperforms the SOTA methods, and the ablation study verifies the effectiveness of the multi-channel setting and the selfsupervised task. The implementation of our model is available via https://github.com/Coder-Yu/RecQ.	LastFM: Precision@10: 20.052, Recall@10: 20.375, NDCG@10: 24.395	快速开始
15	DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems	Abstract Learning effective feature crosses is the key behind building recommender systems. However, the sparse and large feature space requires exhaustive search to identify effective crosses. Deep & Cross Network (DCN) was proposed to automatically and efficiently learn bounded-degree predictive feature interactions. Unfortunately, in models that serve web-scale traffic with billions of training examples, DCN showed limited expressiveness in its cross network at learning more predictive feature interactions. Despite significant research progress made, many deep learning models in production still rely on traditional feed-forward neural networks to learn feature crosses inefficiently. In light of the pros/cons of DCN and existing feature interaction learning approaches, we propose an improved framework DCN-V2 to make DCN more practical in large-scale industrial settings. In a comprehensive experimental study with extensive hyper-parameter search and model tuning, we observed that DCN-V2 approaches outperform all the state-of-the-art algorithms on popular benchmark datasets. The improved DCN-V2 is more expressive yet remains cost efficient at feature interaction learning, especially when coupled with a mixture of low-rank architecture. DCN-V2 is simple, can be easily adopted as building blocks, and has delivered significant offline accuracy and online business metrics gains across many web-scale learning to rank systems at Google.	Logloss: 0.4406, AUC: 0.8115	快速开始
16	FLEN: Leveraging Field for Scalable CTR Prediction	Abstract Click-Through Rate (CTR) prediction systems are usually based on multi-field categorical features, i.e., every feature is categorical and belongs to one and only one field. Modeling feature conjunctions is crucial for CTR prediction accuracy. However, it usually requires a massive number of parameters to explicitly model all feature conjunctions, which is not scalable for real-world production systems. In this paper, we describe a novel Field-Leveraged Embedding Network (FLEN) which has been deployed in the commercial recommender systems in Meitu and serves the main traffic. FLEN devises a field-wise bi-interaction pooling technique. By suitably exploiting field information, the field-wise bi-interaction pooling layer captures both inter-field and intra-field feature conjunctions with a small number of model parameters and an acceptable time complexity for industrial applications. We show that some classic shallow CTR models can be regarded as special cases of this technique, i.e., MF, FM and FwFM. We identify a unique challenge in this technique, i.e., the FM module in our model may suffer from the coupled gradient issue, which will damage the performance of the model. To solve this challenge, we develop Dicefactor: a novel dropout method to prevent independent latent features from co-adapting. Extensive experiments, including offline evaluations and online A/B testing on real production systems, demonstrate the effectiveness and efficiency of FLEN against the state-of-the-art models. In particular, compared to the previous version deployed on the system (i.e. NFM), FLEN has obtained 5.19% improvement on CTR with 1/6 of memory usage and computation time.	AUC: 0.7519, Logloss: 0.3944;	快速开始
17	Deep Position-wise Interaction Network For CTR Prediction	Abstract Click-through rate (CTR) prediction plays an important role in online advertising and recommender systems. In practice, the training of CTR models depends on click data which is intrinsically biased towards higher positions since higher position has higher CTR by nature. Existing methods such as actual position training with fixed position inference and inverse propensity weighted training with no position inference alleviate the bias problem to some extend. However, the different treatment of position information between training and inference will inevitably lead to inconsistency and sub-optimal online performance. Meanwhile, the basic assumption of these methods, i.e., the click probability is the product of examination probability and relevance probability, is oversimplified and insufficient to model the rich interaction between position and other information. In this paper, we propose a Deep Position-wise Interaction Network (DPIN) to efficiently combine all candidate items and positions for estimating CTR at each position, achieving consistency between offline and online as well as modeling the deep non-linear interaction among position, user, context and item under the limit of serving performance. Following our new treatment to the position bias in CTR prediction, we propose a new evaluation metrics named PAUC (position-wise AUC) that is suitable for measuring the ranking quality at a given position. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving position bias problem. We have also deployed our method in production and observed statistically significant improvement over a highly optimized baseline in a rigorous A/B test.	按照论文数据，预计以DIN模型作为对比，AUC获得性能提升	快速开始
18	AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks	Abstract Click-through rate (CTR) prediction, which aims to predict the probability of a user clicking on an ad or an item, is critical to many online applications such as online advertising and recommender systems. The problem is very challenging since (1) the input features (e.g., the user id, user age, item id, item category) are usually sparse and high-dimensional, and (2) an effective prediction relies on high-order combinatorial features (\textit{a.k.a.} cross features), which are very time-consuming to hand-craft by domain experts and are impossible to be enumerated. Therefore, there have been efforts in finding low-dimensional representations of the sparse and high-dimensional raw features and their meaningful combinations. In this paper, we propose an effective and efficient method called the \emph{AutoInt} to automatically learn the high-order feature interactions of input features. Our proposed algorithm is very general, which can be applied to both numerical and categorical input features. Specifically, we map both the numerical and categorical features into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled. The whole model can be efficiently fit on large-scale raw data in an end-to-end fashion. Experimental results on four real-world datasets show that our proposed approach not only outperforms existing state-of-the-art approaches for prediction but also offers good explainability. Code is available at: \url{https://github.com/DeepGraphLearning/RecommenderSystems}.	验收标准：1.按照论文数据Criteo，AUC 80.61% 2.复现后合入PaddleRec套件，并添加TIPC	快速开始
19	Personalized News Recommendation with Knowledge-aware Interactive Matching	Abstract The most important task in personalized news recommendation is accurate matching between candidate news and user interest. Most of existing news recommendation methods model candidate news from its textual content and user interest from their clicked news in an independent way. However, a news article may cover multiple aspects and entities, and a user usually has different kinds of interest. Independent modeling of candidate news and user interest may lead to inferior matching between news and users. In this paper, we propose a knowledge-aware interactive matching method for news recommendation. Our method interactively models candidate news and user interest to facilitate their accurate matching. We design a knowledge-aware news co-encoder to interactively learn representations for both clicked news and candidate news by capturing their relatedness in both semantic and entities with the help of knowledge graphs. We also design a user-news co-encoder to learn candidate news-aware user interest representation and user-aware candidate news representation for better interest matching. Experiments on two real-world datasets validate that our method can effectively improve the performance of news recommendation.	AUC 67.13 ,MRR 32.08,NDCG@5 35.49, NDCG@10 41.79	快速开始
20	Package Recommendation with Intra- and Inter-Package Attention Networks	Abstract With the booming of online social networks in the mobile internet, an emerging recommendation scenario has played a vital role in information acquisition for user, where users are no longer recommended with a single item or item list, but a combination of heterogeneous and diverse objects (called a package, e.g., a package including news, publisher, and friends viewing the news). Different from the conventional recommendation where users are recommended with the item itself, in package recommendation, users would show great interests on the explicitly displayed objects that could have a significant influence on the user behaviors. However, to the best of our knowledge, few effort has been made for package recommendation and existing approaches can hardly model the complex interactions of diverse objects in a package. Thus, in this paper, we make a first study on package recommendation and propose an Intra- and inter-package attention network for Package Recommendation (IPRec). Specifically, for package modeling, an intra-package attention network is put forward to capture the object-level intention of user interacting with the package, while an inter-package attention network acts as a package-level information encoder that captures collaborative features of neighboring packages. In addition, to capture users preference representation, we present a user preference learner equipped with a fine-grained feature aggregation network and coarse-grained package aggregation network. Extensive experiments on three real-world datasets demonstrate that IPRec significantly outperforms the state of the arts. Moreover, the model analysis demonstrates the interpretability of our IPRec and the characteristics of user behaviors. Codes and datasets can be obtained at https://github.com/LeeChenChen/IPRec.	论文搜集的真实数据集，3-dayAUC：0.66915-dayAUC：0.6754， 10-dayAUC：0.6853	快速开始
21	Modeling the Sequential Dependence among Audience Multi-step Conversions with Multi-task Learning in Targeted Display Advertising	Abstract In most real-world large-scale online applications (e.g., e-commerce or finance), customer acquisition is usually a multi-step conversion process of audiences. For example, an impression->click->purchase process is usually performed of audiences for e-commerce platforms. However, it is more difficult to acquire customers in financial advertising (e.g., credit card advertising) than in traditional advertising. On the one hand, the audience multi-step conversion path is longer. On the other hand, the positive feedback is sparser (class imbalance) step by step, and it is difficult to obtain the final positive feedback due to the delayed feedback of activation. Multi-task learning is a typical solution in this direction. While considerable multi-task efforts have been made in this direction, a long-standing challenge is how to explicitly model the long-path sequential dependence among audience multi-step conversions for improving the end-to-end conversion. In this paper, we propose an Adaptive Information Transfer Multi-task (AITM) framework, which models the sequential dependence among audience multi-step conversions via the Adaptive Information Transfer (AIT) module. The AIT module can adaptively learn what and how much information to transfer for different conversion stages. Besides, by combining the Behavioral Expectation Calibrator in the loss function, the AITM framework can yield more accurate end-to-end conversion identification. The proposed framework is deployed in Meituan app, which utilizes it to real-timely show a banner to the audience with a high end-to-end conversion rate for Meituan Co-Branded Credit Cards. Offline experimental results on both industrial and public real-world datasets clearly demonstrate that the proposed framework achieves significantly better performance compared with state-of-the-art baselines.	AUC：0.6043 purchase AUC：0.6525	快速开始
22	Detecting Beneficial Feature Interactions for Recommender Systems	Abstract Feature interactions are essential for achieving high accuracy in recommender systems. Many studies take into account the interaction between every pair of features. However, this is suboptimal because some feature interactions may not be that relevant to the recommendation result, and taking them into account may introduce noise and decrease recommendation accuracy. To make the best out of feature interactions, we propose a graph neural network approach to effectively model them, together with a novel technique to automatically detect those feature interactions that are beneficial in terms of recommendation accuracy. The automatic feature interaction detection is achieved via edge prediction with an L0 activation regularization. Our proposed model is proved to be effective through the information bottleneck principle and statistical interaction theory. Experimental results show that our model (i) outperforms existing baselines in terms of accuracy, and (ii) automatically identifies beneficial feature interactions.	Movielens；AUC：0.9407，ACC：0.8970	快速开始
23	Deep Session Interest Network for Click-Through Rate Prediction	Abstract Easy-to-use,Modular and Extendible package of deep-learning based CTR models.DeepFM,DeepInterestNetwork(DIN),DeepInterestEvolutionNetwork(DIEN),DeepCrossNetwork(DCN),AttentionalFactorizationMachine(AFM),Neural Factorization Machine(NFM),AutoInt,Deep Session Interest Network(DSIN)	advertising-challenge-datase logloss; AUC > 0.63	快速开始
24	Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising	Abstract In recommender systems and advertising platforms, marketers always want to deliver products, contents, or advertisements to potential audiences over media channels such as display, video, or social. Given a set of audiences or customers (seed users), the audience expansion technique (look-alike modeling) is a promising solution to identify more potential audiences, who are similar to the seed users and likely to finish the business goal of the target campaign. However, look-alike modeling faces two challenges: (1) In practice, a company could run hundreds of marketing campaigns to promote various contents within completely different categories every day, e.g., sports, politics, society. Thus, it is difficult to utilize a common method to expand audiences for all campaigns. (2) The seed set of a certain campaign could only cover limited users. Therefore, a customized approach based on such a seed set is likely to be overfitting. In this paper, to address these challenges, we propose a novel two-stage framework named Meta Hybrid Experts and Critics (MetaHeac) which has been deployed in WeChat Look-alike System. In the offline stage, a general model which can capture the relationships among various tasks is trained from a meta-learning perspective on all existing campaign tasks. In the online stage, for a new campaign, a customized model is learned with the given seed set based on the general model. According to both offline and online experiments, the proposed MetaHeac shows superior effectiveness for both content marketing campaigns in recommender systems and advertising campaigns in advertising platforms. Besides, MetaHeac has been successfully deployed in WeChat for the promotion of both contents and advertisements, leading to great improvement in the quality of marketing. The code has been available at \url{https://github.com/easezyc/MetaHeac}.	AUC>=0.7239	快速开始
25	Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction	Abstract Easy-to-use, Modular and Extendible package of deep-learning based CTR models. DeepFM, DeepInterestNetwork(DIN), DeepInterestEvolutionNetwork(DIEN), DeepCrossNetwork(DCN), AttentionalFactorizationMachine(AFM), Neural Factorization Machine(NFM), AutoInt, Deep Session Interest Network(DSIN)	AUC:80.22% ,Log Loss:0.5388	快速开始

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

社区模型库

图像分类

目标检测

图像分割

OCR

图像生成

图像修复

异常检测

人脸识别

行为识别

自然语言处理

多模态

科学计算

推荐系统

其他

Files

README.md

Latest commit

History

README.md

File metadata and controls

社区模型库

图像分类

目标检测

图像分割

OCR

图像生成

图像修复

异常检测

人脸识别

行为识别

自然语言处理

多模态

科学计算

推荐系统

其他