Conditional Random Fields

DOTA: Deformable Optimized Transformer Architecture for End-to-End Text Recognition with Retrieval-Augmented Generation
Text recognition in natural images remains a highly complex yet vital challenge, with wide-ranging applications in computer vision and natural language processing. In this paper, we present a novel end-to-end framework that integrates ResNet and Vision Transformer (ViT) backbones with cutting-edge techniques such as Deformable Convolutions, Retrieval-Augmented Generation, and Conditional Random Fields (CRF). These innovations work together to significantly improve feature representation and Optical Character Recognition (OCR) performance. By replacing the standard convolution layers in the third and fourth blocks with Deformable Convolutions, the framework adapts more flexibly to complex text layouts, while adaptive dropout helps prevent overfitting and enhance generalization. Moreover, incorporating CRFs refines the sequence modeling for more accurate text recognition. Extensive experiments on six benchmark datasets—IC13, IC15, SVT, IIIT5K, SVTP, and CUTE80—demonstrate the framework’s exceptional performance, achieving remarkable accuracies of 97.32% on IC13, 58.26% on IC15, 88.10% on SVT, 74.13% on IIIT5K, 82.17% on SVTP, and 66.67% on CUTE80, yielding an average accuracy of 77.77%. These results establish a new state-of-the-art for text recognition, showing the robustness of our approach across diverse and challenging datasets. Our method represents a significant leap forward in OCR technology, addressing challenges in recognizing text with various distortions, fonts, and orientations. The framework has proven not only effective in controlled conditions but also adaptable to more complex, real-world scenarios. The code for this framework is available at https://github.com/kaopanboonyuen/DOTA.
Road segmentation of remotely-sensed images using deep convolutional neural networks with landscape metrics and conditional random fields
Semantic segmentation of remotely-sensed aerial (or very-high resolution, VHS) images and satellite (or high-resolution, HR) images has numerous application domains, particularly in road extraction, where the segmented objects serve as essential layers in geospatial databases. Despite several efforts to use deep convolutional neural networks (DCNNs) for road extraction from remote sensing images, accuracy remains a challenge. This paper introduces an enhanced DCNN framework specifically designed for road extraction from remote sensing images by incorporating landscape metrics (LMs) and conditional random fields (CRFs). Our framework employs the exponential linear unit (ELU) activation function to improve the DCNN, leading to a higher quantity and more accurate road extraction. Additionally, to minimize false classifications of road objects, we propose a solution based on the integration of LMs. To further refine the extracted roads, a CRF method is incorporated into our framework. Experiments conducted on Massachusetts road aerial imagery and Thailand Earth Observation System (THEOS) satellite imagery datasets demonstrated that our proposed framework outperforms SegNet, a state-of-the-art object segmentation technique, in most cases regarding precision, recall, and F1 score across various types of remote sensing imagery.