Langevin Resonance Transformer for Cross-Script Writer Identification
Abstract
Vision Transformers (ViTs) have emerged as a powerful architecture for various computer vision tasks, including writer identification. However, like their CNN counterparts, they are susceptible to performance degradation in cross-script scenarios because their standard self-attention mechanism learns script-specific visual correlations. To overcome this, we propose a novel architecture, the Langevin Resonance Transformer (LRT). The LRT fundamentally redefines self-attention by replacing the abstract mathematical operation with a physically-grounded dynamic simulation. Each image patch is treated as a particle. The core of the LRT is a novel Langevin Attention Layer, where the interaction between pairs of particles is governed by a learnable potential energy function. The net force on each particle is aggregated from all other particles, and its state is evolved according to the Langevin equation, which models motion under both deterministic and stochastic forces. The LRT treats a writer’s style as a physical system defined by an energy landscape. Because it models biomechanics rather than just visual shapes, the resulting representation works regardless of the script. We tested this on the BRS-ID dataset using a custom augmentation strategy. The results show that LRT achieves higher accuracy than standard Vision Transformers and other top-tier models on the cross-script identification task.
© 2026 Sk Golam Sarowar Hossain, Mridul Ghosh, Tonmoy Mete, Mária Ždímalová, Kaushik Roy, Sk Md Obaidullah, published by Slovak Academy of Sciences, Mathematical Institute
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.