Efficient Hierarchical Temporal Audio-Video Cross-Attention Fusion Network for Audio-Enhanced Text-To-Video Retrieval

R. Rashmi; H. K. Chethan

doi:10.2478/ijssis-2026-0009

.blurhash-client-img { display: none !important; }

Efficient Hierarchical Temporal Audio-Video Cross-Attention Fusion Network for Audio-Enhanced Text-To-Video Retrieval

International Journal on Smart Sensing and Intelligent Systems

Volume 19 (2026): Issue 1 (January 2026)

By: R. Rashmi and H. K. Chethan

Open Access

|Apr 2026

Figures & Tables

Flow diagram for video swim feature pyramid video encoder. FPN, feature pyramid network.

Flowchart for final summation output in multimodal systems.

Feature values representation of DTW. DTW, dynamic time warping.

Feature values representation of FPN. FPN, feature pyramid network.

Recall score (@1, @5, @10) for the proposed model.

(A, B, C): Recall comparison of the proposed model with the existing models with (R@1, R@5, and R@10).

MdR comparison of the proposed model. MdR, median rank.

MnR comparison of the proposed model. MnR, mean rank.

RMSE comparison for the proposed model. RMSE, root mean square error.

MAE comparison of the proposed model. MAE, mean absolute error.

Comparative performance analysis of the proposed model along with the existing methods

Model	R@1 (%)	R@5 (%)	R@10 (%)	MdR	MnR	RMSE	MAE
X-Pool	45.0	55.0	60.0	8	50	2.5	1.75
CAMoE + DSL	50.0	58.0	62.0	–	50	3.0	1.75
TEFAL [31]	55.0	65.0	68.0	7	40	2.0	1.25
TS2-Net [31]	–	–	–	8	–	–	–
Proposed model	80.0	82.0	80.0	5	5	0.5	0.25

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.2478/ijssis-2026-0009 | Journal eISSN: 1178-5608

Journal RSS Feed

Language: English

Submitted on: Jul 25, 2025

Published on: Apr 7, 2026

Published by: Macquarie University, Australia

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

feature pyramid transformer,

audio spectrogram short-term memory transformer,

temporal RoBERTa graph network,

multi-head scaled dot random boosting forest,

multimedia retrieval optimization

Related subjects:

Engineering,

Introductions and overviews,

Engineering, other

© 2026 R. Rashmi, H. K. Chethan, published by Macquarie University, Australia
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 19 (2026): Issue 1 (January 2026)

Efficient Hierarchical Temporal Audio-Video Cross-Attention Fusion Network for Audio-Enhanced Text-To-Video Retrieval

Figures & Tables

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

Figure 11:

Figure 12:

Figure 13:

Figure 14:

Figure 15:

Figure 16:

Figure 17:

Comparative performance analysis of the proposed model along with the existing methods

Paradigm

My account