Skip to main content
Have a personal or library account? Click to login
Efficient Hierarchical Temporal Audio-Video Cross-Attention Fusion Network for Audio-Enhanced Text-To-Video Retrieval Cover

Efficient Hierarchical Temporal Audio-Video Cross-Attention Fusion Network for Audio-Enhanced Text-To-Video Retrieval

By: R. Rashmi and  H. K. Chethan  
Open Access
|Apr 2026

Figures & Tables

Figure 1:

Diagram for the proposed method.

Figure 2:

Flow diagram for video swim feature pyramid video encoder. FPN, feature pyramid network.

Figure 3:

Flowchart for final summation output in multimodal systems.

Figure 4:

Heat map of the proposed model.

Figure 5:

Model loss of the proposed model.

Figure 6:

Accuracy of the proposed model.

Figure 7:

Recall of the proposed model.

Figure 8:

Precision of the proposed model.

Figure 9:

F1 score of the proposed model.

Figure 10:

Feature values representation of DTW. DTW, dynamic time warping.

Figure 11:

Feature values representation of FPN. FPN, feature pyramid network.

Figure 12:

Recall score (@1, @5, @10) for the proposed model.

Figure 13:

(A, B, C): Recall comparison of the proposed model with the existing models with (R@1, R@5, and R@10).

Figure 14:

MdR comparison of the proposed model. MdR, median rank.

Figure 15:

MnR comparison of the proposed model. MnR, mean rank.

Figure 16:

RMSE comparison for the proposed model. RMSE, root mean square error.

Figure 17:

MAE comparison of the proposed model. MAE, mean absolute error.

Comparative performance analysis of the proposed model along with the existing methods

ModelR@1 (%)R@5 (%)R@10 (%)MdRMnRRMSEMAE
X-Pool45.055.060.08502.51.75
CAMoE + DSL50.058.062.0503.01.75
TEFAL [31]55.065.068.07402.01.25
TS2-Net [31]8
Proposed model80.082.080.0550.50.25
Language: English
Submitted on: Jul 25, 2025
Published on: Apr 7, 2026
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 R. Rashmi, H. K. Chethan, published by Macquarie University, Australia
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.