Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Severity distribution before and after LLM-based interpretation (Windows App SDK, C/C++)_
| Severity Level | Findings (SA) | Findings (LLM-Based) | Main Vulnerability Categories | Static Analysis Tools |
|---|---|---|---|---|
| 5 (Critical) | 1 | 0 | Privilege escalation (baseline highest-risk item) | AppScan Static Analyzer [51] |
| 4 (High) | 1 | 2 | Command injection; reclassified critical item (context-limited) | AppScan Static Analyzer [51] |
| 3 (Medium) | 11 | 7 | Improper resource access control; permission/validation warnings | Flawfinder; AppScan Static Analyzer [47,51] |
| 2 (Low) | 42 | 28 | Information exposure; input validation; dependency integrity; API pattern alerts | AppScan; Fluid Attacks; Cppcheck; RATS [48-50] |
Effectiveness of LLM-based alert consolidation across static analysis tools_
| SA Tool | Raw Alerts | Unique Code Locations | LLM-Refined Findings | Alert Reduction (%) |
|---|---|---|---|---|
| Flawfinder | 8 | 6 | 3 | 62.5% |
| RATS | 7 | 6 | 3 | 57.1% |
| Cppcheck | 6 | 6 | 0 | 100.0% |
| Fluid Attacks | 2 | 2 | 2 | 0.0% |
| AppScan Static Analyzer | 1 | 1 | 1 | 0.0% |
| Total | 24 | 21 | 9 | 62.5% |
Project details_
| Metric | ID | Metric Value |
|---|---|---|
| Application Name | AN | Windows App SDK 1.6.2 |
| Review Date | RD | December 12, 2025 |
| Objective | OBJ | Security Code Review |
| Number of Lines (LOC) | LOC | 167,894 |
| Code Review Mode | CRM | Static |
Comparison of research work on bug detection_
| Ref. | Semantic Reasoning | Explainability | Hybrid Static + AI | Failure Mode | Evaluation |
|---|---|---|---|---|---|
| [37] | ✓ | × | ✓ (LLM) | ✓ (shallow reasoning) | High FDR (>50%) |
| [38] | × | × | ✓ (LLM) | ✓ (industrial errors) | FP reduced by ≈ 94 − 98 % |
| [39] | × | × | × | ✓ (count bias) | F1 ≈ 0.97; Recall < 30 % |
| [40] | × | × | × | × | Accuracy ≈ 0.87; F1 ≈ 0.86 |
| [41] | ✓ | × | × | × | Accuracy ≈ 0.86; F1 ≈ 0.85 |
| [42] | × | × | × | × | Accuracy ≈ 0.90; F1 ≈ 0.91 |
| [43] | ✓ | × | × | × | Accuracy ≈ 0.87 |
| [44] | × | ✓ | × | × | Accuracy ≈ 91.8 % |
| Our Proposed Model | ✓ | ✓ | ✓ (LLM) | ✓ (tool disagreement) | Alert reduction 62.5% |