Skip to main content
Have a personal or library account? Click to login
Evidence-Grounded Decision Support for Aircraft Line Maintenance Using Conformal Prediction and Retrieval-Augmented NLP from Technical Log Records Cover

Evidence-Grounded Decision Support for Aircraft Line Maintenance Using Conformal Prediction and Retrieval-Augmented NLP from Technical Log Records

Open Access
|Jun 2026

Figures & Tables

Fig. 1.

Post hoc confusion matrix aggregated to major ATA chapter groups for the temporal test split.

Fig. 2.

Retrieval quality metrics on the temporal test set: Recall@K (K = 1, 3, 5, 10) together with Mean Reciprocal Rank (MRR). Relevance is defined as JASC-code label consistency between the retrieved case and the query.

Fig. 3.

Recall@K curve for the temporal test set, showing the monotonic gain in recall as K increases from 1 to 10, with MRR = 0.47 annotated for reference. Relevance is defined as in Figure 2.

Fig. 4.

Selective prediction trade-off on the temporal test set at α = 0.10. Error rate on accepted top-1 predictions is plotted against abstention rate, parameterized by the maximum acceptable prediction-set size m (labeled at each point as s ≤ m; see equation (4) in Section 3.9). Stricter thresholds (smaller m) move the curve toward higher abstention and lower accepted-case error.

Fig. 5.

Ablation summary showing incremental contributions of the classifier (top-1 accuracy), shortlist output (top-3 accuracy), retrieval evidence (Recall@10), and conformal uncertainty layer (coverage@90%), with abstention@Smax=10 and error among accepted cases reported for the conformal setting.

Included records by year_

YearIncluded recordsUnique aircraftUnique make-modelRegion codes
20211,5008111859
20221,5008311649
20231,4308481768
Total4,4302,20429010

Split counts and held-out station/fleet composition_

Panel A. Temporal split (chronological 70/10/20 by difficulty date).
MetricAll dataTrainCalibrationTest
Difficulty Date range01 Jan 2021 – 22 Dec 202301 Jan 2021 – 14 Jan 202315 Jan 2023 – 24 Apr 202324 Apr 2023 – 22 Dec 2023
Events (unique SDRs)4,4303,101 (70.0%)443 (10.0%)886 (20.0%)
Stations (Receiving Region Code), n101088
Fleets (make-model), n29024579143
Aircraft (RegistryNumber), n2,2041,605263552
Operators (Operator Designator), n97903441
JASC classes (JASC Code), n30627995160

j_tar-2026-0009_tab_106

Panel B. Station hold-out split (region generalization test).
MetricTrain stations (non-held-out)Held-out stations (S1-S2)
Stations (ReceivingRegionCode), n82
Events (unique SDRs)1,7662,664 (60.14%)
Fleets (make-model), n177192
Fleet overlap with training (n)79
Fleets exclusive to held-out (OOD), n113
Aircraft (RegistryNumber), n9351,312
Operators (Operator Designator), n6449
ATA/JASC classes (JASC Code), n249226

Missingness summary for the core modeling schema fields defined in Table 3, in the analytic corpus (n = 4,430)_

FieldMissing (n)Missing (%)
Difficulty Date00
Submission Date00
Receiving Region Code00
Registry Number120.271
Aircraft Make30.068
Aircraft Model50.113
Discrepancy00
JASC Code00

Overall dataset summary (SDRS exports, 2021-2023)_

ItemValue
Study period (Difficulty Date)01 Jan 2021 – 22 Dec 2023
Records extracted (raw rows across files)4,929
Records included after cleaning (unique SDRs with required fields)4,430
Excluded (duplicates / missing critical fields / missing narrative)499
Unique SDR identifiers (Operator Control Number)4,430
Unique aircraft (RegistryNumber)2,204
Unique operators (OperatorDesignator)97
Location/station proxy availableReceiving Region Code
Unique region codes (Receiving Region Code)10
Narrative availability (Discrepancy present)4,430 / 4,430 (100%)
Label availability (JASC Code present)4,430 / 4,430 (100%)

Robustness under dataset shift on the station-comparable temporal test subset: top-1 accuracy with approximate 95% confidence intervals, and descriptive F1 metrics for non-held-out stations versus station-held-out regions_

SplitnTop-1 accuracy (95% CI)Macro-F1Weighted-F1
Time-held-out (temporal test, non-held-out stations)3370.504 (0.451-0.557)0.2630.468
Station-held-out (regions S1 and S2)4730.452 (0.408-0.497)0.1850.421
Temporal test (all stations, label-in-distribution; station-comparable subset)8100.474 (0.440-0.508)0.2090.437
Absolute difference (non-held-out - station-held-out)0.052 (-0.018 to 0.122)0.0780.047

Core data fields, operational meaning, and modeling role in the recommended SDRS schema_

FieldExample / FormatRole in analysisNotes
Aircraft/Tail (hashed)a3f9…Entity linkageEnables repeat-defect and within-aircraft history without revealing identity
TimestampISO datetime; optional bins (e.g., day/week/shift)Temporal ordering, shift testsBinning used when required for privacy oı stability
StationICAO/IATA codeSite effects, stationshift evaluationSupports cross-station robustness testing
Write-up textFree textPrimary NLP inputDefect narrative; cleaned/normalized for abbreviations where feasible
Action taken text (if available)Free textEvidence/outcome contextUseful for retrieval, resolution summarization, and traceability
JASC code (supervised labelATA chapter/subchapterSupervised labelTarget for triage classification when present and reliable
Deferral indicators (if present)MEL/CDL flag; deferral codeOperational constraint markerCaptures deferral behavior relevant to dispatch and risk
Closure status/time (if present)Open/closed; minutes/hoursOutcome/efficiency metricSupports time-to-close variability and operational impact analyses

Example of the proposed decision-support output for a de-identified temporal test case_

Current write-upTop-3 JASC candidatesRetrieved precedentsConformal outputRecommended workflow actionFinal human disposition
Emergency exit light inoperative at R1 door; removed and replaced emergency exit battery pack M1675/STA 322R with a serviceable unit in accordance with B737-800 AMM 33-51-06; test satisfactory.1. JASC 33502. JASC 52403. JASC 3442Case 1: R1 door emergency exit light inoperative; battery pack replaced; operational check satisfactory.Case 2: Aft left emergency exit lights inoperative; battery pack replaced; checks satisfactory.Case 3: Emergency exit light assembly near R1 door removed and replaced; test satisfactory.Case 4: Inoperative emergency exit battery pack replaced; test satisfactory.Case 5: Similar emergency exit battery pack fault corrected by replacement, followed by a satisfactory test.Prediction-set size = 72; status = review required.Review required – the shortlist and retrieved precedents are coherent, but predictive uncertainty remains too high for silent autosuggestion.Human reviewer confirms JASC 3350 and records battery-pack replacement with satisfactory operational test.

Dataset scale, JASC label imbalance, and split composition (SDRS, 2021-2023)_

QuantityAll dataTrainCalibrationTest
Study period (Difficulty Date)01 Jan 2021 – 22 Dec 202301 Jan 2021 – 14 Jan 202315 Jan 2023 – 24 Apr 202324 Apr 2023 – 22 Dec 2023
Events (unique SDRs)4,4303,101 (70.0%)443 (10.0%)886 (20.0%)
Unique aircraft (Registry Number)2,2041,605263552
Unique operators (Operator Designator)97903441
Unique fleets (make-model) (both fields present)29024579143
Stations (Receiving Region Code)101088
Distinct JASC classes (JASC Code)30627995160
Majority class count58242153108
Majority class share (%)13.14%13.58%11.96%12.19%
Top-10 classes cumulative share (%)43.68%43.28%53.72%44.58%
Minority class count (min frequency)1111
Majority/minority ratio (max/min)582:01:00421:01:0053:01:00108:01:00

Ablation summary showing the incremental contribution of each pipeline module on the temporal test set_

ConfigurationPrimary metricValue
Classifier onlyTop-1 accuracy0.509
Classifier + shortlistTop-3 accuracy0.683
Retrieval evidenceRecall@100.688
Conformal + abstentionCoverage@90%0.919

Prediction performance on the temporal test set after label-based in-distribution filtering (labels observed in training)_

ModelEvaluated temporal test subsetTop-1 accuracyTop-3 accuracyMacro-F1Weighted-F1Excluded test cases with OOD labels (n)
TF-IDF + Linear SVM (baseline)848/886 (95.7%)0.50940.68280.23860.476838

Fleet composition (top 10 make-model by count)_

RankFleet (Aircraft Make + Aircraft Model)Count
1BOEING 7377H4226
2AIRBUS A320232209
3DOUG MD11F166
4EMB ERJ170200LR151
5AIRBUS A321231134
6CNDAIR CL6002D24128
7CNDAIR CL6002C10124
8BOEING 737823120
9BOEING 737890109
10BOEING 737106

Field/label availability (after cleaning)_

FieldSDRS columnAvailability in the included set
Unique control # (dedup/audit trail)OperatorControlNumber100%
Event/occurrence dateDifficultyDate100%
Report/submission dateSubmissionDate100%
Aircraft IDRegistryNumberHigh (used for unique count)
Fleet compositionAircraftMake, AircraftModelHigh
Location/station proxyReceivingRegionCodeHigh
Defect narrative (NLP input)Discrepancy100%
System label (JASC code)JASCCode100%

Conformal prediction set efficiency on the temporal test set: empirical coverage and prediction-set size summary across target coverage levels (α)_

Target coverageαhatEmpirical coverageAverage set sizeMedian set size25th percentile set size75th percentile set size
0.800.200.9932850.8280339.0164908413
0.850.150.9956520.87043615.9316812624
0.900.100.9974140.92343940.87986231057
0.920.080.9978390.94346360.44405301385
0.950.050.9986470.971731122.28398322245
0.980.020.9992920.990577196.552426691277

Dataset-shift diagnostics for robustness evaluation: narrative length and similarity-to-training for time-held-out versus station-held-out groups_

Group similarity to trainingnMedian tokens in narrativeMedian max
Time-held-out (non-held-out stations)337360.325201
Station-held-out (S1 and S2)473290.275161
Language: English
Page range: 53 - 85
Submitted on: Jan 23, 2026
Accepted on: Mar 16, 2026
Published on: Jun 17, 2026
In partnership with: Paradigm Publishing Services

© 2026 Arthur Dela Peña, Jefferson Clariza, Mary Ann Aballiar-Vista, published by ŁUKASIEWICZ RESEARCH NETWORK – INSTITUTE OF AVIATION
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.