Evidence-Grounded Decision Support for Aircraft Line Maintenance Using Conformal Prediction and Retrieval-Augmented NLP from Technical Log Records

Arthur Dela Peña; Jefferson Clariza; Mary Ann Aballiar-Vista

doi:10.2478/tar-2026-0009

.blurhash-client-img { display: none !important; }

Evidence-Grounded Decision Support for Aircraft Line Maintenance Using Conformal Prediction and Retrieval-Augmented NLP from Technical Log Records

Transactions on Aerospace Research

Volume 2026 (2026): Issue 2 (June 2026)

By: Arthur Dela Peña , Jefferson Clariza and Mary Ann Aballiar-Vista

Open Access

|Jun 2026

Figures & Tables

Post hoc confusion matrix aggregated to major ATA chapter groups for the temporal test split.

Retrieval quality metrics on the temporal test set: Recall@K (K = 1, 3, 5, 10) together with Mean Reciprocal Rank (MRR). Relevance is defined as JASC-code label consistency between the retrieved case and the query.

Recall@K curve for the temporal test set, showing the monotonic gain in recall as K increases from 1 to 10, with MRR = 0.47 annotated for reference. Relevance is defined as in Figure 2.

Selective prediction trade-off on the temporal test set at α = 0.10. Error rate on accepted top-1 predictions is plotted against abstention rate, parameterized by the maximum acceptable prediction-set size m (labeled at each point as s ≤ m; see equation (4) in Section 3.9). Stricter thresholds (smaller m) move the curve toward higher abstention and lower accepted-case error.

Ablation summary showing incremental contributions of the classifier (top-1 accuracy), shortlist output (top-3 accuracy), retrieval evidence (Recall@10), and conformal uncertainty layer (coverage@90%), with abstention@Smax=10 and error among accepted cases reported for the conformal setting.

Included records by year_

Year	Included records	Unique aircraft	Unique make-model	Region codes
2021	1,500	811	185	9
2022	1,500	831	164	9
2023	1,430	848	176	8
Total	4,430	2,204	290	10

Split counts and held-out station/fleet composition_

Panel A. Temporal split (chronological 70/10/20 by difficulty date).
Metric	All data	Train	Calibration	Test
Difficulty Date range	01 Jan 2021 – 22 Dec 2023	01 Jan 2021 – 14 Jan 2023	15 Jan 2023 – 24 Apr 2023	24 Apr 2023 – 22 Dec 2023
Events (unique SDRs)	4,430	3,101 (70.0%)	443 (10.0%)	886 (20.0%)
Stations (Receiving Region Code), n	10	10	8	8
Fleets (make-model), n	290	245	79	143
Aircraft (RegistryNumber), n	2,204	1,605	263	552
Operators (Operator Designator), n	97	90	34	41
JASC classes (JASC Code), n	306	279	95	160

j_tar-2026-0009_tab_106

Panel B. Station hold-out split (region generalization test).
Metric	Train stations (non-held-out)	Held-out stations (S1-S2)
Stations (ReceivingRegionCode), n	8	2
Events (unique SDRs)	1,766	2,664 (60.14%)
Fleets (make-model), n	177	192
Fleet overlap with training (n)	–	79
Fleets exclusive to held-out (OOD), n	–	113
Aircraft (RegistryNumber), n	935	1,312
Operators (Operator Designator), n	64	49
ATA/JASC classes (JASC Code), n	249	226

Missingness summary for the core modeling schema fields defined in Table 3, in the analytic corpus (n = 4,430)_

Field	Missing (n)	Missing (%)
Difficulty Date	0	0
Submission Date	0	0
Receiving Region Code	0	0
Registry Number	12	0.271
Aircraft Make	3	0.068
Aircraft Model	5	0.113
Discrepancy	0	0
JASC Code	0	0

Overall dataset summary (SDRS exports, 2021-2023)_

Item	Value
Study period (Difficulty Date)	01 Jan 2021 – 22 Dec 2023
Records extracted (raw rows across files)	4,929
Records included after cleaning (unique SDRs with required fields)	4,430
Excluded (duplicates / missing critical fields / missing narrative)	499
Unique SDR identifiers (Operator Control Number)	4,430
Unique aircraft (RegistryNumber)	2,204
Unique operators (OperatorDesignator)	97
Location/station proxy available	Receiving Region Code
Unique region codes (Receiving Region Code)	10
Narrative availability (Discrepancy present)	4,430 / 4,430 (100%)
Label availability (JASC Code present)	4,430 / 4,430 (100%)

Robustness under dataset shift on the station-comparable temporal test subset: top-1 accuracy with approximate 95% confidence intervals, and descriptive F1 metrics for non-held-out stations versus station-held-out regions_

Split	n	Top-1 accuracy (95% CI)	Macro-F1	Weighted-F1
Time-held-out (temporal test, non-held-out stations)	337	0.504 (0.451-0.557)	0.263	0.468
Station-held-out (regions S1 and S2)	473	0.452 (0.408-0.497)	0.185	0.421
Temporal test (all stations, label-in-distribution; station-comparable subset)	810	0.474 (0.440-0.508)	0.209	0.437
Absolute difference (non-held-out - station-held-out)	–	0.052 (-0.018 to 0.122)	0.078	0.047

Core data fields, operational meaning, and modeling role in the recommended SDRS schema_

Field	Example / Format	Role in analysis	Notes
Aircraft/Tail (hashed)	a3f9…	Entity linkage	Enables repeat-defect and within-aircraft history without revealing identity
Timestamp	ISO datetime; optional bins (e.g., day/week/shift)	Temporal ordering, shift tests	Binning used when required for privacy oı stability
Station	ICAO/IATA code	Site effects, stationshift evaluation	Supports cross-station robustness testing
Write-up text	Free text	Primary NLP input	Defect narrative; cleaned/normalized for abbreviations where feasible
Action taken text (if available)	Free text	Evidence/outcome context	Useful for retrieval, resolution summarization, and traceability
JASC code (supervised label	ATA chapter/subchapter	Supervised label	Target for triage classification when present and reliable
Deferral indicators (if present)	MEL/CDL flag; deferral code	Operational constraint marker	Captures deferral behavior relevant to dispatch and risk
Closure status/time (if present)	Open/closed; minutes/hours	Outcome/efficiency metric	Supports time-to-close variability and operational impact analyses

Example of the proposed decision-support output for a de-identified temporal test case_

Current write-up	Top-3 JASC candidates	Retrieved precedents	Conformal output	Recommended workflow action	Final human disposition
Emergency exit light inoperative at R1 door; removed and replaced emergency exit battery pack M1675/STA 322R with a serviceable unit in accordance with B737-800 AMM 33-51-06; test satisfactory.	1. JASC 33502. JASC 52403. JASC 3442	Case 1: R1 door emergency exit light inoperative; battery pack replaced; operational check satisfactory.Case 2: Aft left emergency exit lights inoperative; battery pack replaced; checks satisfactory.Case 3: Emergency exit light assembly near R1 door removed and replaced; test satisfactory.Case 4: Inoperative emergency exit battery pack replaced; test satisfactory.Case 5: Similar emergency exit battery pack fault corrected by replacement, followed by a satisfactory test.	Prediction-set size = 72; status = review required.	Review required – the shortlist and retrieved precedents are coherent, but predictive uncertainty remains too high for silent autosuggestion.	Human reviewer confirms JASC 3350 and records battery-pack replacement with satisfactory operational test.

Dataset scale, JASC label imbalance, and split composition (SDRS, 2021-2023)_

Quantity	All data	Train	Calibration	Test
Study period (Difficulty Date)	01 Jan 2021 – 22 Dec 2023	01 Jan 2021 – 14 Jan 2023	15 Jan 2023 – 24 Apr 2023	24 Apr 2023 – 22 Dec 2023
Events (unique SDRs)	4,430	3,101 (70.0%)	443 (10.0%)	886 (20.0%)
Unique aircraft (Registry Number)	2,204	1,605	263	552
Unique operators (Operator Designator)	97	90	34	41
Unique fleets (make-model) (both fields present)	290	245	79	143
Stations (Receiving Region Code)	10	10	8	8
Distinct JASC classes (JASC Code)	306	279	95	160
Majority class count	582	421	53	108
Majority class share (%)	13.14%	13.58%	11.96%	12.19%
Top-10 classes cumulative share (%)	43.68%	43.28%	53.72%	44.58%
Minority class count (min frequency)	1	1	1	1
Majority/minority ratio (max/min)	582:01:00	421:01:00	53:01:00	108:01:00

Ablation summary showing the incremental contribution of each pipeline module on the temporal test set_

Configuration	Primary metric	Value
Classifier only	Top-1 accuracy	0.509
Classifier + shortlist	Top-3 accuracy	0.683
Retrieval evidence	Recall@10	0.688
Conformal + abstention	Coverage@90%	0.919

Prediction performance on the temporal test set after label-based in-distribution filtering (labels observed in training)_

Model	Evaluated temporal test subset	Top-1 accuracy	Top-3 accuracy	Macro-F1	Weighted-F1	Excluded test cases with OOD labels (n)
TF-IDF + Linear SVM (baseline)	848/886 (95.7%)	0.5094	0.6828	0.2386	0.4768	38

Fleet composition (top 10 make-model by count)_

Rank	Fleet (Aircraft Make + Aircraft Model)	Count
1	BOEING 7377H4	226
2	AIRBUS A320232	209
3	DOUG MD11F	166
4	EMB ERJ170200LR	151
5	AIRBUS A321231	134
6	CNDAIR CL6002D24	128
7	CNDAIR CL6002C10	124
8	BOEING 737823	120
9	BOEING 737890	109
10	BOEING 737	106

Field/label availability (after cleaning)_

Field	SDRS column	Availability in the included set
Unique control # (dedup/audit trail)	OperatorControlNumber	100%
Event/occurrence date	DifficultyDate	100%
Report/submission date	SubmissionDate	100%
Aircraft ID	RegistryNumber	High (used for unique count)
Fleet composition	AircraftMake, AircraftModel	High
Location/station proxy	ReceivingRegionCode	High
Defect narrative (NLP input)	Discrepancy	100%
System label (JASC code)	JASCCode	100%

Conformal prediction set efficiency on the temporal test set: empirical coverage and prediction-set size summary across target coverage levels (α)_

Target coverage	α	q̂_hat	Empirical coverage	Average set size	Median set size	25th percentile set size	75th percentile set size
0.80	0.20	0.993285	0.828033	9.016490	8	4	13
0.85	0.15	0.995652	0.870436	15.93168	12	6	24
0.90	0.10	0.997414	0.923439	40.87986	23	10	57
0.92	0.08	0.997839	0.943463	60.44405	30	13	85
0.95	0.05	0.998647	0.971731	122.2839	83	22	245
0.98	0.02	0.999292	0.990577	196.5524	266	91	277

Dataset-shift diagnostics for robustness evaluation: narrative length and similarity-to-training for time-held-out versus station-held-out groups_

Group similarity to training	n	Median tokens in narrative	Median max
Time-held-out (non-held-out stations)	337	36	0.325201
Station-held-out (S1 and S2)	473	29	0.275161

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.2478/tar-2026-0009 | Journal eISSN: 2545-2835

Journal RSS Feed

Language: English

Page range: 53 - 85

Submitted on: Jan 23, 2026

Accepted on: Mar 16, 2026

Published on: Jun 17, 2026

Published by: ŁUKASIEWICZ RESEARCH NETWORK – INSTITUTE OF AVIATION

In partnership with: Paradigm Publishing Services

Keywords:

aircraft maintenance diagnostics,

line maintenance triage,

JASC code classification,

evidence-grounded retrieval,

conformal prediction,

abstention

Related subjects:

Engineering,

Introductions and overviews,

Materials sciences, other,

Physics,

Physics, other

© 2026 Arthur Dela Peña, Jefferson Clariza, Mary Ann Aballiar-Vista, published by ŁUKASIEWICZ RESEARCH NETWORK – INSTITUTE OF AVIATION
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 2026 (2026): Issue 2 (June 2026)

Evidence-Grounded Decision Support for Aircraft Line Maintenance Using Conformal Prediction and Retrieval-Augmented NLP from Technical Log Records

Figures & Tables

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Included records by year_

Split counts and held-out station/fleet composition_

j_tar-2026-0009_tab_106

Missingness summary for the core modeling schema fields defined in Table 3, in the analytic corpus (n = 4,430)_

Overall dataset summary (SDRS exports, 2021-2023)_

Robustness under dataset shift on the station-comparable temporal test subset: top-1 accuracy with approximate 95% confidence intervals, and descriptive F1 metrics for non-held-out stations versus station-held-out regions_

Core data fields, operational meaning, and modeling role in the recommended SDRS schema_

Example of the proposed decision-support output for a de-identified temporal test case_

Dataset scale, JASC label imbalance, and split composition (SDRS, 2021-2023)_

Ablation summary showing the incremental contribution of each pipeline module on the temporal test set_

Prediction performance on the temporal test set after label-based in-distribution filtering (labels observed in training)_

Fleet composition (top 10 make-model by count)_

Field/label availability (after cleaning)_

Conformal prediction set efficiency on the temporal test set: empirical coverage and prediction-set size summary across target coverage levels (α)_

Dataset-shift diagnostics for robustness evaluation: narrative length and similarity-to-training for time-held-out versus station-held-out groups_

Paradigm

My account