Introduction
Generative artificial intelligence (GenAI), “a transformer-based machine learning model trained in an unsupervised manner on extensive datasets and specifically optimized for generating valuable data through prompts” (Oniani et al., 2023), has rapidly emerged as a transformative force across industries (Maslej et al., 2024). Its potential within the military domain is profound (Grand-Clément, 2023).
The ability of GenAI to generate, synthesize, and analyze vast amounts of information has sparked growing interest among defense organizations seeking to enhance strategic planning, decision-making, and operational efficiency. From simulating war-gaming scenarios (Hua et al., 2024; Rivera et al., 2024) to enhancing intelligence analysis and acting as a force multiplier through disinformation and false signaling (Zarrar & Kakar, 2024), GenAI applications are increasingly becoming integral to modern defense strategies.
The integration of GenAI into military operations presents both unprecedented opportunities and complex challenges. On one hand, GenAI has the potential to revolutionize decision-support systems, mission planning, and autonomous warfare by processing and interpreting large datasets in real time. It can improve operational readiness through enhanced training simulations (Chuang & Cheng, 2022), optimize logistics and supply chain management (Boone et al., 2025), and contribute to cyber defense by detecting anomalies and novel threats (Uddin et al., 2025). On the other hand, the deployment of GenAI in military contexts raises critical concerns, including ethical considerations, questions of bias mitigation (Stebbins et al., 2024), adversarial vulnerabilities (Shayea et al., 2025), and the secure implementation of AI-driven technologies within classified environments.
While algorithmic bias is not solely a GenAI issue, since all AI systems may be skewed in some manner due to data, the design of the model’s algorithms, the learning process, and objectives or performance metrics (Blanchard & Bruun, 2024), GenAI brings new challenges; being more relatable to human users, it is capable of greatly altering the way humans interact with technology (Schneider, 2025).
This paper introduces GenAI and the latest advances in its field before proceeding to explore the evolving landscape of GenAI in military applications. The focus is on identifying key trends, emerging opportunities, and associated challenges. By reviewing the latest research and technological advancements, we analyze how GenAI is shaping the future of defense operations. Further, we examine the implications of GenAI adoption in military contexts, including its technical challenges and limitations and its consequences for human-AI collaboration, ethical governance, and strategic policy frameworks. The study also discusses the role of NATO and allied forces in fostering research, development, and responsible implementation of GenAI-driven defense solutions.
This study provides an understanding of how GenAI is amplifying military capabilities and the strategic steps necessary for its successful deployment. The insights contribute to the discourse on AI-driven military innovations and inform future policy and research.
Methods
This study presents a comprehensive survey and analysis of literature published on GenAI in the military and defense sectors between 2022 and 2025. The selected timeframe aligns with the rapid advancements in GenAI technology and its increasing adoption in defense applications. The research questions concern the current state of research and development of GenAI in the military context. What are the challenges? What are the assumed trends and recommended courses of action for stakeholders?
To ensure a structured and relevant review, publications were sourced from RAND Corporation reports and Royal United Services Institute (RUSI) publications due to their military focus, and from academic databases and search engines such as Google Scholar. The search query used to identify relevant literature included the following: “(Military OR Defense) AND (GenAI OR LLM OR Generative Artificial Intelligence)”.
The results were filtered to focus on studies published within the last three years (2022–2025). The query resulted in 24,100 hits for this period. Manual filtering to reduce the number of initial search hits was conducted; apart from institutional platforms, only the first 200 returns, sorted by ‘relevance’, were assessed, focusing on open- and institution-access papers. From this dataset, 49 publications were selected for closer examination based on abstracts and indicated relevance to military applications of GenAI. Studies that were duplicative, lacked direct relevance to GenAI, or were considered low-impact (e.g., lacking peer review, expert credibility or detailed methodology) were excluded. Of the 49 papers, 29 were selected for the review. Publications with a GenAI focus or relevance and a military context were selected. The distribution of publications is displayed in Figure 1. Six papers were included despite being pre-prints due to their applicative nature.

Figure 1
The distribution of publication venues for the reviewed papers.
Each selected publication was categorized based on its research method and primary contribution. The classification included six categories:
Survey – Studies presenting an overview of existing research without introducing new methods.
Review – Critical evaluations of past research with additional insights or critiques.
Policy analysis – Studies focused on policies, governance or strategic frameworks.
Application – Studies proposing tested and evaluated solutions for military use.
Proposition – Theoretical frameworks, new architectures, or conceptual models.
Overview – Papers addressing singular topics.
To enhance consistency in classification, each paper was analyzed by two researchers to validate its category assignment. Further, application-focused and propositional papers underwent deeper examination to assess readiness levels, technological feasibility, and real-world implementation challenges.
By systematically analyzing these sources, this study provides a comprehensive understanding of the current research landscape, identifies gaps, and highlights emerging opportunities and challenges for GenAI in military applications.
GenAI and the State of the art
While GenAI’s rise to prominence is a very recent phenomenon, its development can be traced back to the mid-20th century. Joseph Weizenbaum introduced ELIZA, an early chatbot that employed rule-based scripts to generate a human-like conversation, in 1966 (Weizenbaum, 1966). In the late 1980s probabilistic methods such as Bayesian networks were used to introduce complex probabilistic relationships (Pearl, 1988). This foundation was enhanced by advancements in neural networks such as recurrent neural networks (RNN) and long short-term memory (LSTM; see Hochreiter & Schmidhuber, 1997). The modern era of generative AI can be deemed to have started in 2014, when Ian Goodfellow and his colleagues (2014) introduced generative adversarial networks (GAN). The idea of GAN is to train a discriminator and a generator network, where the generator aims to generate new data and the discriminator evaluates how closely the generated (i.e., synthetic) outputs match the source data. While the GAN proved to be very efficient and successful in many tasks, notably image generation (Radford et al., 2016), followed by even more capable diffusion methods (Bishop & Bishop, 2023). The current state of GenAI development began with the invention of transformer architecture with an attention mechanism (Vaswani et al., 2017).
Essentially, the transformer is an encoder–decoder architecture in which the input is encoded into a sequence of continuous representations by the encoder and then decoded into an output sequence that constitutes the generated content. This mechanism can be used in applications such as language translation. The attention mechanism enables the model to evaluate relationships between all parts of an input sequence regardless of their distance, in contrast to the sequential recurrence used in RNNs. Specifically, Vaswani and his colleagues (2017) introduced multi-head attention into the decoder-encoder structure – a system allowing multiple types of contextual relationships to be computed in parallel. For the decoder output, each word in the learned vocabulary is assigned a probability of being the next word in the generated sequence, conditioned on the context calculated by the decoder stack. Most current architectures do consist of only one of the two, the encoder or a decoder, depending on the task and scope.
For the time being, advances in technology are mainly due to the scaling up of the transformer models and extensive training with vast datasets, resulting in very large, multimodal models (where “multimodal” designates data types other than text and natural language). The research and development of transformer architecture has since been accelerated by at least eight major innovations, explained below.
Mixture of Experts
In 2017, a Google research team introduced a mixture of experts (MoE) model to leverage “outrageously” large neural networks more efficiently (Shazeer et al., 2017). The innovation, achieved in the same year as the transformer with attention mechanism (Vaswani et al., 2017), focuses on conditional computation that leverage very large neural networks efficiently. The proposed sparsely-gated MoE utilizes a gating network that determines a sparse combination of sub-networks for each input, achieving vastly better model capacity. For example, utilizing suggested solution enables the parameters to be scaled up more than 1000 times without significantly increasing computational costs during inference.
Retrieval-Augmented Generation (RAG)
After the launch of widely known GenAI applications such as ChatGPT, issues of GenAI reliability have become a concern. GenAI models are prone to generate unintended or plainly false responses, a phenomenon known as “hallucination” (Ji et al., 2023).
To combat this, Lewis et al. (2020) have introduced the retrieval-augmented generation (RAG) framework, which utilizes a vectorized database of source documents, from which top-K documents or chunks are retrieved for generation for each query. The aim is to reduce generative AI model hallucination and focus on the scope of the information that exists within the document database. Results show that RAG models are able to generate more specific and factual answers than parametric-only baselines. The approach has significantly improved the utilization of a particular, factual information database that needs to be queried with a minimum of errors – something significant in a military context.
Few-Shot Learning
Few-shot learning, a machine learning method in which models are trained or prompted using only a small number of examples, has been shown to produce results comparable to fine-tuning when applied to LLMs (Brown et al., 2020). In essence, few-shot learning utilizes K examples – in Brown and colleagues’ study, between 10 and 100 – consisting of context-completion pairs, followed by a final example of context for which the model is expected to generate a completion. A one-shot approach uses K = 1, while zero-shot learning relies solely on a natural language prompt without any task-specific examples. Brown et al. found that the few-shot learning approach nearly matches state-of-the-art fine-tuned results, providing a computationally efficient alternative to full fine-tuning. The experiment was carried out on GPT-3, which has 175 billion parameters, but the premise has been shown to hold for smaller models as well (Ateia & Kruschwitz, 2024).
Chain of Thought
To advance from replication to reasoning, Wei et al. (2022) have proposed a chain of thought (CoT) structure that utilizes a series of intermediate reasoning steps, improving the LLM’s capacity for complex reasoning. This is achieved through CoT prompting, where the models employ few-shot prompting of the reasoning steps, aiding inference in complex questions. The prompts are triplets with ⟨input, CoT, output⟩, where the CoT is a series of reasoning steps leading to the output.
In effect, CoT enables the model to partition the problem into subproblems, effectively enriching the problem analysis with more tokens and thus applying more computation to it. This leads to improved performance as well as better explainability in the reasoning steps, as these steps can be viewed and evaluated by the user. Improved inference comes at the cost, however, of a greater consumption of resources, being computationally more intensive. While Wei et al. demonstrate a great performance boost for large models, they note that models smaller than 100 billion parameters would not benefit from a CoT approach. Hence, it can be argued that CoT approach and the resulting models, referred to as large reasoning models (LRMs), may remain unfeasible for low-resource on-premise or edge implementations unless new innovations achieve comparable performance in smaller models.
Teacher-Student Knowledge Distillation
A very prominent and much-studied topic in the domain of GenAI focuses is the pursuit of large language models more computationally efficient in terms of inference and memory (Xu et al., 2024; Gupta & Agrawal, 2021). The proposed solutions include model pruning, quantization, and knowledge distillation. Through pruning, models are scaled down in ways that deliberately minimize effects on performance (removing low magnitude weights, for example); quantization eases the memory requirements by using lower bit size for the weights; “knowledge distillation” describes methods of transferring the performance of a large and widely capable model into a smaller model, creating a computationally feasible and easily deployable version with high performance metrics on the chosen field (Gupta & Agrawal, 2021).
The method of knowledge distillation using a teacher-student scheme, in which a smaller model learns from the outputs of a larger teacher model, was proposed by Ba and Caruana (2014). In essence, the smaller or shallower model learns to mimic the function of the deeper model, aiming to achieve similar performance with more attainable computational resources.
Unsupervised Reinforcement Learning
The AI landscape was shaken in late 2024 and early 2025 by the announcement of R1, developed by the Chinese company DeepSeek: a model with reported results comparable to top-tier models such as OpenAI-o1 (Guo et al., 2025). In their paper, Guo et al. showcase post-training as large-scale unsupervised reinforcement learning (RL) – a machine learning method in which a model learns to make decisions by interacting with an environment and receiving rewards or penalties based on its actions, without relying on supervised fine-tuning (SFT – a training process in which a pre-trained model is adjusted using labeled data to improve its performance). This is achieved by applying RL directly on to the V3 base model to create R1-Zero. The final production model, R1, used a multi-stage pipeline with cold-start SFT, reasoning-oriented RL, and rejection-sampled self-generated SFT. The demonstrated shift from all-human SFT is a major innovation, as it suggests that the laborious part of data preparation can be automated with a capable base model that can curate the data for fine-tuning without human involvement.
Guo et al. (2025) also demonstrate the distillation of the reasoning patterns of larger models into smaller ones, permitting performance and computational resources to be re-balanced in favor of performance. The distillation experiment was carried out by fine-tuning open-source models like Qwen and Llama, using 800,000 samples curated with DeepSeek-R1. From a computationally restricted perspective, the performance of the smallest models, Llama 8B and Qwen 7B, are interesting: the benchmark results show that these distilled models are comparable to GPT-4o, Claude-3.5-Sonnet and even OpenAI o1-mini in performance, with some variance between benchmarks. The researchers note that only SFT was applied for distilled models and unsupervised RL could provide further, substantial boost in model performance.
Titans Architecture and Variants
Behrouz et al. (2024) have recently introduced a group of novel architectures named Titans, which aim to mimic the memory of humans with varying implementations of certain core ideas. In essence, the Titans architecture introduces three types of memory: short-term, long-term, and persistent. Short-term memory in the architecture core consists of an attention mechanism and resembles human working memory; long-term memory, meanwhile, a deep neural memory, stores information over extended periods of time. The persistent memory acts as a meta-memory that ensures patterns and frameworks are held in the memory by learning data-independent meta-information. The persistent memory weights remain fixed during test-time as these are responsible for storing the knowledge about the task, whereas neural memory learns during test time and working memory (core) uses in-context learning, enabled by the attention mechanism associated with transformer architectures.
The major improvements of the proposed architectures are, as the paper’s title Titans: Learning to Memorize at Test Time indicates, their ability to learn during test time (i.e., during inference) and to scale beyond a two-million-token context window. The context window refers to the amount of input that can be processed within a single input–output interaction.
The neural long-term memory utilizes a surprise metric, which is estimated with the impact of the gradient so that a larger gradient indicates greater difference from the previous data. This surprise metric is divided into past surprise and momentary surprise to take both the recent past and the latest incoming data into account. Likewise, architecture incorporates a forgetting mechanism to forget past data that is no longer relevant.
The result is a more effective architecture with scalability beyond a two-million-token context window size with performance above baselines. The proposed architectures excel in handling long sequences, due to the surprise metric, and introduce test-time learning into the current state-of-the-art. The implications are extensive, as the proposed architectures may have tremendous impact for military applications: test-time learning may improve domain-specific tuning in use without standard approach of training and deployment.
Agentic AI
In a recent development, the AI models crafted and trained with the methods described above are turned into “agents” that can execute tasks unsupervised (Shavit et al., 2023). While OpenAI engineers define the degree of agentic AI in terms of goal complexity, environmental complexity, adaptability, and independent execution, the primary characteristic is the ability to “take actions which consistently contribute towards achieving goals over an extended period of time, without their behavior having been specified in advance” (Shavit et al., 2023). The key idea is to automate certain workflows to enhance the productivity and performance of the complete process, in which the AI model functions more like a human worker instead of a component within the workflow.
Summary
This section traced the evolution of GenAI, grounding the current landscape in the foundational transformer architecture. The state of the art is characterized by a dynamic tension between two competing trends. On one hand, innovations like mixture of experts (MoE) and chain of thought (CoT) are pushing the boundaries of model scale and reasoning capability, primarily in large, resource-intensive settings. On the other hand, a parallel and equally important trend focuses on efficiency, accessibility, and adaptation. Techniques such as knowledge distillation are enabling the creation of smaller, specialized models that retain a meaningful level of performance compared to their larger counterparts but are feasible for deployment in constrained environments.
More recent breakthroughs promise to accelerate this trend. Unsupervised reinforcement learning offers a path to automate and scale model fine-tuning without laborious human oversight, while novel architectures like Titans introduce the capacity for models to learn continuously during use (“test-time learning”).
Together, these advancements create a technological landscape of potential for the use of agentic AI, and the utilization of GenAI in military. The following literature review analyzes the extent to which these state-of-the-art capabilities have been translated into practical military applications, exploring the opportunities and the significant challenges that arise when applying these powerful but complex technologies in a defense context.
Literature Review
Table 1 displays all 29 articles classified into the categories according to the type presented in the Methods section. The majority of these papers go into the categories of survey and application, while policy analysis, proposition, and overview papers are present in near equal proportions. The “other” category, consisting of scarce methodologies, includes a case study and a policy analysis.
Table 1
Articles Analyzed in the Literature Review.
| TYPE | COUNT | PAPERS |
|---|---|---|
| Survey | 8 | Andreoni et al. (2024), Geist et al. (2024), Liu G. et al. (2024), Stebbins et al. (2024), Feffer et al. (2024), Moy and Gradon (2023), Huang et al. (2025), Mikhailov (2023) |
| Proposition | 4 | Black et al. (2024), Oniani et al. (2023), Lee et al. (2025), Tian et al. (2025) |
| Overview | 3 | Brearcliffe et al. (2023), Kelly and Smith (2024), Rashid et al. (2023) |
| Application | 10 | Hua et al. (2024), Goecks and Waytowich (2024), Chuang and Cheng (2022), Liu X. et al. (2024), Liu T. et al. (2025), Barzyk et al., (2024), Lee et al. (2023), Ruiz and Sell (2024), Rivera et al (2024), Lin et al. (2024) |
| Review | 2 | Zarrar and Kakar (2024), Caballero and Jenkins (2025) |
| Other | 2 | Marcellino et al. (2023), Beauchamp-Mustafaga et al. (2024) |
| Total | 29 |
While the survey papers provide valuable insight into previous research, current trends and future prospects, this analysis focuses on the application studies as these are hypothesized to provide concrete examples of both challenges and opportunities in adopting GenAI into military use.
Table 2 lists the application studies with general topics and key contributions; proposition papers are briefly analyzed, as these papers address a problem with a proposition that, hypothetically, ought to give insight into future applications.
Table 2
GenAI Application (Type A) and Proposition (Type P) Papers.
| PAPER | TYPE | TOPIC | KEY CONTRIBUTION |
|---|---|---|---|
| Lin et al. (2024) | A | Decision making and decision support | Historic battle analysis with LLM multiagent simulations |
| Hua et al. (2024) | A | Decision making and decision support | Historic strategic international conflict analysis with LLM multiagent simulations |
| Goecks and Waytowich (2024) | A | Decision making and decision support | Enhancing course of action (COA) generation with LLMs |
| Rivera et al. (2024) | A | Decision making and decision support | Examining escalation risks associated with LLM use in decision making |
| Chuang and Cheng (2022) | A | Decision support | Conversational AI systems for military training using intent detection and response generation techniques |
| Liu X. et al. (2024) | A | Information extraction and fine-tuning | LLM approach for extracting military equipment entities from unstructured text to build a military knowledge base |
| Barzyk et al. (2024) | A | Cybersecurity | GenAI methodology for automating data tagging in military zero trust architecture cybersecurity frameworks |
| Lee et al. (2023) | A | Decision making and decision support | Proposes Deep AI Military Staff (DAMS), a multi-agent AI system for battlefield decision- making and introduces Multi-Agent Collaboration Architecture to enhance situational awareness |
| Ruiz & Sell (2024) | A | Information extraction and fine-tuning | Presents TRACLM, a fine-tuned Large Language Model (LLM) developed for the U.S. Army to improve AI-driven decision-making and intelligence analysis. |
| Liu T. et al. (2025) | A | Cybersecurity | The use of GenAI for enhancing cross-layer covert communication in military networks. |
| Black et al. (2024) | P | Strategic advantage | Proposes a strategic framework that positions GenAI within the broader context of military competition |
| Oniani et al. (2023) | P | Ethical principles | Proposes ethical cross-domain principles for transparency, value-alignment and accountability |
| Lee et al. (2025) | P | Collaboration | Proposes a system-level architecture using FL as a collaborative framework for LLM training for allied nations. |
| Tian et al. (2025) | P | Unmanned systems | Explores the integration of LLMs and LVMs into UAVs |
In-Depth Analysis of GenAI Applications and Propositions
The papers presented in Table 2 show that a majority of the applications are focused on decision making and decision support. The second group goes under cybersecurity with two papers on par with fine-tuning and information extraction.
Decision Making and Decision Support
In the process of decision making, a choice is made according to constraints presented by the environment, available information, and cognitive capacity (Simon, 1977). Balis and O’Neill (2022) have noted that while AI has had a profound impact on military decision making, it has attracted little attention beyond specialists. To enhance decision making, decision-support systems (DSS) have been devised (Sprague, 1980), recently integrating AI into the military decision-making process (MDMP; see Nadibaidze et al., 2024).
Goecks and Waytowich (2024) have researched the use of GenAI in accelerating course of action (COA) development in military operations. The COA development is a critical component in most MDMPs, where several different COAs are developed for comparison to enhance decision making with regard to the situation and the objective. The research utilizes the Operation TigerClaw (Narayanan et al., 2021) scenario and StarCraft II learning environment (Vinyals et al., 2017) as the setting and environment, and OpenAI’s GPT-4-Turbo as well as GPT-4-Vision which are prompted to act as a COA-GPT (“generative pre-trained transformer”). The research indicates that using LLMs for COA generation accelerates the process substantially, outperforming reinforcement learning baselines, with better correlation to the commander’s intent. However, expert human scores are still comparable or superior to the baselines. The most striking difference is in friendly forces casualties, in which both textual and visual COA-GPT incur a far greater number of casualties.
Rivera et al. (2024) have studied the escalation risk of LLMs in military and diplomatic decision making. The research setting puts eight LLMs as nation agents in a simulation where the models engage in military and foreign policy. The LLMs are selected from five publicly available models – GPT-3.5, GPT-4, Claude 2.0, Llama-2-Chat and GPT-4-Base. The reported results indicate both that the models are more prone to escalate than de-escalate a situation, and show individual models to differ in their profiles and predictability. It was also shown that without instructions training, a base model will choose the most severe actions more often than other models – but it should be noted that none of the models were tuned in any way to follow a military or foreign policy doctrine, acting, instead, on the basis of their training data and system prompts.
In a similar setting, Hua et al. (2024) have researched an LLM-based multi-agent system named WarAgent in simulating historical events. The simulation used GPT-4 models as country agents, with secretary agents as safeguards to avoid the impact of fallacies. The agents were prompted to act according to specific roles. To evaluate the inevitability of war the models were also tuned with counterfactual information to avoid mere replication of history. The results suggested that avoiding war depends on changes in national policies.
In parallel with this strategic-level point of view, Lin et al. (2024) created BattleAgent, combining a large vision language model (VLM) and a multi-agent system to simulate historical battles to increase understanding. The setting zooms from the macro-level approach by Hua et al. (2024) to the battlefield, with commander and soldier agents acting in coordination with their respective side against the similarly designed adversary. The approach was tested in emulating three historic battles; on average the results aligned with historical data, albeit with a notable variance between the models GPT-4, GPT-4-vision, and Claude-3. While both WarAgent and BattleAgent focused on testing and improving historic analysis, the approach and initial results demonstrated how GenAI can be leveraged in simulating warfighting outcomes on different levels, from a tactical setting to strategy and diplomacy.
Chuang and Cheng (2022) designed a task-oriented dialogue system for military scenarios. The system integrated slot-filling, intent detection, and retrieval-based answering into answer generation, using a Chinese corpus of military training missions. The methodology leverages natural language processing (NLP) and natural language generation (NLG) methods without LLMs. The proposed system extracted features for intent detection and created intent-entity relations for question generation. The results showed a query satisfaction greater than 80% over 8 scenarios after two rounds of dialog.
Lee et al. (2023) demonstrated a multi-agent-based collaboration architecture for manned and unmanned assets. The focus of the paper is in enhancing battlefield data processing and situation awareness with AI and automation. A part of the framework uses a conditional variational auto-encoder with a GAN to complete images of objects that are hidden or occluded. The results indicate that AI solutions, including narrow applications of GenAI, can make the battlefield considerably more transparent than it currently is.
Cybersecurity
Barzyk et al. (2024) have studied supporting zero trust architecture implementations with GenAI-automated data tagging, proposing a novel labeling tool to remove error-prone and time-consuming manual tagging. The zero-trust approach does not assume any part of the system to be a trusted zone, applying appropriate control measures to connections and data access throughout, involving data tagging. This means, for example, assigning classifications or other criteria for the data. The research group tested fine-tuning BERT (Devlin et al., 2019) and TinyLlama (Zhang et al., 2024) for a classification task to identify messages in categories. The results indicated that comparatively simple GenAI methods with modest fine-tuning afforded adequate results and that, therefore, offer a strong foundation to improve utilization of zero trust architectures.
Liu T. et al. (2025) have researched GenAI in cross-layer covert communication, where the layers consist of physical, network, and application layers. They propose a GenAI framework in which a generative model is used for knowledge transfer for channel optimization, channel quality evaluation for communication path planning, and enhanced concealment for warden evasion. In this regard, the warden systematically collects and analyzes data to identify patterns of covert transmissions. The applicatory part of the study compared diffusion RL as diffusion soft actor-critic (DSAC) to the traditional soft actor-critic (SAC) approach, providing numeric results that showcased both training efficiency and performance advantage for the diffusion-based approach. However, the DSAC used is not described in detail and the experimental setup description leaves room for interpretation.
Information Extraction and Fine-Tuning
Liu X. et al. (2024) have researched military equipment entity extraction using LLMs, using what they call chain-of-thought named entity recognition (CoTNER). The research highlights difficulties presented by military terminology in contextual analysis and the data requirements and computational burdens of very large models. As a proposition, the researchers use GPT-3.5 for data-augmentation with chain-of-thought combined with high-quality data filtering in order to fine-tune a relatively small Llama-3-8B-Instruct model. In comparison with models such as BERT-MRC (Li et al., 2019) and GPT-NER (Wang et al., 2023), the proposed CoTNER showcases better performance in all metrics. In a way, the data-augmentation part can be viewed as knowledge distillation, as the GPT3.5-175B is considerably larger LLM used to produce the fine-tuning data for the eight billion parameter Llama3 model.
Ruiz and Sell (2024) have researched fine-tuning open-source LLMs for the army domain. While their experimental fine-tuning was performed on small, 3–7 billion scale LLMs, novel methods show promise for expansion of local fine tuning up to and beyond 70 billion parameter models. The best results were received when the fine-tuning was enhanced with a synthetic instruction tuning data set (knowledge distillation, created with the Mistral-8x-7B-Instruct-v0.1 model). A key takeaway regarding domain specific considerations was the relation of tokenization and contextual connections that result in domain-specific knowledge.
Proposed Frameworks, Concepts and Architectures
In addition to technical applications, a subset of the reviewed literature focuses on the development of conceptual frameworks and architectures aimed at guiding the integration of GenAI in military contexts. These works are essential for setting strategic direction, enabling interoperability, and addressing ethical and organizational considerations beyond model performance alone.
While not solely GenAI focused, Black et al. (2024) propose a strategic framework that positions GenAI within the broader context of military competition, emphasizing the need for national policies that anticipate the disruptive nature of AI-enabled capabilities. Their work outlines how strategic advantages may stem not only from technological superiority but also from organizational adaptation and ethical governance.
Oniani et al. (2023) contribute a conceptual model for cross-domain ethical principles, drawing parallels between GenAI use in military and healthcare domains. Their framework emphasizes the importance of transparency, accountability, and value alignment in AI-enabled decision-making – principles that are especially relevant in high-stakes environments such as conflict scenarios and lethal autonomous systems.
Lee et al. (2025) explore a system-level architecture for federated learning (FL) (McMahan et al., 2017; Ji et al., 2024) among allied nations, proposing a collaborative framework for secure training of military LLMs. The use of FL as a collaborative framework for LLM training is interesting and has many upsides for military applications, including reduced communication overheads and enhanced data privacy (Sani et al., 2024, Iacob et al., 2024). The work of Lee et al. (2025) highlights the vulnerability of GenAI systems to prompt injection attacks,1 especially in decentralized training environments, and offers mitigation strategies to ensure robustness and trust. This architecture represents an important conceptual advance in enabling secure distributed GenAI development within coalition contexts.
Tian et al. (2025) visualize LLMs along with vision foundation model (VFM)s and vision language models (VLM) as foundation models that can be leveraged for unmanned ariel vehicles (UAVs). The key difference from LLMs is that VLMs integrate textual and visual information, permitting operations such as visual reasoning, whereas VFMs have become a core technology in computer vision. In their work, Tian et al. (2025) examine the role and use of these foundation models and propose a framework for an agentic UAV capable of processing multimodal data from different sources to advance autonomy and swarm behavior through the cooperation of multiple agents. In essence, the paper maps current technologies to multimodal data sources and lays out a framework to create GenAI powered UAVs.
Summary
In short, the ten analyzed GenAI application papers highlight two technical things: the use of proprietary LLMs in a surrogate task to experiment with and evaluate off-the-shelf capability; and the fine-tuning of smaller models into mission- and task-specific versions. While most of the papers had some decision-support point of view, it has also been argued that humans remain in a critical role in wargaming (Hinton, 2023), which is a key component of the military decision-making process.
While the relatively low number of cybersecurity topics is worth noting, most military cybersecurity research and development is, of course, not public, which may explain the discrepancy. Likewise, the number of increasingly complex and intelligent systems means that security is in increasingly high demand, and there exists plenty of research on GenAI cybersecurity topics (see, for example, Yigit et al., 2024) that, while not solely military-focused, are still applicable to military domain. In a similar manner, while the GenAI-specific algorithmic bias is not well covered, broader military AI bias has been researched (Bode and Bhila, 2024).
Together, these studies offer valuable building blocks for understanding how GenAI systems can be effectively integrated into defense infrastructure. However, a common theme across the literature is the lack of standardization in conceptual models and the absence of empirical validation for many of the proposed frameworks. As GenAI technologies evolve, there is a growing need for operationally grounded, interoperable, and security-conscious architectures that can guide real-world implementations across national and allied defense systems, as well as for military baselines and benchmarks that can be used to evaluate the performance of AI models and systems.
Out of the introduced state-of-the-art GenAI research, none except few-shot learning and agent-like approaches were experimentally present in the analyzed research papers as the focus of interest. Rather, certain technologies such as MoE and distillation have enabled a scaling up of the large, leading-edge proprietary models, indirectly influencing or enabling the analyzed application studies. Hence, it can be argued that the research cutting-edge, driven by industry, is yet to enter the military domain.
All in all, the analyzed research shows a tendency towards smaller models that underwent fine-tuning or the use of proprietary large models with system prompting to get favorable results. The state-of-the-art research from the past years enables both the scaling up of large models as well as size-reduction through distillation whilst maintaining performance. Known as the scaling law, performance increases linearly when model size increases exponentially (Matarazzo & Torlone, 2025). Proprietary models are driving the development, as they exhibit best performance and function as the stepping stone towards capable smaller models. The immense data and computation requirements make experimenting with in-house base models difficult (Matarazzo & Torlone, 2025). The resource-intensive nature of GenAI entrenches the importance of large cloud service providers that have, by the nature of their business, significant computation and memory capacity and large amounts of data. As an alternative, open-weight models can be leveraged for domain-specific purposes with smaller computational resources while reaching similar performance, but with considerably more effort, especially with regard to fine tuning. The challenges of military data sets (Rettore et al., 2024), military-tuned models, benchmarks, collaboration and frameworks hinders evaluation, validation, and incremental research and development.
The implications of frameworks greatly extends the conceptual scope of certain key issues. The need for things such as governance frameworks is evident in high-stakes environments, as the applicatory solutions need to be rooted at higher levels of guidance, conduct of operations, and doctrine. The strategic impact (Black et al., 2024), ethical framework (Oniani et al., 2023) and secure collaboration (Lee et al., 2025) serve as the fundamentals for the ground level capabilities such as unmanned systems (Tian et al., 2025) to be deployed into.
Current State, Trends and Challenges
The landscape of GenAI research over the past two to three years reflects a dual trend: the continued scaling of large proprietary models and a growing ecosystem of smaller, open-weight models tailored for domain-specific use. As shown in Figure 2, state-of-the-art LLMs now span from massive, multimodal systems with tens or hundreds of billions of parameters to computationally cost-efficient, specialized models that can run in constrained environments. This trend is of particular interest to the military domain, where deployment constraints, security requirements, and operational reliability often preclude the use of cloud-based or opaque AI systems.

Figure 2
The known or estimated (uncertain) parameter counts for a selection of top models from 2023 to 2025 (Cardillo, 2025, Farabet & Warkentin 2025, LLM stats, 2025, Luukkonen et al., 2024, Ali et al. 2024, Martins et al., 2024, Martins et al., 2025).
Despite major advancements in architectures (mixture of experts and Titans, for example), reasoning strategies (chain of thought, for example), and training paradigms such as unsupervised RL and distillation, the current research base reveals several important gaps:
Limited operational realism: Most application studies rely on unclassified data, simulated settings, and public LLMs. There is a lack of experimentation with models trained on domain-specific or sensitive military corpora.
Lack of end-to-end system integration: Research tends to focus on model capability (COA generation or information extraction, for example), but few studies demonstrate full system-level integration into command and control or intelligence, surveillance, and reconnaissance (ISR) workflows.
Underrepresentation of security and ethical risks: While technical breakthroughs are covered, there is comparatively less focus on adversarial risks, governance frameworks, or explainability under operational pressure.
Despite these gaps, recent non-military-focused research has addressed several problems with GenAI, such as cognitive impairment (Kosmyna et al., 2025), explainability (Schneider, 2024), and the risks of agentic AI (Narajala & Narayan, 2025). The emergence of open-source ecosystems and the proliferation of modular, composable AI architectures present a tangible opportunity for military organizations to shape GenAI systems around their own operational needs. Techniques such as teacher-student distillation, federated training, fine-tuning, and test-time learning offer pathways to align GenAI performance with domain-specific constraints, particularly in low-resource environments or environments disconnected from information.
In general, the findings of this study illustrate a dynamic and rapidly evolving landscape in the application of GenAI to military contexts. While the potential of GenAI to transform key functions such as decision making, simulation, cybersecurity, and training is clear, the majority of the current research remains in exploratory or experimental stages. Most application studies reviewed rely on either publicly available large models or small-scale fine-tuned versions, typically operating outside of classified or high-assurance environments.
Importantly, the analysis highlights a disconnect between state-of-the-art GenAI capabilities and their practical military adoption. Breakthroughs such as mixture of experts architectures, unsupervised reinforcement learning, and scalable long-context models are primarily driven by commercial actors, and their integration into defense systems is slowed by operational, ethical, regulatory, and classification barriers. For example, Garcia (2024) notes that lack of international law on AI and algorithmic decision making compounds the existing ethical challenges.
While the research community is making rapid progress on model architecture and training efficiency, the military application of these innovations remains in its early stages. The limited availability of secure, in-domain datasets and systems further restricts experimentation with powerful LLMs in high-stakes scenarios. A key insight from the analysis is the strong orientation toward decision-support systems, which aligns with broader trends in both civilian and defense AI. GenAI shows considerable promise in accelerating COA generation, simulation and wargaming, information extraction, and situational awareness enhancement. However, as demonstrated in multiple studies, the outcomes generated by large language models (LLMs) are still vulnerable to hallucinations, unintended escalation behaviors, and unpredictable outputs – particularly when models are not aligned with military doctrine or contextual nuance.
It can be argued that proprietary models, notwithstanding their impressive benchmark results, are not suitable for military use as an off-the-shelf product. This is demonstrated in both the escalation behavior that may stem from the disconnect between the training data and the attended use case in Rivera et al. (2024), as well as the COA-GPT (Goecks & Waytowich, 2024). The research highlights the issues with very large proprietary models, mainly the lack of explainability due to the vast size of the model, undisclosed model weights, and training data, which are linked to the expected behavior. Models of this size and scale are bound to have irregularities when applied to a specific, possibly ill-suited domain and its constraints, such as the military context. As an example, we can hypothesize that the escalation behavior may stem from the training data, as the public internet can be argued to be filled with input-output pairs that are more aggressive than actual military doctrines.
As a countertrend, the adoption of smaller, fine-tuned models is emerging as a pragmatic approach for military deployment. These models offer the benefits of interpretability, deployability in secure or constrained environments, and lower computational demands. Yet, fine tuning alone does not solve challenges related to contextual reasoning, long-term memory, or instruction-following. The performance of smaller models falls short, and the general applicability is naturally lower. Recent advances such as chain of thought prompting, knowledge distillation, and test-time learning offer promising pathways – but these methods remain underutilized in military settings, often due to resources, data (Menthe et al., 2024), or policy limitations. This disconnect between the model performance, explainability and suitability for a particular task in a constraint environment prompts a need for a larger, military specific base-model that can be used as both the military equivalent of large proprietary models and as a distillation base model for mission-specific use cases. Hence, the bottlenecks are data availability and computation.
Lee et al. (2025) note a potential solution for the creation of credible GenAI capability for allied nations in their proposition for a framework to secure system-level federated learning architecture used to train LLMs in collaboration. The FL paradigm would solve several issues, as it promotes privacy and the ownership of data, reduces communication overheads, and enables tuning for user-specific needs. As a result, allied nations would have their own local LLMs trained on local and possibly sensitive military data, and only model parameters would be shared, with differential privacy, to the global model for system-level updates. In the FL scheme, the global model is updated according to local model inputs and the global parameters are then distributed back to local use, without sharing data itself. This approach may eventually permit credible architecture to enhance GenAI capabilities while simultaneously dealing with certain military limitations.
The analysis thus points out two major obstacles in successful GenAI adaptation to military use: reliance on proprietary models, and a lack of resources, architecture and infrastructure. Even those open-weight models fine-tuned for military use originate either in industry or academia. Besides the AI behemoths in the technology industry, smaller models have been designed and trained by public instances, including EuroLLM (Martins et al., 2025) and Teuken (Ali et al., 2024). The military AI capability driven solely by academia or industry cannot be deemed acceptable in the long term. To design and train, for example, a NATO base model family, the military alliance requires resources, infrastructure and cooperation with industry and academia. Here “resources and infrastructure” denote the data, computation and personnel; “architecture” denotes the way these are connected to enable training and fine-tuning “military-grade” AI models. Bridging this gap requires a deliberate shift toward applied experimentation, secure deployment environments, and closer collaboration between AI researchers, defense technologists, and policy stakeholders.
Finally, while ethical concerns, human-machine teaming, and governance frame- works are acknowledged in the literature, they appear underrepresented in empirical application studies. This gap underscores the need for multidisciplinary collaboration that goes beyond technical validation to include normative, strategic, and policy-oriented perspectives. The role of NATO and allied defense organizations in setting standards, enabling trusted infrastructure, and coordinating joint research efforts will be critical in closing this gap.
Discussion
The focus on publicly available research presents a clear limitation of this study. Further research and development conducted by individual nations and coalitions lie beyond the scope of the present analysis. While this limitation is undeniable, it can also be argued that the pace of innovation is primarily set by industry. The cutting edge of research is typically disseminated through open-access academic publications, with observable trends gradually moving from civilian to military applications, as documented in this paper.
Although not exhaustive, this study highlights evidence-based trends and challenges at the structural level related to the military adaptation of GenAI. The current rate of advancement in GenAI appears to divide stakeholders into two groups: early adopters and onlookers. It can be hypothesized that the current craze for GenAI may hinder the adaptation of current state-of-the-art due to the promise of what might come next. Some research already considers the possible invention of artificial general intelligence (AGI; Black et al., 2024), which can be perceived in two ways. AGI may be seen as a strategic capability requiring further effort if it is to be attained; or it may be held to be an inevitable outcome requiring that we simply wait. As development is already driven primarily by industry, this study does not advocate waiting for future technological advancements. Instead, it encourages the creation and development of functional frameworks and architectures, the accumulation of high-quality data, and the establishment of mission-relevant benchmarks to facilitate the adaptation of GenAI and other AI methods sooner rather than later.
Conclusion
This study has presented a structured review and analysis of current research on the use of GenAI in military applications, covering 29 publications from 2022 to early 2025. The literature reveals both significant potential and clear limitations in the application of GenAI across defense-related domains such as decision making, cybersecurity, training, simulation, and information processing.
A key finding is that while large language models offer valuable capabilities, especially in accelerating planning and improving information synthesis, their use in mission-critical military environments remains constrained by challenges in trust, reliability, and operational security. Many current applications rely on either general- purpose public models or fine-tuned small-scale models, with limited alignment to military doctrine, context, or classification requirements.
At the same time, recent advances in GenAI such as chain of thought prompting, knowledge distillation, unsupervised reinforcement learning, and architectures enabling test-time learning are rapidly expanding the technological frontier. These innovations offer viable strategies to bridge the gap between model performance and deployability in defense contexts. However, these methods have yet to be widely adapted for military use, particularly in secure or resource-constrained environments.
To realize the full potential of GenAI in the military domain, several priorities emerge as recommendations to stakeholders. Foremost, operationalization of AI governance frameworks needs to be instated to ensure responsible deployment, especially in high-stakes scenarios. The usage of GenAI capabilities necessitates development of human-AI teaming models that integrate explainability and oversight into decision-making workflows.
From a technical point of view, the adaptation requires developing and deploying secure and robust system architectures that enable and boost AI development and deployment at scale, including data pipelines and management. Fundamentally for GenAI in the military domain, the design and training of a family of military base models, mimicking the current field of proprietary models, should be a top priority for allied nations. This should be done in collaboration with trusted industry partners and academia and aligned with investment in the research and development of scalable, domain-specific fine-tuning pipelines for small and medium-sized LLMs that can operate in classified or disconnected environments. This technical endeavor should be compounded by strengthening of NATO and allied research collaborations to foster joint development, experimentation, and standard-setting for GenAI in defense and security.
The trajectory of GenAI in the military domain is promising but uneven. Continued research and cross-sector coordination are essential for the move from theoretical promises to robust, mission-ready capabilities. The growing body of literature provides a strong foundation for future work that integrates technical, ethical, strategic, and operational perspectives into a cohesive vision for AI-enabled defense.
Abbreviations
AI – artificial intelligence
COA – course of action
CoT – chain of thought
GAN – generative adversarial networks
GenAI – generative artificial intelligence
FL – federated learning
LLM – large language model
LRM – large reasoning model
LSTM – long short-term memory
MDMP – military decision-making process
MoE – mixture of experts
NLG – natural language generation
NLP – natural language processing
RL – reinforcement learning
RNN – recurrent neural network
SFT – supervised fine-tuning
VFM – vision foundation model
VLM – vision language model
Notes
[1] A prompt injection attack is a security exploit where an adversary provides malicious input to an LLM to make it bypass its safety constraints or execute unintended commands (Perez & Ribeiro, 2022).
Competing Interests
The authors have no competing interests to declare.
