Giving Cows a Digital Voice – AI-Enabled Bioacoustics and Smart Sensing in Precision Livestock Management – A Review

Mayuri Kate; Suresh Neethirajan

doi:10.2478/aoas-2025-0091

.blurhash-client-img { display: none !important; }

Giving Cows a Digital Voice – AI-Enabled Bioacoustics and Smart Sensing in Precision Livestock Management – A Review

Annals of Animal Science

Volume 26 (2026): Issue 3 (April 2026)

By: Mayuri Kate and Suresh Neethirajan

Open Access

|Apr 2026

Figures & Tables

Twenty-year evolution of AI methods for bovine vocalization research, illustrating the shift from manual spectrogram analysis to multimodal, edge-deployed models enhanced by large language models

PRISMA flow diagram summarizing the literature search and screening process. Out of 248 initially retrieved records, 124 core studies and 30 supporting background papers were included in the final qualitative synthesis

Comparison of a traditional, human-centric welfare assessment workflow (left) with an AI-enhanced, sensor-driven loop (right). The manual pathway relies on periodic visual scoring and can delay intervention by days, whereas the smart pathway fuses continuous acoustic, motion and video data, runs edge AI for instant anomaly detection, and provides interpretable alerts that prompt rapid farmer action

Illustration of an AI-driven acoustic analysis pipeline for decoding bovine vocalizations, from audio acquisition through preprocessing and modeling to real-time farm alerts

Noise adaptation pipeline used in our review’s NRFAR style studies. Raw barn audio first undergoes spectral gating and bandpass filtering, then an adaptive denoiser whose coefficients are fine-tuned on site-specific noise samples. A log-Mel feature bank feeds a noise-aware CNN that outputs both class and confidence; low-confidence events trigger a feedback loop that stores new noise exemplars and refreshes denoiser parameters, maintaining robustness without full model retraining

NLP and LLM approaches to “cow language” translation

Schematic model for multimodal data integration (audio, movement, video) in precision livestock farming, emphasizing how sensor fusion yields context-rich insights into animal health and stress

Hybrid explainable AI multimodal (HEAM) pipeline. Four synchronous streams audio, video, collar IMU signals and environmental sensors are preprocessed and fused into a unified 320-dimensional feature vector. A gradient-boosted decision-tree classifier provides transparent rule-based predictions, while a large language model (LLM) converts those rules plus sensor context into plain language guidance for the farmer. A feedback loop allows new labeled events to fine-tune the CNN front-end and refresh the tree, enabling continuous on-farm adaptation without sacrificing interpretability

Expanded key studies on bovine vocalization analysis

No.	Reference	Context	Recording setup	Algorithm applied	Data volume (calls/hours)	Performance metric(s)	Major insight/Key finding
1	Mac et al., 2022	Calf distress at weaning	3 chest-high mics, 44.1 kHz, indoor pen	k-NN on MFCC mean ± SD	600 calls	94% accuracy	High-pitched, long calls reliably indicated distress
2	Sharma and Kadyan, 2023	Dairy estrus detection	Neck collar mic, 16 kHz	SVM, RF comparison	2000 calls	SVM 95% accuracy	Estrus vocalization has signature harmonic pattern
3	Vidana-Vila et al., 2023	Continuous barn monitoring	12 ceiling mics, 8 kHz	MobileNet CNN detector	25 h audio	AUROC 0.93	Real-time detection feasible on edge device
4	Patil et al., 2024	Hunger vs. cough vs. estrus	Hand-held recorder, 48 kHz	7-layer CNN	5200 clips	0.97 accuracy	Deep CNN discriminates four intent categories
5	Ferrero et al., 2023	6-class health dataset	Static barn mic array	CNN-LSTM hybrid	7800 segments	0.80 macro-F1	Temporal context Boosts recall on rare classes
6	Röttgen et al., 2020	Individual ID in group	Dual-mic collar (airborne + structure-borne)	CNN	∼2171 events	87% correct cow ID	Wearable sensors enable individual vocal detection
7	Hagiwara, 2023	Self-supervised AVES	Mixed-species archive, cow subset	Transformer encoder	160 h unlabeled + 800 labeled	+ 7 pp F1 vs. CNN	SSL cuts annotation cost, improves few-shot
8	Martinez-Rau et al., 2023 a	Chew detection collar	Collar mic + accel	RF on chewing spectra	4 h per cow × 20	92% chew vs. rumination	Detects feeding bouts for intake estimation
9	Gavojdian et al., 2024	Stress isolation study	Lav-mics, 22 kHz	Bi-LSTM	3000 sequences	0.91 F1	Sequence model spots stress more reliably
10	Sattar, 2022	Multi-intent cough/food/estrus	6 mics, 48 kHz	Spectrogram CNN	4400 clips	0.82 macro-F1	Combined dataset demonstrates multi-class viability
11	Peng et al., 2024	Behavior fusion EdgeNeXt	Audio + ACC	EdgeNeXt + fusion	220 h	95% behavior acc.	Multimodal fusion > single modality

Comparative analysis of AI methods in bovine vocalization classification

No.	AI approach/Architecture	Typical training data volume	Key input representation	Reported best accuracy/F1	Strengths in reviewed studies	Main limitations/Failure modes	Representative use-case(s)
1	Random Forest (RF)	≈500–3 000 labeled calls	Hand-crafted MFCC + temporal stats	88–93% F1 (distress vs. non-distress)	Robust to noise, interpretable feature importance	Needs manual feature engineering, weak on temporal context	Estrus-call detection (Sharma and Kadyan, 2023)
2	Support Vector Machine (SVM)	200–2 000 calls	MFCC mean ± SD, fundamental F0	86–95% accuracy (estrus vs. baseline)	Performs well on small datasets, strong margins	Sensitive to parameter tuning, scales poorly with >10 k samples	Early estrus detection wearables (Peng et al., 2023)
3	k-Nearest Neighbour (k-NN)	600 calls	Spectral centroid, duration, energy	94% accuracy for open- vs. closed-mouth calls	Simple, no training time	Storage heavy, cannot model sequence	Call-type classifier in Japanese Black cattle (Peng et al., 2023)
4	CNN (2-D spectrogram)	≥5000 call segments	Mel-spectrogram images (128 bins)	97% accuracy, 0.96 F1 (multi-class-4)	Learns spectral patterns, no manual features	Needs GPU and large data, poor temporal memory alone	Multi-intent classifier (hunger, cough, estrus, normal) (Patil et al., 2024)
5	Lightweight CNN (MobileNet)	25 h continuous barn audio	64-bin log-mel	AUROC 0.93 at 1 s stride	Fast edge inference (<20 ms), low power	Precision drops in heavy machinery noise	Real-time call detection collar (Vidana-Vila et al., 2023)
6	LSTM/Bi-LSTM	3000 labeled sequences	Per-frame MFCC + delta MFCC (time series)	91% F1 (calf isolation vs. contact)	Captures temporal dynamics, good on sequences	Over-fitting on short clips, GPU-heavy	Isolation stress monitor (Martinez-Rau et al., 2025)
7	Hybrid CNN + LSTM	7800 segments (6 classes)	CNN spectrograms embedding -> LSTM	80% overall F1, +6 pp over CNN-only on rare classes	Combines spectrum + sequence info	Needs >10 k samples to beat pure CNN	Multi-class health event detector (Ferrero et al., 2023)
8	Transformer Audio Encoder (AVES)	160 h unlabeled pretrain + 800 labels finetune	Raw 16 kHz waveform	3–7 pp increases F1 over baseline CNN	Self-supervised, strong few shots; domain adaptable	Needs GPU for pretrain, complex	Few-shot call classification after self-pre-training (Hagiwara, 2023)
9	EdgeNeXt Multi-Sensor Fusion	220 cow-hours (ACC, audio)	Spectrogram + 6-DoF inertial images	95% accuracy behavior classification	Multimodal, noise-robust	Needs synchronized sensors, heavy preprocessing	Social licking vs. ruminating (Peng et al., 2024)
10	Explainable AutoML DT/Rule set	1200 calls	24 acoustic stats features	90% accuracy, full rule trace	Human-readable decision paths	3–4 pp lower F1 vs. deep nets	White-box distress detection

Acoustic characteristics and contextual interpretation of bovine vocalizations

Vocalization type	Dominant frequency (Hz)	Typical duration (s)	Typical mouth/posture	Principal behavioral context	Practical welfare interpretation
Maternal contact (lowing/closed-mouth call) (Green et al., 2021)	F0 ∼120–280 Hz (mean ∼180 Hz)	∼0.8–2.5 s	Closed or partially open, head lowered toward calf	Cow-calf proximity, gentle bonding, reassurance	Indicates calm social contact and maternal bonding, normally a positive welfare cue
Calf isolation distress call (Mac et al., 2022)	F0 ∼450–780 Hz	∼1–4s (modal ∼2 s)	Open-mouth, elevated head, often repeated bouts	Calf separated from dam/herd	Signals acute distress, should trigger rapid reunion or comfort
Adult distress/pain call (Martinez-Rau et al., 2025)	F0 ∼600–1200 Hz	> 2 s (mean ∼3.1 s)	Fully open mouth, tense neck	Pain (e.g., lameness, injury) or extreme fear	High-urgency alert, immediate welfare check required
Hunger/feed-anticipation call (Sattar, 2022)	F0 ∼220–380 Hz	∼0.5–2.0 s	Open-mouth, pacing near feed-gate	Imminent feeding, empty trough	Indicates motivational state (feed expectation)
Estrus (heat) call (Sharma and Kadyan, 2023)	F0 ∼160–320 Hz (rich harmonic stack)	∼0.8–3 s	Extended vocal tract, head raised	Reproductive behavior, seeking mates	Reliable cue for breeding/AI scheduling, positive management indicator
Social affiliative call (Schnaider et al., 2022 a)	F0 ∼110–260 Hz	∼0.4–1.2 s	Closed-mouth, nasal	Group re-joining, mild excitement	Normal herd cohesion signal, neutral/positive welfare
Alarm/novel object call (Miron et al., 2025)	F0 ∼650–1100 Hz	–	Sudden, sharp, head-up stance	Perceived predator, startling event	Short-term fear, monitor environment and animal safety
Cough/respiratory (Sattar, 2022)	Broadband burst 200–1 200 Hz	∼0.12–0.35 s	Forced exhalation, closed glottis	Respiratory irritation or disease onset	Early health-risk indicator (e.g., BRD), triggers clinical exam
Pain-related moan (low-frequency) (Volkmann et al., 2021)	F0 ∼90–190 Hz	∼1.5–5 s	Mouth partially open, minimal movement	Chronic discomfort (lameness, parturition)	Persistent occurrence warrants veterinary assessment
Play/excitement call (Vogt et al., 2025)	F0 ∼260–450 Hz	∼0.3–0.9 s	Short bursts during running/bucking	Calf play, social excitement	Positive affect indicates good welfare environment

Key public (or semi-public) datasets and benchmarks for bovine bioacoustics – scope, best-fit models, and limitations

No.	Dataset/Benchmark	Scope and modality snapshot	Best-fit models and intended task	Strengths for model development	Main limitation
1	CowVox-2023 Mini (Sharma and Kadyan, 2023)	8 h audio, 10k labeled calls, 2 Holstein farms	SVM/RF for estrus-call detection	Clean labels, free download (CC-BY)	Narrow breed and low noise
2	DeepSound26 Archive (Ferrero et al., 2023)	120 h audio + 5 h collar IMU, 4 farms/3 breeds	CNN-LSTM fusion for multimodal health events	Synchronized streams; individual IDs	Non-standard file names; requires resync
3	BEANS bovine subset (Hagiwara et al., 2023)	6 h cow audio inside 35 h multi-species corpus	Wav2Vec 2.0 or AVES SSL encoder for zero-/few-shot stress detection	Noise-rich clips; ready for SSL pre-train	Sparse bovine labels; class imbalance
4	Agri-LLM Pilot Set (Chen et al., 2024)	200 paired “call → English tag” clips, Jersey herd	AudioGPT/GPT-J prompt-tuning for captioning	Paired acoustic–semantic examples	Tiny; heavy text bias
5	SmartFarm Open-Noise (Martinez-Rau et al., 2025)	40 h barn ambience (negative class), 5 barn layouts	Spec-BERT masking or NRFAR denoiser pre-train	Diverse negative class for contrastive learning	No positive calls; must be combined with other sets

Compact overview of sensor types that can be fused with barn-acoustic streams on low-power edge devices

No.	Sensor (mount)	Signal + Edge load*	Audio-synergy example (welfare alert)	Field limitation
1	Tri-axial ACC (collar/ear) (Martinez-Rau et al., 2023 a; Peng et al., 2024)	100 Hz; low (3 Kbps)	Chew rate high + high-F0 “feed call” → early feeding cue	Battery life: collar fit
2	UWB/RFID (tag grid) (Wang et al., 2022)	Distance events; low	>10 m isolation + distress bawl → weaning-stress alert	Antenna cost; metal interference
3	Thermal cam (fixed) (Slob et al., 2021)	5–15 fps; med (0.5 Mbps)	Eye-temp high + panting sound → heat-stress risk	Night IR; occlusion
4	Top view RGB cam (Arazo et al., 2022)	25 fps; high unless pruned	Limp posture + low-F moan → lameness warning	Bandwidth; privacy; dirt
5	Directional mic (collar) (Röttgen et al., 2020)	16 kHz; low	High-F0 estrus call detected → identify cycling cow	Collar fit; battery life
6	NH₃/CO₂ gas (wall) (Pérez-Granados and Schuchmann, 2023)	1 Hz; low	Gas spike + drop in calling → respiratory risk	Sensor drift
7	Water trough pressure mat (Shi et al., 2024)	Sip events; low	Few sips + thirst call → blocked drinker alert	Hardware wear

Key technical and ethical challenges in AI-driven bovine bioacoustics vs_ proposed solutions

No.	Challenge/Pain-point	Underlying cause(s) and typical manifestation	Impact on research/farm adoption	Proposed technical/operational solutions
1	Data scarcity and class imbalance	Costly, time-consuming manual labeling of calls Rare yet critical events (e.g., pain bawls, calving distress) under-represented Farm privacy limits data sharing	Over-fitting, poor generalization Models ignore rare but critical classes	Large open acoustic repositories (multi-farm, multi-breed) Self-supervised pre-training (Wav2Vec, AVES) to cut labels by ∼70% Synthetic data via generative models (GAN vocoders) to upsample rare calls Transfer learning to share model weights, not raw audio
2	Cross-farm variability and domain shift	Differences in barn acoustics, microphone type, breed dialects, management routines	Performance drop when models deployed outside training site, farmer distrust	Domain-adversarial training, feature-space alignment Calibration period and incremental fine-tuning on each new farm Capture meta-data (mic height, barn SNR) for conditional normalization
3	Background noise and multi-speaker overlap	Machinery, wind, multiple cows calling simultaneously	High false positives/negatives, missed welfare events	Beam-forming or multi-mic arrays for source separation Bi-spectral denoising + mask-based enhancement Event-wise confidence scoring and noise-aware thresholds
4	Limited interpretability (AI)	Deep nets learn latent features not visible to users	Farmers hesitant to trust alerts, regulators demand transparency	SHAP/LIME heatmaps on spectrograms Rule-extraction or surrogate decision trees Dashboard displays “Top 3 acoustic drivers” behind each alert
5	Sparse contextual labeling (why a call occurred?)	Audio often logged without behavioral or physiological context	Misclassification of benign calls as distress (or vice-versa)	Multimodal fusion sync audio with accelerometer, video Mobile annotation apps for on-farm event tagging
6	Real-time processing on resource-constrained edge devices	GPU-heavy models vs. limited power/connectivity in barns	Latency or dropout; costly cloud fees	Lightweight architectures (MobileNet, DistilBERT-audio) On-device quantization and pruning
7	Ethical risk of anthropomorphism and over-interpretation	AI may project human emotion labels inaccurately Farmers may act on unverified alerts	Questionable welfare interventions, misleading claims	Cross-validation against physiological stress markers Expert-in-the-loop verification before deploying new labels
8	Farmer adoption and usability barriers	Alert fatigue, complex interfaces, unclear ROI	System ignored despite accuracy; missed welfare benefit	Tiered alerting (red/high vs. yellow/medium) ROI calculators (savings on vet costs, improved conception) Hands-on training and local language interfaces
9	Data privacy and ownership concerns	Audio streams may reveal proprietary operations	Reluctance to share data, slows collaborative progress	Federated or encrypted model updates Clear data-use agreements; farmer retains raw-data ownership On-premises processing options
10	Regulatory alignment and standardization gaps	No harmonized acoustic welfare metrics yet	Hard to benchmark systems; variable certification hurdles	Develop ISO-style standards for recording and annotation Open benchmarking datasets and leaderboards Engage policymakers early to shape guidelines

NLP/LLM techniques applied to bovine vocalizations

No.	Technique/Model	Up-stream pre-training base	Fine-tuning data (bovine)	Key output/Capability	Demonstrated advantage	Current limitations
1	Wav2Vec 2.0 (SSL)	960 h Librispeech human speech	2 h labeled cow calls	768-dim latent embeddings → downstream classifier	Cuts labeled data need by ≈70% (Hagiwara, 2023)	Requires long GPU pre-train, bovine prosody differs, latent units not explainable
2	HuBERT-style Audio LM	60 k h Youtube-Audio8M	5 h cow distress calls	Discrete token stream for LLM conditioning	Self-supervised tokens improve LLM prompt ability	–
3	Whisper (large-v2)	680 k h multilingual speech	Zero-shot (no cow data)	“Transcript” string + log-prob	Noise-robust segmentation, auto-timestamp	Tokenizer trained on words → outputs nonsense on raw moos; needs post-filter
4	AudioGPT Controller	GPT-4 (text) + plug-in ASR/encoders	50 labeled prompts (few-shot)	Multi-step reasoning over acoustic embeddings	Flexible zero-shot Q&A about herd sounds	Heavy compute, pipeline latency, still prototype
5	CNN Encoder + GPT-2 Decoder	ImageNet CNN weights	7 k spectrograms with text tags	Generates sentence caption (e.g., “hungry calf call”)	Early end-to-end audio-caption success	Needs dataset of paired call + explanation, currently small
6	Prompt-Tuned GPT-J	6 B-param code GPT	400 synthetic “call→meaning” pairs	Rapid adaptation to cow vocabulary (<1 epoch)	Works with minimal GPU	Synthetic pairs risk bias; real validation pending
7	Spec-BERT	100 h farm audio (masked)	800 labeled segments	Predicts masked time-frequency patches, improves downstream F1 +4 pp	Learns robust representations under barn noise	Mask strategy sensitivity; limited to short clips

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.2478/aoas-2025-0091 | Journal eISSN: 2300-8733 | Journal ISSN: 1642-3402

Journal RSS Feed

Language: English

Page range: 751 - 788

Submitted on: May 22, 2025

Accepted on: Aug 18, 2025

Published on: Apr 30, 2026

Published by: National Research Institute of Animal Production

In partnership with: Paradigm Publishing Services

Publication frequency: 4 issues per year

Keywords:

bovine bioacoustics,

precision livestock farming,

sensor fusion,

explainable AI,

animal welfare outcomes

Related subjects:

Zoology,

© 2026 Mayuri Kate, Suresh Neethirajan, published by National Research Institute of Animal Production
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 26 (2026): Issue 3 (April 2026)

Giving Cows a Digital Voice – AI-Enabled Bioacoustics and Smart Sensing in Precision Livestock Management – A Review

Figures & Tables

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Expanded key studies on bovine vocalization analysis

Comparative analysis of AI methods in bovine vocalization classification

Acoustic characteristics and contextual interpretation of bovine vocalizations

Key public (or semi-public) datasets and benchmarks for bovine bioacoustics – scope, best-fit models, and limitations

Compact overview of sensor types that can be fused with barn-acoustic streams on low-power edge devices

Key technical and ethical challenges in AI-driven bovine bioacoustics vs_ proposed solutions

NLP/LLM techniques applied to bovine vocalizations

Paradigm

My account