Skip to main content
Have a personal or library account? Click to login
Giving Cows a Digital Voice – AI-Enabled Bioacoustics and Smart Sensing in Precision Livestock Management – A Review Cover

Giving Cows a Digital Voice – AI-Enabled Bioacoustics and Smart Sensing in Precision Livestock Management – A Review

Open Access
|Apr 2026

Figures & Tables

Figure 1.

Twenty-year evolution of AI methods for bovine vocalization research, illustrating the shift from manual spectrogram analysis to multimodal, edge-deployed models enhanced by large language models

Figure 2.

PRISMA flow diagram summarizing the literature search and screening process. Out of 248 initially retrieved records, 124 core studies and 30 supporting background papers were included in the final qualitative synthesis

Figure 3.

Comparison of a traditional, human-centric welfare assessment workflow (left) with an AI-enhanced, sensor-driven loop (right). The manual pathway relies on periodic visual scoring and can delay intervention by days, whereas the smart pathway fuses continuous acoustic, motion and video data, runs edge AI for instant anomaly detection, and provides interpretable alerts that prompt rapid farmer action

Figure 4.

Illustration of an AI-driven acoustic analysis pipeline for decoding bovine vocalizations, from audio acquisition through preprocessing and modeling to real-time farm alerts

Figure 5.

Noise adaptation pipeline used in our review’s NRFAR style studies. Raw barn audio first undergoes spectral gating and bandpass filtering, then an adaptive denoiser whose coefficients are fine-tuned on site-specific noise samples. A log-Mel feature bank feeds a noise-aware CNN that outputs both class and confidence; low-confidence events trigger a feedback loop that stores new noise exemplars and refreshes denoiser parameters, maintaining robustness without full model retraining

Figure 6.

NLP and LLM approaches to “cow language” translation

Figure 7.

Schematic model for multimodal data integration (audio, movement, video) in precision livestock farming, emphasizing how sensor fusion yields context-rich insights into animal health and stress

Figure 8.

Hybrid explainable AI multimodal (HEAM) pipeline. Four synchronous streams audio, video, collar IMU signals and environmental sensors are preprocessed and fused into a unified 320-dimensional feature vector. A gradient-boosted decision-tree classifier provides transparent rule-based predictions, while a large language model (LLM) converts those rules plus sensor context into plain language guidance for the farmer. A feedback loop allows new labeled events to fine-tune the CNN front-end and refresh the tree, enabling continuous on-farm adaptation without sacrificing interpretability

Expanded key studies on bovine vocalization analysis

No.ReferenceContextRecording setupAlgorithm appliedData volume (calls/hours)Performance metric(s)Major insight/Key finding
1Mac et al., 2022Calf distress at weaning3 chest-high mics, 44.1 kHz, indoor penk-NN on MFCC mean ± SD600 calls94% accuracyHigh-pitched, long calls reliably indicated distress
2Sharma and Kadyan, 2023Dairy estrus detectionNeck collar mic, 16 kHzSVM, RF comparison2000 callsSVM 95% accuracyEstrus vocalization has signature harmonic pattern
3Vidana-Vila et al., 2023Continuous barn monitoring12 ceiling mics, 8 kHzMobileNet CNN detector25 h audioAUROC 0.93Real-time detection feasible on edge device
4Patil et al., 2024Hunger vs. cough vs. estrusHand-held recorder, 48 kHz7-layer CNN5200 clips0.97 accuracyDeep CNN discriminates four intent categories
5Ferrero et al., 20236-class health datasetStatic barn mic arrayCNN-LSTM hybrid7800 segments0.80 macro-F1Temporal context Boosts recall on rare classes
6Röttgen et al., 2020Individual ID in groupDual-mic collar (airborne + structure-borne)CNN∼2171 events87% correct cow IDWearable sensors enable individual vocal detection
7Hagiwara, 2023Self-supervised AVESMixed-species archive, cow subsetTransformer encoder160 h unlabeled + 800 labeled+ 7 pp F1 vs. CNNSSL cuts annotation cost, improves few-shot
8Martinez-Rau et al., 2023 aChew detection collarCollar mic + accelRF on chewing spectra4 h per cow × 2092% chew vs. ruminationDetects feeding bouts for intake estimation
9Gavojdian et al., 2024Stress isolation studyLav-mics, 22 kHzBi-LSTM3000 sequences0.91 F1Sequence model spots stress more reliably
10Sattar, 2022Multi-intent cough/food/estrus6 mics, 48 kHzSpectrogram CNN4400 clips0.82 macro-F1Combined dataset demonstrates multi-class viability
11Peng et al., 2024Behavior fusion EdgeNeXtAudio + ACCEdgeNeXt + fusion220 h95% behavior acc.Multimodal fusion > single modality

Comparative analysis of AI methods in bovine vocalization classification

No.AI approach/ArchitectureTypical training data volumeKey input representationReported best accuracy/F1Strengths in reviewed studiesMain limitations/Failure modesRepresentative use-case(s)
1Random Forest (RF)≈500–3 000 labeled callsHand-crafted MFCC + temporal stats88–93% F1 (distress vs. non-distress)Robust to noise, interpretable feature importanceNeeds manual feature engineering, weak on temporal contextEstrus-call detection (Sharma and Kadyan, 2023)
2Support Vector Machine (SVM)200–2 000 callsMFCC mean ± SD, fundamental F086–95% accuracy (estrus vs. baseline)Performs well on small datasets, strong marginsSensitive to parameter tuning, scales poorly with >10 k samplesEarly estrus detection wearables (Peng et al., 2023)
3k-Nearest Neighbour (k-NN)600 callsSpectral centroid, duration, energy94% accuracy for open- vs. closed-mouth callsSimple, no training timeStorage heavy, cannot model sequenceCall-type classifier in Japanese Black cattle (Peng et al., 2023)
4CNN (2-D spectrogram)≥5000 call segmentsMel-spectrogram images (128 bins)97% accuracy, 0.96 F1 (multi-class-4)Learns spectral patterns, no manual featuresNeeds GPU and large data, poor temporal memory aloneMulti-intent classifier (hunger, cough, estrus, normal) (Patil et al., 2024)
5Lightweight CNN (MobileNet)25 h continuous barn audio64-bin log-melAUROC 0.93 at 1 s strideFast edge inference (<20 ms), low powerPrecision drops in heavy machinery noiseReal-time call detection collar (Vidana-Vila et al., 2023)
6LSTM/Bi-LSTM3000 labeled sequencesPer-frame MFCC + delta MFCC (time series)91% F1 (calf isolation vs. contact)Captures temporal dynamics, good on sequencesOver-fitting on short clips, GPU-heavyIsolation stress monitor (Martinez-Rau et al., 2025)
7Hybrid CNN + LSTM7800 segments (6 classes)CNN spectrograms embedding -> LSTM80% overall F1, +6 pp over CNN-only on rare classesCombines spectrum + sequence infoNeeds >10 k samples to beat pure CNNMulti-class health event detector (Ferrero et al., 2023)
8Transformer Audio Encoder (AVES)160 h unlabeled pretrain + 800 labels finetuneRaw 16 kHz waveform3–7 pp increases F1 over baseline CNNSelf-supervised, strong few shots; domain adaptableNeeds GPU for pretrain, complexFew-shot call classification after self-pre-training (Hagiwara, 2023)
9EdgeNeXt Multi-Sensor Fusion220 cow-hours (ACC, audio)Spectrogram + 6-DoF inertial images95% accuracy behavior classificationMultimodal, noise-robustNeeds synchronized sensors, heavy preprocessingSocial licking vs. ruminating (Peng et al., 2024)
10Explainable AutoML DT/Rule set1200 calls24 acoustic stats features90% accuracy, full rule traceHuman-readable decision paths3–4 pp lower F1 vs. deep netsWhite-box distress detection

Acoustic characteristics and contextual interpretation of bovine vocalizations

Vocalization typeDominant frequency (Hz)Typical duration (s)Typical mouth/posturePrincipal behavioral contextPractical welfare interpretation
Maternal contact (lowing/closed-mouth call) (Green et al., 2021)F0 ∼120–280 Hz (mean ∼180 Hz)∼0.8–2.5 sClosed or partially open, head lowered toward calfCow-calf proximity, gentle bonding, reassuranceIndicates calm social contact and maternal bonding, normally a positive welfare cue
Calf isolation distress call (Mac et al., 2022)F0 ∼450–780 Hz∼1–4s (modal ∼2 s)Open-mouth, elevated head, often repeated boutsCalf separated from dam/herdSignals acute distress, should trigger rapid reunion or comfort
Adult distress/pain call (Martinez-Rau et al., 2025)F0 ∼600–1200 Hz> 2 s (mean ∼3.1 s)Fully open mouth, tense neckPain (e.g., lameness, injury) or extreme fearHigh-urgency alert, immediate welfare check required
Hunger/feed-anticipation call (Sattar, 2022)F0 ∼220–380 Hz∼0.5–2.0 sOpen-mouth, pacing near feed-gateImminent feeding, empty troughIndicates motivational state (feed expectation)
Estrus (heat) call (Sharma and Kadyan, 2023)F0 ∼160–320 Hz (rich harmonic stack)∼0.8–3 sExtended vocal tract, head raisedReproductive behavior, seeking matesReliable cue for breeding/AI scheduling, positive management indicator
Social affiliative call (Schnaider et al., 2022 a)F0 ∼110–260 Hz∼0.4–1.2 sClosed-mouth, nasalGroup re-joining, mild excitementNormal herd cohesion signal, neutral/positive welfare
Alarm/novel object call (Miron et al., 2025)F0 ∼650–1100 HzSudden, sharp, head-up stancePerceived predator, startling eventShort-term fear, monitor environment and animal safety
Cough/respiratory (Sattar, 2022)Broadband burst 200–1 200 Hz∼0.12–0.35 sForced exhalation, closed glottisRespiratory irritation or disease onsetEarly health-risk indicator (e.g., BRD), triggers clinical exam
Pain-related moan (low-frequency) (Volkmann et al., 2021)F0 ∼90–190 Hz∼1.5–5 sMouth partially open, minimal movementChronic discomfort (lameness, parturition)Persistent occurrence warrants veterinary assessment
Play/excitement call (Vogt et al., 2025)F0 ∼260–450 Hz∼0.3–0.9 sShort bursts during running/buckingCalf play, social excitementPositive affect indicates good welfare environment

Key public (or semi-public) datasets and benchmarks for bovine bioacoustics – scope, best-fit models, and limitations

No.Dataset/BenchmarkScope and modality snapshotBest-fit models and intended taskStrengths for model developmentMain limitation
1CowVox-2023 Mini (Sharma and Kadyan, 2023)8 h audio, 10k labeled calls, 2 Holstein farmsSVM/RF for estrus-call detectionClean labels, free download (CC-BY)Narrow breed and low noise
2DeepSound26 Archive (Ferrero et al., 2023)120 h audio + 5 h collar IMU, 4 farms/3 breedsCNN-LSTM fusion for multimodal health eventsSynchronized streams; individual IDsNon-standard file names; requires resync
3BEANS bovine subset (Hagiwara et al., 2023)6 h cow audio inside 35 h multi-species corpusWav2Vec 2.0 or AVES SSL encoder for zero-/few-shot stress detectionNoise-rich clips; ready for SSL pre-trainSparse bovine labels; class imbalance
4Agri-LLM Pilot Set (Chen et al., 2024)200 paired “call → English tag” clips, Jersey herdAudioGPT/GPT-J prompt-tuning for captioningPaired acoustic–semantic examplesTiny; heavy text bias
5SmartFarm Open-Noise (Martinez-Rau et al., 2025)40 h barn ambience (negative class), 5 barn layoutsSpec-BERT masking or NRFAR denoiser pre-trainDiverse negative class for contrastive learningNo positive calls; must be combined with other sets

Compact overview of sensor types that can be fused with barn-acoustic streams on low-power edge devices

No.Sensor (mount)Signal + Edge load*Audio-synergy example (welfare alert)Field limitation
1Tri-axial ACC (collar/ear) (Martinez-Rau et al., 2023 a; Peng et al., 2024)100 Hz; low (3 Kbps)Chew rate high + high-F0 “feed call” → early feeding cueBattery life: collar fit
2UWB/RFID (tag grid) (Wang et al., 2022)Distance events; low>10 m isolation + distress bawl → weaning-stress alertAntenna cost; metal interference
3Thermal cam (fixed) (Slob et al., 2021)5–15 fps; med (0.5 Mbps)Eye-temp high + panting sound → heat-stress riskNight IR; occlusion
4Top view RGB cam (Arazo et al., 2022)25 fps; high unless prunedLimp posture + low-F moan → lameness warningBandwidth; privacy; dirt
5Directional mic (collar) (Röttgen et al., 2020)16 kHz; lowHigh-F0 estrus call detected → identify cycling cowCollar fit; battery life
6NH3/CO2 gas (wall) (Pérez-Granados and Schuchmann, 2023)1 Hz; lowGas spike + drop in calling → respiratory riskSensor drift
7Water trough pressure mat (Shi et al., 2024)Sip events; lowFew sips + thirst call → blocked drinker alertHardware wear

Key technical and ethical challenges in AI-driven bovine bioacoustics vs_ proposed solutions

No.Challenge/Pain-pointUnderlying cause(s) and typical manifestationImpact on research/farm adoptionProposed technical/operational solutions
1Data scarcity and class imbalanceCostly, time-consuming manual labeling of calls Rare yet critical events (e.g., pain bawls, calving distress) under-represented Farm privacy limits data sharingOver-fitting, poor generalization Models ignore rare but critical classesLarge open acoustic repositories (multi-farm, multi-breed) Self-supervised pre-training (Wav2Vec, AVES) to cut labels by ∼70% Synthetic data via generative models (GAN vocoders) to upsample rare calls Transfer learning to share model weights, not raw audio
2Cross-farm variability and domain shiftDifferences in barn acoustics, microphone type, breed dialects, management routinesPerformance drop when models deployed outside training site, farmer distrustDomain-adversarial training, feature-space alignment Calibration period and incremental fine-tuning on each new farm Capture meta-data (mic height, barn SNR) for conditional normalization
3Background noise and multi-speaker overlapMachinery, wind, multiple cows calling simultaneouslyHigh false positives/negatives, missed welfare eventsBeam-forming or multi-mic arrays for source separation Bi-spectral denoising + mask-based enhancement Event-wise confidence scoring and noise-aware thresholds
4Limited interpretability (AI)Deep nets learn latent features not visible to usersFarmers hesitant to trust alerts, regulators demand transparencySHAP/LIME heatmaps on spectrograms Rule-extraction or surrogate decision trees Dashboard displays “Top 3 acoustic drivers” behind each alert
5Sparse contextual labeling (why a call occurred?)Audio often logged without behavioral or physiological contextMisclassification of benign calls as distress (or vice-versa)Multimodal fusion sync audio with accelerometer, video Mobile annotation apps for on-farm event tagging
6Real-time processing on resource-constrained edge devicesGPU-heavy models vs. limited power/connectivity in barnsLatency or dropout; costly cloud feesLightweight architectures (MobileNet, DistilBERT-audio) On-device quantization and pruning
7Ethical risk of anthropomorphism and over-interpretationAI may project human emotion labels inaccurately Farmers may act on unverified alertsQuestionable welfare interventions, misleading claimsCross-validation against physiological stress markers Expert-in-the-loop verification before deploying new labels
8Farmer adoption and usability barriersAlert fatigue, complex interfaces, unclear ROISystem ignored despite accuracy; missed welfare benefitTiered alerting (red/high vs. yellow/medium) ROI calculators (savings on vet costs, improved conception) Hands-on training and local language interfaces
9Data privacy and ownership concernsAudio streams may reveal proprietary operationsReluctance to share data, slows collaborative progressFederated or encrypted model updates Clear data-use agreements; farmer retains raw-data ownership On-premises processing options
10Regulatory alignment and standardization gapsNo harmonized acoustic welfare metrics yetHard to benchmark systems; variable certification hurdlesDevelop ISO-style standards for recording and annotation Open benchmarking datasets and leaderboards Engage policymakers early to shape guidelines

NLP/LLM techniques applied to bovine vocalizations

No.Technique/ModelUp-stream pre-training baseFine-tuning data (bovine)Key output/CapabilityDemonstrated advantageCurrent limitations
1Wav2Vec 2.0 (SSL)960 h Librispeech human speech2 h labeled cow calls768-dim latent embeddings → downstream classifierCuts labeled data need by ≈70% (Hagiwara, 2023)Requires long GPU pre-train, bovine prosody differs, latent units not explainable
2HuBERT-style Audio LM60 k h Youtube-Audio8M5 h cow distress callsDiscrete token stream for LLM conditioningSelf-supervised tokens improve LLM prompt ability
3Whisper (large-v2)680 k h multilingual speechZero-shot (no cow data)“Transcript” string + log-probNoise-robust segmentation, auto-timestampTokenizer trained on words → outputs nonsense on raw moos; needs post-filter
4AudioGPT ControllerGPT-4 (text) + plug-in ASR/encoders50 labeled prompts (few-shot)Multi-step reasoning over acoustic embeddingsFlexible zero-shot Q&A about herd soundsHeavy compute, pipeline latency, still prototype
5CNN Encoder + GPT-2 DecoderImageNet CNN weights7 k spectrograms with text tagsGenerates sentence caption (e.g., “hungry calf call”)Early end-to-end audio-caption successNeeds dataset of paired call + explanation, currently small
6Prompt-Tuned GPT-J6 B-param code GPT400 synthetic “call→meaning” pairsRapid adaptation to cow vocabulary (<1 epoch)Works with minimal GPUSynthetic pairs risk bias; real validation pending
7Spec-BERT100 h farm audio (masked)800 labeled segmentsPredicts masked time-frequency patches, improves downstream F1 +4 ppLearns robust representations under barn noiseMask strategy sensitivity; limited to short clips
DOI: https://doi.org/10.2478/aoas-2025-0091 | Journal eISSN: 2300-8733 | Journal ISSN: 1642-3402
Language: English
Page range: 751 - 788
Submitted on: May 22, 2025
Accepted on: Aug 18, 2025
Published on: Apr 30, 2026
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2026 Mayuri Kate, Suresh Neethirajan, published by National Research Institute of Animal Production
This work is licensed under the Creative Commons Attribution 4.0 License.