ImproVision Equilibrium: Toward Multimodal Musical Human&ndash;Machine Interaction

Ross Greer; Laura Fleig; Shlomo Dubnov

doi:10.5334/tismir.225

Full Article

1 Introduction

Consider the musical scenario found in the second movement of Rachmaninoff’s second piano concerto, illustrated in Figure 1. There is a prescribed musical procedure for the performers, meaning the pianist will play a series of notes as written, and the flutes will accompany with their synchronous notes as written. On the other hand, there are many free dimensions of expression; in particular, the rubato style at the ritardando allows for the pianists and flautists to ‘negotiate’ the exact timing of the sequence and, most importantly, a coherent, perfect landing of the cadence to Adagio sostenuto. So, with at least three bodies involved in this conversation, how can a decision be made? Perhaps the pianist will interpret the score and make some gesture with their head and shoulders to indicate when the final note will be played, with hopes that the flautists will observe and follow. Perhaps the pianist may make eye contact with these flautists to receive their buy‑in to the timing decision. It’s even possible that, at this moment of eye contact, the pianist may see that one of the flautist’s facial expressions suggests they are running out of air, and the pianist may even shorten the length of their cadenza accordingly. In practice, all of these musical and physical considerations are most often facilitated by a fourth body, a conductor, whose role is to observe, process, and guide both the musicians and the musical content.

The return to Tempo I during the second movement of Rachmaninoff’s second piano concerto requires a highly coordinated musical effort. The flutes (‘Fl.,’ top line of second system) sustain a fermata, while the piano soloist interprets a closing ornamentation at their own tempo. All parties are expected to land in complete synchronization on the following cadential downbeat, most often facilitated by a conductor actively listening to, observing, and communicating with the three musicians.

In this research, we describe machines that can contribute to musical co‑creativity with human performers along various dimensions of musical action, such as bidirectional communication of musical intention through non‑audible gestures and led decision‑making in improvisatory ensembles. In doing so via a robotic system, our case studies also provide opportunities for discussion of physical action constraints and differences that create a gap between human and machine capability in musical spaces.

While the use of machines in musical processes is well‑established (Hsu and Sosnick, 2009; Manzo, 2016; Rowe, 1992), we focus specifically on the use of robotic systems that imitate aspects of physical interaction. In musical performance, coordination in timing is often underpinned by entrainment—the temporal coupling of rhythmic processes across individuals and modalities (Clayton et al., 2005). For example, neural entrainment studies show that cortical oscillations in delta–theta bands (<8 Hz) can align with musical note rates, supporting chunking of auditory streams, and that beta‑band activity (15–30 Hz) contributes to temporal prediction (Doelling and Poeppel, 2015). Sensorimotor theories further suggest that, even in the absence of overt movement, covert motor networks simulate rhythmic actions to scaffold beat perception (Ross and Balasubramaniam, 2022). These findings suggest that human–robot musical interaction could be made more similar to human music‑making if they account for cross‑modal entrainment mechanisms, including the visual, gestural, and auditory cues central to our system, which motivates our use of embodied robotic systems to exchange visual and physical information with musicians.

ImproVision is a framework meant to extend these conducting principles to machine agents. A conducting machine differs from an interactive music‑generating machine in its multi‑modality; an effective conductor in human settings maintains awareness and guidance of human elements through eye contact and physical gesture, constituting an augmentation to typical dimensions of music‑generating systems that only listen and audiate.

In this research, we extend the ImproVision framework introduced by Greer et al. (2024) by presenting three systems: Guided Harmony, Equilibrium, and Gestured Improvisation. Together, these modes explore different directions of control and interaction between human musicians and machine agents. We also evaluate the framework using the Standardized Procedure for Evaluating Creative Systems (SPECS), an established creativity evaluation setup.

2 ImproVision in Context: Multi‑modality, Music, and Human–machine Co‑creativity

Improvisational musical games refer to structured or semi‑structured systems that guide improvisation through rule‑based interactions, gestures, or cues. These games often operate with a shared vocabulary of signals that enable real‑time decision‑making among performers. They serve both as artistic frameworks and experimental platforms to explore group creativity, turn‑taking, and communication. Examples range from historical systems like the ‘musical dice games’ of the 18^th century (Hedges, 1978) to modern improvisational structures like ‘conduction’ (Stanley, 2009), ‘soundpainting’ (Thompson, 2024), and ‘Cobra’ (Brackett, 2010; van der Schyff, 2013), which use codified gestures or cards. We refer to these as ‘games’ according to the definitions of Caillois (2001), Huizinga (1971), and Tekinbas and Zimmerman (2003) since these are voluntary activities with rules that structure behavior in experiential, creative, interactive environments. Such systems highlight how visual and nonverbal cues can structure co‑creative dynamics. This principle is echoed in human–robot music systems that rely on motion, posture, or gaze to convey intent.

Applied to a situation in which a human musician and a robot communicate nonverbally in a musical setting, one must consider this relationship from both perspectives: How does the human best ‘encode’ cues to send to the robot (and which robot cues are best for the human to decode), and how does the robot best ‘encode’ cues to send to the human (and which human cues are best for the robot to decode)?

2.1 Entrainment and nonverbal communication in musical settings

Nonverbal communication, representing all communication not expressed through words, is customarily divided into three categories: paralanguage, kinesics, and proxemics (Gordy, 1999; Kurkul, 2007). Paralanguage describes characteristics of one’s voice, including pitch, pauses, rhythm, speed, volume, and verbal fillers (Gordy, 1999; Hall et al., 2021). Musically, paralanguage is expressed through stylistic choices such as dramatic pauses or rubato. Kinesics covers body movements, eye contact, facial expressions, and posture. Proxemics is the study of one’s body in relation to those of others. From a musical perspective, all three categories of nonverbal communication appear in conjunction when playing (Gordy, 1999). This research, using the ImproVision framework, presents a system that receives and transmits information through kinesics supported by a robotic platform and pose recognition, which conveys information about musical intentions during continuous audio streams.

Musical cues of communication can also help establish a ‘messaging’ system while playing, particularly in improvisation. Indicating the end of a phrase by clearly playing a cadence or holding a high note are effective ways of telling your fellow musicians that you have reached the end of your solo section (Seddon, 2005). Smaller ensembles, especially when performing familiar pieces, rely primarily on auditory rather than visual communication (Bishop and Goebl, 2018). More complex music, however, strongly benefits from visual cues. Musical entry points (such as a sudden tempo change or fermata) often benefit from visual cues on top of musical ones to achieve precise coordination (Bishop and Goebl, 2018; Kawase, 2014); examples include a first violinist taking an exaggerated visual breath as a cue to start a faster section together with her string quartet or a conductor guiding an ensemble through Rachmaninoff’s second piano concerto described above.

In a fully human ensemble setting, particularly in conducted improvisation or performances without a score, real‑time communication between musicians and the conductor is a complex, multimodal process. Musicians primarily communicate through their musical output (auditory mode), interpreting the conductor’s cues and responding accordingly. They also employ nonverbal cues (visual mode) such as smiling, eye contact, and body movements. The conductor, conversely, communicates primarily through visual movement cues, not directly producing musical output. This indirect influence is nonetheless perceptible in the ensemble’s performance, even to a listener with closed eyes, as the conductor’s actions shape the creative trajectory of the musicians.

Beyond conducting, the wider field of musical gesture research encompasses ‘musical shaping gestures’ (King, 2013), a term describing physical movements in musical discourse that may be descriptive (illustrating a musical quality) or prescriptive (indicating how a passage should be performed). Gritten and King (2006, 2011) provide extensive cross‑disciplinary perspectives on gesture’s role in performance, rehearsal, and pedagogy. Incorporating such distinctions helps situate our robotic gestures within a taxonomy that spans both temporal coordination and expressive shaping, and we adopt the above terms in our description of the robot’s messaging in the musical interactions presented in this work; the modes demonstrated here are prescriptive in nature, while proposed future research leans into descriptive movement space, where the robot can be used to illustrate dynamic control, gradual change of tempo, and other long‑term continuous‑spectrum behaviors. We note that even descriptive gestures offer significant freedom to the performers or improvisers in these musical settings, which jointly informs future system output, further facilitating the co‑creative (rather than strictly described or prescribed) process.

On a neural level, communication between musicians facilitates entrainment, which, broadly defined, is the synchronization of two or more autonomous rhythmic processes through weak coupling (Clayton et al., 2005). In musical contexts, this occurs at multiple levels: neural entrainment between auditory rhythms and cortical oscillations; overt motor entrainment in synchronized performance; and covert motor entrainment, where motor planning regions engage during passive listening (Ross and Balasubramaniam, 2022). In human ensembles, entrainment serves as the foundation for collective timing decisions, enabling coordinated entries, tempo changes, and phrase endings (Kim et al., 2019). The concept extends to ‘musical social entrainment’, where rhythmic alignment is embedded in social interaction and may span intra‑individual, inter‑individual, and inter‑group processes. Our ImproVision framework operationalizes this in a human–robot setting: visual gestures (kinesics) and musical timing cues function as bidirectional coupling signals, allowing the robot not just to track but also to predict and shape human timing.

This intricate interplay of visual and auditory cues forms a nonverbal communication system facilitating artistic expression. Translating these dynamics to human–machine collaborative settings requires consideration of how a human agent can cooperate with a machine to support creative output, extending beyond purely auditory modes.

2.2 Human–machine co‑creativity

Group creativity emerges not only from individual intentions but from their dynamic interplay. Creative agency may shift between human and machine partners, where each interaction opens up new trajectories (Boden, 2013). These interactions are comparable to real‑time collaborative practices seen in improvisational systems, such as musical jam sessions or even digital phenomena like TikTok duets, where new creative content is layered on top of prior inputs to produce emergent musical expressions.

In this research, we consider creativity to be an abstract intention (whether predetermined or improvised) in the mind of a human agent. The machine should be able to adapt to this human agent’s behaviors in a way that amplifies or interacts with the creative intention. The task can be human‑creative—that is, the intention only exists in the human agent, and the machine agent is expected to support the intention—or it can be co‑creative, where interactions with the machine agent further drive the next intentions of the human agent in a way that, to the human agent, is novel, surprising, or unexpected. In our framework, we consider that both the human and the machine have some hidden intention, , which becomes expressed in an observable form, , where is a subscript to indicate the originating agent (human or machine). The combination of these observable forms becomes the creative output, , from the co‑creative system. Information flow from system output to machine intention is the traditional feedback loop that gives the machine the ability to intelligently construct new intentions based on previous output. Flow from a traditional human‑to‑machine control interface represents the ability of the human agent to directly modify the machine’s intention and influence its output in a direct manner. The intention of our system is not centered on this traditional mode of control, as it reduces the machine’s role in providing creative input to the performance. Rather, we emphasize the presence of a co‑creative line, by which the output produced by the machine influences the creative intentions of the human agent. We model this relationship in Figure 2.

This illustration depicts our framing of the flow of information within the human–machine performance system, with arrows added to highlight ideas we build into the framework. The flow from system output to machine intention is the traditional feedback loop that gives the machine the ability to intelligently construct new intentions based on previous output. The co‑creative line shows this creative input, by which the output produced by the machine influences the creative intentions of the human agent.

As an example, consider two systems: one in which a human clarinetist improvises over a constant electronic drone, and another in which the same clarinetist improvises against an electronic pitch changing keys with random selection, according to a distribution conditioned on the notes it hears from the human. The machine agent influences the intentions of the human in both systems, but it is the latter that is co‑creative, by which changes in the machine agent are novel, surprising, and unexpected to the musician, allowing for influence in ‘both directions.’

To comprehend the dynamics of human–machine co‑creativity in music, we adapt established concepts from human–machine co‑creativity in other artistic domains. Applying these terminologies and frameworks to musical co‑creativity allows exploration of analogous dynamics in this context. The terms and concepts discussed subsequently are grounded in the insights derived from Davis (2013).

The foundation of co‑creativity is a so‑called ‘creative trajectory”—the mutual desire to jointly contribute creatively in a particular agreed‑upon direction. One party executes an ‘establishing action’ that chooses the conversation topic. The other party then accepts or rejects the creative trajectory through a ‘verification’ or ‘rejection action.’ Future improvisatory actions are then effectively a conversation between both parties, with both expressing support for and building upon established themes (‘elaboration actions’). Either party can also perform a ‘deviation action,’ which suggests a new creative trajectory. Successful interaction, then, is characterized by the ability to figure out themes and patterns and respond to them meaningfully (comparable to call‑and‑response in jazz improvisation).

Similar turn‑taking and elaboration mechanics can be found in improvisation games, in which players alternate between reinforcing or diverging from established musical motifs (Brackett, 2010). These parallels highlight the importance of shared vocabularies and responsive feedback in co‑creative systems.

2.3 ImproVision framework

Building on these concepts, in this research, we extend the general ImproVision framework for musical human– machine interaction, where the machine assumes a central, quasi‑conducting role, as follows:

Humans provide auditory and visual output through musical performance and nonverbal cues.
The machine provides visual output through movement, akin to a conductor’s gestures.
The interaction between humans and machine becomes a multimodal, co‑creative process, with both parties contributing and responding to the evolving musical performance.

This framework explores new possibilities in musical co‑creativity and may inform human–machine co‑creativity in other artistic domains such as dance and visual arts.

3 ImproVision Equilibrium

Greer et al. (2024) present ‘Guided Harmony,’ a musical game in which a robotic system responds to musicians’ nonverbal cues through camera movements in order to move the group through a set score, employing a pan–tilt–zoom (PTZ) camera as a minimal robotic system that can take in camera input and also respond through movements that mimic nodding and shaking its head.

As an extension of the above, we introduce ‘ImproVision Equilibrium,’ a system that guides a group of musicians from a dissonant constellation of pitches to the closest major or minor chord, depending on which cue is given by a human. Whereas ‘Guided Harmony’ processes visual input, ‘Equilibrium’ expands this to true multi‑modality by integrating sound.

3.1 System overview

After the (human) musicians initialize on a dissonant chord, one note per musician, the ImproVision Equilibrium system operates in a four‑stage process: gesture detection, pitch detection, novel chord determination, and conducting through camera movements.

3.1.1 Gesture detection

Our system initializes the camera stream and pose detection using MMPose Coco Wholebody (Jin et al., 2020),^¹ where we treat pose detection as the computer vision problem of estimating the location of joint ‘keypoints’ (e.g., arms, wrists, face, and substructures) on each detected human in the frame (Fan et al., 2022). After a 3‑s delay to promote camera stream stability, we begin comparing wrist and nose keypoints for all humans (individually) to look for two gestures: ‘raised hand’ and ‘hand to head,’ as shown in Figure 3. The former occurs when any musician raises their hand significantly above their head (i.e., either wrist keypoint is significantly above their nose keypoint, depending on an adjustable threshold); the latter signifies a musician bringing their hand near their face (i.e., if the Euclidean distance between either wrist keypoint and the nose is below a proximity threshold). The ‘raised hand’ cue communicates a musician’s desire to resolve the group’s dissonant chord to the nearest major chord, while the ‘hand to head’ cue indicates that they prefer to move to the nearest minor chord.

The system looks for two gestures, ‘hand raised’ and ‘hand to head,’ which indicate a musician’s desire to move to a major or minor chord, respectively.

The ImproVision framework can be extended to a larger range of gestures and corresponding signals, to the level of precision and musical options desired by its users. In surveying instrumentalists, no clear consensus set of preferred or available gestures emerged that suited all instruments uniformly, so the choice of these two gestures was made because they are big enough to be reliably detected by MMPose as a demonstration of the system concept, particularly in a noisy scene with multiple people and instruments. Furthermore, major and minor chords lack inherent, widely understood gesture representations, so prioritizing usability over the naturalness of representation in our cue selection is both practical and justified. For more discussion on appropriate cues, we refer readers to Greer et al. (2024). In any system implementation, it is possible to train a computer vision model to recognize individual gestures through fine‑tuning of pose‑classification models on the desired gesture set and even to specific people if a highly‑precise and less‑generalizable system performance is desired for a particular ensemble.

3.1.2 Pitch detection

If either gesture is detected, the system employs the Fraunhofer Institute for Digital Media Technology’s MusicBricks PitchDetectionIDMT library^² for real‑time polyphonic pitch detection. We selected this library because it is a research‑grade tool developed by a reputable institute, and it proved to be the most reliable solution among the pitch‑detection libraries we evaluated, particularly for polyphonic input across multiple instruments. The MusicBricks library operates on short, overlapping audio frames using spectral analysis to identify prominent frequency peaks. It ensures accurate pitch detection by comparing detected frequencies against a reference set of pitches and filtering out irrelevant frequencies. In polyphonic mode, instead of attempting to recover all harmonics, MusicBricks compares detected fundamentals against a reference set of pitches within a defined range, returning one matched frequency per detected source. It outputs at most the same number of top frequencies as the number of musicians, selected based on their detection frequency across multiple iterations of the audio processing loop. We manually set the number of musicians in this prototype, but this capability could be extended using standard object‑detection libraries. During each iteration, detected frequencies are stored, and their occurrences are counted. At the end of the processing period, frequencies that appear in at least 25% of the total iterations are considered significant, after filtering out transient or noise‑induced frequencies. This threshold was selected by a coarse parameter sweep. The list of these significant frequencies is then sorted by decreasing occurrence, and the top frequencies (depending on the number of musicians), if available, are included in the final output, along with the corresponding MIDI note names. This selection process ensures that only the most consistently detected frequencies are reported, representing the most reliably detected musical notes of the input chord.

3.1.3 Novel chord determination

In the third stage, depending on the detected gesture, the system processes this input of detected notes to determine the major or minor chord that can be reached from the current constellation of pitches with the minimal overall semitone movement. Concretely, given the detected notes, each voice is allowed to move up or down within a maximum movement range (currently set to two semitones in either direction; this constraint simplifies the conducting process described below). The algorithm enumerates all such combinations of small adjustments, applies them to the initial notes, and evaluates the resulting pitch‑class collection.

All pitches are represented in 12‑tone equal temperament and reduced to pitch classes, with octave and enharmonic equivalence assumed (e.g., C# = D♭). The harmonic search space is constrained to the 24 common‑practice tonal triads (12 major and 12 minor), defined as any transposition of the pitch‑class set {0,4,7} for major and {0,3,7} for minor, where corresponds to the root pitch class. Inversions (e.g., C major voiced as {E, G, C}) and doublings (e.g., including a second C) are naturally covered, since the system evaluates only the set of pitch classes, not the order or spacing of voices.

For each candidate configuration within the semitone movement limit, the algorithm tests whether its pitch‑class set matches a transposed version of the desired template (major or minor). Among all valid matches, it selects the solution with the smallest total movement, measured as the L1 norm: the sum of absolute semitone shifts across all voices. This corresponds to the maximizing of parsimonious voice leading (Tymoczko, 2008) (the minimal aggregate movement required by the ensemble to move into harmonic alignment). This formulation defines a discrete region in broader ‘musical possibility spaces’ as described by Atkinson (2020), ensuring that the system’s harmonic transitions are theoretically valid for real‑time performance. In cases where multiple candidates achieve the same minimal cost, the system resolves ties deterministically by selecting the first encountered configuration.

As an example, if four musicians initialize on C#, E, G, and B and the system receives the ‘minor’ cue (i.e., a musician raises their hand to their face), the algorithm moves C# down two semitones to B while leaving the other notes unchanged, producing the set {B, E, G, B}, which forms an E minor triad. This solution requires two half‑steps in total and is the chord closest to the original notes that matches the desired quality with the least amount of change. Once this optimal chord is determined, the system then translates this optimal chord into a series of camera movement instructions, which effectively ‘conduct’ the ensemble toward a more harmonious musical state, described in the following section.

3.1.4 Conducting through camera movements

In the initial ‘ImproVision Guided Harmony’ case study, Greer et al. designed a toolkit of camera movements that function as a method of rudimentary conducting, illustrated in Figure 4 (Greer et al., 2024). The camera centers on a specific musician, effectively making eye contact with them, and then tells them what note to change to by tilting its ‘head’ up or down according to semitone movements (i.e., a cue of two short successive nods upwards corresponds to moving up two half steps). After sharing individual instructions, the camera centers and delivers a final ‘downbeat’ cue, indicating to all musicians to now change to their respective new notes. This moves the initial dissonant chord to a major or minor consonant chord.

**(Top)** The ImproVision robot pans to look at a particular musician. It will then tilt up or down, corresponding to the desired note change for this musician. **(Bottom)** To indicate an ensemble down‑beat, the robot pans to the center, tilts up, and then tilts down.

Previous empirical work on conductor–musician synchronization has identified specific kinematic features that most effectively convey temporal information. Previous research has demonstrated that beat perception is strongly linked to patterns of acceleration and deceleration along the gesture trajectory, with periods of maximal deceleration often corresponding to the most synchronous ensemble responses (Luck and Nte, 2008; Luck and Sloboda, 2008). These studies also revealed that smoother trajectories, characterized by continuous curvature and gradual speed changes, are generally easier for performers to follow than abrupt or highly angular movements. This aligns with findings from expressive gesture literature suggesting that minimizing sudden changes in direction supports clearer temporal communication. While our current camera motions were optimized primarily for field of view and discrimination between discrete intentions, research on conducting gestures indicates that smoother, continuous movements—particularly those with consistent curvature—can improve synchronization accuracy. In future iterations, trajectory generation could incorporate spatio‑temporal parameters from conductor gesture studies, such as minimizing abrupt velocity changes, to facilitate more intuitive temporal alignment between robot motion and human musical partners, to the extent allowed by the robotic hardware, which we discuss in Section 6.

4 ImproVision Gestured Improvisation

In addition to the ‘Guided Harmony’ and ‘Equilibrium’ systems described above, we recognize the importance of giving the machine increased agency and therefore explore a third interaction mode, which incorporates machine improvisation. Unlike ‘Guided Harmony,’ where the machine conducts musicians through a fixed harmonic structure, ‘Gestured Improvisation’ reverses the direction of control: human performers use body gestures to guide a machine improvising agent. This system leverages Somax2 (Borg, 2019), a modular improvisation engine developed at IRCAM, and uses live pose estimation to translate physical gestures into control signals that drive real‑time generative output.

Somax2 builds on more than two decades of research into how machines can generate music in real‑time (Assayag and Dubnov, 2004; Assayag et al., 2006; Bonnasse‑Gahot, 2012). Somax2’s improvisation agent analyzes and segments large corpora of prerecorded music, then recombines the segments in real‑time based on multidimensional input features. These input features typically stem from live audio or MIDI input; however, as the modular system runs on top of the visual programming language Max, it also accepts inputs via Open Sound Control (OSC). We leverage OSC to provide the improvisation agent with non‑musical signals—such as arm gestures—in addition to audio input. The agent then improvises based on the audio and gestures it receives from the human musician(s).

The Somax2 improvisation agent has various parameters, such as ‘continuity’ (whether to prioritize continuation of the current musical thought or to jump to a novel idea), ‘probability’ (the likelihood of the generated output being played), and ‘playing mode’ (whether the player generates continuous improvisation or waits for new input). In our extension, we let each detected gesture correspond to a specific one of these parameters of the machine’s musical output, without interrupting instrumental activity.

The system architecture builds on the same PTZ camera and pose estimation framework used in ‘Guided Harmony’ and ‘Equilibrium,’ but, here, the camera functions solely as an observer. Using MMPose with COCO Wholebody keypoints, the system detects a small (but easily extensible) vocabulary of gestures made by the performer.

In its current form, the system recognizes six distinct gestures: raised hand, hand to head, and opposite shoulder touch, each distinguishable for left and right arms. These gestures are defined using relative distances between wrist, head, and shoulder keypoints. When a gesture is detected, it is transmitted via OSC to Somax2, where it maps to pre‑assigned high‑level improvisation parameters. For instance, a raised hand might switch Somax2 into ‘reactive mode,’ which requires explicit auditory input before new material is generated, while a hand to head cue enables continuous improvisation.

The system is designed to be extensible and adaptable. Because gestures are detected via pose rather than sensors or physical controllers, they can be redefined or expanded without modifying the core architecture. This makes it possible to adapt the system for different instrument types or performer constraints (e.g., a violinist could control parameters with foot movements rather than hand gestures). More broadly, ‘Gestured Improvisation’ demonstrates how embodied interaction can support dynamic, co‑creative relationships between human musicians and generative AI systems.

5 Evaluating ImproVision Creativity

Jordanous developed Standardized Procedure for Evaluating Creative Systems (SPECS) (Jordanous, 2012), aligned with definitions of and discourse around creativity, from both human and computational production perspectives. In this section, we evaluate ImproVision using SPECS.

5.1 SPECS step 1: identify a definition of creativity that your system should satisfy to be considered creative

We use the base definition of creativity from Jordanous (2012)—namely, that a system should display characteristics of active involvement and persistence; deal with uncertainty; display general intellectual ability, intention and emotional involvement, progression and development, spontaneity, and subconscious processing; add value; and generate results. It should also show domain competence; independence and freedom; originality; social interaction and communication; thinking and evaluation; and variety, divergence, and experimentation.

SPECS requires the identification of aspects of creativity that are particularly important to the domain the system works in, as well as those less important. We consider active involvement and persistence; progression and development; showing domain competence; independence and freedom; social interaction and communication; and variety, divergence, and experimentation to be the core components of collaborative musical agents for the purpose of creating shared human–computer musical output. Less important are the aspects of general intellectual ability, originality, and thinking and evaluation. We downplay these aspects because creativity in a musical context often involves invoking familiar patterns rather than creating wholly new sounds or sequences, and originality (taken to an extreme) can push output out of the expected range of human musical output, violating the concept that computationally creative output should be similar to human creative output. General displays of intellectual ability and thinking are not necessary for the system to give meaningful output in the musical domain, as the output of ImproVision is generally scoped to the current music‑making session with human musicians (e.g., search material for displays of intelligence is reasonably constrained to the current session).

5.2 SPECS step 2: using step 1, state what standards are used to evaluate creativity of the system

According to Jordanous (2012), each standard identified from step 1 should be evaluated with a separate criterion, which we define here:

Active involvement and persistence: Is the system performing actively and concurrently with human musicians for the entirety of the music‑making session?
Progression and development: Does the system develop musical output in sequence typical to the musical style?
Showing domain competence: Does the system produce musical output idiomatic to its genre?
Independence and freedom: Does the system contribute musical ideas not derived only from content created by the human counterpart?
Social interaction and communication: Does the system express ideas to the human, such that the response from the human is observable and correlated?
Variety, divergence, and experimentation: Is system output quantifiably dissimilar or varying on a musically meaningful metric between sessions?

5.3 SPECS step 3: test the creative system against the standards set in step 2 and report the results

Following the standards set in step 2, our system performs as follows:

1. The system performs actively and concurrently with the human musicians; it continuously observes the human musicians and responds following each observation.
2 and 3. In both Equilibrium and Gestured Improvisation, the system follows the expected musical style by providing chord adjustments within the harmonic expectations of Western canon. The development of progressions follows the structure of dissonance to resolution.
4. In ImproVision Equilibrium, no, the ideas are reactive to content created by the human counterpart. In ImproVision Gestured Improvisation, yes, musical content is generated independently of the human counterpart.
5. In Equilibrium, the system expresses ideas to the human that the human can measurably respond to by their choice of pitches to play in the next chord. In Gestured Improvisation, the expression comes in the form of direct musical output, which the human can react to by changing their own playing style to match cohesively with the generated output.
6. In both discussed modes, the output is quantifiably dissimilar. Randomness plays a role in both generative algorithms; while the chords of Equilibrium must follow harmonic guidelines, there is random choice involved in specific chord selection. The output of Gestured Improvisation is the stochastic output of SoMax, which varies between iterations even upon similar input.

6 Limitations and Future Research

6.1 Information dynamics of nonverbal communication and limitations of robotic systems

The message‑sharing conducted by the ImproVision system, in PTZ form, has rate limitations that may affect what types of musical messages may be useful in the above‑defined performance applications. In our example, the robot takes on positional ‘up’ and ‘down’ states to transmit pitch modification instructions or perform a ‘downbeat’. The ‘up’ motion, in both range and speed, is constrained by the actuators of the robotic system. If we consider the interval for the robot to move from neutral to ‘up’ to take time , then the robot can communicate one message every seconds or at a frequency of messages per second. To interact in a way that is musically meaningful, these messages must be delivered at a rate acceptable to performance practice. In the case of the PTZ camera, the maximum tilt velocity is 69.9 degrees per second. Thus, a 45‑degree tilt can be achieved in 0.643 s, allowing a message frequency of approximately 1.555, or approximately 93 messages per minute. Compared to a beats‑per‑minute (bpm) measurement of musical tempo, this may be practical for slow performances. On the other hand, the messages passed by ImproVision are more complex than a simple ‘up’ or ‘down.’ Even the composite ‘downbeat’ motion reduces to 46 messages per minute, and adding in the necessary panning motion to indicate attention to different performers or voices further slows the message rate to the point where only broad harmonic or stylistic information may be communicated in this discrete‑message form. For example, harmonic information may happen on a bar‑level (chord changes) or phrase/section level (tonality changes), which may be appropriate to influence with this mechanism, whereas beat‑level changes would be rendered impractical by this physical constraint. This motivates further research in the media by which these messages are conveyed (e.g., visual scores).

6.2 Localization and applications

A current limitation of the ‘ImproVision Equilibrium’ system is its lack of sound source separation and localization. Without localization, it is impossible to know which musician is playing which note and, thus, to whom to give what instructions. To circumvent this issue, we have added the constraint that musicians are seated in increasing order of frequency from left to right. There are a variety of recent approaches to sound source localization, especially those augmented by a video modality, which may be promising extensions to the ImproVision system (Qian et al., 2020; Rascon and Meza, 2017; Senocak et al., 2022).

Adding localization opens up promising possibilities for future expansions of this system. For instance, we see applications in ensemble rehearsal, where this system could assist with tuning, correcting dissonant chords, and building intuition for ensemble characteristics such as timbral and harmonic chord balance, along with benefitting compositional aspects such as harmonic resolution. Having the ability of a system to not only recognize but also indicate and correct performance mistakes on an individual level may assist in rehearsal efficiency, providing feedback on‑the‑fly without stopping a rehearsal to provide individual notes, and leaving valuable time for rehearsal of group musical considerations.

Musical games provide pedagogical value (Cassidy and Paisley, 2013; Denis and Jouvelot, 2005; Harrop‑Allin, 2017; Jimenez, 2018) and have also gained recent cultural popularity within social media spaces. A contemporary example within the ‘game’ setting is an online singing challenge where a group of participants simultaneously sing arbitrary notes until they harmonize by chance.^³ With added localization, ImproVision Equilibrium could guide singers to a more pleasing chord.

Besides games based on pitch and standing harmony, other means of improvised musical expression, such as dynamics, motif, style, orchestration, etc., can also be examined as possible mappings, for example relating a particular robotic instruction to a change in dynamic contour or to a selection of a new improvisational leader for a section. While the robotic hardware has a limited rate of motion, further robotic systems may be able to convey different messages to performers, subject to their own constraints on which human‑like features they best emulate, its reasonableness for a given performance setting, its motion constraints, and its communication rate (both in hardware latency and assembly and delivery of meaningful instructional signals from its motions).

There exist many avenues for continued research, including the addition of an audio output mode to the system. Such audio could be generated by a variety of learning techniques (pretrained or conditioned on the musical styles of the performers) and use the visual stream as a control signal (Dubnov et al., 2022; Dubnov et al., 2023).

Even with the current modes, we also pose for future research the question of whether discrete codebooks of gestures should be pre‑determined or if a relationship between gesture and musical intention can be learned through jointly observing the visual and audio output of the musician. In the latter case, ImproVision performances could (after sufficient training) adapt directly to the musical intentions of performers, issuing musical instructions and interactions to the ensemble based on observations of individual or multiple performers.

While our focus here is on gesture paradigms prevalent in Western ensemble practice, we acknowledge the rich and diverse gestural vocabularies present in other musical traditions, such as recent neuro‑ethnomusicological work on gesture–vocal coupling in Karnatak music (Pearson and Pouw, 2022). These traditions employ culturally specific gestural lexicons that mediate timing, phrasing, and interaction in ways that may differ substantially from Western conducting norms. Our present system does not attempt to replicate these repertoires, but future cross‑cultural studies could explore how robotic gesture design might adapt to, or be informed by, such practices.

7 Conclusion

ImproVision Equilibrium marks an advancement in multimodal musical human–machine interaction. By integrating real‑time pitch detection with visual cues, both musician‑derived and conductor‑directed, this system demonstrates the potential for machines to participate actively in musical co‑creation. Working toward audiovisual interactions mirrors the complex, multimodal nature of human ensemble communication.

The system’s ability to guide musicians from dissonant to consonant major and minor chords showcases its potential as both a creative tool and an educational aid. Bridging the gap between human musical intuition and machine precision offers new possibilities for musical exploration and learning.

In addition to Equilibrium, this work integrates two other systems (Guided Harmony and Gestured Improvisation) to illustrate a spectrum of interaction types between human and machine agents in musical contexts. Using the SPECS framework for evaluating creative systems, we demonstrated how ImproVision satisfies core criteria for creativity in collaborative musical settings, including communication, adaptability, and co‑creation. This supports the system’s potential as a creative and pedagogical tool. Extensions to the current system, particularly in sound localization, lay the groundwork for more sophisticated human–machine interactions. Future developments could include enhanced localization capabilities, more nuanced conducting gestures, and adaptation to different musical attributes. Additionally, the principles developed as an overall framework here could inform broader applications in human‑machine co‑creativity across other artistic domains.

As we continue to explore the intersection of technology and music, systems like ImproVision point toward a future where machines are not just tools for music creation, but active, responsive collaborators in a co‑creative process. This research opens up exciting possibilities for enhancing musical education, performance, and composition through intelligent, multimodal human–machine interaction.

Data Accessibility

We provide demonstration videos of ImproVision via a demo repository at https://tinyurl.com/ImproVisionDemos. We also provide a code repository to reproduce the ImproVision musical performance games at https://github.com/rossgreer/ImproVision.

Competing Interests

The authors have no competing interests to declare.

Authors’ Contributions

All authors made substantial contributions to the conception or design of the work; the acquisition, analysis, and interpretation of data for the work; the drafting of the work; and to revising the draft critically for important intellectual content.

Notes

[1] OpenMMLab Pose Estimation Toolbox and Benchmark, https://github.com/open-mmlab/mmpose, 2020.

[2] https://www.idmt.fraunhofer.de/en/research-topics/audio-visual-content-analysis/automatic-music-analysis/musicbricks.html. Open‑source software, MIT License.

[3] tiktok.com/@thechordials/video/7229840510574152966, Accessed 29 September 2025.

ImproVision Equilibrium: Toward Multimodal Musical Human–Machine Interaction