Circles, an experimental approach to film music composition through sonification of moving images
In this paper an experimental approach to film music composition through sonification is discussed. Sonification , the practice of transforming moving images into sounds, is not a new concept. There are several attempts to present data as sound. This technique is called “data sonification” and it is the equivalent of the more established practice of “data visualization”. From stock market data to volcanic activity, from gravitational waves to urban pollution, any kind of data has been treated with sonification. Here the scope is to apply sonification to a video, a film or a documentary by extracting data that can be converted into a musical piece. By doing so the film composer can possibly find new sources of artistic inspiration and new composing techniques and approaches that could lead him to unexpected and evocative musical results. 1. INTRODUCTION: 1.1 Goals The scope of this research is to explore the possibility to let the video compose its own music. That would accomplish several interesting outcomes. First, it would create unpredictable artistic results by forcing the composer to deviate from his usual creative workflow that sees him watching the film, gathering musical themes, harmonies and ideas and starting composing the music score. Second, it would speed up the process of music creation because the length of the video would not influence the duration of the writing process: once the sonification of the video data is set, the algorithm would create the music automatically and in real time. Third, the suggested approach could be extended beyond the sonification of a video. By using video cameras, computer vision, artificial intelligence systems and real-time object detection devices, several interactive synesthetic experiences could be created for the general audience by catching the human body movement data and transforming that into music. This form of movement interpretation could help to explain the meaning of sound, movement and music related to the physical experience of everybody. Fourth, this research could lead to a new software, algorithm or plug-in that could enhance the creative workflow of composers, video makers, production companies and similar that could benefit from some sort of automated music creation tools extrapolated from video data. 1.2 Challenges The first challenge was to find meaningful ways to extract usable data from a video. There are many softwares available today, here the choice was Max/MSP and specifically the set of cv.jit objects designed by Jean-Marc Pelletier . By creating several patches and algorithms in Max/MSP it was managed to extrapolate numeric values from visual parameters like brightness, horizontal and vertical position of various objects, size, movement, speed, contrast, saturation and similar. The second challenge was to find ways to attribute a musical meaning to the collected data. The biggest challenge was creating a musical, melodic, harmonic and sonic vocabulary that could artistically use those data. The scope of this work was to achieve three main goals: first, to create a music piece that could be meaningful, pleasant, understandable and not just random and chaotic. Second, to create a music piece that was able to enhance and comment the story of the video in a narrative way, exactly like any traditional film composer would do. Third, to come up with a music piece that was an aural representation of its visual counterpart. A certain level of similarity between what we see and what we hear needed to be achieved. That involved a thoughtful understanding of how people perceive sounds and images in their everyday physical experience in a multisensory approach. 2. BACKGROUND: The idea of using a picture, a drawing or a moving image like a film as a source for music composition is not new. There are several examples of this practice, from composer Sylvano Bussotti graphical scores to the “clavier à lumières”  ("keyboard with lights"), a synesthetic music instruments invented by the composer Alexander Scriabin for his work Prometheus: Poem of Fire. The work of Conlon Nancarrow, his Studies for Player Piano, his graphical scores and his extensive use of auto-playing instruments are valuable examples of sonification too . The ANS synthesizer created by Russian engineer Evgeny Murzin from 1937 to 1957 is another example of the attempt to convert a graphical image, a drawing or a drawn sound spectrum into a piece of music . Other machines like the “Oramics” designed in 1957 by musician Daphne Oram  or the Variophone developed by Evgeny Sholpo in 1930 are all examples of graphical sound techniques designed to create a more literal relationship between visual and audio material. The common characteristic of those early projects was that they all used a static image as a source for sonification. The image was usually scanned from left to right in order to produce sound, Unfortunately the relationship between sound and time was lost. By contrast, if a video is used instead as a source of sonification the interactivity between what we see in a specific moment and what we hear is guaranteed because the process of sonification happens in real time. 3. COLLECTING DATA FROM IMAGE: For the preliminary step of data extrapolation from the video two main approaches have been designed. The first has been called “Centroid Blobs” and the second “Pixel Mosaic”. 3.1 “Centroid Blobs” This approach uses the main features of the various objects and shapes present in the video by identifying clusters of similar pixels from one frame to the other. Whenever the algorithm identifies a corner, a line, a mass or a salient feature it applies a “centroid”, a “blob” and a “label”. The recognition process operates in a black and white version of the video. Additional controls of saturation, brightness and contrast can modify the behavior of the algorithm. Each blob corresponds to an object or a recognizable feature and produces three numeric values at any moment: horizontal position, vertical position and mass size. Those three values are converted into midi information, each blob represents a specific instrument and a midi channel that goes from Max/MSP to Ableton Live through several midi ports. 3.2 From raw data to music: The flux of horizontal and vertical movement data and the size of each blob is converted into midi. The translation of those raw numbers to a music vocabulary aimed to preserve the most obvious correspondence between how people perceive sounds and their physical and body experience. The horizontal position of each blob can be effectively translated into a panoramic sonic value, from left to right and vice versa. It makes sense to put a sound on the left, center or right side of the stereo field if the corresponding object is on the same visual position in the video. For this, the midi continuous-control “cc10” (panning) seemed to be the best option. The vertical position of each blob can be translated into pitch variations. From low to high, bottom-up or top-bottom. This possible translation seems to be quite obvious too. In music a sound is defined as “low” (low pitch) or “high” (high pitch). High frequencies tend to be perceived as higher (closer to our head) and smaller than low frequencies (which tend to be perceived as bigger, heavier and lower, closer to our guts or feet). The mass (size) of the blobs can be translated into a variation of volume and loudness. The best continuous-controller is “cc7” (midi volume). It is worth noticing that our brain tends to perceive the size of a sound not just in terms of volume variation (a bigger object or a closer object will sound louder and vice versa); a variation of the frequency spectrum can suggest a variation in size too. In fact, sounds with less low frequencies tend to be perceived as “thinner” and therefore “smaller” whereas sounds with more low frequencies are perceived as “fatter” and “bigger”. It has been found that translating the blob mass value into a midi control of low pass and high pass filters can convey an effective perception of size. As mentioned before, the blobs midi data are routed into Ableton Live. Here, a certain level of artistic freedom is guaranteed. In fact, each blob can be associated with a specific scale (diatonic, chromatic) and music key or mode. The choice of sounds is free as well, in fact several patch and variations have been designed by using stock synthesizers in Ableton Live as well as Native Instrument’s Kontakt sound banks, pure sinewaves or wavetables. 3.2 “Pixel Mosaic”: The second approach is called “Pixel Mosaic”. The data extrapolation technique and video interpretation are completely different from the “Centroid Blobs” system. Here the video canvas is treated almost like a musical digital instrument. In fact, the video is converted into black and white and it is downscaled to a matrix of forty by eighty pixels for a total of three thousand and two hundred pixels. Each pixel represents a sound, either a pure sinewave (on an additive synthesis setup) or a filtering frequency (on a subtractive synthesis setup). After a black and white conversion, the system effectively uses the luma value (brightness) for controlling the loudness of each pixel-sound. Each pixel can express a behavior of complete silence (black) to full volume (white) with all the “in between” nuances on a grey scale. By doing so, the video acts like a score for the music. Depending on its content, some pixels will be brighter and some others will be darker and the music result will be different every time. 3.3 From raw data to music: The 3200 pixels are divided into eighty vertical columns. Each column contains forty pixels (and consequently forty pure sounds or pass filter bands). For a clearer correspondence of what we see and what we hear column one is panned to the hard-left side and column eighty is panned to the hard-right side, with all the other columns reflecting their relative visual panning position on video. That gives the most natural audio-visual correspondence. Each column is tuned in the same manner and the tuning follows the usual main music keys and scales (C major, D Dorian, E minor, C minor triad in root position or its inversions, et cetera). The tuning of each column can be changed thanks to a sub-patch called “Transposing Machine” that can be triggered via specific buttons corresponding to the various keys and scales or via midi keyboard input. The sound is generated inside Max/MSP by using sound modules like iosc banks, pink noises, white noises, multiband filters or wavetables. 4. CONCLUSIONS: Various conclusions can be found from the experimentation of the presented framework. First of all, it was the scope of this research to try to compose the music for a full feature narrative documentary called “Circles”, a forty-five minutes long film. The documentary narrates the alternance of live and death, the spirituality of human beings and the meaning of our existence. The composition approach was planned with the extensive use of the various sonification algorithms that have been presented in this paper. It was one of the focal points of this research to discover if the designed systems of algorithmic composition and automated composition could be useful and valuable in a “real time scenario” of composing an entire sound track for a full movie. The artistic results that came out seem to positively answer to that question. More specifically the following elements have been discovered: -The algorithm can only comment musically the film in a simple literal and linear relationship. We hear what we see. The system cannot decide what melody, harmony, sound or scale is more appropriate for the specific scene. In other words, the job of determining the best music vocabulary is still left to the composer. -The proposed technique can be a valuable tool for designing new and fresh sonic landscapes and audio palettes that interact very well with their visual counterpart. The concept of “mickey-mousing” (following every movement of a video with a music gesture) is particularly emphasized here. This could be a positive or negative element depending on the aesthetic and artistic results that the composer wishes to achieve. 5. REFERENCES:  John Harrison (2001), Synaesthesia: The Strangest Thing, ISBN 0-19-263245-0.  Zimmerman, Walter, Desert Plants – Conversations with 23 American Musicians, Berlin: Beginner Press in cooperation with Mode Records, 2020 (originally published in 1976 by A.R.C., Vancouver).  Gann, Kyle (2006). The Music of Conlon Nancarrow, p.38. ISBN 978-0521028073.  Levin, Thomas. 2003. Tones from out of Nowhere: Rudolf Pfenninger and the Archaeology of Synthetic Sound. Grey Room 12 (Fall 2003): p. 32-79  Daphne Oram (1972), An Individual Note: Of Music, Sound And Electronics, Galliard, ISBN 978-0-8524-9109-6  Pelletier, J.M. "Sonified Motion Flow Fields as a Means of Musical Expression", in Proceedings of the Internation Conference on New Interfaces for Musical Expression, Genova, Italy, 2008. pp. 158-163  Kramer, Gregory, ed. (1994). Auditory Display: Sonification, Audification, and Auditory Interfaces. Santa Fe Institute Studies in the Sciences of Complexity. Vol. Proceedings Volume XVIII. Reading, MA: Addison-Wesley. ISBN 978-0-201-62603-2.