Arthur Jinyue Guo

Doctoral Research Fellow

Norwegian version of this page

Email jinyue.guo@imv.uio.no

Username

Visiting address ��Żʹ��,�ʹ��ȷ�sv. 3A Harald Schjelderups hus 0373 Oslo

Postal address Postboks 1133 Blindern 0318 Oslo

Press photo Download business card

Academic interests

AMBIENT project

Thesis (working) title: Machine Synchresis: Investigating Immersive Audio-Visual Rhythms and Environments with Multi-Modal Information Retrieval

Background

M.Mus in Music Technology, Steinhardt School, New York University, New York, NY, USA
B.Sc in Information Engineering, Department of Electronic and Electric Engineering, Southern University of Science and Technology, Shenzhen, China

Tags: Music Information Retrieval, Machine Learning, Multimodal Learning, Neural Audio Synthesis

Guo, Jinyue; T?rresen, Jim & Jensenius, Alexander Refsum (2026). Investigating Auditory�CVisual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset. Transactions of the International Society for Music Information Retrieval. 9(1), p. 85�C85. doi: 10.5334/tismir.223. Full text in Research Archive Show summary
Musicologists, psychologists, and computer scientists study relationships between auditory and visual stimuli from very different perspectives and using various terminologies and methodologies. This article aims to bridge the gap between phenomenological sound theory, auditory�Cvisual theory, and audio�Cvideo processing and machine learning. We introduce the SoundActions dataset, a collection of 365 audio�Cvideo recordings of (primarily) short sound actions. Each recording has been human?labeled and annotated according to Pierre Schaeffer��s theory of reduced listening, which describes the property of the sound itself (e.g., ��an impulsive sound��) instead of the source (e.g., ��a bird sound��). With these reduced?type labels in the audio�Cvideo dataset, we conducted two experiments: (1) fine?tuning the latest audio�Cvideo transformer model on the reduced?type labels in the SoundActions dataset, proving that the model can recognize reduced?type labels, and observing that the modality?imbalance phenomenon is similar to the added value theory by Michel Chion and (2) proposing the Ensemble of Perception Mode Adapters method inspired by Pierre Schaeffer��s three listening modes, improving the audio�Cvideo model also on reduced?type tasks.
Riaz, Maham; Guo, Jinyue; Erdem, Cagri & Jensenius, Alexander Refsum (2025). Where to Put That Microphone? A Study of Sound Localization in Ambisonics Recordings. In McArthur, Angela; Matthews, Emma-Kate & Holberton, Tom (Ed.), Proceedings of the 17th International Symposium on Computer Music Multidisciplinary Research. The Laboratory PRISM ��Perception, Representations, Image, Sound, Music��. ISSN 9791097498061. p. 455�C466. doi: 10.5281/ZENODO.17497086. Full text in Research Archive Show summary
This paper examines the effects of microphone placement on sound localization in first-order Ambisonics recordings. Two microphone setups were used to capture a moving audio source in a lab environment. Array A, a tetrahedral microphone, was placed in the centre of the recording space. Array B consisted of four similar tetrahedral microphones charting a rectangular perimeter surrounding the space. Motion capture data of the moving sound source shows that anglegrams calculated from the Ambisonics recordings can be effectively used for sound localization. An additional perceptual listening study with binaural renders of the audio signals showed that the centrally-placed Array A provided superior localization. However, the corner-placed Array B performed better than expected.
Guo, Jinyue; T?rresen, Jim & Jensenius, Alexander Refsum (2025). Cross-modal Analysis of Spatial-Temporal Auditory Stimuli and Human Micromotion when Standing Still in Indoor Environments. In McArthur, Angela; Matthews, Emma-Kate & Holberton, Tom (Ed.), Proceedings of the 17th International Symposium on Computer Music Multidisciplinary Research. The Laboratory PRISM ��Perception, Representations, Image, Sound, Music��. ISSN 9791097498061. p. 871�C882. doi: 10.5281/ZENODO.17502603. Full text in Research Archive Show summary
This paper examines how a soundscape influences human stillness. We are particularly interested in how spatial and temporal features of a soundscape influence human micromotion and swaying patterns. The analysis is based on 345 Ambisonics audio recordings of different indoor environments and corresponding accelerometer data captured at the chest of a person standing still for ten minutes. We calculated the temporal and spatial correlation between the person's quantity of motion and the sound energy of the Ambisonic recordings. While no clear temporal correlations were found, we discovered a correlation between the spatial directionality of the micromotion and the sound direction of arrival. The results suggest a potential entrainment between the directionality of environmental sounds and human swaying patterns, which have not been thoroughly studied previously compared to the temporal or spectral features of indoor soundscapes.
Riaz, Maham; Guo, Jinyue; G?ks��l��k, Bilge Serdar & Jensenius, Alexander Refsum (2025). Where is That Bird? The Impact of Artificial Birdsong in Public Indoor Environments. In Sei?a, Mariana & Wirfs-Brock, Jordan (Ed.), AM '25: Proceedings of the 20th International Audio Mostly Conference. Association for Computing Machinery (ACM). ISSN 9798400720659. p. 344�C351. doi: 10.1145/3771594.3771629. Full text in Research Archive Show summary
This paper explores the effects of nature sounds, specifically bird sounds, on human experience and behavior in indoor public environments. We report on an intervention study where we introduced an interactive sound device to alter the soundscape. Phenomenological observations and a survey showed that participants noticed and engaged with the bird sounds primarily through causal listening; that is, they attempted to identify the sound source. Participants generally responded positively to the bird sounds, appreciating the calmness and surprise it brought to the environment. The analyses revealed that relative loudness was a key factor influencing the experience. A too-high sound level may feel unpleasant, while a too-low sound level makes it unnoticeable due to background noise. These findings highlight the importance of automatic level adjustments and considering acoustic conditions in soundscape interventions. Our study contributes to a broader discourse on sound perception, human interaction with sonic spaces, and the potential of auditory design in public indoor environments.
Riaz, Maham; Guo, Jinyue & Jensenius, Alexander Refsum (2025). Comparing Spatial Audio Recordings from Commercially Available 360-degree Video Cameras. In Brooks, Anthony L.; Banakou, Domna & Ceperkovic, Slavica (Ed.), Proceedings of the 13th EAI International Conference on ArtsIT, Interactivity and Game Creation, ArtsIT 2024. Springer. ISSN 9783031972546. p. 160�C172. doi: 10.1007/978-3-031-97254-6_12. Full text in Research Archive Show summary
This paper investigates the spatial audio recording capabilities of various commercially available 360-degree cameras (GoPro MAX, Insta360 X3, Garmin VIRB 360, and Ricoh Theta S). A dedicated ambisonics audio recorder (Zoom H3VR) was used for comparison. Six action sequences were performed around the recording setup, including impulsive and continuous vocal and non-vocal stimuli. The audio streams were extracted from the videos and compared using spectrograms and anglegrams. The anglegrams show adequate localization in ambisonic recordings from the GoPro MAX and Zoom H3VR. All cameras feature undocumented noise reduction and audio enhancement algorithms, use different types of audio compression, and have limited audio export options. This makes it challenging to use the spatial audio data reliably for research purposes.
Guo, Jinyue; Christodoulou, Anna-Maria; Laczko, Balint & Glette, Kyrre (2024). LVNS-RAVE: Diversified audio generation with RAVE and Latent Vector Novelty Search. In Li, Xiaodong & Handl, Julia (Ed.), GECCO '24 Companion: Proceedings of the Genetic and Evolutionary Computation Conference Companion. Association for Computing Machinery (ACM). ISSN 9798400704956. p. 667�C670. doi: 10.1145/3638530.3654432. Full text in Research Archive Show summary
Evolutionary Algorithms and Generative Deep Learning have been two of the most powerful tools for sound generation tasks. However, they have limitations: Evolutionary Algorithms require complicated designs, posing challenges in control and achieving realistic sound generation. Generative Deep Learning models often copy from the dataset and lack creativity. In this paper, we propose LVNS-RAVE, a method to combine Evolutionary Algorithms and Generative Deep Learning to produce realistic and novel sounds. We use the RAVE model as the sound generator and the VGGish model as a novelty evaluator in the Latent Vector Novelty Search (LVNS) algorithm. The reported experiments show that the method can successfully generate diversified, novel audio samples under different mutation setups using different pre-trained RAVE models. The characteristics of the generation process can be easily controlled with the mutation parameters. The proposed algorithm can be a creative tool for sound artists and musicians.
Guo, Jinyue; Riaz, Maham & Jensenius, Alexander Refsum (2024). Comparing Four 360-Degree Cameras for Spatial Video Recording and Analysis, Proceedings of the Sound and Music Computing Conference 2024. SMC Network. ISSN 9789893520758. Full text in Research Archive Show summary
This paper reports on a desktop investigation and a lab experiment comparing the video recording capabilities of four commercially available 360-degree cameras: GoPro MAX, Insta360 X3, Garmin VIRB 360, and Ricoh Theta S. The four cameras all use different recording formats and settings and have varying video quality and software support. This makes it difficult to conduct analyses and compare between devices. We have implemented new functions in the Musical Gestures Toolbox (MGT) for reading and merging files from the different platforms. Using the capabilities of FFmpeg, we have also made a new function for converting between different 360-degree video projections and formats. This allows (music) researchers to exploit 360-degree video recordings using regular video-based analysis pipelines.
Guo, Jinyue & McFee, Brian (2023). Automatic Recognition of Cascaded Guitar Effects. In Serafin, Stefania; Fontana, Federico & Willemsen, Silvin (Ed.), Proceedings of the 26th International Conference on Digital Audio Effects. Aalborg University Copenhagen. p. 189�C195. doi: 10.5281/zenodo.7973536. Full text in Research Archive Show summary
This paper reports on a new multi-label classification task for guitar effect recognition that is closer to the actual use case of guitar effect pedals. To generate the dataset, we used multiple clean guitar audio datasets and applied various combinations of 13 commonly used guitar effects. We compared four neural network structures: a simple Multi-Layer Perceptron as a baseline, ResNet models, a CRNN model, and a sample-level CNN model. The ResNet models achieved the best performance in terms of accuracy and robustness under various setups (with or without clean audio, seen or unseen dataset), with a micro F1 of 0.876 and Macro F1 of 0.906 in the hardest setup. An ablation study on the ResNet models further indicates the necessary model complexity for the task.

View all works in NVA

Guo, Jinyue; T?rresen, Jim & Jensenius, Alexander Refsum (2025). Cross-modal Analysis of Spatial-Temporal Auditory Stimuli and Human Micromotion when Standing Still in Indoor Environments (poster). doi: 10.5281/zenodo.17502603. Full text in Research Archive
Guo, Jinyue (2024). Comparing Four 360-Degree Cameras for Spatial Video Recording and Analysis. Full text in Research Archive Show summary
This paper reports on a desktop investigation and a lab experiment comparing the video recording capabilities of four commercially available 360-degree cameras: GoPro MAX, Insta360 X3, Garmin VIRB 360, and Ricoh Theta S. The four cameras all use different recording formats and settings and have varying video quality and software support. This makes it difficult to conduct analyses and compare between devices. We have implemented new functions in the Musical Gestures Toolbox (MGT) for reading and merging files from the different platforms. Using the capabilities of FFmpeg, we have also made a new function for converting between different 360-degree video projections and formats. This allows (music) researchers to exploit 360-degree video recordings using regular videobased analysis pipelines.
Guo, Jinyue (2023). Automatic Recognition of Cascaded Guitar Effects. Full text in Research Archive

View all works in NVA

Published Jan. 31, 2023 4:15 PM - Last modified Mar. 3, 2025 12:47 PM

Arthur Jinyue Guo

Academic interests

Background

Publications

Projects

Links