Abstract
Sound is one of the fundamental senses that helps us reason about our environment. There exists an intricate relationship between the visual appearance of a location and the distribution of sounds present there. We propose leveraging this relationship to formulate the task of soundscape mapping—predicting the most probable distribution of sounds that could be perceived at a given geographic location, as observed in its overhead imagery. To support research on this task, we curated a comprehensive dataset, GeoSound, which consists of geotagged audio recordings from various sources, paired with both low- and high-resolution overhead imagery. We approach the soundscape mapping problem from the perspective of multimodal representation learning and have developed a series of frameworks—GeoCLAP, PSM, and Sat2Sound—to address this task. For each framework, we demonstrate the effectiveness of a shared multimodal representation space in generating soundscape maps for any geographic area using only overhead imagery and audio or textual queries. During the development of these frameworks, we progressively incorporated several desirable capabilities, including temporal conditioning, multi-scale mapping, uncertainty quantification, and fine-grained soundscape prediction. We believe that the introduction of this novel task—together with our dataset and proposed methodologies—will encourage research toward creating high-resolution, global-scale soundscape maps with minimal effort, while also enabling location-conditioned soundscape synthesis for immersive virtual exploration.
Committee Chair
Tao Ju
Committee Members
Alvitta Ottley; Claire Masteller; Jiaxin Huang; Nathan Jacobs; Tao Ju
Degree
Doctor of Philosophy (PhD)
Author's Department
Computer Science & Engineering
Document Type
Dissertation
Date of Award
8-18-2025
Language
English (en)
DOI
https://doi.org/10.7936/wh2x-9s57
Recommended Citation
Khanal, Subash, "Multimodal Representation Learning for Geospatial Soundscape Mapping" (2025). McKelvey School of Engineering Theses & Dissertations. 1283.
The definitive version is available at https://doi.org/10.7936/wh2x-9s57