GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

Hu, Xuran; Xiong, Zhitong; Hong, Zhongcheng; Ban, Yifang; Zhu, Xiaoxiang; Zhao, Wufan

GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

Xuran Hu^1,2, Zhitong Xiong^3,*, Zhongcheng Hong², Yifang Ban¹, Xiaoxiang Zhu³, Wufan Zhao^2,*

¹KTH Royal Institute of Technology
²The Hong Kong University of Science and Technology (Guangzhou)
³Technical University of Munich
^*Corresponding Authors

Paper Code 🤗 Dataset 🤗 Model

Overview of the GeoHeight-Bench (+), which comprising ten diverse tasks organized into four hierarchical levels: Pixel-level retrieval, Object-level extraction, Scene-level analysis, and Reasoning-level inference.

Abstract

Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the "vertical blind spot", successfully unlocking a new paradigm of interactive height reasoning in existing optical models.

Benchmark Generation

Pipeline of the GeoHeight-Bench generation and verification.

GeoHeightChat

The proposed GeoHeightChat framework comprises two training stages: Cross-Modal Geo-Alignment and Geo-Aware Instruction Tuning.

Comparison

Comparison between GeoHeightChat and LMMs on multimodal reasoning.

BibTeX

@misc{hu2026geoheight,
      title={GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing}, 
      author={Xuran Hu and Zhitong Xiong and Zhongcheng Hong and Yifang Ban and Xiaoxiang Zhu and Wufan Zhao},
      year={2026},
      eprint={2603.25565},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2603.25565}, 
}