An Interpretable Vision–Language Foundation Model for Sustainable Smart Cities with Physical World Data-to-Reasoning Traceability

Abstract

Sustainable smart cities represent a long-term vision for human development. Yet modern cities generate massive, heterogeneous, and physics-driven multimodal data whose complexity makes large-scale understanding extremely challenging. Although multimodal large language models offer new analytical possibilities, they ultimately learn from linguistic causal logic rather than the physical laws that govern real urban systems. This fundamental mismatch leaves them ill-equipped to reason about and predict the actual dynamics of cities. Here we present TransCity, a multimodal foundation model for sustainable smart cities that learns directly from the physical spatio-temporal world rather than symbolic linguistic descriptions. Trained on over 21 million city-days of multimodal data collected over the past decade, including NASA satellite imagery, ground-sensor streams, and urban text records, TransCity builds a unified representation of urban mobility, infrastructure usage, and system dynamics. Powered by a multi-expert architecture with vision–language grounding and dedicated Modality Meta Nets for structured feature extraction, the model supports flexible prediction, analysis, and decision support across diverse urban tasks. Across evaluations conducted on diverse real-world datasets from cities worldwide, TransCity exceeds the predictive and analytical performance of frontier foundation models such as ChatGPT and Gemini. Empirical results further show that, at city-edge regions with imbalanced sensor data, our model achieves around 40% improvement in prediction accuracy over the baselines, promoting social equity in smart city development. The pretrained model is publicly available, enabling broad fine-tuning and deployment for real-world smart city applications.

Data-to-Reasoning Traceability

TransCity is trained to construct chain-of-thought reasoning over data retrieved by agents. Through the task aware mixture-of-experts (TAME) architecture, specialized experts are dynamically selected to analyze evidence. Each reasoning step is explicitly linked to its underlying data sources, with original dataset references and database links preserved in the final answer.

Spatiotemporal Learning

TransCity understands urban dynamics through its Agent System, which retrieves nearby sensor data based on the sensor topology. The system processes this data, along with the user query, using MoMExp Nets for feature extraction. These features are then fed into the core model, allowing it to capture dependencies and better predict urban system behavior.

Expert-Level Interpretability

TransCity provides expert-level interpretability through its TAME mixture-of-experts design. A dynamic router assigns different inputs to specialized experts, and the resulting expert activations exhibit clear modality preferences.

Expert outputs are fused through attention, producing step-wise attention scores that link each reasoning step to the most influential evidence. We can quantify how different modalities (e.g., traffic signals, energy measurements, POIs, weather, and visual inputs) contribute to the final answer.

Expert-level interpretability diagram — Fig. 2: Different modalities tend to be handled by different experts

MoMExp Nets Interpretability

We analyze the attention patterns of the MoMExp Nets and observe clear region-aware expert behaviors across different urban areas. When the user query targets different parts of the city, the expert networks selectively focus on different modalities and urban signals rather than applying uniform representations.

In high-flow urban areas, MoMExp Nets tend to emphasize commercial buildings, office districts, and mobility-related data. In low-flow or exurban regions, attention shifts toward schools, residential areas, and physical facilities with stronger temporal locality.

Region-aware interpretability map — Fig. 3: Experts adapt attention by region and activity intensity

Agent System Interpretability

TransCity makes the agent pipeline interpretable by compiling each user question into an explicit executable plan DAG. Each node corresponds to a single-purpose agent, and the orchestrator schedules runnable steps with dependency awareness and parallel execution. Every step writes outputs into a shared context store under a stable key, together with provenance metadata.

This design enables end-to-end auditability: every evidence block used in the final answer can be traced back to the plan step that produced it, the upstream keys it consumed, and the intermediate states recorded in the store. When the system detects evidence alignment deficits, it triggers a bounded refinement loop that performs additional retrieval and verification while logging iteration identifiers for step-by-step replay. A spatial anchoring rule further prevents semantic drift by keeping baseline city signals sampled at the user anchor rather than the incident region.

Agent System Interpretability diagram — Fig. 4: Agent-driven evidence workflow with an explicit plan DAG and traceable records

Safety Control System

TransCity incorporates a system-level Safety Control System that governs both user queries and agent tool execution. For each incoming user query, the system performs a safety assessment and routes the request through different control paths, including direct allowance, constrained execution, abstraction-only responses, or refusal with a standardized template.

The same safety mechanism is applied to agent systems and tool usage. Agent plans may be fully executed, executed under constraints, or blocked and recalled by the planner based on safety evaluation, with feedback collected for iterative refinement.

Safety control system diagram — Fig. 5: Unified safety control across model and tools

Links

Citation

@article{transcity2025,
      title   = {TransCity: Next-Generation Smart City Foundation Model},
      author  = {TODO},
      journal = {TODO},
      year    = {2025},
      note    = {TODO}
    }