Unified Language-driven Zero-shot Domain Adaptation

1The Chinese University of Hong Kong 2Harbin Institute of Technology, Shenzhen 3The Chinese University of Hong Kong, Shenzhen


This paper introduces Unified Language-driven Zero-shot Domain Adaptation (ULDA), a novel task setting that enables a single model to adapt to diverse target domains without explicit domain-ID knowledge. We identify the constraints in the existing language-driven zero-shot domain adaptation task, particularly the requirement for domain IDs and domain-specific models, which may restrict flexibility and scalability. To overcome these issues, we propose a new framework for ULDA, consisting of Hierarchical Context Alignment (HCA), Domain Consistent Representation Learning (DCRL), and Text-Driven Rectifier (TDR). These components work synergistically to align simulated features with target text across multiple visual levels, retain semantic correlations between different regional representations, and rectify biases between simulated and real target visual features, respectively. Our extensive empirical evaluations demonstrate that this framework achieves competitive performance in both settings, surpassing even the model that requires domain-ID, showcasing its superiority and generalization ability. The proposed method is not only effective but also maintains practicality and efficiency, as it does not introduce additional computational costs during inference. Our model and collected data will be open-sourced.


Raw Video

Source Model

Our Model

Our method does not access any target domain images, just a simple text description, e.g., "driving in rain."


      title={Unified Language-driven Zero-shot Domain Adaptation},
      author={Yang, Senqiao and Tian, Zhuotao and Jiang, Li and Jia, Jiaya},
      journal={arXiv preprint arXiv:2404.07155},