May 25, 2024 by B. D. T. Sibarani, Yohandi, D. D. Halim, N. L. Tandaju
In the current visually-driven world, combining text and imagery offers a unique way to explore and document various subjects. Our project focuses on this intersection, specifically on the architectural landscape of CUHK-Shenzhen. We chose this topic for several reasons. Firstly, it provides a familiarization for new and exchange students with campus architecture before their arrival. Secondly, it contributes to the architectural and cultural documentation of CUHK-Shenzhen, preserving its unique identity for posterity. Lastly, our work aligns with the aspiration to contribute to the academic community at CUHK-Shenzhen, particularly in shaping future research and developments in Generative Adversarial Networks (GANs).
This project represents a significant advancement in the domain of text-to-image synthesis, building upon the foundational work presented in the "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" paper by Radford et al. in 2016. While the original DC-GAN framework relies on noise embedding to generate images, we innovate by replacing this noise embedding with a concatenation of noise and text embedding of the image label. Mathematically, the original noise embedding
Besides the modification we made to the existing method, we also collected a new dataset as a part of the novelties. The rigorous procedure of gathering data, which includes various photos of CUHK-Shenzhen buildings with labels, highlights the importance and originality of the research. With the use of this dataset, which provides an extensive representation of the architectural diversity of the university, the model is able to generate images in a variety of scenarios. This curated dataset enhances the text-to-image synthesis and contributes to the academic documentation of CUHK-Shenzhen, enriching future scholarly research in image processing.
We start our project with thorough data collection, which includes taking 144 photos of different CUHK-Shenzhen buildings, each annotated with an appropriate English description that ensures high-quality text-image pairs, providing a robust dataset for training our model. For the data pre-processing, we create a Text2ImageDataset class to facilitate the creation of a dataset suitable for training text-to-image generation models by providing pairs of correct images with corresponding text embeddings and randomly sampled wrong images of different buildings.
The methodology used for our training is Deep Convolutional Generative Adversarial Network (DC-GAN). The architecture consists of two main components: a generator and a discriminator, both of which are trained in an adversarial manner.
The training process for the model involves two main steps: discriminator training and generator training, repeated iteratively over several epochs. In discriminator training, the text encoder first converts the matching text into a text embedding. Using a noise vector drawn from a normal distribution, the generator produces a fake image. The discriminator then evaluates real, wrong, and fake images with the text embedding, producing scores for each. The discriminator loss is calculated using Binary Cross-Entropy (BCE) loss, penalizing misclassification, and its parameters are updated with the Adam optimizer. In generator training, the generator loss is a weighted sum of BCE, Mean Absolute Error (MAE), and Mean Squared Error (MSE) losses. The MAE aims to make fake images generated by the generator and real images as similar as possible, likely preserving detailed textures. The MSE encourages the score of the discriminator given the fake images and real images to be as similar as possible, ensuring the generated images align closely with the real data distribution. The generator parameters are updated with the Adam optimizer. This alternating training process continues until a predefined number of epochs and training batches are completed, allowing the model to generate increasingly realistic images that correspond to the given text descriptions. The choice of the respective loss functions was made accordingly to address the discriminator's needs to distinguish between real images, fake images, and real images with mismatched text, thereby maximizing both image-text matching and image realism, which enhances learning dynamics.
The discriminator in the original GAN learns from two inputs: generated images
The pseudocode is written as follows:
For the text embedding, we adopt from an existing model:
The progression of generated images over different epochs is illustrated in the following figure. Notably, around epoch 100, the images become distinguishable and recognizable by the naked eye. This demonstrates the model's capability to generate coherent and detailed images from textual descriptions as training progresses. However, to quantify the model's performance objectively and avoid relying solely on visual inspection, we use the loss metrics for the generator and discriminator as stopping criteria.
To quantify the training progress and determine the stopping criteria, we examine the loss metrics for both the generator and the discriminator.
The generator's stopping criteria are defined by observing the discriminator's loss curve. As shown in the above figure, the discriminator loss decreases over time, reaching a lower bound where it becomes adept at distinguishing between real and fake images. Once the discriminator reaches this level of performance, the generator loss also converges, indicating that the generator has learned to produce images that are challenging for the discriminator.
The
By modifying the original DC-GAN framework to include text embeddings and introducing a third type of input for the discriminator, we significantly improved the model's ability to generate high-quality, realistic images that match the given text descriptions. The results demonstrated that the approach effectively bridges the gap between textual descriptions and visual representations. The qualitative improvements in generated images over epochs, coupled with the quantitative analysis of loss metrics, confirmed the model's learning progress. By reaching a convergence point where the discriminator becomes adept at distinguishing real from fake images, the generator achieves a level of performance that produces highly realistic and detailed images.
Our research lays a foundation for further exploration in the field of text-based image generation. There are several interesting possibilities for future work:
Personalization and Contextualization: Inspired by the concept of personalization in text-to-image diffusion models such as DreamBooth, future research can focus on fine-tuning the pre-trained models to bind unique identifiers with specific campus buildings. By embedding these identifiers into the model's output domain, we can synthesize fully novel photo-realistic images of campus buildings contextualized in different scenes, lighting conditions, and architectural styles.
Stable Diffusion for Transfer Learning: Developing Diffusion Models instead of GANs offers more opportunity for cross-domain transfer learning, which may facilitate the adaptation of the text-based image generation model to other domains beyond campus buildings. By transferring knowledge learned from one domain to another, the model can generalize better and produce high-quality images across diverse contexts. However, it requires high computational power due to complex architecture.
By pursuing these future works, we aim to further advance the capabilities and applications of text-based image generation in the domain of campus buildings and beyond.