The aim of multi-exposure fusion (MEF) is to generate a high-dynamic-range-like image from images captured by common cameras under different exposure settings. The existing generative adversarial network (GAN)-based MEF methods have achieved fair performance. However, when addressing extremely exposed images, these existing GAN-based MEF methods still face numerous limitations, such as detail loss and unnatural visual effects, partly due to their inability to sufficiently aggrege multiscale context information. To address these limitations, we propose a cross-scale bilevel aggregation-based conditional generative adversarial network (CBA-cGAN). The generator contains two subnetworks: a cross-scale aggregation GhostNetV2 (CSA-GhostNetV2) for capturing and integrating multiscale context information with longrange dependencies and a U-shaped local network (ULN) for extracting local features. To fully aggregate short-range and multiscale long-range dependencies, we propose the use of bilevel aggregation in CSA-GhostNetV2: intrablock and interblock aggregation. The intrablock aggregation scheme based on dilated convolution and GhostBlockV2 is designed to aggregate multiscale context information and long-range relationships. The interblock aggregation scheme is designed to balance local and global contextual information during the fusion process by combining features acquired from CSA-GhostNetV2 and the ULN. Additionally, to better align the visual effect with human perception, we take the average of the input images for the discriminator. We compare the performance of the proposed method with that of traditional and deep learning methods on two publicly available datasets in terms of six objective metrics. Extensive experiments are conducted to demonstrate that the proposed CBA-cGAN outperforms the existing state-of-the-art methods in retaining local details while preserving overall visual effects.