๐ Original Info
- Title:
- ArXiv ID: 2512.18388
- Date:
- Authors: Unknown
๐ Abstract
Generative AI has begun to democratize creative work, enabling novices to produce complex artifacts such as code, images, and videos. However, in practice, existing interaction paradigms often fail to support divergent exploration: users tend to converge too quickly on early "good enough" results and struggle to move beyond them, leading to premature convergence and design fixation that constrains their creative potential. To address this, we propose a structured, process-oriented human-AI co-creation paradigm including divergent and convergent thinking stages, grounded in Wallas's model of creativity. To avoid design fixation, our paradigm scaffolds both high-level exploration of conceptual ideas in the early divergent thinking phase and low-level exploration of variations in the later convergent thinking phrase. We instantiate this paradigm in HAIExplore, an image co-creation system that (i) scaffolds divergent thinking through a dedicated brainstorming stage for exploring high-level ideas in a conceptual space, and (ii) scaffolds convergent refinement through an interface that externalizes users' refinement intentions as interpretable parameters and options, making the refinement process more controllable and easier to explore. We report on a within-subjects study comparing HAIExplore with a widely used linear chat interface (ChatGPT) for creative image generation. Our findings show that explicitly scaffolding the creative process into brainstorming and refinement stages can mitigate design fixation, improve perceived controllability and alignment with users' intentions, and better support the non-linear nature of creative work. We conclude with design implications for future creativity support tools and human-AI co-creation workflows.๐ Full Content
refining them into a single or a few viable solutions [17]. Previous research has established that creative processes often unfold as transitions between divergent and convergent thinking [46]. For instance, in Wallas’s four stages of creativity (i.e., preparation, incubation, illumination, and verification), divergent thinking is often associated with preparation and incubation, while convergent thinking is more associated with illumination and verification [25,48].
More recently, researchers have investigated how generative models perform on divergent and convergent thinking tasks, and how they affect human-AI co-creation processes. Existing work has found that generative models show exceptional divergent thinking capabilities, often surpassing human baselines on tasks such as the Alternate Uses Test (AUT) [18,21]. However, despite these capabilities, studies have found that during human-AI co-creation, generative models do not necessarily help humans think more divergently and can instead lead users to anchor on the model’s initial outputs rather than exploring a broader space of alternatives, resulting in premature convergence and design fixation [42,47]. Based on these prior findings, our work focuses on operationalizing divergent and convergent thinking in the design of our human-AI co-creation workflow. Specifically, we operationalize these two modes of thinking into two complementary stages, brainstorming and refinement, and design interfaces that scaffold both divergent exploration and convergent refinement.
Generative AI has democratized content generation, enabling novices to produce artifacts such as tasks [16,52], code [13,53], and images [6,50] that previously required professional expertise and complex tools. Early work primarily framed generative models as productivity tools that automate content creation, such as synthesizing programming tasks and generating code [3,52,53]. More recent research has begun to design human-AI co-creation tools in which generative models are treated as collaborators in creative processes [40,42,44,50]. We broadly group human-AI co-creation tools into two categories: (i) systems that support divergent thinking by designing interaction mechanisms and interfaces for exploration and (ii) systems that support convergent thinking by helping users select, refine, and control outputs.
To support divergent thinking, recent systems help users explore a design space by generating design dimensions. For example, Luminate [42] infers dimensions of a design space (e.g., genre, tone) from a user’s creative writing task, while DesignWeaver [44] extracts design dimensions from a designer’s specification to scaffold product design exploration.
Similarly, AIdeation [50] generates brainstormed visual ideas from predefined dimensions such as theme, art style, color palette, and lighting, allowing users to browse and combine options. Although these systems support exploration, the dimensions they surface are often tied to surface-level presentation attributes (e.g., style, color, tone) rather than to the high-level conceptual properties that shape the overall direction and meaning of the design. In contrast, our system supports divergent exploration by suggesting high-level conceptual ideas (e.g., metaphors, narratives, framings) that define the artifact’s direction and meaning, treating presentation-level dimensions as secondary considerations that are better addressed during convergent refinement.
Another line of work focuses on convergent refinement by helping users articulate and elaborate prompts to better control generative models’ outputs. For instance, Promptify [6] supports users by expanding prompts with additional subjects and stylistic details, reducing the effort required to craft effective text-to-image prompts. Other systems, such as PromptCharm [51] and GenTune [49], aim to ease prompt engineering by automatically optimizing user prompts and suggesting alternatives. While these systems can reduce the burden of prompt writing, they also tend to hide much of the prompt construction process from users, limiting opportunities to develop transferable prompting skills. In contrast, our system treats convergent refinement as an opportunity for learning: users still author high-level prompts, but we introduce an intermediate layer where the generative models infer and externalize their intent as editable options. Users can select, modify, and customize these options, and always retain access to the full underlying prompt. This design not only helps bridge the “gulf of envisioning” [41] by making tacit intent more concrete, but also supports users in learning how to prompt models for convergent refinement in a more controlled way.
In this section, we deepen the analysis of why current human-AI co-creation workflows are misaligned with creative practice and, from there, derive design goals for our system. We ground this analysis in Wallas’s theory of the creative process and recent works on creativity-support tools, and then show how these insights motivate our structured, two-stage co-creation paradigm.
In 1926, Wallas proposed one of the first formal models of creativity, describing that the creative process unfolds in four stages -preparation, incubation, illumination, and verification -with both conscious and unconscious processes contributing across these phases [48]. During preparation and incubation, individuals explore the problem, gather information, and consider multiple possibilities. These early phases are often associated with divergent thinking, where the goal is to open up the space of options [17,25]. By contrast, illumination and especially verification are more focused: a promising idea emerges, and then it is evaluated, tested, and refined. These later phases tend to involve more convergent thinking, where possibilities are assessed and shaped into coherent, workable solutions [17,25].
From this perspective, a human-AI co-creation paradigm that aims to support creative processes would benefit from scaffolding both divergent and convergent phases. During divergent thinking, it can help users explore the problem, gather information, and consider multiple possibilities. During convergent thinking, it can assist users in evaluating, testing, and refining ideas into more concrete artifacts.
Current interaction paradigms for human-AI co-creation, especially for novices, are largely based on a “slot machine” model: users submit a prompt to a generative model (e.g., ChatGPT), inspect the returned output, optionally modify the prompt several times, and then either accept one of the results or abandon the attempt. This paradigm can work reasonably well when users have a clear vision and can articulate highly specific prompts. In those cases, the generative model functions primarily as an efficient executor of a pre-determined idea. For creative work, however, the slot-machine paradigm introduces several challenges. First, it encourages a rush to generation: users are immediately presented with fully rendered artifacts, making it tempting to settle on the first “good enough” result rather than exploring alternative directions. Prior work has shown that such workflows often lead to premature convergence and design fixation, where users become anchored to initial outputs instead of engaging in broad, divergent exploration [42,47]. In Wallas’s terms, the slot-machine workflow collapses the stages of preparation, incubation, illumination, and verification into a single prompt-artifact step, emphasizing verification-like behavior at the expense of early-stage exploration.
Second, once users decide to refine an initial output, they frequently struggle to communicate nuanced changes to generative models and often become fixated on a single or a few refined outputs [41,51,57,58]. There is a gap between the mental vision users have and the prompts they are able to write, often referred to as the “gulf of envisioning” [41,58].
wrong or what they wish to adjust after seeing the output, yet they cannot easily translate these intentions into concrete textual instructions that the model can interpret reliably. For instance, a user may ask the model to “make the image feel more lively” for a street scene. The model may interpret this by adding bright saturated colors, and the user only realizes that this is not what they intended after seeing the result, pushing them into a trial-and-error prompting loop. Moreover, users can also fixate on the model’s initial “good enough” refinement results. For example, the model may interpret “make the image feel more lively” by changing time of day, and the user feels this is roughly correct and accepts it. However, other possibilities, such as adding small groups of people or adjusting the lighting, may remain unexplored. As a result, users may end up refining through local edits, reinforcing design fixation and leading to suboptimal convergence.
To mitigate design fixation and premature convergence, recent work has proposed creativity-support tools that explicitly embed exploratory mechanisms into human-AI co-creation. These systems typically structure exploration through dimensions of a design space. For instance, AIdeation [50] introduces an explicit ideation phase in which the system generates design ideas organized along dimensions such as theme, content, art style, color palette, and lighting. Other systems, such as Luminate [42] and DesignWeaver [44], dynamically generate or surface design dimensions to help users navigate alternatives. while dimension-based systems make the design space more visible and easier to navigate, they still exhibit important limitations when viewed through Wallas’s staged model.
First, the dimensions they expose are often surface-level rather than conceptual-level, which limits their usefulness in the early stages of preparation and incubation. For example, when designing a poster, surface-level dimensions such as style, color, or font primarily affect the visual presentation, whereas a designer may first want to explore deeper metaphors, framings, or narratives that determine the core message and overall direction. Moreover, these systems typically treat dimensions as independent, overlooking the coherence, hierarchy, and interrelationships that naturally exist among conceptual and visual choices. Exposing surface-level dimensions too early can thus shift users’ attention toward selecting and filtering dimensions instead of engaging in open-ended conceptual ideation.
Second, these systems primarily support Preparation and Illumination by helping users explore possibilities. However, support is weaker in the Verification stage, where users converge on a chosen direction yet still need some continued exploration to avoid design fixation, while also requiring refinements that are targeted and controllable in line with their intent. Verification support would benefit from enabling users to convey high-level refinement intent and operationalizing it into precise, predictable adjustments, while exposing structured low-level alternatives for systematic comparison.
Third, these systems provide limited opportunities for users to learn from the creative process itself. Many tools are optimized to help users produce artifacts in the moment, but they often sideline users from key decisions, reducing their sense of agency in directing the process. As a result, users have fewer opportunities to develop reusable knowledge, skills, and workflow patterns that transfer beyond a single tool. When they switch to other systems (e.g., ChatGPT) for creative work, they may continue to struggle.
Finally, both chat-based interfaces and many dimension-based tools tend to impose a largely linear interaction flow.
History is presented as a flat sequence of prompts, making it difficult for users to branch, revisit earlier conceptual directions, or fluidly alternate between divergent and convergent modes -even though creativity theories emphasize iterative movement across stages rather than a single pass [60].
Drawing on these challenges and the theoretical framing provided by Wallas’s stage model, along with its connection to divergent and convergent thinking, we derive four design goals (DG1-DG4) for a human-AI co-creation paradigm that supports creative processes:
โข DG1: Scaffold conceptual-level divergent exploration to reduce design fixation. The system should explicitly support preparation and incubation by helping users generate, externalize, and organize multiple high-level conceptual directions (e.g., metaphors, narratives, framings) prior to committing to specific artifacts.
This aims to reduce premature convergence and design fixation.
โข DG2: Scaffold convergent refinement to bridge the “gulf of envisioning” and avoid fixation. During illumination and verification, the system should help users translate their intentions into actionable refinements, and encourage exploration of multiple alternative refinements rather than fixation on a single “good enough” refinement.
โข DG3: Support transferable learning during the co-creation process. Systems should not only support the production of high-quality artifacts, but also foster the acquisition of reusable knowledge, skills, or process patterns through use, enabling users to work effectively even when transferring to other generative AI tools (e.g., ChatGPT).
โข DG4: Enable non-linear, iterative workflows that preserve context. The system should support fluid movement between divergent and convergent modes, including branching, revisiting earlier ideas, and comparing alternatives side-by-side. History and intermediate artifacts should be preserved in a way that maintains context for both conceptual directions and concrete outputs.
To address these design goals, we operationalize the paradigm by developing HAIExplore, a human-AI co-creation system for creative image generation, a domain commonly used in related studies [6,44,51]. However, our paradigm is not limited to images and can also be applied to other creative domains, such as story writing and product design.
HAIExplore contains two stages (Fig. 1,DG4): (i) a Brainstorming stage (DG1) that supports divergent exploration by having the system propose high-level conceptual ideas, from which users can select, edit, and generate initial images;
and (ii) a Refinement stage (DG2), where users provide a refinement prompt and the system first translates the user’s refinement intentions into interpretable, controllable parameters and options that users can select and adjust, before producing refined images. In this section, we first describe the user interfaces for each of the stages, followed by the backend implementation.
Overall User Interface. The brainstorming interface (Fig. 2) consists of three primary areas: the Input Section for prompting, the Idea Grid for idea exploration, and the Image Library for review. Users initiate the brainstorming process by entering a text prompt in the Input Section (Fig. 2A) and then clicking the “Brainstorm” button.
Idea Generation. Given the user’s prompt, HAIExplore populates the Idea Grid with structured Idea Cards (Fig. 2B). As detailed in Fig. 3 (top left), each card represents a distinct conceptual idea with five key components: a concise title of the idea, a thumbnail visual for illustrating the idea, a textual description, category tags, and contextual background information revealed on hover. This structured card presentation allows users to quickly scan and compare distinct Workflow overview of HAIExplore. HAIExplore supports two stages: Brainstorming and Refinement, and allows users to switch between them at any time. In Brainstorming, the user provides an ideation prompt describing the desired image. Based on this prompt, HAIExplore generates diverse conceptual ideas, which the user can select, edit, or extend to create new ideas. When an idea seems promising, the user can generate an image from it. In Refinement, the user iterates on generated images. To refine an image, the user provides a high-level refinement prompt, and HAIExplore translates it into concrete, interpretable parameters. Each parameter includes suggested options to help users explore alternative variations. The user can edit these options or create their own, then apply the selected options to generate a refined image. The workflow is non-linear, enabling flexible iteration between brainstorming and refinement.
conceptual directions without being overwhelmed by unstructured text. Furthermore, surfacing this information helps users learn new concepts that they can reuse in other scenarios. For example, in Fig. 3 (top left), the idea card titled “The Interrupting Cow” introduces a popular meme that the user may not have encountered before. Later, when designing related content (e.g., humor-or meme-inspired visuals), users can recall this reference and use it as inspiration (DG3).
Idea Operations. After idea cards are generated, users can interact with these ideas through the following operations:
โข Create: Users can manually create new idea cards (Fig. 2C) by clicking “Create your own idea” and then entering an idea title and a detailed description. For example, the “Empty Bench Outside University” card shown in Fig. 2B is a user-created idea card.
โข Expand: Users can click “More Ideas” (Fig. 2D) and optionally provide a contextual prompt to broaden and steer the idea space with additional system-generated ideas. HAIExplore then generates new ideas based on the provided context and appends the new idea cards to the existing ones in the Idea Grid.
โข Edit: Users can edit any idea card by first selecting it and then clicking the pencil icon on the card to modify the card information (Fig. 3, bottom left) so the idea aligns more closely with their intent. Users can also delete irrelevant cards to declutter the idea grid.
Image Generation. Once users feel an idea is promising and want to use it, they can select the corresponding idea card and click the spark icon on the card (Fig. 2E) to trigger image generation based on that idea. Users can view the image via an image viewer that displays the full image alongside a textual Explanation (Fig. 3, right). The explanation is included to better illustrate the underlying intention. For example, Fig. 3 (right) shows an image inspired by the “The Interrupting Cow” meme with an explanation to elaborate on how the meme’s idea and the overall image relate to the user’s original prompt about reducing phone usage.
Technical Challenges and Methods. For the brainstorming stage, our goal is to generate high-level conceptual ideas that support open-ended divergent exploration and from which users can generate images. These ideas should (i) operate at a conceptual level rather than at low-level dimensions that control surface-level presentation (e.g., color, style), (ii) be diverse and span different domains, and (iii) be usable in ways that lead to creative, non-obvious images.
To meet these requirements, we draw on prior research suggesting that associative thinking, the ability to connect seemingly unrelated concepts, can boost creativity [5,29]. Motivated by this, we explicitly prompt the generative model to perform associative thinking during idea generation by drawing connections from various domains (e.g., artworks, historical events, mythology, metaphors) [30]. These associative ideas are then combined with the user’s design goal when generating images, increasing the potential for novel yet relevant images. For instance, in Fig. 2, the input prompt asks for a poster to encourage students to spend less time on their phones (Fig. 2A). One resulting idea generated through associative thinking was “The Interrupting Cow” (Fig. 2E), which originates from a popular meme.
The generative model may infer that “spending less time on the phone” relates to disrupting an ongoing habit and therefore associates this theme with the meme.
After viewing and creating different ideas and generating initial images in the Brainstorming stage, users can explore various candidates and eventually decide to focus on a specific image. Typically, this initial image incorporates a concept that the user finds interesting; however, the fine-grained details may not perfectly match the user’s vision. HAIExplore addresses this problem by providing a dedicated refinement stage (DG2, see Fig. 4).
Overall User Interface. To trigger the refinement process for an initial image generated during the Brainstorming stage, users can select an generated image and click “Refine” (available in the image viewer), which starts the refinement process for that image. This action opens a dedicated Refine Tab alongside the original Brainstorm Tab (Fig. 4A). This tabbed structure allows users to maintain parallel workflows, refining different images in multiple tabs while keeping the brainstorming session on standby (DG4). In a Refine Tab, the selected image is used as the Base Image and displayed at the top of the Image Library (Fig. 4F). The Image Library in each tab is independent, containing only the base image and its subsequent refined variations. Once a Refine Tab is opened (Fig. 4A), the user can begin the refinement workflow by following these steps:
โข Specify refinement intent (Fig. 4B): The user provides a prompt describing the desired refinement in the input text box and clicks “Refine. "
โข Generate parameters and options (Fig. 4C): HAIExplore analyzes the refinement prompt and the Base Image to generate a set of dynamic Parameters and Options. The system infers relevant parameters from the user’s refinement prompt and pre-populates each parameter with concrete, context-aware options that help users explore different alternatives. For example, the system infers that cow_role is a useful parameter to adjust as the user mentions that “the cow is guiding people. " It then provides several dropdown options for the parameter cow_role, including “a friendly mascot …, " “a playful coach …, " “a tour guide …, " and “a magician …. “. Users can review and edit these options or enter custom values. โข Select options and edit the prompt (Fig. 4D): As users select and edit options, a Prompt Preview updates in real time, showing the full refinement prompt that will be sent to the model. The bolded segments in this prompt correspond to the options users have selected or edited, making it clear how their selections shape the final prompt (DG3). Users can also manually edit the prompt to specify additional details, or just skip prompt editing.
โข Generate refined images (Fig. 4E): Once the user is satisfied with the selected options and/or the refinement prompt, they click “Generate Image.” The resulting refined images appear as Variations in the Image Library (Fig. 4F). Clicking a variation reveals the full image alongside its generation prompt.
The refinement process also supports non-linear iteration: users can modify options to generate new variations, edit the input refinement prompt to trigger an entirely new set of parameters and options, or switch back to the Brainstorm tab to explore more ideas (DG4).
Technical Challenges and Methods. The key technical challenge in refinement is generating parameters that help articulate users’ refinement intent and providing concrete, useful options for each parameter to support exploring alternatives and reducing design fixation. To address this, we introduce a representation we call a Sketch. We define a Sketch as a parametric function (implemented as Python code) that takes abstract parameters as input and returns a concrete prompt string. When a user provides a refinement prompt, HAIExplore synthesizes a Sketch function. The body of this function encodes the prompt template, and its arguments represent semantic dimensions (e.g., back_activity, cow_tone) extracted from the user’s refinement prompt. This sketch-based representation enables real-time interactivity, as the frontend can re-execute the function locally to update the prompt preview without issuing new queries to the generative models.
We implemented HAIExplore as a full-stack web application. The frontend is built using React to ensure a responsive user interface. The backend API is developed with Django REST Framework. We use OpenAI’s gpt-5-2025-08-07 model for generating ideas and sketches. For generating the actual user-selected images (in both the Brainstorming and Refinement stages), we use the gpt-image-1 model. To optimize the workflow, images in the Brainstorming stage are generated with quality set to Medium to support fast, iterative exploration, whereas images in the Refinement stage are generated with quality set to Auto to allow for higher fidelity and detail in the final image.
To generate visuals for the ideas shown in the Brainstorming phase, we prompt the text-to-image gpt-image-1-mini model to produce a single composite image arranged in a 3 ร 3 tiled layout, where each tile visualizes a single idea. The frontend then processes this image, slicing the grid into nine separate thumbnails for display. This method effectively reduces both computational cost and API latency compared to generating images individually.
To evaluate the effectiveness of HAIExplore in supporting users’ image creation processes, we conducted a withinsubjects comparative user study. We compared HAIExplore against ChatGPT (powered by GPT-5 with gpt-image-1 as the image generation model), which we chose as a representative baseline because it is one of the most widely used and accessible text-to-image prompting interfaces for general users.
We recruited 12 participants (5 female, 7 male) from the local university, aged 24-33 years (๐ = 27.8, ๐๐ท = 3.2). Six participants reported no prior experience or training in digital art or design. The other six had some prior experience: three had less than one year of self-taught experience, and three had 1-3 years of self-taught or informal experience. All Your task is to create images for the given task (see below) using only the provided system.
Topic: Spending Less Time on Phone Create images that encourage university students to spend less time on their phones and reconnect with real life and others. Imagine one of these images being printed as a poster displayed in a student cafรฉ or library -it should motivate people to pause, look up, and engage with the world around them.
You have 20 minutes to complete this task. Within the time limit, feel free to create as many images as you like-there are no right or wrong answers.
There are no additional constraints for the image-for example, there are no constraints on its size, style, or format. Table 1. The 12 unique experimental conditions used to counterbalance task pairing, task order, and system order.
System Order editing features, while one participant had only heard of such tools but had never used them. Each participant received a โฌ10 coupon and a bag as a token of appreciation for their participation.
In the study, participants were asked to create poster images to convey a specific message. We chose poster design because our participants were primarily university students, for whom posters are a familiar and common form of communication in campus settings. Moreover, poster design is a widely accessible, non-professional creative activity that requires both conceptual thinking and visual creativity. For poster design, we defined three comparable task topics: We employed a within-subjects design to compare the creative workflows facilitated by HAIExplore and ChatGPT.
To minimize learning effects, each participant completed two distinct image-creation sessions (one per system), using a different task topic for each. To control for task-specific variance and carryover effects, we employed a Balanced Incomplete Block Design with fully counterbalanced ordering [26]. Specifically, we controlled for three factors: task pair (selecting a subset of two out of the three topics: ๐ด&๐ต, ๐ต&๐ถ, or ๐ด&๐ถ), task order (e.g., Task A followed by Task B),
and system order (HAIExplore first vs. ChatGPT first). This design yielded exactly 12 unique experimental conditions (3 task pairs ร 2 task orders ร 2 system orders) (see Table 1). Participants were randomly assigned to these conditions, ensuring that every permutation of task assignment and system order was observed with equal frequency. The study lasted approximately 1 hour and 50 minutes per participant. The procedure is illustrated in Fig. 6 and detailed as follows:
Pre-study Survey and Setup (5 minutes). Participants were first briefed on the study objectives, procedures, and data handling protocols. After reviewing and signing a consent form, they completed a pre-study questionnaire collecting demographics, prior experience with text-to-image tools, and baseline attitudes toward AI-assisted creativity. All participants used the same laptop (Apple MacBook Pro 14) and web browser.
Two Image Creation Sessions (85 minutes). The session began with a brief onboarding tutorial (up to 15 minutes). We provided a walkthrough of the assigned system, after which participants were allowed to explore the interface freely. If a participant felt comfortable with the system before the 15 minutes were up, they could proceed directly to the main task. Following the tutorial, participants were given their first assigned task (e.g., Task A) and the corresponding system (e.g., HAIExplore). After the session, they were asked to download any images they felt were ready to be printed as a poster. During this session, all screen interactions and downloaded images were recorded. Following the task, participants completed a short post-task survey (โ5 minutes). This included the Creativity Support Index (CSI) [8], administered without paired comparisons to reduce fatigue, the UMUX-Lite [27], and a single item asking whether they felt they had learned something previously unknown. After this, participants were given a 5-minute break to reduce fatigue and minimize cross-condition interference. Then, they completed a second image creation session (40 minutes), using the other system (e.g., ChatGPT for Task B) and following the same procedure as in the first session.
Post-study Interview (20 minutes). After completing both sessions, participants engaged in a semi-structured interview to reflect on their experiences. The interview explored participants’ perceptions of workflow guidance, creative exploration, and moments of inspiration, as well as comparisons between the two systems, perceived learning, and intentions for future use. All interviews were audio-recorded with consent and later transcribed for thematic analysis.
We collected both quantitative and qualitative data from multiple sources, including post-task surveys, screen recordings of participants’ interactions, semi-structured interviews, the final images downloaded by participants, and post-study generations for ablation analyses. For all comparisons of quantitative measures, we employed the Wilcoxon signed-rank test [54] with a significance level of ๐ผ < 0.05. We describe the specific data collection and analysis methods below.
Survey. After each creation session, participants completed three questionnaires: unweighted Creativity Support Index (CSI) [8], UMUX-Lite [27], and a self-reported learning question. The CSI questionnaire [8] measures how well a creativity support tool aids users across dimensions such as enjoyment, exploration, and expressiveness (0-10 scale). We used the unweighted CSI without paired comparisons to reduce participant burden [31]. Since our study did not involve collaboration, we omitted the Collaboration factor from the standard CSI, as in prior work [42,43]. The System Usability (UMUX-Lite) [27] measures perceived system usability (scale 1-7). Finally, for self-reported learning, we asked a single item about whether the participant learned something previously unknown during the session (scale 1-7).
Quantitative Metrics from Screen Recordings. We analyzed screen recordings for quantitative metrics about participants’ creation processes. For both HAIExplore and ChatGPT, we computed two quantitative metrics (Image Clusters and Refinement Prompts) and two additional metrics for HAIExplore only. These metrics are defined as follows:
โข Image Clusters: We define a cluster as a group of images derived from the same initial image, including the initial image and all its subsequent variations. For each participant, the number of image clusters indicates exploration breadth, i.e., how many distinct ideas they explored.
โข Refinement Prompts: This is defined as the number of times participants prompted the system to refine an image, reflecting the effort required during the refinement process.
โข Idea Customization: This is measured as the number of user-edited or user-created idea cards in the Brainstorming stage, capturing how actively they created their own ideas or tailored the generated ideas.
โข Default Option Adoption: This is defined as the proportion of refinements in the Refinement stage where participants used the default options (i.e., the first options for all parameters) to generate variation images. A high proportion indicates that users are typically not in need of having many options, while a low proportion suggests the opposite, highlighting the value of having multiple options to choose from.
Lastly, for ChatGPT, we examined each participant’s first prompt to see if the participant followed a similar workflow of brainstorming-first (as in HAIExplore) by requesting ChatGPT for idea suggestions. We compare between the two groups of system order (i.e., HAIExplore first vs. ChatGPT first) to see if prior experience with HAIExplore influenced participants to adopt a brainstorming-first approach when using ChatGPT.
Thematic Analysis. We analyzed the interview transcripts using thematic analysis [7,9]. Two researchers conducted open coding on the transcripts. They first independently coded transcripts from two randomly selected participants, then met to discuss and refine the initial codes. After this, each researcher coded half of the remaining transcripts.
Finally, they met again to review and consolidate the codes into overarching themes.
Evaluation of the Fluency and Diversity of Generated Images. To evaluate the quality of the final images generated and then downloaded by participants in each system condition, we operationalized artifact quality using four metrics: Fluency, Diversity, Novelty, and Usefulness.
โข Fluency: We operationalized Fluency as the number of final images downloaded by each participant in each system condition.
โข Diversity: This captures the semantic difference of a participant’s final images. Specifically, we measured Diversity by extracting image embeddings (using CLIP ViT-bigG-14 [38]) for all downloaded images and computing the average pairwise cosine distance between these embeddings. Higher scores indicate that the images are more different from each other in the embedding space, reflecting higher semantic diversity.
Note that computing diversity requires at least two downloaded images per participant, two participants who downloaded only a single final image in both conditions were excluded from this analysis, resulting in a final sample size of ๐ = 10 for this metric.
Evaluation of Novelty and Usefulness of Generated Images. To assess Novelty and Usefulness of the generated images, we recruited two independent evaluators representing the target audience for the posters. We administered a doubleblind rating task in which evaluators viewed all final images without knowledge of the participants or system conditions.
They independently rated each image on a 7-point Likert scale along the two dimensions of Novelty and Usefulness following these definitions:
โข Novelty: This captures how original or surprising the idea behind the image is. Higher scores indicate images that are very surprising and original, i.e., not something one would normally consider. Lower scores indicate images that are common, predictable, or clichรฉd.
โข Usefulness: This captures how well the image works as a real poster for the given task. Higher scores indicate that the image could be used as-is for printing and influencing the audience. Lower scores indicate that the image does not work well for the task and would not be used as a poster.
For each participant (not to be confused with an evaluator), we computed Novelty and Usefulness scores as follows: we averaged the ratings from the two evaluators to obtain a single consensus score for each image, and then averaged these image-level scores across all images produced by that participant under a given condition, yielding one Novelty and one Usefulness score per participant per condition. We then reported the means, standard deviations, and results of statistical comparisons for these participant-level scores in each condition.
Effectiveness of Associative-Thinking Prompting on Idea Diversity. To foster idea diversity in the brainstorming phase of HAIExplore, we explicitly incorporated principles of associative thinking into the prompts given to the generative model, aiming to support divergent thinking and encourage more varied ideas. To examine whether this design choice indeed leads to more diverse idea sets, we conducted an ablation comparing two prompting conditions given to the same generative model: (i) an associative prompt that explicitly instructed the model to perform associative thinking and draw on remote connections to generate diverse ideas, and (ii) a non-associative prompt that requested diverse ideas but did not mention associative thinking. To do this, we took the 12 initial prompts that participants entered when using HAIExplore (one prompt per participant). For each of these 12 prompts, we queried GPT-5 under both prompting conditions. To account for sampling randomness, each condition was run 3 times independently per prompt, and in each run the model was asked to generate 9 ideas (same setting as the brainstorming phase of HAIExplore). For every set of 9 ideas, we computed an embedding-based diversity score. We encoded the name of each idea using the [38] and calculated the average pairwise cosine distance across all the 9 ideas within that set.
This average distance serves as our diversity metric, with higher values indicating that the ideas are more dissimilar from one another in the embedding space. To determine a statistically significant difference between the associative and non-associative conditions, we first averaged the diversity scores across the three runs for each distinct prompt to obtain a single aggregated score per prompt, before applying the Wilcoxon signed-rank test (๐ = 12).
In this section, we present our findings organized by the four research questions (RQ1 -RQ4) proposed in Section 1, combining both quantitative and qualitative results.
This section presents the results on participants’ overall experience and perceived effort with the two workflows.
Participants perceived HAIExplore as better for supporting creativity and their image co-creation needs. We first compared participants’ perceived creativity support and system usability between HAIExplore and ChatGPT. On the unweighted CSI (Table 2a), participants rated HAIExplore significantly higher on Enjoyment (๐ = 0.0034), Exploration (๐ = 0.0029), and Results Worth Effort (๐ = 0.0137). System usability (UMUX-Lite; Table 2b) was significantly participants rated HAIExplore’s capabilities as better meeting their requirements (๐ = 5.92, ๐๐ท = 0.67) than ChatGPT (๐ = 3.92, ๐๐ท = 1.38, ๐ = 0.0049), whereas ease-of-use ratings did not differ (HAIExplore: ๐ = 6.17, ๐๐ท = 1.03;
ChatGPT: ๐ = 6.00, ๐๐ท = 0.95, ๐ = 0.7656). Per our thematic analysis, participants described HAIExplore’s two-stage workflow as natural and intuitive, noting that they “got used to it immediately” (P8). A few participants noted that ChatGPT can still be useful when they have a clear vision of what they want. However, HAIExplore would be their preferred choice for most scenarios: “I would actually prefer HAIExplore, unless I really have a very specific idea in mind; in that case, I could just use ChatGPT. But I feel like these cases are actually very rare.” (P10).
Participants generated more novel images with HAIExplore. Table 2c summarizes the comparison of final images created during the user study with HAIExplore and ChatGPT. The two systems yielded similar fluency and usefulness scores, but with HAIExplore, participants produced significantly more novel images (๐ = 0.004) and tended to show higher diversity, although this diversity difference was not statistically significant (๐ = 0.084). In interviews, several participants described HAIExplore as particularly valuable for novices and non-experts in image prompting, characterizing it as an expert “guide” that structures the process and suggests concrete next steps: “what could be the next step, what I can try. That guidance is very important.” (P12).
Participants experienced greater agency and ownership with ChatGPT. Our thematic analysis revealed differences in perceived agency and ownership between the two systems. A few participants reported feeling more agency when working with ChatGPT. For example, P1 explained, “[ChatGPT] wasn’t biasing me towards a certain set of ideas. So it gave me the perception that I was thinking more by myself.” Similarly, P8 reflected, “The ideas are mostly HAIExplore’s.
I just guided the system to produce what I wanted, and then I selected. So, yeah, I would feel higher ownership with ChatGPT than with HAIExplore.”
This section presents findings on how HAIExplore supports the initial brainstorming phase.
HAIExplore serves as a creative catalyst, generating diverse conceptual ideas when users lack a starting point. As shown in Table 3, there was a trend towards exploring more image clusters with HAIExplore (๐ = 3.08, ๐๐ท = 1.31) than with ChatGPT (๐ = 2.08, ๐๐ท = 0.79), although this difference was not statistically significant (๐ = 0.0898). Next, since the ideas in HAIExplore were generated using associative-thinking prompting, we also compared associative-thinking prompting vs. non-associative-thinking prompting for idea diversity. We found that associative-thinking prompting significantly outperformed the counterpart (๐ = 0.016), producing higher diversity scores (๐ = 0.64, ๐๐ท = 0.05 vs. ๐ = 0.61, ๐๐ท = 0.06). Further, all participants reported that HAIExplore’s brainstorming stage effectively sparked “diversified” (P3), “creative” (P8, P12) ideas. Many participants described feeling stuck about how to start their creative process, and they found that HAIExplore provided a helpful jumpstart by generating a variety of concepts to consider: “it’s very nice when I have literally no clue where I want to go” (P3), “it’s very useful if you’re trying to generate images for things that you don’t really know much about” (P10), “I think it was ideal that there were many ideas, because I didn’t know what I wanted, so that was really good” (P7).
Text and visual descriptions help users understand idea relevance and filter ideas quickly. Participants explicitly appreciated the visually-grounded idea descriptions. P2 praised the “initial ideas with text and also some illustrations” for helping them “immediately know what each one is about”. These explanations were vital for bridging the gap between novel concepts and user understanding: “Seeing different ideas visually with a short description is useful. It helps me quickly see, out of these nine ideas, which ones I like” (P5).
Several participants reported cognitive overload with HAIExplore’s output. While most participants appreciated the abundance of ideas generated by HAIExplore, a few reported a ‘cost of divergence’, feeling overwhelmed by the sheer volume of idea options. The volume of generated concepts was described as “sometimes overstimulating” (P1), calling for convergence mechanisms to help them narrow down choices.
This section presents findings regarding the refinement stage of HAIExplore, focusing on user control and iteration.
Participants prefer HAIExplore for its efficiency and controllability. Table 4 includes the result for the number of refinements per image cluster. There was a trend toward fewer refinements with HAIExplore (๐ = 1.45, ๐๐ท = 1.22) than ChatGPT (๐ = 2.50, ๐๐ท = 1.41; ๐ = 0.0513), although this difference was not statistically significant. Table 4 also reports how often participants selected the default options when refining with HAIExplore (๐ = 26.4%, ๐๐ท = 29.7%), suggesting that participants frequently needed options beyond the default suggestions. As per our thematic analysis, many participants expressed frustration with ChatGPT’s tendency to be “volatile”, changing the images randomly or overwriting previous versions entirely: “ChatGPT misinterprets my intention, so it changes things I do not want it to change […] It’s very unpredictable.” (P4), “[With ChatGPT] when I try to refine things it sometimes just randomly changes things. " (P9). In contrast, HAIExplore was praised for its ability to keep images prompt-consistent and match ideas closely: “I also like that [HAIExplore] kept track of the specific details in the prompt when generating the image, so the image doesn’t change too much other than what I asked to change” (P9), “[HAIExplore] keeps context or portions of the previous image and continues from there. " (P12).
“Guided Refinement” reduces cognitive load, option selection is preferred over writing complex corrective prompts. The thematic analysis revealed that almost all participants expressed a strong preference for HAIExplore’s guided refinement approach. They appreciated the ability to select from concrete options rather than having to perform the high-effort trial-and-error prompting in ChatGPT. P4 noted that HAIExplore was “very clear because you had Table 5. Results for RQ4: Self-reported learning scores (Mean ยฑ SD). More details in Section 6.4
ChatGPT p-value Learning 5.25 ยฑ 1.91 2.50 ยฑ 1.78 0.0059 the options and it knew what to modify,” unlike ChatGPT, where the changes were opaque. P8 found the structured suggestions “very useful” because they “take the burden off you to be very explicit about every detail. " Similarly, P11
praised the intuitive nature of the options, stating that “I don’t have to remember many terms or type so many prompts, "
which “saves my time and makes it easier for me. "
It is sometimes difficult to map HAIExplore’s abstract text options to visual outcomes. Several participants reported difficulty understanding what the options meant until they saw the refined images. They noted the desire for clearer traceability to understand exactly how a text selection would alter the visual output. For example, P10 suggested that refinement options could be “more descriptive”, noting that abstract terms like “mosaic” were hard to interpret without visual aids: “If I had seen this “mosaic” idea […] I wouldn’t have generated this image because I would already know I wouldn’t like it.” They recommended adding “visualization images” or “longer descriptions” to clarify the expected outcomes.
Users desire direct manipulation to complement dropdowns. While the menu-driven refinement was helpful for broad changes, some participants reported wanting additional interaction mechanisms for more targeted adjustments. For example, P9 described the dropdown approach as “limiting” (P9) for specific, granular edits. Participants also described an interest in interacting more directly with images. P1 wanted to “add text as you like directly to the image” to “edit and fix it faster” without writing prompts. P12 echoed this, stating a preference to “mark something directly on the image” to avoid the ambiguities of describing visual changes in text: “I would prefer to just mark, ‘Here, do something. ‘” In addition, P9 and P12 mentioned a desire for better side-by-side comparison tools, with P12 noting, “I want to compare them side by side and then decide which to pick. "
This section presents results on the educational aspects of using HAIExplore compared to ChatGPT. Overall, participants reported significantly higher self-perceived learning with HAIExplore (๐ = 5.25, ๐๐ท = 1.91) compared to ChatGPT (๐ = 2.50, ๐๐ท = 1.78; ๐ = 0.0059) (Table 5). Below, we elaborate on more detailed findings.
HAIExplore expands participants’ creative vocabulary and conceptual knowledge beyond their initial expertise. Our thematic analysis recorded many participants reporting that HAIExplore introduced them to new concepts, styles, and terminology that they were previously unaware of. For example, P3 noted learning about “Maslow’s hierarchy” and the “social support map, " realizing connections they hadn’t considered: “I know about the Pomodoro technique […] but I wouldn’t see it as a mental well-being technique”. Similarly, P5 discovered new frameworks like the “20-5-3 outdoor rule” and “attention restoration theory, " stating, “I wasn’t aware of this topic. These ideas and concepts could also be useful later. " Visualizing HAIExplore’s underlying logic helps participants become better prompt engineers. Participants described the ability to view the underlying prompt and keywords as useful for learning how to communicate more effectively with AI systems. P2 noted that seeing the underlying prompt “teaches you how to prompt, so next time you can write a better and more detailed prompt. " P4 reported that this made it easier to express intentions in the future: “it becomes much easier in the future for me to express myself. " P6 observed that the system’s prompts were “more expressive” and included details they hadn’t specified, such as “font hierarchy, " concluding that “using this kind of prompt […] would generally produce a better result” when directly interacting with ChatGPT.
Participants expressed interest in transferring HAIExplore’s structured workflow to their ChatGPT use. We examined whether participants prompted ChatGPT to brainstorm ideas at the beginning of their ChatGPT session. Among the six participants who experienced HAIExplore first, three (50%) did so, compared to one out of six participants (16.7%) who used ChatGPT first. In interviews, several participants expressed interest in adopting HAIExplore’s structured approach (diverge-then-converge) and planning into their own creative workflows, even when using other tools like ChatGPT. They saw value in the systematic process itself, beyond just the specific system. For example, P5 stated “I would like to start by asking ChatGPT to give me more brainstorming ideas in a similar way before going to the refinement phase. " P9, who described themself as a ‘power user’, felt they needed to “use this system first to learn how to do it properly” before applying the workflow to ChatGPT. P12 went further, suggesting they might “create my own custom GPT that simulates this guidance”, structuring prompts to “verify requirements” and “brainstorm potential ideas” before generation.
This section discusses our findings and highlights the broader implications of our results for future research and the design of human-AI co-creative systems.
Quantitative results show that participants rated our system significantly higher in creativity support and usability compared to ChatGPT across multiple dimensions. Qualitative findings corroborate this, revealing that participants strongly preferred the structured, two-stage workflow over ChatGPT’s linear chat paradigm, finding it more efficient and less cognitively demanding, though some noted a trade-off in perceived agency.
These findings align with the concept of functional fixedness in problem-solving, where individuals struggle to see beyond their initial ideas or the standard use of tools [2]. The linear chat interface reinforces a “rush to generation” mentality, exacerbating the fixation problem where users settle on early, suboptimal ideas [23]. By enforcing a structural separation between ideation and generation, HAIExplore breaks this fixation loop. Our results also resonate with the “gulf of envisioning” described by recent work on prompt-based interactions [41], which highlights the cognitive gap between a user’s abstract intent and the precise prompt required to realize it. The preference for HAIExplore suggests that structured scaffolding bridges this gulf more effectively than open-ended chat. For system designers, these results suggest that the “chatbot” paradigm may not be the optimal default for creative tasks. Instead of relying solely on natural language conversation, generative tools should incorporate explicit process scaffolding-interfaces that guide users through distinct stages of divergent and convergent thinking [10,18,35]. Practically, this could mean that future versions of tools like Midjourney or ChatGPT should offer a “Brainstorm Mode” distinct from their “Generate Mode, " allowing users to toggle between exploring concepts and refining details.
Regarding agency, the tension we observed reflects the “Control vs. Serendipity” paradox in co-creative systems [36].
While users value the serendipitous ideas generated by AI, the very act of the system taking initiative can degrade their feeling of ownership [11]. Our findings confirm that while guidance improves outcomes, it must be balanced carefully to preserve the user’s psychological ownership [55]. To address these agency and ownership concerns, systems should frame AI suggestions as “options” rather than “decisions, " ensuring the user always retains their final control power.
The results suggest that the brainstorming stage exhibited divergent thinking, with the associative prompting mechanisms significantly increasing idea diversity. Participants appreciated the system’s role as a creative catalyst that offered visually grounded concepts, although the volume of generated ideas occasionally led to cognitive overload.
Our associative approach addresses “collaborative fixation”, which posits that while working with others can stimulate ideas, it can also lead to convergence if the partner’s ideas are too similar [24]. By explicitly prompting the AI model to seek diverse associations, HAIExplore acts as a collaborator that actively resists convergence, pushing the user out of any initial mental model. As a practical takeaway, co-creative tools should prioritize divergence through associative thinking in their ideation modules. Rather than simply asking the AI to “generate ideas, " systems should be designed to look for metaphors, analogies, and cross-domain connections to maximize diversity.
The cognitive overload reported by some participants can be explained by the paradox of choice [39]. While divergent thinking requires quantity, presenting too many unstructured options can paralyze decision-making. This suggests a need for balancing idea suggestion with effective filtering mechanisms. To mitigate overload, future systems could implement progressive disclosure of ideas: showing a small batch of high-quality, diverse options first, with the ability to load more on demand. Additionally, a more nuanced grouping or clustering of ideas based on thematic similarity (as HAIExplore attempts to do with tags) could help users navigate large idea spaces more effectively without being overwhelmed.
In the refinement stage, participants preferred HAIExplore’s menu-driven, option-based approach over the trial-anderror prompting required by ChatGPT. The system’s ability to maintain consistency while allowing parameter-based tweaks was a key advantage. However, some participants expressed a need for better visualization of abstract parameters and more direct manipulation capabilities.
The struggle with ChatGPT’s refinement process highlights the challenges of Generative AI Literacy [4], especially the difficulty in detailing precise prompts [41,58]. HAIExplore addresses this by first acting as a translator, converting high-level goals into low-level parameters, then as a server, providing a menu of options that users can select from rather than requiring them to generate new prompts. This aligns with recent work on assisted prompt engineering [14], which argues that users perform better in recognizing good options rather than recalling technical terms. The preference for HAIExplore supports the shift from “recall-based” interaction (typing prompts from memory) to “recognition-based” interaction (selecting from a curated menu). As a result, users needed to perform fewer refinements with HAIExplore than with ChatGPT to achieve the desired outcomes. The low percentage of refinements with default options further highlights the limitation of even state-of-the-art AI models in predicting users’ preferences: in only 26.4% of the time could AI guess the user’s choices correctly (so users refined with the default options), indicating the importance of having users in the loop, selecting their own preferences. In comparison, ChatGPT directly generates outputs based on its best guess of the user’s intent, which often leads to misalignment and more trial-and-error.
The desire for direct manipulation echoes a limitation of modern deep learning and generative models, which often fail to support fine-grained control [59]. Text is an inherently low-bandwidth channel for describing spatial and visual changes, which explains why users want to “point to” or “mark” the image directly [56]. Future co-creation systems should move beyond text-only prompting by adopting hybrid interfaces that combine natural language with graphical user interface controls. While text is powerful for broad conceptual changes, it is inefficient for precise adjustments. Integrating the ability to highlight and mark specific image regions for targeted refinement would bridge the gap between the ease of text and the precision of manual editing. To further reduce prompting effort, systems could shift the user’s role from “writing” to “selecting. " By proactively surfacing relevant parameters with valid options, tools can lower the barrier to entry for novice users who lack the technical vocabulary to craft complex prompts from scratch. Furthermore, to address the traceability issue, the future version of HAIExplore should provide visual previews for refinement options. Instead of just listing “Mosaic Style” as a text option, the system could show a small thumbnail previewing that effect to reduce the risk of users selecting options that don’t match their mental model.
Participants reported significantly higher self-perceived learning with HAIExplore, noting that it expanded their conceptual vocabulary and improved their prompting skills. The structured workflow was also seen as a valuable model for creative endeavors, with several participants expressing interest in applying its principles to future interactions with ChatGPT.
These findings suggest that HAIExplore functions not just as a tool, but as a scaffold for learning. By exposing the “hidden state” of the creative process (e.g., showing the diverse ideas with their definitions and links to the user’s intent, explicitly naming a style, and displaying the refinement prompts; see Fig. 2, 3, and 4), the system supports reflection-in-action [37]. This mirrors findings in educational AI, where exposing the chain of thought helps students understand the process rather than just the answer [28]. The transfer of the “brainstorming-first strategy” to ChatGPT indicates that users can acquire Generative AI literacy through tool usage. They learned the two prompting stages of ideation and refinement from experiencing a better workflow. These findings highlight a dual role of co-creation systems: they should be designed for both task performance (helping users create good results now) and user learning (helping users to become better creators over time) [19]. While the former goal is often prioritized, the latter is equally important for long-term user satisfaction and development [1,33]. Tools should be transparent by design. This turns every interaction into a potential learning opportunity, helping users build the vocabulary and mental models necessary to master generative AI.
Our study has several limitations that should be considered when interpreting the results. First, our sample size was small (๐ = 12) and relatively homogenous, consisting of students in the same university. This may limit the generalizability of our findings to broader populations with different backgrounds. Second, we evaluated our paradigm only within the domain of creative image generation. While the principles of divergent and convergent thinking are broadly applicable, the specific impacts of our workflow in other creative domains remain to be verified. Finally, our user study was a single-session evaluation, which prevents us from assessing long-term effects, such as whether the observed learning benefits persist over time or how users’ reliance on scaffolding evolves with extended use.
This paper proposed a structured human-AI co-creation paradigm designed to foster exploration, mitigate design fixation, and support the non-linear nature of creative work. Grounded in Wallas’ model of creativity, this paradigm explicitly separates and scaffolds the cognitive processes of divergent and convergent thinking. We instantiated this approach in HAIExplore, a system that features two stages: Brainstorming and Refinement, encouraging exploration and bridging the “gulf of envisioning”. Our user study with 12 participants demonstrated that, compared to a linear chat baseline, this structured workflow significantly improved perceived creativity support, usability, and the novelty of the generated outcomes, while also helping users acquire transferable prompting skills.
Our findings also point to several avenues for future research. First, we observed a tension between automated scaffolds and user agency: While the structured workflow reduced cognitive load, some users felt a diminished sense of agency and thus were more reluctant to claim ownership. Future work should explore interaction mechanisms that better balance scaffolding with user autonomy, such as incorporating adaptive levels of guidance. Second, our evaluation focused only on the image generation domain. For better generalization of the results, future studies should investigate this paradigm across different creative domains, such as video production and UI design. Finally, future research could be based on our results on learning effects to further explore other mechanisms to enable users to seamlessly learn while using productive systems.
(c) Final images scores (Mean ยฑ SD).