"Make it Pop, but not Like That": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces

doi:10.21203/rs.3.rs-8994174/v1

"Make it Pop, but not Like That": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces

2026 · doi:10.21203/rs.3.rs-8994174/v1

preprint OA: closed

Full text JSON View at publisher

Full text 77,978 characters · extracted from preprint-html · click to expand

"Make it Pop, but not Like That": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article "Make it Pop, but not Like That": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces Zhenjiang Song This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8994174/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 4 You are reading this latest preprint version Abstract The rapid proliferation of Large Language Models (LLMs) and generative tools (e.g., GPT-4, Tongyi Lingma, Trae) has fundamentally democratized the landscape of web development, shifting the paradigm from manual syntax construction to natural language intent specification. However, while the barrier to "drafting" initial code has been lowered, a significant "Refinement Crisis" has emerged. As task complexity scales from static landing pages to dynamic, stateful applications, novice users encounter a profound "Gulf of Evaluation" when attempting to repair AI-generated errors. Unlike text generation, where errors are semantic and visible, web interface generation involves a complex interplay between visual presentation (CSS) and invisible state management (JavaScript). In this paper, we present a large-scale observational study with 200 novice participants tasked with utilizing IDE-integrated AI assistants to build a fully functional CRUD (Create, Read, Update, Delete) note-taking application. Through a rigorous analysis of interaction logs and source code snapshots, we reveal that while 90% of users could generate a baseline prototype, 80% encountered severe "invisible state" breakdowns (e.g., data persistence failure), and 50% suffered from persistent layout regressions . We contribute a detailed taxonomy of four repair strategies: Perceptual Refinement , Behavioral Correction , Diagnostic Proxy , and Global Reset . Furthermore, we characterize the "Whack-a-Mole" effect —a phenomenon where repairing visual elements inadvertently corrupts functional logic due to the AI's lack of holistic state awareness. Our findings provide empirical evidence for the limitations of current chat-based coding interfaces and offer critical design implications for future "State-Aware" AI IDEs that reify invisible data flows to bridge the gap between user intent and execution. Figures Figure 1 Figure 2 1. Introduction 1.1 The Paradigm Shift: From Syntax to Intent The landscape of software engineering is undergoing a seismic shift. For decades, the primary barrier to entry was Syntax : learning the rigid grammar of languages like JavaScript or Python. Today, Large Language Models (LLMs) such as GPT-4, Tongyi Lingma, and Trae have effectively democratized code generation, transitioning the user's role from a "Writer of Code" to an "Architect of Intent." However, this lowering of the floor has inadvertently raised the ceiling for Validation . In the context of web development, the output of an LLM is not a single linear text but a complex, multi-dimensional artifact comprising Structure (HTML) , Presentation (CSS) , and Behavior (JavaScript) . While a novice user can easily prompt "Make a note-taking app," they often lack the mental model to understand the invisible dependencies between these layers. When the generated application fails—for instance, when a note disappears after a page refresh—the user is thrust into a "Refinement Crisis." They must debug a system they did not write and do not understand, often using vague natural language that the AI misinterprets. 1.2 The Unique Challenge of Web State Unlike text generation tasks (e.g., writing an email), where errors are semantic and visible (e.g., a wrong tone), errors in web generation are often State-Based and Invisible . Our preliminary observations indicate a critical misalignment: The Desktop Metaphor Novice users often transfer their experience from desktop applications (like Word or Excel) to the Web. They assume that if they "edit" a text on screen, it is saved; or if they "open" a new page, the data travels with them. The Stateless Reality The Web is inherently stateless. Without explicit instruction to use localStorage or a database, any data entered into the DOM is ephemeral. This gap creates a profound "Gulf of Evaluation" (Norman, 1986 ). The user sees the visual interface (the "View") but is blind to the underlying data flow (the "Model"). When they attempt to repair these issues using visual descriptors (e.g., "Make the button red" ), they often inadvertently trigger regressive bugs in the logical layer, a phenomenon we term the "Whack-a-Mole" Effect . 1.3 Research Context and Contributions To empirically investigate this phenomenon, we orchestrated a large-scale observational study with N = 200 novice developers in a controlled university setting. The participants, who had zero prior coding experience, were tasked with building a fully functional CRUD (Create, Read, Update, Delete) application using AI tools. This paper moves beyond simple success/failure metrics to analyze the process of failure. We contribute: A Taxonomy of Repair Strategies , categorizing how novices linguistically navigate technical breakdowns. The identification of the "Singin.html" Phenomenon , a specific architectural misconception where users attempt to manage state via file creation. The "Aesthetic Compensation" Theory , explaining why users pivot to visual polishing when functional logic fails. 2. Related Work 2.1 Large Language Models in Software Engineering Recent advancements in LLMs, exemplified by tools like OpenAI's Codex and StarCoder, have revolutionized code generation (Chen et al., 2021 ). Studies by Vaithilingam et al. ( 2022 ) have shown that these tools significantly improve developer productivity by automating boilerplate code. However, existing research predominantly focuses on algorithmic correctness in backend languages (e.g., Python, Java), where success is binarily defined by passing unit tests. Less attention has been paid to Frontend Web Development , where "correctness" is subjective, visual, and highly context-dependent. Furthermore, most studies recruit professional developers. Our work shifts the focus to Novice End-User Programmers , investigating how those without a mental model of "variables" or "functions" navigate the complexities of full-stack application construction. 2.2 The Challenge of End-User Prompting As generative tools democratize access to coding, "Prompt Engineering" has emerged as a critical barrier. Zamfirescu-Pereira et al. (CHI '23) in their seminal work "Why Johnny Can't Prompt" highlighted that non-experts often struggle with overgeneralization —assuming that if the AI understands one concept (e.g., "blue button"), it understands the entire context. While their study focused on text-based chatbots, our research extends this to the Visual-Functional domain . In web development, users must describe spatial relationships ("move this to the right") and dynamic behaviors ("update the list after deleting"). These concepts suffer from Spatial Deixis Failure , where natural language lacks the precision to describe layout constraints (e.g., distinct from margin, padding, or float), leading to unique frustrations not seen in text generation tasks. 2.3 Mental Models of Web Architecture and State Central to the "Refinement Crisis" is the divergence between the actual architecture of the Web and the user’s mental model of it. Mental models are internal representations of how systems work (Gentner & Stevens, 1983 ). For novice programmers, these models are often built upon the "Desktop Metaphor," where files are discrete, stateful entities that retain information when opened or moved. In the context of Web development, the architecture is inherently Stateless (Fielding, 2000 ), meaning data does not persist across page navigations without explicit state management (e.g., LocalStorage, Session, or Database). Prior research in end-user programming (Ko & Myers, 2004 ) identified the "selection barrier," where users struggle to identify the correct technical components to achieve a goal. Our study extends this by investigating how a faulty mental model—viewing the Web as a collection of persistent files rather than a dynamic state machine—leads to catastrophic failures when users attempt to refine multi-page applications like our CRUD note-taking task. By identifying the specific "Singin.html" misconception, we contribute to the understanding of how AI-assisted coding may inadvertently reinforce these faulty mental models by fulfilling syntactic requests while ignoring semantic gaps. 3. Methodology 3.1 Participant Demographics and Recruitment We recruited 200 vocational college students from a vocational college in Central China from a non-Computer Science major course titled "Introduction to Digital Media." Inclusion Criteria Participants were screened via a pre-study survey to ensure they had zero prior experience in HTML, CSS, or JavaScript. This "Tabula Rasa" (Blank Slate) condition is crucial for our study, as it ensures that their prompting strategies reflect pure natural language intent rather than "pseudo-code" thinking. Tool Environment Participants were divided into two groups using Tongyi Lingma and Trae . Both tools are state-of-the-art, Chinese-native AI coding assistants integrated into VS Code, eliminating language barriers as a confounding variable. 3.2 The Task: Why CRUD? While previous studies (Kim et al., 2022 ) focused on static styling tasks, we specifically chose a CRUD Note-Taking Application to force interaction with State Management . The task required five specific milestones: Creation Input text and add it to a list (Tests DOM manipulation). Persistence Data must survive a page refresh (Tests localStorage understanding). Read/Sort Display notes, ideally sorted by time (Tests Array logic). Update Edit existing note content (The distinct "State Persistence" challenge). Delete Remove a note (Tests Event Listeners and Array mutation). This incremental complexity allowed us to observe exactly where the user's mental model diverged from the AI's execution. 3.3 Data Analysis: Qualitative Coding We collected a corpus of over 3,000 lines of interaction logs . Data analysis followed a Grounded Theory approach(Charmaz, 2006 ) to identify emergent patterns in user repair strategies. Open Coding Two researchers independently tagged log entries with descriptive labels (e.g., "Complaining about style," "Asking for file creation"). Axial Coding These labels were grouped into broader categories. For instance, commands like "Make it red" and "Move to right" were categorized under "Perceptual Refinement." Selective Coding We identified core themes—such as the "Singin.html" phenomenon and "Aesthetic Compensation" —by analyzing the temporal sequence of prompts (e.g., identifying that aesthetic requests frequently spike following a functional failure). 4. Results: The Anatomy of Breakdown 4.1 Quantitative Overview: Success vs. Struggle Among the N = 200 participants, the majority demonstrated the capability to build functional web applications using AI tools, though success levels varied significantly. Completion Rates: 65% (n = 130) of participants successfully implemented the full "CRUD + Search" functionality. Another 25% (n = 50) completed the basic "MVP" features (Create/Read/Delete) but failed to implement advanced logic like Search or Update. The remaining 10% (n = 20) abandoned the task after repeated failures (Global Reset). Interaction Effort Participants engaged in an average of 15 interaction turns (SD = ± 4.2). This high number indicates that "one-shot" generation is a myth for complex apps. The "Invisible State" Barrier : The most pervasive breakdown was the "Data Persistence" issue. A staggering 80% (n = 160) of participants encountered data loss upon page refresh. Original logs show a repetitive pattern of complaints: "After I add a note and refresh, it disappears" , This reveals a widespread failure in the AI's default strategy, which prioritizes DOM manipulation over localStorage implementation. The "Visual Fidelity" Gap Aesthetic and layout issues affected 50% of users. Common complaints involved "ugly buttons," "misalignment," and "height collapse," triggering extensive visual repair loops. 4.2 A Taxonomy of Repair Strategies Through qualitative coding of the 3,000 + prompt lines, we identified four distinct patterns of repair strategies that novices employ when the AI output fails to meet expectations. Type I: Perceptual Refinement (The Aesthetic Overload) : Users lacking CSS vocabulary rely on cultural metaphors. Evidence : Participant Li: "The delete button... Change it to rainbow colors, Hello Kitty button shape." Analysis This reveals a Semantic Gap where AI interprets "concise" as minimalist code, while users intend visual elegance. Type II: Behavioral Correction (The State-Blindness Struggle) : Users focus on the disconnect between data and view. Evidence : Participant Gao: "Clicking modify should not make it disappear... cover the old note after modification." Analysis Users understand the symptom (data loss) but not the mechanism (LocalStorage), leading to repetitive storytelling prompts. Type III: Structural Deixis (Layout Struggles) : Users struggle with layout stability using spatial language. Evidence : Participant Zhang): "Center the pagination module... change width to 1200px." Analysis The recurrence of specific pixel values suggests users pivot to Hard-coding Constraints when vague descriptions fail. Type IV: Global Reset (The Abandonment) : Evidence: "Rewrite everything" or "Forget the previous code" appeared in 10% of sessions. Analysis Indicates a low tolerance for Cognitive Debt . 4.3 The "Singin.html" Misconception: Architectural Mismatch A pervasive phenomenon observed in the second batch of logs (e.g., participants P12, P35, P42 ) was the explicit request to "Create a file named singin.html for note details." Interviews and prompt analysis reveal that singin is not a typo for "Sign In," but a novice shorthand for a "Single Note View." The User's Model (File-Based) Novices view a web app like a file system. They believe that clicking a note "opens" a specific file (singin.html), carrying the content with it naturally, much like opening a .docx file from a folder. The Web's Reality (Stateless) The Web is inherently stateless. When the browser navigates from index.html to singin.html, the JavaScript memory (RAM) containing the note array is wiped. The AI's Failure The AI faithfully executes the syntax () but fails to implement the invisible glue code (URL Parameters or localStorage) required to re-fetch the data. The Outcome : Users reported verbatim: "Clicking the note jumps to the new page, but the content is gone/empty." This is not a bug in code syntax, but a failure in State Persistence , causing a "Silent Crash." 4.4 Granular Analysis of CRUD Friction By decomposing the task into granular operations, we found that user struggles are not uniform. 4.4.1 Create: The "Enter Key" and Timestamp Anxiety: The "WeChat" Habit Many users (e.g., Chang Xinyi ) requested "Press Enter to send note." The AI often added a keydown listener but failed to prevent the default behavior (newline), resulting in bugs where a blank note was added simultaneously. Timestamp Formatting Users like Participant Wang were obsessed with "Add recording time." However, simply appending new Date() resulted in ugly strings. Users spent 5–10 turns just refining this string, treating the AI as a string formatter. 4.4.2 Read: The Pagination Paradox: The Paradox Despite dealing with small datasets (< 10 notes), a disproportionate number of users (e.g., Participant p5 ) demanded "Pagination" . The Logic Conflict AI often implemented pagination using slice(), but when users simultaneously asked for "Newest notes on top" (Sorting), the AI used reverse(). These two logic blocks frequently conflicted, leading to users seeing the oldest notes on Page 1. 4.4.3 Update: The Navigation vs. Modal Conflict: Navigation Model As noted in 4.3, users often requested a jump to singin.html. The Saving Loop When users realized data was lost on jump, they tried to pivot to an Inline Edit model. However, prompts like "Make the document modifiable" led the AI to simply add contenteditable="true" without backend logic, resulting in "Phantom Edits" that vanished on refresh. 4.4.4 Delete: The "Spatial Deixis" of the Red Button: Deixis Failure : The "Delete" function generated the most visual-specific prompts. Users treated the button as a physical object. Participant Liu instructed: "Move delete button to the right... make it red." The AI often struggled to interpret "Right" (Float vs Flexbox), causing layout shifts where adding a timestamp pushed the button off-screen. 4.5 The "Whack-a-Mole" Effect: Systemic Regression Cycles A unique and critical finding in our study is the "Whack-a-Mole" phenomenon —a regression cycle where repairing one feature leads to the breaking of another. 4.5.1 Case Study A: The "Modify" Button Loop (Participant Zhang): The Conflict The AI successfully generated the logic (onclick="editNote()") initially. However, when the user prompted "Put them close together" (Visual Constraint), the AI rewrote the HTML structure, wrapping buttons in a new but "forgetting" to re-attach the event listener. The Outcome The button looked perfect but became unclickable. The user assumed the feature was broken and deleted it entirely. 4.5.2 Case Study B: The "Delete" Button Layout Crash (Participant Wang): Step 1 (Logic Success) The user successfully implemented the delete function. Step 2 (Visual Prompt) : User prompted: "Add a timestamp." Step 3 (Regression) The AI inserted the timestamp string into the note container. Due to lack of CSS space management, the long string pushed the "Delete" button out of the visible area. User Reaction: "My delete button is gone!" The user assumed the logic was deleted, whereas it was merely visually displaced. 4.6 The Emotional Trajectory: Aesthetic Compensation The logs reveal a distinct emotional arc, culminating in a phenomenon we term "Aesthetic Compensation." Phase 1: The "Magic" Honeymoon (0–10 Minutes) : Evidence : Participant Xue wrote: "Very interesting! I learned new knowledge from the errors, happy!" Analysis The "Gulf of Execution" feels non-existent. The AI's ability to produce something creates an illusion of mastery. Phase 2: The Friction Plateau (10–30 Minutes) : Evidence : Participant Pei scolded: "What you generated is wrong, regenerate it." Participant Jia despaired: "The website crashed halfway through." Analysis This marks the onset of the Refinement Crisis . The "Singin.html" errors and CRUD frictions accumulate here. Phase 3: Resignation and "Aesthetic Compensation" (30 + Minutes) : Evidence Unable to fix complex logic (like "Save Edit"), users pivoted entirely to superficial beautification. Participant P3 requested "Change theme to pink," and Participant P6 asked for "Add cloud images" specifically after failing logic tasks. Theoretical Implication Users unconsciously attempt to compensate for the lack of functional utility (broken logic) by maximizing hedonic value (visuals). By making the broken site look "professional" or "cute," they regain a sense of closure. 5. Discussion 5.1 Bridging the Gulf of Execution: From "What Code" to "What Happened" Our results reveal a critical gap in current AI-assisted development: the asymmetry between Static Code Generation and Dynamic State Management . When participants encountered a logical bug (e.g., "the delete button doesn't work"), they were forced to debug via natural language guessing ("Is the code wrong?"). We argue that for novice users, code is not the feedback; behavior is. The AI should not just output a corrected code block; it must verify and visualize the consequence of that code. The current paradigm of "Chat-to-Code" is insufficient for dynamic applications; we need a paradigm shift to "Chat-to-Behavior." 5.2 Design Implication I: The Logic Inspector Based on the "State-Blindness" strategy identified in Section 4.2 , we propose that future AI coding tools should embrace "Text-to-Behavior Visualization." Specifically for logical operations (CRUD), the IDE should implement a "Logic Inspector" that reifies invisible data flows. Scenario : When a user reports "Can't Delete," instead of just regenerating the JavaScript, the AI should present a visual state diff : User Clicked Delete -> Array Updated ✅ -> View Updated ❌ . Impact By explicitly visualizing where the logic broke (Data vs. View), the AI shifts the user's role from a "blind guesser" describing symptoms to an "informed verifier" identifying root causes. This effectively bridges the Evaluation Gulf 5.3 Design Implication II: Logic Guardrails against "Whack-a-Mole" To mitigate the regression cycles described in Section 4.5 (The "Whack-a-Mole" Effect), we suggest a "Logic-Style Separation" mechanism. Current LLMs often regenerate the entire file based on a new prompt, introducing the risk of overwriting existing logic. The Problem When a user prompts "Make the button pink," the AI rewrites the HTML element. In doing so, it might accidentally drop the onclick="deleteNote()" attribute or change the class name that the JavaScript was listening to. The Solution Future IDEs should allow users to "Pin" or "Lock" verified functional blocks . When a user prompts for a visual change, the AI should be technically constrained to only regenerate the presentation layer (CSS) while preserving the locked logical integrity (JS/HTML structure). This feature would serve as a "Guardrail" for novice developers, giving them the confidence to iterate on aesthetics without fear of breaking functionality. 5.4 Theoretical Implication: The Trap of "High-Fidelity" Hallucination The "Singin.html" phenomenon and the "Aesthetic Compensation" behavior point to a deeper issue: High-Fidelity Hallucination. Traditional prototyping tools (like Balsamiq) are intentionally low-fidelity to force focus on logic. AI tools, however, generate High-Fidelity UI (polished CSS, real buttons) from the very first prompt. This creates a dangerous illusion of completeness. The user sees a "Sign In" button and assumes the "Sign In" logic exists. The user sees a "Detail Page" (singin.html) and assumes data transfer logic exists. This visual realism masks the logical emptiness underneath. Future AI tools must consider "Progressive Fidelity" —perhaps rendering new features as "Wireframes" first to force the user to verify the logic before applying the "Pink Theme." 5.5 Limitations and Future Work Our study is subject to several limitations that outline directions for future research. Participant Demographics Our participants were vocational college students (novices). Professional developers might employ different prompting strategies (e.g., pasting specific API documentation or error codes directly), which could lead to a different taxonomy of repair strategies. Tool Specificity We used Tongyi Lingma and Trae . While these represent state-of-the-art IDE agents, other models (like GPT-4o or Claude 3.5 Sonnet) might have lower hallucination rates. However, we argue that the fundamental "Logic-Visual Gap" is an epistemological problem regarding how users articulate intent, not just a model performance problem, and is likely to persist across platforms. Short-term Study The study focused on a single session. We do not know if users would learn to prompt better over time (Longitudinal effects). Future work should investigate how these strategies evolve as novices gain experience and whether the "Whack-a-Mole" effect diminishes with practice. 6. Conclusion In this large-scale study of 200 novice developers, we demonstrated that while AI can easily generate the syntax of a web application, it struggles to help users manage the semantics of application state. We identified four key repair strategies—Perceptual, Behavioral, Structural, and Reset—and characterized the "Whack-a-Mole" phenomenon as a critical friction point in multi-modal generation. The "Refinement Crisis" we observed suggests that the future of AI programming tools must evolve from "Code Generators" to "Logic Visualizers." Only by reifying the invisible logic and providing guardrails against regression can we truly empower end-users to see not just the code they wrote, but the reality they created. Declarations Ethics declaration: This study involved human participants and was approved by the Institutional Research Committee of Anyang Preschool Teachers College. Ethics approval: This study involved human participants (200 students) and was conducted in accordance with the ethical standards of the institutional research committee at Anyang Preschool Teachers College. Consent to participate: Informed consent was obtained from all individual participants included in the study. Consent to publish: The authors affirm that human research participants provided informed consent for the publication of the research findings and any accompanying images in an anonymized format. Competing interests: The authors have no relevant financial or non-financial interests to disclose. Funding: The authors received no specific funding for this research. The study was conducted as part of the authors' educational and research activities at Anyang Preschool Teachers College. Author Contribution A wrote the main manuscript text References Zamfirescu-Pereira JD, Wong RY, Hartmann B, Yang Q. (2023). Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23). Kim T, Bragg J, Chilton LB. (2022). Stylette: Styling the Web with Natural Language. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). Ko AJ, Myers BA. (2004). Six Learning Barriers in End-User Programming Systems. Proceedings of the 2004 IEEE Symposium on Visual Languages - Human Centric Computing (VL/HCC '04). Vaithilingam P, Zhang T, Glassman EL. (2022). Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). Chen M, Tworek J, Jun H et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. Norman DA. Cognitive Engineering. User Centered System Design. New Perspectives on Human-Computer Interaction; 1986. Barke L, James MB, Polikarpova N. (2023). Grounding Copilots in Human Interaction. Proceedings of the ACM on Programming Languages (OOPSLA '23). Charmaz K. Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. Sage; 2006. Gentner D, Stevens AL. (1983). Mental Models. Psychology. Fielding RT. (2000). Architectural Styles and the Design of Network-based Software Architectures. Ph.D. Dissertation, University of California, Irvine. Additional Declarations No competing interests reported. Supplementary Files Appendices.pdf Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 05 Mar, 2026 Editor assigned by journal 28 Feb, 2026 Submission checks completed at journal 28 Feb, 2026 First submitted to journal 28 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8994174","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":598568666,"identity":"0039284e-d7a6-4303-bc17-915b00275e11","order_by":0,"name":"Zhenjiang Song","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAApklEQVRIiWNgGAWjYFCCBIYDDBWkaznDIEGaFgbGNlK08LdnJx4unHe4jp/97AGGHxXbCGuROPN2w+GZ2w5LSPbkJTD2nLlNhDU3cjcc5gVqMbjBY8DM2EaEFnmwljmHJeyJ1mIA1tIAtEWCWC2GIL/wHEuXnHEmx+AgUX6RO567+TNPjTU/f/sZwwc/KojxPgQ0g8kDRKsHgjpSFI+CUTAKRsFIAwCGnT4m80JAOgAAAABJRU5ErkJggg==","orcid":"","institution":"Anyang Preschool Teachers Colleg","correspondingAuthor":true,"prefix":"","firstName":"Zhenjiang","middleName":"","lastName":"Song","suffix":""}],"badges":[],"createdAt":"2026-02-28 09:56:57","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8994174/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8994174/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104176798,"identity":"a52a3eba-edd1-4beb-9126-6f3e7069bccd","added_by":"auto","created_at":"2026-03-08 16:39:47","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":44427,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8994174/v1/4986570a74d3fcf719dc6b6a.jpg"},{"id":104176795,"identity":"24c4a4fd-6fd5-49d8-ac25-c813d0d97369","added_by":"auto","created_at":"2026-03-08 16:39:40","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":52911,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8994174/v1/ef56e5de469b5bae833c766d.jpg"},{"id":104404970,"identity":"2c775a34-4433-4628-8181-e97d37ed44bb","added_by":"auto","created_at":"2026-03-11 12:21:28","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2098301,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8994174/v1/fceb00d3-336d-4075-b8ce-1d9ee289a78b.pdf"},{"id":104176794,"identity":"720e9f83-5a7f-413f-b8fa-69f33a2b8436","added_by":"auto","created_at":"2026-03-08 16:39:40","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":57580,"visible":true,"origin":"","legend":"","description":"","filename":"Appendices.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8994174/v1/5cd0d56dc7c21fc9c0cc7ed8.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\"Make it Pop, but not Like That\": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces","fulltext":[{"header":"1. Introduction","content":"\u003cdiv id=\"Sec2\" class=\"Section2\"\u003e \u003ch2\u003e1.1 The Paradigm Shift: From Syntax to Intent\u003c/h2\u003e \u003cp\u003eThe landscape of software engineering is undergoing a seismic shift. For decades, the primary barrier to entry was \u003cb\u003eSyntax\u003c/b\u003e: learning the rigid grammar of languages like JavaScript or Python. Today, Large Language Models (LLMs) such as GPT-4, Tongyi Lingma, and Trae have effectively democratized code generation, transitioning the user's role from a \"Writer of Code\" to an \"Architect of Intent.\" However, this lowering of the floor has inadvertently raised the ceiling for \u003cb\u003eValidation\u003c/b\u003e. In the context of web development, the output of an LLM is not a single linear text but a complex, multi-dimensional artifact comprising \u003cb\u003eStructure (HTML)\u003c/b\u003e, \u003cb\u003ePresentation (CSS)\u003c/b\u003e, and \u003cb\u003eBehavior (JavaScript)\u003c/b\u003e. While a novice user can easily prompt \u003cem\u003e\"Make a note-taking app,\"\u003c/em\u003e they often lack the mental model to understand the invisible dependencies between these layers. When the generated application fails\u0026mdash;for instance, when a note disappears after a page refresh\u0026mdash;the user is thrust into a \u003cb\u003e\"Refinement Crisis.\"\u003c/b\u003e They must debug a system they did not write and do not understand, often using vague natural language that the AI misinterprets.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e1.2 The Unique Challenge of Web State\u003c/h2\u003e \u003cp\u003eUnlike text generation tasks (e.g., writing an email), where errors are semantic and visible (e.g., a wrong tone), errors in web generation are often \u003cb\u003eState-Based and Invisible\u003c/b\u003e. Our preliminary observations indicate a critical misalignment:\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Desktop Metaphor\u003c/strong\u003e \u003cp\u003eNovice users often transfer their experience from desktop applications (like Word or Excel) to the Web. They assume that if they \"edit\" a text on screen, it is saved; or if they \"open\" a new page, the data travels with them.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Stateless Reality\u003c/strong\u003e \u003cp\u003eThe Web is inherently stateless. Without explicit instruction to use localStorage or a database, any data entered into the DOM is ephemeral. This gap creates a profound \u003cb\u003e\"Gulf of Evaluation\"\u003c/b\u003e (Norman, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e1986\u003c/span\u003e). The user sees the visual interface (the \"View\") but is blind to the underlying data flow (the \"Model\"). When they attempt to repair these issues using visual descriptors (e.g., \u003cem\u003e\"Make the button red\"\u003c/em\u003e), they often inadvertently trigger regressive bugs in the logical layer, a phenomenon we term the \u003cb\u003e\"Whack-a-Mole\" Effect\u003c/b\u003e.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e1.3 Research Context and Contributions\u003c/h2\u003e \u003cp\u003eTo empirically investigate this phenomenon, we orchestrated a large-scale observational study with \u003cb\u003eN\u0026thinsp;=\u0026thinsp;200 novice developers\u003c/b\u003e in a controlled university setting. The participants, who had zero prior coding experience, were tasked with building a fully functional \u003cb\u003eCRUD (Create, Read, Update, Delete)\u003c/b\u003e application using AI tools. This paper moves beyond simple success/failure metrics to analyze the \u003cem\u003eprocess\u003c/em\u003e of failure. We contribute:\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eA \u003cb\u003eTaxonomy of Repair Strategies\u003c/b\u003e, categorizing how novices linguistically navigate technical breakdowns.\u003c/p\u003e\u003cp\u003eThe identification of the \u003cb\u003e\"Singin.html\" Phenomenon\u003c/b\u003e, a specific architectural misconception where users attempt to manage state via file creation.\u003c/p\u003e\u003cp\u003eThe \u003cb\u003e\"Aesthetic Compensation\" Theory\u003c/b\u003e, explaining why users pivot to visual polishing when functional logic fails.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e"},{"header":"2. Related Work","content":"\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Large Language Models in Software Engineering\u003c/h2\u003e \u003cp\u003eRecent advancements in LLMs, exemplified by tools like OpenAI's Codex and StarCoder, have revolutionized code generation (Chen et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Studies by Vaithilingam et al. (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) have shown that these tools significantly improve developer productivity by automating boilerplate code. However, existing research predominantly focuses on \u003cb\u003ealgorithmic correctness\u003c/b\u003e in backend languages (e.g., Python, Java), where success is binarily defined by passing unit tests. Less attention has been paid to \u003cb\u003eFrontend Web Development\u003c/b\u003e, where \"correctness\" is subjective, visual, and highly context-dependent. Furthermore, most studies recruit professional developers. Our work shifts the focus to \u003cb\u003eNovice End-User Programmers\u003c/b\u003e, investigating how those without a mental model of \"variables\" or \"functions\" navigate the complexities of full-stack application construction.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.2 The Challenge of End-User Prompting\u003c/h2\u003e \u003cp\u003eAs generative tools democratize access to coding, \"Prompt Engineering\" has emerged as a critical barrier. Zamfirescu-Pereira et al. (CHI '23) in their seminal work \u003cem\u003e\"Why Johnny Can't Prompt\"\u003c/em\u003e highlighted that non-experts often struggle with \u003cb\u003eovergeneralization\u003c/b\u003e\u0026mdash;assuming that if the AI understands one concept (e.g., \"blue button\"), it understands the entire context. While their study focused on text-based chatbots, our research extends this to the \u003cb\u003eVisual-Functional domain\u003c/b\u003e. In web development, users must describe spatial relationships (\"move this to the right\") and dynamic behaviors (\"update the list after deleting\"). These concepts suffer from \u003cb\u003eSpatial Deixis Failure\u003c/b\u003e, where natural language lacks the precision to describe layout constraints (e.g., distinct from margin, padding, or float), leading to unique frustrations not seen in text generation tasks.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Mental Models of Web Architecture and State\u003c/h2\u003e \u003cp\u003eCentral to the \"Refinement Crisis\" is the divergence between the actual architecture of the Web and the user\u0026rsquo;s mental model of it. Mental models are internal representations of how systems work (Gentner \u0026amp; Stevens, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e1983\u003c/span\u003e). For novice programmers, these models are often built upon the \u003cb\u003e\"Desktop Metaphor,\"\u003c/b\u003e where files are discrete, stateful entities that retain information when opened or moved. In the context of Web development, the architecture is inherently \u003cb\u003eStateless\u003c/b\u003e (Fielding, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2000\u003c/span\u003e), meaning data does not persist across page navigations without explicit state management (e.g., LocalStorage, Session, or Database). Prior research in end-user programming (Ko \u0026amp; Myers, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2004\u003c/span\u003e) identified the \"selection barrier,\" where users struggle to identify the correct technical components to achieve a goal. Our study extends this by investigating how a faulty mental model\u0026mdash;viewing the Web as a collection of persistent files rather than a dynamic state machine\u0026mdash;leads to catastrophic failures when users attempt to refine multi-page applications like our CRUD note-taking task. By identifying the specific \"Singin.html\" misconception, we contribute to the understanding of how AI-assisted coding may inadvertently reinforce these faulty mental models by fulfilling syntactic requests while ignoring semantic gaps.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Methodology","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Participant Demographics and Recruitment\u003c/h2\u003e \u003cp\u003eWe recruited 200 vocational college students from a vocational college in Central China from a non-Computer Science major course titled \"Introduction to Digital Media.\"\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eInclusion Criteria\u003c/strong\u003e \u003cp\u003eParticipants were screened via a pre-study survey to ensure they had \u003cb\u003ezero prior experience\u003c/b\u003e in HTML, CSS, or JavaScript. This \"Tabula Rasa\" (Blank Slate) condition is crucial for our study, as it ensures that their prompting strategies reflect pure natural language intent rather than \"pseudo-code\" thinking.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eTool Environment\u003c/strong\u003e \u003cp\u003eParticipants were divided into two groups using \u003cb\u003eTongyi Lingma\u003c/b\u003e and \u003cb\u003eTrae\u003c/b\u003e. Both tools are state-of-the-art, Chinese-native AI coding assistants integrated into VS Code, eliminating language barriers as a confounding variable.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.2 The Task: Why CRUD?\u003c/h2\u003e \u003cp\u003eWhile previous studies (Kim et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) focused on static styling tasks, we specifically chose a \u003cb\u003eCRUD Note-Taking Application\u003c/b\u003e to force interaction with \u003cb\u003eState Management\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv description=\"3-1\" class=\"Drawing\" id=\"1\" name=\"图片 1\"\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe task required five specific milestones:\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eCreation\u003c/strong\u003e \u003cp\u003eInput text and add it to a list (Tests DOM manipulation).\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003ePersistence\u003c/strong\u003e \u003cp\u003eData must survive a page refresh (Tests localStorage understanding).\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eRead/Sort\u003c/strong\u003e \u003cp\u003eDisplay notes, ideally sorted by time (Tests Array logic).\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eUpdate\u003c/strong\u003e \u003cp\u003eEdit existing note content (The distinct \"State Persistence\" challenge).\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eDelete\u003c/strong\u003e \u003cp\u003eRemove a note (Tests Event Listeners and Array mutation).\u003c/p\u003e \u003c/p\u003e \u003cp\u003eThis incremental complexity allowed us to observe exactly \u003cem\u003ewhere\u003c/em\u003e the user's mental model diverged from the AI's execution.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Data Analysis: Qualitative Coding\u003c/h2\u003e \u003cp\u003eWe collected a corpus of over \u003cb\u003e3,000 lines of interaction logs\u003c/b\u003e. Data analysis followed a \u003cb\u003eGrounded Theory\u003c/b\u003e approach(Charmaz, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2006\u003c/span\u003e) to identify emergent patterns in user repair strategies.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eOpen Coding\u003c/strong\u003e \u003cp\u003eTwo researchers independently tagged log entries with descriptive labels (e.g., \"Complaining about style,\" \"Asking for file creation\").\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAxial Coding\u003c/strong\u003e \u003cp\u003eThese labels were grouped into broader categories. For instance, commands like \u003cem\u003e\"Make it red\"\u003c/em\u003e and \u003cem\u003e\"Move to right\"\u003c/em\u003e were categorized under \u003cb\u003e\"Perceptual Refinement.\"\u003c/b\u003e\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eSelective Coding\u003c/strong\u003e \u003cp\u003eWe identified core themes\u0026mdash;such as the \u003cb\u003e\"Singin.html\" phenomenon\u003c/b\u003e and \u003cb\u003e\"Aesthetic Compensation\"\u003c/b\u003e\u0026mdash;by analyzing the temporal sequence of prompts (e.g., identifying that aesthetic requests frequently spike following a functional failure).\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Results: The Anatomy of Breakdown","content":"\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Quantitative Overview: Success vs. Struggle\u003c/h2\u003e \u003cp\u003eAmong the N\u0026thinsp;=\u0026thinsp;200 participants, the majority demonstrated the capability to build functional web applications using AI tools, though success levels varied significantly.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eCompletion Rates: 65% (n\u0026thinsp;=\u0026thinsp;130)\u003c/b\u003e of participants successfully implemented the full \"CRUD\u0026thinsp;+\u0026thinsp;Search\" functionality. Another \u003cb\u003e25% (n\u0026thinsp;=\u0026thinsp;50)\u003c/b\u003e completed the basic \"MVP\" features (Create/Read/Delete) but failed to implement advanced logic like Search or Update. The remaining \u003cb\u003e10% (n\u0026thinsp;=\u0026thinsp;20)\u003c/b\u003e abandoned the task after repeated failures (Global Reset).\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eInteraction Effort\u003c/strong\u003e \u003cp\u003eParticipants engaged in an average of \u003cb\u003e15 interaction turns\u003c/b\u003e (SD\u0026thinsp;=\u0026thinsp;\u0026plusmn;\u0026thinsp;4.2). This high number indicates that \"one-shot\" generation is a myth for complex apps.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cb\u003eThe \"Invisible State\" Barrier\u003c/b\u003e: The most pervasive breakdown was the \"Data Persistence\" issue. A staggering \u003cb\u003e80% (n\u0026thinsp;=\u0026thinsp;160)\u003c/b\u003e of participants encountered data loss upon page refresh. Original logs show a repetitive pattern of complaints: \u003cem\u003e\"After I add a note and refresh, it disappears\"\u003c/em\u003e, This reveals a widespread failure in the AI's default strategy, which prioritizes DOM manipulation over localStorage implementation.\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe \"Visual Fidelity\" Gap\u003c/strong\u003e \u003cp\u003eAesthetic and layout issues affected \u003cb\u003e50%\u003c/b\u003e of users. Common complaints involved \"ugly buttons,\" \"misalignment,\" and \"height collapse,\" triggering extensive visual repair loops.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e4.2 A Taxonomy of Repair Strategies\u003c/h2\u003e \u003cp\u003eThrough qualitative coding of the 3,000\u0026thinsp;+\u0026thinsp;prompt lines, we identified four distinct patterns of repair strategies that novices employ when the AI output fails to meet expectations.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eType I: Perceptual Refinement (The Aesthetic Overload)\u003c/b\u003e: Users lacking CSS vocabulary rely on cultural metaphors.\u003c/p\u003e\u003cp\u003e \u003cem\u003eEvidence\u003c/em\u003e: Participant Li: \u003cem\u003e\"The delete button... Change it to rainbow colors, Hello Kitty button shape.\"\u003c/em\u003e\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAnalysis\u003c/strong\u003e \u003cp\u003eThis reveals a \u003cb\u003eSemantic Gap\u003c/b\u003e where AI interprets \"concise\" as minimalist code, while users intend visual elegance.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cb\u003eType II: Behavioral Correction (The State-Blindness Struggle)\u003c/b\u003e: Users focus on the disconnect between data and view.\u003c/p\u003e \u003cp\u003e \u003cem\u003eEvidence\u003c/em\u003e: Participant Gao: \u003cem\u003e\"Clicking modify should not make it disappear... cover the old note after modification.\"\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAnalysis\u003c/strong\u003e \u003cp\u003eUsers understand the \u003cb\u003esymptom\u003c/b\u003e (data loss) but not the \u003cb\u003emechanism\u003c/b\u003e (LocalStorage), leading to repetitive storytelling prompts.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cb\u003eType III: Structural Deixis (Layout Struggles)\u003c/b\u003e: Users struggle with layout stability using spatial language.\u003c/p\u003e \u003cp\u003e \u003cem\u003eEvidence\u003c/em\u003e: Participant Zhang): \u003cem\u003e\"Center the pagination module... change width to 1200px.\"\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAnalysis\u003c/strong\u003e \u003cp\u003eThe recurrence of specific pixel values suggests users pivot to \u003cb\u003eHard-coding Constraints\u003c/b\u003e when vague descriptions fail.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cb\u003eType IV: Global Reset (The Abandonment)\u003c/b\u003e:\u003c/p\u003e \u003cp\u003e \u003cem\u003eEvidence: \"Rewrite everything\"\u003c/em\u003e or \u003cem\u003e\"Forget the previous code\"\u003c/em\u003e appeared in 10% of sessions.\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAnalysis\u003c/strong\u003e \u003cp\u003eIndicates a low tolerance for \u003cb\u003eCognitive Debt\u003c/b\u003e.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e4.3 The \"Singin.html\" Misconception: Architectural Mismatch\u003c/h2\u003e \u003cp\u003eA pervasive phenomenon observed in the second batch of logs (e.g., participants \u003cb\u003eP12, P35, P42\u003c/b\u003e) was the explicit request to \u003cem\u003e\"Create a file named singin.html for note details.\"\u003c/em\u003e Interviews and prompt analysis reveal that singin is not a typo for \"Sign In,\" but a novice shorthand for a \u003cb\u003e\"Single Note View.\"\u003c/b\u003e\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe User's Model (File-Based)\u003c/strong\u003e \u003cp\u003eNovices view a web app like a file system. They believe that clicking a note \"opens\" a specific file (singin.html), carrying the content with it naturally, much like opening a .docx file from a folder.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Web's Reality (Stateless)\u003c/strong\u003e \u003cp\u003eThe Web is inherently stateless. When the browser navigates from index.html to singin.html, the JavaScript memory (RAM) containing the note array is wiped.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe AI's Failure\u003c/strong\u003e \u003cp\u003eThe AI faithfully executes the \u003cem\u003esyntax\u003c/em\u003e (\u0026lt;\u0026thinsp;a href=\"singin.html\"\u0026gt;) but fails to implement the invisible \u003cem\u003eglue code\u003c/em\u003e (URL Parameters or localStorage) required to re-fetch the data.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cb\u003eThe Outcome\u003c/b\u003e: Users reported verbatim: \u003cem\u003e\"Clicking the note jumps to the new page, but the content is gone/empty.\"\u003c/em\u003e This is not a bug in code syntax, but a failure in \u003cb\u003eState Persistence\u003c/b\u003e, causing a \"Silent Crash.\"\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Granular Analysis of CRUD Friction\u003c/h2\u003e \u003cp\u003eBy decomposing the task into granular operations, we found that user struggles are not uniform.\u003c/p\u003e \u003cdiv id=\"Sec18\" class=\"Section3\"\u003e \u003ch2\u003e4.4.1 Create: The \"Enter Key\" and Timestamp Anxiety:\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eThe \"WeChat\" Habit\u003c/strong\u003e \u003cp\u003eMany users (e.g., \u003cb\u003eChang Xinyi\u003c/b\u003e) requested \u003cem\u003e\"Press Enter to send note.\"\u003c/em\u003e The AI often added a keydown listener but failed to prevent the default behavior (newline), resulting in bugs where a blank note was added simultaneously.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eTimestamp Formatting\u003c/strong\u003e \u003cp\u003eUsers like \u003cb\u003eParticipant Wang\u003c/b\u003e were obsessed with \u003cem\u003e\"Add recording time.\"\u003c/em\u003e However, simply appending new Date() resulted in ugly strings. Users spent 5\u0026ndash;10 turns just refining this string, treating the AI as a string formatter.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section3\"\u003e \u003ch2\u003e4.4.2 Read: The Pagination Paradox:\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eThe Paradox\u003c/strong\u003e \u003cp\u003eDespite dealing with small datasets (\u0026lt;\u0026thinsp;10 notes), a disproportionate number of users (e.g., \u003cb\u003eParticipant p5\u003c/b\u003e) demanded \u003cb\u003e\"Pagination\"\u003c/b\u003e.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Logic Conflict\u003c/strong\u003e \u003cp\u003eAI often implemented pagination using slice(), but when users simultaneously asked for \u003cem\u003e\"Newest notes on top\"\u003c/em\u003e (Sorting), the AI used reverse(). These two logic blocks frequently conflicted, leading to users seeing the \u003cem\u003eoldest\u003c/em\u003e notes on Page 1.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section3\"\u003e \u003ch2\u003e4.4.3 Update: The Navigation vs. Modal Conflict:\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eNavigation Model\u003c/strong\u003e \u003cp\u003eAs noted in 4.3, users often requested a jump to singin.html.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Saving Loop\u003c/strong\u003e \u003cp\u003eWhen users realized data was lost on jump, they tried to pivot to an \u003cb\u003eInline Edit\u003c/b\u003e model. However, prompts like \u003cem\u003e\"Make the document modifiable\"\u003c/em\u003e led the AI to simply add contenteditable=\"true\" without backend logic, resulting in \"Phantom Edits\" that vanished on refresh.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section3\"\u003e \u003ch2\u003e4.4.4 Delete: The \"Spatial Deixis\" of the Red Button:\u003c/h2\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cem\u003eDeixis Failure\u003c/em\u003e: The \"Delete\" function generated the most visual-specific prompts. Users treated the button as a physical object. \u003cb\u003eParticipant Liu\u003c/b\u003e instructed: \u003cem\u003e\"Move delete button to the right... make it red.\"\u003c/em\u003e The AI often struggled to interpret \"Right\" (Float vs Flexbox), causing layout shifts where adding a timestamp pushed the button off-screen.\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e4.5 The \"Whack-a-Mole\" Effect: Systemic Regression Cycles\u003c/h2\u003e \u003cp\u003eA unique and critical finding in our study is the \u003cb\u003e\"Whack-a-Mole\" phenomenon\u003c/b\u003e\u0026mdash;a regression cycle where repairing one feature leads to the breaking of another.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003e4.5.1 Case Study A: The \"Modify\" Button Loop (Participant Zhang):\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eThe Conflict\u003c/strong\u003e \u003cp\u003eThe AI successfully generated the logic (onclick=\"editNote()\") initially. However, when the user prompted \u003cem\u003e\"Put them close together\"\u003c/em\u003e (Visual Constraint), the AI rewrote the HTML structure, wrapping buttons in a new\u0026thinsp;\u0026lt;\u0026thinsp;div\u0026gt; but \"forgetting\" to re-attach the event listener.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Outcome\u003c/strong\u003e \u003cp\u003eThe button looked perfect but became unclickable. The user assumed the feature was broken and deleted it entirely.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section3\"\u003e \u003ch2\u003e4.5.2 Case Study B: The \"Delete\" Button Layout Crash (Participant Wang):\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eStep 1 (Logic Success)\u003c/strong\u003e \u003cp\u003eThe user successfully implemented the delete function.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cem\u003eStep 2 (Visual Prompt)\u003c/em\u003e: User prompted: \u003cem\u003e\"Add a timestamp.\"\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eStep 3 (Regression)\u003c/strong\u003e \u003cp\u003eThe AI inserted the timestamp string into the note container. Due to lack of CSS space management, the long string pushed the \"Delete\" button out of the visible area.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cem\u003eUser Reaction: \"My delete button is gone!\"\u003c/em\u003e The user assumed the logic was deleted, whereas it was merely visually displaced.\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section2\"\u003e \u003ch2\u003e4.6 The Emotional Trajectory: Aesthetic Compensation\u003c/h2\u003e \u003cp\u003eThe logs reveal a distinct emotional arc, culminating in a phenomenon we term \u003cb\u003e\"Aesthetic Compensation.\"\u003c/b\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003ePhase 1: The \"Magic\" Honeymoon (0\u0026ndash;10 Minutes)\u003c/b\u003e:\u003c/p\u003e\u003cp\u003e \u003cem\u003eEvidence\u003c/em\u003e:\u003cb\u003eParticipant Xue\u003c/b\u003e wrote: \u003cem\u003e\"Very interesting! I learned new knowledge from the errors, happy!\"\u003c/em\u003e\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAnalysis\u003c/strong\u003e \u003cp\u003eThe \"Gulf of Execution\" feels non-existent. The AI's ability to produce \u003cem\u003esomething\u003c/em\u003e creates an illusion of mastery.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cb\u003ePhase 2: The Friction Plateau (10\u0026ndash;30 Minutes)\u003c/b\u003e:\u003c/p\u003e \u003cp\u003e \u003cem\u003eEvidence\u003c/em\u003e: \u003cb\u003eParticipant Pei\u003c/b\u003e scolded: \u003cem\u003e\"What you generated is wrong, regenerate it.\"\u003c/em\u003e \u003cb\u003eParticipant Jia\u003c/b\u003e despaired: \u003cem\u003e\"The website crashed halfway through.\"\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAnalysis\u003c/strong\u003e \u003cp\u003eThis marks the onset of the \u003cb\u003eRefinement Crisis\u003c/b\u003e. The \"Singin.html\" errors and CRUD frictions accumulate here.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003e \u003cb\u003ePhase 3: Resignation and \"Aesthetic Compensation\" (30\u0026thinsp;+\u0026thinsp;Minutes)\u003c/b\u003e:\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eEvidence\u003c/strong\u003e \u003cp\u003eUnable to fix complex logic (like \"Save Edit\"), users pivoted entirely to superficial beautification. \u003cb\u003eParticipant P3\u003c/b\u003e requested \u003cem\u003e\"Change theme to pink,\"\u003c/em\u003e and \u003cb\u003eParticipant P6\u003c/b\u003easked for \u003cem\u003e\"Add cloud images\"\u003c/em\u003e specifically \u003cem\u003eafter\u003c/em\u003e failing logic tasks.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eTheoretical Implication\u003c/strong\u003e \u003cp\u003eUsers unconsciously attempt to compensate for the lack of \u003cem\u003efunctional utility\u003c/em\u003e (broken logic) by maximizing \u003cem\u003ehedonic value\u003c/em\u003e (visuals). By making the broken site look \"professional\" or \"cute,\" they regain a sense of closure.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5. Discussion","content":"\u003cdiv id=\"Sec27\" class=\"Section2\"\u003e \u003ch2\u003e5.1 Bridging the Gulf of Execution: From \"What Code\" to \"What Happened\"\u003c/h2\u003e \u003cp\u003eOur results reveal a critical gap in current AI-assisted development: the asymmetry between \u003cb\u003eStatic Code Generation\u003c/b\u003e and \u003cb\u003eDynamic State Management\u003c/b\u003e. When participants encountered a logical bug (e.g., \"the delete button doesn't work\"), they were forced to debug via natural language guessing (\"Is the code wrong?\"). We argue that for novice users, \u003cb\u003ecode is not the feedback; behavior is.\u003c/b\u003e The AI should not just output a corrected code block; it must verify and visualize the \u003cem\u003econsequence\u003c/em\u003e of that code. The current paradigm of \"Chat-to-Code\" is insufficient for dynamic applications; we need a paradigm shift to \"Chat-to-Behavior.\"\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec28\" class=\"Section2\"\u003e \u003ch2\u003e5.2 Design Implication I: The Logic Inspector\u003c/h2\u003e \u003cp\u003eBased on the \"State-Blindness\" strategy identified in Section \u003cspan refid=\"Sec15\" class=\"InternalRef\"\u003e4.2\u003c/span\u003e, we propose that future AI coding tools should embrace \u003cb\u003e\"Text-to-Behavior Visualization.\"\u003c/b\u003e Specifically for logical operations (CRUD), the IDE should implement a \u003cb\u003e\"Logic Inspector\"\u003c/b\u003e that reifies invisible data flows.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cb\u003eScenario\u003c/b\u003e: When a user reports \"Can't Delete,\" instead of just regenerating the JavaScript, the AI should present a \u003cb\u003evisual state diff\u003c/b\u003e: \u003cem\u003eUser Clicked Delete -\u0026gt; Array Updated ✅ -\u0026gt; View Updated ❌\u003c/em\u003e.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eImpact\u003c/strong\u003e \u003cp\u003eBy explicitly visualizing \u003cem\u003ewhere\u003c/em\u003e the logic broke (Data vs. View), the AI shifts the user's role from a \"blind guesser\" describing symptoms to an \"informed verifier\" identifying root causes. This effectively bridges the Evaluation Gulf\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec29\" class=\"Section2\"\u003e \u003ch2\u003e5.3 Design Implication II: Logic Guardrails against \"Whack-a-Mole\"\u003c/h2\u003e \u003cp\u003eTo mitigate the regression cycles described in Section \u003cspan refid=\"Sec22\" class=\"InternalRef\"\u003e4.5\u003c/span\u003e (The \"Whack-a-Mole\" Effect), we suggest a \u003cb\u003e\"Logic-Style Separation\"\u003c/b\u003e mechanism. Current LLMs often regenerate the entire file based on a new prompt, introducing the risk of overwriting existing logic.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Problem\u003c/strong\u003e \u003cp\u003eWhen a user prompts \u003cem\u003e\"Make the button pink,\"\u003c/em\u003e the AI rewrites the HTML element. In doing so, it might accidentally drop the onclick=\"deleteNote()\" attribute or change the class name that the JavaScript was listening to.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eThe Solution\u003c/strong\u003e \u003cp\u003eFuture IDEs should allow users to \u003cb\u003e\"Pin\" or \"Lock\" verified functional blocks\u003c/b\u003e. When a user prompts for a visual change, the AI should be technically constrained to only regenerate the presentation layer (CSS) while preserving the locked logical integrity (JS/HTML structure). This feature would serve as a \"Guardrail\" for novice developers, giving them the confidence to iterate on aesthetics without fear of breaking functionality.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec30\" class=\"Section2\"\u003e \u003ch2\u003e5.4 Theoretical Implication: The Trap of \"High-Fidelity\" Hallucination\u003c/h2\u003e \u003cp\u003eThe \"Singin.html\" phenomenon and the \"Aesthetic Compensation\" behavior point to a deeper issue: \u003cb\u003eHigh-Fidelity Hallucination.\u003c/b\u003e Traditional prototyping tools (like Balsamiq) are intentionally low-fidelity to force focus on logic. AI tools, however, generate \u003cb\u003eHigh-Fidelity UI\u003c/b\u003e (polished CSS, real buttons) from the very first prompt. This creates a dangerous illusion of completeness.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThe user sees a \"Sign In\" button and assumes the \"Sign In\" logic exists.\u003c/p\u003e\u003cp\u003eThe user sees a \"Detail Page\" (singin.html) and assumes data transfer logic exists. This visual realism masks the logical emptiness underneath. Future AI tools must consider \u003cb\u003e\"Progressive Fidelity\"\u003c/b\u003e\u0026mdash;perhaps rendering new features as \"Wireframes\" first to force the user to verify the logic before applying the \"Pink Theme.\"\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec31\" class=\"Section2\"\u003e \u003ch2\u003e5.5 Limitations and Future Work\u003c/h2\u003e \u003cp\u003eOur study is subject to several limitations that outline directions for future research.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eParticipant Demographics\u003c/strong\u003e \u003cp\u003eOur participants were vocational college students (novices). Professional developers might employ different prompting strategies (e.g., pasting specific API documentation or error codes directly), which could lead to a different taxonomy of repair strategies.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eTool Specificity\u003c/strong\u003e \u003cp\u003eWe used \u003cem\u003eTongyi Lingma\u003c/em\u003e and \u003cem\u003eTrae\u003c/em\u003e. While these represent state-of-the-art IDE agents, other models (like GPT-4o or Claude 3.5 Sonnet) might have lower hallucination rates. However, we argue that the fundamental \"Logic-Visual Gap\" is an epistemological problem regarding how users articulate intent, not just a model performance problem, and is likely to persist across platforms.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eShort-term Study\u003c/strong\u003e \u003cp\u003eThe study focused on a single session. We do not know if users would learn to prompt better over time (Longitudinal effects). Future work should investigate how these strategies evolve as novices gain experience and whether the \"Whack-a-Mole\" effect diminishes with practice.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eIn this large-scale study of 200 novice developers, we demonstrated that while AI can easily generate the syntax of a web application, it struggles to help users manage the semantics of application state. We identified four key repair strategies\u0026mdash;Perceptual, Behavioral, Structural, and Reset\u0026mdash;and characterized the \"Whack-a-Mole\" phenomenon as a critical friction point in multi-modal generation. The \"Refinement Crisis\" we observed suggests that the future of AI programming tools must evolve from \"Code Generators\" to \"Logic Visualizers.\" Only by reifying the invisible logic and providing guardrails against regression can we truly empower end-users to see not just the code they wrote, but the reality they created.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eEthics declaration: This study involved human participants and was approved by the Institutional Research Committee of Anyang Preschool Teachers College.\u003c/p\u003e\u003cp\u003e \u003ch2\u003eEthics approval:\u003c/h2\u003e \u003cp\u003eThis study involved human participants (200 students) and was conducted in accordance with the ethical standards of the institutional research committee at Anyang Preschool Teachers College.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eConsent to participate:\u003c/strong\u003e \u003cp\u003eInformed consent was obtained from all individual participants included in the study.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eConsent to publish:\u003c/strong\u003e \u003cp\u003eThe authors affirm that human research participants provided informed consent for the publication of the research findings and any accompanying images in an anonymized format.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eCompeting interests:\u003c/strong\u003e \u003cp\u003eThe authors have no relevant financial or non-financial interests to disclose.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding:\u003c/h2\u003e \u003cp\u003eThe authors received no specific funding for this research. The study was conducted as part of the authors' educational and research activities at Anyang Preschool Teachers College.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eA wrote the main manuscript text\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eZamfirescu-Pereira JD, Wong RY, Hartmann B, Yang Q. (2023). Why Johnny Can\u0026rsquo;t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim T, Bragg J, Chilton LB. (2022). Stylette: Styling the Web with Natural Language. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKo AJ, Myers BA. (2004). Six Learning Barriers in End-User Programming Systems. Proceedings of the 2004 IEEE Symposium on Visual Languages - Human Centric Computing (VL/HCC '04).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaithilingam P, Zhang T, Glassman EL. (2022). Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen M, Tworek J, Jun H et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNorman DA. Cognitive Engineering. User Centered System Design. New Perspectives on Human-Computer Interaction; 1986.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarke L, James MB, Polikarpova N. (2023). Grounding Copilots in Human Interaction. Proceedings of the ACM on Programming Languages (OOPSLA '23).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCharmaz K. Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. Sage; 2006.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGentner D, Stevens AL. (1983). Mental Models. Psychology.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFielding RT. (2000). Architectural Styles and the Design of Network-based Software Architectures. Ph.D. Dissertation, University of California, Irvine.\u003c/span\u003e\u003c/li\u003e \u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"discover-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"diai","sideBox":"Learn more about [Discover Artificial Intelligence](https://www.springer.com/44163)","snPcode":"","submissionUrl":"","title":"Discover Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Discover Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8994174/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8994174/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe rapid proliferation of Large Language Models (LLMs) and generative tools (e.g., GPT-4, Tongyi Lingma, Trae) has fundamentally democratized the landscape of web development, shifting the paradigm from manual syntax construction to natural language intent specification. However, while the barrier to \"drafting\" initial code has been lowered, a significant \"Refinement Crisis\" has emerged. As task complexity scales from static landing pages to dynamic, stateful applications, novice users encounter a profound \"Gulf of Evaluation\" when attempting to repair AI-generated errors. Unlike text generation, where errors are semantic and visible, web interface generation involves a complex interplay between visual presentation (CSS) and invisible state management (JavaScript). In this paper, we present a large-scale observational study with \u003cb\u003e200 novice participants\u003c/b\u003e tasked with utilizing IDE-integrated AI assistants to build a fully functional CRUD (Create, Read, Update, Delete) note-taking application.\u003c/p\u003e \u003cp\u003eThrough a rigorous analysis of interaction logs and source code snapshots, we reveal that while 90% of users could generate a baseline prototype, \u003cb\u003e80% encountered severe \"invisible state\" breakdowns\u003c/b\u003e (e.g., data persistence failure), and \u003cb\u003e50% suffered from persistent layout regressions\u003c/b\u003e. We contribute a detailed taxonomy of four repair strategies: \u003cem\u003ePerceptual Refinement\u003c/em\u003e, \u003cem\u003eBehavioral Correction\u003c/em\u003e, \u003cem\u003eDiagnostic Proxy\u003c/em\u003e, and \u003cem\u003eGlobal Reset\u003c/em\u003e. Furthermore, we characterize the \u003cb\u003e\"Whack-a-Mole\" effect\u003c/b\u003e\u0026mdash;a phenomenon where repairing visual elements inadvertently corrupts functional logic due to the AI's lack of holistic state awareness. Our findings provide empirical evidence for the limitations of current chat-based coding interfaces and offer critical design implications for future \"State-Aware\" AI IDEs that reify invisible data flows to bridge the gap between user intent and execution.\u003c/p\u003e","manuscriptTitle":"\"Make it Pop, but not Like That\": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-08 16:39:35","doi":"10.21203/rs.3.rs-8994174/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-03-05T09:53:15+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-28T14:48:56+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-28T14:46:37+00:00","index":"","fulltext":""},{"type":"submitted","content":"Discover Artificial Intelligence","date":"2026-02-28T09:42:11+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"discover-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"diai","sideBox":"Learn more about [Discover Artificial Intelligence](https://www.springer.com/44163)","snPcode":"","submissionUrl":"","title":"Discover Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Discover Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6c8ae0b2-97c2-4557-b88d-f7cc1472b1c9","owner":[],"postedDate":"March 8th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-03-24T08:25:16+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-08 16:39:35","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8994174","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8994174","identity":"rs-8994174","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00