Your AI Is Doing the Wrong Job. That's On You.
What two weeks of Moodle import errors taught me about right-sizing roles Two weeks of debugging. Every single failure was XML. Not the AI. XML! I build Python-based deployment pipelines for professional certification programs delivered on Moodle. Course content is authored by Team 1 — a group of AI agents working alongside a subject matter expert who stays in the loop as a human reviewer. Call the whole team T1. I take that content and compile it into a deployable Moodle course package. The pipeline is automated. The process is repeatable. It works. Except for two weeks in April, it didn't. And the whole time, the answer was sitting right in front of me. Quick context on the team I reference T1, T2, and the SME throughout this article. If you have not read the previous piece, here is the 30-second version. T1 is the content team — a group of AI agents working alongside a subject matter expert (SME) who reviews and approves every deliverable before it leaves T1's hands. The AI agents produce the bulk of the work fast. The SME is the accuracy gate. It is not fully autonomous. That human-in-the-loop (HIL) is deliberate — AI agents are getting sharper every module, but the SME stays in the loop until the system earns full trust on each task. T2 is the infrastructure — the agent personas, the prompting architecture, the agentic workflows, the QA measurement tools, and the Python pipeline that compiles T1's output into a deployable course package. I designed and built all of it. When I describe a failure in this article, I am describing a failure in my own architecture. The distinction matters for this article because the XML problem was not a T1 failure. It was a pipeline design failure. T1 was doing exactly what it was asked. I asked it for the wrong thing. And to be clear about what T1 is already doing: for every module of a 12-module professional certification course, T1 produces learning objectives, participant guides, facilitator guides, handouts, activities, graphics, and assessment questions — all at medical-grade accuracy required for NCCA credentialing. A wrong answer key on a quiz is not a typo. It is a compliance failure. That is T1's job. Content creation at medical-grade accuracy across an entire course catalog. Asking that team to also enforce Moodle's XML schema on top of all of that was the mistake. One function. One job. What the wrong job looks like The wrong thing was Moodle quiz XML. If you have never tried to import assessment questions into Moodle programmatically, you probably assume the XML is straightforward. It is not. Every question type has a different schema. The rules are scattered across Moodle's PHP source code, not documented in any single reference. And the importer fails silently on half of them. Here is a single True/False question in valid, importable Moodle XML. One question. Pay attention to how much structure surrounds four words of actual content: <?xml version="1.0" encoding="UTF-8"?> <quiz> <question type="category"> <category> <text>$course$/CertPro/Question Bank/M01/TrueFalse</text> </category> </question> <question type="truefalse"> <name><text>M01-TF-01</text></name> <questiontext format="html"> <text><![CDATA[<p>Audit logs must be retained for a minimum of seven years under federal standards.</p>]]></text> </questiontext> <defaultgrade>1.0000000</defaultgrade> <penalty>1.0000000</penalty> <hidden>0</hidden> <answer fraction="100" format="plain_text"> <text>true</text> <feedback format="html"> <text><![CDATA[<p>Correct. Seven years is the federal minimum.</p>]]></text> </feedback> </answer> <answer fraction="0" format="plain_text"> <text>false</text> <feedback format="html"> <text><![CDATA[<p>Incorrect. Review the retention policy section.</p>]]></text> </feedback> </answer> </question> </quiz> That is one question. Now here is a Matching question — same file, different question type, completely different schema: <question type="matching"> <name><text>M01 Matching A - Compliance Terms</text></name> <questiontext format="html"> <text><![CDATA[<p>Match each compliance term to its correct definition.</p>]]></text> </questiontext> <shuffleanswers>1</shuffleanswers> <correctfeedback format="html"> <text><![CDATA[<p>All correct.</p>]]></text> </correctfeedback> <partiallycorrectfeedback format="html"> <text><![CDATA[<p>Some incorrect. Review and retry.</p>]]></text> </partiallycorrectfeedback> <incorrectfeedback format="html"> <text><![CDATA[<p>Incorrect. Return to Module 01 and retry.</p>]]></text> </incorrectfeedback> <subquestion format="html"> <text>Audit trail</text> <answer><text>A chronological record of system activity</text></answer> </subquestion> <subquestion format="html"> <text>Data custodian</text> <answer><text>The person responsible for maintaining data integrity</text></answer> </subquestion> </question> Two question types. Two completely different schemas. Cloze and Essay have their own structures too — each one requires its own creation logic. Every element has rules. Most of the rules are not documented in any single reference. What Moodle's importer actually enforces Here is what Moodle's PHP importer actually enforces at parse time: format="html" is required on almost every text-containing element. Omit it from <questiontext>, <feedback>, or <subquestion> and Moodle silently drops the content or aborts the import. No clear error message. <text> nodes containing HTML must use CDATA — or fully escaped entities. A <text> node with a raw <p> child is not a string to PHP's trim(). It's an array. You get: Error: trim(): Argument #1 ($string) must be of type string, array given Not obvious. Every HTML-containing text node needs <![CDATA[...]]> or escaped markup. True/False answer text must be lowercase. <text>True</text> fails silently. Moodle can't determine which answer is correct and imports a broken question. Must be <text>true</text>. Four characters. Costs you the whole question. Matching <subquestion> elements must be direct children of <question>. Not wrapped. Moodle's PHP reads $question->subquestion directly. Wrap them in a <subquestions> parent — a completely logical authoring choice — and you get Undefined array key "subquestion" on every single matching question in the file. Category paths use a pseudo-filesystem with a $course$ variable. The content of the <category> block determines which question pool a question lands in. Use M01 - Introduction/TrueFalse for your first authoring batch and M01/TrueFalse for the second — both valid XML, both syntactically fine — and Moodle creates two separate categories. Your randomized question pool is now split. Students across delivery cohorts draw from different pools. The exam is no longer audit-defensible. Cloze syntax is embedded inside escaped HTML inside a CDATA block. {1:SHORTANSWER:=answer1~%100%answer2} lives inside the questiontext string. It has to survive XML parsing, CDATA unwrapping, and PHP string processing. Double-encode a single character upstream and the answer matching silently breaks. Encoding artifacts compound all of it. Smart quotes from word processors. Mojibake from double-UTF8 encoding — â€" showing up where — should be. Bare HTML entities like &ndash; outside CDATA blocks. Some fail loudly. Some import the question with corrupted text that only surfaces when you open it in the Moodle UI three days later. What happened when I asked T1 to write this directly First delivery: 65 errors. I gave T1 explicit feedback. Showed it the specific failures. Corrected examples. Second delivery: 49 errors. Different errors. Issue Delivery 1 Delivery 2 Questions in wrong category YES — matching landed in TrueFalse pool Fixed Capital True/False Fixed YES — 28 instances Raw HTML in <text> nodes Fixed YES — 21 instances Smart quotes and dashes YES — 131 instances Fixed Correct question count NO — 74 of 83 NO — 74 of 83 Not metaphorically. Literally different errors each time. And that is with a human expert reviewing the output before it reached me. This is not a prompting problem. T1 understood the requirements. The SME reviewed the files. They fixed what I told them to fix. But XML has ~15 interdependent rules across four question types and an LLM generating XML improvises on those rules every generation. It cannot hold all of them consistently across 83 questions and 12 module files in a single pass. The human reviewer caught content errors. Nobody caught all the structural ones — because they are invisible until Moodle's PHP importer rejects them. It was the wrong tool for the job. I was the one using it wrong. The answer was already in the pipeline Every other piece of this pipeline runs on HTML. Course pages — HTML. Activity descriptions — HTML. Lesson content — HTML. Moodle renders HTML everywhere. Even the Moodle XML we were targeting wraps HTML inside CDATA blocks inside every single <text> node. Our entire pipeline is built on an architecture we call HTML-as-JSON — structured HTML with embedded data-* attributes that serves as both the human-readable deliverable and the machine-parseable data source. The AI writes content in a format it produces fluently. The Python pipeline extracts the data it needs from the DOM. No translation layer. No schema enforcement in the prompt. A BONUS - HTML is the only structured format where the subject matter expert can open the file in a browser and immediately see what the student will see. My co-founder said it plainly: "If I give you a spec file, I cannot see how it looks until you build it. With HTML, I can see it immediately." That is a free QA step baked into the format choice. An SME cannot review XML. They cannot review JSON. But they can open a browser, look at a page, and tell you in ten seconds whether the content is right. The human-in-the-loop works because the format is human-readable without tooling. The format we were asking T1 to produce — XML — was DEAD WRONG for the workflow we were using. T1 naturally produces HTML in every other context we give it. Every course page it writes comes out clean. Every activity description, every lesson block, every rubric. HTML. Consistent. Parseable. HIL reviewable. No encoding surprises. We thought we could train it out of its XML errors. Two deliveries, explicit feedback, corrected examples. Still broken. Different broken. WRONG approach. We went back to what the LLM does natively and what the rest of the pipeline already speaks, HTML. KISS — Keep It Simple Stupid (Me). Stop asking the tool to do the hard part. Do the hard part yourself, in code, once. The AI agents are getting sharper every module. The HIL keeps the content accurate. Neither of them should be debugging XML schema compliance. That is a machine job. The fix As a programmer you write functions to do one thing. You probably learned the hard way what happens when you don't. Why would we expect an LLM to be any different? The problem was not that T1 is bad at writing course content. It is excellent at that — the AI agents produce clean, consistent material and the SME keeps it accurate. The problem is I was asking them to simultaneously be content authors and XML schema enforcers. Those are two completely different tasks. One has room for creativity and judgment. The other has zero tolerance for variation. Asking one tool to do both is how you get 49 different errors on the second try. Hard separation: T1 (LLM) → HTML template (interface contract) → Python converter → Moodle XML [content author] [structured but forgiving format] [T2 schema enforcer] [import target] The interface contract HTML is a format LLMs produce fluently and consistently. It's also forgiving — minor variations (True instead of true) don't break the parse. The template I gave T1 uses data-* attributes as machine-readable markers: <section data-type="truefalse" data-include="yes"> <article data-id="C1-M01-TF-01"> <p class="question">Audit logs must be retained for a minimum of seven years.</p> <p class="correct-answer" data-correct="true"></p> <p class="feedback-correct">Correct. Seven years is the federal minimum.</p> <p class="feedback-wrong">Incorrect. Review Module 01.</p> </article> </section> T1 writes content. The SME reviews it. Neither of them ever touches format="html", CDATA, fraction="100", or category paths. Those don't exist in their world. Cloze blanks use a dead-simple inline marker instead of embedded XML syntax: <p class="question"> Records must be retained for [BLANK:seven|7] years under [BLANK:federal] guidelines. </p> Five authoring rules. Worked examples for all four question types. Everything in HTML comments — no separate setup document to get lost or go stale. T1's self-check on the first HTML file it delivered under the new system: 5 matching sets, 12 T/F, 8 cloze, 4 essay, zero smart quotes, zero bare blanks, all data-correct lowercase. One invisible BOM at the file start — stripped automatically by the converter. First try. Human reviewer signed off before handoff. No XML debugging session. The enforcer The Python converter owns 100% of the Moodle XML rules: Reads <meta name="module"> and data-type to build the correct category path — every time, the same way Adds format="html" to every element that needs it Wraps all HTML content in <![CDATA[...]]> automatically Converts [BLANK:a|b] to {1:SHORTANSWER:=a~%100%b} cloze syntax Always outputs true/false lowercase regardless of what T1 wrote Strips all non-ASCII before writing — mojibake is structurally impossible Validates minimum pool sizes per delivery mode before writing output Generates <subquestion format="html"> as direct children. No wrapper. Written once. Tested once. Never improvises. The broader pattern This isn't a Moodle problem. It shows up anywhere LLMs need to produce output conforming to a strict target schema: Generating... Don't ask the LLM to write... Ask it to write... Moodle quiz XML Raw XML Structured HTML template API request payloads JSON with strict schema Simple key-value markdown Database seed files Raw SQL CSV with header row Config files YAML/TOML Annotated plain text SCORM packages IMS XML manifests HTML with data attributes Same pattern every time: Define an interface format the LLM can produce reliably Build a converter that transforms it into the target schema Push every schema rule into the converter — none in the prompt, none in the LLM's head Validate at conversion time, not at import time The LLM's job: write good content in a forgiving format. The converter's job: enforce every rule, every time, with zero tolerance for variation. What actually changed For two weeks I was debugging XML. Now I am reviewing content quality. Which is where my attention should have been from the start. But the deeper shift is this: everyone in the pipeline is now doing the job they are actually built for. The AI agents generate content. The SME validates accuracy. The Python converter enforces schema. Each role is sized to what it can do reliably — not to what we hoped it could do with enough prompting. When you right-size the roles, the errors stop being random and start being findable. You can measure them. Fix them. Track whether they come back. Review. Measure. Revise. Repeat. That loop is the whole game. Not getting it perfect the first time — nobody does. Getting it measurably better every iteration. The agents are sharper this month than last month. The converter catches things the pre-check missed at first. The SME's review is faster because the template is cleaner. The system improves because each component has a clear job and a clear failure mode. The question is never "can the LLM write this?" It is always "what is the smallest, most reliable job I can give each part of the system — and how do I know whether it is doing that job?" That is an architecture problem. It compounds. Prompting doesn't. The code All three tools from this article are open source. Clone, adapt, use them: GitHub: EdFife / HTML-as-JSON File What it does python-scaffold/quiz_template_universal.html The universal HTML authoring template — all 4 question types, all instructions in comments python-scaffold/html_to_moodle_xml.py Converts the HTML template to valid Moodle XML — owns 100% of the schema rules python-scaffold/precheck_quiz_html.py Pre-conversion validator — catches authoring errors before they become import failures The converter works whether T1 is an AI team or a single instructor writing quiz questions on a Saturday. The HTML template is simpler to author than Moodle's own quiz UI. The repo also contains our full AI agent persona library, the agentic workflow architecture from the first article, and the Python scaffold we use to build the rest of the course package. If you are building anything on Moodle with AI, start there. The pipeline ran three more courses after this fix. Zero XML debugging sessions. The system is still improving. If you are solving a similar problem or want to argue about the approach, I am easy to find. Tags: ai python xml moodle llm architecture devops opensource
Loading comments…