Stage 2 of Backward Design: Assessment Evidence That Survives AI

In the first piece in this series I argued that most courses break in Stage 1 of Backward Design — that faculty who sincerely believe they start from outcomes usually start from content, and the syllabus reveals it. If you got Stage 1 right, you now have outcomes that name what a student should be able to do with what they learned.

Stage 2 is where you decide how you'll know they can. Wiggins and McTighe call it "determining acceptable evidence," and it is the stage that AI just detonated.

Here's the uncomfortable version. For most of the history of higher education, Stage 2 had a comfortable shortcut: assign the work — the essay, the problem set, the lab report, the case analysis — and treat that finished work as evidence of the cognition behind it. The essay stood in for the thinking. That shortcut held because producing the work required the cognition. You couldn't write a competent literature review without having read and synthesized the literature.

That coupling is gone. A student can now produce a competent piece of work with little of the cognition it was supposed to evidence. The work didn't get easier to fake; it got trivial to fake. And this is the part faculty keep getting wrong: AI didn't create a new problem. It exposed an old one. Stage 2 was always supposed to ask "what evidence would actually demonstrate this outcome?" — and for decades we answered "the essay" without checking whether the essay demonstrated the outcome or merely the output. AI just removed our ability to keep not checking.

So Stage 2 has to be rebuilt around a harder question: what evidence of this outcome remains valid when the work itself can be generated? Two kinds survive.

Evidence type one: alignment you can trace

The first kind of surviving evidence isn't about catching cheaters — it's about whether the assessment measures the outcome at all. Most assessments fail a simple trace test: if you put the outcome and the assessment side by side, the assessment measures something narrower, broader, or just adjacent to what the outcome claims.

An outcome that says "students will evaluate competing methodological approaches and justify a choice for a given research question" is not assessed by a multiple-choice exam on methodology terms, and it's not assessed by an essay that merely describes three methods. It's assessed by a task that forces the student to choose under specific constraints and defend the choice against the alternatives. When the outcome verb and the assessment task verb match — evaluate with evaluate, design with design, not "evaluate" with "identify" — you have alignment, and alignment is evidence that survives regardless of what tools the student used, because a well-aligned task is hard to complete without the underlying capability whether or not AI helped draft the prose.

Most "AI-proofing" energy is misspent here. Faculty reach for surveillance — lockdown browsers, plagiarism detectors, proctoring — when the cheaper and more durable fix is an assessment that's actually aligned to a demanding outcome. A genuinely well-aligned performance task is most of the defense you need.

Evidence type two: performance that's hard to fake

The second kind is evidence generated under conditions the finished work alone can't satisfy — where the student has to perform the cognition live or adapt it on demand, not just hand in a product. This is the evidence AI can't sit between the student and the assessor for.

It doesn't require an exam hall. It requires designing the evidence so that producing it depends on having done the thinking:

Defense of the work. Let them use whatever tools they want to produce it — then have them explain a specific decision in it, justify an alternative they rejected, or extend it to a case they haven't seen. A student who outsourced the thinking can produce the essay; they can't defend a choice they didn't make.
In-context application. Give the concept, then a novel situation in the room, and ask them to apply it. The transfer is the evidence. Transfer is precisely what generated work doesn't demonstrate.
Process made visible. Evidence of the reasoning path — the rejected approaches, the revision in response to a constraint, the "why this and not that" — rather than only the polished end state.

The key design move: don't grade more. Grade the part that requires the cognition, and let the rest be tool-assisted without apology. A student using AI to clean up prose on a task they can defend is not a problem. A student who can't defend any of it is the only thing you were ever trying to catch — and a hard-to-fake performance component catches it without surveilling the honest majority.

How to actually redesign a Stage-2 assessment

You don't rebuild every assessment in your course. You apply a triage:

For each outcome, name the verb. What cognition does it claim?
Look at the current assessment and name its verb. Does it match, or did you settle for the finished work as proxy?
If the work is now easy to fake and the stakes are real, add a hard-to-fake component — a short defense, an in-class application, a visible-process requirement. It can be small. Five minutes of "walk me through why you chose this" recovers most of the validity a take-home lost.
Leave the low-stakes formative work alone. Not everything needs to resist faking. Practice can be tool-assisted; that's fine, even good. Reserve the rigor for the evidence that actually certifies the outcome.

Where AI fits — on both sides of the desk

It would be dishonest to write this as though AI is only the threat. It's also the most useful Stage-2 design partner faculty have had:

It can check alignment — paste an outcome and an assessment and get an honest read on whether the verbs match and what the task actually measures.
It can generate the hard-to-fake component — propose three defense questions for a given piece of work, or a novel transfer scenario for a given concept, in seconds.
And, increasingly, it can conduct the live questioning at scale — the part that made oral defense impractical for a 90-student course. That's a different essay, but it's the reason performance under live questioning stopped being a luxury reserved for dissertations.

What AI can't do is decide what your outcome means or whether a given piece of evidence honors it. That's Stage 2, and it's still yours.

The bottom line

Stage 2 was always asking the right question — what evidence would actually demonstrate this outcome? — and for a long time we were allowed to answer lazily because the work and the cognition were welded together. AI cut the weld. The fix isn't to police the work harder; it's to design evidence that's either tightly aligned to a demanding outcome or generated under conditions generated work can't satisfy. Do that, and you don't have an AI problem. You have the assessment you should have had all along.

TeachingsByDesign helps faculty get Stage 1 and Stage 2 right — aligned outcomes, evidence that holds, and authentic assessment that scales.

References

Biggs, J. (1996). Enhancing teaching through constructive alignment. Higher Education, 32(3), 347–364.

Wiggins, G., & McTighe, J. (2005). Understanding by Design (Expanded 2nd ed.). ASCD.

About the author

Thomas R. Christian is the founder of TeachingsByDesign, an AI-native academic platform built around the coherence engine thesis — that alignment from outcomes downward, not feature-by-feature LMS plumbing, is where higher education actually breaks and where AI can actually help.

He holds a Master's in Adult and Continuing Education from Rutgers University and has spent twenty years designing instruction, training, and curriculum across enterprise CX, healthcare, and financial services. He writes about course design, AI in higher education, and the discipline of getting Stage 1 right.