🌻 !black box

9 Apr 2025

Of course there are hundreds of useful ways evaluators can use AI, but the one that bothers me is using it to make evaluative judgements, as follows.

Don't use AI as a "black box".

Limit the AI's freedom to make evaluative judgements.

Do not ask the AI “what are the main or most important causal stories in the document” as this is a significant evaluative judgement, carried out in an opaque way by a machine we have no special reason to trust.

Do not ask the AI to make summaries, as making a summary is an evaluative act.

Instead, break down your high-level, evaluative question into simpler tasks like "does this paragraph mention changes in health behaviour?" .

Break down your text into small units.

In the end you will have broken down your high-level task into very many smaller much simpler tasks, with instructions about how to reassemble the low-level results to get a high-level answer, so that you in principle don't need an AI. If you had a lot of time and patience you could use hundreds of school-children who have been given adequate background knowledge. There should be a high degree of inter-subjective agreement about how to answer the question.

This break-it-down and build-it-up logic is also the logic of rubrics.

Use the AI only as a tireless low-level assistant to exhaustively and transparently carry out each small task on each piece of text, usually with zero "temperature" to make results as reproducible as possible.

This advantage of using AI for this is game-changing because we can process enormous amounts of text in a reproducible and verifiable way, and experiment with and optimise different procedures at very little cost.

But this means if you are using a chat interface like chatGPT you have to do a lot of book-keeping, copying and pasting. Or use specialised software. Reassemble the results in a transparent way (not using AI) to help answer the bigger question.

The responsibility for how to break down the question(s) and reassemble the answer(s), is all yours, as the evaluator or evaluation team. So is the responsibility for checking the work of the AI, looking for bias, misunderstandings, etc.