LLM use

There are some of the opinion that LLM use is a wholly positive thing for students and teachers alike. According to these AI utopianists, LLMs represent a democratising of academia, in which students use LLMs to understand material quicker and better, use it to write better essays, not burdened by their circumstances as a non-native speaker, etc. In my experience, this is laughably innacurate; students have repeatedly used LLMs as a cheap and easy way to cheat in essays and dissertations. They don't use it to increase their knowledge, but in fact to decrease it. Others have discussed how LLM use has affected a number of different sectors, where 'AI slop' is seriously submitted to open source software projects, academic journals, legal cases, etc.

However, I am not a luddite; I do not think that LLM use is an absolute evil. I use LLMs almost every day: to help me understand a new topic; to help me write low-quality text like a draft of an email or draft of presentation slides; to help me write computer code; to help me summarise text, etc. LLMs are a fantastic tool for many many tasks. However, it is important to understand why LLMs are useful: in what context? for what purpose? with what level of fidelity? LLMs are fantastic at what they can do:

1. Convert text to word and sentence-level semantic vectors.

2. Generate predictive, stochastic text relating to a given prompt.

LLMs are an absolutely fantastic resource at vectorising plain text and generating plain text related to their training set. The problem comes when we expect it to do things which it was not designed to do. LLMs were not designed to 'think' or 'understand'and they cannot do so. It can produce a simulacrum of understanding by repeating words and phrases from its' training set with a high probability of statistical correlation with the input prompt. For many low-stakes tasks this is a huge advantage, and absolutely makes my work easier to do. In many ways LLMs are similar in fidelity to a Wikipedia article (and indeed Wikipedia is the soiurce of a large amount of training text that LLMS process): they are great at quickly generating well-known types and examples, but one should expect that a large number of the sources are unreliable, incorrect, biased, or simply out of place. It therefore is highly useful as a first approximation, or an introduction to a subject that you have zero current knowledge of; if I wanted to undersatnd the outline of quantum mechanics in years past I would turn to their Wikipedia page, now I might ask chatGPT. From writing a simple Python script applying some known data cleaning technique, to generating a summary of a 10k words article I'm not sure I want to read, LLMs make short work of tedious tasks. In many ways they simply expand what computers are already good at: they can do 1 million repititions of tedious tasks without complaint in less than a second. This is absolutely not trivial. However, it quickly becomes appararent that attempting to use LLMs to do things that it was not designed to do is a fools errand.

It cannot evaluate the logical consistency of an argument, except in extremely simple cases>. It cannot evaluate empirical evidence in research except in the simplest cases. It cannot evaluate the usefulness of a particular methodology or research technique except in the simplest cases. It cannot evaluate a students' essay or dissertation in any useful way. It cannot write a novel program of more than about 500 lines without making many mistakes; novel programs of over 1,000 lines of code become mired in repeated failures of understanding, repeated unreasonable decisions on software stack decisions and architecture, much wasted time creating more bugs to fix existing bugs, repeatedly misunderstanding instructions, claiming to have solved problems it has not solved, mis-naming variables it has previously named differently elsewhere, etc.

Similarly, LLMs cannot be used to write any text that is in any way novel, or otherwise not simply an example of a well-known type. If emails are written using LLMs you will spend alsmost as much time revising and editing them to take out the linguistic tics than you would have done writing it in the first place. Simple emails to colleagues and students are so short that this is not worth it. Larger, more complex texts may sometimes benefit from judicious use of LLMs where you believe that some major part of the text follows a well-known type, and any novel parts may then be edited in later. This can be done with much longer emails, that are expected to follow a strict structure as they may contain detailed step by step instructions or lay out a number of options, etc. The parts that the LLM can do easily are those wordy, structure-laden parts that resemble a template or boilerplate text. Writing those parts yourself will not save you either time or lead to any further clarity: they are boilerplate because they are the same everywhere. Similarly, when writing presentation slides: where the slides are introductory, and follow some well-known theory or example these tedious tasks can be sped up with the judicious use of LLMs. As I explain elsewhere, I like to write presentation slides using plain text software instead of Microsoft Powerpoint; the boilerplate in such slides are a great use-case for LLMs. Where you are presenting specific research of your own for example, LLMs are great at summarising long texts into just a few slides. Similarly with the other examples, however, where your presentation gets more specific or complex, and you want to present it using a particular idiosyncratic style, you need to edit more and more of the slides. You may get to a point where LLM generated texts become more of a burden than a useful tool.

What about writing essays, dissertations, articles, grant applications, research proposals, etc?

Unsurprisingly, these longer texts may also benefit from some judicious use of LLMs, but they quickly reach their limitations. A research outline may be quickly drafted using one's notes; this is simply summarisation of what is already there. You can quickly generate a research proposal outline from your detailed notes of literature review, methodological direction, specific empirical and theoretical challenges, etc. But after summarisation, there is not much that LLMs can do to help you. And this is where some students go wrong. Since LLMs seem so knowledgable on every topic imaginable (they have consumed and incorporated pretty much the whole internet in their training set), they can quickly write 2,000 words on any topic you give it. For longer university dissertations of 10,000 words, the student will have to prompt the LLM by chapter or section, but it can generate something that looks realistic; it looks like a master's dissertation on their chosen topic. But it is not.

The essay or dissertation produced will resemble a dissertation or article because it looks right; it will resemble academic language in the correct tone becuase it has trained on many academic articles. It will generate text that seem broadly okay: the text may contain a number of realistic 'facts' that may actually be correct (if they are well-known enough); it will generate a research methodology which may be broadly realistic (this can vary widely based on the research field; I have found that social science methods are particularly weak in LLMs); they may produce a research analysis which looks broadly ok (this is another relative weak point of LLMs in social science); and it may produce a findings section which looks broadly right (this tends to follow the broad consensus of the field, instead of actually being related to the claimed "research" sources and analysis). If a marker is particularly rushed (this is almost always the case), or less knowledgable about LLM use (many non-technical colleagues do not keep up with the emerging capacities of this field), they are likely to judge that the essay they just read was bland, possibly with glaring mistakes in methodology or anlysis, overly flowery language, and an unclear findings, but broadly acceptable for a just-pass score. They will mark the essay accordingly and move on to the next one.

This is exactly the situation I want to avoid. The student did not write the essay at all; neither in the sense of writing each word, sentence or paragraph, nor in the sense of planning, understanding, or thinking about the challenges and issues with particular theories or empirical data. When the marker does their job they are being cheated; they are not marking the students' work at all, they are marking chatGPT. The student has only learned one thing: they do not need to engage with the course whatsoever; they merely need to feed the essay instructions into their LLM of choice, and copy/paste the resulting text into a Word document. Even worse (is this possible?), perhaps the student does not even need to lift their finger to copy/paste, they can simply request that the LLM generate a finished .docx document for them. The LLM accomplishes this with a well-known Python library that easily injects text into a .docx document.

Even worse than the cheating student (make no mistake, using LLMs to write essays is just as much cheating as paying for an essay, or plagiarising from another text) is the effect on everyone else. Even if (this is a big if) there is only one student that engages in LLM cheating in a single class, their barely-pass score actually affects other students who did poor work but did not use an LLM. What if markers were instructed to mark on a grade? The poor student who may otherwise have scraped a pass will now fail, not because their work is poor, but because it looks poor when compared to the correct grammar, technical vocabulary, reasonable structure of the LLM essay, even though the poor student may have actually had the seed of a good idea buried inside that poor essay. In a class full of weak students, the LLM essay may actually recieve a good mark; again, not because it is actually good, but because overly rushed markers that don't understand LLMs will look at the the technical vocabulary and perfect grammar and judge it more positively than perhaps they should. This is particularly a problem in the social sciences and humanities, where a beautifully written essay without a clear or novel idea may be marked higher simply for looking more elegant compared to its peers. In such circumstances, honest students may be actively disadvantaged compared to dishonest students. We may be putting ourselves in the situation whereby honesty not only doesn't pay, it is a direct and active impediment. We as institutions and lecturers are in effect promoting cheating. This is not why I got involved in education; I flatter myself that my teaching, inadequate as it is, may sometimes, occasionally, actually change my students lives for the better. Widespread use and tacit acceptance of LLM cheating actively makes the world a worse place to live.

Detecting LLM plagiarism

So, what can be done about this problem? There are considerable challenges to creating software detection system for LLM generated text itself. The only popular model for this is to use an LLM trained on LLM generated texts to recognise LLM generated texts. It is extremely unreliable; I don't think that a responsible institution or marker would use such tools at all. However, there is an attractive alternative model; instead of detecting LLM generated text, simply analyse the writing process itself. Since Microsoft Word, is, for better or worse, the standard writing tool for most scholars and students in the social sciences, LLM detection software must use Word to detect how a file was written, focusing on copy/paste behaviour, not LLM generated text itself.

These simple examples where LLMs can display thinking-like or logic-like processes are often where the training set has amply provided texts where these simple logical problems are explicitly pointed out. These may also include particularly simple examples not in the training set but where there is a textual break: the methodology section says I will use technique x, however in the results section it becomes apparent I used technique y. The author may claim to have used primary sources, but in the analysis section it is clear that only secondary sources have been used, etc. These examples do not require the underatanding of what the author of the text is trying to do; but rather simply compare two sentences, sections or theories for obvious inconsistencies. ↩
i.e. a program which is not simply an example of a well-known type, such as a fullstack website in JavaScript, or a well-known game such as breakout in Python. Where AI companies claim that coding is a solved problem of LLMs they are referring to these extremely well-known programs with millions of online examples in repositories on GitHub which are part of their training set. ↩