17 June 2025

The AI Learning Centre Researching Reinforcement Fine-Tuning

A case study in using Reinforcement Fine-Tuning for faster and fairer student placements.

 

All prospective students are required to take the University of Nicosia New English Placement Test Online (NEPTON). The purpose of this test is to place students in the appropriate level of English in order to support their academic studies at the University. The NEPTON is not a university entrance examination; previous academic performance (e.g. School Leaving Certificate) is taken into consideration with regard to university entrance requirements.

This required testing process generates hundreds of student essays that must be evaluated by our placement team. The evaluation is time-critical, labour-intensive, and—despite best efforts—susceptible to small variations in marking. This year we asked a sharper question:

Could a compact language-model, trained only on our own data and rules, deliver immediate, consistent placement recommendations?

To explore that possibility we turned to Reinforcement Fine-Tuning (RFT)—a training method that rewards the model for meeting a precise objective rather than merely copying past answers.

Important note

The experiment described below is research only. It has not been deployed in live placement, and any future use would require rigorous human oversight and formal approval.

What makes RFT different?

Conventional fine-tuning is an exercise in imitation: we show a model thousands of paired prompts and “gold” answers, and it learns to echo those answers back. RFT, by contrast, lets the model try an answer first. A small grading script scores that attempt:

  • +1 for a perfect match
  • 0 for a wrong but well-formed answer
  • -1 for any reply that breaks the required format

The model then adjusts its own weights to chase higher scores. RFT is ideal when the product you need is governed by an unambiguous rule—in our case, a single JSON object that names a course level and nothing more.

Preparing the data—step-by-step
  1. Matching IDs Our quiz export lists students by username, while the registry sheets store the same students under student ID. We converted both strings to lowercase and aligned them one-to-one.

  2. Cleaning the essays Stray HTML such as <br> or empty <p>&nbsp;</p> tags contributes exactly zero meaning. We stripped them all, leaving only the students’ sentences.

  3. Homework vs. exam piles We set aside 258 essays for the model to learn from and 88 essays for it to face later as an unseen test. Think of it as homework and a final exam.

  4. Wrapping each example for RFT RFT requires the last turn to be from the user, so the model must respond during training.

Early learning curves

After only four optimisation steps the model moved from negative reward to a clear positive trajectory and eliminated every formatting error. The figures below capture the key moments.

  • Training-reward curve climbs from –0.07 to +0.19

  • Grading-error rate collapses to 0 % by step 2

Even at this early checkpoint the model demonstrates a working grasp of the five valid labels and the exact JSON schema—crucial prerequisites for trustworthy deployment.

Practical implications
  • Time saved: Marking 258 essays now takes seconds of GPU time instead of long staff-hours.
  • Consistency: One unambiguous rubric, no fatigue bias, identical treatment for every student.
  • Transparency: Each output is a single line of JSON plus a numeric reward score—trivial to audit.
Conclusion

Within minutes of reinforcement fine-tuning, a compact language model learned to read student essays, obey a strict JSON schema, and assign plausible course levels—without a single supervised example. The preliminary curves already point toward professional-level accuracy, and the infrastructure is light enough to run overnight on standard hardware. This proof-of-concept suggests a promising path towards faster, fairer placement for every incoming cohort.

Share This Story, Choose Your Platform!