Large Language Models and Assessment Development – Finetune Generate, ChatGPT and Beyond

Charles Foster and Jesse Hamer

Introduction

Since 2021, at Finetune we have seen the potential of Large Language Models (LLMs) for transforming the way professionals in education & assessment work. The dramatic pace of progress in this space has meant that concepts can go from a research toy one week, to a viral product the next.

It was then no surprise to see how excited the response to ChatGPT was: in one demo, everyone understood that we’re standing on the threshold of something great. Given the present excitement and uncertainty, one might wonder: how does Finetune Generate fit into this landscape? If I could just ask a generic chatbot to do my writing for me, why would I need anything else?

We like to think of Large Language Models as foundation models: AI systems whose extensive and diverse training let them act as the bedrock for a wide range of use cases. A few organizations including Anthropic, EleutherAI, and OpenAI (the developer of ChatGPT) train these giant models and make them available for others to use. But the models themselves are merely the base layer: they have much greater potential when woven into a larger system, tailored for a specific application. Just like other general-purpose technologies such as the Web, it may take a whole generation of researchers and entrepreneurs building systems on top of it, for it to realize its potential. In an interview with Ezra Klein, OpenAI CEO Sam Altman expressed a similar sentiment:

What I think we’re not the best in the world at, nor do we want to really divert our attention [from], are all of the wonderful products that will be built on top of [large language models]. And so we think about our role as to figure out how to build the most capable A.I. systems in the world and then make them available to anybody who follows our rules to build all of these systems on top of them.

Altman, 2023

By combining LLMs with more traditional technologies like knowledge bases and human-in-the-loop interfaces, we can create mature technology stacks, or generative applications, that allow us to unleash the capabilities of LLMs to create smart tools in all sorts of application areas. Generate and ChatGPT are two early examples of these.

With this framework in mind, let’s compare ChatGPT and Finetune Generate as generative applications both built on GPT-3, from the standpoint of item development.

Design Goals

Both ChatGPT and Finetune Generate are intended to provide a more intuitive interface for users to interact with generative models like GPT-3. Beyond that, the two applications are quite different. OpenAI has a mission to build safe, general-purpose AI systems for all, and built ChatGPT to give the general public a taste of what language models are capable of doing with natural language, and to serve as a sandbox for builders to test out new ideas.

At Finetune, although we do engage with the broader research community around language model innovations (see our collaboration with OpenAI on improvements to semantic search), our aim with Generate was not primarily to build new general-purpose systems, but rather to build the best tool possible for AI-assisted item writing. That is why Generate is built specifically with item writers in mind, around their best practices, language, and workflows. All of our design constraints were based on engagement with a wide variety of early adopters. Each Generate model that we build is designed to reflect the unique structure of each assessment, and gives the user the specific controls needed for their task. Moreover, entire teams of item writers can collaborate on developing items using Generate, with built-in functionality to allow permission management and structured export into formats like QTI.

Specificity

Large language models go through an initial training phase called pretraining, where in one long session they learn from millions of pages from the web, books, and other sources. Because of how expensive the computation of learning from those inputs is, their knowledge is typically fixed in place afterwards. Since it is a thin dialogue wrapper on top of GPT-3, ChatGPT similarly has a fixed knowledge base that cannot be amended. If, say, a technician wanted help regarding some proprietary system, such a model would probably not be helpful to them, because the model has no way of learning new subject matter.

Finetune’s partners run the gamut from K-12 to higher education to licensure & certification, and span a wide variety of domains.

As such, it is critical for us that the models we build for them must learn from their unique content—even if that content is highly specialized or novel—and must be updatable with new materials as they become available.

To make this possible, our AI R&D team has refined our own methods to efficiently incorporate new knowledge into language models and to target them to the specific guidelines of an assessment. Moreover, Generate dynamically learns over time to better target items to the specific content and style of each customer’s tasks. Throughout this year we plan to roll out several more features that will continue to improve the controllability and adaptability of our models, from key phrase targeting to fine-grained control over cognitive complexity and beyond.

Security

As an experimental demo, ChatGPT is meant to elicit feedback on how people interact with language models, so that OpenAI can improve the fundamental technology backing its APIs. Because of this, when users talk with ChatGPT, those interactions are stored and may make their way into future training datasets, to help train the next generation of models. That means that if you develop an assessment item with ChatGPT, future models may know about it or have memorized it, potentially exposing your items and item style in ways you didn’t intend, risking their security.

Security is a key concern within item development.

Generate keeps items secured, walled off, with each customer accessing only their models.

Even within a single customer, users can be restricted to only access specific generated items. With Generate, customers are always the owners of whatever items they produce, no matter whether they are just trying out an initial model or have adopted the tool at scale.

Trust & Support

Much of what makes productively using a LLM difficult is that it is fundamentally random: ask it the same question twice and it will give you two different answers. This runs against what we usually expect from our tools: we count on them to be reliable. This leads to one of the most persistent problems with ChatGPT and with other LLM tools, namely that it is hard to trust their outputs when you do not know why those outputs were chosen. Was it based on facts that the model recalls, or falsehoods the model made up, or even plagiarized from some unseen source?

The standards for trust within education & assessment are high, much higher than for casual chatbots. Customers want to know that items they produce through Generate are truly novel, are based on their own materials, and are valid.

Our Measurement and AI R&D teams work with each customer to create models tailored to their needs, and to incorporate their feedback into ongoing model improvements.

We also perform manual & automated checks to verify that the suggestions Generate makes match the customer’s specifications. We will soon be rolling out a new feature that will allow users to easily cross-reference generated items with reference materials, so that they can have immediate reassurance that the items they produce are grounded in fact.

Conclusion

This is an exciting time wherein hundreds of generative applications will be built out, all pursuing different potential use cases for LLMs. As you explore them as someone who cares deeply about the quality of assessment in education, certification and licensure, we recommend always keeping the following questions in mind:

  • Who is this application designed for?
  • Is the model this application uses trained specifically for what my organization needs, including our security needs?
  • How will the data I provide be used?
  • Do I want to invest the time and money to make a raw general purpose model usable (e.g. the appropriate UI) and trusted by our Subject Matter Experts (SMEs) to be integrated into our workflow and high stakes use case?

We are still in the early days of this profoundly impressive technology, but already the extent of capabilities that generative applications will enable across multiple industries is becoming apparent. So too are the voices of caution expressed by Gary Marcus of NYU and others.

At Finetune we are very excited to continue showcasing more features in our third year that will make Generate even more performant, even more reliable, and even more helpful across the entire learning and assessment landscape.