Working with Language Models: Principles Over Hype

Sara Vispoel, Brad Bolender, Charles Foster, Jesse Hamer, Sierra Magnotta, and Safat Siddiqui


In the past months, we have witnessed an explosion of interest in large language models (LLMs) such as GPT-4 and in how Finetune is harnessing the technology. Everyone is on the lookout: established teams looking to test out emerging tech, rising startups looking to turn research into transformative products, and fly-by-night operators looking to make a quick buck in this gold rush. However, with this explosion of interest, we are also seeing an explosion of confusion. People are asking: “How do I navigate this new terrain?”, “What should I be looking out for?”, “How can I get real value out of this technological paradigm-shift?”

Having worked with LLMs since well before the hype, we’d like to offer some clarity. We have seen how powerful tools that integrate this technology can be. Through pre-training on millions of pages of text to learn complex conceptual associations, plus additional, more granular guidance (through methods like “fine-tuning”, “reinforcement learning on human feedback”, and “prompt engineering”), transformer models can be made useful for all sorts of tasks. But what is often a surprising realization to newcomers in the space is that actually making LLMs useful for doing real work is not easy, especially in areas where quality counts.

A principled approach

At Finetune, for several years we have leveraged LLMs to augment our partners’ content generation and tagging workflows. Through those partnerships, and through the hard lessons that come with real-world experience, we’ve found that the technology is most impactful when combined with a principled framework. Doing it right, rather than merely doing it fast is what matters.

Why not just do it the easy way? Well, say you just ask GPT-4 to compose a new “Shakespearean” sonnet, or to write a test question on a particular topic. At first glance, the output will often seem acceptable. But remember: these models act like skillful impersonators. Look past the surface of that sonnet and you’ll see a hollow core: most of Shakespeare’s underlying beliefs, intellect, and attitude are completely left out. Likewise, inspect that test question and you’ll see major issues: no attention paid to any underlying construct, or to how one might optimally sample the domain to support inferences of proficiency, or to any purpose driving the test. In sum, it lacks psychometric validity!

In order to build in validity and everything else that professionals in our industry want, one needs to go beyond the raw language model through a synthesis between measurement & learning science, psychometrics, and AI.

Here are some core principles of what that synthesis looks like: 

  1. Design for the workflow, not for the AI
  2. Center the human in the loop
  3. Build trustworthiness through transparency

Design for the workflow, not for the AI

Merely having an LLM integrated into an application is not enough: the focus has to be on giving the user the AI tools that best support their work. Be wary of providers that boast an integration with one particular model, and seek out ones that keep up with AI progress, especially by being LLM-agnostic. After all, particular models come and go: GPT-3 had its day in the sun and then it was old hat. Today there are a wealth of options, both well-known like GPT-4 and Claude, and lesser-known such as GPT-NeoX, FLAN, and fine-tuned models.

This desire to focus on the workflow is why at Finetune, we’ve been designing AI models to fit the work they need to support. As soon as we begin work with a customer, our Measurement team collects key artifacts to describe, organize, and prioritize the key constructs for their assessments and the design patterns required to measure them. This results in a structured set of test and item specifications, enabling our AI scientists to incorporate this into the model development process. Before release, the Measurement & AI teams go through several iterations of quality assurance to confirm the model outputs test the correct constructs at the appropriate levels of cognitive complexity, and that the items adhere to both test writing guidelines and best practices in assessment.

Center the human in the loop

While many pay lip service to the value of user input, few actually live that out. Subject matter experts (SMEs) should be equal partners in model development, alongside data scientists and other stakeholders. Also, validation should not stop at deployment. LLMs like GPT-4 stop learning after their initial training, so application developers need to develop ways to give control to the user and to keep up with their users’ needs. Even out in the field, AI models should be getting continual improvements, to make sure the user is always in the driver’s seat.

For example, feedback from SMEs helps us determine what constructs should be measured by AI-generated content, what parts of the content they most need help with, what constitutes high-quality, and how the model improves over time. We meet regularly with customers throughout model building to discuss progress and areas for improvement and to solicit SME feedback. Also, with a feature we call Learn, SMEs are able to flag the best AI-generated items, feeding them back into the AI self-improvement flywheel. Rather than growing stale, through SME feedback your models can get better over time.

Build trustworthiness through transparency

Without transparency, how can you trust the output of a LLM? These models are often opaque and prone to making confident false statements. Any LLM-supported tool should have in-built capabilities to trace the model outputs back to a trusted source. Moreover, the need for trust goes beyond trust in the AI system, encompassing trust in data security and privacy.

This trust has been quite important to us. For Generate, it motivated us to build features like AI-assisted reference lookup and the ability to do generation directly from reference materials. Likewise, on our AI tagging product, Catalog, we had to develop methods for having our AI systems make tagging decisions systematically and with explanations, including a Rationale and Catalog Score breakdown. Just as a trusted human SME who assigns a tag should be able to explain the thought process behind the decision, so too should a trusted AI system. On the data security & privacy front, the models we develop are isolated on a per-customer basis and are only tuned on the data from that customer. That way, the models can learn the ins and outs of how a specific customer does their work, without fear of leakage.


Aside from the remarkable qualitative improvements which LLMs have enjoyed in recent months, the improvements to accessibility have been equally astounding. We have entered an era where AI expertise is no longer a barrier-to-entry for interacting with LLMs. That said, the difference between interacting with an LLM and building a quality product with an LLM is as stark as the difference between having a frying pan and delivering a 5-star dining experience at scale: the latter is only possible with a team of dedicated experts implementing a principled design centered around user experience.

At Finetune, we recommend three simple—yet, we believe, necessary—principles which any products—not just Generate or Catalog—should adhere to if they want to effectively leverage the power of LLMs. By designing for the workflow, instead of the AI, one ensures that the quality of the user experience is prioritized above the marketability of whichever LLM happens to have hype on that particular day. By centering the human in the loop, one acknowledges that regardless of the power of the particular LLM, the expertise of the SME is always required for leveraging LLMs at scale. By building trustworthiness through transparency, one demonstrates respect for the customer by emphasizing transparency in both LLM-decision-making and data security. Underneath each of these principles is a central theme: that an LLM—like any AI model—is a tool. At Finetune, we are proud of not only our expertise in AI and Measurement, but also of our nearly three years of experience in leveraging these powerful AI tools in order to deliver a high quality user experience: one designed to amplify, rather than replace, the expertise of our customers.