Generative AI – Moving Beyond Proof of Concept

Since the release of ChatGPT in November 2022, Generative AI has been hailed as the future of pretty much every interaction between humans and technology. While it is undoubtedly an incredible step forward in AI development, modern history is littered with similar claims of inflection points, many of which failed to live up to their hype. On my own personal journey to learn more about this potentially transformative technology, I often found a lot of proof of concepts that didn’t go far beyond the stage of the very basic concepts. So how do we begin to harness and production-ise this new advancement?

When to Gen

The first question I needed to answer is what is the right use case for GenAI and Large Language Models (LLMs, the machine learning models which power GenAI), and when is ‘traditional’ Machine Learning (ML) the more appropriate approach? For many, GenAI is seen as a way of cutting out the (M)idd(L)e man, i.e. removing the need for Data Scientists to be involved in the development of our AI applications at all. The renowned computer scientist and all round AI guru Andrew Ng estimates that inside of a regular ML pipeline the ML itself only takes up about 5-10% of the overall flow, but requires thousands of hours and iterations to generate a decent working model. Why can’t we cut that out and just use pre-trained LLMs from the outset? Isn’t the point that they can turn their hands to anything we ask them to do?

In short, no. 

LLMs on their own are designed to come up with an answer, but it doesn’t actually have to be the right one. They are people pleasers by design, and we’ve all seen examples of users being able to fairly easily dissuade them of the most basic facts. Hallucinations aren’t a by-product of the way  LLMs are trained, they are the actual function (the word Generative is the clue). For many specific use cases an LLM is both way too large and all encompassing to be of reliable use, and a focus trained ML model will be able to realise more accurate and dependable results.

So are the only use cases writing bland LinkedIn posts, generating Eminem style lyrics about my friends and their misfortunes or re-writing blog posts like this one to make me sound more intelligent (which I have definitely done by the way)? Again the answer is no. We can make the answers that LLMs create more reliable, but we need to supply a firm base of data for them to work with first, something solid which can be referenced to provide substance to the answer. We need RAG.

RAGtime

Retrieval Augmented Generation (RAG) is the way we keep our LLM grounded to fact (or at least the facts we supply them). We can supply our data to the LLM in the form of embeddings (converted via a different LLM type) and stored in a vector database, which it can then access, interpret and reference in its reply. With this approach we have a new use case – we can interrogate our data using natural language, allowing deeper insight into our businesses and displaying it in an understandable and exciting way.

So that means we can do whatever we want with our data, right? Gen AI can look at our information and tell us how to start making billions overnight? Again, no. RAG works using semantic search, simply returning the most semantically relevant information and basing its answer on that. It can’t make iterative reasoning assumptions, like a human being would naturally be able to do.

As an example, I recently wrote an article about creating a chatbot to query your CV, using AWS services (you can read the article here). If I ask the chat bot “What qualifications make this person a good fit for a job as a software engineer?” it can answer quickly, citing the relevant information contained in the document. But if I ask “How have those qualifications impacted their career progression?” It finds it much harder to make a coherent response, other than reciting those qualifications again.

Even with these limitations, RAG makes the use of LLMs in our apps a powerful and profound tool, so how do we move towards a production-ready application that can leverage these capabilities?

Testing the Untestable

A requirement of safe scalable production pipelines is always good testing coverage. It ensures us that the application hasn’t veered massively off course during each deployment, and allows us to see what side effects our new code has produced before the user does. This all rests on the principle that the results of these tests will only change if our code or our input does. But with Gen AI, you can prompt an LLM with the exact same set of words 15 times and get 15 different answers. A human may be able to check that those answers are essentially the same, and arguably there should always be a final manual approval step in every production pipeline, but the point of automated testing is to find issues before we get to that point. We can define some arbitrary unit testing by confining them to simple inputs and outputs, restraining the model’s response so it can be evaluated, but that often only covers the absolute basics. So, how do we test something we can’t exactly predict?

The answer is to use Gen AI to help test itself. We can point another LLM, with a set of testing instructions, at the results of our tests and ask it to check them to make sure they are consistent with our expectations, and that the syntax, structure and grammar are functional and understandable. This process is sometimes referred to as LLM-as-judge, and is something LLMs thrive at doing, so leveraging it in this way is ideal. 

Care should be taken when choosing our testing model – ideally it should be more powerful than that of the model we are testing, and be confident that we can trust it to interpret the information it is given. We also need to make sure our test prompt is ironclad, with no ambiguity on how it should present its findings. We’ll need to stop the pipeline if the output is negative. With all that, ultimately a human will need to check over the testing model’s response briefly before opening the manual gate.

Mo’ Models, Mo’ Problems

As this technology blooms at an incredible rate, more models become available all the time, with billions more parameters making them exponentially more powerful. The current top two performing LLMs, ChatGPT and Claude 3.5 are estimated to have hundreds of billions of parameters (some suspect ChatGPT has hit one trillion), although their respective creators remain tight-lipped about specifics. Choosing the right one for our use cases will require investigation and experimentation, but for the main part, aiming at a model which is the lowest possible power to meet our needs is the right path to go – it makes it easier to test and predict. This of course means we need to make sure the models are still hosted, maintained and supported as newer, shinier ones emerge, and we need to keep a keen eye out for that eventuality.

Prompt Departure

Prompts are the way we interact with Gen AI. Constructing a good prompt started out as little more than guesswork and experimentation, and has given rise to the heretofore unknown discipline of Prompt Engineering. Like well written code, good prompts should be concise, clear and effective – it’s surprisingly easy to iterate yourself a 2000 token behemoth trying to cover every eventuality. In their defining article, 6  distinguished AI and ML engineers who have spent the last year building with Gen AI models recommend you start with your prompting, focussing on fundamental techniques and favouring small prompts that do a single thing well, and layering them rather than trying to create One Prompt To Rule Them All.

Hallucinations

As mentioned before, LLMs will always want to answer a question, regardless of whether or not they are correct. We can apply certain guardrails on harmful and offensive responses, but removing the hallucinations that cause an LLM to manufacture fictional data to support its answer is still an issue. Good prompting technique and the use of RAG will help, as will so-called factual inconsistency guardrails (essentially evaluating the output with another model that aims to filter out wildly inaccurate generations). But for the moment the issue is here to stay and all Generative AI apps should caveat the interactions in such a way that it isn’t a surprise to the end user.

Data Strategy is Cool, Kids

While model selection, prompt engineering and response guardrails can all better serve a cleaner and more satisfying user experience, one of the largest effects you can have on any RAG application is to collect and organise your data in a coherent manner. Everyone knows the adage of ‘Garbage in, garbage out’ (or a less family friendly version). This is especially true when it comes to RAG apps. Define your data strategy early – what you are collecting, where you are storing it, how often you are updating it and so on. With ‘traditional’ ML, effective labelling of accurate data is essential to achieving a well trained model, and we can draw parallels with RAG. Don’t labour on collecting huge swathes of data first though, as this can stunt your development cycle. Focus on a small amount of good quality data to start, so you can get the other parts of your application fleshed out. That way, if it becomes apparent that the data strategy isn’t working as you’d hoped, you can pivot quickly without having wasted too much time. Once you are confident things are moving forward as expected then expand your data collection, checking that the app still behaves the way you had planned.

The Path Forward 

By providing all of these elements in the structure of our app’s development cycle we’ll have more confidence and feedback in its overall performance when we release it to production. By combining good data strategy, Prompt Engineering, LLM-as-judge and inconsistency guardrails, we can start to create a safer and more consistent pipeline to production ready applications. This is by no means a comprehensive list, and the excitement of working with this cutting edge technology is that new approaches are emerging all the time, combined with some incredibly talented and experienced minds pushing the envelope constantly, and generously sharing their findings with the wider community. Whilst the future of Generative AI and its impact on the workplace can not be predicted by even the most effective model, it is almost certainly going to be profound, touching almost every area of our professional lives. By embedding a strong ethic of reliability, scalability and procedure into the apps we create, we should be well prepared to harness this new frontier effectively.

We use cookies