I came across a couple of Youtube videos recently that I thought were interesting. The focus of my investigation into Large Language Models and other forms of generative AI is to see where they can be incorporated into the development of business applications. I’m not interested in the hype, I’m interested in the practical applications. I’m also interested in the challenges that need to be overcome to make reliable tools that can be used in production.
There are some interviews online between Lex Fridman and Sam Altman, and another with Lex and Ilya Sutskever. I can’t remember where exactly but somewhere in there Lex asks one of them what they think would be the thing that could prevent them from achieving the kind of success they’re hoping for, and the reply was something along the lines of “if we can’t get the models to work reliably”. I think there is a real consensus around this - if these models are going to be useful outside the narrow confines of Chatbots and summarisation tools we need techniques to tame some of their more wild excesses.
In this presentation, Simon Willison talks about the weird world of LLMs and how non-intuitive working with them can be for experienced developers. Some key points here really resonated with me, as I’ve been trying to make workable solutions out of these models for a while and it’s fascinating but can also be frustrating.
Simon talks about how odd it is to ‘program’ the model with natural language, which is what ‘prompt engineering’ is. You structure the prompt you give the model to influence the outcome, adding statements like ‘you are a helpful assistant’ or a role to play to shape the results. As you experiment with LLMs you realize that prompting is not an exact science, you have to be willing to try different things and almost cajole the right kind of answers from the model. You can provide statements like “if you don’t know the answer, just say ‘I don’t know’” and that does at least help to reduce hallucinations, but none of this makes LLMs deterministic. That makes it quite hard to work out how to include them in any kind of automated business process.
This is a really hot area of research, with lots of very smart people trying different things. Langchain has functions designed for processing the outputs from LLMs, and there are plenty of other tools built around this like Open AIs specialised versions of their models trained to work with provided tools.
This video caught my attention as it combines using Open AI’s specialized function models, with Pydantic for trying to wrangle the outputs you get from your prompts to the model. The example code is some of the most structured stuff I’ve seen working with prompt results, and the examples given are really interesting. The final example, where the code gets results, then goes back over those results making further LLM calls to try and validate that the results generated are sourced from the material given, is a really interesting idea that could be used in scenarios where the accuracy of the results is critical.
There is a lot of talk at the moment about whether we’ve hit a plateau with LLMs, which would mean we wouldn’t see any significant improvement in performance or capability for some time. The kind of work I’ve illustrated above shows that even if that plateau is real, we still have a lot of work to do just taming the models we already have. I’m interested to see what comes out of this area of research over the next year or so.