A simple test of real-world LLM utility

I don’t have any firm plans for next weekend, so I thought I’d see if I could remedy my FOMO a little with an airshow visit. Without looking up the answer beforehand, I decided to test a few different LLM tools to answer what should be a simple question:

are either the Blue Angels or the Thunderbirds performing at a US airshow in the next three weeks?

I was looking for 3 elements in the response:

did the tool correctly identify the parameters (“next three weeks”, “US”, and “airshow”?)
did it give a factually correct response?
did it leave out anything important?

For reference, the Blue Angels are at Milwaukee 19-20 July and then at Seafair 1-3 August. The Thunderbirds are at Ft Wayne today (12 July), at Kingsley Field 19-20 July, and in Cheyenne 26 July.

Grok 3 got the date range right, told me “Based on available information, neither the Blue Angels nor the Thunderbirds are scheduled to perform at a US airshow in the next three weeks (from July 13, 2025, to August 3, 2025).”, and then went on to tell me that the Blues are performing at Seafair 1-3 August and the Thunderbirds in Oregon next weekend. It missed one show for the Blues and two for the Thunderbirds, counting today.

The free, no-login version of ChatGPT told me it couldn’t search the web and so I should look up the answer myself. After logging in, still with the free version, it quickly identified both teams’ shows 19-20 July, but that was all. It was faster than the other models but still didn’t produce a complete, correct answer.

My paid Claude subscription got the date range correct and found the Thunderbirds in Oregon and the Blues at Seafair, but didn’t get any of the others.

The free, no-login version of Copilot Chat got the date range correct, and was the only tool to spot the Thunderbirds’ performance today. Bizarrely, it included yesterday’s Blue Angels performance at Pensacola, but not their Seafair date.

This problem didn’t require advanced reasoning or searching. It’s disappointing that all of the tools produced incomplete and/or incorrect results for something this simple. We’re clearly not past the point of having to double-check factual results to queries.

A simple test of real-world LLM utility

Leave a comment Cancel reply

Office 365 for IT Pros

Recently…

Posts by category

RSS

A simple test of real-world LLM utility

Share this:

Related

Leave a comment Cancel reply

Office 365 for IT Pros

Recently…

Posts by category

RSS