Don't ask OpenAI's most cutting-edge Large Language Model (LLM), the GPT-4 Turbo, to perform exhaustive tasks over the winter holidays. That's the conclusion that one can comfortably draw from a recent statistically significant test conducted by an LLM enthusiast.
OpenAI claims that GPT-4 Turbo is capable of handling highly complicated tasks encased within a single prompt, courtesy of its much more exhaustive training. The model is also capable of processing 128,000 tokens courtesy of its expanded token context window, a measure of the richness or depth of input and output of a particular LLM. As a refresher, 1,000 tokens are roughly equivalent to 750 words. This means that OpenAI's latest offering is capable of processing an input of around 96,000 words.
@ChatGPTapp @OpenAI @tszzl @emollick @voooooogel Wild result. gpt-4-turbo over the API produces (statistically significant) shorter completions when it "thinks" its December vs. when it thinks its May (as determined by the date in the system prompt).
I took the same exact prompt… pic.twitter.com/mA7sqZUA0r
— Rob Lynch (@RobLynch99) December 11, 2023
Recently, Rob Lynch, an LLM enthusiast, put GPT-4 Turbo through its proverbial paces. To his utter surprise, the LLM produces a shorter response when it thinks that the current month is December vs. when it is prompted to believe that it is May.
Specifically, Lynch was able to obtain an average output of 4,298 tokens over 477 test runs from GPT-4 Turbo when it was prompted to believe that the current month was May. For December, the LLM gave a significantly shorter mean output of 4,086 tokens, equating to a decrease in productivity of around 5 percent.
OMG, the AI Winter Break Hypothesis may actually be true?
There was some idle speculation that GPT-4 might perform worse in December because it "learned" to do less work over the holidays.
Here is a statistically significant test showing that this may be true. LLMs are weird.🎅 https://t.co/mtCY3lmLFF
— Ethan Mollick (@emollick) December 11, 2023
While shedding light on the likely cause behind this discrepancy, Ethan Mollick, a professor at Wharton, believes that the GPT-4 Turbo learned from the human tendency to do less work in holiday-heavy December. This also suggests that these LLMs, despite exhaustive efforts to prevent the incursion of harmful human biases, still remain susceptible to inheriting some of the quirkier human shortcomings courtesy of training data infiltration.
This development comes on the heels of another one that suggested OpenAI's GPT model was becoming progressively lazier, resorting to shortcuts instead of giving complete answers to queries. Some anecdotes suggest that users have been pretending to be handicapped to eke out complete answers from the LLM! The situation is apparently dire enough to prompt OpenAI to try to come up with a hotfix.
Follow Wccftech on Google to get more of our news coverage in your feeds.
