Reasoning in Large Language Models


I was wowed by ChatGPT. While I understood tasks like text generation and summarization, something was different with ChatGPT. When I looked at the literature, I saw this work exploring reasoning. Models reasoning, c’mon. As a very skeptical data scientist, that seemed far-fetched to me. But I had to explore.

I came upon the Big Bench Benchmark, composed of more than 200 reasoning tasks. The tasks include playing chess, describing code, guessing the perpetrator of a crime in a short story, identifying sarcasm, and even recognizing self-awareness. A common benchmark to test models is the Big Bench Hard (BBH), a subset of 23 tasks from Big Bench. Early models like OpenAI’s text-ada-00 struggle to reach a random score of 25. However, several newer models reach and surpass the average human rater score of 67.7. You can see results for these models in these publications: 1, 2, and 3.

Big Bench Hard (23 Tasks) (1).png

A survey of the research pointed out some common starting points for evaluating reasoning in models, including Arithmetic Reasoning, Symbolic Reasoning, and Commonsense Reasoning. This blog post provides examples of reasoning, but you should try out all these examples yourself. Hugging Face has a space where you can try to test a Flan T5 model yourself.

Arithmetic Reasoning

Let’s start with the following problem.

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: The answer is 5

If you ask an older text generation model like GPT-2 to complete this, it doesn’t understand the question and instead continues to write a story like this.


While I don’t have access to PalM - 540B parameter model in the Big Bench, I was able to work with the Flan-T5 XXL using this publicly available space. I entered the problem and got this answer!


It solved it! I tried messing with it and changing the words, but it still answered correctly. To my untrained eye, it is trying to take the numbers and perform a calculation using the surrounding information. This is an elementary problem, but this is more sophisticated than the GPT-2 response. I next wanted to do a more challenging problem like this:

Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?

The model gave an answer of 8, which isn’t correct. Recent research has found using chain-of-thought prompting can improve the ability of models. This involves providing intermediate reasoning to help the model determine the answer.

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 

The model correctly answers 11. To solve the juggling problem, I used this chain-of-thought prompt as an example. Giving the model some examples is known as few-shot learning. The new combined prompt using chain-of-thought and few-shot learning is:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?

Try it, it works! Giving it an example and making it think everything through step by step was beneficial. This was fascinating for me. We don’t train the model in the sense of updating it’s weights. Instead, we are guiding it purely by the inference process.

Symbolic Reasoning

The first symbolic reasoning was doing a reversal and the Flan-T5 worked very well on this type of problem.

Reverse the sequence "glasses, pen, alarm, license".

A more complex problem on coin flipping was more interesting for me.

Q: A coin is heads up. Tom does not flip the coin. Mike does not flip the coin. Is the coin still heads up?

For this one, I played around with different combinations of people flipping and showing the coin and the model, and it answered correctly. It was following the logic that was going through.

Common sense reasoning

The last category was common sense reasoning and much less obvious to me how models know how to solve these problems correctly.

Q: What home entertainment equipment requires cable?
Answer Choices: (a) radio shack (b) substation (c) television (d) cabinet
A: The answer is

I was amazed at how well the model did, even when I changed the order.


Another common reasoning example goes like this:

Q: Can Barack Obama have a conversation with George Washington? Give the rationale before answering.

I changed around people to someone currently living, and it still works well.


As the first step, please, go try out these models for yourself. Google’s Flan-T5 is available with an Apache 2.0 license. Hugging Face has a space where you can try all these reasoning examples yourself. You can also replicate this using OpenAI’s GPT or other language models. I have a short video on the reasoning that also shows several examples.

The current language models have many known limitations. The next generation of models will likely be able to retrieve relevant information before answering. Additionally, language models will likely be able to delegate tasks to other services. You can see a demo of this integrating ChatGPT with Wolfram’s scientific API. By letting language models offload other tasks, the role of language models will emphasize communication and reasoning.

The current generation of models is starting to solve some reasoning tasks and match average human raters. It also appears that performance can still keep increasing. What happens when there are a set of reasoning tasks that computers are better than humans? While plenty of academic literature highlights the limitations, the overall trajectory is clear and has extraordinary implications.

Text style transfer in a spreadsheet using Hugging Face Inference Endpoints


We change our conversational style from informal to formal speech. We often do this without thinking when talking to our friends compared to addressing a judge. Computers now have this capability! I use textual style transfer in this post to convert informal text to formal text. To make this easy to use, we do it in a spreadsheet.

The first step is identifying an informal to formal text style model. Next, we deploy the model using Hugging Face Inference endpoints. Inference endpoints is a production-grade solution for model deployment.

Let’s incorporate the endpoint into Google Sheets custom function to make the model easy to use.

I added the code to Google Sheets through the Apps Script extension. Grab it here as a gist. Once that is saved, you can use the new function as a formula. Now, I can use one simple command if I want to do textual style transfer!

Alt Text

I created a Youtube 🎥 video for a more detailed walkthrough.

Go try this out with your favorite model! For another example, check out the positive style textual model in a Tik Tok video.

Few shot text classification with SetFit


Data scientists often do not have large amounts of labeled data. This issue is even graver when dealing with problems with tens or hundreds of classes. The reality is very few text classification problems get to the point where adding more labeled data isn’t improving performance.

SetFit offers a few-shot learning approach for text classification. The paper’s results show across many datasets, it’s possible to get better performance with less labeled data. This technique uses contrastive learning to build a larger dataset for fine-tuning a text classification model. This approach was new to me and was why I did a video explaining how contrastive learning helps with text classification.

I have created a Colab 📓 companion notebook at, and the Youtube 🎥 video that provides a detailed explanation. I walk through a simple churn example to give the intuition behind SetFit. The notebook trains the CR (customer review dataset) highlighted in the SetFit paper.

The SetFit github contains the code, and a great deep dive for text classification is found on Philipp’s blog. For those looking to productionize a SetFit model, Philipp has also documented how to create the Hugging Face endpoint for a SetFit model.

So grab your favorite text classification dataset and give it a try!

Getting predictions intervals with conformal inference


Data scientists often overstate the certainty of their predictions. I have had engineers laugh at my point predictions and point out several types of errors in my model that create uncertainty. Prediction intervals are an excellent counterbalance for communicating the uncertainty of predictions.

Conformal inference offers a model agnostic technique for prediction intervals. It’s well known within statistics but not as well established in machine learning. This post focuses on a straightforward conformal inference technique, but there are more sophisticated techniques that provide more adaptable prediction intervals.

I have created a Colab 📓 companion notebook at, and the Youtube 🎥 video that provides a detailed explanation. This explanation is a toy example to learn how conformal inference works. Typical applications will use a more sophisticated methodology along with implementations found within the resources below.

For python folks, a great package to start using conformal inference is MAPIE - Model Agnostic Prediction Interval Estimator. It works for tabular and time series problems.

Further Resources:

Quick intro to conformal prediction using MAPIE in medium

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification, paper link

Awesome Conformal Prediction (lots of resources)

Explaining predictions from 🤗 transformer models

Copy of @rajistics (3)

This post covers 3 easy-to-use 📦 packages to get started. You can also check out the Colab 📓 companion notebook at and the Youtube 🎥 video for a deeper treatment.

Explanations are useful for explaining predictions. In the case of text, they highlight how the text influenced the prediction. They are helpful for 🩺 diagnosing model issues, 👀 showing stakeholders understand how a model is working, and 🧑‍⚖️ meeting regulatory requirements. Here is an explanation 👇 using shap. For more on explanations, check out the explanations in machine learning video.

Screen Shot 2022-08-12 at 9.25.07 AM

Let’s review 3 packages you can use to get explanations. All of these work with transformers, provide visualizations, and only require a few lines of code.

Red and Purple Real Estate Soft Gradients Twitter Ad (1)

  1. SHAP is a well-known, well-regarded, and robust package for explanations. In working with text, SHAP typically defers to using a Partition Shap explainer. This method makes the shap computation tractable by using hierarchical clustering and Owens values. The image here shows the clustering for a simple phrase. If you want to learn more about Shapley values, I have a video on shapley values and a deep dive on Partition Shap explainer is here.

Screen Shot 2022-08-12 at 9.35.34 AM

  1. Transformers Interpret uses Integrated Gradients from Captum to calculate the explanations. This approach is 🐇 quicker than shap! Check out this space to see a demo.

Screen Shot 2022-08-12 at 9.27.04 AM

  1. Ferret is built for benchmarking interpretability techniques and includes multiple explanation methodologies (including Partition Shap and Integrated Gradients). A spaces demo for ferret is here along with a paper that explains the various metrics incorporated in ferret.

    You can see below how explanations can differ when using different explanation methods. A great reminder that explanations for text are complicated and need to be appropriately caveated.

    Screen Shot 2022-08-11 at 1.19.05 PM

    Ready to dive in? 🟢

    For a longer walkthrough of all the 📦 packages with code snippets, web-based demos, and links to documentation/papers, check out:

    👉 Colab notebook: