Breaking Down Tasks with AI Chains, Verifiers, and Faithful CoT

Learn how to break down complex tasks into simpler subtasks and use techniques such as AI Chains, verifiers, probabilistic programming, sentence label manipulation, and faithful CoT to get the best output.
0:00
/0:46

In this blog post, I will discuss several methods that have been developed for LLMs, including AI chains, verifiers, probabilistic programming, and other techniques like selection-inference prompting or Maieutic prompting which help us make logical decisions on long reasoning problems. I will also look at how sentence label manipulation can reduce wrong answers while Faithful Reasoning using Large Language Models can reduce hallucinations.

Each paper discussed bwloq explores a different way to improve language models, so let's dive in!

1. Break complex tasks into simpler subtasks:

The first paper we will be discussing is AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts, published in October 2021. This paper suggests that breaking complex tasks into simpler subtasks can improve the interaction between humans and AI. The paper also suggests that exposing intermediate outputs to users can make the interaction more transparent and controllable.

Findings:

Large language models (LLMs) are powerful tools, but they can be less effective on complex tasks because they lack transparency and are hard to control. To make LLMs more useful, researchers have developed a new technique called Chaining. This technique involves breaking down a complex task into smaller steps and having the output of each step become the input for the next step. This makes the overall task easier to manage and control.

Researchers tested this technique on a group of people and found that Chaining not only improved the quality of task outcomes but also made the system more transparent and easier to control. Users were able to calibrate the model's expectations and debug unexpected outputs by using this method. Chaining is a promising technique that could be used in future applications of LLMs.


2. Generate many candidates:

The second paper we will be discussing is Training Verifiers to Solve Math Word Problems, also published in October 2021. This paper explores how generating many candidates for a solution and then picking the best one can improve the output of language models. This method can be particularly useful in solving math word problems.

Findings:

State-of-the-art language models can perform well on many tasks but struggle with multi-step mathematical reasoning. To improve these models, researchers created a new dataset called GSM8K, which includes 8.5K math word problems from grade school that are linguistically diverse and high quality.

They found that even the largest transformer models struggled to solve these problems. To improve performance, researchers proposed training verifiers to judge the correctness of model completions. This involved generating many candidate solutions and selecting the one ranked highest by the verifier. They found that this method significantly improved performance on GSM8K and was more effective than finetuning when given more data.


3. Reason step-by-step:

The third paper we will be discussing is Chain of Thought Prompting Elicits Reasoning in Large Language Models, published in January 2022. This paper suggests that models do better on reasoning tasks when they reason step-by-step before answering. This method can be particularly useful in improving the accuracy of language models on complex reasoning tasks.

Findings:

The authors of this paper explore how generating a chain of intermediate reasoning steps improves the ability of large language models to perform complex reasoning tasks. They introduce a method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting, and show that this method significantly improves performance on arithmetic, commonsense, and symbolic reasoning tasks.

The method achieves state-of-the-art accuracy on a math word problem benchmark, surpassing even a finetuned GPT-3 with a verifier. The authors demonstrate that large language models have the natural ability to reason step-by-step when prompted and that this method can be used to improve their performance on a range of complex reasoning tasks.


4. Generate many explanation-answer outputs:

The fourth paper we will be discussing is Self-Consistency Improves Chain of Thought Reasoning in Language Models, published in March 2022. This paper explores how generating many explanation-answer outputs and picking the most popular answer can improve step-by-step reasoning. This method can be particularly useful in improving the accuracy of language models on complex reasoning tasks.

Findings:

This paper proposes a new decoding strategy, self-consistency, to improve the performance of large language models on complex reasoning tasks. The self-consistency approach samples multiple reasoning paths and selects the most consistent answer among them, instead of just using a greedy approach.

The authors show that this method significantly improves the performance of chain-of-thought prompting on various reasoning benchmarks, such as GSM8K, SVAMP, AQuA, StrategyQA, and ARC-challenge. The idea is that just like how there can be multiple ways to solve a complex math problem, there can be multiple ways to arrive at the correct answer for a complex reasoning task, and the self-consistency approach explores these different paths to find the best answer.


5. Fine-tune a step-by-step reasoner:

The fifth paper we will be discussing is STaR: Bootstrapping Reasoning With Reasoning, published in March 2022. This paper suggests that it is possible to fine-tune a step-by-step reasoner with multiple-choice question-and-answer data alone. This method can be particularly useful in improving the accuracy of language models on reasoning tasks.

Findings:

The paper proposes a technique called Self-Taught Reasoner (STaR) which improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. STaR uses a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. The technique relies on a loop where models generate rationales to answer many questions, prompted with a few rationale examples. If the generated answers are wrong, models try again to generate a rationale given the correct answer. Then, models are fine-tuned on all the rationales that ultimately yielded correct answers.

STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30× larger state-of-the-art language model on CommensenseQA.


6. Zero-shot reasoning:

The sixth paper we will be discussing is Large Language Models are Zero-Shot Reasoners, published in May 2022. This paper suggests that step-by-step reasoning works great even with zero examples. This means that language models can perform reasoning tasks even if they have never seen a similar problem before.

Findings:

The authors show that large language models (LLMs) can perform complex reasoning tasks without being explicitly trained on specific examples, a technique called zero-shot reasoning. They achieve this by adding a simple prompt, "Let's think step by step," before each answer.

This method, called Zero-shot-CoT, outperforms existing zero-shot LLMs on various reasoning tasks, including arithmetics and symbolic reasoning, without any hand-crafted few-shot examples. The authors suggest that there may be untapped and understudied zero-shot capabilities of LLMs, and highlight the importance of exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.


7. Alternate 'selection' and 'inference' prompts:

The seventh paper we will be discussing is Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning, also published in May 2022. This paper suggests that alternating between a 'selection' prompt and an 'inference' prompt can improve the accuracy of language models on logical reasoning tasks. This method can be particularly useful in improving the interpretability of language models.

Findings:

As stated, Large language models (LLMs) are computer programs that are good at understanding and processing human language. They can learn how to do new tasks with just a little bit of training. However, they struggle with problems that require multiple steps of logical thinking. To help them, researchers have developed a new framework called Selection-Inference (SI). SI uses LLMs like building blocks, and alternates between selecting the right information and using logical reasoning to put the information together to solve more complex problems.

When tested on 10 logical reasoning tasks, the SI framework with a 7B parameter LLM performed 100% better than a regular LLM and even outperformed a much larger 280B parameter LLM. The SI framework can also explain how it arrived at its answer in a way that people can understand. Think of it like using building blocks to make a tower. The LLMs are the blocks, and the SI framework is the process of selecting and stacking the blocks to create a bigger and more complex structure.


8. Splitting complex reasoning problems:

The eighth paper we will be discussing is Least-to-most Prompting Enables Complex Reasoning in Large Language Models, also published in May 2022. This paper explores how splitting complex reasoning problems into smaller pieces and solving them incrementally can improve the accuracy of language models. This method can be particularly useful in improving the efficiency of language models on long reasoning problems.

Findings:

The article discusses a new way to prompt large language models to solve complex problems. Chain-of-thought prompting, described above, has limitations when solving problems that are harder than the examples it has seen before. The new method, called least-to-most prompting, breaks down a complex problem into smaller subproblems that are solved sequentially. The answers to previously solved subproblems help with solving the next subproblem. The article provides examples of symbolic manipulation, compositional generalization, and math reasoning to show that least-to-most prompting can solve problems that are harder than those seen in the prompt.

The article concludes by stating that the GPT-3 code-davinci-002 model with least-to-most prompting outperforms chain-of-thought prompting by a large margin and can solve the SCAN benchmark with an accuracy of 99.7% using only 14 examples. In contrast, chain-of-thought prompting achieves an accuracy of 16.2%, and other neural-symbolic models specialized for solving SCAN are trained with over 15,000 examples.


9. Analyzing good and bogus explanations:

The ninth paper we will be discussing is Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations, also published in May 2022. This paper suggests that language models can analyze both good and bogus explanations to figure out which set of explanations are most consistent. This method can be particularly useful in improving the logical consistency of language models on reasoning tasks.

Findings:

The article talks about a new method called Maieutic Prompting that helps large pre-trained language models reason more consistently. Currently, prompting these models to generate self-guided explanations is a promising approach but it has limitations because the explanations can be noisy and inconsistent. Maieutic Prompting solves this problem by inducing a tree of explanations abductively (e.g. X is true, because...) and recursively. It then frames the inference as a satisfiability problem over these explanations and their logical relations.

Maieutic Prompting has been tested on three challenging benchmarks that require complex commonsense reasoning and it achieves up to 20% better accuracy than state-of-the-art prompting methods. It is also fully unsupervised and performs competitively with supervised models. Maieutic Prompting improves the robustness of inference while providing interpretable rationales.


10. Probabilistic programming:

The tenth paper we will be discussing is Language Model Cascades, published in July 2022. This paper explores the concept of probabilistic programming, where language models comprise unreliable components. The paper suggests that using cascading models can improve the accuracy and efficiency of language models.

Findings:

Language Model Cascades are a way of combining multiple models together to improve few-shot learning capabilities. This is done by using probabilistic models, which are expressed in the language of graphical models with random variables whose values are complex data types such as strings. The techniques used for this include scratchpads/chain of thought, verifiers, STaR, selection inference, and tool use. By combining these techniques, the models can be expressed in a unified language that allows for control flow and dynamic structure. This unified language is referred to as a language model cascade.


11. Eliminating hallucination:

The final paper we will be discussing is Faithful Reasoning Using Large Language Models, published in August 2022. This paper suggests that language models can eliminate hallucination with sentence label manipulation, and reduce wrong answers with a 'halter' prompt. This method can be particularly useful in improving the accuracy and interpretability of language models.x

Findings:

Faithful Reasoning is a process that uses large language models (LMs) to perform multi-step reasoning. It works by chaining together reasoning steps, each of which is the product of two fine-tuned LMs. One LM is used for selection, while the other is used for inference. This process carries out a beam search through the space of reasoning traces in order to improve reasoning quality. This approach has been demonstrated to be effective on multi-step logical deduction and scientific question-answering, outperforming baselines on final answer accuracy and generating humanly interpretable reasoning traces that can be verified by the user.


This article is based on the chart below

LessonPaperDate
Break complex tasks into simpler subtasks (and consider exposing the intermediate outputs to users)AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts2021 Oct
You can improve output by generating many candidates, and then picking the one that looks bestTraining Verifiers to Solve Math Word Problems2021 Oct
On reasoning tasks, models do better when they reason step-by-step before answeringChain of Thought Prompting Elicits Reasoning in Large Language Models2022 Jan
You can improve step-by-step reasoning by generating many explanation-answer outputs, and picking the most popular answerSelf-Consistency Improves Chain of Thought Reasoning in Language Models2022 Mar
If you want to fine-tune a step-by-step reasoner, you can do it with multiple-choice question & answer data aloneSTaR: Bootstrapping Reasoning With Reasoning2022 Mar
The step-by-step reasoning method works great even with zero examplesLarge Language Models are Zero-Shot Reasoners2022 May
You can do better than step-by-step reasoning by alternating a ‘selection’ prompt and an ‘inference’ promptSelection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning2022 May
On long reasoning problems, you can improve step-by-step reasoning by splitting the problem into pieces to solve incrementallyLeast-to-most Prompting Enables Complex Reasoning in Large Language Models2022 May
You can have the model analyze both good and bogus explanations to figure out which set of explanations are most consistentMaieutic Prompting: Logically Consistent Reasoning with Recursive Explanations2022 May
You can think about these techniques in terms of probabilistic programming, where systems comprise unreliable componentsLanguage Model Cascades2022 Jul
You can eliminate hallucination with sentence label manipulation, and you can reduce wrong answers with a 'halter' promptFaithful Reasoning Using Large Language Models2022 Aug

source: https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md

Conclusion:

In conclusion, these recent research papers have explored a variety of ways to improve language models, including breaking complex tasks into simpler subtasks, generating many candidates, reasoning step-by-step, generating many explanation-answer outputs, fine-tuning step-by-step reasoners, zero-shot reasoning, alternating 'selection' and 'inference' prompts, splitting complex reasoning problems, analyzing good and bogus explanations, and using probabilistic programming. Each method has its own benefits and limitations, and the best approach will depend on the specific task at hand.

Nonetheless, these advancements represent significant progress in the field of Natural Language Processing and bring us closer to developing more accurate, efficient, and transparent language models.

About the author
Von Wooding

Von Wooding

Counsel Stack develops grounded language models equipped with research, retrieval, and drafting tools. We offer legal leads, pre-built intelligent applications, and white label solutions.

Counsel Stack Learn

Counsel Stack Learn is a comprehensive tech education center built to help attorneys maintain professional competence.

Counsel Stack Learn

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Counsel Stack Learn.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.