Authors:
(1) Zhan Ling, UC San Diego and equal contribution;
(2) Yunhao Fang, UC San Diego and equal contribution;
(3) Xuanlin Li, UC San Diego;
(4) Zhiao Huang, UC San Diego;
(5) Mingu Lee, Qualcomm AI Research and Qualcomm AI Research
(6) Roland Memisevic, Qualcomm AI Research;
(7) Hao Su, UC San Diego.
Table of Links
Motivation and Problem Formulation
Deductively Verifiable Chain-of-Thought Reasoning
Conclusion, Acknowledgements and References
A Deductive Verification with Vicuna Models
C More Details on Answer Extraction
E More Deductive Verification Examples
B More Discussion on Improvements of Deductive Verification Accuracy Versus Improvements on Final Answer Correctness
In the main paper, we demonstrated that our verification approach significantly improves the verification accuracy of reasoning chains (Tab. 3, 6, but barely improves the final answer accuracy (Tab. 4). We further analyze this phenomenon below:
Consider the GSM8K dataset as an example (recall that the final answer for a problem is obtained through majority voting). Among all problems, 91.6% of problems have |(number of votes received by the correct answer) − (largest number of votes received by a single wrong answer)| > 2, and their final answers are unlikely to be changed through our deductive verification approach. For the rest of the cases (8.4%), where deductive verification is more likely to impact their final answers, we found that:
• Among all reasoning chains that arrive at correct answers (these correct-answer chains account for 49.4% of all reasoning chain candidates), 46.2% of reasoning chains are filtered out by our verification process.
• Among the reasoning chains that arrive at correct answer but are filtered out by our verification process, 76.3% indeed exhibit incorrect reasoning.
• Among the reasoning chains that arrive at correct answer and are not filtered out by our verification process, 78.0% indeed have correct reasonings.
• Among the reasoning chains that do not arrive at correct answer and exhibit incorrect reasonings (these account for 50.6% of all reasoning chain candidates), 40.6% are filtered out by our verification process.
The above statistics shows that a significant portion of reasoning chains that arrive at correct answers but exhibit incorrect reasoning are successfully eliminated. Therefore, the reliability and trustfulness of reasoning chains that arrive at the correct answers are significantly improved. Combined with the fact that a significant proportion of reasoning chains that exhibit incorrect answers are eliminated, and that our approach’s verification accuracy significantly improves over naive verification approaches, our primary goal to improve LLM reasoning reliability is accomplished.
Nevertheless, the removals of many reasoning chains yielding correct answers (specifically, a significant 46.2% × 49.4% of all chains) has a notable impact. This even exceeds the removals of reasoning chains with incorrect reasonings and answers (40.6% × 50.6% of all chains). As a result, there are fewer votes for the correct answer when generating final answers through majority voting, which limits the final answer accuracy. In the future, we believe that when a greater proportion of incorrect reasoning chains with incorrect answers are filtered out, we can improve the final answer accuracy.
This paper is available on arxiv under CC BY 4.0 DEED license.