ChatGPT, the AI chatbot that's taken the world by storm, has already conquered numerous tests–the Wharton MBA exam, the bar exam, and several AP exams among others. But the talking bot met its match when Accounting Today ran it through a practice CPA exam as an experiment: ChatGPT failed utterly in all four sections.
—
The experiment took place at the Arizent office in New York City's financial district on April 13 in collaboration with Surgent CPA Review. We used two laptops, each running a separate ChatGPT 3.5 Pro account (metering on free accounts, or on GPT 4, would have made the experiment impractical). One laptop ran the BEC and FAR section. The other ran the REG and AUD section.
When all test sections were completed, its scores were:
- REG: 39%;
- AUD: 46%;
- FAR: 35%
- BEC: 48%
The results indicate ChatGPT did not pass any part of the CPA exam. (Full details on our methodology are included at the bottom of the article.)
Accounting reacts
Jack Castonguay, vice president of strategic content development at Surgent Accounting & Financial Education, acted as Accounting Today's liaison for the project and provided support during the experiment. While not necessarily expecting ChatGPT to get a perfect score, he said he was surprised at just how badly it did.
"I consider accounting to be highly rules-based. I consider it closer to a law degree than a math degree. And I'd thought it would perform really well on math-based functions, which is the FAR section of the exam, but it didn't do so well at FAR. That surprised me because a lot of it is just math. But maybe it goes to show humans, generally, we're not the best at math either, so copying the data on math may have done that. But that probably surprised me more than the rest."
He did suggest, however, that there might be good reasons for ChatGPT to be challenged specifically by accounting. For one, while ChatGPT might have the knowledge, that does not mean it understands how to contextualize it or make the right inferences from it.
"If someone is trying to figure something out from a financial standpoint or a tax standpoint, I can read the Internal Revenue Code, I can pull up the FASB codification. But the value of the CPA is interpretation, and that is where ChatGPT has failed professionally," he said. "You read the reports about the bar exam, and [ChatGPT] is good at saying what the law is but when you actually need to apply it, things become more difficult," he said. He added, too, that ChatGPT's training data included literally hundreds of years of case law and business data. There's far less material about accounting.
Someone who was not surprised by the results was Wes Bricker, vice chair and co-leader of trust solutions at Big Four firm PwC, who said that there is more to accounting than just having a lot of accounting knowledge: There is also the matter of professional judgment, skepticism and experience, which ChatGPT lacks.
"Value is created whenever we put people together with technology. It creates something much bigger. If you could make an AI deliver the same value as a CPA I'd be shocked because AI and tech tools are valuable, but they're only one piece of a two part equation: tech plus humans," he said.
Bricker noted that ChatGPT is a large language model — meaning that its primary function is working with words, not numbers. As someone who has passed both the bar exam and the CPA exam, Bricker raised a similar point to Castonguay in that, while the chatbot may have the requisite knowledge, it lacks the experience to properly contextualize it.
"I'm proud to be a member of the bar and a CPA. Both are tough exams. But the CPA exam is one that connects the numbers, because accounting is about not just measurement systems … . ChatGPT is a large language model, not a large life model. Life goes beyond language. It includes it, but also includes measurement and assessing magnitudes and likelihoods and severities and values. ChatGPT is powerful but CPAs are communicating accounting in the context of society and life," he said.
Tracey Niemotko, an accounting professor at Marist College and a member of the Governing Council of the American Institute of CPAs, raised a similar point in noting another possible reason for ChatGPT's poor performance: It's not very good at math right now.
"We know it has conceptual value for research and the ability to connect language terminology, but I'm not surprised when it comes to the quantitative. … It's good for general research and drafting and discussion, but overall the accounting faculty, at this point, we recognize there's a void when it comes to the math skills and application," she said.
She added that if ChatGPT were a person, it would be reflective of a larger problem she has observed in U.S. education overall: an emphasis on memorization and a lack of critical thinking. There are humans who act like ChatGPT in that they've memorized everything there is to known but memorization does not equal intelligence, she suggested: "I think ChatGPT is symbolic of that student who has the knowledge but can't apply it in a hands-on situation. I think it's symbolic of our problems overall. Where are our thinkers? … The obstacle we have to overcome in higher education is talking to students who are not used to talking to a client. They're used to memorizing and that, in a nutshell, is the biggest obstacle to overcome to get critical thinkers. I'd say [ChatGPT] is like someone who has the ability to generate information but has not been trained, or programmed, on its applications," she said.
Enzo Santilli, chief transformation officer at Top 10 Firm Grant Thornton, knew students like this when he was in college, and agreed that the ChatGPT results suggest someone who was great at memorization but bad at application. "If this were a person, it would be the type of student, which I never was, who just had photographic memory and if a teacher wrote something on a whiteboard and they saw it, they'd immediately commit it to memory. Then they'd get a question and immediately remember, here on page 76, there's the paragraph you need, and bang, you know the answer," he said.
He also noted that the bot may have done poorly because accounting calculations are often multistep, and as bad as ChatGPT is at math in general, it's even worse at problems that require multi-step reasoning.
Common threads
Accounting Today conducted its experiment very shortly before the release of another study that showed that ChatGPT did poorly on accounting questions typically given to undergraduates (
Much like our own experiment, the study found that the AI chatbot, with a score of 47.4%, would utterly bomb an accounting class, not even getting a D grade. Human students, while not exactly acing the questions, did much better, averaging 76.7%. The AI did outperform students on 11.3% of questions, mainly on AIS and auditing, but did worse than humans on tax, financial and managerial assessments.
Daniel Street, a Bucknell University accounting professor who was one of the study authors, noted that their study, much like ours, found that ChatGPT struggled with quantitative information, which he noted "happens to be the single largest task of accounting."
"If you ask it about a conceptual framework, it nails it. If you ask about LIFO periods of inflation, it nails it. If you ask which are the responsibilities of the SEC versus PCAOB or to interpret a standard, it will nail it. But because it is a large language model and designed to predict text in response to the prompts, this text can include numbers but is not designed to be a calculation engine, and so there are stupidly simple calculations it will fail," he said.
He added that this may not necessarily be the only test that ChatGPT will fail, noting that he is currently working with a colleague to test how well it does on engineering economics, which is also highly quantitative, and has found similar deficiencies.
Street felt it made sense that as bad as ChatGPT did on undergraduate accounting questions, it did even worse at the CPA exam, where it averaged 42% between all four sections. The class questions were from all levels of accounting curricula over a wide range of topics, many of which people have written about online.
"But contrast the CPA exam, which is consistently written at one level of difficulty: that of an entry level staff member. And so that level is far higher than that of an intro student. So one reason I'm not surprised it scored a little worse than ours is because your difficulty level was a little higher," he said.
Like others, Street also pointed out that ChatGPT seems to lack a sense of nuance and context that is essential for many accounting tasks. He also made the point that there likely isn't as much accounting data in its training corpus as there was for fields like law or business administration. He also found that it has trouble recognizing how GAAP standards change over time.
Accountants safe for now
Wesley Hartman, founder of accounting automation solutions provider Automata and the director of technology at Kirsch Kohn & Bridge LLP, said that one clear conclusion to draw from this is that ChatGPT is not coming for accounting jobs anytime soon.
"AI sometimes can do some really cool things and sometimes it's just bonkers. It's not at a consistency level yet where we can even rely on it for a lot of things. You can use it as an initial source but you've got to verify with more legitimate sources. So I don't think the accountant is going to disappear. What I think will happen is the accountant who doesn't use AI tools will fade into the background. You don't find accountants with giant ledger pads anymore. We have computers and software. So the next evolution of accounting will be where accountants leverage these tools, and those who are not leveraging them won't be as fast or efficient," he said, adding that the invention of the calculator did not destroy the mathematician as a profession.
However, "soon" is subjective. Joe Wilck, a Bucknell University professor of analytics and operations management who is working with Street on the aforementioned economics engineering paper, noted that while GPT 3.5 clearly failed, GPT 4.0, which was released this year, would likely have performed better. And further versions of the software will do better still.
Street noted, too, that Version 4.0 could also address the math issue via plugins. For example, work is being done for a ChatGPT 4.0 plugin that connects with Wolfram Alpha, a knowledge engine that is to math what ChatGPT is to language.
"Now ChatGPT can reliably pass information back and forth to a tool designed for numbers versus text. When [the plugin] becomes widely available — it's still in beta release so not even all paid subscribers have it — we will want to revisit the abilities to handle numeric information when reacting to other domains. That could be a game-changer, but we don't know yet," he said.
Methodology
Each of the four chat windows were primed with the following prompt: "ChatGPT, you are going to take on the role of a student who is taking the CPA Exam today. Repeat this information back to me." We found that not doing so could lead to confusion, as ChatGPT would not always respond as if it was taking a test.
We then created new accounts with Surgent CPA Review, which provided access to an online practice exam. Once the practice exams were loaded, we copied each test question and pasted it into ChatGPT. Once the program provided an answer, we manually entered it in the exam window, then proceeded to the next question. We did sections two at a time, starting with AUD on one laptop and FAR on another, and proceeding to REG on the first laptop and BEC on the other. A Surgent vice president was with us at the time to provide support.
During the multiple choice sections, there were times when ChatGPT would not accept any available option as an answer. In response, we felt that if a human were confronted with this situation, they would choose the answer that was closest to their own and so we did that. In simulation tasks that involved drop-down menus, we manually entered the options in the prompt, as ChatGPT would not otherwise recognize them as options. In simulation tasks in which there was outside documentation, we decided that trying to enter the information from those documents would complicate the question even further and so opted not to do so, reasoning that there are likely real human students who also try to answer the simulation questions without reading the documentation. This is not to imply that these humans are wise to do so, simply that they exist.