DeepSeek on par with GPT, Claude, Llama for accounting

By Chris Gaetano January 31, 2025, 9:00 a.m. EST 4 Min Read

Accounting artificial intelligence leaders have deemed Chinese large language model DeepSeek as roughly on par with other general models when it comes to knowledge, questions and tasks relevant to the profession.

DeepSeek's latest R1 model was released to the world last week to much fanfare by producing performance comparable to massive models like Claude or ChatGPT but at a fraction of the cost — something that blindsided other players in the AI field who hadn't believed it was possible. In the short time since it exploded on the world stage, industry observers have had to rethink their priorities, as the event has shown that one does not have to be a gigantic corporation like Microsoft (whose $3 trillion market cap is bigger than the entire GDP of Italy) to produce quality AI models.

Generative AI models are not usually known for their math skills, let alone accounting, as the probabilistic nature of their outputs makes precision difficult. While theoretically someone could apply public models like ChatGPT to an accounting problem, the possibility the AI could make up information from whole cloth makes it a risky proposition. This does not, however, mean they're completely incapable of performing accounting tasks, just not as good as a more specialized solution.

Jeff Seibert, CEO of accounting automation solutions provider Digits, tested DeepSeek against OpenAI's ChatGPT, Anthropic's Claude and Meta's Llama and found its ability to perform accounting tasks and answer relevant questions to be roughly on par with the others. He asked the AIs to classify 1000 transactions, given the description, dollar amount and a chart of accounts. ChatGPT was correct 61-65% of the time depending on the specific version; DeepSeek was correct 59.90% of the time; Llama was correct 48% of the time; and Claude was correct 43.4% of the time. The correct category was in the top 3 suggestions 75-78% of the time for ChatGPT (depending on version), 69.67% for DeepSeek, 61% of the time for Llama and 56% of the time for Claude.

ChatGPT created a fake category 0.10-4% of the time (depending on version), Llama did so 0.2% of the time, Claude 2.8% of the time, and DeepSeek 0.39% of the time. As far as speed, it answered queries faster than GPT-o1 mini, GPT-o1, and Llama. It was slower than GPT-4o, GPT 4 Turbo and Claude.

Seibert, in a LinkedIn post outlining his experiment, noted that even if it doesn't outdo the other models in all areas, the fact that it was made at a fraction of their cost is quite impressive.

"If their claims around training cost are accurate, this represents a massive breakthrough in model efficiency and sets the new bar for open source AI performance," he wrote.

Daniel Shorstein, president of technology solutions advisory firm James Moore Digital, also put DeepSeek through its paces by asking multiple choice questions covering a range of accounting topics, adding he tried to keep them just difficult enough that it would trip up a less than highly intelligent LLM. He used the same test as the one he made to evaluate Llama, Claude, ChatGPT and other models against each other. He found its results to be inconsistent.

He illustrated with how it reacted to a question on segregation of duties:

Question: "There are three employees in the accounting department: payroll clerk, accounts payable clerk, and accounts receivable clerk. Which one of these employees should not make the daily deposit? A. payroll clerk B. account payable clerk C. accounts receivable clerk D. none (any can make the deposit)"

DeepSeek, like other models lately, is equipped to not only provide an answer but reveal some of its internal reasoning in how it got to the answer. Shorstein noted that, internally, it actually got the correct answer after a long chain of reasoning where it first recalled general principles of segregation of duties, thought of an ideal setup, thought of possible exceptions and special circumstances, went back to the main point of segregation of duties and ultimately determined the answer was C, the accounts receivable clerk, which Shorstein said was the correct answer.

"But its final answer: 'The correct answer is A. payroll clerk,'" he said in a message.

Meanwhile, Hitendra Patil, CEO of accounting tech consultancy Accountaneur, said DeepSeek gives more detailed answers compared to ChatGPT, and also appears to have been built to not only answer the direct question asked, but to actually pre-empt the likely follow-up question that the user may ask after the first question and provides answers to such pre-empted follow-up questions. He added that it also goes over its reasoning when discussing math questions versus other models which simply give an answer.

For example, he asked the model what is 343 multiplied by 741. It broke down the problem using the distributive property to simplify the multiplication, then added the results together to get 254,163.

At the same time, he noted that DeepSeek does not browse the Internet unless specifically asked, and so some of the answers it gave him about tax law were somewhat dated, versus ChatGPT which more or less gave the latest information. Overall, he said that it seems slightly behind ChatGPT for now, despite rumors that it had been trained similarly.

"There is no verifiable proof … but DeepSeek has been/is being trained on similar underlying data that ChatGPT was/is being trained on, albeit it seems to be lagging behind ChatGPT," he said in an email.

Chris Gaetano

Technology Editor, Accounting Today