This article is part of a special issue of VentureBeat, “The Actual Cost of AI: Large-scale Performance, Efficiency, and ROI.” Read more about this special issue.
With the advent of large-scale language models (LLMS), it is easier for companies to imagine the types of projects they can take on, and pilot programs are now moving towards deployment.
However, as these projects gained momentum, companies found the previous LLMs they used were both cumbersome and, worse, expensive.
Enter a small language model and distillation. Models like Google's Gemma Family, Microsoft's Phi, and Mistral's small 3.1 have allowed companies to choose fast, accurate models that work for specific tasks. Companies can choose small models for their specific use cases, reducing the cost of running AI applications and increasing return on investment.
LinkedIn Distinguished Engineer Karthik Ramgopal told VentureBeat that companies will choose smaller models for several reasons.
“Small models require computation, memory and faster inference times, which translates directly into lower infrastructure opex and CAPEX, taking into account GPU costs, availability and power requirements,” Ramgoapl said. “Task-specific models have a narrow range and make their behavior more tuned and maintained over time without complex, rapid engineering.”
Model developers will price small models accordingly. Openai's O4-MINI is $1.1 per million in input and 4.4/million tokens with output, compared to the full O3 version with input and $40 in output.
Today's businesses have a larger pool of smaller models, task-specific models, and distillation models. These days, most flagship models offer a variety of sizes. For example, the Claude family of human models consists of the largest model, Claude Sonnet, and the smallest version of Claude Haiku. These models are compact enough to work with portable devices such as laptops and mobile phones.
Savings Questions
However, when discussing return on investment, the questions are always: What does ROI look like? Should it be a time saving that means that the costs incurred have been returned or ultimately saved the dollar? Expert VentureBeat said that determining ROI can be difficult. Because, some companies believe they are reaching ROI by reducing the amount of time spent on tasks, others think they are waiting for businesses brought in to say whether the actual dollar actually worked or not.
Typically, enterprises calculate the ROI using a simple formula, as posted and described by Cognizant's chief technologyist Ravi Naarla. ROI= (benefits)/cost. However, with AI programs, the benefits are not immediately apparent. He proposes identifying the benefits that companies expect to achieve, estimate these based on historical data, and understanding that the overall costs of AI, such as employment, implementation, maintenance, etc., must be realistic and in-between for the long term.
In small models, experts argue that they reduce implementation and maintenance costs, especially when the fine-tuning model provides more context to your company.
AIFE founder and CEO Arijit Sengupta said the way people bring context to their models will determine how much cost savings they can get. For individuals who need additional context at prompts such as long and complicated instructions, this can lead to higher token costs.
“You have to provide it in some way in the context of the model. There's no free lunch. But in large models it's usually done by putting it in the prompt,” he said. “Think of fine-tuning and post-training training as an alternative to providing a model's context. It might cost $100 after training, but it's not astronomical.”
Sengupta said he saw a cost savings of about 100 times from just after training, and often removed the cost of using the model to “something like single digits to $30,000.” He noted that this number includes the operating costs of the software and the ongoing costs of the model and vector databases.
“When it comes to maintenance costs, it can be expensive to maintain because if you do it manually with a human expert, small models need to be trained after training to produce results comparable to larger models,” he said.
The experiments conducted showed that task-specific fine-tuned models work well in some use cases, similar to LLMS, and that it was more cost-effective to deploy several use case-specific models rather than large models to do everything.
The company compared the post-training version of the Llama-3.3-70B-Instruct with a smaller 8B parameter option from the same model. The 70B model was $11.30 after training, with an automatic rating of 84% accuracy and a manual rating of 92%. When fine-tuned to a cost of $4.58, the 8B model achieved 82% accuracy on manual ratings. This is suitable for more minor and more targeted use cases.
Cost factors that fit your purpose
A model of the right size doesn't have to arrive at the expense of performance. Recently, organizations understand that model selection does not just mean choosing GPT-4o or Llama-3.1. Some use cases, such as summaries and code generation, have been found to be more suitable for small models.
Daniel Hoske, chief technology officer at Contact Center AI Products Provider Cresta, said he has informed us that potential cost savings will be better once development begins with LLMS.
“You need to start with the biggest model and see if what you imagine works at all because if it doesn't work with the biggest model, that doesn't mean you're using a small model,” he said.
According to Ramgopal, LinkedIn follows a similar pattern. This is because prototyping is the only way these problems start to emerge.
“The typical approach to agent use cases starts with a general LLMS, as the broad generalizability allows for rapid prototypes, hypotheses to be tested and product market fits,” Linkedin's Ramgopal said. “When a product matures and encounters quality, cost, or latency constraints, it moves to a more customized solution.”
During the experimental phase, organizations can determine what is most valuable from their AI applications. Understanding this allows developers to better plan and select what they want to store the best model size for their purpose and budget.
Experts warned that while it is important to build using the model under development, high-parameter LLMs will always be more expensive. Large models always require important computing power.
However, using small, task-specific models creates problems. Rahul Pathak, vice president of data at AWS and AI GTM, said in a blog post that cost optimization comes from using models with lower computing power needs as well as fitting models to tasks. Smaller models may not have a large enough context window to understand more complex instructions, leading to increased workloads for human employees and higher costs.
Sengupta also warned that long-term use may not lead to savings as some distillation models may be vulnerable.
I will always rate it
Regardless of model size, industry players highlighted the flexibility to address potential issues and new use cases. Therefore, if they start with a larger model and a smaller model with similar or better performance, which costs less, the organization cannot do anything valuable to the model of their choice.
Tessa Burg, CTO and head of innovation at Brand Marketing Company Mod OP, told VentureBeat that what an organization is building now must understand that it will always be replaced by a better version.
“We started with the idea that the technology under the workflow we are creating, the process that is making it more efficient, will change. I knew the model I used would be the worst version of the model. ”
Berg said the small model will help her company and its clients save time in investigating and developing concepts. She said she saved her time. She recommends breaking through high-cost high-frequency use cases for lightweight models.
Sengupta noted that it makes it easier for vendors to switch models automatically, but it warns users to find a platform that encourages tweaking, so there is no additional cost.
