Updated: You Should Default To Google's Gemini Flash 1.5 Or GPT-4o mini For Your Agentic Workflows

Engineering
July 23, 2024
At ChannelFour.AI, we use LLMs in multiple ways ranging from single API calls, multimodal image processing, and also to power our AI Agents. We made the choice early on not to pick one provider but instead balance OpenAI and Google for each of these tasks. We collect and track their responses, duration, and costs to determine which is best for each job. Recently a 3rd party published price vs performance comparison completely in line with what we're seeing; Gemini Flash is a game changer for AI Agents and Agentic Workflows.
Sign up for our mailing list

Updated 18 Jul 22024

Today, OpenAI Released "GPT-4o mini" which takes aim directly at Gemini Flash. Relative to Flash, GPT-4o mini is slightly slower (~12%), slightly more capable (1%), and about 40% cheaper. Both are significantly cheaper than their big brother models. With this model in the wild, you should be using one either of these as your goto Agent model. I've updated the title of the post and the comparison image below but otherwise left this post as is.

Comparing Gemini Flash and GPT-4o for AI Agents

Artificialanalysis.ai recently released a wonderful and comprehensive breakdown of quality vs speed vs price of most of the major LLM providers. I want to jump off their post to say that we are seeing very similar results to their findings. In particular, Google's Gemini Flash has amazing price vs quality vs speed ratios; If you aren't at least trying it for your AI Agent and Agentic workflows, you should be.

Updated Image

Original Image

AI Agents

AI Agents are not simply LLM API calls, they are complex processes that may re-try if they don't like their own output, may connect to and process data from elsewhere via tools, and may pass their output as context to other agents. This back and forth, in a way, adds a 2nd level of non-determinism; the time it takes for them to complete the task at hand. When you add all this up, you get run times ranging from 10s of seconds to minutes. Truth be told, I've done a research job with AutoGPT and GPT4-turbo that ran for over an hour and cost over $10. In enterprise settings, this may be no big deal but for user-facing applications to be powered by AI Agents, this simply isn't going to work.

A Simple Example

Let's start with a simple prompt and see how long each take with straight LLM calls:

Prompt: 
Write me a blog post about the outcomes of artificialanalysis.ai recent 
post concerning the speed, price, and quality of all major LLMs.
It should be formatted to the specifications of Dreyer's English.

What's the takeaway? We can clearly see the speed difference between the two. For now, I'm less concerned about quality and to be frank, this prompt is a bad idea so I'm not going to extend this post with their output. Neither model knows about the report but will happily make things up. We'll adapt this prompt to be Agentic in the coming steps but this sets our baseline for speed.

Make it Agentic with CrewAI

CrewAI is a great framework for making Agentic workflows. It's easy to use but still quite powerful. In this simple example, we'll define 2 agents; one to summarize the data and one to format it. I’m giving the data in the prompt since, for the purposes of this test, I don’t want to introduce the variability of HTTP requests, but in a real scenario, I would definitely allow the research agent to search and scrape the web.

New prompt:

Compare and Contrast gpt-4o and gemini-flash-1.5 for AI agents and Agentic Workflows.

gpt-4o
Quality: GPT-4o is of higher quality compared to average, with a MMLU score of 0.887 and a Quality Index across evaluations of 100.
Price: GPT-4o is more expensive compared to average with a price of $7.50 per 1M Tokens (blended 3:1). 
GPT-4o Input token price: $5.00, Output token price: $15.00 per 1M Tokens.
Speed: GPT-4o is faster compared to average, with a output speed of 86.8 tokens per second.Latency: 
GPT-4o has a lower latency compared to average, taking 0.45s to receive the first token (TTFT).
Context Window: GPT-4o has a larger context windows than average, with a context window of 130k tokens.

gemini-flash-1.5
Quality: Gemini 1.5 Flash is of higher quality compared to average, with a MMLU score of 0.789 and a Quality Index across evaluations of 84.
Price: Gemini 1.5 Flash is cheaper compared to average with a price of $0.53 per 1M Tokens (blended 3:1).
Gemini 1.5 Flash Input token price: $0.35, Output token price: $1.05 per 1M Tokens.
Speed: Gemini 1.5 Flash is faster compared to average, with a output speed of 163.6 tokens per second.
Latency: Gemini 1.5 Flash has a higher latency compared to average, taking 1.06s to receive the first token (TTFT).
Context Window: Gemini 1.5 Flash has a larger context windows than average, with a context window of 1.0M tokens.

Research Agent


    research_agent = Agent(
      role="Your job to research the latest articles and news on the topic",
      goal="Find the latest articles and news relating to the task at hand. when summarizing your findings, pick a winner",
      verbose=True,
      memory=False,
      backstory="""
    You are an expert in the field of research and have been tasked with finding the latest articles and news regarding the topic at hand. 
    You have access to a wide range of resources and are able to quickly and efficiently find the information needed to complete the task.
    You should provide significant context for the next agent to use.,
    """,
      allow_delegation=False,
      cache=False,
      llm=llm,
    )

Formatter Agent


    format_agent = Agent(
      role="Your job is to format the information provided",
      goal="Format the information specifically to be a technical blog post and adhere to rules laid out by Benjamin Dreyer in the book Dreyer's English",
      verbose=True,
      memory=False,
      backstory="""
    You are an expert in the field of formatting and have been tasked with formatting the information provided by the previous agent.
    """,
      allow_delegation=False,
      llm=llm,
    )

Time to Process the tasks

For good measure, I also ran the prompt against gemini-1.5-pro and gpt-4-turbo. We can see the distinct speedup flash enjoys. Artificialanalysis.ai has significantly better data and adds dimensions around price and capability. I highly suggest digging in on that site for a deeper dive.

gpt-4o output


# A Detailed Comparison: GPT-4o vs. Gemini Flash 1.5 in AI Agentic Workflows

In the rapidly evolving landscape of AI agents and agentic workflows, two models have garnered significant attention: GPT-4o and Gemini Flash 1.5. Both models offer unique advantages and trade-offs, making them suitable for different types of applications. This detailed comparison will help you understand their strengths and weaknesses, assisting you in making an informed decision for your specific needs.

## Quality

### GPT-4o
- **MMLU Score**: 0.887
- **Quality Index**: 100

GPT-4o stands out for its exceptional quality, boasting a high MMLU score of 0.887 and a perfect Quality Index of 100. This makes it ideal for applications requiring high precision and nuanced understanding, such as advanced research, complex problem-solving, and high-stakes decision-making.

### Gemini Flash 1.5
- **MMLU Score**: 0.789
- **Quality Index**: 84

Gemini Flash 1.5, while still of high quality, falls short of GPT-4o in terms of MMLU score and Quality Index. With a score of 0.789 and an index of 84, it is more suited for tasks where ultra-high precision is not as critical but still requires a reliable performance.

## Price

### GPT-4o
- **Blended Price**: $7.50 per 1M Tokens
- **Input Token Price**: $5.00 per 1M Tokens
- **Output Token Price**: $15.00 per 1M Tokens

GPT-4o is on the expensive side, with a blended price of $7.50 per 1 million tokens. The input token price is $5.00, and the output token price is $15.00 per 1 million tokens. This cost factor makes it a considerable investment, suitable for projects where the budget aligns with the need for high-quality output.

### Gemini Flash 1.5
- **Blended Price**: $0.53 per 1M Tokens
- **Input Token Price**: $0.35 per 1M Tokens
- **Output Token Price**: $1.05 per 1M Tokens

Gemini Flash 1.5 is significantly cheaper, with a blended price of $0.53 per 1 million tokens. The input token price is $0.35, and the output token price is $1.05 per 1 million tokens. This makes it an economical choice for large-scale deployments or applications with budget constraints.

## Speed

### GPT-4o
- **Output Speed**: 86.8 tokens per second

GPT-4o offers a fast output speed of 86.8 tokens per second, which is above average. This speed is beneficial for applications where timely responses are crucial but not necessarily in real-time scenarios.

### Gemini Flash 1.5
- **Output Speed**: 163.6 tokens per second

Gemini Flash 1.5 excels in speed, with an impressive output rate of 163.6 tokens per second. This makes it ideal for real-time applications, such as live customer support, real-time data analysis, and other time-sensitive tasks.

## Latency

### GPT-4o
- **TTFT (Time to First Token)**: 0.45 seconds

GPT-4o has a lower latency, with a TTFT of 0.45 seconds. This quick response time enhances user experience, making it suitable for interactive applications where immediate feedback is essential.

### Gemini Flash 1.5
- **TTFT**: 1.06 seconds

Gemini Flash 1.5 has a higher latency, with a TTFT of 1.06 seconds. While this is still acceptable for many applications, it may not be ideal for scenarios requiring ultra-fast initial responses.

## Context Window

### GPT-4o
- **Context Window**: 130k tokens

GPT-4o offers a substantial context window of 130k tokens, making it well-suited for applications requiring a large amount of contextual information, such as comprehensive document analysis, long-form content generation, and complex conversation management.

### Gemini Flash 1.5
- **Context Window**: 1.0M tokens

Gemini Flash 1.5 surpasses GPT-4o with a massive context window of 1.0 million tokens. This feature is particularly advantageous for applications involving extensive data sets, long conversations, or large documents, where maintaining context over vast information is crucial.

## Conclusion

Both GPT-4o and Gemini Flash 1.5 offer compelling features tailored to different needs:

- **GPT-4o** is ideal for high-quality, high-precision tasks where cost is less of a concern and where a substantial but not excessive context window is sufficient. Its lower latency and high-quality index make it suitable for applications requiring quick, accurate responses.

- **Gemini Flash 1.5** is perfect for cost-sensitive applications requiring extremely fast output speed and the ability to handle very large context windows. Its lower price point and higher token output speed make it a highly efficient choice for large-scale, real-time tasks.

By weighing these factors against your specific requirements, you can select the AI model that best aligns with your operational goals and budgetary constraints.

gemini-1.5-flash output


## GPT-4o vs. Gemini Flash 1.5: Choosing the Right LLM for Your AI Agents

The world of large language models (LLMs) is a fast-paced one, with new contenders emerging seemingly every day.  Two prominent players in this arena are GPT-4o and Gemini Flash 1.5, both boasting impressive capabilities for AI agents and agentic workflows.  This deep dive explores their strengths and weaknesses, providing a comprehensive comparison to help you make the right choice for your needs.

**Quality: A Matter of Precision**

When it comes to quality, GPT-4o takes the lead. Its higher MMLU score of 0.887, compared to Gemini Flash 1.5's 0.789, suggests it excels at understanding and responding to complex tasks. This translates to potentially more accurate and nuanced outputs for your AI agents.  Furthermore, a comprehensive Quality Index across various evaluations shows GPT-4o scoring 100, while Gemini Flash 1.5 scores 84, further solidifying GPT-4o's overall higher performance.

**Price: Balancing Performance and Budget**

Both models offer competitive pricing, but GPT-4o comes with a higher cost of $7.50 per 1M Tokens (blended 3:1) compared to Gemini Flash 1.5's $0.53.  This makes Gemini Flash 1.5 a more budget-friendly option, especially for applications requiring extensive token processing.  However, the cost differential should be carefully weighed against the potential benefits of GPT-4o's higher quality and performance.

**Speed and Latency: The Race for Responsiveness**

Speed is a key consideration for real-time applications, and here, Gemini Flash 1.5 takes the crown.  Its output speed of 163.6 tokens per second eclipses GPT-4o's 86.8 tokens per second, leading to quicker response times for your AI agents.  However, GPT-4o compensates with lower latency, taking only 0.45s to deliver the first token compared to Gemini Flash 1.5's 1.06s. This makes GPT-4o more suitable for applications demanding immediate feedback and responsiveness.

**Context Window: Handling Complexity**

Both models offer substantial context windows, significantly exceeding the average. GPT-4o provides 130k tokens, while Gemini Flash 1.5 boasts a larger window of 1.0M tokens. This means Gemini Flash 1.5 can handle more complex and extensive inputs, enabling more sophisticated AI agents capable of understanding and responding to intricate scenarios.

**Conclusion:  Making the Right Choice**

The choice between GPT-4o and Gemini Flash 1.5 ultimately depends on your specific requirements and priorities. For AI agents that demand high quality and responsiveness, GPT-4o is a strong contender.  However, if budget constraints and a larger context window are paramount, Gemini Flash 1.5 presents a compelling alternative.

**Key Takeaways:**

* **Quality:** GPT-4o offers higher quality with a higher MMLU score and Quality Index.
* **Price:** Gemini Flash 1.5 is more affordable.
* **Speed:** Gemini Flash 1.5 boasts a faster output speed.
* **Latency:** GPT-4o offers lower latency for quicker response times.
* **Context Window:** Gemini Flash 1.5 provides a larger context window for handling more complex inputs.

By carefully considering these factors, you can select the most suitable LLM for your AI agents and agentic workflows, maximizing their potential and achieving optimal results.

Agents took 3x - 5x longer. Why?

I'd like to dig into this during a future blog post but at a high level:

  • Are calling LLMs multiple times
  • Provide significantly more instructions to follow
  • Additional processing overhead locally
  • Check that their output matches the request and re-trying if needed

If we introduced HTTP calls for the research portion (and even the format portion), we can expect even longer duration with even additional variability. 

Best of Both Worlds when you need it

At least in CrewAI, we can mix and match LLMs. If we know an agent needs more ability to reason, we can choose for them to use a different LLM. Let’s test that real quick. I set the research agent to gpt-4o and the format agent to gemini-flash-1.5:

This took 37s and the final output was:


## GPT-4o vs. Gemini-Flash-1.5: Choosing the Right Language Model for AI Agents

The world of AI is abuzz with powerful language models, each offering unique capabilities. Two models that have captured the attention of developers and researchers are GPT-4o and Gemini-Flash-1.5. Both excel in driving AI agents and agentic workflows, but which one is the right fit for your project? This post dives into a comprehensive comparison, examining key factors like quality, price, speed, latency, and context window to help you make an informed decision.

### Quality: Precision and Nuance

**GPT-4o** shines when it comes to quality. With a **MMLU score of 0.887** and a **Quality Index of 100**, it consistently delivers high-quality output. This makes it a solid choice for applications demanding precise and nuanced understanding, such as complex reasoning tasks or generating creative content.

**Gemini-Flash-1.5** also outperforms average models, achieving a **MMLU score of 0.789** and a **Quality Index of 84**. While not quite reaching the heights of GPT-4o, it remains a strong contender for most practical use cases.

| Model | MMLU Score | Quality Index | Assessment |
|---|---|---|---|
| GPT-4o | 0.887 | 100 | High quality, ideal for precision and nuance |
| Gemini-Flash-1.5 | 0.789 | 84 | Good quality, suitable for most practical applications |

### Price: Balancing Power and Budget

When it comes to cost, there's a significant difference between the two models. 

**GPT-4o** is the more expensive option, with a **blended price of $7.50 per 1M tokens**. This might be a concern for budget-conscious projects or those needing to process large volumes of data.

**Gemini-Flash-1.5** offers a much more attractive price point, with a **blended price of $0.53 per 1M tokens**. This makes it a compelling choice for large-scale applications where cost efficiency is paramount.

| Model | Blended Price (3:1) | Input Token Price | Output Token Price | Assessment |
|---|---|---|---|---|
| GPT-4o | $7.50 per 1M Tokens | $5.00 per 1M Tokens | $15.00 per 1M Tokens | More expensive, suitable for high-value applications |
| Gemini-Flash-1.5 | $0.53 per 1M Tokens | $0.35 per 1M Tokens | $1.05 per 1M Tokens | Cost-effective, ideal for large-scale deployments |

### Speed and Latency: Time is of the Essence

Speed and latency play a crucial role in many AI applications. 

**GPT-4o** delivers a **speed of 86.8 tokens per second**, making it suitable for applications needing reasonably quick responses. Its **latency (TTFT) of 0.45 seconds** ensures quicker initial responses, enhancing user experience in interactive scenarios.

**Gemini-Flash-1.5** takes the lead in speed, boasting an **output speed of 163.6 tokens per second**. This makes it ideal for real-time applications where speed is critical. However, its **latency of 1.06 seconds** might be a factor to consider depending on the use case.

| Model | Output Speed | Latency (TTFT) | Assessment |
|---|---|---|---|
| GPT-4o | 86.8 tokens per second | 0.45 seconds | Good speed, low latency for interactive applications |
| Gemini-Flash-1.5 | 163.6 tokens per second | 1.06 seconds | Very fast, suitable for real-time applications |

### Context Window: Handling Extensive Information

The size of the context window determines how much information a model can process at once. 

**GPT-4o** offers a **context window of 130k tokens**, which is substantial and beneficial for applications requiring extensive context retention and understanding.

**Gemini-Flash-1.5** takes the crown in this category, boasting an **exceptional context window of 1.0M tokens**. This makes it a perfect choice for complex tasks involving long documents or conversations.

| Model | Context Window | Assessment |
|---|---|---|
| GPT-4o | 130k tokens | Large context window for extensive information |
| Gemini-Flash-1.5 | 1.0M tokens | Unparalleled context window for complex tasks |

### Conclusion: Making the Right Choice

Choosing between GPT-4o and Gemini-Flash-1.5 depends on your project's specific needs. 

- **GPT-4o** excels in quality and low latency, making it suitable for high-stakes applications where precision and quick initial responses are crucial. However, its higher cost might be a limiting factor for some.

- **Gemini-Flash-1.5** shines in speed, cost efficiency, and its unparalleled context window size. This makes it a great choice for large-scale deployments and scenarios where processing speed and cost are paramount.

Ultimately, the decision should align with the requirements and constraints of your AI project.  By carefully considering the factors discussed here, you can choose the language model that best empowers your AI agents and agentic workflows.

Summary

While the most important aspect for AI Agents is a strong ability to reason, at a certain point, the models are strong enough for most tasks. Practical requirements of software require a balance of cost and speed as well. Gemini 1.5 Flash is positioned high enough on reasoning and destroys the competition on speed and price, making it a great way to power your agents as a default.

If you're using GPT-3.5 or GPT-4 Turbo, Gemini 1.5 Flash will be faster, cheaper, and better. If you're using GPT-4o (or Gemini-pro-1.5), you'd be giving up about 10% in reasoning & knowledge and 4.5% in general ability for a whopping 2x speedup and nearly 7x price reduction. It’s worth a test if nothing else. You can always enable a particular agent to use a higher order model if need arises.

Further Reading

Llama3 on Groq was a standout for me from the report. If you're focused on more open models, this route looks very promising. It appears to be one of the fastest options available and the price appears to be attractive.

Finally

If you're wanting to learn more about AI Agents, CrewAI, or Agentic workflows, please reach out to Ryan at ryan@channelfour.ai

Reference - Final Python Code


import argparse
from datetime import datetime
from crewai import Agent, Task, Crew
from crewai_tools import SerperDevTool, WebsiteSearchTool
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import ChatOpenAI

def log_time(callback_input):
    print(datetime.now())

prompt = """
Compare and Contrast gpt-4o and gemini-flash-1.5 for AI agents and Agentic Workflows.

gpt-4o
Quality:
GPT-4o is of higher quality compared to average, with a MMLU score of 0.887 and a Quality Index across evaluations of 100.
Price:
GPT-4o is more expensive compared to average with a price of $7.50 per 1M Tokens (blended 3:1).
GPT-4o Input token price: $5.00, Output token price: $15.00 per 1M Tokens.
Speed:
GPT-4o is faster compared to average, with a output speed of 86.8 tokens per second.
Latency:
GPT-4o has a lower latency compared to average, taking 0.45s to receive the first token (TTFT).
Context Window:
GPT-4o has a larger context windows than average, with a context window of 130k tokens.



gemini-flash-1.5
Quality:
Gemini 1.5 Flash is of higher quality compared to average, with a MMLU score of 0.789 and a Quality Index across evaluations of 84.
Price:
Gemini 1.5 Flash is cheaper compared to average with a price of $0.53 per 1M Tokens (blended 3:1).
Gemini 1.5 Flash Input token price: $0.35, Output token price: $1.05 per 1M Tokens.
Speed:
Gemini 1.5 Flash is faster compared to average, with a output speed of 163.6 tokens per second.
Latency:
Gemini 1.5 Flash has a higher latency compared to average, taking 1.06s to receive the first token (TTFT).
Context Window:
Gemini 1.5 Flash has a larger context windows than average, with a context window of 1.0M tokens.
"""
openAI_llm = ChatOpenAI(model_name="gpt-4o", temperature=0.7)
flash_llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.7)

if  __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Send a prompt to an LLM.")
    parser.add_argument("model_choice", type=int, choices=[1, 2], help="1 for GPT-4o, 2 for Gemini Flash")
    args = parser.parse_args()

    if args.model_choice == 1:
        llm = openAI_llm
    elif args.model_choice == 2:
        llm = flash_llm

    research_agent = Agent(
      role="Your job to research the latest articles and news on the topic",
      goal="Find the latest articles and news relating to the task at hand. when summarizing your findings, pick a winner",
      verbose=True,
      memory=False,
      backstory="""
    You are an expert in the field of research and have been tasked with finding the latest articles and news regarding the topic at hand. 
    You have access to a wide range of resources and are able to quickly and efficiently find the information needed to complete the task.
    You should provide significant context for the next agent to use.,
    """,
      allow_delegation=False,
      tools=[],
      cache=False,
      #llm=llm,
      #llm=flash_llm
      llm=openAI_llm
    )

    format_agent = Agent(
      role="Your job is to format the information provided",
      goal="Format the information specifically to be a technical blog post and adhere to rules laid out by Benjamin Dreyer in the book Dreyer's English",
      verbose=True,
      memory=False,
      backstory="""
    You are an expert in the field of formatting and have been tasked with formatting the information provided by the previous agent.
    """,
      allow_delegation=False,
      #llm=llm,
      llm=flash_llm,
      #llm=openAI_llm
    )

    research_task = Task(
      description="Research {input}",
      expected_output=(
        "Content necessary for a technical blog post"
      ),
      agent=research_agent
    )

    format_task = Task(
      description="Format the information",
      expected_output=(
        "A formatted technical blog post"
      ),
      agent=format_agent,
    )

    crew = Crew(
      agents=[research_agent, format_agent],
      tasks=[research_task, format_task],
      step_callback=log_time,
      verbose=10,
    )

    log_time("")
    result = crew.kickoff(inputs={'input': prompt})
    log_time("")
    print(result)

Sign up for our mailing list