Gemini 1.5 Pro Multimodal AI Overview

4 weeks ago

Getting your Trinity Audio player ready...

Artificial intelligence has developed rapidly in recent years and has become a technology that affects every aspect of our lives. One of the pioneers of this rapid change is the Gemini model developed by Google. This model, also known as Bard, has taken an important step in the world of artificial intelligence by using even higher processing power with its new version Gemini 1.5 Pro. So what is this Gemini 1.5 Pro? In this article, we will give you information about the 12 tests we prepared for this AI and the progress of artificial intelligence. Let’s take a closer look.

About Gemini 1.5 Pro

Gemini 1.5 Pro, released to developers on February 15, 2024, is currently Google’s most advanced multimodal AI. Multimodal AIs are models that can process different types of data, such as text, code, images and video (ChatGPT, which most of you know, is also multimodal).

How Do I Use Gemini 1.5 Pro?

On February 15th, this model was opened only to developers with a waiting list, but last week it was made freely available to everyone on the Google AI Studio website. You can try Gemini 1.5 Pro by following the steps below.

Enter ai.google.dev
Click on the “Get API key in Google AI Studio” button.
Sign in with your email address.
Press the “New Prompt” button on the page that opens and start using Gemini 1.5 Pro.

That’s how simple it is to register. So what are the features of this artificial intelligence?

Gemini 1.5 Features

First of all, what is a token? To explain briefly, it is a unit of data power processed in artificial intelligence. As you can see in the table above, even the most widely used AI in the world such as GPT-4 has 128 thousand tokens, while Gemini 1.5 Pro will reach 1 million tokens.

With 1 million tokens, what can we have Gemini 1.5 Pro examine with a single prompt order?

1 hour video,
11 hours of sound,
More than 30,000 lines of code,
You can have over 700,000 words of text reviewed with a single prompt order.

This is really an incredible dimension. For now, it’s just an incredible size. As I will mention at the end of the article, even this number of tokens will be very low with the development of NVDIA’s chips.

Gemini 1.5 Pro Usage Tests:

Grab a coffee and let’s analyze the results of the 12 tests I have prepared for you. These are comprehensive tests that we have carefully selected and analyzed with Gemini 1.5 Pro by running the same tests on a large number of AIs. They are not technical tests, just comments based on benchmarks.

If you want to go directly to the tests we have done, you can click on the test you want to go to:

Test 1: Writing a Story
Test 2: Summarizing a Story
Test 3: Solving Mistakes in the Text
Test 4: Complex Word Translation
Test 5: Image Description
Test 6: Chaotic Image Description
Test 7: Meme Explanation
Test 8: Drawing Explanation
Test 9: Code Analysis
Test 10: Solving a Problem
Test 11: Video Summarization
Test 12: Asking a Question About a Chapter in a Video

Test 1: Writing a Story

It wouldn’t be much fun if I asked it to write a story I wanted to write. That’s why I asked Gemini 1.5 Pro to write down the details for a story. 61 sentences with details were created for me. It’s impossible to put them all in, so I’m only putting the details about the characters below.

Then I asked Gemini 1.5 Pro to turn the 61 details it had prepared for me into a story.

And the text generated by Gemini 1.5 Pro looks like the picture above. Honestly, it’s not very long and not very good. The flow of the text and the harmony of the text is good, but I thought it could be longer. I thought it should be compared and I asked him to create a text from these details using Claude 3 Opus AI, which I trust in text creation.

There is more. But this is enough for us. It’s longer than Gemini’s. I don’t comment on the quality and harmony of the writing as it can vary according to each person, but I prefer Claude’s writing.

You may have been too lazy to read Claude’s article. So I immediately did a summarization test on Gemini 1.5 Pro.

Test 2: Summarizing a Story

In Test 1, I asked Gemini 1.5 Pro to summarize the text I had Claude do.

I think Gemini 1.5 Pro did an adequate summarization. I also prepared a prompt for Claude to summarize this test. Claude’s summary was longer than Gemini 1.5 Pro. But since I asked for a summary, I can say that Gemini was more successful.

Test 3: Solving Mistakes in the Text

I had another AI write a piece of writing with grammatical errors and then asked it to provide corrections and suggestions to improve it. The result is shown in the photo below:

That’s a really good explanation. Just what I wanted. I repeated this test with other Multimodal AIs. Results:

ChatGPT 3.5: It just corrected the errors in the text and sent it to me as a new text.
Gemini: Gemini gave me a description similar to Gemini 1.5 Pro. Even if it was shorter, it was enough.
Gemini Advanced: Gemini Advanced gave me an explanation similar to Gemini. It was longer, but not as detailed as Gemini 1.5 Pro.
Claude 3 Opus: Similar to Gemini 1.5 Pro. But Claude 3 Opus didn’t give suggestions, only bugs.

Test 4: Complex Word Translation

I asked it to translate the longest word in Turkish, “muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesine” into English. This word is a word that has taken too many affixes in Turkish and was created to be the longest word in Turkish.

I asked him to translate the longest word in Turkish, “muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesine” into English. This word is a word that has been created to be the longest word in Turkish with too many affixes.

To be honest, I thought Gemini 1.5 Pro would only translate English. In fact, the prompt I gave him was only for him to translate. Although I didn’t ask him to explain in my prompt, Gemini 1.5 Pro did. But one thing should not be overlooked. The explanation is really good!

I tested this prompt with ChatGPT 3.5 and Claude 3 Opus.

ChatGPT 3.5 only wrote in English.
Claude 3 Opus gave a long description like Gemini 1.5 Pro. You can see the description in the picture below:

As you can see, the description of Gemini 1.5 Pro is more descriptive than that of Claude 3 Opus.

Test 5: Image Description

One of the most used features in multimodal AIs is image annotation.

In Gemini Advanced, when we ask it to describe an image, it says that it will not interpret the image because there is a human in the image. But in Gemini 1.5 Pro, you can observe that it is also trained on humans.

Gemini 1.5 Pro did it easily because it was an easy picture. So how will Gemini 1.5 Pro explain a chaotic picture?

Test 6: Chaotic Image Description

I asked the Gemini 1.5 Pro model to interpret a chaotic picture in which you see different things every time you examine it. First, take a look at the picture below. What do you see? Then let’s examine together what Gemini 1.5 Pro sees.

I don’t know what you think, but the sentence that impressed me the most was “The overall impression is one of confusion and disorganization”. My point here is not that Gemini 1.5 Pro has a good image analysis system. Different companies have been producing software to analyze objects in images for years. Object recognition alone would be insufficient to recognize “clutter and irregularity” from the image. You can observe that Gemini 1.5 Pro is also trained for scene understanding and interpretation.

Test 7: Meme Explanation

Gemini 1.5 Pro needs to analyze three main things in this photo.
1) Image
2) Text
3) Understanding humor
I think Gemini 1.5 Pro has done all of this at a sufficient level and explained it in a way that someone who doesn’t understand the joke can understand it.

Test 8: Drawing Explanation

I made a drawing where the quality of the drawing was low. I made a very beautiful drawing because I am a great painter. In the drawing there is a happy person with a Turkish flag and 2 unhappy people playing soccer. And the result:

Gemini 1.5 Pro noticed almost every detail. A smiling man, 2 non-smiling men and the smiling man holding a flag. I specifically tried to make the flag look like the flag of Turkey and it noticed that and wrote “flag of Turkey or another country with a similar flag design”. Gemini 1.5 Pro only failed to notice the soccer ball between the two people in the background. Maybe this is because of my inability to draw.

Test 9: Code Analysis

I passed the following prompt to Gemini 1.5 Pro for code analysis.

Gemini 1.5 Pro gave the following response:

The answer is really very good. But when I tested this code with Claude with the same prompt, I got a similar response. You can see the similarity in the screenshot below.

Test 10: Solving a Problem

I asked Gemini 1.5 Pro to solve the problem below.

“Imagine you are a logistics manager responsible for organizing the delivery of packages to multiple cities. You have a fleet of trucks with varying capacities, and each truck can make multiple trips. Your task is to develop an algorithm that optimizes the number of trips required to deliver all the packages while minimizing the total distance traveled. Describe your approach in detail, including any assumptions, data structures, and algorithms you would use. Additionally, discuss the time and space complexity of your solution and any potential limitations or edge cases you might encounter.”

The result is excellent. When I tried this question in other Multimodals, all other AIs had similar solutions and explained in detail.

Test 11: Video Summarization

Here is my favorite feature and how it differs from other Multimodals.

Video Duration to summarize: 1 Minute

I started with a short 1-minute video for video summarization. Gemini 1.5 Pro watched the 1-minute video in about 10 seconds and sent me the summary.

Video Duration to be summarized: 10 Minutes

For a 9:57 minute video, Gemini 1.5 Pro watched the video in a very short time like 60 seconds and summarized it for me.

So how long can I have the video analyzed?

I think you are asking this question. With simple math, Gemini 1.5 Pro can analyze a 66-minute video with the quality (720p) in a 10-minute video.

Test 12: Asking a Question About a Chapter in a Video

I took a second of a 10 minute segment in the movie and asked what the woman was doing during that second. You can see the scene and Gemini 1.5 Pro’s answer in the photo below.

It’s actually wrong. She was blowing on the canister. I’ve done a lot of tests like this and sometimes it can make mistakes like that. But even this is a great technology right now.

My Personal Comments about Gemini 1.5 Pro

Previously, I often used Multimodal AIs produced by Google. Although I sometimes used Bard, after Gemini Advanced came out, I put Gemini Advanced among the AIs I use most often to get information about current information. Gemini has a better experience in getting information about current information than GPT-4. But it didn’t have as good typing as GPT-4. I live in Turkey and for some reason that I don’t know yet, Claude doesn’t allow me to use it in Turkey. After testing Claude in some ways, I stopped using GPT-4 completely because Claude is at a high level when it comes to writing and analyzing text. I can analyze Claude in another article, but recently I used Claude to write a petition. I just gave it the information and it wrote a petition exactly as I wanted. But I don’t think that Gemini 1.5 Pro can do this so flawlessly. Of course, in the future, with the training of this AI, it can get better. So for now I can only use Gemini 1.5 Pro to get updated information. But since the information in the version of Gemini 1.5 Pro opened to developers is not up to date, I will continue to use Gemini Advanced.

Missing Features

There are 3 features that I observe and think are missing.

Voice Question Asking
Image Generation
Internet Access

It would be wrong to expect internet access for an AI that is only open to developers. Current data is for November 2023. I think that the Image Generation and Voice Questioning features in Gemini Advanced will also come when Gemini 1.5 Pro is opened to everyone.

Latest Developments: NVDIA’s New Chips

In this article, I think you understand how powerful the processing power of Gemini 1.5 Pro is. Let’s take a look at what happens to that processing power in the future. You’ve probably heard about the chips that NVDIA introduced at their event a few weeks ago. I had an AI calculate approximately how many tokens are equivalent to the power of these chips.

Nvidia Blackwell B200: It is estimated that this chip can process up to 1.5 billion tokens.
Nvidia Hopper: This chip is estimated to handle up to 2 billion tokens.
Nvidia GeForce RTX 4090 Ti: This graphics card is estimated to process up to 100 billion tokens.

These are truly scary numbers. While we say how important an AI that can process up to 1 million tokens is, we see that these chips can process from 1.5 billion to 100 billion tokens. These are really enormous numbers.

As we come to the end of this article, I would like to give you some advice: Be innovative! Artificial intelligence has been growing very fast in the last 4 years. Don’t try to stick to a single AI assistant when there is such an increase. Keep researching new AIs.