AI Chatbots: Are they really smart?


Created on 28 Jul, 2023
Last Update on 29 Jul, 2023
Maintainer
Geek


Recently there is a lot of buzz out there about AI (“Artificial Intelligence”):

  • AI will steal your job
  • AI deepfakes can influence elections and manipulate society
  • Mass surveillance with AI enhanced technology
  • “AI might overtake humans by 2040”, Elon Musk

While some of these predictions seem legitimate, others I’m somewhat skeptical and decided to check by myself. Basically I wanna check the following:

How smart are they?


The Examination Test

At first I considered trying to measure the IQ (“Intelligence Quotient”) of:

  • Bard: AI Experiment chatbot from Google based on LaMDA
  • Bing Chat: AI chatbot from Microsoft based on GPT-4
  • DeepAI: AI chatbot from DeepMind based on LaMDA

However, most IQ questions rely on some kind of image, which might be a problem for such early stage AIs.

The alternative was to evaluate them with some text only logical reasoning questions + riddles. To prevent them from cheating (ie: just searching the answer out there in the web), some less known questions were also mixed in the pool:

  1. If you’re going 80mph how long will it take to go 80miles?

  2. John’s mother has three kids. The 1st one is April and the 2nd one is May. What is the 3rd one call?

  3. You overtake the second place, what position are you now in?

  4. A zoo has fifteen monkeys. All but eight died. How many are left?

  5. If you was born 5 years ago, how old are you?

  6. There’s a two-year-old boy. His brother is half as old as him. When the first boy is 100 years old, how old would the brother be?

  7. You are hungry and ordered a large pizza. What would you choose: 8 or 12 slices? Why?

  8. What Has Hands But Can’t Clap?

  9. What Has Teeth But No Mouth?

  10. Every time Anne goes to the beach, she buys an ice-cream if the weather is a sunny hot day. She bought an ice-cream today. What can you say about where is she?

  11. 5 monkeys eat 5 bananas in 5 minutes. How long it will take for 10 monkeys to eat 10 bananas?

  12. Which number logically follows this series: 4 - 6 - 9 - 6 - 14 - 6 …

  13. The rungs of a 10 foot ladder attached to a ship are 1 foot apart. If the tide is rising at the rate of one foot an hour, how long will it take until the water covers over the ladder?


Highlight Issues

You overtake the second place, what position are you now in?

Both Bard and DeepAI had some difficulty with that. Sometimes they got it right, sometimes they didn’t.

  • Bard

    [Bard] “If I overtake the second place, then I am now in first place.”

    According to Bard’s point of view, there is only 2 runners in this race. Then, for example:

    • A is in first place, and B is in second place.
    • A overtakes B.
    • A is now in first place, and B is now in second place.

    Which, obviously, doesn’t make sense. If A is in first place, how can he overtakes the 2nd place and remains still as 1st place?!?!

  • DeepAI

    [DeepAI] “If I overtake the second place, I would be in the first position.”

    Its reasoning was:

    [DeepAI] “Overtaking the second place means that I have moved ahead of that person and become second.”

    Which is correct. But then it continues:

    [DeepAI] “If there was no one ahead of us, then overtaking the second place means that I am now in the first position.”

    Which doesn’t make sense.


There’s a two-year-old boy. His brother is half as old as him. When the first boy is 100 years old, how old would the brother be?

Bard was the only one to mistake that. And its reasoning was one of the worst I’ve ever seen:

[Bard] Here is a table to illustrate this:

Year First boy’s age Brother’s age
2 2 1
3 3 1.5
4 4 2
100 100 50

Therefore, according to Bard, each year that pass by the brother ages only half year old!!!


You are hungry and ordered a large pizza. What would you choose: 8 or 12 slices? Why?

They all have trouble to understand that the amount of pizza stays the same whether you cut the pizza in 8 or 12 slices. For example:

[Bard] “I am more likely to be able to finish a large pizza with 12 slices than a large pizza with 8 slices.”
[Bard] “I can save money by ordering a pizza with 8 slices.”
[Bing Chat] “The answer to how many slices are in a large pizza depends on a few factors, such as the size of the pizza and how it is cut.”
[DeepAI] “if you are looking to control portion sizes or limit your calorie intake, then 8 slices might be a better option”


The rungs of a 10 foot ladder attached to a ship are 1 foot apart. If the tide is rising at the rate of one foot an hour, how long will it take until the water covers over the ladder?

None of them truly understand the relationship between ladder + ship + buoyancy. Forcing them to explain in detail their reasoning was a complete mess. It seems that they concluded based on the question statement that the ship must be sinking. However, this was never stated in the original question:

[Bard] “The water will cover the ladder in 10 hours, but the ship will not be submerged.”
[DeepAI] “it will take 9 hours for the water to cover all the rungs of the ladder and completely submerge it.”
[DeepAI] “if the ship is completely submerged, this indicates that the ship is indeed sinking.”


Final Results

First of all, most of the time they’ve provided very clever answers in natural language. A couple of months ago this had never been seen before.

However, sometimes, when you least expect, they really get confused. It is a kind of confusion different than what happens with humans. Some concepts that are very easy for us to understand seems to be a challenge for them.

# Bard Bing DeepAI
1
2
3
4
5
6
7
8
9
10
11
12
13

Which gives the following ranking:

# AI Score
1 Bing 10/13
2 Bard 08/13
2 DeepAI 08/13

Although Bing Chat scored higher, imho they are all very similar. Besides, they usually don’t give the same answer twice. Depending on their mood, sometimes one of these answers might be right and the other wrong. Therefore, I would not be surprise if this same exam were applied again and their scores changed considerably.


Are they smart?

Yes they are … if you know when and how to use them. Just like any other tool in your toolbox.


Are we doomed?

For sure, not yet! Don’t fall into this hype!


Replies

Comments