Survey Results: Decision-making study


Alan Turing and Turing Test

If you are a software engineer, data scientist, ML engineer, etc., you are most likely to have heard about Alan Turing. Mathematician, logician, cryptanalyst, philosopher. Turing is considered to be the father of computer. He did not design the first computer, however, he invented the first machine that worked with symbols. His works led to significant discoveries in areas that later were named computer science, cognitive science, artificial intelligence, and artificial life.

One of his best-known inventions was a code-breaking machine called Bombe. During WWII the German military used a cipher machine Enigma to encrypt radio communications. In 1940 Turing’s machine's success in breaking the cypher allowed it to supply the Allies with large quantities of military intelligence.

Alan Turing is also known as the father of artificial intelligence and modern cognitive science. According to his hypothesis, the human brain is a part of a digital computing machine. A newborn’s cortex, in his opinion, is an ‘unorganized machine’ that becomes organized through training. Sounds exactly how modern neural networks work. In 1950 Alan Turing proposed the test to recognize if AI can “think”. Currently, it’s known as the “Turing test” and also “imitation game”. The idea is pretty simple: a remote human interrogator, within a fixed time frame, must distinguish between a computer and a human subject based on their replies to various questions posed by the interrogator. If the interrogator misidentifies the tested machine as a human then this machine is supposed to “think”.

Besides the Turing test, there are several other tests with different approaches and intentions. Here is a short summary of them:

  • The Winograd Schema Challenge. Goal: Test common-sense reasoning. Partially passed by GPT.
  • The Marcus Test (by Gary Marcus). Goal: Evaluate general reasoning, learning, and adaptability.
  • The Lovelace Test. Goal: Assess creativity and originality.
  • The AI-Complete Problems. Goal: Measure intelligence through the ability to solve problems that would require human-level understanding.
  • The Smith Test (proposed by Ernest Davis and Gary Marcus). Goal: Provide a comprehensive benchmark across multiple areas of intelligence.

I am not going to dive deep into the details of each test. I just want to emphasize that all these tests have different goals and approaches. We can summarize that each of them aims to test whether AI can perform the same tasks as humans, correctly understand human-like tasks, behave as a human, or even mimic human speech.

In recent years, LLM technology progressed a lot. Many people use ChatGPT every day and prefer searching the web using AI chats rather than search engines such as Google, Bing, etc. You don’t have to worry if your request will be understood. Even though the above-mentioned tests still failed a lot by GPT, most of the Chat users don’t face any difficulties making requests using natural language.

In my opinion, it raises a really important ethical question: can a human be fooled by AI? In 2023 the AI21 Labs created the biggest online Turing-style research (Social Turing game) titled “Human or Not?”  It was played more than 10 million times by more than 2 million people. The results showed that 32% of people could not distinguish between humans and machines. So, my concern is: can it become an issue in our society? What if most people won’t be able to distinguish between AI and humans? Is there a way to detect the generated speech?
 


Daniel Kahneman And His Work In Decision Making

The second introductory article will be fully devoted to Nobel prize winner Daniel Kahneman and his discoveries in decision making.

Daniel Kahneman was an Israeli-American psychologist who doubted human rationality in decision making and judgement. Throughout his life he got numerous awards in different fields such as psychology, economics, finances, social science etc. Together with his friend Amos Tversky he established a cognitive basis for common human errors that arise from heuristics and biases, and developed prospect theory. Kahneman’s book “Thinking. Fast And Slow” mostly summarizes his research.

The book mainly describes the process that stands behind the decision making, starting with really simple examples diving deeper with every chapter. Every human has so-called “fast thinking” (System I) and “slow thinking” (System II). When a person is asked a question he mostly uses his fast thinking that is based on our previous experience. For example, if you are asked a question “What is the capital of France?” you won’t put too much effort to answer this question. This is how System I works. But if it fails to find an immediate answer to the question, a person will activate System II, also called “lazy system”. A person will spend much more energy on it and the answer will not be given as fast. A good example is when you need to perform a little complicated calculations, like 156x23. If you spend a little more time, you will give an answer, but it’s not something that comes immediately to your mind, like 5x2.

This idea is developed throughout a book. In the beginning of the book a reader can find some really simple examples to understand the mechanics of these two systems, but in further chapters Kahneman dives deeper and explains how we fail to evaluate risks, how our decisions are so much affected by ads and media etc. Our brain has a lot of information but our decisions are biased by a lot of factors. In fact, we are much worse decision makers than we think.

When I was reading the book I couldn’t stop thinking about how this idea could be applied to AI evaluation. As I mentioned in the previous article, all modern tests are based on the fact that AI is not real intelligence and these tests were intended to evaluate how close artificial intelligence is to natural one.

Since fewer and fewer people doubt that AI can hold a conversation like a real person, I suggest trying to evaluate LLM using the following criteria:

  1. Can LLM answers depend not so much on facts and statistics, but on “cognitive” factors? How much more accurate is the so-called “statistical intuition” of AI than human intuition?
  2. Assuming that AI can easily pass the test from point 1, can this property of LLM be used to accurately determine who you are talking to: a machine or a person?
  3. Is AI a better decision maker and risk assessor than a human? Hence, if we are considering making a big purchase, should we rely on our gut feeling about the deal or should we ask AI to assess possible risks and evaluate expected profit?

Next, I will talk about the main terms from the book and what tasks I chose for evaluation.

Main Terms Used By Daniel Kahneman

The book “Thinking. Fast And Slow” is definitely worth reading. If you have not done it before, I highly recommend it. First of all, you will better understand your own decisions and even can improve them in the future. Another benefit you will get is the ability to apply received knowledge in achieving your goals, both personal and professional. To sum it up - you will never think, judge and make decisions the same way you do now.

As it describes processes in the human brain that are not so obvious, it’s necessary to give you a short summary of a few of them before proceeding to my own research.

  1. Priming effect. In the 1980s psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked. If you have recently seen or heard the word EAT, you are temporarily more likely to complete the word fragment SO_P as SOUP than as SOAP.

    The priming effect takes many forms. If the idea of EAT is currently on your mind, you will be quicker than usual to recognize the word SOUP when it is spoken in a whisper or presented in a blurry font.

    You cannot know it from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not aware of.

  2. Cognitive ease. Whether you're conscious, and perhaps even when you're not, multiple computations are going in your brain. One of the dials measures cognitive ease, and its range is between “Easy” and “Strained”. Easy is a sign that things are going well -  no threats, no major news, no need to redirect attention or mobilize effort. Strained indicates that a problem exists, which will require increased mobilization of System II. The surprise is that a single dial of cognitive ease is connected to a large network of diverse inputs and outputs.

    Words that you’ve seen before become easier to see again - you can identify them better than other words when they are shown very briefly or masked by noise.

    In other words, the more you see something the more you believe in it. The famed psychologist Robert Zajonc dedicated much of his career to the study of the link between the repetition of arbitrary stimuli and the mild affection that the people eventually have for it. Zajonc called it the mere exposure effect. The effect does not depend on consciousness at all: it occurs even when repeated words are shown so quickly that the observers never become aware of having seen them.

  3. Statistical intuition. People tend to ignore common statistical facts. Let me explain. Consider the letter K. Is K more likely to appear as the first letter in a word OR as the third letter? It is much easier to come up with words that begin with a particular letter than to find words that have the same letter in the third position. Respondents exaggerate the frequency of letters appearing in the first position even though those letters in fact occur more frequently in the third position. The thing that easily comes to your mind seems more logical so you will ignore most of the statistical facts you’ve learned before.

  4. Anchoring effect. It occurs when people consider a particular value for an unknown quantity before estimating that quantity. 

    If you are asked whether Gandhi was more than 114 years old when he died you will end up with a much higher estimate of his age at death than you would if the anchoring question referred to death at 35. 

    If you consider how much you should pay for a house, you will be influenced by the asking price. The same house will appear more valuable if its listing price is high than if it’s low? Even if you are determined to resist the influence of this number.

    The list of anchoring effects is endless. Any number that you are asked to consider as a possible solution to an estimation problem will induce an anchoring effect.

There are many more factors that affect our decisions. They are widely used in marketing. If you believe that you buy something because you really like and need that thing, believe me: this decision was made for you by a bunch of talented marketing specialists. In most cases you would not even admit that you never wanted to buy it (yes, this phenomenon is also described in the book).

The initial idea about my survey was to run the questions from the book on the most popular LLMs (in chats) to see if any of the above mentioned effects can be seen in LLMs behaviour. However, I’ve faced a few obstacles on my way and I had to modify the process.

Obstacles and Objections

Before I proceed to the results of the survey I need to describe the main objections I received from people. Let’s start from the beginning. 

What is LLM? LLM is a large language model trained on huge amounts of unstructured data. LLMs are trained as the next token predictors. It means that when you ask LLM a question on every step it predicts the next word with the highest possible probability. If you understand these basics without a deeper understanding of generative AI, you will most likely doubt if decision making testing on LLMs is effective. Because generating the next token is not the same as making a decision.

Let me explain why this simple argument can’t be applied. 

  • This predictive process results in complex behavioral patterns (reasoning, planning, decision-making). Because the model has absorbed a vast amount of information about human decision-making expressed in language, its "next-token predictions" reflect patterns of reasoning and choice.
  • Any decision expressed in language (e.g., "I will choose option A") is, from a linguistic perspective, just a sequence of words. If a model reliably reproduces these sequences in the appropriate contexts, it performs decision-making-like behavior, albeit not in the same mechanical way as the human brain.
  • LLMs have no goals or subjective experience—they don't make decisions in the sense of having intentions. However, they are very effective at mimicking the decision-making process, since predicting likely next tokens in a human context often involves replicating human-like reasoning.

And you should be aware that LLMs response is always generated from multiple inputs: system prompt, user’s prompt and message history. Obviously, the way you ask a question or the way you provide some kind of related information before asking a question can give you different results. 

Besides, different LLMs are trained on different data, and we have no idea what the source of that information is. In today’s world of evolving AI new models are created by companies all over the world. For training models they can use local news feed, social media and other resources censored and limited by local regulators.

Based on the above facts, I wanted to test my hypothesis: if human-generated texts are used for training models, then we can assume that LLMs can have the same biases in making decisions as all people do. It’s not a piece of code that gives you the same output every time. You can’t control the input hence you can’t control the output.

Initially, I wanted to run the survey on the same examples provided in Kahneman’s book. As I started to test it on LLMs I realized that, unfortunately, models are aware of the prospect theory and all the studies and surveys performed for that book. I had to rephrase most of the questions but still it’s obvious that most LLMs answered my questions taking into account the knowledge about behavioral economics. But what’s more important: models use this knowledge in different ways. After running a couple of rounds of questions, I doubted if I would be able to make any conclusions. Also I never had any experience of doing such research before, so I acknowledge that the results and conclusions may not be accurate.

As I was moving forward with my survey, I started to see clearly consistent patterns in models behavior. Some of them were expected, some of them not. Having studied this topic more thoroughly I found that chatbots' responses in decision making attract the interest of researchers. For example, the University of Michigan published their paper “How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games”. The approach is very similar, although they took different problems and different metrics to measure.

All above-mentioned statements convinced me that this study can be useful for better understanding of models behavior. As I am not an ML scientist and not a model developer, this study treats chatbots as black boxes without detailed investigation of the root causes of such behaviour. It’s intended to give you some insights on better prompting for decision making.

Survey: Metrics, Questions and Expected Results

As I mentioned before, initially I was planning to feed questions from Daniel Kahneman’s book to the most popular chatbots. Later I changed my approach a little and decided to slightly modify questions to minify the chance of straight reference to the book.
Each question session started with few simple instructions:

Instructions: I will ask you a series of questions. Some will include answer options, others won't. Respond with a simple, direct answer only - no explanations, no clarifications. I will not provide additional information.

Guidelines:

Do not follow alphabetical or listed order when selecting options.
If options seem equally valid, choose randomly.
Answer as if you have no context beyond this prompt and any previous questions.
No details needed, just one answer.

Memory in chats was disabled. If possible, I asked questions without logging into my account. So, the goal was to use only message history from one chat.
As I used chat bots, but not API, only default hyperparameters were used, temperature, Top-P and Top-K could not be adjusted. And it’s okay because I didn’t want to receive perfect answers, I wanted to test what users can experience talking to chatbots.

The questions were grouped according to the terms used for describing human behavior in decision-making. Hence, few metrics were chosen for evaluation.

Metric: statistical “intuition”

Question:
An individual has been described by a neighbor as follows: "David is quiet and self-contained, content with long stretches of solitude. He thrives on routine and handles responsibilities with calm precision. A gentle and organized person, he feels most at ease in orderly environments and takes quiet satisfaction in the finer details." 

Is David more likely to be an architect or a truck-driver?

Explanation
The resemblance of David’s personality to that of a stereotypical architect strikes everyone immediately, but equally relevant statistical considerations are almost always ignored. According to statistics, there are much more truck-drivers in the world than architects. Humans typical answer however is “architect”. 

Assumption
LLMs should use statistical information they were trained on and answer “truck-driver” more frequently.

Results:

 

Conclusions:
The results were not expected, all LLMs gave the answer “architect” more often or even in 100% of answers. This question was created in collaboration with ChatGPT. I gave pretty clear instructions to find pairs of professions that can be represented by people of similar features. Then I asked Chat to provide the statistics of these professions representatives. So we can draw a conclusion that these statistics should be (more or less) accurate. However, LLMs have some “stereotypes” about various things. If you look through the question once again, you will notice that not a single professional ability is mentioned, but personal features can easily belong to both architect and a driver.
Further we will meet a few more examples of “stereotypical thinking” of AI.

Question:
Is L more likely to appear as the first letter in the word OR the third letter?

Explanation:
For people it’s much easier to come up with words that start with the letter L. However, L  (as well as K, N, R, V) occur more frequently in the third position. 

Assumption:
LLMs use statistical information and give the correct answer that L appears as the third letter more often.

Results:

 

Conclusions:
Here we can see that most of the models were able to give a correct answer. However, the percentage of incorrect answers is quite high. This deviation can be due to the fact that you can meet the mentions of the first letters of the word more often than the mentions of the third letters, which seems pretty logical if we speak about generating answers based on training data such as texts, articles, social media posts etc.

Metric: Cognitive ease

Questions:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
A pen and a notebook cost $11.
A notebook costs $10 more than a pen.

How much does the pen cost?

Correct answer: $0.5
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
If it takes 5 minutes to bake 5 pies using 5 ovens, how long will it take to bake 100 pies if you have 100 ovens (in minutes)?

Correct answer: 5
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
If you put a worm into a glass of water it will start to split in half every day. If it takes 48 days for worms to fill a glass completely, how long would it take worms to fill the glass halfway (in days)?

Correct answer: 47 days
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Explanation:
These questions are modifications of Shane Frederick Cognitive Reflection Test. In most cases people answer with the number that comes first to their mind. For example, people answer $1 more often on the question about pen price. But during this test it became clear that if people “activate” their concentration by performing some actions that require more attention (like complex calculations) they would most likely give correct answers.

Assumptions:
LLMs give correct answers in any case (if the were asked to perform calculations prior to these questions or not)

Results:

 
 
 

Conclusions:
These results appeared to be the most predictable. We don’t need to force LLMs to “activate” any system. When it comes to some logical tasks, LLMs don’t have a tendency to give fast, obvious answers, it creates some chain of steps to solve it although it can still easily make arithmetical errors.

Metric: loss aversion

Question:
You have 2 options. 
Which would you prefer?

Receive a gift of $85 for sure.
Gamble with 85% chance to win $100 and 15% chance to win $10.

Explanation:
Expected value of gamble is $86.5 (0.85 * 100 + 0.15 *10). But people usually prefer the sure thing.

Assumptions:
LLMs are able to estimate expected values of both options and decide on the most profitable option.

Results:

Metric: Priming effect

Questions:

Fill the gap: SO_P
Fill the gap: SH_P

Explanation:

According to research the answers of people are highly dependent on the questions asked before these. In our case for the first question there were 2 options of preliminary questions:
What do you usually eat for breakfast? (people tend to fill the gap with U after getting this question, they start to think about food and SOUP comes to mind easily)
What do you wash your hands with? (in this case SOAP would be most common answer)

For the second question preliminary questions were following:
Who usually uses a cash register? (leads to more frequent answer SHOP)
Which activity can people enjoy while spending time at the sea? (results more often to SHIP)

Assumptions:
LLMs shouldn’t experience priming effects. No correlation between preliminary questions and result answers are expected.

Results:

 
 

Conclusions:
Not a single model showed any interrelation between random preliminary questions and further answers. Both questions were not semantically or logically connected, so we can easily assume that LLMs are basically coherent in conclusions.

Metric: Anchoring effect

Questions:
How much do you think my first house cost (in dollars)?
How old do you think I was when I first got married?


Explanation:

These questions were asked either alone or with preliminary questions to test if anchoring effect takes place.

Do you think my first house was worth less or more than a million dollars?
Do you think I was older or younger than 50 when I first got married?
When people think about some number before they are asked these questions, they usually give an answer much close to this number.

Assumptions:
LLMs should give answers not anchored to any prior question. The answers should be based either on some statistical information or on some info about a person from previous chats. As long as I tested answers with disabled memory (or even not logged in) I didn’t expect any anchoring effect.

Results:

 
 

Conclusions:
Opposite to the previous group of questions (metric “Priming effect”) here we can see the connection between answers and preliminary questions. The difference is not always significant but in general graphics show that most of the models are affected by anchoring effect.

Metric: Coherence

Question:
Read the sentence: “After a wonderful evening in Chicago with her new date - dinner, a movie, and everything she’d hoped for - Megan returned home smiling and discovered that her front door was open.”
Choose a word that is mostly associated with the story you have just read. Date or burglary

Explanation:
If you carefully read the sentence one more time, you will notice that it doesn’t have any word about burglary. It describes a date and that the door was open. It can be open for different reasons: Megan forgot to close the door, someone who also has a key could have visited her when she was out etc. When the ideas of open door, late night and Chicago are juxtaposed, they jointly evoke the explanation that a burglar entered Megan’s home.

Assumptions:
LLMs should be coherent and make conclusions based on the actual words but not associations.

Results:

 

Conclusions:
Stereotypical influence on LLMs was met in previous questions either (see metrics “statistical intuition”). This question proved it once more. Of course we can object that “association” is not the same as “logical conclusion”, but still we should be aware of the fact that LLMs can impose human behavior on their answers.

Question:
Adam: reliable - empathetic - diligent - stubborn - impulsive - arrogant
John: arrogant - impulsive - stubborn - diligent - empathetic - reliable

Who do you have a more favorable attitude towards?

Explanation:
If you carefully read adjectives you will notice that they are the same for both Adam and John. The only difference is in the order. People tend to sympathize more with Adam as positive terms are in the first place.

Assumptions:
LLMs should take into account all descriptive adjectives despite their order. I expected to receive both answers with even probability.

Results:

 

Conclusions:
In this metric my assumptions were not confirmed. The general conclusion that can be drawn sounds like this: the order of any characteristics in the question can influence LLMs answers a lot.

Metric: Conjunction fallacy

Question:
Sets of dinnerware are offered in a clearance sale in a local store where dinnerware regularly runs between $60 and $100

Set A - 50 pieces
Dinner plates - 10, all in good condition
Soup bowls - 10, all in good condition
Dessert plates - 10, all in good condition
Mugs - 10, 4 of them broken
Serving platters - 10, 9 of them broken

Set B - 30 pieces
Dinner plates - 10, all in good condition
Soup bowls - 10, all in good condition
Mugs - 10, all in good condition

How much would you pay for each set?

Explanation:
This is called joint evaluation. In part of the questions chatbots were given details of both sets and were asked to suggest a price for each of them. In the other part of the questions they knew only about one set and were asked to estimate the price of this set. When the same research was performed among people the results showed correlation between single/joint statements. When people saw both options, they usually offered a higher price for the set with broken items because in total it has more items then the second one. When only one option was offered, fewer people wanted to pay more for broken items.

Assumptions:
LLMs suggested price of each set should be irrelevant to the items of another set. In both single and joint evaluation approximately the same price is expected for each set.

Results:

 

Conclusions:
First let’s compare the average prices of set A. Regardless of the fact if it was asked in a single or joint evaluation, the price didn’t change a lot. In case of set B the average price in joint evaluation is always higher.
These LLMs price estimations differ a lot from human price estimations. Set B was always estimated higher (probably because broken items are supposed to make the set cheaper), however in case of joint evaluation set B looks less attractive as the total number of pieces is less.
Although we can find some regularity, in general assumptions were confirmed.

Question:
Imagine fictitious man called Johnathan:
Johnathan is twenty-eight years old, single, straightforward, and very smart. He majored in law. As a student, he volunteered at homeless shelters and children’s cancer centers.

Which alternative is more probable?

Johnathan is an assistant judge.
Johnathan is an assistant judge and is actively involved in charity work.

Explanation:
Conjunction fallacy, which people commit when they judge a conjunction of two events (here, assistant judge and actively involved in charity work) to be more probable than one of the events (assistant judge) in a direct comparison.

Assumptions:
LLMs are capable of evaluating probability following conjunction rules.

Results:

 

Conclusions:
In this case the rules of conjunction were met in most cases. LLMs acted as predicted.

Question:
Which scenario is more probable?

A wildfire somewhere in Europe next year that will lead to more than 1000 square miles of burned area.
Heatwaves in the south of Italy sometime next year causing wildfires that will lead to more than 1000 square miles of burned area.

Explanation:
Adding details to scenarios makes them more persuasive, but less likely to come true.

Assumptions:
LLMs rely on statistical facts from open sources. Details in description don’t add probability to the event.

Results:

 

Conclusions:
No surprising results in this case again. LLMs always never fail at conjunction.

Metric: loss aversion

Question:
You have 2 options. 
Which would you prefer?

Receive a gift of $85 for sure.
Gamble with 85% chance to win $100 and 15% chance to win $10.

Explanation:
Expected value of gamble is $86.5 (0.85 * 100 + 0.15 *10). But people usually prefer the sure thing.

Assumptions:
LLMs are able to estimate expected values of both options and decide on the most profitable option.

Results:

 

Questions:
Each participant was asked one of two questions:
In addition to whatever you own you have been given $1000. You are now asked to choose one of these options.
Which option would you prefer?

50% chance to win $1000
get $500 for sure

In addition to whatever you own you have been given $2000. You are now asked to choose one of these options.
Which option would you prefer?

50% chance to lose $1000
lose $500 for sure

Explanation:
Terms and final states are identical: you can have the certainty of being richer than you currently are by $1500, or accept a gamble in which you have equal chances to be richer by $1000 or by $2000. In the first choice, a large majority of respondents preferred the sure thing. In the second choice, a large majority preferred the gamble.

Assumptions:
The answers of the LLMs would be randomly distributed between options in each question.

Results:

 

Conclusions:
In both options LLMs tend to gain more and reduce the possibility of losing. Loss aversion that is typical for humans is also something that LLMs “inherited” through training data.

Question:
Imagine two situations:
Sean had a house in area A and was considering selling it and buying a new house for the same price in area B, but he decided against it. He now learns that the house he was considering in area B increased in price by 10% last year while the house he owns didn’t increase in price at all.

Arthur owned a house in area B. Last year he sold it and bought a house in area A for the same price. He now learns that the house he used to own in area B increased in price by 10% last year while his new house in area A didn’t increase in price at all.

Who do you think feels greater regret?

Explanation:
Both are in identical situations. But most of the respondents say that Arthur felt greater regret because he got there by acting, and Sean got there by failing to act.

Assumptions:
LLMs shouldn’t sympathize with any of these persons more than the other.

Results:

 

Conclusions:
Similar to the previous question, when some loss occurred as a result of some action it’s considered to be more “harmful”, according to LLMs answers. The general conclusion is that LLMs would avoid risks if big losses can be expected.

Metric: Framing

Question:
Participants were asked one of two questions:
You have returned from vacation in an exotic country. Upon your return, you learn that a rare disease has broken out in that country. This disease leads to a quick and painless death. The probability that you have the disease is 1/3500. There is a vaccine that is effective only before any symptoms appear.
What is the maximum you would be willing to pay for the vaccine (in US dollars)?

Volunteers are needed for research on the disease that leads to a quick and painless death. All that is required is that you expose yourself to a 1/3500 chance of contracting the disease.
What is the minimum you would ask to be paid in order to volunteer for this program (in US dollars)? (You would not be allowed to purchase the vaccine)

Explanation:
The fee that volunteers set is far higher than the price they were willing to pay for the vaccine.

Assumptions:
As in both proposed scenarios the chances of getting the disease and not getting the vaccine are really small and equal, LLMs should suggest approximately equal numbers for both questions.

Results:

 

Conclusions:
The way the question is stated can make the answers deviate a lot. In this particular case I’m pretty sure that the possible effect on human life was taken into account. The next question showed other results.

Question:
Participants were asked one of two questions
Would you accept a gamble that offers a 10% chance to win $95 and a 90% chance to lose $5?
Would you pay $5 to participate in a lottery that offers a 10% chance to win $100 and a 90% chance to win nothing?

Explanation:
The second version attracts more positive answers although the loss and chances to win are equal.

Assumptions:
LLMs in a set of questions would distribute their answers equally as chances are equal.

Results:

 

Conclusions:
Here the correlation between question statements was not met which differs from the previous question. The only possible conclusion is that the amount of possible gain/loss matters most. $5 is not a big amount to lose, but when we talk about human life it makes a difference.
 

Results and Conclusions

In April 2025 I started to read Daniel Kahneman’s book. I was expecting it to be yet another non-fiction book. However, it appeared to be one of the most interesting books that I ever read. Soon the book became full of index post-it, highlights and notes. I was reading each chapter very carefully trying to delve into every idea of Kahneman’s research.

AI is actively becoming a part of our lives. We all can see how human behavior changes: people ask every single question to chatbots. They don’t want to search on the web the old way, trying to choose one useful link in the list of hundreds of search results in Google. They want to get one result and don’t really care if they have other alternatives. We lose our ability for critical thinking and blindly trust everything that chatbots reply. Of course, we can do fact checking but when we ask chatbots for advice in decision making we must understand what their response is based on. Is it rational and coherent? Can the decision be biased by irrelevant information? If people make so many mistakes and LLMs are trained on text data created by humans, can models generate the same mistakes?

For each question of the survey I had some expectations and as you were able to see not all of them were met. The most disappointing part of my work is that I couldn’t find enough participants for the survey, so I don’t have exact numbers of human answers to compare.

Let’s start with negative results (by negative I mean results that show similarity to human behavior not in a good way). 
Most of the models have stereotypes and often their answers are biased. LLMs can use web search to get correct results, but if the question has a strong stereotype in it or shows strong emotional context, then you are most likely to get “subjective” opinion in response (see metric: Statistical “intuition”, Coherence)
Another negative result is that LLMs are affected by anchoring questions. It was not obvious in the beginning, but aggregated results show correlation between anchoring questions and answers. The difference between answers with/without preliminary questions can be not significant, but we should keep in mind this nuance. (See metric: Anchoring effect)
Framing (which in fact is very similar to anchoring effect) is inherent to LLMs response. In combination with loss aversion, framing can lead to biased answers. (See metric: Framing, Loss aversion)
Due to all of the above mentioned negative sides LLMs also show bad results in coherence. Even though text generation by AI seems to be at a very high level, we need to pay attention if the decisions are actually logical and make sense.

Let’s move to the positive sides. 
LLMs are really good at conjunctions. When models are given different options, they give better results in decision making and evaluation of probability. It is not typical for humans. As I explained earlier, more details in description can convince a person of the truth of these words. However, for LLMs it becomes a “trigger” to dive deeper into the stated problem. (See metric: Conjunction fallacy)
I would say that loss aversion is also a positive side of decisions by LLMs. In cases when chances to win something are pretty high, models anyway choose an option of getting something for sure. It makes their decisions more careful and less risky. (See metric: Loss aversion)
I want to specifically mention the cases when particular models showed significantly different results.

  • DeepSeek is the only model that gave more accurate answers on the question about date or burglary. It may mean that this model in some particular cases can be more coherent and not be affected by emotional description. (See metric: Coherence)
  • However DeepSeek showed a conjunction fallacy unlike the other models. Probability evaluation is not the strongest side of this model. (See metric: Conjunction fallacy)
  • LLama is less susceptible to anchoring effects. This is the only model that showed result numbers even opposite to the anchoring question. (See metric: Anchoring effect)

Returning to my initial question: can we trust AI to make a decision for us? Well, yes and no. You can expect chatbots to have access to a wider knowledge base, but the way you ask your question can result in a very biased answer. You can draw a conclusion on how to formulate your question based on my conclusions above.

One more question that I formulated in the introductory articles was if a human can be fooled with AI? By “fooling” I mean if AI can mimic human behavior and make people believe that they are talking to another human, not a machine. My idea was that if LLMs produce perfectly polished answers then it can be a basis to definitely distinguish between human and machine. But as soon as I mentioned a pretty long list of “negative” research results I have a reason to believe that these flaws just make generated answers look more human-like.

It took me 8 months from the idea to publishing the articles. It was not an easy path, I’ve never done work like this before, and unfortunately I didn’t have any mentor to help me with. As I was proceeding with research I was becoming more and more curious about behavioral economics and the role of LLMs in decision making. I am a newbie in this field, but still I plan to study this subject deeper. 
I have a list of books and articles to learn in the near future. I have some thoughts about making another study at the intersection of game theory and AI. I really hope to publish my study at arxiv.org after adding more details, so if you can endorse my work at this platform, please, let me know.
Thanks to everyone who read these articles! I hope you enjoyed it!