Survey Results: Decision-making study


Alan Turing and Turing Test

If you are a software engineer, data scientist, ML engineer, etc., you are most likely to have heard about Alan Turing. Mathematician, logician, cryptanalyst, philosopher. Turing is considered to be the father of computer. He did not design the first computer, however, he invented the first machine that worked with symbols. His works led to significant discoveries in areas that later were named computer science, cognitive science, artificial intelligence, and artificial life.

One of his best-known inventions was a code-breaking machine called Bombe. During WWII the German military used a cipher machine Enigma to encrypt radio communications. In 1940 Turing’s machine's success in breaking the cypher allowed it to supply the Allies with large quantities of military intelligence.

Alan Turing is also known as the father of artificial intelligence and modern cognitive science. According to his hypothesis, the human brain is a part of a digital computing machine. A newborn’s cortex, in his opinion, is an ‘unorganized machine’ that becomes organized through training. Sounds exactly how modern neural networks work. In 1950 Alan Turing proposed the test to recognize if AI can “think”. Currently, it’s known as the “Turing test” and also “imitation game”. The idea is pretty simple: a remote human interrogator, within a fixed time frame, must distinguish between a computer and a human subject based on their replies to various questions posed by the interrogator. If the interrogator misidentifies the tested machine as a human then this machine is supposed to “think”.

Besides the Turing test, there are several other tests with different approaches and intentions. Here is a short summary of them:

  • The Winograd Schema Challenge. Goal: Test common-sense reasoning. Partially passed by GPT.
  • The Marcus Test (by Gary Marcus). Goal: Evaluate general reasoning, learning, and adaptability.
  • The Lovelace Test. Goal: Assess creativity and originality.
  • The AI-Complete Problems. Goal: Measure intelligence through the ability to solve problems that would require human-level understanding.
  • The Smith Test (proposed by Ernest Davis and Gary Marcus). Goal: Provide a comprehensive benchmark across multiple areas of intelligence.

I am not going to dive deep into the details of each test. I just want to emphasize that all these tests have different goals and approaches. We can summarize that each of them aims to test whether AI can perform the same tasks as humans, correctly understand human-like tasks, behave as a human, or even mimic human speech.

In recent years, LLM technology progressed a lot. Many people use ChatGPT every day and prefer searching the web using AI chats rather than search engines such as Google, Bing, etc. You don’t have to worry if your request will be understood. Even though the above-mentioned tests still failed a lot by GPT, most of the Chat users don’t face any difficulties making requests using natural language.

In my opinion, it raises a really important ethical question: can a human be fooled by AI? In 2023 the AI21 Labs created the biggest online Turing-style research (Social Turing game) titled “Human or Not?”  It was played more than 10 million times by more than 2 million people. The results showed that 32% of people could not distinguish between humans and machines. So, my concern is: can it become an issue in our society? What if most people won’t be able to distinguish between AI and humans? Is there a way to detect the generated speech?
 


Daniel Kahneman And His Work In Decision Making

The second introductory article will be fully devoted to Nobel prize winner Daniel Kahneman and his discoveries in decision making.

Daniel Kahneman was an Israeli-American psychologist who doubted human rationality in decision making and judgement. Throughout his life he got numerous awards in different fields such as psychology, economics, finances, social science etc. Together with his friend Amos Tversky he established a cognitive basis for common human errors that arise from heuristics and biases, and developed prospect theory. Kahneman’s book “Thinking. Fast And Slow” mostly summarizes his research.

The book mainly describes the process that stands behind the decision making, starting with really simple examples diving deeper with every chapter. Every human has so-called “fast thinking” (System I) and “slow thinking” (System II). When a person is asked a question he mostly uses his fast thinking that is based on our previous experience. For example, if you are asked a question “What is the capital of France?” you won’t put too much effort to answer this question. This is how System I works. But if it fails to find an immediate answer to the question, a person will activate System II, also called “lazy system”. A person will spend much more energy on it and the answer will not be given as fast. A good example is when you need to perform a little complicated calculations, like 156x23. If you spend a little more time, you will give an answer, but it’s not something that comes immediately to your mind, like 5x2.

This idea is developed throughout a book. In the beginning of the book a reader can find some really simple examples to understand the mechanics of these two systems, but in further chapters Kahneman dives deeper and explains how we fail to evaluate risks, how our decisions are so much affected by ads and media etc. Our brain has a lot of information but our decisions are biased by a lot of factors. In fact, we are much worse decision makers than we think.

When I was reading the book I couldn’t stop thinking about how this idea could be applied to AI evaluation. As I mentioned in the previous article, all modern tests are based on the fact that AI is not real intelligence and these tests were intended to evaluate how close artificial intelligence is to natural one.

Since fewer and fewer people doubt that AI can hold a conversation like a real person, I suggest trying to evaluate LLM using the following criteria:

  1. Can LLM answers depend not so much on facts and statistics, but on “cognitive” factors? How much more accurate is the so-called “statistical intuition” of AI than human intuition?
  2. Assuming that AI can easily pass the test from point 1, can this property of LLM be used to accurately determine who you are talking to: a machine or a person?
  3. Is AI a better decision maker and risk assessor than a human? Hence, if we are considering making a big purchase, should we rely on our gut feeling about the deal or should we ask AI to assess possible risks and evaluate expected profit?

Next, I will talk about the main terms from the book and what tasks I chose for evaluation.

Main Terms Used By Daniel Kahneman

The book “Thinking. Fast And Slow” is definitely worth reading. If you have not done it before, I highly recommend it. First of all, you will better understand your own decisions and even can improve them in the future. Another benefit you will get is the ability to apply received knowledge in achieving your goals, both personal and professional. To sum it up - you will never think, judge and make decisions the same way you do now.

As it describes processes in the human brain that are not so obvious, it’s necessary to give you a short summary of a few of them before proceeding to my own research.

  1. Priming effect. In the 1980s psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked. If you have recently seen or heard the word EAT, you are temporarily more likely to complete the word fragment SO_P as SOUP than as SOAP.

    The priming effect takes many forms. If the idea of EAT is currently on your mind, you will be quicker than usual to recognize the word SOUP when it is spoken in a whisper or presented in a blurry font.

    You cannot know it from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not aware of.

  2. Cognitive ease. Whether you're conscious, and perhaps even when you're not, multiple computations are going in your brain. One of the dials measures cognitive ease, and its range is between “Easy” and “Strained”. Easy is a sign that things are going well -  no threats, no major news, no need to redirect attention or mobilize effort. Strained indicates that a problem exists, which will require increased mobilization of System II. The surprise is that a single dial of cognitive ease is connected to a large network of diverse inputs and outputs.

    Words that you’ve seen before become easier to see again - you can identify them better than other words when they are shown very briefly or masked by noise.

    In other words, the more you see something the more you believe in it. The famed psychologist Robert Zajonc dedicated much of his career to the study of the link between the repetition of arbitrary stimuli and the mild affection that the people eventually have for it. Zajonc called it the mere exposure effect. The effect does not depend on consciousness at all: it occurs even when repeated words are shown so quickly that the observers never become aware of having seen them.

  3. Statistical intuition. People tend to ignore common statistical facts. Let me explain. Consider the letter K. Is K more likely to appear as the first letter in a word OR as the third letter? It is much easier to come up with words that begin with a particular letter than to find words that have the same letter in the third position. Respondents exaggerate the frequency of letters appearing in the first position even though those letters in fact occur more frequently in the third position. The thing that easily comes to your mind seems more logical so you will ignore most of the statistical facts you’ve learned before.

  4. Anchoring effect. It occurs when people consider a particular value for an unknown quantity before estimating that quantity. 

    If you are asked whether Gandhi was more than 114 years old when he died you will end up with a much higher estimate of his age at death than you would if the anchoring question referred to death at 35. 

    If you consider how much you should pay for a house, you will be influenced by the asking price. The same house will appear more valuable if its listing price is high than if it’s low? Even if you are determined to resist the influence of this number.

    The list of anchoring effects is endless. Any number that you are asked to consider as a possible solution to an estimation problem will induce an anchoring effect.

There are many more factors that affect our decisions. They are widely used in marketing. If you believe that you buy something because you really like and need that thing, believe me: this decision was made for you by a bunch of talented marketing specialists. In most cases you would not even admit that you never wanted to buy it (yes, this phenomenon is also described in the book).

The initial idea about my survey was to run the questions from the book on the most popular LLMs (in chats) to see if any of the above mentioned effects can be seen in LLMs behaviour. However, I’ve faced a few obstacles on my way and I had to modify the process.


The continuation will be published on January 21th.