Survey Results: Decision-making study


Alan Turing and Turing Test

If you are a software engineer, data scientist, ML engineer, etc., you are most likely to have heard about Alan Turing. Mathematician, logician, cryptanalyst, philosopher. Turing is considered to be the father of computer. He did not design the first computer, however, he invented the first machine that worked with symbols. His works led to significant discoveries in areas that later were named computer science, cognitive science, artificial intelligence, and artificial life.

One of his best-known inventions was a code-breaking machine called Bombe. During WWII the German military used a cipher machine Enigma to encrypt radio communications. In 1940 Turing’s machine's success in breaking the cypher allowed it to supply the Allies with large quantities of military intelligence.

Alan Turing is also known as the father of artificial intelligence and modern cognitive science. According to his hypothesis, the human brain is a part of a digital computing machine. A newborn’s cortex, in his opinion, is an ‘unorganized machine’ that becomes organized through training. Sounds exactly how modern neural networks work. In 1950 Alan Turing proposed the test to recognize if AI can “think”. Currently, it’s known as the “Turing test” and also “imitation game”. The idea is pretty simple: a remote human interrogator, within a fixed time frame, must distinguish between a computer and a human subject based on their replies to various questions posed by the interrogator. If the interrogator misidentifies the tested machine as a human then this machine is supposed to “think”.

Besides the Turing test, there are several other tests with different approaches and intentions. Here is a short summary of them:

  • The Winograd Schema Challenge. Goal: Test common-sense reasoning. Partially passed by GPT.
  • The Marcus Test (by Gary Marcus). Goal: Evaluate general reasoning, learning, and adaptability.
  • The Lovelace Test. Goal: Assess creativity and originality.
  • The AI-Complete Problems. Goal: Measure intelligence through the ability to solve problems that would require human-level understanding.
  • The Smith Test (proposed by Ernest Davis and Gary Marcus). Goal: Provide a comprehensive benchmark across multiple areas of intelligence.

I am not going to dive deep into the details of each test. I just want to emphasize that all these tests have different goals and approaches. We can summarize that each of them aims to test whether AI can perform the same tasks as humans, correctly understand human-like tasks, behave as a human, or even mimic human speech.

In recent years, LLM technology progressed a lot. Many people use ChatGPT every day and prefer searching the web using AI chats rather than search engines such as Google, Bing, etc. You don’t have to worry if your request will be understood. Even though the above-mentioned tests still failed a lot by GPT, most of the Chat users don’t face any difficulties making requests using natural language.

In my opinion, it raises a really important ethical question: can a human be fooled by AI? In 2023 the AI21 Labs created the biggest online Turing-style research (Social Turing game) titled “Human or Not?”  It was played more than 10 million times by more than 2 million people. The results showed that 32% of people could not distinguish between humans and machines. So, my concern is: can it become an issue in our society? What if most people won’t be able to distinguish between AI and humans? Is there a way to detect the generated speech?
 


Daniel Kahneman And His Work In Decision Making

The second introductory article will be fully devoted to Nobel prize winner Daniel Kahneman and his discoveries in decision making.

Daniel Kahneman was an Israeli-American psychologist who doubted human rationality in decision making and judgement. Throughout his life he got numerous awards in different fields such as psychology, economics, finances, social science etc. Together with his friend Amos Tversky he established a cognitive basis for common human errors that arise from heuristics and biases, and developed prospect theory. Kahneman’s book “Thinking. Fast And Slow” mostly summarizes his research.

The book mainly describes the process that stands behind the decision making, starting with really simple examples diving deeper with every chapter. Every human has so-called “fast thinking” (System I) and “slow thinking” (System II). When a person is asked a question he mostly uses his fast thinking that is based on our previous experience. For example, if you are asked a question “What is the capital of France?” you won’t put too much effort to answer this question. This is how System I works. But if it fails to find an immediate answer to the question, a person will activate System II, also called “lazy system”. A person will spend much more energy on it and the answer will not be given as fast. A good example is when you need to perform a little complicated calculations, like 156x23. If you spend a little more time, you will give an answer, but it’s not something that comes immediately to your mind, like 5x2.

This idea is developed throughout a book. In the beginning of the book a reader can find some really simple examples to understand the mechanics of these two systems, but in further chapters Kahneman dives deeper and explains how we fail to evaluate risks, how our decisions are so much affected by ads and media etc. Our brain has a lot of information but our decisions are biased by a lot of factors. In fact, we are much worse decision makers than we think.

When I was reading the book I couldn’t stop thinking about how this idea could be applied to AI evaluation. As I mentioned in the previous article, all modern tests are based on the fact that AI is not real intelligence and these tests were intended to evaluate how close artificial intelligence is to natural one.

Since fewer and fewer people doubt that AI can hold a conversation like a real person, I suggest trying to evaluate LLM using the following criteria:

  1. Can LLM answers depend not so much on facts and statistics, but on “cognitive” factors? How much more accurate is the so-called “statistical intuition” of AI than human intuition?
  2. Assuming that AI can easily pass the test from point 1, can this property of LLM be used to accurately determine who you are talking to: a machine or a person?
  3. Is AI a better decision maker and risk assessor than a human? Hence, if we are considering making a big purchase, should we rely on our gut feeling about the deal or should we ask AI to assess possible risks and evaluate expected profit?

Next, I will talk about the main terms from the book and what tasks I chose for evaluation.


The continuation will be published on January 14th.