Q learning is a value-based and model-free algorithm that will look for the best series of actions based on the agent’s current state. The Q stands for qual, representing how valuable the action is in optimizing future rewards. A model-based algorithm will train the value function to learn which state is more crucial and then take action. Another concept in Q learning is the policy-based method that will teach the policy directly to know which action to take in a given state.

The model-based algorithm will use reward functions and transition to estimate the optimal policy and create the model. This algorithm will learn the consequences of their action through the experience without reward function and transition.

How does Q Learning Work?

Q learning is designed to solve problems where an agent will make various decisions. Consequently, over time, it will increase the future reward. Let us briefly understand how Q learning works:

  1. The learning process will start by defining the environment where the agent will begin operating the process. This environment contains states, actions, and rewards. This step will represent various situations, possible moves the agent can make. And numerical values showing the benefit of taking action in a specific state.
  2. Q learning will maintain a table that is known as the Q table. In this table, you can enter the numerical values, And this will indicate the expected cumulative reward for taking a specific action in a particular state. For the beginning, the Q table is usually started with default values. It means the numbers are placed randomly.
  3. In the next step, the agent will interact with the environment by taking actions based on the current state. During this step, it will follow a strategy called exploration and exploitation.
  4. Once the action has been taken, it’s time to observe the result of the state and provide immediate reward. The Q table for the chosen action in the current state is changed with the help of the Q learning update rule.
  5. Later, the agent keeps repeating the process of taking action, updating the Q values, and refining its policy.
  6. Eventually, once the cube values have converged, The optimal policy can be extracted by choosing the action with the highest key value for every state.


In conclusion, Q learning is efficient for issues with finite states and discrete action. It has been developed to handle continuous state and action space with the help of a neural network to increase Q values.