To make an optimal action policy, we would need to know the complete state transition matrix and the rewards associated with each state transition. Without these informations, we can’t have a value function that tells us what is the value of a certain action.

As we’re dealing with incomplete information, we try to approximate the value function. This approximation is called the Q-function.

Q-learning solves the problem of not knowing the/having no:

Q-learning is a type of Reinforcement Learning.

mlaics