Usually, when training a neural network, we compare the neural network output with the expected outputs. (This could be a note:newnote).

In DQN, there’s no knowledge of what is the correct output.

DQN relies on comparing the neural network output with a benchmark value.

To do it there is a “target neural network”, which is an older iteration of the neural network that is being trained. Periodically, after a number of iterations, the target neural network is updated.

The benchmark value is generated by adding the current reward signal in the replay buffer to the target network estimation of Q-value for the next state.

The difference between the current network and the benchmark value is the loss function used in DQN.

mlaics