Model-free (reinforcement learning)
In reinforcement learning, "model-free" refers to a category of algorithms that learn optimal policies or value functions directly from experience, without explicitly constructing or utilizing a model of the environment's dynamics. A model of the environment describes how the environment will change in response to an agent's actions. Model-free methods bypass this step, focusing instead on directly estimating the optimal policy or value function through trial-and-error interaction with the environment.
Unlike model-based reinforcement learning algorithms, model-free methods do not attempt to learn the transition probabilities (the probability of transitioning to a specific state given a particular action) or the reward function (the expected reward for taking a specific action in a specific state). Instead, they directly learn the optimal way to behave.
Two major classes of model-free algorithms are:
-
Value-based methods: These methods learn an optimal value function, which estimates the expected cumulative reward for being in a given state or taking a specific action in a given state. Examples include Q-learning and SARSA. The policy is then derived from the learned value function by choosing actions that maximize the expected reward.
-
Policy-based methods: These methods directly learn an optimal policy, which maps states to actions. Examples include REINFORCE and Actor-Critic methods. These methods search directly in the policy space, often using gradient-based optimization techniques to improve the policy.
Some algorithms, such as Actor-Critic methods, combine both value-based and policy-based approaches. These methods use an actor (policy) to take actions and a critic (value function) to evaluate those actions.
Model-free methods are often simpler to implement than model-based methods, especially in environments with complex or unknown dynamics. They can also be more computationally efficient when the environment's model is very large or difficult to learn. However, they typically require more experience to learn an optimal policy compared to model-based methods, particularly in environments where experience is costly or limited. The agent must actively explore the environment to discover optimal actions and states. This can be a slow process, especially in environments with sparse rewards.