F1 (classification)
The F1 score, also known as the F1-measure, is a single metric that combines precision and recall into a harmonic mean, providing a balanced measure of a classification model's accuracy. It is particularly useful when dealing with datasets that have imbalanced class distributions, where a simple accuracy score can be misleading.
Definition:
The F1 score is calculated as follows:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Where:
-
Precision (also called positive predictive value) is the proportion of positive identifications that were actually correct. It answers the question: "Of all the instances predicted as positive, how many were actually positive?" Precision = True Positives / (True Positives + False Positives)
-
Recall (also called sensitivity) is the proportion of actual positives that were identified correctly. It answers the question: "Of all the actual positive instances, how many were predicted as positive?" Recall = True Positives / (True Positives + False Negatives)
-
True Positives (TP) are the number of instances correctly predicted as positive.
-
False Positives (FP) are the number of instances incorrectly predicted as positive.
-
False Negatives (FN) are the number of instances incorrectly predicted as negative.
Interpretation:
The F1 score ranges from 0 to 1, where:
- 1 represents perfect precision and recall (ideal score).
- 0 represents the worst possible score, indicating that either precision or recall (or both) is zero.
A higher F1 score indicates a better balance between precision and recall.
Use Cases:
The F1 score is commonly used in various machine learning tasks, including:
- Spam detection
- Medical diagnosis
- Fraud detection
- Information retrieval
- Natural language processing
Limitations:
While the F1 score provides a useful single metric, it's important to consider its limitations:
- It gives equal weight to precision and recall. In some scenarios, one might be more important than the other.
- Like precision and recall, it is sensitive to class imbalance.
- It doesn't provide insight into the types of errors the model is making (e.g., confusing different classes).
Alternatives:
Depending on the specific application and requirements, other evaluation metrics can be considered, such as:
- Accuracy
- Precision
- Recall
- Area Under the ROC Curve (AUC-ROC)
- Area Under the Precision-Recall Curve (AUC-PR)
- Matthews Correlation Coefficient (MCC)