F1 (classification)

The F1 score, also known as the F1-measure, is a single metric that combines precision and recall into a harmonic mean, providing a balanced measure of a classification model's accuracy. It is particularly useful when dealing with datasets that have imbalanced class distributions, where a simple accuracy score can be misleading.

Definition:

The F1 score is calculated as follows:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Where:

Precision (also called positive predictive value) is the proportion of positive identifications that were actually correct. It answers the question: "Of all the instances predicted as positive, how many were actually positive?" Precision = True Positives / (True Positives + False Positives)
Recall (also called sensitivity) is the proportion of actual positives that were identified correctly. It answers the question: "Of all the actual positive instances, how many were predicted as positive?" Recall = True Positives / (True Positives + False Negatives)
True Positives (TP) are the number of instances correctly predicted as positive.
False Positives (FP) are the number of instances incorrectly predicted as positive.
False Negatives (FN) are the number of instances incorrectly predicted as negative.

Interpretation:

The F1 score ranges from 0 to 1, where:

1 represents perfect precision and recall (ideal score).
0 represents the worst possible score, indicating that either precision or recall (or both) is zero.

A higher F1 score indicates a better balance between precision and recall.

Use Cases:

The F1 score is commonly used in various machine learning tasks, including:

Spam detection
Medical diagnosis
Fraud detection
Information retrieval
Natural language processing

Limitations:

While the F1 score provides a useful single metric, it's important to consider its limitations:

It gives equal weight to precision and recall. In some scenarios, one might be more important than the other.
Like precision and recall, it is sensitive to class imbalance.
It doesn't provide insight into the types of errors the model is making (e.g., confusing different classes).

Alternatives:

Depending on the specific application and requirements, other evaluation metrics can be considered, such as:

Accuracy
Precision
Recall
Area Under the ROC Curve (AUC-ROC)
Area Under the Precision-Recall Curve (AUC-PR)
Matthews Correlation Coefficient (MCC)

📖 WIPIVERSE

F1 (classification)