AFM Tear Classification — Scan-Level Evaluation
All models ranked by scan-level macro F1. Green = strong, bold = best per metric. Overfit gap = train F1 minus eval F1 (high gap signals overfitting).
| Model | Bal. Acc | F1 | ROC AUC | PR AUC | Prec | Recall | Spec | Overfit Gap |
|---|---|---|---|---|---|---|---|---|
| hist_gradient_boosting (fixed) tabular | 0.912 | 0.874 | 0.970 | 0.991 | 0.977 | 0.896 | 0.929 | -- |
| random_forest (fixed) tabular | 0.891 | 0.838 | 0.958 | 0.987 | 0.976 | 0.854 | 0.929 | -- |
| xgboost (fixed) tabular | 0.866 | 0.832 | 0.949 | 0.984 | 0.955 | 0.875 | 0.857 | -- |
| lightgbm (fixed) tabular | 0.881 | 0.821 | 0.961 | 0.989 | 0.976 | 0.833 | 0.929 | -- |
| hierarchical_mil_binary_fixed dl | 0.808 | 0.712 | 0.808 | 0.909 | 0.971 | 0.688 | 0.929 | +0.216 |
| direct4ch_binary_fixed dl | 0.717 | 0.690 | 0.717 | 0.861 | 0.884 | 0.792 | 0.643 | +0.203 |
| hybrid_fusion_binary_fixed dl | 0.688 | 0.514 | 0.688 | 0.859 | 1.000 | 0.375 | 1.000 | +0.189 |
Diagnostic curves comparing all models at scan level. ROC AUC measures discrimination, PR AUC is more informative under class imbalance.
DL model training curves. Early stopping triggered = model converged before max epochs.
| model | final_train_loss | best_val_f1 | best_epoch | total_epochs | max_epochs | early_stopped | last_is_best | final_lr |
|---|---|---|---|---|---|---|---|---|
| direct4ch_binary_fixed | 0.256 | NaN | NaN | 13 | 25 | True | False | 0.000 |
| hierarchical_mil_binary_fixed | 0.211 | NaN | NaN | 12 | 25 | True | False | 0.000 |
| hybrid_fusion_binary_fixed | 0.394 | NaN | NaN | 7 | 20 | True | False | 0.000 |
MIL attention weights reveal which channels the model relies on. Higher attention = more influence on the prediction.
| channel | mean_attention | std | min | max |
|---|---|---|---|---|
| height_sensor | 0.2609 | 0.0447 | 0.0000 | 0.3951 |
| height | 0.2580 | 0.0458 | 0.0000 | 0.3746 |
| amplitude_error | 0.2481 | 0.0437 | 0.0000 | 0.3564 |
| phase | 0.2330 | 0.0664 | 0.0000 | 0.3498 |