This notebook evaluates Model B (Motivation Classification), a multi-class classifier that categorizes fiscal acts by their primary motivation and determines if they are exogenous to the business cycle.
Primary Success Criteria:
Overall Accuracy > 0.75 on test set
Per-class F1 Score > 0.70 for each motivation category
Exogenous flag accuracy > 0.85
Model Configuration:
LLM: Claude Sonnet 4 (claude-sonnet-4-20250514)
Approach: Few-shot prompting (5 examples per class = 20 total)
The validation set shows strong performance that meets most Phase 0 success criteria:
Overall Accuracy: 90% ✅ Exceeds the 75% target by +15 percentage points
Macro F1: 0.881 ✅ Exceeds the 0.70 target, showing good balance across classes
Exogenous Accuracy: 90% ✅ Exceeds the 85% target by +5 percentage points
The model correctly classified 9 out of 10 acts in the validation set. The single misclassification was a Countercyclical act predicted as Long-run, indicating some difficulty distinguishing between cycle-motivated and efficiency-motivated reforms.
Precision=0.75 (1 false positive: Countercyclical act misclassified as Long-run)
Recall=1.0 (found all Long-run acts)
Key Finding: Countercyclical classification is the weak point, with one instance confused with Long-run. This suggests the model has difficulty distinguishing cycle-motivated reforms from efficiency-motivated reforms when the language is ambiguous.
The test set shows below-target performance that does NOT meet Phase 0 success criteria:
Overall Accuracy: 66.7% ❌ Below the 75% target by -8.3 percentage points
Macro F1: 1.0 ✅ This metric is misleading due to missing categories (see per-class analysis)
Exogenous Accuracy: 66.7% ❌ Below the 85% target by -18.3 percentage points
Critical Issue: The model correctly classified only 4 out of 6 acts (66.7%). The two misclassifications were both Long-run acts predicted as Countercyclical, creating a systematic error pattern.
Cascading Error Impact: Because Long-run acts should be classified as exogenous (TRUE) but Countercyclical is endogenous (FALSE), these motivation errors automatically create exogenous flag errors, dropping exogenous accuracy to 66.7%.
Important Context: The test set contains only 6 acts with an imbalanced distribution (3 Spending-driven, 0 Countercyclical, 1 Deficit-driven, 2 Long-run). Small sample size means each error has outsized impact (1 error = 16.7% drop in accuracy).
The test set confusion matrix reveals a systematic misclassification pattern:
Long-run → Countercyclical (2 instances): BOTH Long-run acts in the test set were incorrectly classified as Countercyclical
This represents 100% failure rate on Long-run classification (0/2 recall)
Both predictions had high confidence (0.85 and 0.90), suggesting the model is confidently wrong
Root Cause Hypothesis: The model appears to confuse:
Long-run reforms aimed at improving efficiency and fairness (exogenous)
Countercyclical reforms aimed at stimulating the economy during recessions (endogenous)
When an act discusses “improving economic conditions” or “raising growth,” the model may incorrectly interpret this as countercyclical stabilization rather than long-run structural reform.
Correct Classifications:
Spending-driven: 3/3 perfect (100% precision and recall)
Deficit-driven: 1/1 perfect (100% precision and recall)
Public Law 90-26 (Restoration of the Investment Tax Credit)
1967
Long-run
Countercyclical
0.85
TRUE
FALSE
Error Patterns:
Analyzing the 2 misclassified acts reveals a clear pattern:
Both errors follow the same pattern: Long-run → Countercyclical
Public Law 99-514 (likely Tax Reform Act of 1986)
True: Long-run (exogenous=TRUE)
Predicted: Countercyclical (exogenous=FALSE)
Confidence: 0.85
Revenue Act (year and context needed)
True: Long-run (exogenous=TRUE)
Predicted: Countercyclical (exogenous=FALSE)
Confidence: 0.90
Why This Matters:
Both acts are major tax reforms aimed at long-term efficiency gains
The model incorrectly interpreted them as recession-fighting measures
High confidence scores (0.85-0.90) indicate the model is systematically wrong, not uncertain
This creates cascading errors: motivation error → automatic exogenous flag error
Confidence Calibration
Show code
# Confidence calibration for test setmodel_b_eval_test$calibration %>%filter(!is.na(confidence_bin)) %>%gt() %>%cols_label(confidence_bin ="Confidence Range",n ="N Predictions",accuracy ="Actual Accuracy" ) %>%fmt_number(columns = accuracy,decimals =3 ) %>%tab_header(title ="Test Set: Confidence Calibration",subtitle ="Does predicted confidence match actual accuracy?" ) %>%tab_options(table.width =pct(100) )
Test Set: Confidence Calibration
Does predicted confidence match actual accuracy?
Confidence Range
N Predictions
Actual Accuracy
(0.8,0.9]
5
0.600
(0.9,1]
1
1.000
Calibration Interpretation:
Well-calibrated model: predictions with 90% confidence should be 90% accurate.
The test set shows poor calibration:
5 predictions at 80-90% confidence → 60% actual accuracy (should be ~85%)
1 prediction at 90-100% confidence → 100% actual accuracy ✅
Key Finding: The model is overconfident in its incorrect predictions. The two Long-run misclassifications had 0.85-0.90 confidence, yet were wrong. This indicates the model doesn’t recognize when it’s uncertain about Long-run vs. Countercyclical distinctions.
Implication: We cannot rely on confidence scores to filter questionable predictions—the model is confident even when systematically wrong.
Model B presents a mixed performance with strong validation results but failing test set performance:
✅ Validation Set (10 acts): - Accuracy: 90% (target: 75%) - Strong pass - Macro F1: 0.881 (target: 0.70) - Strong pass - Exogenous Accuracy: 90% (target: 85%) - Pass - Only 1 misclassification out of 10
❌ Test Set (6 acts): - Accuracy: 66.7% (target: 75%) - FAIL by -8.3 points - Exogenous Accuracy: 66.7% (target: 85%) - FAIL by -18.3 points - 2 misclassifications out of 6 (33% error rate) - Both errors follow same pattern: Long-run → Countercyclical
Root Cause: The model systematically confuses Long-run efficiency reforms with Countercyclical stabilization policies. This appears to be a conceptual failure in distinguishing “improving growth” (long-run) from “fighting recession” (countercyclical).
Small Sample Size Impact: With only 6 test acts, each error carries heavy weight (16.7% per error). The validation set’s larger size (10 acts) and different class distribution may not have exposed this weakness.
Status: Model B does NOT meet Phase 0 success criteria for production deployment and requires improvement before proceeding to Model C or Southeast Asia deployment.
Detailed Predictions
Sample Predictions (Test Set)
Show a few representative predictions to verify qualitative performance: