5 Excel AI Classes I Discovered the Arduous Means

November 26, 2025

10

5 Excel AI Lessons I Learned the Hard Way

Picture by Editor

# Introduction

For a lot of organizations, notably these in regulated industries or with restricted technical infrastructure, Excel and its XLMiner add-in function the first platform for predictive modeling and machine studying workflows.

But Excel’s accessibility masks a important hole: the distinction between working fashions and constructing reliable analytics programs. Engaged on a mortgage approval prediction mission, I found that Excel-based machine studying fails not attributable to algorithmic limitations, however attributable to some practices which are steadily ignored.

This text transforms the unwelcome experiences into 5 complete frameworks that may elevate your Excel-based machine studying work.

# Lesson 1: A number of Strategies for Outlier Detection

Outlier dealing with is extra artwork than science, and untimely removing can get rid of reputable excessive values that carry necessary info. In a single case, all residential asset values above the ninety fifth percentile have been eliminated utilizing a easy IQR calculation, below the idea they have been errors. Later evaluation revealed the removing of reputable ultra-high-value properties, a related section for big mortgage approvals.

The lesson: Use a number of detection strategies and guide evaluate earlier than removing. Create a complete outlier detection framework.

In a brand new sheet adjoining to the principle information, create detection columns:

Column A: Unique Worth (residential_assets_value)
Column B: IQR Technique
=IF(A2 > QUARTILE.INC($A$2:$A$4270,3) + 1.5*(QUARTILE.INC($A$2:$A$4270,3)-QUARTILE.INC($A$2:$A$4270,1)), "Outlier_IQR", "Regular")
Column C: 3-Sigma Technique
=IF(ABS(A2-AVERAGE($A$2:$A$4270)) > 3*STDEV($A$2:$A$4270), "Outlier_3SD", "Regular")
Column D: Percentile Technique
=IF(A2 > PERCENTILE.INC($A$2:$A$4270,0.99), "Outlier_P99", "Regular")
Column E: Mixed Flag
=IF(COUNTIF(B2:D2,"Outlier*")>=2, "INVESTIGATE", "OK")
Column F: Guide Overview [Notes after investigating]
Column G: Remaining Determination [Keep/Remove/Transform]

This multi-method method revealed patterns in my mortgage information:

Values flagged by all three strategies (IQR, 3-sigma, and percentile): Doubtless errors
Values flagged by IQR however not 3-sigma: Professional excessive values in skewed distributions
Values flagged solely by percentile: The intense however legitimate circumstances I virtually misplaced

The “Guide Overview” column is essential. For every flagged statement, doc findings resembling: “Professional luxurious property, verified in opposition to public data” or “Doubtless information entry error, worth exceeds market most by 10x.”

# Lesson 2: At all times Set Random Seeds

Few experiences are extra irritating than presenting glorious mannequin outcomes, then being unable to breed these precise numbers when making ready the ultimate report. This situation occurred with a classification tree mannequin: At some point’s validation accuracy was 97.3%, however subsequent day’s was 96.8%. The distinction appears small, however it undermines credibility. Makes the viewers marvel which quantity is actual and the way a lot can this evaluation be trusted.

The lesson: The wrongdoer is random partitioning with no fastened seed. Most machine studying algorithms contain randomness at some stage.

Information partitioning: Which observations go into coaching vs. validation vs. take a look at units
Neural networks: Preliminary weight randomization
Some ensemble strategies: Random function choice

XLMiner makes use of random processes for partitioning the info. Operating the identical mannequin twice with an identical parameters yields barely totally different outcomes as a result of the coaching/validation break up differs every time.

The answer is straightforward however non-obvious. When utilizing XLMiner’s partition performance (present in most mannequin dialogs):

Examine the field labeled “Set seed” (it is unchecked by default)
Enter a particular integer: 12345, 42, 2024, or any memorable quantity
Doc this seed worth within the Mannequin Log

Now, each time the mannequin is run with this seed:

Similar coaching/validation/take a look at splits
Similar mannequin efficiency metrics
Similar predictions for a similar observations
Good reproducibility

Right here is an instance from the mortgage approval dataset with out seed (three runs of an identical logistic regression):

Run 1: Validation Accuracy = 92.4%, F1 = 0.917
Run 2: Validation Accuracy = 91.8%, F1 = 0.923
Run 3: Validation Accuracy = 92.1%, F1 = 0.919

And with with seed=12345 (three runs of an identical logistic regression):

Run 1: Validation Accuracy = 92.1%, F1 = 0.928
Run 2: Validation Accuracy = 92.1%, F1 = 0.928
Run 3: Validation Accuracy = 92.1%, F1 = 0.928

The distinction issues enormously for credibility. When tasked with recreating the evaluation, it may be accomplished with confidence, figuring out the numbers will match.

Vital caveat: The seed controls randomness in partitioning and initialization, however it does not make the evaluation resistant to different modifications. If information is modified (including observations, altering transformations) or mannequin parameters are adjusted, outcomes will nonetheless differ, as they need to.

# Lesson 3: Correct Information Partitioning: The Three-Means Break up

Associated to reproducibility is partition technique. XLMiner’s default settings create a 60/40 coaching/validation break up. This appears affordable till the query arises: the place’s the take a look at set?

A typical mistake entails constructing a neural community, tuning it based mostly on validation efficiency, then reporting these validation metrics as last outcomes.

The lesson: With no separate take a look at set, the optimization happens straight on the info being reported, inflating efficiency estimates. The right partitioning technique makes use of three units.

1. Coaching Set (50% of Information)

The place the mannequin learns patterns
Used to suit parameters, coefficients, or weights
For the mortgage dataset: ~2,135 observations

2. Validation Set (30% of Information)

For mannequin choice and hyperparameter tuning
Used to check totally different fashions or configurations
Helps choose the very best pruned tree, optimum cutoff worth, or preferrred neural community structure
For the mortgage dataset: ~1,280 observations

3. Check Set (20% of Information)

“Remaining examination” – solely rating as soon as
Used ONLY in spite of everything modeling choices are full
Gives unbiased estimate of real-world efficiency
For the mortgage dataset: ~854 observations

Essential rule: By no means iterate on take a look at set efficiency. The second a mannequin is chosen as a result of “it performs higher on the take a look at set,” that take a look at set turns into a second validation set, and efficiency estimates change into biased.

That is my workflow now:

Set seed to 12345
Partition 50/30/20 (coaching/validation/take a look at)
Construct a number of mannequin variants, evaluating every on validation set solely
Choose the very best mannequin based mostly on validation efficiency and enterprise necessities
Rating the take a look at set precisely as soon as with the chosen mannequin
Report take a look at set efficiency because the anticipated real-world final result

Right here is an instance from the mortgage approval mission:

Mannequin Model	Coaching Acc	Validation Acc	Check Acc	Chosen?
Logistic Regression (all vars)	90.6%	89.2%	Not scored	No
Logistic Regression (stepwise)	91.2%	92.1%	Not scored	No
Classification Tree (depth=7)	98.5%	97.3%	Not scored	Sure
Classification Tree (depth=5)	96.8%	96.9%	Not scored	No
Neural Community (7 nodes)	92.3%	90.1%	Not scored	No

After choosing the Classification Tree (depth=7) based mostly on validation efficiency, the take a look at set was scored precisely as soon as: 97.4% accuracy. This take a look at accuracy represents the anticipated manufacturing efficiency.

# Lesson 4: The Coaching/Validation Hole: Recognizing Overfitting Earlier than It Hurts

The issue: Preliminary have a look at the classification tree outcomes from the mission report appear promising.

Coaching Information Efficiency:

Accuracy: 98.45%
Precision: 99%
Recall: 96%
F1 Rating: 98.7%

The mannequin appeared profitable up to now till the main focus bought shifted to the validation outcomes.

Validation Information Efficiency:

Accuracy: 97.27%
Precision: 98%
Recall: 94%
F1 Rating: 97.3%

The distinction appeared small, simply 1.18% accuracy distinction. However figuring out whether or not the hole constituted an issue required a scientific framework.

The lesson: It’s vital to grasp when the mannequin memorizes as a substitute of learns.

The sensible resolution: Create an overfitting monitor. Construct a easy however systematic comparability sheet that makes overfitting apparent.

Step 1: Create the Comparability Framework

Here’s a mannequin efficiency comparability within the “Overfitting_Monitor” sheet:

Metric	Coaching	Validation	Hole	Hole %	Standing
Accuracy	98.45%	97.27%	1.18%	1.20%	✓ Good
Precision	99.00%	98.00%	1.00%	1.01%	✓ Good
Recall	96.27%	94.40%	1.87%	1.94%	✓ Good
F1 Rating	98.76%	97.27%	1.49%	1.51%	✓ Good
Specificity	96.56%	92.74%	3.82%	4.06%	? Watch

And listed here are the interpretation guidelines:

Hole < 3%: ✅ Good – Mannequin generalizing properly
Hole 3-5%: ❓ Watch – Acceptable however monitor carefully
Hole 5-10%: ⚠️ Regarding – Attainable overfit, take into account simplification
Hole > 10%: ❌ Drawback – Particular overfit, should handle

And that is the detailed evaluation:

Total Evaluation: GOOD
Reasoning: All main metrics inside 2% hole. Specificity hole barely larger however nonetheless acceptable. Mannequin seems to generalize properly.
Suggestion: Proceed with take a look at set analysis.

Step 2: Add the calculation formulation

Cell: Hole (for Accuracy)
=[@Training] - [@Validation]

Cell: Hole % (for Accuracy)
=([@Training] - [@Validation]) / [@Training]

Cell: Standing (for Accuracy)

=IF([@[Gap %]]<0.03, "✓ Good",
    IF([@[Gap %]]<0.05, "? Watch",
       IF([@[Gap %]]<0.10, "⚠ Regarding", "✗ Drawback")))

Step 3: Create a Visible Overfitting Chart

Construct a side-by-side bar chart evaluating coaching vs. validation for every metric. This makes patterns immediately seen:

When bars are shut, the mannequin generalizes properly. When coaching bars are for much longer than validation bars, there may be overfitting.

Evaluating Throughout Completely different Fashions

The true worth comes from evaluating overfitting patterns throughout the mannequin choices. Right here is the comparability for the “Model_Overfitting_Comparison” sheet:

Mannequin	Prepare Acc	Val Acc	Hole	Overfitting Danger
Logistic Regression	91.2%	92.1%	-0.9%	Low (damaging hole)
Classification Tree	98.5%	97.3%	1.2%	Low
Neural Community (5 nodes)	90.7%	89.8%	0.9%	Low
Neural Community (10 nodes)	95.1%	88.2%	6.9%	Excessive – Reject this
Neural Community (14 nodes)	99.3%	85.4%	13.9%	Very Excessive – Reject this

Interpretation: The neural community with 10+ nodes is clearly overfitting. Regardless of excessive coaching accuracy (99.3%), validation accuracy drops to 85.4%. The mannequin memorized coaching information patterns that do not generalize.

Most suitable option: Classification tree

Excessive efficiency (97.3% validation)
Minimal overfitting (1.2% hole)
Good steadiness of complexity and generalization

Listed below are some easy methods to scale back overfitting when noticed:

For Classification Bushes: Cut back most depth or enhance minimal samples per node
For Neural Networks: Cut back variety of nodes or layers
For Logistic Regression: Take away variables or use stepwise choice
For All Fashions: Add extra coaching information if doable

# Lesson 5: Implement Information Validation for Categorical Variables

Information entry errors are silent killers of machine studying initiatives. A single typo, resembling “gradute” as a substitute of “graduate”, creates a 3rd class in what ought to be a binary variable. The mannequin now has an surprising function worth it is by no means seen throughout coaching, probably inflicting errors throughout deployment or, worse, silently producing incorrect predictions.

The prevention: Excel’s information validation function. Here is the implementation protocol for categorical variables:

In a hidden sheet (title it “Validation_Lists”), create lists of legitimate values:

For schooling: Create a column with entries “Graduate” and “Not Graduate”
For self-employment: Create a column with entries “Sure” and “No”
For mortgage standing: Create a column with entries “Authorised” and “Rejected”

Within the information entry sheet:

Choose your complete column for a categorical variable (e.g., column containing schooling information)
Information → Information Validation → Settings tab
Permit: Listing
Supply: Navigate to the hidden validation sheet and choose the suitable record
Error Alert tab: Fashion = Cease, with a transparent message: “Solely ‘Graduate’ or ‘Not Graduate’ are legitimate entries”

Now it is inconceivable to enter invalid values. Customers see a dropdown with legitimate selections, eliminating typographical errors completely.

For numeric variables with identified ranges, apply comparable validation to stop inconceivable values:

Credit score scores: Should be between 300 and 900
Mortgage time period: Should be between 1 and 30 years
Annual earnings: Should be larger than 0

Choose the column, apply information validation, set:

Permit: Entire Quantity (or Decimal)
Information: between
Minimal: 300 (for credit score scores)
Most: 900

# Remaining Ideas

Here’s a abstract of the teachings outlined within the article.

5 Excel AI Classes I Discovered the Arduous Means (click on to enlarge)

The 5 practices on this article — multi-method outlier detection, setting random seeds, three-way information partitioning, monitoring the training-validation hole, and implementing information validation — share a typical thread: they’re all easy to implement however devastating to omit.

None of those practices require superior statistical information or advanced programming. They do not demand extra software program or costly instruments. Excel XLMiner is a robust device for accessible machine studying.

Rachel Kuznetsov has a Grasp’s in Enterprise Analytics and thrives on tackling advanced information puzzles and trying to find contemporary challenges to tackle. She’s dedicated to creating intricate information science ideas simpler to grasp and is exploring the varied methods AI makes an impression on our lives. On her steady quest to study and develop, she paperwork her journey so others can study alongside her. You’ll find her on LinkedIn.

5 Excel AI Classes I Discovered the Arduous Means

# Introduction

# Lesson 1: A number of Strategies for Outlier Detection

# Lesson 2: At all times Set Random Seeds

# Lesson 3: Correct Information Partitioning: The Three-Means Break up

# Lesson 4: The Coaching/Validation Hole: Recognizing Overfitting Earlier than It Hurts

# Lesson 5: Implement Information Validation for Categorical Variables

# Remaining Ideas

Related Articles

MongoDB warns admins to patch extreme RCE flaw instantly

knowledge synchronization – Why am I getting an iPhone sync ‘ready for adjustments to be utilized’ error?

Efficiency Metrics in Machine Studying: Accuracy, Equity & Drift

LEAVE A REPLY Cancel reply

Latest Articles

MongoDB warns admins to patch extreme RCE flaw instantly

knowledge synchronization – Why am I getting an iPhone sync ‘ready for adjustments to be utilized’ error?

Efficiency Metrics in Machine Studying: Accuracy, Equity & Drift

RIA Corporations Eye Development By way of M&A Offers, New Companies

10 Excessive‑Inflation Winter Prices Seniors Can Nonetheless Management