Dann brown

I am an Senior Fullstack Software Developer working in my skills and learning new stuffs about tech daily

DANA 4830 — Feature selection paper: how to read it, MCAR/MAR/NMAR, and the big ideas

Study notes for DANA 4830 on the paper:

Evaluating the impact of filter-based feature selection in intrusion detection systems
Zouhri, Idri & Ratnani (2024), International Journal of Information Security
DOI 10.1007/s10207-023-00767-y

The paper is not a book chapter. You do not read it top-to-bottom on the first pass. The goal is to extract a pipeline: problem → data → methods → results → limits.

How to read this paper (5-minute first pass)

Step	Section	Goal
1	Title	Topic: feature selection in intrusion detection
2	Abstract	One-paragraph summary of everything
3	Research questions	What they actually test
4	Figures / tables	What won and what did not
5	Methodology (§5)	Experimental pipeline
6	Introduction	Context (read last)

Translation of the title: they compare filter-based feature selection methods and ask whether reducing variables helps classifiers detect cyber-attacks.

The core problem in one sentence

IDS datasets have many features (packet size, duration, protocol, bytes sent, …). Too many irrelevant or redundant features can cause overfitting, slower training, and worse generalization. The paper asks: which filter methods + which classifiers + which thresholds work best?

IDS concept	Fraud-detection analogue (your assignment angle)
Attack vs benign	Fraud vs legitimate transaction
Network traffic features	TransactionAmt, card type, device info, V1–V339
High false alarm rate	High false positives on normal users

Research questions (this is the map of the whole paper)

RQ	Question (plain English)
RQ1	Does reducing features help or hurt classification?
RQ2	Do univariate filters beat multivariate filters?
RQ3	Which filter + classifier combinations work best?
RQ4	Are there features useful across all attack types?

They test 228 model variants:

[ 228 = 4\ \text{classifiers} \times 3\ \text{datasets} \times (5\ \text{univariate} \times 3\ \text{thresholds} + 3\ \text{multivariate} + \text{full feature set}) ]

Feature selection vs feature extraction (PCA connection)

This distinction appears early in the paper and matters for your assignment.

Approach	What it does	Keeps original columns?	Example
Feature selection	Picks a subset of existing variables	Yes	Keep `TransactionAmt`, drop `V47`
Feature extraction	Creates new variables from combinations	No	PCA: (PC_1 = 0.7X_1 + 0.3X_2)

PCA is not feature selection. PCA builds new components; filters like Chi-square or CFS remove or rank original columns.

Three families of feature selection

Type	Idea	Speed	Depends on classifier?	Paper examples
Filter	Statistical score per feature or subset	Fast	No	Chi2, ANOVA, MI, Pearson, ReliefF, CFS, CON, DISR
Wrapper	Train model, try subsets, keep what improves performance	Slow	Yes	RFE (mentioned in related work, not main experiment)
Embedded	Selection inside the model	Medium	Built-in	XGBoost split importance, RF impurity

Your intuition about XGBoost: partly correct. XGB is not a wrapper in this paper’s taxonomy, but it does learn which splits matter internally. That is why external FS sometimes does not help XGB — it may already ignore noise, or FS may remove useful joint information.

Filters used in the paper

Univariate (one feature at a time)

Code	Method	Question it answers
K	Chi-square	Is this feature dependent on the class?
A	ANOVA (F-test)	Do class means differ for this numeric feature?
MI	Mutual information	Does knowing the feature reduce uncertainty about the class?
RL	ReliefF	Does the feature separate near neighbors of different classes?
C	Pearson correlation	Linear correlation feature ↔ target

Thresholds: keep top 20%, 40%, or 60% of ranked features.

Multivariate (subset at a time)

Based on mRMR (Maximum Relevance, Minimum Redundancy):

Method	Logic
CFS	High correlation with target, low correlation among selected features
CON	Subset with minimal inconsistency (same feature values but different class)
DISR	Features that together carry more information than the sum of parts

Classifiers compared

Model	How it learns	Paper takeaway
SVM	Maximum-margin boundary; sensitive to scale and noise	Often benefits from FS
MLP	Neural network; needs tuning and enough data	FS helps inconsistently
RF	Many trees vote on random subsets	Strong baseline; FS often helps
XGB	Sequential trees fix previous errors	Very strong; FS not always needed

RF vs XGB (simple analogy):

Random Forest:     Tree1 ┐
                     Tree2 ├── vote
                     Tree3 ┘

XGBoost:           Tree1 → errors → Tree2 → errors → Tree3 → …

Benchmark, Scott–Knott, Borda Count

Benchmark here = same rules for everyone: same datasets, CV, metrics, hyperparameter search.
Scott–Knott (SK): groups models whose performance differences are not statistically meaningful into clusters.
Borda Count (BC): ranks models across Accuracy, Precision, Recall, F1, AUC so one metric does not dominate.

Example naming from the paper: SVMRL40 = SVM trained on features selected by ReliefF at the 40% threshold; SVM100 = all original features.

Main result (short version)

Filter-based FS can reduce features without hurting detection, but not always. XGBoost and Random Forest with multivariate filters (CON, DISR) often perform very well. FS helps SVM and RF more often than XGB or MLP.

Feature selection is not universally better — it depends on method + model + dataset.

MCAR, MAR, NMAR — before you select features

The paper’s Step 1 is cleaning: remove missing values, duplicates, irrelevant columns. The paper does not define missing-data mechanisms, but you must understand them when applying the pipeline to IEEE-CIS Fraud or Credit Card Fraud.

These three terms describe why data are missing — not how to impute them.

MCAR — Missing Completely At Random

Missingness has nothing to do with observed or unobserved values.

Example (credit card): a logging server crashes for 2 hours; random transactions lose TransactionAmt. Fraud and non-fraud rows are equally affected.

isFraud   TransactionAmt
       120
       450
       NA    ← random outage, not related to fraud

What you can do: listwise deletion or simple imputation is less biased (still check sample size).

MAR — Missing At Random

Missingness depends on observed variables, not on the missing value itself.

Example: P_emaildomain is missing more often when ProductCD = 'WP' (web purchase) — and you see ProductCD in the table. Missingness is explained by something you observe.

ProductCD   P_emaildomain   isFraud
WP          NA              0
W           user@gmail.com  0
WP          NA              1

What you can do: impute using other columns (group medians, models). Still need care in evaluation.

NMAR — Not Missing At Random

Missingness depends on the missing value itself (or unobserved causes tied to it).

Example: very large TransactionAmt values are missing because high-value channels use a different system that was not merged into the dataset. The fact that the amount is large causes it to be missing.

TransactionAmt   isFraud
NA               1        ← amount might be 50,000; you never see it
85               0
120              0

What you can do: hardest case. Deleting rows or mean imputation can bias the model. You may need domain knowledge, sensitivity analysis, or separate modeling.

Quick reference

Mechanism	Missingness depends on…	Risk if ignored
MCAR	Nothing systematic	Lower bias
MAR	Observed columns	Moderate; use informed imputation
NMAR	The hidden value itself	High bias; FS and models can look good on wrong data

Type I error reminder (from univariate tests): if you run ANOVA or t-tests on 400 features at (\alpha = 0.05), you expect ~20 false positives by chance. That is one reason the paper validates with classifiers + cross-validation, not only p-values.

Mental template for any ML paper

What problem do they have?
What data do they use?
What methods do they compare?
How do they measure success?
What won?
What are the limitations?

For this paper:

#	Answer
1	Too many IDS features; need FS
2	CIC-IDS2017, CSE-CIC-IDS2018, CIC-ToN-IoT
3	5 univariate + 3 multivariate filters × 4 classifiers
4	10-fold CV; Accuracy, Precision, Recall, F1, AUC; SK + Borda
5	RF/XGB + multivariate filters often strong
6	Old-style IDS focus; no universal FS winner

My reproduction plan (finance / fraud)

Same pipeline, different domain:

Public fraud dataset (Credit Card or IEEE-CIS)
    → cleaning + MCAR/MAR check
    → encoding
    → scaling (for SVM / PCA)
    → univariate FS (Chi2, ANOVA, MI, Pearson, ReliefF)
    → multivariate FS (CFS-like, CON concept, DISR concept)
    → RF, XGB, SVM
    → metrics (Precision, Recall, F1, ROC-AUC, PR-AUC)
    → compare

Follow-up posts in this series:

References

Zouhri, H., Idri, A., & Ratnani, A. (2024). Evaluating the impact of filter-based feature selection in intrusion detection systems. International Journal of Information Security, 23, 759–785.
Course paper PDF: W2_Sec1_Feature_Selection_Paper_1.pdf
Kaggle — Credit Card Fraud Detection (good starter size)
Kaggle — IEEE-CIS Fraud Detection (assignment-scale dataset)