Dann brown

I am an Senior Fullstack Software Developer working in my skills and learning new stuffs about tech daily

DANA 4830 — Multivariate feature selection Part 1: mRMR and CFS

Multivariate (sometimes called bivariate-and-beyond) feature selection picks a subset of columns by judging them together. Univariate methods pick strong features one by one but may keep two columns that say the same thing.

Paper: Zouhri et al. (2024), §2.2 — CFS, CON, DISR, all grounded in mRMR.

Previous: Univariate FS · Paper overview
Next: Part 2 — CON, DISR, full benchmark

Why univariate is not enough

Suppose you predict fraud with IEEE-CIS-style columns:

Feature	Correlation with `isFraud`	Correlation with each other
`TransactionAmt`	High	Low with others
`V12`	High	0.95 with `V14`
`V14`	High	0.95 with `V12`

Univariate ANOVA might keep both V12 and V14. Multivariate logic says: pick one; they are redundant.

Univariate thinking:  "Who are the best solo players?"
Multivariate thinking: "Who forms the best TEAM without repeating roles?"

mRMR — Maximum Relevance, Minimum Redundancy

The paper’s multivariate filters follow mRMR:

At each step, add the feature that:

Has high relevance to the target (class).
Has low redundancy with features already chosen.

[ \text{score}(f) = \text{relevance}(f, y) - \text{redundancy}(f, S) ]

where (S) is the set already selected.

Term	Meaning	Fraud example
Relevance	Feature helps predict fraud	`TransactionAmt` higher in fraud rows
Redundancy	Feature duplicates info already in (S)	`V12` after `V14` is already in the set

CFS — Correlation-based Feature Subset Selection

Paper definition (§2.2a): choose features with

High feature–class correlation
Low feature–feature correlation inside the subset

Intuition

Target: isFraud

Good pair for CFS:
  TransactionAmt  ──high──► isFraud
  card_region     ──high──► isFraud
  TransactionAmt  ──low──► card_region   (different signal)

Bad pair for univariate-only ranking:
  V12 ──high──► isFraud
  V14 ──high──► isFraud
  V12 ──0.95──► V14   (almost same signal twice)

CFS prefers V12 OR V14, not necessarily both.

CFS-like implementation in Python

sklearn has no official CFS class. A CFS-like pipeline (good for learning and assignments):

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("creditcard.csv")
X = df.drop(columns=["Class"])
y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 1: relevance = |correlation| with target
relevance = X_train.corrwith(y_train).abs().sort_values(ascending=False)

# Step 2: candidate pool (top 60% by relevance — paper uses multiple thresholds)
pool_size = max(5, int(0.6 * X_train.shape[1]))
candidates = relevance.head(pool_size).index.tolist()

# Step 3: drop redundant pairs (feature-feature |corr| > threshold)
X_pool = X_train[candidates]
corr_ff = X_pool.corr().abs()
upper = corr_ff.where(np.triu(np.ones(corr_ff.shape), k=1).astype(bool))

to_drop = set()
for col in upper.columns:
    if any(upper[col] > 0.90):
        to_drop.add(col)

selected_cfs = [c for c in candidates if c not in to_drop]
print(f"CFS-like: {len(selected_cfs)} features from {X_train.shape[1]}")
print(selected_cfs[:15])

What you can report in an assignment

Question	How CFS answers it
Which features explain fraud?	High relevance list
Which features repeat each other?	Dropped by high feature–feature correlation
How many columns removed?	`before - after` count
Does RF still work?	Train before/after and compare AUC

Greedy mRMR (closer to the paper’s iterative idea)

from sklearn.feature_selection import mutual_info_classif

def mrmr_greedy(X, y, k=12):
    mi = pd.Series(
        mutual_info_classif(X, y, random_state=42),
        index=X.columns
    )
    selected = []
    remaining = list(X.columns)

    for _ in range(k):
        best_score, best_feat = -np.inf, None
        for f in remaining:
            rel = mi[f]
            red = 0.0
            if selected:
                red = X[selected].corrwith(X[f]).abs().mean()
            score = rel - red
            if score > best_score:
                best_score, best_feat = score, f
        selected.append(best_feat)
        remaining.remove(best_feat)
    return selected

mrmr_features = mrmr_greedy(X_train, y_train, k=12)
print("mRMR subset:", mrmr_features)

This mirrors the paper’s description: each iteration adds the feature with best relevance minus redundancy.

Bivariate analysis vs multivariate FS

In coursework, bivariate analysis often means two variables at a time (scatter, correlation matrix, cross-tab). That is exploratory. CFS / mRMR is selective — it uses pairwise relationships to build a subset.

Stage	What you do	Example
Bivariate EDA	Understand pairs	`sns.heatmap(X.corr())`, `TransactionAmt` vs `isFraud`
Multivariate FS	Choose subset	CFS drops redundant pairs; mRMR builds a team

import seaborn as sns
import matplotlib.pyplot as plt

cols = ["V14", "V12", "V10", "V17", "TransactionAmt"] if "TransactionAmt" in X.columns else ["V14", "V12", "V10", "V17", "V4"]
sns.heatmap(X_train[cols].corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Bivariate view: feature–feature correlations")
plt.show()

High off-diagonal values → redundancy CFS will penalize.

IEEE-CIS: where CFS helps most

IEEE-CIS has blocks of similar columns:

Block	Typical issue	CFS effect
`V1`–`V339`	Many correlated anonymized counts	Large reduction
`C1`–`C14`, `D1`–`D15`	Overlapping signals	Drop duplicates
Categorical (`card4`, `ProductCD`)	Need encoding before correlation	One-hot first, then CFS on numeric block

Suggested workflow:

Split numeric vs categorical
One-hot encode categoricals
ANOVA / Chi2 univariate pass (optional pre-filter)
CFS or mRMR on numeric pool
Union with top categorical dummies (careful with redundancy)

Compare univariate vs CFS with Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report

def rf_auc(X_tr, X_te, y_tr, y_te):
    rf = RandomForestClassifier(
        n_estimators=200, random_state=42, class_weight="balanced", n_jobs=-1
    )
    rf.fit(X_tr, y_tr)
    prob = rf.predict_proba(X_te)[:, 1]
    return roc_auc_score(y_te, prob)

X_train_cfs = X_train[selected_cfs]
X_test_cfs = X_test[selected_cfs]

print("AUC all features:", rf_auc(X_train, X_test, y_train, y_test))
print("AUC CFS-like:", rf_auc(X_train_cfs, X_test_cfs, y_train, y_test))
print("AUC mRMR:", rf_auc(X_train[mrmr_features], X_test[mrmr_features], y_train, y_test))

Paper-aligned expectation: CFS-like subsets are smaller with similar or better AUC for RF/SVM on redundant IDS data. On Credit Card (already 30 PCA-like features), gains may be small — the big win shows up on high-dimensional IEEE-CIS.

CFS vs PCA (again)

	CFS	PCA
Output	Original column names	New components `PC1`, `PC2`, …
Interpretability	“We kept `TransactionAmt`”	“PC1 mixes 40 columns”
Redundancy	Explicitly minimized	Absorbed into components
Paper category	Feature selection	Feature extraction

For an assignment comparing both: run CFS → RF and PCA → RF with the same CV folds.

Checklist before Part 2

Univariate ranking done (ANOVA / MI at 20–60%)
Correlation heatmap on top 20 numeric columns
CFS-like or mRMR subset built
RF benchmark: full vs CFS vs mRMR
Document how many features removed and which pairs were redundant

Next post: CON, DISR, Scott–Knott mindset, and the full reproduction pipeline on fraud data.

References

Zouhri et al. (2024) — §2.2 Multivariate methods; CFS [Hall, 1999]
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: mRMR
Part 1 paper overview: MCAR/MAR/NMAR