Dann brown

I am an Senior Fullstack Software Developer working in my skills and learning new stuffs about tech daily

DANA 4830 — Multivariate feature selection Part 2: CON, DISR, and fraud dataset pipeline

Part 2 of multivariate (bivariate-and-beyond) feature selection from Zouhri et al. (2024): CON, DISR, how they differ from CFS, and a full Python pipeline you can run on public fraud data (same logic as IEEE-CIS for your assignment).

Previous: Part 1 — mRMR and CFS · Univariate · Paper overview

Multivariate methods in the paper (recap)

Method	Core idea	One-line
CFS	High target correlation, low inter-feature correlation	No duplicate teammates
CON	Minimize inconsistency in the subset	Same X-values should not split across classes
DISR	Joint information beats sum of parts	Synergy between features

Paper result: XGBoost and Random Forest with CON and DISR subsets often matched or beat full feature sets on IDS data.

CON — Consistency-based subset selection

Paper (§2.2b): the best subset has the lowest inconsistency rate.

A subset is inconsistent when two rows have the same values on all selected features but belong to different classes.

Example (toy fraud table)

Row	`Amount`	`Hour`	`isFraud`
1	100	14	0
2	100	14	1
3	500	3	1

If you select {Amount, Hour} only, rows 1 and 2 are identical on those features but different class → inconsistent.

CON searches for subsets where matching feature vectors usually imply the same class.

Why it matters for IDS / fraud

Attack traffic (or fraud) sometimes shares surface stats with benign rows. CON pushes toward feature sets that separate classes cleanly in the training space — related to decision boundary clarity, not just marginal correlation.

Practical Python proxy

Exact CON uses search over subsets (expensive). For learning, use a consistency score on candidate features from mRMR/CFS:

import pandas as pd
import numpy as np

def inconsistency_rate(X_sub: pd.DataFrame, y: pd.Series) -> float:
    """Fraction of duplicate feature rows that disagree on class."""
    grouped = X_sub.assign(_y=y).groupby(list(X_sub.columns))["_y"]
    bad = 0
    total = 0
    for _, labels in grouped:
        n = len(labels)
        if n > 1:
            total += n
            bad += labels.nunique() > 1
    return bad / total if total else 0.0

# Example on Credit Card subset
from sklearn.model_selection import train_test_split

df = pd.read_csv("creditcard.csv")
X = df.drop(columns=["Class"])
y = df["Class"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

candidate = ["V14", "V12", "V10", "V17", "V4", "V3"]
print("Inconsistency:", inconsistency_rate(X_train[candidate], y_train))

For assignments: compare inconsistency before FS (all features binned) vs after CFS/mRMR subset — lower is better if AUC holds.

DISR — Double Input Symmetric Relevance

Paper (§2.2c): a set of variables together can carry more information about the class than the sum of each variable alone.

Intuition

Feature alone	Joint with partner
`country = US` → weak fraud signal	`country = US` and `email_domain missing` → strong signal
`Amount` moderate	`Amount` high and `new_device = 1` → strong signal

DISR targets synergy — classic bivariate interaction that univariate ANOVA misses.

Practical proxy: interaction + tree importance

Full DISR needs specialized libraries. Two assignment-friendly proxies:

1. Add explicit interaction terms for top pairs

top = ["V14", "V12", "V10"]
X_int = X_train[top].copy()
for i in range(len(top)):
    for j in range(i + 1, len(top)):
        name = f"{top[i]}_x_{top[j]}"
        X_int[name] = X_train[top[i]] * X_train[top[j]]

2. Let XGBoost reveal joint splits (embedded FS)

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.1,
    scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum(),
    random_state=42,
    eval_metric="logloss",
)
model.fit(X_train, y_train)

imp = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
top_disr_proxy = imp.head(15).index.tolist()
print(top_disr_proxy)

Report this as: “DISR concept — joint relevance approximated via interaction terms / XGB split gains.”

Bivariate analysis on your fraud dataset

Before multivariate FS, run pair-level EDA on the subset you kept in Part 1.

Correlation with target (bivariate numeric)

import seaborn as sns
import matplotlib.pyplot as plt

features = ["V14", "V12", "V10", "V17", "V4"]
plot_df = X_train[features].assign(isFraud=y_train.values)

sns.pairplot(plot_df, hue="isFraud", corner=True, plot_kws={"alpha": 0.4, "s": 8})
plt.suptitle("Bivariate EDA: pairwise views", y=1.02)
plt.show()

Cross-tab (bivariate categorical — IEEE-CIS style)

After encoding:

# Example pattern for IEEE-CIS
# pd.crosstab(df["ProductCD"], df["isFraud"], normalize="index")

Read: if fraud rate changes sharply across categories, Chi2/univariate FS will rank that column high; CON/DISR ask whether combinations add separation.

Full reproduction pipeline (paper → fraud domain)

Mirrors paper §5 — Steps 1–4 simplified without Bayesian optimization.

┌─────────────────────────────────────────────────────────┐
│ 1. Clean: missing, duplicates, drop ID columns          │
│ 2. Encode categoricals (IEEE-CIS)                       │
│ 3. Univariate FS: Chi2 / ANOVA / MI @ 20%, 40%, 60%   │
│ 4. Multivariate FS: CFS-like, mRMR, CON proxy, DISR proxy│
│ 5. Models: RF, XGB, SVM (same CV folds)                 │
│ 6. Metrics: Precision, Recall, F1, ROC-AUC, PR-AUC    │
│ 7. Compare: full features vs each subset              │
└─────────────────────────────────────────────────────────┘

End-to-end script skeleton

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

df = pd.read_csv("creditcard.csv")
X = df.drop(columns=["Class"])
y = df["Class"]

# --- Multivariate: CFS-like (from Part 1) ---
rel = X.corrwith(y).abs().sort_values(ascending=False)
pool = rel.head(int(0.6 * X.shape[1])).index
cm = X[pool].corr().abs()
upper = cm.where(np.triu(np.ones(cm.shape), k=1).astype(bool))
drop = {c for c in upper.columns if any(upper[c] > 0.9)}
cfs_cols = [c for c in pool if c not in drop]

# --- Univariate 40% ---
k = max(1, int(0.4 * X.shape[1]))
anova = SelectKBest(f_classif, k=k).fit(X, y)
anova_cols = X.columns[anova.get_support()].tolist()

subsets = {
    "all_100": X.columns.tolist(),
    "anova_40": anova_cols,
    "cfs": cfs_cols,
}

models = {
    "RF": RandomForestClassifier(n_estimators=200, class_weight="balanced", random_state=42, n_jobs=-1),
    "XGB": xgb.XGBClassifier(
        n_estimators=200, max_depth=4, learning_rate=0.1,
        scale_pos_weight=(y == 0).sum() / (y == 1).sum(),
        random_state=42, eval_metric="logloss",
    ),
    "SVM": Pipeline([
        ("scale", StandardScaler()),
        ("svc", SVC(kernel="rbf", class_weight="balanced", probability=True, random_state=42)),
    ]),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = ["accuracy", "precision", "recall", "f1", "roc_auc"]

rows = []
for sub_name, cols in subsets.items():
    X_sub = X[cols]
    for model_name, model in models.items():
        scores = cross_validate(model, X_sub, y, cv=cv, scoring=scoring, n_jobs=-1)
        rows.append({
            "subset": sub_name,
            "model": model_name,
            "n_features": len(cols),
            **{m: scores[f"test_{m}"].mean() for m in scoring},
        })

results = pd.DataFrame(rows).sort_values("roc_auc", ascending=False)
print(results.round(4))

How to read your table (paper-style)

Pattern	Interpretation
`SVM` + `anova_40` » `SVM` + `all_100`	FS helps SVM (noise reduction)
`XGB` + `all_100` ≈ `XGB` + `cfs`	XGB already handles weak features
`RF` + `cfs` ≥ `RF` + `all_100` on IEEE-CIS-like data	Redundancy removed, speed up
High Accuracy but low Recall on fraud	Wrong metric focus — use PR-AUC

Scott–Knott and Borda (simplified for your assignment)

You may not implement SK test on day one. Use this workflow:

Run 5-fold CV for each (subset, model) pair.
Group rows within 0.5% ROC-AUC as “same cluster”.
Inside the best cluster, rank by average rank across Precision, Recall, F1, AUC (Borda idea).

best = results[results["roc_auc"] >= results["roc_auc"].max() - 0.005]
best["borda"] = (
    best[["precision", "recall", "f1", "roc_auc"]]
    .rank(ascending=False)
    .sum(axis=1)
)
print(best.sort_values("borda"))

PCA side experiment (paper contrast)

Same folds, compare selection vs extraction:

from sklearn.decomposition import PCA

n = 12  # match CFS subset size
pca_pipe = Pipeline([
    ("scale", StandardScaler()),
    ("pca", PCA(n_components=n)),
    ("rf", RandomForestClassifier(n_estimators=200, class_weight="balanced", random_state=42)),
])

pca_scores = cross_validate(pca_pipe, X, y, cv=cv, scoring=scoring, n_jobs=-1)
print({m: pca_scores[f"test_{m}"].mean() for m in scoring})

Write-up sentence:

PCA reduced dimensionality by constructing new components, while CFS preserved interpretable original features; on fraud data, CFS-like selection retained column names for analysis while PCA absorbed redundancy into latent factors.

IEEE-CIS assignment checklist

Step	Action
Download	IEEE-CIS Fraud Detection
Merge	`train_transaction` + `train_identity` on `TransactionID`
Target	`isFraud`
Missing	Report MCAR/MAR/NMAR per column group (D, V, C blocks)
Univariate	ANOVA on numeric; Chi2 on one-hot categoricals
Multivariate	CFS on V-block; interaction proxy for DISR
Models	RF, XGB, SVM — same as paper
Metrics	Recall, PR-AUC primary; Accuracy secondary
Compare	228 variants is optional; start with 12–20 key combos

What the paper teaches (final takeaway)

Feature selection is not always an improvement — especially for XGB.
Multivariate filters (CON, DISR) can match full feature performance with fewer columns on high-dimensional IDS data.
Univariate methods rank fast but keep redundancy.
Benchmark discipline (same CV, multiple metrics, ranking) matters more than one lucky train/test split.
For fraud, connect IDS logic: attacks ≈ fraud, benign ≈ normal transactions.

Series index

Paper overview + MCAR/MAR/NMAR
Univariate FS
Multivariate Part 1 — CFS / mRMR
This post — CON, DISR, full pipeline

References

Zouhri et al. (2024) — §2.2, §5 Experimental design, §6 Results
Dash, M., & Liu, H. (1997). Consistency-based search in feature selection (CON)
IEEE-CIS Fraud Detection competition data
Credit Card Fraud Detection (Kaggle) — practice dataset