Stability and Performance on TieNet
Mehmet Ugurbil
Unversity of Minnesota
02-Jul-2018
Contents
Aim
1. Show that stability of feature selector is not correlated with the performance of the selected features.
- - In other words, different sets of features can have equal signal strength.
2. Instability can exist even when no information equivalency is present in the dataset.
3. The claims hold for datasets with multiplicity, withot multiplicity, removed according to true graph or using Tie* results, and with large number of weak variables with and without added error.
Null Hypothesis
Instable feature selection leads to poor performance.
Experiment Design
1. Feature selection and svm classification on cross validation sets of small sample.
- - Assesment of performance on 50 repeat, 10 fold cross validation.
- - Calculation of stability metrics.
2. Feature selection and svm classification on entire small sample.
- - Model validation performance assesed on hold out testing set.
3. SVM classification using features selected in (2) on large sample training set.
- - Feature validation performance assesed on hold out testing set.
Observations
1. Monotonic performance increase is observed as sample size increases.
2. Performance of the models approach the theoretical limit in the large sample limit.
3. Instable features selected in small sample region generalize well.
4. Stability in RFE increases initially, sees a minima for medium sample size, then continues increasing.
Dataset Descriptions
TIE-Net = Original simulated data - TIE near-faithful causal network.
TIE-Net-Reduced1 = TIE-Net with multiplicity removed according to the original graph.
TIE-Net-Reduced2 = TIE-Net with multiplicity removed using Tie* Algorithm.
- - Note that this is not one dataset, but one for each repeat per sample size (550 total).
- - This also implies that the feature stability doesn't make sense in this dataset, but included for completeness.
TIE-Net-Weak1 = TIE-Net with weak variables multiplied 50 times.
TIE-Net-Weak2 = TIE-Net-Weak1 with gaussian noise, uniformly random deviation.
Performance