Reducing the number of initial features to select
Can we use univariate feature prediction to eliminate useless features?
- Load data collected from Vespa
- Ranking features available
- Simplify target label
- Model
- Subset selection routine
- Analyze results
- Conclusion
The dataset used here were created by collecting ranking features from Vespa associated with the labelled data released by the round 3 of the TREC-CORD competition.
vespa_cord19.to_csv("data/2020-05-27-subset-selection/training_features.csv", index=False)
vespa_cord19.head(2)
There are 163 ranking features available.
features = [
x for x in list(vespa_cord19.columns) if x not in [
'topic_id', 'iteration', 'cord_uid', 'relevancy', 'binary_relevance', 'query',
'query-rewrite', 'query-vector', 'question', 'narrative'
]
]
print(len(features))
features
The original labelled data has three types of label: 0, 1 and 2. To simplify we will consider just two labels here. The document is either relevant (label = 1) or irrelevant (label = 0)
vespa_cord19["binary_relevance"] = vespa_cord19.apply(lambda row: 1 if row["relevancy"] > 0 else 0, axis=1)
vespa_cord19[['relevancy', 'binary_relevance']].head()
We are going to fit logistic regressions with the objective of maximizing the log probability of the observed outcome.
from sklearn.linear_model import LogisticRegression
from statistics import mean
def compute_mean_realize_log_prob(model, X, Y):
return mean([x[int(y)] for x, y in zip(model.predict_log_proba(X), Y)])
def fit_logistic_reg(X, Y):
model = LogisticRegression(penalty='none', fit_intercept=True)
model.fit(X, Y)
realized_log_prob = compute_mean_realize_log_prob(model, X, Y)
return realized_log_prob
Below we run the subset selection algorithm with only one feature.
import itertools
import pandas as pd
from tqdm import tnrange, tqdm_notebook #Importing tqdm for the progress bar
from tqdm.notebook import trange
log_probs, feature_list = [], []
numb_features = []
max_number_features = min(1, len(features))
data = vespa_cord19
Y = data.binary_relevance
X = data[features]
for k in range(1,max_number_features + 1):
for combo in itertools.combinations(X.columns,k):
tmp_result = fit_logistic_reg(X[list(combo)],Y)
log_probs.append(tmp_result)
feature_list.append(combo)
numb_features.append(len(combo))
#Store in DataFrame
df = pd.DataFrame(
{
'numb_features': numb_features,
'log_probs': log_probs,
'features':feature_list
}
)
df
df['max_log_probs'] = df.groupby('numb_features')['log_probs'].transform(max)
df
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
plt.scatter(df.numb_features,df.log_probs, alpha = .2, color = 'darkblue')
plt.xlabel('# Features')
plt.ylabel('log_probs')
plt.title('Best subset selection')
plt.plot(df.numb_features,df.max_log_probs, color = 'r', label = 'Best subset')
plt.show()
df_max = df.sort_values('log_probs', ascending=False)
for f in df_max.features:
print(f)
Using the predicting performance of individual features does not seem a good approach to eliminate features from a grid search by greedy algorithms. The reason is that many features that perform poorly when considered in isolation would shine when combined with other complementary features.