Best subset selection of Vespa ranking features
The dataset used here were created by collecting ranking features from Vespa associated with the labelled data released by the round 3 of the TREC-CORD competition.
vespa_cord19.head(2)
There are 163 ranking features available. Below we print the first three as a sample.
features = [
x for x in list(vespa_cord19.columns) if x not in [
'topic_id', 'iteration', 'cord_uid', 'relevancy', 'binary_relevance', 'query',
'query-rewrite', 'query-vector', 'question', 'narrative'
]
]
print(len(features))
print(features[:3])
The original labelled data has three types of label: 0, 1 and 2. To simplify we will consider just two labels here. The document is either relevant (label = 1) or irrelevant (label = 0)
vespa_cord19["binary_relevance"] = vespa_cord19.apply(lambda row: 1 if row["relevancy"] > 0 else 0, axis=1)
vespa_cord19[['relevancy', 'binary_relevance']].head()
We are going to fit logistic regressions with the objective of maximizing the log probability of the observed outcome.
from sklearn.linear_model import LogisticRegression
from statistics import mean
def compute_mean_realize_log_prob(model, X, Y):
return mean([x[int(y)] for x, y in zip(model.predict_log_proba(X), Y)])
def fit_logistic_reg(X, Y):
model = LogisticRegression(penalty='none', fit_intercept=True)
model.fit(X, Y)
realized_log_prob = compute_mean_realize_log_prob(model, X, Y)
return realized_log_prob
Give the high number of features, we can only apply the best subset selection algorithm up to three features. Instead of running one logistic regression for each feature combination, we will run 10 replications with smaller sampled datasets with 1.000 data points.
The code below was adapted from this blog post written by Xavier Bourret Sicotte.
import itertools
import pandas as pd
from tqdm import tnrange, tqdm_notebook #Importing tqdm for the progress bar
from tqdm.notebook import trange
#Initialization variables
log_probs, feature_list = [], []
numb_features = []
data_sample = []
number_sample = 10
number_points_per_sample = 1000
max_number_features = min(3, len(features))
for i in range(number_sample):
data = vespa_cord19.sample(n=number_points_per_sample, random_state=456)
Y = data.binary_relevance
X = data[features]
#Looping over k = 1 to k = 11 features in X
for k in trange(1,max_number_features + 1, desc = 'Loop...'):
print(k)
#Looping over all possible combinations: from 11 choose k
for combo in itertools.combinations(X.columns,k):
tmp_result = fit_logistic_reg(X[list(combo)],Y) #Store temp result
log_probs.append(tmp_result) #Append lists
feature_list.append(combo)
numb_features.append(len(combo))
data_sample.append(i)
#Store in DataFrame
df = pd.DataFrame(
{
'data_sample': data_sample,
'numb_features': numb_features,
'log_probs': log_probs,
'features':feature_list
}
)
Even with the limitation of checking at most 3 features we end up running more than 7 million logistic regressions to obtain the results presented here.
df
We can now average the results across the 10 replications we runned for each combination.
average_df = df.groupby(['numb_features', 'features'], as_index=False).mean()
average_df
average_df['max_log_probs'] = df.groupby('numb_features')['log_probs'].transform(max)
average_df
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
plt.scatter(average_df.numb_features,average_df.log_probs, alpha = .2, color = 'darkblue')
plt.xlabel('# Features')
plt.ylabel('log_probs')
plt.title('Best subset selection')
plt.plot(average_df.numb_features,average_df.max_log_probs, color = 'r', label = 'Best subset')
plt.show()
average_df_max = average_df.sort_values('log_probs', ascending=False).drop_duplicates(['numb_features']).sort_values('numb_features')
for f in average_df_max.features:
print(f)