Context

I have fine-tuned a BERT model using a simple training routine. I then deployed a simplified cord19 application on my laptop to validate the model. We will investigate some strange results we found in a previous sprint.

Evaluate ranking functions

Connect to my local Vespa app.

from vespa.application import Vespa

app = Vespa(url = "http://localhost", port = 8080)

Define different query models. bert_index_1 uses the correct output from the BERT model used, which is the probability of the document being relevant. bert was a mistake I made when I used the wrong output to rank the documents. It is still here because it will be part of my investigations later.

from vespa.query import Query, RankProfile as Ranking, OR

query_models = {
    "or_bm25": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bm25")
    ),
    "or_bm25_bert": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bert")
    ),
    "or_bm25_bert_index_1": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bert_index_1")
    )
    
}
        

The evaluation metrics that we want to compute.

from vespa.evaluation import MatchRatio, Recall, ReciprocalRank, NormalizedDiscountedCumulativeGain

eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10), NormalizedDiscountedCumulativeGain(at=10)]

Load labeled data. You can download it here.

import json

labelled_data = json.load(open("cord19/labelled_data.json", "r"))

We will need to tokenizer to convert the query string to embedding vector.

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Compute evaluation metrics for each query model and each query.

from pandas import DataFrame

evaluations = {}
for query_model in query_models:
    evaluation = []
    for query_data in labelled_data:
        print(query_data["query_id"])
        evaluation_query = app.evaluate_query(
            eval_metrics=eval_metrics,
            query_model=query_models[query_model],
            query_id=query_data["query_id"],
            query=query_data["query"],
            id_field = "cord_uid",
            relevant_docs=query_data["relevant_docs"],
            hits = 10,
            timeout="100s",
            **{"ranking.features.query(query_token_ids)": str(tokenizer(
                        str(query_data["query"]), 
                        truncation=True, 
                        padding="max_length",
                        max_length=64, 
                        add_special_tokens=False
                    )["input_ids"])}            
        )
        evaluation.append(evaluation_query)
    evaluations[query_model] = DataFrame.from_records(evaluation)

Organize the data into a nicer format to work with.

import pandas as pd

metric_values = []
for query_model in query_models:
    for metric in eval_metrics:
        metric_values.append(
            pd.DataFrame(
                data={
                    "query_model": query_model, 
                    "metric": metric.name, 
                    "value": evaluations[query_model][metric.name + "_value"].to_list()
                }
            )
        )
metric_values = pd.concat(metric_values, ignore_index=True)

Recall issue

The recall issue is that different query models were giving different recall metrics, even though they all had the same matching and ranking-phase and were just reordering the top 10 positions.

second-phase {
    rerank-count: 10
    expression: sum(eval)
}
metric_values[metric_values.metric == "recall_10"].groupby(['query_model', 'metric']).median()
value
query_model metric
or_bm25 recall_10 0.007412
or_bm25_bert recall_10 0.008076
or_bm25_bert_index_1 recall_10 0.008118

It seems that Vespa reorder the top 11 documents, even though rerank-count: 10, as I show below.

Identify which queries are responsible for the difference:

from pandas import merge

recall_measures = merge(
    left=evaluations["or_bm25"], 
    right=evaluations["or_bm25_bert_index_1"],
    on="query_id"
)[["query_id", "recall_10_value_x", "recall_10_value_y"]]

recall_measures[recall_measures.recall_10_value_x != recall_measures.recall_10_value_y]
query_id recall_10_value_x recall_10_value_y
14 15 0.006726 0.004484
16 17 0.006974 0.008368
20 21 0.006088 0.007610
32 33 0.006515 0.009772
38 39 0.007165 0.008188
40 41 0.014045 0.016854
49 50 0.020134 0.013423

Query two different rank profiles.

query_data = labelled_data[14]

result_bm25 = app.query(query=query_data["query"], query_model=query_models["or_bm25"],
          hits = 10,
         )
bm25_ids = [hit["fields"]["cord_uid"] for hit in result_bm25.hits]
result_bm25_bert = app.query(query=query_data["query"], query_model=query_models["or_bm25_bert_index_1"],
          hits = 10,
          timeout="100s",
          **{"ranking.features.query(query_token_ids)": str(tokenizer(
                        str(query_data["query"]), 
                        truncation=True, 
                        padding="max_length",
                        max_length=64, 
                        add_special_tokens=False
                    )["input_ids"])}            
         )
bm25_bert_ids = [hit["fields"]["cord_uid"] for hit in result_bm25_bert.hits] 

Check which id is in the BERT top 10 but not in the BM25 top 10:

id_in_bert_not_in_bm25 = [x for x in bm25_bert_ids if x not in bm25_ids]
id_in_bert_not_in_bm25
['ecu579el']

List the top 11 results from BM25. Notice that the missing doc is in the 11th position.

result_bm25_11 = [hit["fields"]["cord_uid"] for hit in app.query(query=query_data["query"], query_model=query_models["or_bm25"], hits = 11).hits] 
result_bm25_11
['zpek8i5e',
 '75u57fw1',
 'up5jpq45',
 'qmrntk43',
 'cxfzs68n',
 'y2nhss9u',
 '94puwlbm',
 'zpmdrh4q',
 'fpexj3s5',
 'axljtddn',
 'ecu579el']

Positive and null NDGC issue

When querying @bergum dev instance I found cases where the NDCG @ 10 was positive for BM25 model and zero for the BERT re-rank model, as we can see in the picture below, which makes no sense.

title

I could not reproduce the NDCG issue and the results were as expected on my local instance, as I show below.

from pandas import merge

ndcg_measures = merge(
    left=evaluations["or_bm25"], 
    right=evaluations["or_bm25_bert_index_1"],
    on="query_id"
)[["query_id", "ndcg_10_value_x", "ndcg_10_value_y"]]
ndcg_measures
query_id ndcg_10_value_x ndcg_10_value_y
0 1 0.683159 0.812003
1 2 0.000000 0.000000
2 3 0.450853 0.619669
3 4 0.000000 0.000000
4 5 0.397809 0.455605
5 6 0.678762 0.901013
6 7 0.888733 0.629200
7 8 0.527845 0.947807
8 9 0.859413 0.569139
9 10 0.541696 0.880740
10 11 0.000000 0.000000
11 12 0.510384 0.844481
12 13 0.500000 0.386853
13 14 0.847790 0.855857
14 15 0.882121 0.859719
15 16 0.588160 0.792087
16 17 0.674788 0.907813
17 18 0.758879 0.652883
18 19 0.695950 0.421776
19 20 0.769846 0.991829
20 21 0.581889 0.651197
21 22 0.301030 0.430677
22 23 0.735734 0.686502
23 24 0.963487 0.857319
24 25 0.570642 0.650921
25 26 0.918849 0.868415
26 27 0.533893 0.493208
27 28 0.548702 0.575719
28 29 0.966813 0.907663
29 30 0.811837 0.804581
30 31 0.000000 0.000000
31 32 0.430677 0.630930
32 33 0.707489 0.426919
33 34 0.430677 0.630930
34 35 0.489969 0.906025
35 36 0.846117 0.879854
36 37 0.950421 0.950421
37 38 0.783761 0.800132
38 39 0.976409 0.710612
39 40 0.815931 0.559814
40 41 0.806327 0.970409
41 42 0.763499 0.706410
42 43 0.926285 0.702700
43 44 0.833223 0.579375
44 45 0.804465 0.710549
45 46 0.965444 0.622067
46 47 0.796498 0.717333
47 48 0.505179 0.917655
48 49 0.531761 0.864027
49 50 0.445734 0.422790
import plotly.graph_objects as go
ids=ndcg_measures.query_id.tolist()

fig = go.Figure(data=[
    go.Bar(name='BM25', x=ids, y=ndcg_measures.ndcg_10_value_x.tolist()),
    go.Bar(name='BM25 + BERT', x=ids, y=ndcg_measures.ndcg_10_value_y.tolist())
])
# Change the bar mode
fig.update_layout(barmode='group', xaxis=dict(type='category'))
fig.show()

Bergum's instance

@bergum dev instance was down, so I could not check his results again, but below is the code I would use it.

from vespa.application import Vespa

app = Vespa(
    url="https://bergum.cord-19.vespa-team.aws-us-east-1c.dev.public.vespa.oath.cloud",
    cert="/Users/tmartins/projects/vespa/pyvespa/docs/sphinx/source/use_cases/cord19/data-plane-joint.txt"
)
from vespa.query import Query, RankProfile, OR

query_models = {
    "or_bm25": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bm25")
    ),
    "or_bm25_bert": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bert")
    )
}   
from pandas import DataFrame

evaluations = {}
for query_model in query_models:
    evaluation = []
    for query_data in labelled_data:
        print(query_data["query_id"])
        body = {
            "yql": "select * from sources * where userQuery();",
            "query": query_data["query"],
            "type": "any",
            "model.defaultIndex": "default",
            "hits": 10,
            "collapsefield": "title",
            "timeout": "100s",
            "ranking": query_models[query_model].rank_profile.name
        }
        evaluation_query = app.evaluate_query(
            eval_metrics=eval_metrics,
            query_model=query_models[query_model],
            query_id=query_data["query_id"],
            query=query_data["query"],
            id_field = "cord_uid",
            relevant_docs=query_data["relevant_docs"],
            body=body,
        )
        evaluation.append(evaluation_query)
    evaluations[query_model] = DataFrame.from_records(evaluation)

Effect of using different model outputs from BERT

For a moment I thought that the results were similar, nor matter this model output I used. This would indicate a bug. After further analysis I realized that this made no sense, as expected. Using the right model output yielded much better results.

import plotly.express as px

fig = px.box(metric_values[(metric_values.metric == "ndcg_10") & (metric_values.query_model != "or_bm25")], x="query_model", y="value", points="all")
fig.show()