Context

I have fine-tuned a BERT model using a simple training routine. I then deployed a simplified cord19 application on my laptop to validate the model. We will investigate some strange results we found in a previous sprint.

Evaluate ranking functions

Connect to my local Vespa app.

from vespa.application import Vespa

app = Vespa(url = "http://localhost", port = 8080)

Define different query models. bert_index_1 uses the correct output from the BERT model used, which is the probability of the document being relevant. bert was a mistake I made when I used the wrong output to rank the documents. It is still here because it will be part of my investigations later.

from vespa.query import Query, RankProfile as Ranking, OR

query_models = {
    "or_bm25": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bm25")
    ),
    "or_bm25_bert": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bert")
    ),
    "or_bm25_bert_index_1": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bert_index_1")
    )
    
}

The evaluation metrics that we want to compute.

from vespa.evaluation import MatchRatio, Recall, ReciprocalRank, NormalizedDiscountedCumulativeGain

eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10), NormalizedDiscountedCumulativeGain(at=10)]

Load labeled data. You can download it here.

import json

labelled_data = json.load(open("cord19/labelled_data.json", "r"))

We will need to tokenizer to convert the query string to embedding vector.

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Compute evaluation metrics for each query model and each query.

from pandas import DataFrame

evaluations = {}
for query_model in query_models:
    evaluation = []
    for query_data in labelled_data:
        print(query_data["query_id"])
        evaluation_query = app.evaluate_query(
            eval_metrics=eval_metrics,
            query_model=query_models[query_model],
            query_id=query_data["query_id"],
            query=query_data["query"],
            id_field = "cord_uid",
            relevant_docs=query_data["relevant_docs"],
            hits = 10,
            timeout="100s",
            **{"ranking.features.query(query_token_ids)": str(tokenizer(
                        str(query_data["query"]), 
                        truncation=True, 
                        padding="max_length",
                        max_length=64, 
                        add_special_tokens=False
                    )["input_ids"])}            
        )
        evaluation.append(evaluation_query)
    evaluations[query_model] = DataFrame.from_records(evaluation)

Organize the data into a nicer format to work with.

import pandas as pd

metric_values = []
for query_model in query_models:
    for metric in eval_metrics:
        metric_values.append(
            pd.DataFrame(
                data={
                    "query_model": query_model, 
                    "metric": metric.name, 
                    "value": evaluations[query_model][metric.name + "_value"].to_list()
                }
            )
        )
metric_values = pd.concat(metric_values, ignore_index=True)

Recall issue

The recall issue is that different query models were giving different recall metrics, even though they all had the same matching and ranking-phase and were just reordering the top 10 positions.

second-phase {
    rerank-count: 10
    expression: sum(eval)
}

metric_values[metric_values.metric == "recall_10"].groupby(['query_model', 'metric']).median()

It seems that Vespa reorder the top 11 documents, even though rerank-count: 10, as I show below.

Identify which queries are responsible for the difference:

from pandas import merge

recall_measures = merge(
    left=evaluations["or_bm25"], 
    right=evaluations["or_bm25_bert_index_1"],
    on="query_id"
)[["query_id", "recall_10_value_x", "recall_10_value_y"]]

recall_measures[recall_measures.recall_10_value_x != recall_measures.recall_10_value_y]

Query two different rank profiles.

query_data = labelled_data[14]

result_bm25 = app.query(query=query_data["query"], query_model=query_models["or_bm25"],
          hits = 10,
         )
bm25_ids = [hit["fields"]["cord_uid"] for hit in result_bm25.hits]
result_bm25_bert = app.query(query=query_data["query"], query_model=query_models["or_bm25_bert_index_1"],
          hits = 10,
          timeout="100s",
          **{"ranking.features.query(query_token_ids)": str(tokenizer(
                        str(query_data["query"]), 
                        truncation=True, 
                        padding="max_length",
                        max_length=64, 
                        add_special_tokens=False
                    )["input_ids"])}            
         )
bm25_bert_ids = [hit["fields"]["cord_uid"] for hit in result_bm25_bert.hits]

Check which id is in the BERT top 10 but not in the BM25 top 10:

id_in_bert_not_in_bm25 = [x for x in bm25_bert_ids if x not in bm25_ids]
id_in_bert_not_in_bm25

['ecu579el']

List the top 11 results from BM25. Notice that the missing doc is in the 11th position.

result_bm25_11 = [hit["fields"]["cord_uid"] for hit in app.query(query=query_data["query"], query_model=query_models["or_bm25"], hits = 11).hits]

result_bm25_11

['zpek8i5e',
 '75u57fw1',
 'up5jpq45',
 'qmrntk43',
 'cxfzs68n',
 'y2nhss9u',
 '94puwlbm',
 'zpmdrh4q',
 'fpexj3s5',
 'axljtddn',
 'ecu579el']

Positive and null NDGC issue

When querying @bergum dev instance I found cases where the NDCG @ 10 was positive for BM25 model and zero for the BERT re-rank model, as we can see in the picture below, which makes no sense.

title

I could not reproduce the NDCG issue and the results were as expected on my local instance, as I show below.

from pandas import merge

ndcg_measures = merge(
    left=evaluations["or_bm25"], 
    right=evaluations["or_bm25_bert_index_1"],
    on="query_id"
)[["query_id", "ndcg_10_value_x", "ndcg_10_value_y"]]

ndcg_measures

import plotly.graph_objects as go
ids=ndcg_measures.query_id.tolist()

fig = go.Figure(data=[
    go.Bar(name='BM25', x=ids, y=ndcg_measures.ndcg_10_value_x.tolist()),
    go.Bar(name='BM25 + BERT', x=ids, y=ndcg_measures.ndcg_10_value_y.tolist())
])
# Change the bar mode
fig.update_layout(barmode='group', xaxis=dict(type='category'))
fig.show()

Bergum's instance

@bergum dev instance was down, so I could not check his results again, but below is the code I would use it.

from vespa.application import Vespa

app = Vespa(
    url="https://bergum.cord-19.vespa-team.aws-us-east-1c.dev.public.vespa.oath.cloud",
    cert="/Users/tmartins/projects/vespa/pyvespa/docs/sphinx/source/use_cases/cord19/data-plane-joint.txt"
)

from vespa.query import Query, RankProfile, OR

query_models = {
    "or_bm25": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bm25")
    ),
    "or_bm25_bert": Query(
        match_phase = OR(),
        rank_profile = Ranking(name="bert")
    )
}

from pandas import DataFrame

evaluations = {}
for query_model in query_models:
    evaluation = []
    for query_data in labelled_data:
        print(query_data["query_id"])
        body = {
            "yql": "select * from sources * where userQuery();",
            "query": query_data["query"],
            "type": "any",
            "model.defaultIndex": "default",
            "hits": 10,
            "collapsefield": "title",
            "timeout": "100s",
            "ranking": query_models[query_model].rank_profile.name
        }
        evaluation_query = app.evaluate_query(
            eval_metrics=eval_metrics,
            query_model=query_models[query_model],
            query_id=query_data["query_id"],
            query=query_data["query"],
            id_field = "cord_uid",
            relevant_docs=query_data["relevant_docs"],
            body=body,
        )
        evaluation.append(evaluation_query)
    evaluations[query_model] = DataFrame.from_records(evaluation)

Effect of using different model outputs from BERT

For a moment I thought that the results were similar, nor matter this model output I used. This would indicate a bug. After further analysis I realized that this made no sense, as expected. Using the right model output yielded much better results.

import plotly.express as px

fig = px.box(metric_values[(metric_values.metric == "ndcg_10") & (metric_values.query_model != "or_bm25")], x="query_model", y="value", points="all")
fig.show()

	query_id	recall_10_value_x	recall_10_value_y
14	15	0.006726	0.004484
16	17	0.006974	0.008368
20	21	0.006088	0.007610
32	33	0.006515	0.009772
38	39	0.007165	0.008188
40	41	0.014045	0.016854
49	50	0.020134	0.013423

	query_id	ndcg_10_value_x	ndcg_10_value_y
0	1	0.683159	0.812003
1	2	0.000000	0.000000
2	3	0.450853	0.619669
3	4	0.000000	0.000000
4	5	0.397809	0.455605
5	6	0.678762	0.901013
6	7	0.888733	0.629200
7	8	0.527845	0.947807
8	9	0.859413	0.569139
9	10	0.541696	0.880740
10	11	0.000000	0.000000
11	12	0.510384	0.844481
12	13	0.500000	0.386853
13	14	0.847790	0.855857
14	15	0.882121	0.859719
15	16	0.588160	0.792087
16	17	0.674788	0.907813
17	18	0.758879	0.652883
18	19	0.695950	0.421776
19	20	0.769846	0.991829
20	21	0.581889	0.651197
21	22	0.301030	0.430677
22	23	0.735734	0.686502
23	24	0.963487	0.857319
24	25	0.570642	0.650921
25	26	0.918849	0.868415
26	27	0.533893	0.493208
27	28	0.548702	0.575719
28	29	0.966813	0.907663
29	30	0.811837	0.804581
30	31	0.000000	0.000000
31	32	0.430677	0.630930
32	33	0.707489	0.426919
33	34	0.430677	0.630930
34	35	0.489969	0.906025
35	36	0.846117	0.879854
36	37	0.950421	0.950421
37	38	0.783761	0.800132
38	39	0.976409	0.710612
39	40	0.815931	0.559814
40	41	0.806327	0.970409
41	42	0.763499	0.706410
42	43	0.926285	0.702700
43	44	0.833223	0.579375
44	45	0.804465	0.710549
45	46	0.965444	0.622067
46	47	0.796498	0.717333
47	48	0.505179	0.917655
48	49	0.531761	0.864027
49	50	0.445734	0.422790

		value
query_model	metric
or_bm25	recall_10	0.007412
or_bm25_bert	recall_10	0.008076
or_bm25_bert_index_1	recall_10	0.008118