Download processed data

We can start by downloading the data that we have processed before.

import requests, json
from pandas import read_csv

topics = json.loads(
    requests.get("https://thigm85.github.io/data/cord19/topics.json").text
)
relevance_data = read_csv("https://thigm85.github.io/data/cord19/relevance_data.csv")

topics contain data about the 50 topics available, including query, question and narrative.

topics["1"]
{'query': 'coronavirus origin',
 'question': 'what is the origin of COVID-19',
 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}

relevance_data contains the relevance judgments for each of the 50 topics.

relevance_data.head(5)
topic_id round_id cord_uid relevancy
0 1 4.5 005b2j4b 2
1 1 4.0 00fmeepz 1
2 1 0.5 010vptx3 2
3 1 2.5 0194oljo 1
4 1 4.0 021q9884 1

Install pyvespa

We are going to use pyvespa to evaluate ranking functions from python.

!pip install pyvespa

pyvespa provides a python API to Vespa. It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.

Format the labeled data into expected pyvespa format

pyvespa expects labeled data to follow the format illustrated below. It is a list of dict where each dict represents a query containing query_id, query and a list of relevant_docs. Each relevant document contains a required id key and an optional score key.

labeled_data = [
    {
        'query_id': 1,
        'query': 'coronavirus origin',
        'relevant_docs': [{'id': '005b2j4b', 'score': 2}, {'id': '00fmeepz', 'score': 1}]
    },
    {
        'query_id': 2,
        'query': 'coronavirus response to weather changes',
        'relevant_docs': [{'id': '01goni72', 'score': 2}, {'id': '03h85lvy', 'score': 2}]
    }
]

We can create labeled_data from the topics and relevance_data that we downloaded before. We are only going to include documents with relevance score > 0 into the final list.

labeled_data = [
    {
        "query_id": int(topic_id), 
        "query": topics[topic_id]["query"], 
        "relevant_docs": [
            {
                "id": row["cord_uid"], 
                "score": row["relevancy"]
            } for idx, row in relevance_data[relevance_data.topic_id == int(topic_id)].iterrows() if row["relevancy"] > 0
        ]
    } for topic_id in topics.keys()]

Define query models to be evaluated

We are going to define two query models to be evaluated here. Both will match all the documents that share at least one term with the query. This is defined by setting match_phase = OR().

The difference between the query models happens in the ranking phase. The or_default model will rank documents based on nativeRank while the or_bm25 model will rank documents based on BM25. Discussion about those two types of ranking is out of the scope of this tutorial. It is enough to know that they rank documents according to two different formulas.

Those ranking profiles were defined by the team behind the cord19 app and can be found here.

from vespa.query import Query, RankProfile, OR

query_models = {
    "or_default": Query(
        match_phase = OR(),
        rank_profile = RankProfile(name="default")
    ),
    "or_bm25": Query(
        match_phase = OR(),
        rank_profile = RankProfile(name="bm25t5")
    )
}
        

Define metrics to be used in the evaluation

We would like to compute the following metrics:

  • The percentage of documents matched by the query

  • Recall @ 10

  • Reciprocal rank @ 10

  • NDCG @ 10

from vespa.evaluation import (
    MatchRatio,
    Recall,
    ReciprocalRank,
    NormalizedDiscountedCumulativeGain,
)

eval_metrics = [
    MatchRatio(), 
    Recall(at=10), 
    ReciprocalRank(at=10), 
    NormalizedDiscountedCumulativeGain(at=10)
]

Evaluate

Connect to a running Vespa instance:

from vespa.application import Vespa

app = Vespa(url = "https://api.cord19.vespa.ai")

Compute the metrics defined above for each query model and store the results in a dictionary.

evaluations = {}
for query_model in query_models:
    evaluations[query_model] = app.evaluate(
        labeled_data = labeled_data,
        eval_metrics = eval_metrics,
        query_model = query_models[query_model],
        id_field = "cord_uid",
        hits = 10
    )

Analyze results

Let’s first combine the data into one DataFrame in a format to facilitate a comparison between query models.

import pandas as pd

metric_values = []
for query_model in query_models:
    for metric in eval_metrics:
        metric_values.append(
            pd.DataFrame(
                data={
                    "query_model": query_model, 
                    "metric": metric.name, 
                    "value": evaluations[query_model][metric.name + "_value"].to_list()
                }
            )
        )
metric_values = pd.concat(metric_values, ignore_index=True)
metric_values.head()
query_model metric value
0 or_default match_ratio 0.231523
1 or_default match_ratio 0.755509
2 or_default match_ratio 0.265400
3 or_default match_ratio 0.843403
4 or_default match_ratio 0.901592

We can see below that the query model based on BM25 is superior across all metrics considered here.

metric_values.groupby(['query_model', 'metric']).mean()
value
query_model metric
or_bm25 match_ratio 0.412386
ndcg_10 0.651929
recall_10 0.007654
reciprocal_rank_10 0.610270
or_default match_ratio 0.412386
ndcg_10 0.602556
recall_10 0.005435
reciprocal_rank_10 0.564437

We can also visualize the distribution of the metrics across the queries to get a better picture of the results.

import plotly.express as px

fig = px.box(
    metric_values[metric_values.metric == "ndcg_10"], 
    x="query_model", 
    y="value", 
    title="Ndgc @ 10",
    points="all"
)
fig.show()