Vespa is the faster, more scalable and advanced search engine currently available, imho. It has a native tensor evaluation framework, can perform approximate nearest neighbor search and deploy the latest advancencements in NLP modeling, such as BERT models. This post will give you an overview of the Vespa python API available through the pyvespa library. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.

We are going to connect to the CORD-19 search app and use it as an example here. You can later use your own application to replicate the following steps. Future posts will go deeper into each topic described in this overview tutorial.

You can also run the steps contained here from Google Colab.

Install

**Warning**: The library is under active development and backward incompatible changes may occur.

The library is available at PyPI and therefore can be installed with pip.

!pip install pyvespa

Connect to a running Vespa application

We can connect to a running Vespa application by creating an instance of Vespa with the appropriate url. The resulting app will then be used to communicate with the application.

from vespa.application import Vespa

app = Vespa(url = "https://api.cord19.vespa.ai")

Define a Query model

Easily define matching and ranking criteria

When building a search application, we usually want to experiment with different query models. A Query model consists of a match phase and a ranking phase. The matching phase will define how to match documents based on the query sent and the ranking phase will define how to rank the matched documents. Both phases can get quite complex and being able to easily express and experiment with them is very valuable.

In the example below we define the match phase to be the Union of the WeakAnd and the ANN operators. The WeakAnd will match documents based on query terms while the Approximate Nearest Neighbor (ANN) operator will match documents based on the distance between the query and document embeddings. This is an illustration of how easy it is to combine term and semantic matching in Vespa.

from vespa.query import Union, WeakAnd, ANN
from random import random

match_phase = Union(
    WeakAnd(hits = 10), 
    ANN(
        doc_vector="title_embedding", 
        query_vector="title_vector", 
        embedding_model=lambda x: [random() for x in range(768)],
        hits = 10,
        label="title"
    )
)

We then define the ranking to be done by the bm25 rank-profile that is already defined in the application schema. We set list_features=True to be able to collect ranking-features later in this tutorial. After defining the match_phase and the rank_profile we can instantiate the Query model.

from vespa.query import Query, RankProfile

rank_profile = RankProfile(name="bm25", list_features=True)

query_model = Query(match_phase=match_phase, rank_profile=rank_profile)

Query the vespa app

Send queries via the query API. See the query page for more examples.

We can use the query_model that we just defined to issue queries to the application via the query method.

query_result = app.query(
    query="Is remdesivir an effective treatment for COVID-19?", 
    query_model=query_model
)

We can see the number of documents that were retrieved by Vespa:

query_result.number_documents_retrieved
1121

And the number of documents that were returned to us:

len(query_result.hits)
10

Labelled data

How to structure labelled data

We often need to either evaluate query models or to collect data to improve query models through ML. In both cases we usually need labelled data. Let's create some labelled data to illustrate their expected format and their usage in the library.

Each data point contains a query_id, a query and relevant_docs associated with the query.

labelled_data = [
    {
        "query_id": 0, 
        "query": "Intrauterine virus infections and congenital heart disease",
        "relevant_docs": [{"id": 0, "score": 1}, {"id": 3, "score": 1}]
    },
    {
        "query_id": 1, 
        "query": "Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus",
        "relevant_docs": [{"id": 1, "score": 1}, {"id": 5, "score": 1}]
    }
]

Non-relevant documents are assigned "score": 0 by default. Relevant documents will be assigned "score": 1 by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods.

Collect training data

Collect training data to analyse and/or improve ranking functions. See the collect training data page for more examples.

We can collect training data with the collect_training_data method according to a specific Query model. Below we will collect two documents for each query in addition to the relevant ones.

training_data_batch = app.collect_training_data(
    labelled_data = labelled_data,
    id_field = "id",
    query_model = query_model,
    number_additional_docs = 2,
    fields = ["rankfeatures"]
)

Many rank features are returned by default. We can select some of them to inspect:

training_data_batch[
    [
        "document_id", "query_id", "label", 
        "textSimilarity(title).proximity", 
        "textSimilarity(title).queryCoverage", 
        "textSimilarity(title).score"
    ]
]
document_id query_id label textSimilarity(title).proximity textSimilarity(title).queryCoverage textSimilarity(title).score
0 0 0 1 0.000000 0.142857 0.055357
1 255164 0 0 1.000000 1.000000 1.000000
2 145189 0 0 0.739583 0.571429 0.587426
3 3 0 1 0.437500 0.142857 0.224554
4 255164 0 0 1.000000 1.000000 1.000000
5 145189 0 0 0.739583 0.571429 0.587426
6 1 1 1 0.000000 0.083333 0.047222
7 232555 1 0 1.000000 1.000000 1.000000
8 13944 1 0 1.000000 0.250000 0.612500
9 5 1 1 0.000000 0.083333 0.041667
10 232555 1 0 1.000000 1.000000 1.000000
11 13944 1 0 1.000000 0.250000 0.612500

Evaluating a query model

Define metrics and evaluate query models. See the evaluation page for more examples.

We will define the following evaluation metrics:

  • % of documents retrieved per query

  • recall @ 10 per query

  • MRR @ 10 per query

from vespa.evaluation import MatchRatio, Recall, ReciprocalRank

eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]

Evaluate:

evaluation = app.evaluate(
    labelled_data = labelled_data,
    eval_metrics = eval_metrics, 
    query_model = query_model, 
    id_field = "id",
)
evaluation
query_id match_ratio_retrieved_docs match_ratio_docs_available match_ratio_value recall_10_value reciprocal_rank_10_value
0 0 1192 309201 0.003855 0.0 0
1 1 1144 309201 0.003700 0.0 0