How to connect and interact with search applications from python
A pyvespa library overview. Connect, query, collect data and evaluate query models.
- Install
- Connect to a running Vespa application
- Define a Query model
- Query the vespa app
- Labelled data
- Collect training data
- Evaluating a query model
Vespa is the faster, more scalable and advanced search engine currently available, imho. It has a native tensor evaluation framework, can perform approximate nearest neighbor search and deploy the latest advancencements in NLP modeling, such as BERT models. This post will give you an overview of the Vespa python API available through the pyvespa library. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.
We are going to connect to the CORD-19 search app and use it as an example here. You can later use your own application to replicate the following steps. Future posts will go deeper into each topic described in this overview tutorial.
You can also run the steps contained here from Google Colab.
The library is available at PyPI and therefore can be installed with pip
.
!pip install pyvespa
We can connect to a running Vespa application by creating an instance of Vespa with the appropriate url. The resulting app
will then be used to communicate with the application.
from vespa.application import Vespa
app = Vespa(url = "https://api.cord19.vespa.ai")
When building a search application, we usually want to experiment with different query models. A Query model consists of a match phase and a ranking phase. The matching phase will define how to match documents based on the query sent and the ranking phase will define how to rank the matched documents. Both phases can get quite complex and being able to easily express and experiment with them is very valuable.
In the example below we define the match phase to be the Union of the WeakAnd and the ANN operators. The WeakAnd
will match documents based on query terms while the Approximate Nearest Neighbor (ANN
) operator will match documents based on the distance between the query and document embeddings. This is an illustration of how easy it is to combine term and semantic matching in Vespa.
from vespa.query import Union, WeakAnd, ANN
from random import random
match_phase = Union(
WeakAnd(hits = 10),
ANN(
doc_vector="title_embedding",
query_vector="title_vector",
embedding_model=lambda x: [random() for x in range(768)],
hits = 10,
label="title"
)
)
We then define the ranking to be done by the bm25
rank-profile that is already defined in the application schema. We set list_features=True
to be able to collect ranking-features later in this tutorial. After defining the match_phase
and the rank_profile
we can instantiate the Query
model.
from vespa.query import Query, RankProfile
rank_profile = RankProfile(name="bm25", list_features=True)
query_model = Query(match_phase=match_phase, rank_profile=rank_profile)
Query the vespa app
Send queries via the query API. See the query page for more examples.
We can use the query_model
that we just defined to issue queries to the application via the query
method.
query_result = app.query(
query="Is remdesivir an effective treatment for COVID-19?",
query_model=query_model
)
We can see the number of documents that were retrieved by Vespa:
query_result.number_documents_retrieved
And the number of documents that were returned to us:
len(query_result.hits)
We often need to either evaluate query models or to collect data to improve query models through ML. In both cases we usually need labelled data. Let's create some labelled data to illustrate their expected format and their usage in the library.
Each data point contains a query_id
, a query
and relevant_docs
associated with the query.
labelled_data = [
{
"query_id": 0,
"query": "Intrauterine virus infections and congenital heart disease",
"relevant_docs": [{"id": 0, "score": 1}, {"id": 3, "score": 1}]
},
{
"query_id": 1,
"query": "Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus",
"relevant_docs": [{"id": 1, "score": 1}, {"id": 5, "score": 1}]
}
]
Non-relevant documents are assigned "score": 0
by default. Relevant documents will be assigned "score": 1
by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods.
Collect training data
Collect training data to analyse and/or improve ranking functions. See the collect training data page for more examples.
We can collect training data with the collect_training_data method according to a specific Query model. Below we will collect two documents for each query in addition to the relevant ones.
training_data_batch = app.collect_training_data(
labelled_data = labelled_data,
id_field = "id",
query_model = query_model,
number_additional_docs = 2,
fields = ["rankfeatures"]
)
Many rank features are returned by default. We can select some of them to inspect:
training_data_batch[
[
"document_id", "query_id", "label",
"textSimilarity(title).proximity",
"textSimilarity(title).queryCoverage",
"textSimilarity(title).score"
]
]
Evaluating a query model
Define metrics and evaluate query models. See the evaluation page for more examples.
We will define the following evaluation metrics:
-
% of documents retrieved per query
-
recall @ 10 per query
-
MRR @ 10 per query
from vespa.evaluation import MatchRatio, Recall, ReciprocalRank
eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]
Evaluate:
evaluation = app.evaluate(
labelled_data = labelled_data,
eval_metrics = eval_metrics,
query_model = query_model,
id_field = "id",
)
evaluation