Build a basic text search application from python with Vespa
Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.
This post will introduce you to the simplified pyvespa
API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.
pyvespa
exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.
The pyvespa simplified API introduced here was released on version 0.2.0
pip3 install pyvespa>=0.2.0
As an example, we will build an application to search through CORD19 sample data.
The first step is to create a Vespa ApplicationPackage:
from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name="cord19")
from vespa.package import Field
app_package.schema.add_fields(
Field(
name = "cord_uid",
type = "string",
indexing = ["attribute", "summary"]
),
Field(
name = "title",
type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
),
Field(
name = "abstract",
type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
)
)
-
cord_uid
will store the cord19 document ids, whiletitle
andabstract
are self explanatory. -
All the fields, in this case, are of type
string
. -
Including
"index"
in theindexing
list means that Vespa will create a searchable index fortitle
andabstract
. You can read more about which options is available forindexing
in the Vespa documentation. -
Setting
index = "enable-bm25"
makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.
A Fieldset groups fields together for searching. For example, the default
fieldset defined below groups title
and abstract
together.
from vespa.package import FieldSet
app_package.schema.add_field_set(
FieldSet(name = "default", fields = ["title", "abstract"])
)
We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25
rank profile that combines that BM25 scores computed over the title
and abstract
fields.
from vespa.package import RankProfile
app_package.schema.add_rank_profile(
RankProfile(
name = "bm25",
first_phase = "bm25(title) + bm25(abstract)"
)
)
We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package
using Docker without leaving the notebook, by creating an instance of VespaDocker, as shown below:
from vespa.package import VespaDocker
vespa_docker = VespaDocker(
port=8080,
disk_folder="/Users/username/cord19_app"
)
app = vespa_docker.deploy(
application_package = app_package,
)
app
now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.
It is important to know that pyvespa
simply provides a convenient API to define Vespa application packages from python. vespa_docker.deploy
export Vespa configuration files to the disk_folder
defined above. Going through those files is an excellent way to start learning about Vespa syntax.
Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame
containing 100 rows and the cord_uid
, title
, and abstract
columns required by our schema definition.
from pandas import read_csv
parsed_feed = read_csv(
"https://thigm85.github.io/data/cord19/parsed_feed_100.csv"
)
parsed_feed
We can then iterate through the DataFrame
above and feed each row by using the app.feed_data_point method:
-
The schema name is by default set to be equal to the application name, which is
cord19
in this case. -
When feeding data to Vespa, we must have a unique id for each data point. We will use
cord_uid
here.
for idx, row in parsed_feed.iterrows():
fields = {
"cord_uid": str(row["cord_uid"]),
"title": str(row["title"]),
"abstract": str(row["abstract"])
}
response = app.feed_data_point(
schema = "cord19",
data_id = str(row["cord_uid"]),
fields = fields,
)
You can also inspect the response to each request if desired.
response.json()
With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.
query = {
'yql': 'select * from sources * where userQuery();',
'query': 'What is the role of endothelin-1',
'ranking': 'bm25',
'type': 'any',
'presentation.timing': True,
'hits': 3
}
res = app.query(body=query)
res.hits[0]
We can also define the same query by using the QueryModel abstraction that allows us to specify how we want to match and rank our documents. In this case, we defined that we want to:
- match our documents using the
OR
operator, which matches all the documents that share at least one term with the query. - rank the matched documents using the
bm25
rank profile defined in our application package.
from vespa.query import QueryModel, RankProfile as Ranking, OR
res = app.query(
query="What is the role of endothelin-1",
query_model=QueryModel(
match_phase = OR(),
rank_profile = Ranking(name="bm25")
)
)
res.hits[0]
Using the Vespa Query Language as in our first example gives you the full power and flexibility that Vespa can offer. In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments, but this is a future post topic.