After the release of version 0.2.0 for pyvespa, we can now create a semantic search application based on embedding representation from python with just a few lines of code:

from vespa.package import ApplicationPackage, Field, QueryTypeField, RankProfile

app_package = ApplicationPackage(name = "msmarco")

app_package.schema.add_fields(        
    Field(
        name = "id", type = "string", 
        indexing = ["attribute", "summary"]
    ),
    Field(
        name = "title", type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    ),
    Field(
        name = "title_bert", type = "tensor<float>(x[768])", 
        indexing = ["attribute"]
    )        
)

app_package.query_profile_type.add_fields(        
    QueryTypeField(
        name="ranking.features.query(title_bert)",
        type="tensor<float>(x[768])"
    )
)

app_package.schema.add_rank_profile(
    RankProfile(
        name = "bert_title", 
        first_phase = "sum(query(title_bert)*attribute(title_bert))"
    )
)

We will cover the step-by-step of the code block in the next sections. The goal of this post is to illustrate the pyvespa API that will enable you to create your own application rather than focusing on how to create an effective application or chosing the best transformers model to use for embedding generation.

Install pyvespa

We will use version 0.2.0 here, which can be installed via pip

pip install pyvespa==0.2.0

Define the application

The first step is to create an application package with the ApplicationPackage class. We will name our application msmarco because we want to build a full document ranking application based on the MS MARCO dataset.

from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name = "msmarco")

We will then add three fields to our application. Each document will have an unique id, a title and a 768 dimensional vector named title_bert that will store the title embedding.

from vespa.package import Field

app_package.schema.add_fields(        
    Field(
        name = "id", type = "string", 
        indexing = ["attribute", "summary"]
    ),
    Field(
        name = "title", type = "string", 
        indexing = ["index", "summary"], 
        index = "enable-bm25"
    ),
    Field(
        name = "title_bert", type = "tensor<float>(x[768])", 
        indexing = ["attribute"]
    )        
)

We need to define a query ranking feature responsible to hold the query embedding at query time.

from vespa.package import QueryTypeField

app_package.query_profile_type.add_fields(        
    QueryTypeField(
        name="ranking.features.query(title_bert)",
        type="tensor<float>(x[768])"
    )
)

Now that we have defined a document and a query embedding in the application, we can use them in a RankProfile to rank the documents matched by the query. In this case we will define a dot-product between the query embedding query(title_bert) and the document embedding attribute(title_bert).

from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(
        name = "bert_title", 
        first_phase = "sum(query(title_bert)*attribute(title_bert))"
    )
)

Deploy the application

import os
from vespa.package import VespaDocker

vespa_docker = VespaDocker(port=8089)

os.environ["WORK_DIR"] = "/Users/tmartins"
disk_folder = os.path.join(os.getenv("WORK_DIR"), "sample_application")
app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder=disk_folder
)

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.

Load the BERT model

Loading one of the many models available.

from sentence_transformers import SentenceTransformer

bert_model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")

Define a function that take a text as input and return a vector of floats as output.

import numpy as np

def normalized_bert_encoder(text):
    vector = bert_model.encode([text])[0].tolist()
    norm = np.linalg.norm(vector)
    if norm > 0.0:
        vector = vector / norm
    return vector.tolist()

Feed data

from pandas import read_csv

docs = read_csv("https://thigm85.github.io/data/msmarco/docs_100.tsv", sep = "\t")
docs = docs[docs['title'].str.strip().astype(bool)] # remove empty titles
docs.shape

(88, 3)

docs.head(2)

for idx, row in docs.iterrows():
    response = app.feed_data_point(
        schema = "msmarco",
        data_id = row["id"],
        fields = {
            "id": row["id"],
            "title": row["title"],
            "title_bert": {"values": normalized_bert_encoder(row["title"])}
        }
    )

Define a query model

from vespa.query import QueryModel, QueryRankingFeature, Union, WeakAnd, ANN, RankProfile

query_model = QueryModel(
    query_properties=[QueryRankingFeature(name="title_bert", mapping=normalized_bert_encoder)],
    match_phase=Union(
        WeakAnd(field="title", hits=10), 
        ANN(
            doc_vector="title_bert", 
            query_vector="title_bert", 
            hits=10, 
            label="ann_title"
        )
    ),
    rank_profile=RankProfile(name="bert_title")
)

At this point we can query our application:

query_results = app.query(query="What is science?", query_model=query_model, debug_request=False)

query_results.hits[0]

{'id': 'id:msmarco:msmarco::D2089371',
 'relevance': 0.6983165740966797,
 'source': 'msmarco_content',
 'fields': {'sddocname': 'msmarco',
  'documentid': 'id:msmarco:msmarco::D2089371',
  'id': 'D2089371',
  'title': 'Inquiry Science'}}

	id	title	body
0	D2185715	What Is an Appropriate Gift for a Bris	Hub Pages Religion and Philosophy Judaism...
1	D2819479	lunge	1lungenoun ˈlənj Popularity Bottom 40 of...