Semantic search with embedding representation from python
Build and interact with a MS MARCO search application
- Install pyvespa
- Define the application
- Deploy the application
- Load the BERT model
- Feed data
- Define a query model
After the release of version 0.2.0 for pyvespa, we can now create a semantic search application based on embedding representation from python with just a few lines of code:
from vespa.package import ApplicationPackage, Field, QueryTypeField, RankProfile
app_package = ApplicationPackage(name = "msmarco")
app_package.schema.add_fields(
Field(
name = "id", type = "string",
indexing = ["attribute", "summary"]
),
Field(
name = "title", type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
),
Field(
name = "title_bert", type = "tensor<float>(x[768])",
indexing = ["attribute"]
)
)
app_package.query_profile_type.add_fields(
QueryTypeField(
name="ranking.features.query(title_bert)",
type="tensor<float>(x[768])"
)
)
app_package.schema.add_rank_profile(
RankProfile(
name = "bert_title",
first_phase = "sum(query(title_bert)*attribute(title_bert))"
)
)
We will cover the step-by-step of the code block in the next sections. The goal of this post is to illustrate the pyvespa API that will enable you to create your own application rather than focusing on how to create an effective application or chosing the best transformers model to use for embedding generation.
We will use version 0.2.0 here, which can be installed via pip
pip install pyvespa==0.2.0
The first step is to create an application package with the ApplicationPackage class. We will name our application msmarco
because we want to build a full document ranking application based on the MS MARCO dataset.
from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name = "msmarco")
We will then add three fields to our application. Each document will have an unique id
, a title
and a 768 dimensional vector named title_bert
that will store the title embedding.
from vespa.package import Field
app_package.schema.add_fields(
Field(
name = "id", type = "string",
indexing = ["attribute", "summary"]
),
Field(
name = "title", type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
),
Field(
name = "title_bert", type = "tensor<float>(x[768])",
indexing = ["attribute"]
)
)
We need to define a query ranking feature responsible to hold the query embedding at query time.
from vespa.package import QueryTypeField
app_package.query_profile_type.add_fields(
QueryTypeField(
name="ranking.features.query(title_bert)",
type="tensor<float>(x[768])"
)
)
Now that we have defined a document and a query embedding in the application, we can use them in a RankProfile
to rank the documents matched by the query. In this case we will define a dot-product between the query embedding query(title_bert)
and the document embedding attribute(title_bert)
.
from vespa.package import RankProfile
app_package.schema.add_rank_profile(
RankProfile(
name = "bert_title",
first_phase = "sum(query(title_bert)*attribute(title_bert))"
)
)
import os
from vespa.package import VespaDocker
vespa_docker = VespaDocker(port=8089)
os.environ["WORK_DIR"] = "/Users/tmartins"
disk_folder = os.path.join(os.getenv("WORK_DIR"), "sample_application")
app = vespa_docker.deploy(
application_package = app_package,
disk_folder=disk_folder
)
Loading one of the many models available.
from sentence_transformers import SentenceTransformer
bert_model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
Define a function that take a text as input and return a vector of floats as output.
import numpy as np
def normalized_bert_encoder(text):
vector = bert_model.encode([text])[0].tolist()
norm = np.linalg.norm(vector)
if norm > 0.0:
vector = vector / norm
return vector.tolist()
from pandas import read_csv
docs = read_csv("https://thigm85.github.io/data/msmarco/docs_100.tsv", sep = "\t")
docs = docs[docs['title'].str.strip().astype(bool)] # remove empty titles
docs.shape
docs.head(2)
for idx, row in docs.iterrows():
response = app.feed_data_point(
schema = "msmarco",
data_id = row["id"],
fields = {
"id": row["id"],
"title": row["title"],
"title_bert": {"values": normalized_bert_encoder(row["title"])}
}
)
from vespa.query import QueryModel, QueryRankingFeature, Union, WeakAnd, ANN, RankProfile
query_model = QueryModel(
query_properties=[QueryRankingFeature(name="title_bert", mapping=normalized_bert_encoder)],
match_phase=Union(
WeakAnd(field="title", hits=10),
ANN(
doc_vector="title_bert",
query_vector="title_bert",
hits=10,
label="ann_title"
)
),
rank_profile=RankProfile(name="bert_title")
)
At this point we can query our application:
query_results = app.query(query="What is science?", query_model=query_model, debug_request=False)
query_results.hits[0]