Build a News recommendation app from python with Vespa
Part 2 - From news search to news recommendation with embeddings
- Add a user schema
- Recommendation using embeddings
- Query Profile Type
- Redeploy the application
- Feeding and partial updates: news and user embeddings
- Fetch the user embedding
- Get recommendations
In this part, we'll start transforming our application from news search to news recommendation using the embeddings created in this tutorial. An embedding vector will represent each user and news article. We will make the embeddings used available for download to make it easier to follow this post along. When a user comes, we retrieve his embedding and use it to retrieve the closest news articles via an approximate nearest neighbor (ANN) search. We also show that Vespa can jointly apply general filtering and ANN search, unlike competing alternatives available in the market.
We assume that you have followed the news search tutorial. Therefore, you should have an app_package
variable holding the news search app definition and a Docker container named news
running a search application fed with news articles from the demo version of the MIND dataset.
We need to add another document type to represent a user. We set up the schema to search for a user_id
and retrieve the user’s embedding vector.
from vespa.package import Schema, Document, Field
app_package.add_schema(
Schema(
name="user",
document=Document(
fields=[
Field(
name="user_id",
type="string",
indexing=["summary", "attribute"],
attribute=["fast-search"]
),
Field(
name="embedding",
type="tensor<float>(d0[51])",
indexing=["summary", "attribute"]
)
]
)
)
)
We build an index for the attribute field user_id
by specifying the fast-search
attribute. Remember that attribute fields are held in memory and are not indexed by default.
The embedding field is a tensor field. Tensors in Vespa are flexible multi-dimensional data structures and, as first-class citizens, can be used in queries, document fields, and constants in ranking. Tensors can be either dense or sparse or both and can contain any number of dimensions. Please see the tensor user guide for more information. Here we have defined a dense tensor with a single dimension (d0
- dimension 0), representing a vector. 51 is the size of the embeddings used in this post.
We now have one schema for the news
and one schema for the user
.
[schema.name for schema in app_package.schemas]
Similarly to the user schema, we will use a dense tensor to represent the news embeddings. But unlike the user embedding field, we will index the news embedding by including index
in the indexing
argument and specify that we want to build the index using the HNSW (hierarchical navigable small world) algorithm. The distance metric used is euclidean. Read this blog post to know more about Vespa’s journey to implement ANN search.
from vespa.package import Field, HNSW
app_package.get_schema(name="news").add_fields(
Field(
name="embedding",
type="tensor<float>(d0[51])",
indexing=["attribute", "index"],
ann=HNSW(distance_metric="euclidean")
)
)
Here, we’ve added a ranking expression using the closeness ranking feature, which calculates the euclidean distance and uses that to rank the news articles. This rank-profile depends on using the nearestNeighbor search operator, which we’ll get back to below when searching. But for now, this expects a tensor in the query to use as the initial search point.
from vespa.package import RankProfile
app_package.get_schema(name="news").add_rank_profile(
RankProfile(
name="recommendation",
inherits="default",
first_phase="closeness(field, embedding)"
)
)
The recommendation rank profile above requires that we send a tensor along with the query. For Vespa to bind the correct types, it needs to know the expected type of this query parameter.
from vespa.package import QueryTypeField
app_package.query_profile_type.add_fields(
QueryTypeField(
name="ranking.features.query(user_embedding)",
type="tensor<float>(d0[51])"
)
)
This query profile type instructs Vespa to expect a float tensor with dimension d0[51]
when the query parameter ranking.features.query(user_embedding) is passed. We’ll see how this works together with the nearestNeighbor search operator below.
We made all the required changes to turn our news search app into a news recommendation app. We can now redeploy the app_package
to our running container named news
.
from vespa.package import VespaDocker
vespa_docker = VespaDocker.from_container_name_or_id("news")
app = vespa_docker.deploy(application_package=app_package)
app.deployment_message
To keep this tutorial easy to follow, we make the parsed embeddings available for download. To build them yourself, please follow this tutorial.
import requests, json
user_embeddings = json.loads(
requests.get("https://thigm85.github.io/data/mind/mind_demo_user_embeddings_parsed.json").text
)
news_embeddings = json.loads(
requests.get("https://thigm85.github.io/data/mind/mind_demo_news_embeddings_parsed.json").text
)
We just created the user
schema, so we need to feed user data for the first time.
for user_embedding in user_embeddings:
response = app.feed_data_point(
schema="user",
data_id=user_embedding["user_id"],
fields=user_embedding
)
For the news documents, we just need to update the embedding
field added to the news
schema.
for news_embedding in news_embeddings:
response = app.update_data(
schema="news",
data_id=news_embedding["news_id"],
fields={"embedding": news_embedding["embedding"]}
)
Next, we create a query_user_embedding
function to retrieve the user embedding
by the user_id
. Of course, you could do this more efficiently using a Vespa Searcher as described here, but keeping everything in python at this point makes learning easier.
def parse_embedding(hit_json):
embedding_json = hit_json["fields"]["embedding"]["cells"]
embedding_vector = [0.0] * len(embedding_json)
for val in embedding_json:
embedding_vector[int(val["address"]["d0"])] = val["value"]
return embedding_vector
def query_user_embedding(user_id):
result = app.query(body={"yql": "select * from sources user where user_id contains '{}';".format(user_id)})
embedding = parse_embedding(result.hits[0])
return embedding
The function will query Vespa, retrieve the embedding and parse it into a list of floats. Here are the first five elements of the user U63195
's embedding.
query_user_embedding(user_id="U63195")[:5]
The following yql
instructs Vespa to select the title
and the category
from the ten news documents closest to the user embedding.
yql = "select title, category from sources news where ([{'targetHits': 10}]nearestNeighbor(embedding, user_embedding));"
We also specify that we want to rank those documents by the recommendation
rank-profile that we defined earlier and send the user embedding via the query profile type ranking.features.query(user_embedding)
that we also defined in our app_package
.
result = app.query(
body={
"yql": yql,
"hits": 10,
"ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
"ranking.profile": "recommendation"
}
)
Here are the first two hits out of the ten returned.
result.hits[0:2]
Vespa ANN search is fully integrated into the Vespa query tree. This integration means that we can include query filters and the ANN search will be applied only to documents that satisfy the filters. No need to do pre- or post-processing involving filters.
The following yql
search over news documents that have sports
as their category.
yql = "select title, category from sources news where " \
"([{'targetHits': 10}]nearestNeighbor(embedding, user_embedding)) AND " \
"category contains 'sports';"
result = app.query(
body={
"yql": yql,
"hits": 10,
"ranking.features.query(user_embedding)": str(query_user_embedding(user_id="U63195")),
"ranking.profile": "recommendation"
}
)
Here are the first two hits out of the ten returned. Notice the category
field.
result.hits[0:2]