Build a News recommendation app from python with Vespa
Part 1 - News search functionality
- Dataset
- Install pyvespa
- Create the search app
- Deploy the app on Docker
- Feed data to the app
- Query the app
- Use news popularity signal for ranking
We will build a news recommendation app in Vespa without leaving a python environment. In this first part of the series, we want to develop an application with basic search functionality. Future posts will add recommendation capabilities based on embeddings and other ML models. This series is a simplified version of Vespa's News search and recommendation tutorial. We will also use the demo version of the Microsoft News Dataset (MIND) so that anyone can follow along on their laptops.
The original Vespa news search tutorial provides a script to download, parse and convert the MIND dataset to Vespa format. To make things easier for you, we made the final parsed data required for this tutorial available for download:
import requests, json
data = json.loads(
requests.get("https://thigm85.github.io/data/mind/mind_demo_fields_parsed.json").text
)
data[0]
The final parsed data used here is a list where each element is a dictionary containing relevant fields about a news article such as title
and category
. We also have information about the number of impressions
and clicks
the article has received. The demo version of the mind dataset has 28.603 news articles included.
len(data)
!pip install pyvespa
Create the application package. app_package
will hold all the relevant data related to your application's specification.
from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name="news")
Add fields to the schema. Here is a short description of the non-obvious arguments used below:
-
indexing argument: configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing.
-
"index": Create a search index for this field.
-
"summary": Lets this field be part of the document summary in the result set.
-
"attribute": Store this field in memory as an attribute — for sorting, querying and grouping.
-
-
index argument: configure how Vespa should create the search index.
- "enable-bm25": set up an index compatible with bm25 ranking for text search.
-
attribute argument: configure how Vespa should treat an attribute field.
- "fast-search": Build an index for an attribute field. By default, no index is generated for attributes, and search over these defaults to a linear scan.
from vespa.package import Field
app_package.schema.add_fields(
Field(name="news_id", type="string", indexing=["summary", "attribute"], attribute=["fast-search"]),
Field(name="category", type="string", indexing=["summary", "attribute"]),
Field(name="subcategory", type="string", indexing=["summary", "attribute"]),
Field(name="title", type="string", indexing=["index", "summary"], index="enable-bm25"),
Field(name="abstract", type="string", indexing=["index", "summary"], index="enable-bm25"),
Field(name="url", type="string", indexing=["index", "summary"]),
Field(name="date", type="int", indexing=["summary", "attribute"]),
Field(name="clicks", type="int", indexing=["summary", "attribute"]),
Field(name="impressions", type="int", indexing=["summary", "attribute"]),
)
Add a fieldset to the schema. Fieldset allows us to search over multiple fields easily. In this case, searching over the default
fieldset is equivalent to searching over title
and abstract
.
from vespa.package import FieldSet
app_package.schema.add_field_set(
FieldSet(name="default", fields=["title", "abstract"])
)
We have enough to deploy the first version of our application. Later in this tutorial, we will include an article’s popularity into the relevance score used to rank the news that matches our queries.
If you have Docker installed on your machine, you can deploy the app_package
in a local Docker container:
from vespa.package import VespaDocker
vespa_docker = VespaDocker(
port=8080,
container_memory="8G",
disk_folder="/Users/tmartins/news" # change for your desired absolute folder
)
app = vespa_docker.deploy(
application_package=app_package,
)
vespa_docker
will parse the app_package
and write all the necessary Vespa config files to the disk_folder
. It will then create the docker containers and use the Vespa config files to deploy the Vespa application. We can then use the app
instance to interact with the deployed application, such as for feeding and querying. If you want to know more about what happens behind the scenes, we suggest you go through this getting started with Docker tutorial.
We can use the feed_data_point
method. We need to specify:
-
data_id
: unique id to identify the data point -
fields
: dictionary with keys matching the field names defined in our application package schema. -
schema
: name of the schema we want to feed data to. When we created an application package, we created a schema by default with the same name as the application name,news
in our case.
for article in data:
res = app.feed_data_point(
data_id=article["news_id"],
fields=article,
schema="news"
)
We can use the Vespa Query API through app.query
to unlock the full query flexibility Vespa can offer.
Select all the fields from documents where default
(title or abstract) contains the keyword 'music'.
res = app.query(body={"yql" : "select * from sources * where default contains 'music';"})
res.hits[0]
Select title
and abstract
where title
contains 'music' and default
contains 'festival'.
res = app.query(body = {"yql" : "select title, abstract from sources * where title contains 'music' AND default contains 'festival';"})
res.hits[0]
Select the title of all the documents with document type equal to news
. Our application has only one document type, so the query below retrieves all our documents.
res = app.query(body = {"yql" : "select title from sources * where sddocname contains 'news';"})
res.hits[0]
Since date
is not specified with attribute=["fast-search"]
there is no index built for it. Therefore, search over it is equivalent to doing a linear scan over the values of the field.
res = app.query(body={"yql" : "select title, date from sources * where date contains '20191110';"})
res.hits[0]
Since the default
fieldset is formed by indexed fields, Vespa will first filter by all the documents that contain the keyword 'weather' within title
or abstract
, before scanning the date
field for '20191110'.
res = app.query(body={"yql" : "select title, abstract, date from sources * where default contains 'weather' AND date contains '20191110';"})
res.hits[0]
We can also perform range searches:
res = app.query({"yql" : "select date from sources * where date <= 20191110 AND date >= 20191108;"})
res.hits[0]
By default, Vespa sorts the hits by descending relevance score. The relevance score is given by the nativeRank unless something else is specified, as we will do later in this post.
res = app.query(body={"yql" : "select title, date from sources * where default contains 'music';"})
res.hits[:2]
However, we can explicitly order by a given field with the order
keyword.
res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date;"})
res.hits[:2]
order
sorts in ascending order by default, we can override that with the desc
keyword:
res = app.query(body={"yql" : "select title, date from sources * where default contains 'music' order by date desc;"})
res.hits[:2]
We can use Vespa's grouping feature to compute the three news categories with the highest number of document counts:
-
news with 9115 articles
-
sports with 6765 articles
-
finance with 1886 articles
res = app.query(body={"yql" : "select * from sources * where sddocname contains 'news' limit 0 | all(group(category) max(3) order(-count())each(output(count())));"})
res.hits[0]
Vespa uses nativeRank to compute relevance scores by default. We will create a new rank-profile that includes a popularity signal in our relevance score computation.
from vespa.package import RankProfile, Function
app_package.schema.add_rank_profile(
RankProfile(
name="popularity",
inherits="default",
functions=[
Function(
name="popularity",
expression="if (attribute(impressions) > 0, attribute(clicks) / attribute(impressions), 0)"
)
],
first_phase="nativeRank(title, abstract) + 10 * popularity"
)
)
Our new rank-profile will be called popularity
. Here is a breakdown of what is included above:
- inherits="default"
This configures Vespa to create a new rank profile named popularity, which inherits all the default rank-profile properties; only properties that are explicitly defined, or overridden, will differ from those of the default rank-profile.
- function popularity
This sets up a function that can be called from other expressions. This function calculates the number of clicks divided by impressions for indicating popularity. However, this isn’t really the best way of calculating this, as an article with a low number of impressions can score high on such a value, even though uncertainty is high. But it is a start :)
- first-phase
Relevance calculations in Vespa are two-phased. The calculations done in the first phase are performed on every single document matching your query. In contrast, the second phase calculations are only done on the top n documents as determined by the calculations done in the first phase. We are just going to use the first-phase for now.
- expression: nativeRank + 10 * popularity
This expression is used to rank documents. Here, the default ranking expression — the nativeRank of the default fieldset — is included to make the query relevant, while the second term calls the popularity function. The weighted sum of these two terms is the final relevance for each document. Note that the weight here, 10, is set by observation. A better approach would be to learn such values using machine learning, which we'll get back to in future posts.
Since we have changed the application package, we need to redeploy our application:
app = vespa_docker.deploy(
application_package=app_package,
)
app.deployment_message
When the redeployment is complete, we can use it to rank the matched documents by using the ranking
argument.
res = app.query(body={
"yql" : "select * from sources * where default contains 'music';",
"ranking" : "popularity"
})
res.hits[0]