Build sentence/paragraph level QA application from python with Vespa
Retrieve paragraph and sentence level information with sparse and dense ranking features
- About the data
- Create and deploy the application
- Feed the data
- Sentence level retrieval
- Sentence level hybrid retrieval
- Paragraph level retrieval
- Conclusion and future work
We will walk through the steps necessary to create a question answering (QA) application that can retrieve sentence or paragraph level answers based on a combination of semantic and/or term-based search. We start by discussing the dataset used and the question and sentence embeddings generated for semantic search. We then include the steps necessary to create and deploy a Vespa application to serve the answers. We make all the required data available to feed the application and show how to query for sentence and paragraph level answers based on a combination of semantic and term-based search.
This tutorial is based on earlier work by the Vespa team to reproduce the results of the paper ReQA: An Evaluation for End-to-End Answer Retrieval Models by Ahmad Et al. using the Stanford Question Answering Dataset (SQuAD) v1.1 dataset.
We are going to use the Stanford Question Answering Dataset (SQuAD) v1.1 dataset. The data contains paragraphs (denoted here as context), and each paragraph has questions that have answers in the associated paragraph. We have parsed the dataset and organized the data that we will use in this tutorial to make it easier to follow along.
import requests, json
context_data = json.loads(
requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_context_data.json").text
)
Each context
data point contains a context_id
that uniquely identifies a paragraph, a text
field holding the paragraph string, and a questions
field holding a list of question ids that can be answered from the paragraph text. We also include a dataset
field to identify the data source if we want to index more than one dataset in our application.
context_data[0]
According to the data point above, context_id = 0
can be used to answer the questions with id = [0, 1, 2, 3, 4]
. We can load the file containing the questions and display those first five questions.
from pandas import read_csv
# Note that squad_queries.txt has approx. 1 Gb due to the 512-sized question embeddings
questions = read_csv(
filepath_or_buffer="https://data.vespa.oath.cloud/blog/qa/squad_queries.txt",
sep="\t",
names=["question_id", "question", "number_answers", "embedding"]
)
questions[["question_id", "question"]].head()
To build a more accurate application, we can break the paragraphs down into sentences. For example, the first sentence below comes from the paragraph with context_id = 0
and can answer the question with question_id = 4
.
# Note that qa_squad_sentence_data.json has approx. 1 Gb due to the 512-sized sentence embeddings
sentence_data = json.loads(
requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_sentence_data.json").text
)
{k:sentence_data[0][k] for k in ["text", "dataset", "questions", "context_id"]}
We want to combine semantic (dense) and term-based (sparse) signals to answer the questions sent to our application. We have generated embeddings for both the questions and the sentences to implement the semantic search, each having size equal to 512.
questions[["question_id", "embedding"]].head(1)
sentence_data[0]["sentence_embedding"]["values"][0:5] # display the first five elements
Here is the script containing the code that we used to generate the sentence and questions embeddings. We used Google's Universal Sentence Encoder at the time but feel free to replace it with embeddings generated by your preferred model.
We can now build a sentence-level Question answering application based on the data described above.
The context
schema will have a document containing the four relevant fields described in the data section. We create an index for the text
field and use enable-bm25
to pre-compute data required to speed up the use of BM25 for ranking. The summary
indexing indicates that all the fields will be included in the requested context documents. The attribute
indexing store the fields in memory as an attribute for sorting, querying, and grouping.
from vespa.package import Document, Field
context_document = Document(
fields=[
Field(name="questions", type="array<int>", indexing=["summary", "attribute"]),
Field(name="dataset", type="string", indexing=["summary", "attribute"]),
Field(name="context_id", type="int", indexing=["summary", "attribute"]),
Field(name="text", type="string", indexing=["summary", "index"], index="enable-bm25"),
]
)
The default fieldset means query tokens will be matched against the text
field by default. We defined two rank-profiles (bm25
and nativeRank
) to illustrate that we can define and experiment with as many rank-profiles as we want. You can create different ones using the ranking expressions and features available.
from vespa.package import Schema, FieldSet, RankProfile
context_schema = Schema(
name="context",
document=context_document,
fieldsets=[FieldSet(name="default", fields=["text"])],
rank_profiles=[
RankProfile(name="bm25", inherits="default", first_phase="bm25(text)"),
RankProfile(name="nativeRank", inherits="default", first_phase="nativeRank(text)")]
)
The document of the sentence
schema will inherit the fields defined in the context
document to avoid unnecessary duplication of the same field types. Besides, we add the sentence_embedding
field defined to hold a one-dimensional tensor of floats of size 512. We will store the field as an attribute in memory and build an ANN index
using the HNSW
(hierarchical navigable small world) algorithm. Read this blog post to know more about Vespa’s journey to implement ANN search and the documentation for more information about the HNSW parameters.
from vespa.package import HNSW
sentence_document = Document(
inherits="context",
fields=[
Field(
name="sentence_embedding",
type="tensor<float>(x[512])",
indexing=["attribute", "index"],
ann=HNSW(
distance_metric="euclidean",
max_links_per_node=16,
neighbors_to_explore_at_insert=500
)
)
]
)
For the sentence
schema, we define three rank profiles. The semantic-similarity
uses the Vespa closeness
ranking feature, which is defined as 1/(1 + distance)
so that sentences with embeddings closer to the question embedding will be ranked higher than sentences that are far apart. The bm25
is an example of a term-based rank profile, and bm25-semantic-similarity
combines both term-based and semantic-based signals as an example of a hybrid approach.
sentence_schema = Schema(
name="sentence",
document=sentence_document,
fieldsets=[FieldSet(name="default", fields=["text"])],
rank_profiles=[
RankProfile(
name="semantic-similarity",
inherits="default",
first_phase="closeness(sentence_embedding)"
),
RankProfile(
name="bm25",
inherits="default",
first_phase="bm25(text)"
),
RankProfile(
name="bm25-semantic-similarity",
inherits="default",
first_phase="bm25(text) + closeness(sentence_embedding)"
)
]
)
We can now define our qa
application by creating an application package with both the context_schema
and the sentence_schema
that we defined above. In addition, we need to inform Vespa that we plan to send a query ranking feature named query_embedding
with the same type that we used to define the sentence_embedding
field.
from vespa.package import ApplicationPackage, QueryProfile, QueryProfileType, QueryTypeField
app_package = ApplicationPackage(
name="qa",
schema=[context_schema, sentence_schema],
query_profile=QueryProfile(),
query_profile_type=QueryProfileType(
fields=[
QueryTypeField(
name="ranking.features.query(query_embedding)",
type="tensor<float>(x[512])"
)
]
)
)
We can deploy the app_package
in a Docker container (or to Vespa Cloud):
from vespa.package import VespaDocker
vespa_docker = VespaDocker(
port=8081,
container_memory="8G",
disk_folder="/Users/username/qa_app" # requires absolute path
)
app = vespa_docker.deploy(application_package=app_package)
Once deployed, we can use the Vespa
instance app
to interact with the application. We can start by feeding context and sentence data.
for idx, sentence in enumerate(sentence_data):
app.feed_data_point(schema="sentence", data_id=idx, fields=sentence)
for context in context_data:
app.feed_data_point(schema="context", data_id=context["context_id"], fields=context)
The query below sends the first question embedding (questions.loc[0, "embedding"]
) through the ranking.features.query(query_embedding)
parameter and use the nearestNeighbor
search operator to retrieve the closest 100 sentences in embedding space using Euclidean distance as configured in the HNSW
settings. The sentences returned will be ranked by the semantic-similarity
rank profile defined in the sentence
schema.
result = app.query(body={
'yql': 'select * from sources sentence where ([{"targetNumHits":100}]nearestNeighbor(sentence_embedding,query_embedding));',
'hits': 100,
'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
'ranking.profile': 'semantic-similarity'
})
result.hits[0]
In addition to sending the query embedding, we can send the question string (questions.loc[0, "question"]
) via the query
parameter and use the or
operator to retrieve documents that satisfy either the semantic operator nearestNeighbor
or the term-based operator userQuery
. Choosing type
equal any
means that the term-based operator will retrieve all the documents that match at least one query token. The retrieved documents will be ranked by the hybrid rank-profile bm25-semantic-similarity
.
result = app.query(body={
'yql': 'select * from sources sentence where ([{"targetNumHits":100}]nearestNeighbor(sentence_embedding,query_embedding)) or userQuery();',
'query': questions.loc[0, "question"],
'type': 'any',
'hits': 100,
'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
'ranking.profile': 'bm25-semantic-similarity'
})
result.hits[0]
For paragraph-level retrieval, we use Vespa's grouping feature to retrieve paragraphs instead of sentences. In the sample query below, we group by context_id
and use the paragraph’s max sentence score to represent the paragraph level score. We limit the number of paragraphs returned by 3, and each paragraph contains at most two sentences. We return all the summary features for each sentence. All those configurations can be changed to fit different use cases.
result = app.query(body={
'yql': ('select * from sources sentence where ([{"targetNumHits":10000}]nearestNeighbor(sentence_embedding,query_embedding)) |'
'all(group(context_id) max(3) order(-max(relevance())) each( max(2) each(output(summary())) as(sentences)) as(paragraphs));'),
'hits': 0,
'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
'ranking.profile': 'sentence-semantic-similarity'
})
paragraphs = result.json["root"]["children"][0]["children"][0]
paragraphs["children"][0] # top-ranked paragraph
paragraphs["children"][1] # second-ranked paragraph
This work used Google's Universal Sentence Encoder to generate the embeddings. It would be nice to compare evaluation metrics with embeddings generated by the most recent Facebook's Dense Passage Retrieval methodology. This Vespa blog post uses DPR to reproduce the state-of-the-art baseline for retrieval-based question-answering systems within a single, scalable production-ready application.