Build sentence/paragraph level QA application from python with Vespa
Retrieve paragraph and sentence level information with sparse and dense ranking features
- About the data
- Create and deploy the application
- Feed the data
- Sentence level retrieval
- Sentence level hybrid retrieval
- Paragraph level retrieval
- Conclusion and future work
We will walk through the steps necessary to create a question answering (QA) application that can retrieve sentence or paragraph level answers based on a combination of semantic and/or term-based search. We start by discussing the dataset used and the question and sentence embeddings generated for semantic search. We then include the steps necessary to create and deploy a Vespa application to serve the answers. We make all the required data available to feed the application and show how to query for sentence and paragraph level answers based on a combination of semantic and term-based search.
This tutorial is based on earlier work by the Vespa team to reproduce the results of the paper ReQA: An Evaluation for End-to-End Answer Retrieval Models by Ahmad Et al. using the Stanford Question Answering Dataset (SQuAD) v1.1 dataset.
We are going to use the Stanford Question Answering Dataset (SQuAD) v1.1 dataset. The data contains paragraphs (denoted here as context), and each paragraph has questions that have answers in the associated paragraph. We have parsed the dataset and organized the data that we will use in this tutorial to make it easier to follow along.
import requests, json
context_data = json.loads(
requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_context_data.json").text
)
Each context data point contains a context_id that uniquely identifies a paragraph, a text field holding the paragraph string, and a questions field holding a list of question ids that can be answered from the paragraph text. We also include a dataset field to identify the data source if we want to index more than one dataset in our application.
context_data[0]
According to the data point above, context_id = 0 can be used to answer the questions with id = [0, 1, 2, 3, 4]. We can load the file containing the questions and display those first five questions.
from pandas import read_csv
# Note that squad_queries.txt has approx. 1 Gb due to the 512-sized question embeddings
questions = read_csv(
filepath_or_buffer="https://data.vespa.oath.cloud/blog/qa/squad_queries.txt",
sep="\t",
names=["question_id", "question", "number_answers", "embedding"]
)
questions[["question_id", "question"]].head()
To build a more accurate application, we can break the paragraphs down into sentences. For example, the first sentence below comes from the paragraph with context_id = 0 and can answer the question with question_id = 4.
# Note that qa_squad_sentence_data.json has approx. 1 Gb due to the 512-sized sentence embeddings
sentence_data = json.loads(
requests.get("https://data.vespa.oath.cloud/blog/qa/qa_squad_sentence_data.json").text
)
{k:sentence_data[0][k] for k in ["text", "dataset", "questions", "context_id"]}
We want to combine semantic (dense) and term-based (sparse) signals to answer the questions sent to our application. We have generated embeddings for both the questions and the sentences to implement the semantic search, each having size equal to 512.
questions[["question_id", "embedding"]].head(1)
sentence_data[0]["sentence_embedding"]["values"][0:5] # display the first five elements
Here is the script containing the code that we used to generate the sentence and questions embeddings. We used Google's Universal Sentence Encoder at the time but feel free to replace it with embeddings generated by your preferred model.
We can now build a sentence-level Question answering application based on the data described above.
The context schema will have a document containing the four relevant fields described in the data section. We create an index for the text field and use enable-bm25 to pre-compute data required to speed up the use of BM25 for ranking. The summary indexing indicates that all the fields will be included in the requested context documents. The attribute indexing store the fields in memory as an attribute for sorting, querying, and grouping.
from vespa.package import Document, Field
context_document = Document(
fields=[
Field(name="questions", type="array<int>", indexing=["summary", "attribute"]),
Field(name="dataset", type="string", indexing=["summary", "attribute"]),
Field(name="context_id", type="int", indexing=["summary", "attribute"]),
Field(name="text", type="string", indexing=["summary", "index"], index="enable-bm25"),
]
)
The default fieldset means query tokens will be matched against the text field by default. We defined two rank-profiles (bm25 and nativeRank) to illustrate that we can define and experiment with as many rank-profiles as we want. You can create different ones using the ranking expressions and features available.
from vespa.package import Schema, FieldSet, RankProfile
context_schema = Schema(
name="context",
document=context_document,
fieldsets=[FieldSet(name="default", fields=["text"])],
rank_profiles=[
RankProfile(name="bm25", inherits="default", first_phase="bm25(text)"),
RankProfile(name="nativeRank", inherits="default", first_phase="nativeRank(text)")]
)
The document of the sentence schema will inherit the fields defined in the context document to avoid unnecessary duplication of the same field types. Besides, we add the sentence_embedding field defined to hold a one-dimensional tensor of floats of size 512. We will store the field as an attribute in memory and build an ANN index using the HNSW (hierarchical navigable small world) algorithm. Read this blog post to know more about Vespa’s journey to implement ANN search and the documentation for more information about the HNSW parameters.
from vespa.package import HNSW
sentence_document = Document(
inherits="context",
fields=[
Field(
name="sentence_embedding",
type="tensor<float>(x[512])",
indexing=["attribute", "index"],
ann=HNSW(
distance_metric="euclidean",
max_links_per_node=16,
neighbors_to_explore_at_insert=500
)
)
]
)
For the sentence schema, we define three rank profiles. The semantic-similarity uses the Vespa closeness ranking feature, which is defined as 1/(1 + distance) so that sentences with embeddings closer to the question embedding will be ranked higher than sentences that are far apart. The bm25 is an example of a term-based rank profile, and bm25-semantic-similarity combines both term-based and semantic-based signals as an example of a hybrid approach.
sentence_schema = Schema(
name="sentence",
document=sentence_document,
fieldsets=[FieldSet(name="default", fields=["text"])],
rank_profiles=[
RankProfile(
name="semantic-similarity",
inherits="default",
first_phase="closeness(sentence_embedding)"
),
RankProfile(
name="bm25",
inherits="default",
first_phase="bm25(text)"
),
RankProfile(
name="bm25-semantic-similarity",
inherits="default",
first_phase="bm25(text) + closeness(sentence_embedding)"
)
]
)
We can now define our qa application by creating an application package with both the context_schema and the sentence_schema that we defined above. In addition, we need to inform Vespa that we plan to send a query ranking feature named query_embedding with the same type that we used to define the sentence_embedding field.
from vespa.package import ApplicationPackage, QueryProfile, QueryProfileType, QueryTypeField
app_package = ApplicationPackage(
name="qa",
schema=[context_schema, sentence_schema],
query_profile=QueryProfile(),
query_profile_type=QueryProfileType(
fields=[
QueryTypeField(
name="ranking.features.query(query_embedding)",
type="tensor<float>(x[512])"
)
]
)
)
We can deploy the app_package in a Docker container (or to Vespa Cloud):
from vespa.package import VespaDocker
vespa_docker = VespaDocker(
port=8081,
container_memory="8G",
disk_folder="/Users/username/qa_app" # requires absolute path
)
app = vespa_docker.deploy(application_package=app_package)
Once deployed, we can use the Vespa instance app to interact with the application. We can start by feeding context and sentence data.
for idx, sentence in enumerate(sentence_data):
app.feed_data_point(schema="sentence", data_id=idx, fields=sentence)
for context in context_data:
app.feed_data_point(schema="context", data_id=context["context_id"], fields=context)
The query below sends the first question embedding (questions.loc[0, "embedding"]) through the ranking.features.query(query_embedding) parameter and use the nearestNeighbor search operator to retrieve the closest 100 sentences in embedding space using Euclidean distance as configured in the HNSW settings. The sentences returned will be ranked by the semantic-similarity rank profile defined in the sentence schema.
result = app.query(body={
'yql': 'select * from sources sentence where ([{"targetNumHits":100}]nearestNeighbor(sentence_embedding,query_embedding));',
'hits': 100,
'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
'ranking.profile': 'semantic-similarity'
})
result.hits[0]
In addition to sending the query embedding, we can send the question string (questions.loc[0, "question"]) via the query parameter and use the or operator to retrieve documents that satisfy either the semantic operator nearestNeighbor or the term-based operator userQuery. Choosing type equal any means that the term-based operator will retrieve all the documents that match at least one query token. The retrieved documents will be ranked by the hybrid rank-profile bm25-semantic-similarity.
result = app.query(body={
'yql': 'select * from sources sentence where ([{"targetNumHits":100}]nearestNeighbor(sentence_embedding,query_embedding)) or userQuery();',
'query': questions.loc[0, "question"],
'type': 'any',
'hits': 100,
'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
'ranking.profile': 'bm25-semantic-similarity'
})
result.hits[0]
For paragraph-level retrieval, we use Vespa's grouping feature to retrieve paragraphs instead of sentences. In the sample query below, we group by context_id and use the paragraph’s max sentence score to represent the paragraph level score. We limit the number of paragraphs returned by 3, and each paragraph contains at most two sentences. We return all the summary features for each sentence. All those configurations can be changed to fit different use cases.
result = app.query(body={
'yql': ('select * from sources sentence where ([{"targetNumHits":10000}]nearestNeighbor(sentence_embedding,query_embedding)) |'
'all(group(context_id) max(3) order(-max(relevance())) each( max(2) each(output(summary())) as(sentences)) as(paragraphs));'),
'hits': 0,
'ranking.features.query(query_embedding)': questions.loc[0, "embedding"],
'ranking.profile': 'sentence-semantic-similarity'
})
paragraphs = result.json["root"]["children"][0]["children"][0]
paragraphs["children"][0] # top-ranked paragraph
paragraphs["children"][1] # second-ranked paragraph
This work used Google's Universal Sentence Encoder to generate the embeddings. It would be nice to compare evaluation metrics with embeddings generated by the most recent Facebook's Dense Passage Retrieval methodology. This Vespa blog post uses DPR to reproduce the state-of-the-art baseline for retrieval-based question-answering systems within a single, scalable production-ready application.