Proposal for a simplified pyvespa API
Simplification, embeddings and consistency
While experimenting with embeddings in pyvespa, I noticed that I could make significant improvements on the pyvespa API.
- Simplify the API for the most common and basic usage (one
default
schema, onedefault
query profile withroot
query profile type) while making it easy to configure more complex usage (multiple schemas, for example).
Before simplification:
from vespa.package import Document, Field, Schema, FieldSet, RankProfile, ApplicationPackage
document = Document(
fields=[
Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
Field(name = "body", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
]
)
msmarco_schema = Schema(
name = "msmarco",
document = document,
fieldsets = [FieldSet(name = "default", fields = ["title", "body"])],
rank_profiles = [RankProfile(name = "default", first_phase = "nativeRank(title, body)")]
)
app_package = ApplicationPackage(name = "msmarco", schema=msmarco_schema)
After simplification:
from vespa.package import Field, FieldSet, RankProfile, ApplicationPackage
app_package = ApplicationPackage(name = "msmarco")
app_package.schema.add_field(
Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
Field(name = "body", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)
app_package.schema.add_field_set(FieldSet(name = "default", fields = ["title", "body"]))
app_package.schema.add_rank_profile(RankProfile(name = "default", first_phase = "nativeRank(title, body)"))
When required we can use a similar pattern to deal with multiple schemas use cases where the .schema
attribute is short for .schemas("default")
app_package.schemas("my_other_schema").add_field(
Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
Field(name = "user_name", type = "string", indexing = ["attribute", "summary"]),
)
To support embeddings we need to enable the creation of query profile and query profile types. Example of what we want to accomplish:
- query profile type
<query-profile-type id="root">
<field name="ranking.features.query(tensor_bert)" type="tensor<float>(x[768])" />
</query-profile-type>
- query profile
<query-profile id="default" type="root">
<field name="maxHits">1000</field>
</query-profile>
Verbose API:
query_profile_type = QueryProfileType(
id="root",
fields = [
QueryTypeField(
name="ranking.features.query(tensor_bert)",
type="tensor<float>(x[768])"
)
]
)
query_profile = QueryProfile(
id="default",
type="root",
fields=[QueryField(name="maxHits", value=1000)]
)
app_package.add_query_profile_type(query_profile_type)
app_package.add_query_profile(query_profile)
Simplified API: Assuming default
query profile and root
query profile type.
app_package.query_profile_type.add_field(
QueryTypeField(
name="ranking.features.query(tensor_bert)",
type="tensor<float>(x[768])"
)
)
app_package.query_profile.add_field(
QueryField(name="maxHits", value=1000)
)
When required we can use a similar pattern to deal with multiple profiles where .query_profile
attribute is short for .query_profiles("default")
. Similarly for .query_profile_type
.
Lets see how we can use embeddings for document and query representation using the simplified API.
app_package = ApplicationPackage(name = "msmarco")
app_package.schema.add_field(
Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
Field(name = "title_bert", type = "tensor<float>(x[768])", indexing = ["attribute"])
)
app_package.query_profile_type.add_field(
QueryTypeField(
name="ranking.features.query(tensor_bert)",
type="tensor<float>(x[768])"
)
)
app_package.schema.add_rank_profile(
RankProfile(
name = "bert_title",
first_phase = "sum(query(tensor_bert)*attribute(title_bert))"
)
)
It is annoying the need to specify {"values": [...]}
for indexed tensors, but this is the user's responsability for now.
response = app.feed_data_point(
data_id = "test_id",
fields = {
"id": "test_id",
"title": "this is a test title",
"title_bert": {
"values": create_embedding(text="this is a test title") # from string to list of floats
}
}
)
- Here I propose a unified approach to define and use embeddings in the applications.
The ANN operator currently accepts an embedding function, and take care of converting query to embedding.
results = app.query(
query="Where is my text?",
query_model = Query(
match_phase=ANN(
doc_vector="title_bert",
query_vector="tensor_bert",
embedding_model=create_embedding,
hits=10,
),
rank_profile=Ranking(name="bert_title")
),
)
But we need to manually do the conversion when not using an embedding friendly operator, such as term-based OR
.
other_args = {
"ranking.features.query(tensor_bert)": create_embedding(text="this is a test query")
}
results = app.query(
query="Where is my text?",
query_model = Query(
match_phase=OR(),
rank_profile=Ranking(name="default")
),
hits = 2,
**other_args
)
I think we should unify those two approaches by moving the embedding creation to the Query
model. Besides unifying usage, this makes sense because the embedding used is actually part of the query model. Changing the embedding function actually defines a different query model.
# term-based matching
query_model = Query(
query_properties=[
QueryRankingFeature(name="tensor_bert", mapping=create_embedding)
],
match_phase=OR(),
rank_profile=Ranking(name="default")
)
# embedding-based matching
query_model = Query(
query_properties=[
QueryRankingFeature(name="tensor_bert", mapping=create_embedding)
],
match_phase=ANN(
doc_vector="title_bert",
query_vector="tensor_bert",
hits=10,
),
rank_profile=Ranking(name="bert_title")
)
# same usage for both
results = app.query(
query="Where is my text?",
query_model=query_model
)