While experimenting with embeddings in pyvespa, I noticed that I could make significant improvements on the pyvespa API.

Simpler API

  • Simplify the API for the most common and basic usage (one default schema, one default query profile with root query profile type) while making it easy to configure more complex usage (multiple schemas, for example).

Default Schema

Before simplification:

from vespa.package import Document, Field, Schema, FieldSet, RankProfile, ApplicationPackage

document = Document(
    fields=[
        Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
        Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
        Field(name = "body", type = "string", indexing = ["index", "summary"], index = "enable-bm25")        
    ]
)
msmarco_schema = Schema(
    name = "msmarco", 
    document = document, 
    fieldsets = [FieldSet(name = "default", fields = ["title", "body"])],
    rank_profiles = [RankProfile(name = "default", first_phase = "nativeRank(title, body)")]
)
app_package = ApplicationPackage(name = "msmarco", schema=msmarco_schema)

After simplification:

from vespa.package import Field, FieldSet, RankProfile, ApplicationPackage

app_package = ApplicationPackage(name = "msmarco")

app_package.schema.add_field(        
    Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "body", type = "string", indexing = ["index", "summary"], index = "enable-bm25")        
)
app_package.schema.add_field_set(FieldSet(name = "default", fields = ["title", "body"]))
app_package.schema.add_rank_profile(RankProfile(name = "default", first_phase = "nativeRank(title, body)"))

When required we can use a similar pattern to deal with multiple schemas use cases where the .schema attribute is short for .schemas("default")

app_package.schemas("my_other_schema").add_field(        
    Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "user_name", type = "string", indexing = ["attribute", "summary"]),
)

Default query profile with root query profile type

To support embeddings we need to enable the creation of query profile and query profile types. Example of what we want to accomplish:

  • query profile type
<query-profile-type id="root">
  <field name="ranking.features.query(tensor_bert)" type="tensor&lt;float&gt;(x[768])" />
</query-profile-type>
  • query profile
<query-profile id="default" type="root">
  <field name="maxHits">1000</field>
</query-profile>

Verbose API:

query_profile_type = QueryProfileType(
    id="root", 
    fields = [
        QueryTypeField(
            name="ranking.features.query(tensor_bert)",
            type="tensor<float>(x[768])"
        )
    ]
)
query_profile = QueryProfile(
    id="default", 
    type="root", 
    fields=[QueryField(name="maxHits", value=1000)]
)

app_package.add_query_profile_type(query_profile_type)
app_package.add_query_profile(query_profile)

Simplified API: Assuming default query profile and root query profile type.

app_package.query_profile_type.add_field(        
    QueryTypeField(
        name="ranking.features.query(tensor_bert)",
        type="tensor<float>(x[768])"
    )
)
app_package.query_profile.add_field(
    QueryField(name="maxHits", value=1000)
)

When required we can use a similar pattern to deal with multiple profiles where .query_profile attribute is short for .query_profiles("default"). Similarly for .query_profile_type.

Embedding as doc/query representation

Create the application with the simplified API

Lets see how we can use embeddings for document and query representation using the simplified API.

app_package = ApplicationPackage(name = "msmarco")

app_package.schema.add_field(        
    Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "title_bert", type = "tensor<float>(x[768])", indexing = ["attribute"])        
)
app_package.query_profile_type.add_field(        
    QueryTypeField(
        name="ranking.features.query(tensor_bert)",
        type="tensor<float>(x[768])"
    )
)
app_package.schema.add_rank_profile(
    RankProfile(
        name = "bert_title", 
        first_phase = "sum(query(tensor_bert)*attribute(title_bert))"
    )
)

Feeding

It is annoying the need to specify {"values": [...]} for indexed tensors, but this is the user's responsability for now.

response = app.feed_data_point(
    data_id = "test_id", 
    fields = {
        "id": "test_id", 
        "title": "this is a test title", 
        "title_bert": {
            "values": create_embedding(text="this is a test title") # from string to list of floats
        }
    }
)

Using embeddings on queries

  • Here I propose a unified approach to define and use embeddings in the applications.

The ANN operator currently accepts an embedding function, and take care of converting query to embedding.

results = app.query(
    query="Where is my text?", 
    query_model = Query(
        match_phase=ANN(
            doc_vector="title_bert", 
            query_vector="tensor_bert", 
            embedding_model=create_embedding, 
            hits=10, 
        ), 
        rank_profile=Ranking(name="bert_title")
    ),
)

But we need to manually do the conversion when not using an embedding friendly operator, such as term-based OR.

other_args = {
    "ranking.features.query(tensor_bert)": create_embedding(text="this is a test query")
}

results = app.query(
    query="Where is my text?", 
    query_model = Query(
        match_phase=OR(), 
        rank_profile=Ranking(name="default")
    ),
    hits = 2,
    **other_args
)

I think we should unify those two approaches by moving the embedding creation to the Query model. Besides unifying usage, this makes sense because the embedding used is actually part of the query model. Changing the embedding function actually defines a different query model.

# term-based matching
query_model = Query(
    query_properties=[
        QueryRankingFeature(name="tensor_bert", mapping=create_embedding)
    ],
    match_phase=OR(), 
    rank_profile=Ranking(name="default")
)

# embedding-based matching
query_model = Query(
    query_properties=[
        QueryRankingFeature(name="tensor_bert", mapping=create_embedding)
    ],    
    match_phase=ANN(
        doc_vector="title_bert", 
        query_vector="tensor_bert", 
        hits=10, 
    ), 
    rank_profile=Ranking(name="bert_title")
)

# same usage for both
results = app.query(
    query="Where is my text?", 
    query_model=query_model
)