Text classification task with the Huggingface inference API

Hosted API

import requests

API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
headers = {"Authorization": "Bearer api_mxMKsfJoFDmvPdPziZLGuymSBQMbxVYoWg"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()
output = query({"inputs": "This is a test"})
output
[[{'label': 'NEGATIVE', 'score': 0.9670119881629944},
  {'label': 'POSITIVE', 'score': 0.032987985759973526}]]

First time using the API, it returns an error indicating the it is loading the model.

{'error': 'Model distilbert-base-uncased-finetuned-sst-2-english is currently loading',
 'estimated_time': 20}

Local with their python API

from transformers import Pipeline, AutoTokenizer, AutoModelForSequenceClassification

pipeline = Pipeline(
    tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english"), 
    model=AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
)
import math

logits = pipeline("This is a test").tolist()[0]
[math.exp(x)/(math.exp(logits[0])+math.exp(logits[1])) for x in logits]
[0.9670119572344816, 0.032988042765518436]

Vespa stateless model api - the bare minumum

Create the application package and deploy

**Note**: The step below assumes you exported a model to ONNX
from vespa.package import ModelServer

model_server = ModelServer(
    name="bert_model_server", 
    model_file_path = "data/2021-08-10-stateless-model-api/bert_tiny.onnx"
)

Could just as easily deploy to Vespa Cloud:

from vespa.deployment import VespaDocker

disk_folder = "/Users/tmartins/model_server_docker"
vespa_docker = VespaDocker(disk_folder=disk_folder, port=8081)
app = vespa_docker.deploy(application_package=model_server)
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Finished deployment.

Interact with the application with Vespa REST api

Get the models available

!curl -s 'http://localhost:8081/model-evaluation/v1/'
{"bert_tiny":"http://localhost:8080/model-evaluation/v1/bert_tiny"}

Get information about a specific model

!curl -s 'http://localhost:8081/model-evaluation/v1/bert_tiny'
{"model":"bert_tiny","functions":[{"function":"output_0","info":"http://localhost:8080/model-evaluation/v1/bert_tiny/output_0","eval":"http://localhost:8080/model-evaluation/v1/bert_tiny/output_0/eval","arguments":[{"name":"input_ids","type":"tensor(d0[],d1[])"},{"name":"attention_mask","type":"tensor(d0[],d1[])"},{"name":"token_type_ids","type":"tensor(d0[],d1[])"}]}]}

Write custom code to generate url encoded inputs

**Note**: Writing custom code to get the inputs right is messy and does not allow to improve speed unless the users writes their own custom java searcher.
tokenizer = AutoTokenizer.from_pretrained(
    "google/bert_uncased_L-2_H-128_A-2"
)
from urllib.parse import urlencode

def create_url_encoded_input(text, tokenizer):
    tokens = tokenizer(text)
    encoded_tokens = urlencode(
        {
            key: "{"
            + ",".join(
                [
                    "{{d0: 0, d1: {}}}: {}".format(idx, x)
                    for idx, x in enumerate(value)
                ]
            )
            + "}"
            for key, value in tokens.items()
        }
    )
    return encoded_tokens
encoded_tokens = create_url_encoded_input("this is a text", tokenizer)
encoded_tokens
'input_ids=%7B%7Bd0%3A+0%2C+d1%3A+0%7D%3A+101%2C%7Bd0%3A+0%2C+d1%3A+1%7D%3A+2023%2C%7Bd0%3A+0%2C+d1%3A+2%7D%3A+2003%2C%7Bd0%3A+0%2C+d1%3A+3%7D%3A+1037%2C%7Bd0%3A+0%2C+d1%3A+4%7D%3A+3793%2C%7Bd0%3A+0%2C+d1%3A+5%7D%3A+102%7D&token_type_ids=%7B%7Bd0%3A+0%2C+d1%3A+0%7D%3A+0%2C%7Bd0%3A+0%2C+d1%3A+1%7D%3A+0%2C%7Bd0%3A+0%2C+d1%3A+2%7D%3A+0%2C%7Bd0%3A+0%2C+d1%3A+3%7D%3A+0%2C%7Bd0%3A+0%2C+d1%3A+4%7D%3A+0%2C%7Bd0%3A+0%2C+d1%3A+5%7D%3A+0%7D&attention_mask=%7B%7Bd0%3A+0%2C+d1%3A+0%7D%3A+1%2C%7Bd0%3A+0%2C+d1%3A+1%7D%3A+1%2C%7Bd0%3A+0%2C+d1%3A+2%7D%3A+1%2C%7Bd0%3A+0%2C+d1%3A+3%7D%3A+1%2C%7Bd0%3A+0%2C+d1%3A+4%7D%3A+1%2C%7Bd0%3A+0%2C+d1%3A+5%7D%3A+1%7D'

Use the encoded tokens to get a prediction from Vespa:

from requests import get

get("http://localhost:8081/model-evaluation/v1/bert_tiny/output_0/eval?{}".format(encoded_tokens)).json()
{'cells': [{'address': {'d0': '0', 'd1': '0'}, 'value': -0.02798202447593212},
  {'address': {'d0': '0', 'd1': '1'}, 'value': -0.1420438140630722}]}

Clean up the environment

import shutil

shutil.rmtree(disk_folder, ignore_errors=True)
vespa_docker.container.stop()
vespa_docker.container.remove()

Vespa stateless model - full python api

Define the task and the model to use:

from vespa.ml import SequenceClassification

model = SequenceClassification(
    model_id="bert_tiny", 
    model="google/bert_uncased_L-2_H-128_A-2"
)

Create the application package, no need to manually export the model.

from vespa.package import ModelServer

model_server = ModelServer(
    name="bert_model_server",
    models=[model],
)

Deploy, could just as easily deploy to Vespa Cloud:

from vespa.deployment import VespaDocker

disk_folder = "/Users/tmartins/model_server_docker"
vespa_docker = VespaDocker(disk_folder=disk_folder, port=8081)
app = vespa_docker.deploy(application_package=model_server)
Using framework PyTorch: 1.7.1
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Finished deployment.

Get models available:

app.get_model_endpoint()
{'bert_tiny': 'http://localhost:8080/model-evaluation/v1/bert_tiny'}

Get information about a specific model:

app.get_model_endpoint("bert_tiny")
{'model': 'bert_tiny',
 'functions': [{'function': 'output_0',
   'info': 'http://localhost:8080/model-evaluation/v1/bert_tiny/output_0',
   'eval': 'http://localhost:8080/model-evaluation/v1/bert_tiny/output_0/eval',
   'arguments': [{'name': 'input_ids', 'type': 'tensor(d0[],d1[])'},
    {'name': 'attention_mask', 'type': 'tensor(d0[],d1[])'},
    {'name': 'token_type_ids', 'type': 'tensor(d0[],d1[])'}]}]}

Get a prediction:

app.predict(x="this is a test", model_name="bert_tiny")
[0.009904447011649609, 0.04607260227203369]