Vespa stateless model evaluation
Experimennting with a python API
- Text classification task with the Huggingface inference API
- Vespa stateless model api - the bare minumum
- Interact with the application with Vespa REST api
- Vespa stateless model - full python api
import requests
API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
headers = {"Authorization": "Bearer api_mxMKsfJoFDmvPdPziZLGuymSBQMbxVYoWg"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({"inputs": "This is a test"})
output
First time using the API, it returns an error indicating the it is loading the model.
{'error': 'Model distilbert-base-uncased-finetuned-sst-2-english is currently loading',
'estimated_time': 20}
from transformers import Pipeline, AutoTokenizer, AutoModelForSequenceClassification
pipeline = Pipeline(
tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english"),
model=AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
)
import math
logits = pipeline("This is a test").tolist()[0]
[math.exp(x)/(math.exp(logits[0])+math.exp(logits[1])) for x in logits]
**Note**: The step below assumes you exported a model to ONNX
from vespa.package import ModelServer
model_server = ModelServer(
name="bert_model_server",
model_file_path = "data/2021-08-10-stateless-model-api/bert_tiny.onnx"
)
Could just as easily deploy to Vespa Cloud:
from vespa.deployment import VespaDocker
disk_folder = "/Users/tmartins/model_server_docker"
vespa_docker = VespaDocker(disk_folder=disk_folder, port=8081)
app = vespa_docker.deploy(application_package=model_server)
Get the models available
!curl -s 'http://localhost:8081/model-evaluation/v1/'
Get information about a specific model
!curl -s 'http://localhost:8081/model-evaluation/v1/bert_tiny'
Write custom code to generate url encoded inputs
**Note**: Writing custom code to get the inputs right is messy and does not allow to improve speed unless the users writes their own custom java searcher.
tokenizer = AutoTokenizer.from_pretrained(
"google/bert_uncased_L-2_H-128_A-2"
)
from urllib.parse import urlencode
def create_url_encoded_input(text, tokenizer):
tokens = tokenizer(text)
encoded_tokens = urlencode(
{
key: "{"
+ ",".join(
[
"{{d0: 0, d1: {}}}: {}".format(idx, x)
for idx, x in enumerate(value)
]
)
+ "}"
for key, value in tokens.items()
}
)
return encoded_tokens
encoded_tokens = create_url_encoded_input("this is a text", tokenizer)
encoded_tokens
Use the encoded tokens to get a prediction from Vespa:
from requests import get
get("http://localhost:8081/model-evaluation/v1/bert_tiny/output_0/eval?{}".format(encoded_tokens)).json()
Clean up the environment
import shutil
shutil.rmtree(disk_folder, ignore_errors=True)
vespa_docker.container.stop()
vespa_docker.container.remove()
Define the task and the model to use:
from vespa.ml import SequenceClassification
model = SequenceClassification(
model_id="bert_tiny",
model="google/bert_uncased_L-2_H-128_A-2"
)
Create the application package, no need to manually export the model.
from vespa.package import ModelServer
model_server = ModelServer(
name="bert_model_server",
models=[model],
)
Deploy, could just as easily deploy to Vespa Cloud:
from vespa.deployment import VespaDocker
disk_folder = "/Users/tmartins/model_server_docker"
vespa_docker = VespaDocker(disk_folder=disk_folder, port=8081)
app = vespa_docker.deploy(application_package=model_server)
Get models available:
app.get_model_endpoint()
Get information about a specific model:
app.get_model_endpoint("bert_tiny")
Get a prediction:
app.predict(x="this is a test", model_name="bert_tiny")