This post creates a labeled dataset out of the Flicker 8k image-caption dataset, builds a text processor that uses a CLIP model to map a text query into the same 512-dimensional space used to represent images and evaluate different query models using the Vespa python API.

Create labeled data

An (image, caption) pair will be considered relevant for our purposes if all three experts agreed on a relevance score equal to 4.

Load and check the expert judgments

from pandas import read_csv

experts = read_csv(
    os.path.join(os.environ["DATA_FOLDER"], "ExpertAnnotations.txt"), 
    sep = "\t", 
    header=None, 
    names=["image_file_name", "caption_id", "expert_1", "expert_2", "expert_3"]
)
experts.head()
image_file_name caption_id expert_1 expert_2 expert_3
0 1056338697_4f7d7ce270.jpg 2549968784_39bfbe44f9.jpg#2 1 1 1
1 1056338697_4f7d7ce270.jpg 2718495608_d8533e3ac5.jpg#2 1 1 2
2 1056338697_4f7d7ce270.jpg 3181701312_70a379ab6e.jpg#2 1 1 2
3 1056338697_4f7d7ce270.jpg 3207358897_bfa61fa3c6.jpg#2 1 2 2
4 1056338697_4f7d7ce270.jpg 3286822339_5535af6b93.jpg#2 1 1 2

Check cases where all experts agrees

experts_agreement_bool = experts.apply(
    lambda x: x["expert_1"] == x["expert_2"] and x["expert_2"] == x["expert_3"], 
    axis=1
)
experts_agreement = experts[experts_agreement_bool][
    ["image_file_name", "caption_id", "expert_1"]
].rename(columns={"expert_1":"expert"})
experts_agreement.head()
image_file_name caption_id expert
0 1056338697_4f7d7ce270.jpg 2549968784_39bfbe44f9.jpg#2 1
5 1056338697_4f7d7ce270.jpg 3360930596_1e75164ce6.jpg#2 1
6 1056338697_4f7d7ce270.jpg 3545652636_0746537307.jpg#2 1
8 106490881_5a2dd9b7bd.jpg 1425069308_488e5fcf9d.jpg#2 1
9 106490881_5a2dd9b7bd.jpg 1714316707_8bbaa2a2ba.jpg#2 2
experts_agreement["expert"].value_counts().sort_index()
1    2350
2     580
3     214
4     247
Name: expert, dtype: int64

Load captions data

captions = read_csv(
    os.path.join(os.environ["DATA_FOLDER"], "Flickr8k.token.txt"), 
    sep="\t", 
    header=None, 
    names=["caption_id", "caption"]
)
captions.head()
caption_id caption
0 1000268201_693b08cb0e.jpg#0 A child in a pink dress is climbing up a set o...
1 1000268201_693b08cb0e.jpg#1 A girl going into a wooden building .
2 1000268201_693b08cb0e.jpg#2 A little girl climbing into a wooden playhouse .
3 1000268201_693b08cb0e.jpg#3 A little girl climbing the stairs to her playh...
4 1000268201_693b08cb0e.jpg#4 A little girl in a pink dress going into a woo...
def get_caption(caption_id, captions):
    return captions[captions["caption_id"] == caption_id]["caption"].values[0]

Relevant (image, text) pair

relevant_data = experts_agreement[experts_agreement["expert"] == 4]
relevant_data.head(3)
image_file_name caption_id expert
43 1119015538_e8e796281e.jpg 416106657_cab2a107a5.jpg#2 4
53 1131932671_c8d17751b3.jpg 1131932671_c8d17751b3.jpg#2 4
66 115684808_cb01227802.jpg 115684808_cb01227802.jpg#2 4

Create labeled data

from ntpath import basename
from pandas import DataFrame

labeled_data = DataFrame(
    data={
        "qid": list(range(relevant_data.shape[0])),
        "query": [get_caption(
            caption_id=x, 
            captions=captions
        ).replace(" ,", "").replace(" .", "") for x in list(relevant_data.caption_id)],
        "doc_id": [basename(x) for x in list(relevant_data.image_file_name)],
        "relevance": 1}
)
labeled_data.head()
qid query doc_id relevance
0 0 A white dog runs in the grass 1119015538_e8e796281e.jpg 1
1 1 A boy jumps from one bed to another 1131932671_c8d17751b3.jpg 1
2 2 Three people and a sled 115684808_cb01227802.jpg 1
3 3 A group of people walking a city street in war... 1174629344_a2e1a2bdbf.jpg 1
4 4 Two children one of which is holding a stick a... 1322323208_c7ecb742c6.jpg 1

From text to embeddings

Create a text processor to map a text string into the same 512-dimensional space used to embed the images.

import clip
import torch

class TextProcessor(object):
    def __init__(self, model_name):
        self.model, _ = clip.load(model_name)
        
    def embed(self, text):
        text_tokens = clip.tokenize(text)
        with torch.no_grad():
            text_features = model.encode_text(text_tokens).float()
            text_features /= text_features.norm(dim=-1, keepdim=True)
        return text_features.tolist()[0]

Evaluate

Define search evaluation metrics:

from vespa.evaluation import MatchRatio, Recall, ReciprocalRank

eval_metrics = [
    MatchRatio(), 
    Recall(at=5), 
    Recall(at=100), 
    ReciprocalRank(at=5), 
    ReciprocalRank(at=100)
]

Instantiate TextProcessor with a specific CLIP model.

text_processor = TextProcessor(model_name="ViT-B/32")

Create a QueryModel's to be evaluated. In this case we create two query models based on the ViT-B/32 CLIP model, one that sends the query as it is and another that prepends the prompt "A photo of " to the query before sending it, as suggest in the original CLIP paper.

from vespa.query import QueryModel

def create_vespa_query(query, prompt = False):
    if prompt:
        query = "A photo of " + query.lower()
    return {
        'yql': 'select * from sources * where ([{"targetNumHits":100}]nearestNeighbor(vit_b_32_image,vit_b_32_text));',
        'hits': 100,
        'ranking.features.query(vit_b_32_text)': text_processor.embed(query),
        'ranking.profile': 'vit-b-32-similarity',
        'timeout': 10
    }

query_model_1 = QueryModel(name="vit_b_32", body_function=create_vespa_query)
query_model_2 = QueryModel(name="vit_b_32_prompt", body_function=lambda x: create_vespa_query(x, prompt=True))

Create a connection to the Vespa instance:

app = Vespa(
    url=os.environ["VESPA_END_POINT"],
    cert = os.environ["PRIVATE_CERTIFICATE_PATH"]
)

Evaluate the query models using the labeled data and metrics defined earlier. The labeled data uses the image_file_name and doc id.

from vespa.application import Vespa

result = app.evaluate(
    labeled_data=labeled_data, 
    eval_metrics=eval_metrics, 
    query_model=[query_model_1, query_model_2], 
    id_field="image_file_name"
)

The results shows that there is a lot of improvements to be made on the pre-trained ViT-B/32 CLIP model.

result
model vit_b_32 vit_b_32_prompt
match_ratio mean 0.012359 0.012359
median 0.012359 0.012359
std 0.000000 0.000000
recall_5 mean 0.417004 0.412955
median 0.000000 0.000000
std 0.494065 0.493365
recall_100 mean 0.870445 0.870445
median 1.000000 1.000000
std 0.336495 0.336495
reciprocal_rank_5 mean 0.279285 0.268084
median 0.000000 0.000000
std 0.394814 0.386606
reciprocal_rank_100 mean 0.304849 0.293651
median 0.111111 0.100000
std 0.378595 0.370633