This post creates a labeled dataset out of the Flicker 8k image-caption dataset, builds a text processor that uses a CLIP model to map a text query into the same 512-dimensional space used to represent images and evaluate different query models using the Vespa python API.

Check the previous three posts for context:

Create labeled data

An (image, caption) pair will be considered relevant for our purposes if all three experts agreed on a relevance score equal to 4.

Load and check the expert judgments

from pandas import read_csv

experts = read_csv(
    os.path.join(os.environ["DATA_FOLDER"], "ExpertAnnotations.txt"), 
    sep = "\t", 
    header=None, 
    names=["image_file_name", "caption_id", "expert_1", "expert_2", "expert_3"]
)

experts.head()

Check cases where all experts agrees

experts_agreement_bool = experts.apply(
    lambda x: x["expert_1"] == x["expert_2"] and x["expert_2"] == x["expert_3"], 
    axis=1
)
experts_agreement = experts[experts_agreement_bool][
    ["image_file_name", "caption_id", "expert_1"]
].rename(columns={"expert_1":"expert"})

experts_agreement.head()

experts_agreement["expert"].value_counts().sort_index()

1    2350
2     580
3     214
4     247
Name: expert, dtype: int64

Load captions data

captions = read_csv(
    os.path.join(os.environ["DATA_FOLDER"], "Flickr8k.token.txt"), 
    sep="\t", 
    header=None, 
    names=["caption_id", "caption"]
)

captions.head()

def get_caption(caption_id, captions):
    return captions[captions["caption_id"] == caption_id]["caption"].values[0]

Relevant (image, text) pair

relevant_data = experts_agreement[experts_agreement["expert"] == 4]
relevant_data.head(3)

Create labeled data

from ntpath import basename
from pandas import DataFrame

labeled_data = DataFrame(
    data={
        "qid": list(range(relevant_data.shape[0])),
        "query": [get_caption(
            caption_id=x, 
            captions=captions
        ).replace(" ,", "").replace(" .", "") for x in list(relevant_data.caption_id)],
        "doc_id": [basename(x) for x in list(relevant_data.image_file_name)],
        "relevance": 1}
)
labeled_data.head()

From text to embeddings

Create a text processor to map a text string into the same 512-dimensional space used to embed the images.

import clip
import torch

class TextProcessor(object):
    def __init__(self, model_name):
        self.model, _ = clip.load(model_name)
        
    def embed(self, text):
        text_tokens = clip.tokenize(text)
        with torch.no_grad():
            text_features = model.encode_text(text_tokens).float()
            text_features /= text_features.norm(dim=-1, keepdim=True)
        return text_features.tolist()[0]

Evaluate

Define search evaluation metrics:

from vespa.evaluation import MatchRatio, Recall, ReciprocalRank

eval_metrics = [
    MatchRatio(), 
    Recall(at=5), 
    Recall(at=100), 
    ReciprocalRank(at=5), 
    ReciprocalRank(at=100)
]

Instantiate TextProcessor with a specific CLIP model.

text_processor = TextProcessor(model_name="ViT-B/32")

Create a QueryModel's to be evaluated. In this case we create two query models based on the ViT-B/32 CLIP model, one that sends the query as it is and another that prepends the prompt "A photo of " to the query before sending it, as suggest in the original CLIP paper.

from vespa.query import QueryModel

def create_vespa_query(query, prompt = False):
    if prompt:
        query = "A photo of " + query.lower()
    return {
        'yql': 'select * from sources * where ([{"targetNumHits":100}]nearestNeighbor(vit_b_32_image,vit_b_32_text));',
        'hits': 100,
        'ranking.features.query(vit_b_32_text)': text_processor.embed(query),
        'ranking.profile': 'vit-b-32-similarity',
        'timeout': 10
    }

query_model_1 = QueryModel(name="vit_b_32", body_function=create_vespa_query)
query_model_2 = QueryModel(name="vit_b_32_prompt", body_function=lambda x: create_vespa_query(x, prompt=True))

Create a connection to the Vespa instance:

app = Vespa(
    url=os.environ["VESPA_END_POINT"],
    cert = os.environ["PRIVATE_CERTIFICATE_PATH"]
)

Evaluate the query models using the labeled data and metrics defined earlier. The labeled data uses the image_file_name and doc id.

from vespa.application import Vespa

result = app.evaluate(
    labeled_data=labeled_data, 
    eval_metrics=eval_metrics, 
    query_model=[query_model_1, query_model_2], 
    id_field="image_file_name"
)

The results shows that there is a lot of improvements to be made on the pre-trained ViT-B/32 CLIP model.

result

	image_file_name	caption_id	expert_1	expert_2	expert_3
0	1056338697_4f7d7ce270.jpg	2549968784_39bfbe44f9.jpg#2	1	1	1
1	1056338697_4f7d7ce270.jpg	2718495608_d8533e3ac5.jpg#2	1	1	2
2	1056338697_4f7d7ce270.jpg	3181701312_70a379ab6e.jpg#2	1	1	2
3	1056338697_4f7d7ce270.jpg	3207358897_bfa61fa3c6.jpg#2	1	2	2
4	1056338697_4f7d7ce270.jpg	3286822339_5535af6b93.jpg#2	1	1	2

	image_file_name	caption_id	expert
0	1056338697_4f7d7ce270.jpg	2549968784_39bfbe44f9.jpg#2	1
5	1056338697_4f7d7ce270.jpg	3360930596_1e75164ce6.jpg#2	1
6	1056338697_4f7d7ce270.jpg	3545652636_0746537307.jpg#2	1
8	106490881_5a2dd9b7bd.jpg	1425069308_488e5fcf9d.jpg#2	1
9	106490881_5a2dd9b7bd.jpg	1714316707_8bbaa2a2ba.jpg#2	2

	caption_id	caption
0	1000268201_693b08cb0e.jpg#0	A child in a pink dress is climbing up a set o...
1	1000268201_693b08cb0e.jpg#1	A girl going into a wooden building .
2	1000268201_693b08cb0e.jpg#2	A little girl climbing into a wooden playhouse .
3	1000268201_693b08cb0e.jpg#3	A little girl climbing the stairs to her playh...
4	1000268201_693b08cb0e.jpg#4	A little girl in a pink dress going into a woo...

	image_file_name	caption_id	expert
43	1119015538_e8e796281e.jpg	416106657_cab2a107a5.jpg#2	4
53	1131932671_c8d17751b3.jpg	1131932671_c8d17751b3.jpg#2	4
66	115684808_cb01227802.jpg	115684808_cb01227802.jpg#2	4

	qid	query	doc_id	relevance
0	0	A white dog runs in the grass	1119015538_e8e796281e.jpg	1
1	1	A boy jumps from one bed to another	1131932671_c8d17751b3.jpg	1
2	2	Three people and a sled	115684808_cb01227802.jpg	1
3	3	A group of people walking a city street in war...	1174629344_a2e1a2bdbf.jpg	1
4	4	Two children one of which is holding a stick a...	1322323208_c7ecb742c6.jpg	1

	model	vit_b_32	vit_b_32_prompt
match_ratio	mean	0.012359	0.012359
	median	0.012359	0.012359
	std	0.000000	0.000000
recall_5	mean	0.417004	0.412955
	median	0.000000	0.000000
	std	0.494065	0.493365
recall_100	mean	0.870445	0.870445
	median	1.000000	1.000000
	std	0.336495	0.336495
reciprocal_rank_5	mean	0.279285	0.268084
	median	0.000000	0.000000
	std	0.394814	0.386606
reciprocal_rank_100	mean	0.304849	0.293651
	median	0.111111	0.100000
	std	0.378595	0.370633