Simple baseline for YouTube 8M video-level features with TensorFlow
Getting started with TensorFlow API
The goal of this blog post is to show how to use the TensorFlow API to create a multi-label logistic classification model that takes multiple inputs. The focus is not on the results as we will use just a sample dataset, but on the API itself. This post builds on a previous blog post that shows how to create a TensorFlow Dataset for the YouTube 8M video-level dataset.
This code works with tensorflow 2.6.0.
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
The parsed dataset created in the previous blog post was saved using tf.data.experimental.save
:
tf.data.experimental.save(parsed_dataset, os.path.join(data_folder, "dataset"))
Load the parsed dataset:
parsed_dataset = tf.data.experimental.load(os.path.join(os.environ["DATA_FOLDER"], "dataset"))
for parsed_record in parsed_dataset.take(1):
print(repr(parsed_record))
Builds a multi-label logistic classification model that takes the image and audio vectors as input.
According to the YouTube 8M video-level dataset there are 3862 classes. We can check if our sample data has at most 3862 different labels. It is a good opportunity to use the tf.Dataset.reduce
method.
def tf_reduce_unique_values(old, new):
concat_tensor = tf.concat([old, new["labels"]], axis = 0)
y, _ = tf.unique(concat_tensor)
return y
unique_labels = tf.sort(parsed_dataset.reduce(
np.array([], dtype=np.int64), tf_reduce_unique_values
))
unique_labels
assert unique_labels[-1] <= 3861 # The dataset has a total of 3862 classes
We can then define the number of classes to be 3862:
number_classes = 3862
Use keras functional API to define a multiple inputs model:
mean_rgb = keras.Input(name="mean_rgb", shape=(1024,))
mean_audio = keras.Input(name="mean_audio", shape=(128,))
x = keras.layers.concatenate([mean_rgb, mean_audio])
x = keras.layers.Dense(activation="sigmoid", units=number_classes)(x)
model = keras.Model(inputs=[mean_rgb, mean_audio], outputs=[x])
Since each video can belong to more than one class, we need to build a multi-label classification model. We can then use the binary crossentropy loss function and the binary accuracy metric for reasons discussed in this blog post.
model.compile(
optimizer=keras.optimizers.RMSprop(learning_rate=1e-3),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()],
)
The keras training API accepts a tf.Dataset
as input but it expects a tuple containing (features, labels)
. We need then to preprocess our parsed_dataset
to turn it into a train_dataset
with appropriate output format. We also need to transform the labels from a list of integers to a multi-hot encoding as desccribed in this blog post.
def training_preprocessing(data, number_classes):
features = {"mean_rgb": data["mean_rgb"], "mean_audio": data["mean_audio"]}
one_hot = tf.one_hot(indices=data["labels"], depth=number_classes)
label = tf.reduce_max(one_hot, axis = 0)
return (features, label)
train_dataset = parsed_dataset.map(lambda x: training_preprocessing(x, number_classes=number_classes))
for data in train_dataset.take(1):
print(repr(data))
We can then use the fit
method with the train_dataset
that we created above.
model.fit(train_dataset.batch(32), epochs=3)