Play around with Youtube 8M video-level dataset. The goal of this section is to create a tf.data.Dataset from a set of .tfrecords file.

Requirements

This code works with tensorflow 2.6.0.

import tensorflow as tf
print(tf.__version__)
2.6.0

Load data

The sample data were downloaded with

curl data.yt8m.org/download.py | shard=1,100 partition=2/video/train mirror=us python

per instruction available on the YouTube 8M dataset download page.

Load raw dataset

Import libraries and specify data_folder.

import os
import glob
from tensorflow.data import TFRecordDataset
data_folder = "/home/default/video"

List .tfrecord files to be loaded.

filenames = glob.glob(os.path.join(data_folder, "*.tfrecord"))
print(filenames[0]); print(filenames[-1])
/home/default/video/train0093.tfrecord
/home/default/video/train3749.tfrecord

Load .tfrecord files into a raw (not parsed) dataset.

raw_dataset = tf.data.TFRecordDataset(filenames)

Parse raw dataset

Create a funtion to parse the raw data. According to YouTube 8M dataset download section, the video-level data are stored as tensorflow.Example protocol buffers with the following text format:

features: {
  feature: {
    key  : "id"
    value: {
      bytes_list: {
        value: (Video id)
      }
    }
  }
  feature: {
    key  : "labels"
    value: {
      int64_list: {
        value: [1, 522, 11, 172]  # label list
      }
    }
  }
  feature: {
    # Average of all 'rgb' features for the video
    key  : "mean_rgb"
    value: {
      float_list: {
        value: [1024 float features]
      }
    }
  }
  feature: {
    # Average of all 'audio' features for the video
    key  : "mean_audio"
    value: {
      float_list: {
        value: [128 float features]
      }
    }
  }
}
# Create a description of the features.
feature_description = {
    'id': tf.io.FixedLenFeature([1], tf.string, default_value=''),
    'labels': tf.io.FixedLenSequenceFeature([], tf.int64, default_value=0, allow_missing=True),
    'mean_audio': tf.io.FixedLenFeature([128], tf.float32, default_value=[0.0] * 128),    
    'mean_rgb': tf.io.FixedLenFeature([1024], tf.float32, default_value=[0.0] * 1024),
}

def _parse_function(example_proto):
  # Parse the input `tf.train.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)
parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset
<MapDataset shapes: {id: (1,), labels: (None,), mean_audio: (128,), mean_rgb: (1024,)}, types: {id: tf.string, labels: tf.int64, mean_audio: tf.float32, mean_rgb: tf.float32}>

Check parsed dataset

for parsed_record in parsed_dataset.take(1):
  print(repr(parsed_record))
{'id': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'eXbF'], dtype=object)>, 'labels': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([ 0, 12])>, 'mean_audio': <tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-1.2556146 ,  0.17297305,  0.53898615,  1.5446128 ,  1.4344678 ,
        0.41190457,  1.2042887 ,  0.9899097 , -0.28567997,  1.1892846 ,
        0.6182132 , -0.54916394, -0.02003632,  0.7124445 , -1.275734  ,
       -1.0121363 ,  0.8652152 ,  0.45430297, -0.5905393 , -0.8244694 ,
        0.95853716,  0.379509  , -1.1317158 ,  0.46737486,  1.3991169 ,
       -0.4367456 , -0.287044  , -0.7412639 ,  0.5608105 ,  0.9686536 ,
        0.36370906,  0.15887815,  1.1279035 , -0.08369077, -0.20577091,
       -1.467152  , -0.9784904 ,  0.44680086,  1.1796227 ,  0.14648826,
        1.3656982 ,  0.12989263, -0.9865609 , -1.2897152 ,  0.6123024 ,
        0.1184121 ,  0.49931577, -1.1900278 ,  0.0516886 ,  0.16899465,
       -1.0225939 , -0.6807922 , -1.1495618 ,  0.5336437 , -0.10267343,
       -0.14041142, -0.20417954,  0.37166587,  0.56979036,  0.7668918 ,
        0.17683779, -0.4835771 ,  0.188432  ,  0.948989  ,  0.59286505,
        0.7839421 , -0.29659215,  0.06305546,  0.13159767,  1.1180142 ,
        0.85737205,  1.5399523 , -0.28511164, -0.49676266,  0.21741751,
       -0.85834265,  0.88090146, -0.7543358 ,  0.4161103 , -0.19713208,
       -0.13404599,  0.9562638 ,  0.3493868 ,  0.9435329 , -0.8736879 ,
       -0.2188428 , -0.2544211 ,  0.24140158,  0.31994662,  0.69403017,
       -1.1273963 , -0.801281  ,  0.04793753, -0.69943386,  0.8120182 ,
       -0.28852168, -0.16166747,  0.94978464, -1.2834635 ,  0.32062864,
       -0.66567427,  1.2626008 , -1.583094  , -0.97621703,  1.3589919 ,
        0.43338794, -0.5152907 , -1.63595   , -0.4190133 , -0.16496386,
       -0.81412554, -0.22532192, -0.28386128, -0.4277658 , -0.7794566 ,
        0.16581193, -1.0593089 , -0.03117585,  0.0952237 ,  1.1476818 ,
       -0.28931737, -0.7578596 , -0.48096272,  0.36552775,  0.35063717,
        0.5677443 ,  1.4371959 ,  0.81667864], dtype=float32)>, 'mean_rgb': <tf.Tensor: shape=(1024,), dtype=float32, numpy=
array([ 0.5198898 ,  0.30175963, -0.5135856 , ...,  0.44089007,
        0.398037  , -0.48050806], dtype=float32)>}