Required packages

import pandas as pd
import numpy as np
import tensorflow as tf

In-memory data - numeric features

Load data with pandas

For small datasets, we can load them into memory using a pandas DataFrame.

abalone_train = pd.read_csv(
    "https://storage.googleapis.com/download.tensorflow.org/data/abalone_train.csv",
    names=["Length", "Diameter", "Height", "Whole weight", "Shucked weight",
           "Viscera weight", "Shell weight", "Age"])

abalone_train.head()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Age
0 0.435 0.335 0.110 0.334 0.1355 0.0775 0.0965 7
1 0.585 0.450 0.125 0.874 0.3545 0.2075 0.2250 6
2 0.655 0.510 0.160 1.092 0.3960 0.2825 0.3700 14
3 0.545 0.425 0.125 0.768 0.2940 0.1495 0.2600 16
4 0.545 0.420 0.130 0.879 0.3740 0.1695 0.2300 13
abalone_train.dtypes
Length            float64
Diameter          float64
Height            float64
Whole weight      float64
Shucked weight    float64
Viscera weight    float64
Shell weight      float64
Age                 int64
dtype: object

Separate label and features

abalone_features = abalone_train.copy()
abalone_label = abalone_features.pop("Age")
abalone_features.head()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight
0 0.435 0.335 0.110 0.334 0.1355 0.0775 0.0965
1 0.585 0.450 0.125 0.874 0.3545 0.2075 0.2250
2 0.655 0.510 0.160 1.092 0.3960 0.2825 0.3700
3 0.545 0.425 0.125 0.768 0.2940 0.1495 0.2600
4 0.545 0.420 0.130 0.879 0.3740 0.1695 0.2300
abalone_label.head()
0     7
1     6
2    14
3    16
4    13
Name: Age, dtype: int64

Numeric features as numpy array

X and y can be used to fit a model.

X, y = np.array(abalone_features), abalone_label
X
array([[0.435 , 0.335 , 0.11  , ..., 0.1355, 0.0775, 0.0965],
       [0.585 , 0.45  , 0.125 , ..., 0.3545, 0.2075, 0.225 ],
       [0.655 , 0.51  , 0.16  , ..., 0.396 , 0.2825, 0.37  ],
       ...,
       [0.53  , 0.42  , 0.13  , ..., 0.3745, 0.167 , 0.249 ],
       [0.395 , 0.315 , 0.105 , ..., 0.1185, 0.091 , 0.1195],
       [0.45  , 0.355 , 0.12  , ..., 0.1145, 0.0665, 0.16  ]])
y
0        7
1        6
2       14
3       16
4       13
        ..
3315    15
3316    10
3317    11
3318    16
3319    19
Name: Age, Length: 3320, dtype: int64

In-memory data - mixed data types

Load data with pandas

titanic = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic.head()
survived sex age n_siblings_spouses parch fare class deck embark_town alone
0 0 male 22.0 1 0 7.2500 Third unknown Southampton n
1 1 female 38.0 1 0 71.2833 First C Cherbourg n
2 1 female 26.0 0 0 7.9250 Third unknown Southampton y
3 1 female 35.0 1 0 53.1000 First C Southampton n
4 0 male 28.0 0 0 8.4583 Third unknown Queenstown y
titanic.dtypes
survived                int64
sex                    object
age                   float64
n_siblings_spouses      int64
parch                   int64
fare                  float64
class                  object
deck                   object
embark_town            object
alone                  object
dtype: object
titanic_features = titanic.copy()
titanic_label = titanic.pop("survived")

Pre-process mixed type data

Create a pre-processing model that can be used as part of a larger model. The pre-processing model can for example concatenate and normalize all numeric features of type float64 and apply one-hot encoding to the categorical features of type object. See for example the pre-processing contained here.

titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)

Once the pre-processing layer is setup we can use a dict of features to the model as input to the pre-processing layer:

Parse pandas DataFrame to use as input

titanic_features_dict = {name: np.array(value) 
                         for name, value in titanic_features.items()}

Show first row of the dict:

{name: value[:1] for name, value in titanic_features_dict.items()}
{'survived': array([0]),
 'sex': array(['male'], dtype=object),
 'age': array([22.]),
 'n_siblings_spouses': array([1]),
 'parch': array([0]),
 'fare': array([7.25]),
 'class': array(['Third'], dtype=object),
 'deck': array(['unknown'], dtype=object),
 'embark_town': array(['Southampton'], dtype=object),
 'alone': array(['n'], dtype=object)}

tf.data.Dataset from in-memory data through from_tensor_slices

features_ds = tf.data.Dataset.from_tensor_slices(titanic_features_dict)

Check first example:

for example in features_ds:
    for name, value in example.items():
        print("{}: {}".format(name, value))
    break
survived: 0
sex: b'male'
age: 22.0
n_siblings_spouses: 1
parch: 0
fare: 7.25
class: b'Third'
deck: b'unknown'
embark_town: b'Southampton'
alone: b'n'

The from_tensor_slice can handle any structure of nested dictionaries and tuples.

titanic_ds = tf.data.Dataset.from_tensor_slices((titanic_features_dict, titanic_label))
for feature, label in titanic_ds:
    for name, value in feature.items():
        print("{}: {}".format(name, value))
    print("Target: {}".format(label))
    break
survived: 0
sex: b'male'
age: 22.0
n_siblings_spouses: 1
parch: 0
fare: 7.25
class: b'Third'
deck: b'unknown'
embark_town: b'Southampton'
alone: b'n'
Target: 0

To train a model using this Dataset, you'll need to at least shuffle and batch the data.

titanic_batches = titanic_ds.shuffle(len(titanic_label)).batch(32)

titanic_batches can be used in fit functions instead of X and y.

Create tf.data.Dataset from CSV file

Uncompressed file

We can create a tf.data.Dataset directly from a .csv file in case our data does not fit into memory.

titanic_file_path = tf.keras.utils.get_file(
    "train.csv", 
    "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
)

Download the dataset file:

titanic_file_path
'/Users/tmartins/.keras/datasets/train.csv'

Create a tf.data.Dataset from the .csv file above:

titanic_csv_ds = tf.data.experimental.make_csv_dataset(
    titanic_file_path,
    batch_size=5, # Artificially small to make examples easier to show.
    label_name='survived',
    num_epochs=1,
    ignore_errors=True
)

Take the first batch of data from the tf.data.Dataset:

for batch, label in titanic_csv_ds.take(1):
    for key, value in batch.items():
        print(f"{key:20s}: {value}")
    print()
    print(f"{'label':20s}: {label}")
sex                 : [b'male' b'female' b'male' b'female' b'female']
age                 : [43. 36. 50.  9. 44.]
n_siblings_spouses  : [0 1 1 3 0]
parch               : [0 0 0 2 0]
fare                : [ 8.05   17.4    55.9    27.9    27.7208]
class               : [b'Third' b'Third' b'First' b'Third' b'First']
deck                : [b'unknown' b'unknown' b'E' b'unknown' b'B']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton' b'Cherbourg']
alone               : [b'y' b'n' b'n' b'n' b'y']

label               : [0 1 0 0 1]

Compressed file

Download compressed file:

traffic_volume_csv_gz = tf.keras.utils.get_file(
    'Metro_Interstate_Traffic_Volume.csv.gz', 
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Interstate_Traffic_Volume.csv.gz",
    cache_dir='.', 
    cache_subdir='traffic'
)

Setup compression_type argument to `"GZIP"``.

traffic_volume_csv_gz_ds = tf.data.experimental.make_csv_dataset(
    traffic_volume_csv_gz,
    batch_size=256,
    label_name='traffic_volume',
    num_epochs=1,
    compression_type="GZIP"
)

Take a peep at the first 5 values of each feature and label of the first batch.

for batch, label in traffic_volume_csv_gz_ds.take(1):
    for key, value in batch.items():
        print(f"{key:20s}: {value[:5]}")
    print()
    print(f"{'label':20s}: {label[:5]}")
holiday             : [b'None' b'None' b'None' b'None' b'None']
temp                : [296.68 294.37 275.68 270.68 275.74]
rain_1h             : [0. 0. 0. 0. 0.]
snow_1h             : [0. 0. 0. 0. 0.]
clouds_all          : [ 0  0 90 75  1]
weather_main        : [b'Clear' b'Clear' b'Clouds' b'Clouds' b'Clear']
weather_description : [b'Sky is Clear' b'Sky is Clear' b'overcast clouds' b'broken clouds'
 b'sky is clear']
date_time           : [b'2013-08-12 20:00:00' b'2013-08-08 13:00:00' b'2013-04-10 07:00:00'
 b'2013-01-08 04:00:00' b'2013-10-22 10:00:00']

label               : [2805 5296 6812  758 4384]

List of files

Download a list of files:

fonts_zip = tf.keras.utils.get_file(
    'fonts.zip',  "https://archive.ics.uci.edu/ml/machine-learning-databases/00417/fonts.zip",
    cache_dir='.', 
    cache_subdir='fonts',
    extract=True
)

List files downloaded:

import pathlib

font_csvs =  sorted(str(p) for p in pathlib.Path('fonts').glob("*.csv"))
font_csvs[:10]
['fonts/AGENCY.csv',
 'fonts/ARIAL.csv',
 'fonts/BAITI.csv',
 'fonts/BANKGOTHIC.csv',
 'fonts/BASKERVILLE.csv',
 'fonts/BAUHAUS.csv',
 'fonts/BELL.csv',
 'fonts/BERLIN.csv',
 'fonts/BERNARD.csv',
 'fonts/BITSTREAMVERA.csv']
len(font_csvs)
153

Create a tf.data.Dataset from a list of files:

fonts_ds = tf.data.experimental.make_csv_dataset(
    file_pattern = "fonts/*.csv",
    batch_size=10, 
    num_epochs=1,
    num_parallel_reads=20,
    shuffle_buffer_size=10000
)

Print features:

for features in fonts_ds.take(1):
    for i, (name, value) in enumerate(features.items()):
        if i>15:
            break
        print(f"{name:20s}: {value}")
    print('...')
    print(f"[total: {len(features)} features]")
font                : [b'BERNARD' b'JAVANESE' b'ONYX' b'MINGLIU' b'ELEPHANT' b'BANKGOTHIC'
 b'HANDPRINT' b'COMMERCIALSCRIPT' b'BERNARD' b'HARLOW']
fontVariant         : [b'BERNARD MT CONDENSED' b'JAVANESE TEXT' b'ONYX' b'MINGLIU_HKSCS-EXTB'
 b'ELEPHANT' b'BANKGOTHIC MD BT' b'scanned' b'COMMERCIALSCRIPT BT'
 b'BERNARD MT CONDENSED' b'HARLOW SOLID ITALIC']
m_label             : [176 219 103 195  72 186  54 100  68  97]
strength            : [0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
italic              : [0 1 1 1 0 0 0 1 0 0]
orientation         : [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
m_top               : [33 42 46 21 37 43  0 38 33 54]
m_left              : [23 32 18 19 23 24  0 24 20 23]
originalH           : [21 63 48 57 49 17 20 39 53 25]
originalW           : [21 52 34 44 61 24 20 51 34 30]
h                   : [20 20 20 20 20 20 20 20 20 20]
w                   : [20 20 20 20 20 20 20 20 20 20]
r0c0                : [  1   1   1   1 105   1   1   1 255   1]
r0c1                : [  1   1   1   1 178  86   1   1 255   1]
r0c2                : [  1   1   1   1 255 255   1   1 255   1]
r0c3                : [  1   1   1   1 255 255   1   1 255   1]
...
[total: 412 features]