Load tabular data with TensorFlow
In-memory and out-of-memory tabular data loading with TensorFlow
- Required packages
- In-memory data - numeric features
- In-memory data - mixed data types
- Create tf.data.Dataset from CSV file
import pandas as pd
import numpy as np
import tensorflow as tf
For small datasets, we can load them into memory using a pandas DataFrame.
abalone_train = pd.read_csv(
"https://storage.googleapis.com/download.tensorflow.org/data/abalone_train.csv",
names=["Length", "Diameter", "Height", "Whole weight", "Shucked weight",
"Viscera weight", "Shell weight", "Age"])
abalone_train.head()
abalone_train.dtypes
abalone_features = abalone_train.copy()
abalone_label = abalone_features.pop("Age")
abalone_features.head()
abalone_label.head()
X and y can be used to fit a model.
X, y = np.array(abalone_features), abalone_label
X
y
titanic = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic.head()
titanic.dtypes
titanic_features = titanic.copy()
titanic_label = titanic.pop("survived")
Create a pre-processing model that can be used as part of a larger model. The pre-processing model can for example concatenate and normalize all numeric features of type float64
and apply one-hot encoding to the categorical features of type object
. See for example the pre-processing contained here.
titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)
Once the pre-processing layer is setup we can use a dict of features to the model as input to the pre-processing layer:
titanic_features_dict = {name: np.array(value)
for name, value in titanic_features.items()}
Show first row of the dict:
{name: value[:1] for name, value in titanic_features_dict.items()}
features_ds = tf.data.Dataset.from_tensor_slices(titanic_features_dict)
Check first example:
for example in features_ds:
for name, value in example.items():
print("{}: {}".format(name, value))
break
The from_tensor_slice
can handle any structure of nested dictionaries and tuples.
titanic_ds = tf.data.Dataset.from_tensor_slices((titanic_features_dict, titanic_label))
for feature, label in titanic_ds:
for name, value in feature.items():
print("{}: {}".format(name, value))
print("Target: {}".format(label))
break
To train a model using this Dataset, you'll need to at least shuffle and batch the data.
titanic_batches = titanic_ds.shuffle(len(titanic_label)).batch(32)
titanic_batches
can be used in fit functions instead of X
and y
.
We can create a tf.data.Dataset directly from a .csv file in case our data does not fit into memory.
titanic_file_path = tf.keras.utils.get_file(
"train.csv",
"https://storage.googleapis.com/tf-datasets/titanic/train.csv"
)
Download the dataset file:
titanic_file_path
Create a tf.data.Dataset from the .csv file above:
titanic_csv_ds = tf.data.experimental.make_csv_dataset(
titanic_file_path,
batch_size=5, # Artificially small to make examples easier to show.
label_name='survived',
num_epochs=1,
ignore_errors=True
)
Take the first batch of data from the tf.data.Dataset:
for batch, label in titanic_csv_ds.take(1):
for key, value in batch.items():
print(f"{key:20s}: {value}")
print()
print(f"{'label':20s}: {label}")
Download compressed file:
traffic_volume_csv_gz = tf.keras.utils.get_file(
'Metro_Interstate_Traffic_Volume.csv.gz',
"https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Interstate_Traffic_Volume.csv.gz",
cache_dir='.',
cache_subdir='traffic'
)
Setup compression_type
argument to `"GZIP"``.
traffic_volume_csv_gz_ds = tf.data.experimental.make_csv_dataset(
traffic_volume_csv_gz,
batch_size=256,
label_name='traffic_volume',
num_epochs=1,
compression_type="GZIP"
)
Take a peep at the first 5 values of each feature and label of the first batch.
for batch, label in traffic_volume_csv_gz_ds.take(1):
for key, value in batch.items():
print(f"{key:20s}: {value[:5]}")
print()
print(f"{'label':20s}: {label[:5]}")
Download a list of files:
fonts_zip = tf.keras.utils.get_file(
'fonts.zip', "https://archive.ics.uci.edu/ml/machine-learning-databases/00417/fonts.zip",
cache_dir='.',
cache_subdir='fonts',
extract=True
)
List files downloaded:
import pathlib
font_csvs = sorted(str(p) for p in pathlib.Path('fonts').glob("*.csv"))
font_csvs[:10]
len(font_csvs)
Create a tf.data.Dataset from a list of files:
fonts_ds = tf.data.experimental.make_csv_dataset(
file_pattern = "fonts/*.csv",
batch_size=10,
num_epochs=1,
num_parallel_reads=20,
shuffle_buffer_size=10000
)
Print features:
for features in fonts_ds.take(1):
for i, (name, value) in enumerate(features.items()):
if i>15:
break
print(f"{name:20s}: {value}")
print('...')
print(f"[total: {len(features)} features]")