Ruddra.com

Convert TFRecords to Pandas and Pandas to TFRecords

Convert TFRecords to Pandas and Pandas to TFRecords
Photo by Eric Krull on Unsplash

The TFRecord (Tensor Flow Records) format is a simple format for storing a sequence of binary records. It is much more efficient storage because it can take up less space than the original data. Aside from that, you can partition TFRecords into multiple files. In this article, we are going to see how you can convert Pandas Dataframe (essentially from a CSV) to TFRecords or vice versa.

Pandas to TFRecords

To write to a TFRecords file from Pandas, you need to serialize the data to make it writeable to the TFRecords file. Here is an example:

  1. Let us assume we have the following CSV file:
Name, age, sex
A, 20, M
B, 30, F
C, 35, M
D, 2, F

Then we can load it in Pandas:

import pandas as pd

csv = pd.read_csv('example.csv')
  1. Now, we create an example serializer (or TFRecords example) for the above CSV:
import tensorflow as tf

def create_tf_records_example(features):
    tf_example = tf.train.Example(
        features=tf.train.Features(feature={
            'name': tf.train.Feature(bytes_list=tf.train.BytesList(
                value=[features[0].encode('utf-8')])),
            'age': tf.train.Feature(float_list=tf.train.Int64List(
                value=[features[1]])),
            'sex': tf.train.Feature(float_list=tf.train.Int64List(
                value=[features[2]])),
    }))
    return tf_example
  1. After that, let us write the pandas data frame to a TFRecords file:
writer = tf.io.TFRecordWriter("tf_records/example.tfrecords")
for row in data:
    features = df.values
    example = create_tf_records_example(features)
    writer.write(example.SerializeToString())

This process will work in both Tensor Flow 1 and 2.

TFRecords to Pandas

Now let us get the pandas data frame from the TFRecords using the following steps:

  1. First, we need to load the TFRecords file:
dataset =  tf.data.TFRecordDataset(
        ['tf_records/example.tfrecords'])  # here you can use multiple tfrecords file
  1. Write a parser for getting the data from the TFRecords file:
def parse_df_element(element):
    parser = {
        'name': tf.io.FixedLenFeature([], tf.string),
        'age': tf.io.FixedLenFeature([], tf.int64),
        'sex': tf.io.FixedLenFeature([], tf.int64),
    }
    # create an example:
    content = tf.io.parse_single_example(element, parser)
    return content['name'], \
        content['age'], content['sex']
  1. Then, let us use this parser function to get the data from TF records:
parsed_tf_records = dataset.map(parse_ac_element)
df = pd.DataFrame(
    parsed_tf_records.as_numpy_iterator(),
    columns=['Name', 'age', 'sex']
)

Important: This implementation will work on Tensor Flow 2 or above because it evaluates the records when called using NumPy iterator.

In conclusion

Although it is not necessary to convert TFRecords to Pandas Dataframe because you can use TFRecords in Machine Learning (ML) models, use aggregation, randomization, and other operations on the data. Still, it is very convenient to convert it back to Pandas Dataframe. I hope this article helps you with your Machine Learning Journey. Cheers!!

Last updated: Oct 19, 2024


← Previous
Useful Tips to Help You with Starting Your First Professional Project

Things you should know before working on a professional project for the first time.

Next →
How to Ask Questions in Stack Overflow

How to utilize Stack Overflow at its fullest by asking properly.

Share Your Thoughts
M↓ Markdown