Should we define batch size during graph construction?

March 7, 2016 - just another day in heaven

Intro

So I’ve seen a number of Deep Learning snippets using Tensorflow, and reused a mixture of them here and there in my code. But there are things that cannot be mixed. Some codes follow their own philosophy (okay let’s call it taste) and refuse incorporating some certain kind of code into its workflow (It is possible that this is because of my noop technical ability, but nevertheless there are points worth mentioning)

Studying this separation can be really beneficial for starting out with Tensorflow. Here I present two templates that I would comfortably switch back and forth between them depending on the deployed model. (And ofcourse they cannot be mixed)

Review

Again I would throw a terrible generalised statement here about the workflow of any Deep Learning experiment:

You read data (supervised one - features \( X \) and label \( Y\) )
If you don’t have any test set, then use 10-fold Cross Validation
Now that you have train and test sets, hold-out a part of the train set for early stopping, if the size of train set is big enough
Now go on training (in batches), while simultaenously looking at the hold-out set to decide when to stop. Then stop and evaluate the model on test set, report it (if you are using 10 fold CV, you’ll have to average the numbers).

So, this is the first code template of the two, it will look something like this:

import tensorflow as tf
import data_manipulator # you build this yourself

# Set up data
hope = FLAGS.hyperParameters.Hold_Out_PErcentage 
train, hold_out, test = data_manipulator.read(input_folder, hope)

# Set up model
class Model(object):
    def __init__(self, hyperParameters):
        self.loss_op, self.accuracy = self.Set_up_layers(hyperParameters)
        
    def Set_up_layers(hyperParameters):
        # 1. Set up placeholders self.x and self.y
        # 2. build layers gradually upon self.x 
        #    all the way through to the output layer
        # 3. calculate loss and accuracy, using self.y
        return((loss, acc))

# Session switch on
sess = tf.Session()
test_acc = float() # report this after the experiment

with sess.as_default():
    
    # Set up the graph
    hyperParams = FLAGS.hyperParameters
    m = Model(hyperParams)

	# Set up trainer
	optimizer = tf.train.AdamOptimizer(1e-3) # Adagrad, RMSprop, etc.
	gradients = optimizer.compute_gradients(m.loss_op)
	train_op  = optimizer.apply_gradients(gradients)

    # Post-training set-ups
    max_epoch = hyperParams.max_epoch
    batch_size = hyperParams.batch_size
    max_hold_acc = float(0.0) # keep track of the accuracy on hold-out set
    
    # Now start the training
    for batch in data_manipulator.yield_batch(train, batch_size, max_epoch):
        # 1. Let the tensors flows
        #    with gradients calculated and variables updated
        feed_dict = {m.x = batch.x, m.y = batch.y}
        _, train_acc = sess.run([train_op, m.accuracy], feed_dict)
        
        # 2a. Let the tensors simply flows through
        #     you may want to do this step much less frequently
        feed_dict = {m.x = hold_out.x, m.y = hold_out.y}
        hold_acc = sess.run([m.accuracy], feed_dict)
        
        # 2b. hold out peaked, report accuracy on test
        if hold_acc >= max_hold_acc:
            max_hold_acc = hold_acc
            feed_dict = {m.x = test.x, m.y = test.y}
            test_acc = sess.run([m.accuracy], feed_dict)
    
# now publish your paper maybe?
print(test_acc)

The second template comes later on in this post, which I initially consider being completely lack of good taste, but it solves problems. Go on reading if you are curious :D

Okay, So where is the fun/interesting/troublesome part?

The troublesome `batch_size` placeholder

The hidden thing in the above code snippet is that, there are other placeholders besides m.x and m.y to be fed into feed_dict. Most likely they are m.drop_out_probability (yeah, very powerful regularisation technique - definitely recommended in every deep model) and m.batch_size. Why are they being placeholders? Because in the evaluation steps (2a and 2b), they have a different value to step 1. Namely, model.drop_out_probability must be \( 1.0 \) in both 2a and 2b, while model.batch_size must be hold_out.size in 2a and test.size in 2b.

The good taste here is that: the viewpoint above about batch_size generalizes very well from training set to both hold-out set and test set. It is brilliant to view hold-out and test sets to be two big fat batches, because who evaluates models in small batches anyway (slower execution for the same output)? (well, I can imagine people do this in specific situations, like when hold-out and test sets are too big to fit in memory, but this is rare for my daily experiments). When there is a generalization, there is less branching statements in your code, and when the code is cleaner, people will just love it.

But this also means, the actual value of hyperParameters.batch_size, although being passed to the initiation of m, will not be used because it is not applicable for hold-out and test sets. But hold on, you are fine not specifying model.batch_size as the placeholder in almost many cases, because Tensorflow allows tensors with None dimension, as long as at run-time, tensor operations work out fine (e.g. tensor multiplication requires some dimension-matching) (cool!). That being said, there are cases when you do have to specifically refer to batch_size in the constructure phase (a.k.a. initiating Model object). For an example, the code below is from tensorflow’s github repository:

def zero_state(self, batch_size, dtype):
    """Return state tensor (shape [batch_size x state_size]) filled with 0.
    Args:
      batch_size: int, float, or unit Tensor representing the batch size.
      dtype: the data type to use for the state.
    Returns:
      A 2D Tensor of shape [batch_size x state_size] filled with zeros.
    """

zero_state is a function that most will use in every Recurrent model, see the documentation? batch_size should be int, float if you know the actual value at construction phase, or you can throw in a placeholder if you don’t know its value.

So what is the problem anyway?

The problem is, there are cases you desparately have the need to iterate over range(0,batch_size) at construction phase. But when batch_size being a placeholder (you don’t know the value at construction phase), iterate over it makes no sense. Yes, Python raise an error immediately when I desperately (and naively) try this loop expecting some magical colaboration between range() and Tensorflow’s placeholders.

Is this particular looping need frequent and universal? I’ll say yes, as I and my colleagues encountered it for quite a number of times. In this post I point out a concrete situation of this looping need and a (cool) walk-around for it. Notice I call it a walk around. Because it is not a solution, I don’t know if there is any built-in Tensorflow op, like tensorflow.range() or something, for people to loop over placeholders at construction phase. But the fact is, you DON’T need such operation at all, as long as you use the following template (which I personally considered to be terribly lack of taste, but nevertheless works wonders)

Here it comes,

import tensorflow as tf
import data_manipulator # you build this yourself

# Set up data
hope = FLAGS.hyperParameters.Hold_Out_PErcentage 
train, hold_out, test = data_manipulator.read(input_folder, hope)

# Set up model
class Model(object):
    def __init__(self, hyperParameters, batch_size = None):
        # Here comes the branching ...
        if batch_size is None:
            self.batch_size = hyperParameters.batch_size
        else:
            self.batch_size = batch_size
        self.loss_op, self.accuracy = self.Set_up_layers(hyperParameters)
        
    def Set_up_layers(hyperParameters):
        # 1. Set up placeholders self.x and self.y
        # 2. build layers gradually upon self.x 
        #    all the way through to the output layer,
        #    feel free to loop over self.batch_size
        #    it is an actual int/float now!
        # 3. calculate loss and accuracy, using self.y
        return((loss, acc))

# Session switch on
sess = tf.Session()
test_acc = float() # report this after the experiment

with sess.as_default():

    # Set up THREE graphs
    hyperParams = FLAGS.hyperParameters
    m_train = Model(hyperParams)
    m_hold = Model(hyperParams, hold_out.size)
    m_test = Model(hyperParams, test.size)

    # Set up trainer
	optimizer = tf.train.AdamOptimizer(1e-3) # Adagrad, RMSprop, etc.
	gradients = optimizer.compute_gradients(m.loss_op)
	train_op  = optimizer.apply_gradients(gradients)
    
    # Post-training set-ups
    max_epoch = hyperParams.max_epoch
    batch_size = hyperParams.batch_size
    max_hold_acc = float(0.0) # keep track of the accuracy on hold-out set
    
    # Now start training
    for batch in data_manipulator.yield_batch(train, batch_size, max_epoch):
        # 1. Let the tensors flows
        #    with gradients calculated and variables updated
        feed_dict = {m_train.x = batch.x, m_train.y = batch.y}
        _, train_acc = sess.run([train_op, m_train.accuracy], feed_dict)
        
        # 2a. Let the tensors simply flows through
        #     you may want to do this step much less frequently
        feed_dict = {m_hold.x = hold_out.x, m_hold.y = hold_out.y}
        hold_acc = sess.run([m_hold.accuracy], feed_dict)
        
        # 2b. hold out peaked, report accuracy on test
        if hold_acc >= max_hold_acc:
            max_hold_acc = hold_acc
            feed_dict = {m_test.x = test.x, m_test.y = test.y}
            test_acc = sess.run([m_test.accuracy], feed_dict)
    
# now publish your paper maybe?
print(test_acc)

Now see that this template has an if branching, thus it is kinda more verbose (uglier) than the previous one, not to mention it is longer and obviously requires more memory. Yes, there are ways to avoid the if but you will not be able to avoid instantiating three different objects of class Model. The ugly part of this is that, these three objects, although containing only two Tensorflow-op attributes each, correspond to three separate underlying Tensorflow graphs. These graphs can be huge if your model is big enough, and guess what? All three of them are identical except for the loop part!

For me, the redundancy here is unbearable at the time discovering this template. But over time, it gradually becomes acceptable because three is a constant anyway. And constants are fine, look, they are \( O(1) \). Nevertheless, whenever I encounter a loop need, the first thing I would do is to stick with template 1 until I cannot find a walk-around for it. Because walk-arounds are conceptually very cool, and every developers need that little stubborness when it comes to coding styles, isn’t it? No? okay then =(.