February 15, 2016 - just another day in heaven

Intro

So I’ve been working on this project of sentence classification for a while. The baseline is this very-simple-yet-effective CNN from Yoon Kim. It’s been a while since the model was proposed and several other models have beaten this one on several datasets. But here I’ll just work on this baseline with another focus - not beating the accuracy, but to see if injecting more temporal information into the structure is of any good. The reason is: sentences are sequential words, but CNN kinda produces a bag of patches to the next, typically, fully-connected discriminative layer. Even though patches of words contain some kind of sequential information, a bag is essentially the opposite. The work so far is fine. Until I encounter this problem of varying sentence length.

Problems with padding

One thing people commonly do with sentences is to pad them, as a preprossing step of their deep learning application, so that all sentences in a dataset are of the same length. The reason for this is: Tensorflow works with tensors only, apparently. And tensors essentially contains constant sized matrices or vectors, thus padding is unavoidable.

But say, sentences after being padded are passed into the structure. Then the padded part will be transformed and pushed through several layers of the network. There is a high chance that this padding information will mess with the actual content of the input at later layers and thus, create quite a bit of noise in the network’s output. Even if the padded part are all zeros, when going through a linear transformation, they will be added by some bias weight and becomes visible in contributing to the activations. So, eliminating these noise is desirable.

The dataset that I am working on have the ratio of the average sentence length to the maximum one being \( 1/4 \). So padding is injecting one hell of noise into the network. This, must be taken into account seriously.

Getting the last output signal of RNN

If you had a look on this awesome blog post on RNN, you’ll know that there is essentially two different uses of RNN when the input is a sequence. The first one is sequence-to-sequence, the second one is sequence-to-unit. The implementation of the first one looks something like this

outputs, state = rnn.rnn(lstm_cell, inputs, initial_state = initial_state)

While in the second one:

outputs, state = rnn.rnn(lstm_cell, inputs, initial_state = initial_state)
output = outputs[-1]

Not much of a difference huh? Yes, but in the second one, getting the last output signal is something very undesirable as indicated earlier. In order to come to this final element of the list outputs, the real content have to survive a considerable amount of paddings.

The solution

This should easily be the solution.

output = outputs[0]
for i in range(0, batch_size):
    # l_i : length of sentence number \\( i \\) in the batch
    l_i = sentence_length[i]
    output[i,:] = outputs[l_i - 1][i,:]

The problem is right there at range(0, batch_size). You suppose to know the actual value of batch_size at construction phase of the graph. Can we do it? Well definitely, why in the world we cannot at all? The thing is, if you set batch_size to a specific defined number while building this graph, then how would it allow the validation/test data (which will not comes in batches) to flow in later on? Actually this can be solved in many different ways (you can easily think of one right now), and at the end of the day it actually boils down to coding style. This post discusses the topic in much greater length. Right now, let’s assume we insist on not specifying a defined value for batch_size at construction phase.

So, if batch_size, instead of being a defined number, presents as a placeholder, how would we iterate over range(0, batch_size) and collect outputs at appropriate positions like above? Well, LSTM is theoretically a great subtitute solution thanks to its ability to ‘forget’ the unnecessary padding at the end of each sentences. And even if not using it, one can simply reverse the input so that the padding comes first and then (likely) be washed away as information propagates through several time steps.

But what if the padding creates a too noisy signal that it must be eliminated completely like in the solution above. Or what if we are the die-hard perfectionists willing to spend days and months cleansing the hell out of this noise or else we will not be able to go on with life?

Don’t worry! I’ve got here a cool and fun solution ;)

Defining the problem

So here is the problem statement:

Given

outputs, a list of length padded_sentence_length, each element being a tensor of size batch_size \( \times \) hidden_size.
sentence_length, a vector of length batch_size, with entry number \( i \) being the real length of sentence number \( i \) in the batch.

Question: how to calculate the tensor output of size batch_size \( \times \) hidden_size with the same value produced as in the solution code, without explicitly refering to batch_size’s value?

In other words, we are trying to come up with an equivalent piece of code, with much more stingy restrictions. (Excited yet?)

The walk-around Solution

Before moving on to the solution, I will give here a toy example for illustration purpose

# Assume hidden_size = 2, batch_size = 3, padded_sentence_length = 4
# And the given data is:
outputs = 
[[[1, 2],   [[0, 0],    [[0, 0],    [[0, 0],    
  [0, 0],    [0, 0],     [3, 4],     [0, 0],
  [0, 0]];   [5, 6]];    [0, 0]];    [0, 0]]]
sentence_length = [1, 3, 2]
# Then the expected return is
output =
[[1, 2],
 [3, 4],
 [5, 6]]

As in the restriction, you can’t iterate over batch_size, so let’s iterate over padded_sentence_length (this is pure trial and error reasoning). By this loop, we essentially go through each and every element of the list outputs. Note that each of these elements is of the same size with output, so a reasonable idea is to initialise output randomly and refine its value as we iterate through the loop using values of each outputs[i]

Specifically the main idea is, as we iterate through each element of outputs, imagine output being a filter, it receives the whole value of each element but only keeps what it needs, discards the rest and then moves on.

If you follow the reasoning, at this point it is very clear that during this loop, step number \( i \) is exactly when we must set the value for all rows of output that correspond the sentences with length \( i \) to the value of the same rows of outputs[i]. Well I admit I’ve just said something lengthy and confusing, so feel free to skip and just look at the example below to understand. Here we iterate by i in range(0, 4)

# Progression of output as we go through each iteration:
time  |initial | i = 0  | i = 1  | i = 2  | i = 3
------+--------+--------+--------+--------+--------
      |[[x, x],|[[1, 2],|[[1, 2],|[[1, 2],|[[1, 2],
output| [x, x],| [0, 0],| [0, 0],| [3, 4],| [3, 4],
      | [x, x]]| [0, 0]]| [5, 6]]| [5, 6]]| [5, 6]]
# By 'x', I mean 'anything, random'.

How do we do that? The first thing to realise is that at each interation, the current output has some rows that should be unchanged, and some others to be updated from outputs[i]. We can do this by a matrix mask having the same size as output with entries being \( 1 \) where output should be kept and \( 0 \) otherwise.

Filtery speaking, entries with value of \( 1 \) indicates that this position on the filter output is occupied by a number that didn’t make it through the filter at some previous step, and thus this position can no longer be occupied by anything else that comes later.

Again, let’s look at the value of mask in the example

# Progression of the mask as we go through each iteration:
i   |    0   |    1   |    2   |    3
----+--------+--------+--------+--------
    |[[0, 0],|[[1, 1],|[[1, 1],|[[1, 1],
mask| [0, 0],| [0, 0],| [0, 0],| [1, 1],
    | [0, 0]]| [0, 0]]| [1, 1]]| [1, 1]]

If we are able to calculate mask at each step, then the update rule for output is simply

output = tf.mul(output, mask) + tf.mul(outputs[i], 1 - mask)

So the remaining problem is to calculate mask. Looking at mask, we realise that all entries on the same row have the same value, thus mask’s value can be inferred from a single vector v that has the progression as follows:

# Progression of v as we go through each iteration:
i |  0  |  1  |  2  |  3
--+-----+-----+-----+------
  | [0, | [1, | [1, | [1,
v |  0, |  0, |  0, |  1,
  |  0] |  0] |  1] |  1]

See any patterns? Clearly the value of v must relies on the value of i (obviously, since v changes as i goes from 0 to 3), and the value of the constant sentence_length. Let’s put these three together:

l = sentence_length
# Progression of v as we go through each iteration:
i     |  0      |  1      |  2      |  3
------+---------+---------+---------+--------
      | [0, [1, | [1, [1, | [1, [1, | [1, [1,
v ; l |  0,  3, |  0,  3, |  0,  3, |  1,  3,
      |  0]; 2] |  0]; 2] |  1]; 2] |  1]; 2]

See the pattern now? Yes, it is indeed v = l < (i+1). And we are done (Pheeewww), here is the walk-around solution:

l = sentence_length
output = outputs[0]
one = tf.ones([1,hidden_size], tf.float32) # v * one = mask

for i in range(1, len(outputs)):
    v = tf.to_float(l < (i+1))
    v = tf.expand_dims(v, -1)
    mask = tf.matmul(v, one)
    output = tf.mul(output, mask)
    output += tf.mul(outputs[i], 1.0 - mask)

There you have it!

Update 28th Aug

There is this much simpler way to do exactly the same thing

def gather_by_lengths(outputs, seq_lengths):
    """
    :param outputs: [batch_size x max_seq_length x hidden_size] tensor of dynamic_rnn outputs
    :param seq_lengths: [batch_size] tensor of sequence lengths
    :return: [batch_size x hidden_size] tensor of last outputs
    """
    batch_size, max_seq_length, hidden_size = tf.unpack(tf.shape(outputs))
    index = tf.range(0, batch_size) * max_seq_length + (seq_lengths - 1)
return tf.gather(tf.reshape(outputs, [-1, hidden_size]), index)