Neural Net Inauguration Generator

In the last post, I tried making an inaugeration generator using Markov models. Since then, I've become very interested in Deep Learning, and what better way to learn more about it than to make a Deep Learning inaugeration generator!

A word of warning; the results shown here are admittedly poor. I mainly put that down to my rather unfortunate setup of Lubuntu running from a VirtualBox with a Windows host laptop. This means I don't have access to a GPU to speed up the processing, and I had to keep my training data relatively small to fit in memory. On top of this, the inaugeration address corpus is relatively tiny and is certainly not large enough for any serious Deep Neural Net work. The corpus features just less than 10,000 unique words.

That being said, I think the general methodology here is pretty sound and the main goal of this excercise was only to teach myself how all of the parts Deep Text Synthesizers worked and fitted together.

We're going to generate the text at the word level. I've seen a lot of guides on generating text at the character level but I wanted to try it at the word level. Its much easier to generate text at the character level because the model only needs to make predictions in ~30 (26 letters + a couple of special characters) dimensional space, but if you were to naively generate text at the word level you would have to make predictions in 50,000+ dimensional space (i.e predict the probability of every known word occuring).

The way around making predictions in crazy high dimensional space is to encode the words into a lower dimensional space. But encoding them into arbitrary vectors won't do. The text generation model will work much better if similar vectors represent similar words. How do we learn which words should have similar vector representations? We use another model that we train before the text generation model. This technique, of encoding words into lower dimension space while preserving similarities, is called word embedding. One of the particularly well known implementations of this technique is Word2Vec

There will be two main sections to this post. In the first, we'll embed the vocabularly into 50-dimension vector space to create a Word2Vec-like model. Then we'll use what this first model has learned to train a second model that will generate the text.

Step 0 - Imports

I'm using Keras with theano backend, I encountered some teething problems with the TensorFlow backend while writing this.

In [1]:
import os.path
import numpy as np
from Queue import PriorityQueue
from nltk.corpus import inaugural
from scipy.spatial.distance import cosine

#Keras imports
from keras.models import Sequential
from keras.models import model_from_json
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.datasets.data_utils import get_file
from keras.utils.generic_utils import Progbar
Using Theano backend.

Step 1 - Building a Word2Vec model

If I were focused purely on results, I would have skipped this step entirely and just used Word2Vec, I wanted to learn how to do it myself!

First things first, lets set up some variables

In [2]:
embed_size = 50                           #we're embedding the words into 50-dimensional space
num_words = len(inaugural.words())      
vocab_size = len(set(inaugural.words()))  #the vocabulary size is the number of unique words in the corpus

Next we set up a dictionary that maps a word to a number, this will be used to place each word in the correct position of the array

In [3]:
word_to_index = dict(zip(set(inaugural.words()), range(vocab_size)))

Because training took so long on my setup, I had to train over multiple days. This necessitated saving the model in between. Keras saves the model config and the learned weights separately.

In [4]:
embedding_model_layout = 'embedding_model_architecture.json'
embedding_model_weights = 'embedding_model_weights.h5'

If the model already exists on disk, just load from that. Otherwise create a new one.

The model takes as input a vector the size of our vacabularly, has a hidden layer the size of our desired embedding size, and an output layer which is again the size of our vocabularly. After the model is trained, we will extract the useful information from the weights of the hidden layer.

In [5]:
if not os.path.isfile(embedding_model_layout):
    embedding_model = Sequential()
    embedding_model.add(Dense(input_dim=vocab_size, output_dim=embed_size, activation="relu"))
    embedding_model.add(Dense(input_dim=embed_size, output_dim=vocab_size, activation="softmax"))
    embedding_model.compile(loss='categorical_crossentropy', optimizer='sgd')
    open(embedding_model_layout, 'w').write(embedding_model.to_json())
else:
    embedding_model = model_from_json(open(embedding_model_layout, 'r').read())

    
if os.path.isfile(embedding_model_weights):
    embedding_model.load_weights(embedding_model_weights)

So what do we train this model with? There are a couple of possible ideas here, but in essence we want the model to associate words with the context they usually appear in. Similar words should appear in the same context, hence the model will learn to treat them similarly and will give them similar weights.

The two main approches are Continuous-bag-of-words (CBOW) and Skip-gram. Essentially the CBOW approach uses the context to predict the target word, and the Skip-gram approach uses the word to predict the context. If you want to learn more about these approaches, Thomas Mikolov is the guy to follow in this space. His paper, Efficient Estimation of Word Representations in Vector Space, I found particurly useful.

I decided to use the CBOW model just to keep things simple, and its faster to train than Skip-gram model, although Mikolov says that the Skip-gram approach would work best for small datasets, like ours.

Since we're using the CBOW approach, we need to construct a table of context words, and of target words. This helper function will help us do that. It creates an array of context words out of the words to either side of a given word.

In [6]:
def get_context_words(word_index, sentence, context_size):
    return sentence[word_index-context_size:word_index] + sentence[word_index:word_index+context_size]

In this step, we iterate over every word in our corpus saving both the context and the target word that goes along with it.

In [7]:
contexts = np.zeros((num_words, vocab_size), dtype=np.int8)
target_word = np.zeros((num_words, vocab_size), dtype=np.int8)

print('Building training data')
example_num = 0

for sentence in inaugural.sents():    
    for word_index, word in enumerate(sentence):
        context_words = get_context_words(word_index, sentence, 3)
        for context_word in context_words:        
            contexts[example_num, word_to_index[context_word]] = 1
        target_word[example_num, word_to_index[word]] = 1
        example_num += 1
Building training data
In [8]:
X = context_words
y = target_word

Now comes training the Word2Vec model. We take each context and use it to try to predict the target words.

In [9]:
num_epoch = 20
for i in range(num_epoch):
    print('Training Model... Epoch', i+1, 'of', num_epoch)
    embedding_model.fit(X, y, verbose=1, nb_epoch=1)        
    embedding_model.save_weights(embedding_model_weights, overwrite=True)

Make the Word2Vec Map

With the Word2Vec model trained, now we need to extract the useful data from it. This data comes from the weights in the hidden layer.

Since figuring our why these weights give us what we want was a stumbling block for me when I tried to learn about this stuff, I'll do my best to explain it here for others.

We trained our model to predict the target word given a context. Our model is simple and only contains two sets of weights, one mapping the contexts to the hidden layer, and another mapping the output of the hidden layer to the target words. The weights into the hidden layer won't be particularly useful to us, but lets think about the weights out of the hidden layer. The weights out of the hidden layer to the target word determine how important each of the 50 'signals' (activations) is to each word.

For example, the weights corresponding to the word 'Country' may place importance on hidden layer activations 2, 7, 31, and 49 (this is purly an example, importance is of course a gradation and not absolute). At the same time, the weights corresponding to the word 'Nation' may place importance on hidden layer activations 2, 7, 25, and 49. Hey, those words place importance on 3 out of the 4 same signals! They could be pretty similar! We're going to use cosine similarity to compare 2 different vectors of weights, but the idea is the same. If the weights corresponding to 2 different words value similar sets of signals from the hidden layer, they probably appear in the same context, and are therefore similar themselves.

So lets put that into action.

In [10]:
word_weights = embedding_model.layers[1].get_weights()[0]   #Get the weights from the output of the hidden layer
word_weights = word_weights / word_weights.sum(axis=0)      #Normalise the weights 
word_weights.shape
Out[10]:
(50, 9754)

Now we tie words to their corresponding vectors

In [11]:
def word_to_vec_pair(word):
    return (word, word_weights.T[word_to_index[word]])

word_to_vec = dict(map(word_to_vec_pair, inaugural.words()))

Thats pretty much it. Now we can play around with it and generate a list of similar words from any word in the corpus.

In [12]:
word_vec = word_to_vec['country']
queue = PriorityQueue()
for other_word, other_word_vec in word_to_vec.iteritems():
    queue.put((cosine(word_vec, other_word_vec), other_word))
    
for i in range(7):
    print(queue.get())
(-6.8554002696785687e-08, u'country')
(0.1208314274193345, u'national')
(0.14875480380648587, u'life')
(0.16996928824016544, u'system')
(0.18475146698671385, u'institutions')
(0.18922386457700979, u'political')
(0.20473386425407103, u'Nation')

We can see that we're getting somewhere, but the results make sense, but aren't quite as good as those from Word2Vec. Pretty much as expected.

Lets create a helper function that we'll use later to get the closest word to a given vector.

In [13]:
def closest_word(word_vec):
    queue = PriorityQueue()
    for other_word, other_word_vec in word_to_vec.iteritems():
        queue.put((cosine(word_vec, other_word_vec), other_word))
    return queue.get()[1]

Generate the Text

To generate the text, we're going to have to train another model. This model will try to learn what words usually follow other words. We'll present the words to this model as the vector representations that were learned by the previous model. This works better than presenting it with a raw 1-of-n type word encoding because the problem space is much smaller, and we've learned which words are similar to each other.

We're going to grab sequences of words from the sentences of the corpus and ask the model to predict what the next word should be. This helper function will help us do that and should be self-explanatory.

In [14]:
def get_preceding_words(word_index, sentence, lookback):
    return sentence[word_index - lookback:word_index]

I used a sequence length of 3 words, that is 3 preceding words and 1 target word, for my dataset. This was to make the training time more tractable for my setup.

In [15]:
seq_len = 3

preceding_words = np.zeros((num_words, seq_len, embed_size))
target_words = np.zeros((num_words, embed_size))

Now we iterate starting at the 4th word of every sentence in our corpus. We gather the preceding words and the target word, then add them to the preceding_words and target_words vectors respectively.

In [16]:
example_num = 0
for sentence in inaugural.sents():    
    for word_index, word in enumerate(sentence[seq_len:]):
        previous_words = get_preceding_words(word_index + seq_len, sentence, seq_len)
        for i, previous_word in enumerate(previous_words):
            preceding_words[example_num, i] = word_to_vec[previous_word]
        target_words[example_num] = word_to_vec[word]
        example_num += 1
In [17]:
X = preceding_words
Y = target_words

The layout of this model has been borrowed from the keras github repo. They use it for character-level embedding, but it should work just as well for word-level embedding.

In [18]:
text_synthesizer_model_layout = 'text_synthesizer_model_architecture.json'
text_synthesizer_model_weights = 'text_synthesizer_model_weights.h5'
In [19]:
if not os.path.isfile(text_synthesizer_model_layout):
    text_synthesizer = Sequential()
    text_synthesizer.add(LSTM(512, return_sequences=True, input_shape=(seq_len, embed_size)))
    text_synthesizer.add(Dropout(0.2))
    text_synthesizer.add(LSTM(512, return_sequences=False))
    text_synthesizer.add(Dropout(0.2))
    text_synthesizer.add(Dense(embed_size))
    text_synthesizer.add(Activation('softmax'))
    text_synthesizer.compile(loss='categorical_crossentropy', optimizer='rmsprop')
    open(text_synthesizer_model_layout, 'w').write(text_synthesizer.to_json())
else:
    text_synthesizer = model_from_json(open(text_synthesizer_model_layout, 'r').read())

    
if os.path.isfile(text_synthesizer_model_weights):
    text_synthesizer_model_weights.load_weights(text_synthesizer_model_weights)

The generate_text function also borrowed heavily from the same keras page, with a couple of necessary changes. The function proceeds as follows. First a random starting point is chosen from the words of the corpus. The first three words following the starting point are taken and turned into their vector representations. This three vector sequence is fed into the trained model and the model makes a prediction. The prediction is appended to the end of the vector sequence, and the sequence is truncated from the front, leaving a new vector sequence of length three that will be used to make the next prediction. This process continues for as many predictions as we wish to generate.

Each prediction is also turned back into the english form by finding the word with the closest vector representation. Concatenating these will generate our english sentence.

In [20]:
def generate_text(synthesizer_model, corpus, text_len):    
    start_index = random.randint(0, len(corpus) - seq_len - 1)

    sentence_english = corpus[start_index: start_index + seq_len]
    word_vec_sequence = map(lambda word: word_to_vec[word], sentence_english)

    prog = Progbar(text_len)
    for _ in range(text_len):
        x = np.zeros((1, seq_len, embed_size))
        for i, word_vec in enumerate(word_vec_sequence):
            x[0, i] = word_vec

        y = synthesizer_model.predict(x, verbose=0)        
                
        word_vec_sequence = word_vec_sequence[1:] + [y]
        sentence_english.append(closest_word(y))
        prog.add(1)

    return ' '.join(sentence_english)

At each epoch we train the model and then use it to build a sentence. With any luck, we will see the model gradually improve its english abilities to the point where it can make a somewhat understandable sentence.

In [ ]:
num_epochs = 50
for iteration in range(num_epochs):
    
    print('Training model. Epoch', iteration+1, 'of', num_epoch)
    text_synthesizer.fit(X, Y, batch_size=128, nb_epoch=1, verbose=1)
    text_synthesizer.save_weights(text_synthesizer_model_weights, overwrite=True)
    
    print('Generating text')
    generated_text = generate_text(text_synthesizer, inaugural.words(), 100)
    print(generated_text)

I've omitted the final output of output of the text generator until I get a chance to run this on AWS. I was simply unable to train this model sufficently on my setup. Deep LSTMs take a long time to train!

Despite the missing final result, hopefully this walk through is helpful to others. Feel free to download the notebook and play around with it yourself!

Download IPython Notebook