To play around with text synthesis, I used the Natural Language Toolkit (nltk) library to generate an Inauguration Address. It works just about as well as you might imagine, its terrible. But it was pretty fun seeing what it came up with.
The general method used here was something along the lines of a markov chain. At each state, we transition to the next with a probability that we have learned from training on the corpora. The random path that we take is used to form a sentence.
The project is broken into two experiments. In the first the states are represented as words, and in the second the states are presentented by the word's part of speech tag (e.g noun, verb, etc.).
This way of representing the states is the most common in speech synthesis. Its also pretty simple. The idea is that we build an adjacency structure for each word (or group of words) so we know what should follow it.
import numpy as np
from textblob import TextBlob
from nltk.corpus import inaugural
from collections import defaultdict
from IPython.core.display import HTML
We'll use start/stop tags to simplify the code. By sticking these on the start and end, we don't have to worry about edge cases so much
start_tag = 'START'
end_tag = 'STOP'
The following function generates the word-adjacency structure for a given corpus and order. Order here translates to the order of our markov chain. If the order is 1, then we look at the previous word to determine the next. If the order is 2, we look at the previous 2 words to determine the next and so on.
There is also a struct parameter that we can pass a list or a set. The importance of this choice will be explained later
def generate_adjacency_map(corpus, order, struct=list):
adjacency_map = defaultdict(struct)
for sentence in corpus.sents():
# generate a list where the start and end are buffered by start and end tags.
words = ([start_tag] * order) + sentence + ([end_tag] * order)
#slide over the list
#add each new word to the adjacency_map indexed by the preceding order-words
for i in range(0, len(sentence) + 1):
preceding_words = tuple([words[j] for j in range(i, i + order)])
word = words[i + order]
if struct is list:
adjacency_map[preceding_words].append(word)
elif struct is set:
adjacency_map[preceding_words].add(word)
return adjacency_map
word_adjacency_list = generate_adjacency_map(corpus = inaugural, order = 2, struct=list)
word_adjacency_set = generate_adjacency_map(corpus = inaugural, order = 2, struct=set)
Of course both structures contain the same number of unique elements, but in a set all unique words have equal probability to be picked at random. In a list, the more frequest the word, the more times it is expected to be picked at random.
print 'set : mean word adjacencies -', np.mean(map(len, word_adjacency_set.values()))
print 'list: mean word adjacencies -', np.mean(map(len, word_adjacency_list.values()))
The following function takes the adjacency structure and builds a random sentence out of it.
def build_sentence(adjacency_map, order, max_len=50):
sentence = []
next_word = ''
state = tuple([start_tag] * order)
while next_word != end_tag and len(sentence) < max_len:
sentence.append(next_word)
next_word = np.random.choice(list(adjacency_map[state]))
state = state[1:] + (next_word,)
return ' '.join(sentence)
sentence = build_sentence(word_adjacency_list, order=2)
sentence
At this point, I had an idea. I could use the TextBlob library to tag the words and use that as the markov state instead.
For each part of speech tag, I saved the assosiated bag of words, and built an adjacency structure similarly to what I did for words
def generate_tag_adjacencies(corpus, order, struct=list):
adjacency_map = defaultdict(struct)
tag_word_bags = defaultdict(struct, {end_tag : ['.']})
for sentence in corpus.sents():
tagged_sent = TextBlob(' '.join(sentence)).tags
#add tagged words to the relevent tag bag
for word, tag in tagged_sent:
if struct is list:
tag_word_bags[tag].append(word)
elif struct is set:
tag_word_bags[tag].add(word)
#build tag adjacency map similarly to the word-based one
tags = ([start_tag] * order) + map(lambda tup: tup[1], tagged_sent) + ([end_tag] * order)
for i in range(0, len(tagged_sent) + 1):
index = tuple([tags[j] for j in range(i, i + order)])
tag = tags[i + order]
if struct is list:
adjacency_map[index].append(tag)
elif struct is set:
adjacency_map[index].add(tag)
return adjacency_map, tag_word_bags
tag_adjacency_list, tag_bags_list = generate_tag_adjacencies(corpus = inaugural, order = 2, struct=list)
tag_adjacency_set, tag_bags_set = generate_tag_adjacencies(corpus = inaugural, order = 2, struct=set)
In this case, the list/set difference makes a significant difference.
print 'list: mean tag adjacencies -', np.mean(map(len, tag_adjacency_list.values())), ', mean word bag size -', np.mean(map(len, tag_bags_list.values()))
print 'set : mean tag adjacencies -', np.mean(map(len, tag_adjacency_set.values())), ', mean word bag size -', np.mean(map(len, tag_bags_set.values()))
This means that if we choose to use a set as the datastructure, our resulting sentences will feel much more random.
Next we build a sentence using the above structures similarly to before. But now the word we want to add to the sentence and what state we are entering are separate concepts.
def build_tag_sentence(adjacency_map, tag_word_bags, order, max_len=50):
sentence = []
next_tag = None
state = tuple([start_tag] * order)
while next_tag != end_tag and len(sentence) < max_len:
next_tag = np.random.choice(list(adjacency_map[state]))
next_word = np.random.choice(list(tag_word_bags[next_tag]))
state = state[1:] + (next_tag,)
sentence.append(next_word)
return ' '.join(sentence)
sentence = build_tag_sentence(tag_adjacency_list, tag_bags_list, order = 2)
sentence
Finally, we put together a bunch of sentences to make our Inaugural Address.
def output(sents):
return HTML('<p>' + ('</p><p>'.join(sents)) + '</p>')
output([build_sentence(word_adjacency_list, order=2) for i in range(10)])
output([build_tag_sentence(tag_adjacency_list, tag_bags_list, order = 2) for i in range(10)])
As you might have expected, its just a nonsense generator. Increasing the order can reduce the nonsense-rating, but on a corpus as small as this, it ends up regurgitating entire sentences.
The tag method seems to be even worse than the word-based method, but maybe you could combine them somehow to improve the result.