Numeric representation of text documents is a challenging task in machine learning. Such a representation may be used for many purposes, for example: document retrieval, web search, spam filtering, topic modeling etc.
However, there are not many good techniques to do this. many tasks use the well known but simplistic method of bag of words (BOW), but outcomes will be mostly mediocre, since BOW loses many subtleties of a possible good representation, e.g consideration of word ordering.
Latent Dirichlet Allocation (LDA) is also a common technique for topic modeling (extracting topics/keywords out of texts) but it’s very hard to tune, and results are hard to evaluate.
Doc2vec method is a concept that was presented at 2014 by Mikilov and Le in article Distributed Representations of Sentences and Documents.
Doc2vec is a very nice technique. It’s easy to use, gives good results, and as you can understand from it’s name, heavily based on word2vec. so we’ll start with a short introduction about word2vec.
Word2vec
Word2vec is a well known concept, used to generate representation vectors out of words.
In general, when you like to build some model using words, simply labeling/one-hot encoding them is a plausible way to go. However, when using such encoding, the words lose their meaning. e.g, if we encode Paris as id_4, France as id_6 and power as id_8, France will have the same relation to power as with Paris. We would prefer a representation in which France and Paris will be closer than France and power.
Doc2vec
After hopefully understanding what is word2vec, it will be easier to understand how doc2vec works.
The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick.
Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.
Input
Since the Doc2Vec class extends gensim’s original Word2Vec class, many of the usage patterns are similar. You can easily adjust the dimension of the representation, the size of the sliding window, the number of workers, or almost any other parameter that you can change with the Word2Vec model.
The one exception to this rule are the parameters relating to the training method used by the model. In the word2vec architecture, the two algorithm names are “continuous bag of words” (cbow) and “skip-gram” (sg); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (dm) and “distributed bag of words” (dbow). Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. You can still force the dbow model if you wish, by using the dm=0 flag in constructor.
The input to Doc2Vec is an iterator of LabeledSentence objects. Each such object represents a single sentence, and consists of two simple lists: a list of words and a list of labels:
"sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])"
The algorithm then runs through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset.
Although this architecture permits more than one label per sentence (and I myself have used it this way), I suspect the most popular use case would be to have a single label per sentence which is the unique identifier for the sentence. One could implement this kind of use case for a file with one sentence per line by using the following class as training data:
"class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])"
A more robust version of this LabeledLineSentence class above is also included in the doc2vec module, so you can use that. Read the doc2vec API docs for all constructor parameters.
Training
Doc2Vec learns representations for words and labels simultaneously. If you wish to only learn representations for words, you can use the flag train_lbls=False in your Doc2Vec class. Similarly, if you only wish to learn representations for labels and leave the word representations fixed, the model also has the flag train_words=False.
One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results. I have obtained better results by iterating over the data several times and either randomizing the order of input sentences, or manually controlling the learning rate over the course of several iterations.
For example, if one wanted to manually control the learning rate over the course of 10 epochs, one could use the following:
"model = Doc2Vec(alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(sentences)
for epoch in range(10):
model.train(sentences)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay"
The code runs on optimized C (via Cython), just like the original word2vec, so it’s fairly fast.
Memory Usage
With the current implementation, all label vectors are stored separately in RAM. In the case above with a unique label per sentence, this causes memory usage to grow linearly with the size of the corpus, which may or may not be a problem depending on the size of your corpus and the amount of RAM available on your box. For example, I’ve successfully run this over a collection of over 2 million sentences with no problems whatsoever; however, when I tried to run it on 20x that much data my box ran out of RAM since it needed to create a new vector for each sentence.
However, there are not many good techniques to do this. many tasks use the well known but simplistic method of bag of words (BOW), but outcomes will be mostly mediocre, since BOW loses many subtleties of a possible good representation, e.g consideration of word ordering.
Latent Dirichlet Allocation (LDA) is also a common technique for topic modeling (extracting topics/keywords out of texts) but it’s very hard to tune, and results are hard to evaluate.
Doc2vec method is a concept that was presented at 2014 by Mikilov and Le in article Distributed Representations of Sentences and Documents.
Doc2vec is a very nice technique. It’s easy to use, gives good results, and as you can understand from it’s name, heavily based on word2vec. so we’ll start with a short introduction about word2vec.
Word2vec
Word2vec is a well known concept, used to generate representation vectors out of words.
In general, when you like to build some model using words, simply labeling/one-hot encoding them is a plausible way to go. However, when using such encoding, the words lose their meaning. e.g, if we encode Paris as id_4, France as id_6 and power as id_8, France will have the same relation to power as with Paris. We would prefer a representation in which France and Paris will be closer than France and power.
Doc2vec
After hopefully understanding what is word2vec, it will be easier to understand how doc2vec works.
The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick.
Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.
Input
Since the Doc2Vec class extends gensim’s original Word2Vec class, many of the usage patterns are similar. You can easily adjust the dimension of the representation, the size of the sliding window, the number of workers, or almost any other parameter that you can change with the Word2Vec model.
The one exception to this rule are the parameters relating to the training method used by the model. In the word2vec architecture, the two algorithm names are “continuous bag of words” (cbow) and “skip-gram” (sg); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (dm) and “distributed bag of words” (dbow). Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. You can still force the dbow model if you wish, by using the dm=0 flag in constructor.
The input to Doc2Vec is an iterator of LabeledSentence objects. Each such object represents a single sentence, and consists of two simple lists: a list of words and a list of labels:
"sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])"
The algorithm then runs through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset.
Although this architecture permits more than one label per sentence (and I myself have used it this way), I suspect the most popular use case would be to have a single label per sentence which is the unique identifier for the sentence. One could implement this kind of use case for a file with one sentence per line by using the following class as training data:
"class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])"
A more robust version of this LabeledLineSentence class above is also included in the doc2vec module, so you can use that. Read the doc2vec API docs for all constructor parameters.
Training
Doc2Vec learns representations for words and labels simultaneously. If you wish to only learn representations for words, you can use the flag train_lbls=False in your Doc2Vec class. Similarly, if you only wish to learn representations for labels and leave the word representations fixed, the model also has the flag train_words=False.
One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results. I have obtained better results by iterating over the data several times and either randomizing the order of input sentences, or manually controlling the learning rate over the course of several iterations.
For example, if one wanted to manually control the learning rate over the course of 10 epochs, one could use the following:
"model = Doc2Vec(alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(sentences)
for epoch in range(10):
model.train(sentences)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay"
The code runs on optimized C (via Cython), just like the original word2vec, so it’s fairly fast.
Memory Usage
With the current implementation, all label vectors are stored separately in RAM. In the case above with a unique label per sentence, this causes memory usage to grow linearly with the size of the corpus, which may or may not be a problem depending on the size of your corpus and the amount of RAM available on your box. For example, I’ve successfully run this over a collection of over 2 million sentences with no problems whatsoever; however, when I tried to run it on 20x that much data my box ran out of RAM since it needed to create a new vector for each sentence.
 



