使用预训练的word2vec和LSTM进行单词生成
keras
lstm
machine-learning
neural-network
5
0

LSTM / RNN可用于文本生成。 显示了将预训练的GloVe词嵌入用于Keras模型的方法。

  1. 如何在Keras LSTM模型中使用预训练的Word2Vec单词嵌入? 这篇文章确实有帮助。
  2. 当模型提供单词序列作为输入时,如何预测/生成下一个单词

尝试的示例方法:

# Sample code to prepare word2vec word embeddings    
import gensim
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]

word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)

# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation

embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))   
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')

训练LSTM和预测的样本代码/伪代码将不胜感激。

参考资料:
Stack Overflow
收藏
评论
共 1 个回答
高赞 时间 活跃

我用一个简单的生成器创建了要点 ,该生成器基于您的最初想法:它是一个LSTM网络,连接到预先训练的word2vec嵌入中,经过训练可以预测句子中的下一个单词。数据是arXiv网站的摘要列表

我将在这里重点介绍最重要的部分。

Gensim Word2Vec

您的代码很好,但要训练的迭代次数除外。默认的iter=5似乎很低。此外,这绝对不是瓶颈-LSTM培训需要更长的时间。 iter=100看起来更好。

word_model = gensim.models.Word2Vec(sentences, size=100, min_count=1, 
                                    window=5, iter=100)
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')
for word in ['model', 'network', 'train', 'learn']:
  most_similar = ', '.join('%s (%.2f)' % (similar, dist) 
                           for similar, dist in word_model.most_similar(word)[:8])
  print('  %s -> %s' % (word, most_similar))

def word2idx(word):
  return word_model.wv.vocab[word].index
def idx2word(idx):
  return word_model.wv.index2word[idx]

结果嵌入矩阵保存到形状为(vocab_size, emdedding_size) pretrained_weights数组中。

凯拉斯模型

除了损失功能外,您的代码几乎是正确的。由于模型预测下一个单词,因此这是一个分类任务,因此损失应为categorical_crossentropysparse_categorical_crossentropy 。出于效率考虑,我选择了后者:这样可以避免单点编码,这对于大词汇量来说是相当昂贵的。

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, 
                    weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

注意将预训练的砝码传递给weights

资料准备

为了处理sparse_categorical_crossentropy损失,句子和标签都必须是单词索引。短句子必须用零填充到公共长度。

train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
  for t, word in enumerate(sentence[:-1]):
    train_x[i, t] = word2idx(word)
  train_y[i] = word2idx(sentence[-1])

样品生成

这很简单:模型输出概率向量,其中下一个单词将被采样并附加到输入中。请注意,如果对下一个单词进行采样而不是选择argmax ,则生成的文本将更好,更多样化。我在这里使用了基于温度的随机采样。

def sample(preds, temperature=1.0):
  if temperature <= 0:
    return np.argmax(preds)
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def generate_next(text, num_generated=10):
  word_idxs = [word2idx(word) for word in text.lower().split()]
  for i in range(num_generated):
    prediction = model.predict(x=np.array(word_idxs))
    idx = sample(prediction[-1], temperature=0.7)
    word_idxs.append(idx)
  return ' '.join(idx2word(idx) for idx in word_idxs)

生成文字的例子

deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness
simple and effective... -> simple and effective family of variables preventing compute automatically
a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov
a... -> a function parameterization necessary both both intuitions with technique valpola utilizes

没什么意义,但是能够产生看起来至少在语法上合理的句子(有时)。

指向完整的可运行脚本的链接。

收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题

关于我们

常见问题

内容许可

联系我们

@2020 AskGo
京ICP备20001863号