如果您正在寻找非常准确的信息,则需要使用比tf-idf更好的工具。 通用句子编码器是找到任意两段文本之间相似度的最准确的编码器之一。 Google提供了预先训练的模型,您可以将其用于自己的应用程序,而无需从头开始训练任何东西。首先,您必须安装tensorflow和tensorflow-hub:
pip install tensorflow
pip install tensorflow_hub
下面的代码使您可以将任何文本转换为固定长度的矢量表示形式,然后可以使用点积来找出它们之间的相似性
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/1?tf-hub-format=compressed"
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)
# sample text
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]
similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})
corr = np.inner(message_embeddings_, message_embeddings_)
print(corr)
heatmap(messages, messages, corr)
以及绘图代码:
def heatmap(x_labels, y_labels, values):
fig, ax = plt.subplots()
im = ax.imshow(values)
# We want to show all ticks...
ax.set_xticks(np.arange(len(x_labels)))
ax.set_yticks(np.arange(len(y_labels)))
# ... and label them with the respective list entries
ax.set_xticklabels(x_labels)
ax.set_yticklabels(y_labels)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10,
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(y_labels)):
for j in range(len(x_labels)):
text = ax.text(j, i, "%.2f"%values[i, j],
ha="center", va="center", color="w",
fontsize=6)
fig.tight_layout()
plt.show()
结果将是:
如您所见,最相似的是文本之间的相互关系,以及文本之间的含义。
重要信息 :第一次运行代码时,它会很慢,因为它需要下载模型。如果要防止它再次下载模型并使用本地模型,则必须创建一个用于缓存的文件夹并将其添加到环境变量中,然后在第一次运行后使用该路径:
tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir
# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)
更多信息: https : //tfhub.dev/google/universal-sentence-encoder/2
0
我正在研究以任何编程语言编写的NLP项目(尽管我会优先选择Python)。
我想拿两个文件并确定它们有多相似。