tensorflow | oom出现问题和解决方案

标签: tensorflow 更新于: 2018/12/11 阅读:2047 原文发表于：2018-12-20

原因

有几个原因

其他程序已经分配了很多内存
当前程序申请了太多内存

解决方案

关掉同gpu下的其他程序
申请小的资源
设置小的batch_size

The recommended way is to use a partitioner to shard this large tensor across several parts:

embedding = tf.get_variable("embedding", [1000000000, 20],
                            partitioner=tf.fixed_size_partitioner(3))

This will split the tensor into 3 shards along 0 axis, but the rest of the program will see it as an ordinary tensor. The biggest benefit is to use a partitioner along with parameter server replication, like this:

with tf.device(tf.train.replica_device_setter(ps_tasks=3)):
  embedding = tf.get_variable("embedding", [1000000000, 20],
                              partitioner=tf.fixed_size_partitioner(3))

The key function here is tf.train.replica_device_setter. It allows you to run 3 different processes, called parameter servers, that store all of model variables. The large embedding tensor will be split across these servers like on this picture.

参考

how do I use a very large (>2M) word embedding in tensorflow?

tensorflow | oom出现问题和解决方案

原因

解决方案

参考

tensorflow相关文章

最近热门

最常浏览