24 November 2016

Spark accumulator的出现最初是为了模仿Hadoop中的Counter,但现在看来,因为Spark的特殊设计,使得accumulator容易返回错误的结果。


For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.



However, it can easily seem like these limitations are just for a small corner case, when in fact, they make accumulators a totally incomplete replacement for MapReduce counters. If you do try to use accumulators outside of RDD actions, they are worse than useless – they are actively misleading.


本文链接地址:Spark学习路径:使用Spark Accumulators的注意事项

blog comments powered by Disqus