Efficient Count Distinct with Apache Spark
visitors.distinct().count() would be the obvious ways, with the first way in distinct you can specify the level of parallelism and also see improvement in the speed. If it is possible to set up visitors as a stream and use D-streams, that would do the count in realtime. You can stream directly from a directory and … Read more