Is there a way to limit the number of rows a mapreduce job produces
I have a map-side-only (0 reducers) MapReduce job, is there a way to limit how many rows it produces? It’s okay if the solution is approximate (i.e. it outputs a little bit more or less than desired)
I am looking for the MapReduce equivalent of
cat filename | $UNIXEY_THINGS | head -10000000
I thought about setting a limit for each mapper (divide $NUM_ROWS by $NUM_MAPPERS), but that means I’ll have to set the number of mappers, and my research suggests that’s not possible. Piping this into a single reducer doesn’t look like it’ll be performant.
It seems like the amount of coordination between processes makes this impossible, or at least impossible without a large performance hit. Am I right about that?