Is there a way to limit the number of rows a mapreduce job produces

I have a map-side-only (0 reducers) MapReduce job, is there a way to limit how many rows it produces? It’s okay if the solution is approximate (i.e. it outputs a little bit more or less than desired)

I am looking for the MapReduce equivalent of

cat filename | $UNIXEY_THINGS | head -10000000 

I thought about setting a limit for each mapper (divide $NUM_ROWS by $NUM_MAPPERS), but that means I’ll have to set the number of mappers, and my research suggests that’s not possible. Piping this into a single reducer doesn’t look like it’ll be performant.

It seems like the amount of coordination between processes makes this impossible, or at least impossible without a large performance hit. Am I right about that?

Add Comment
0 Answer(s)

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.