Splitting the file in Map-Reduce

Imagine that you have a file “data.csv” that lies on Hadoop and you need to split it into a number of smaller files with the different data to process them separately. To do it with Pig or Hive you should specify the file schema to describe it as a table, which might be not the thing you need (for instance, if different rows have different schema). Here’s an example of how it can be done with a MapReduce job utilizing MultipleOutputs.

1. Mapper class:

2. Main class

It is a map-only job that reads the pipe-delimited file and outputs its content to a number of files based on the value of the first field in each row.

Leave a Reply