Compressing File in bzip2 Format in Hadoop - Java Program

This post shows how to compress an input file in bzip2 format in Hadoop. The Java program will read input file from the local file system and copy it to HDFS in compressed bzip2 format.

Input file is large enough so that it is stored as more than one HDFS block. That way you can also see that the file is splittable or not when used in a MapReduce job. Note here that bzip2 format is splittable compression format in Hadoop.

Java program to compress file in bzip2 format

As explained in the post Data Compression in Hadoop, there are different codec (compressor/decompressor) classes for different compression formats. Codec class for bzip2 compression format is “org.apache.hadoop.io.compress.Bzip2Codec”.

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionOutputStream;

public class BzipCompress {

 public static void main(String[] args) {
  Configuration conf = new Configuration();
  InputStream in = null;
  OutputStream out = null;
  try {
   FileSystem fs = FileSystem.get(conf);
   // Input file - local file system
   in = new BufferedInputStream(new FileInputStream
           ("netjs/Hadoop/Data/log.txt"));
   // Output file path in HDFS
   Path outFile = new Path("/user/out/test.bz2");
   // Verifying if the output file already exists
   if (fs.exists(outFile)) {
    System.out.println("Output file already exists");
    throw new IOException("Output file already exists");
   }
   
   out = fs.create(outFile);
   
   // bzip2 compression 
   CompressionCodecFactory factory = new CompressionCodecFactory(conf);
   CompressionCodec codec = factory.getCodecByClassName
     ("org.apache.hadoop.io.compress.BZip2Codec");
   CompressionOutputStream compressionOutputStream = codec.createOutputStream(out);
   
   try {
    IOUtils.copyBytes(in, compressionOutputStream, 4096, false);
    compressionOutputStream.finish();
    
   } finally {
    IOUtils.closeStream(in);
    IOUtils.closeStream(compressionOutputStream);
   }  
  } catch (IOException e) {
   e.printStackTrace();
  }
 }
}

To run this Java program in Hadoop environment export the class path where your .class file for the Java program resides.

export HADOOP_CLASSPATH=/home/netjs/eclipse-workspace/bin

Then you can run the Java program using the following command.

$ hadoop org.netjs.BzipCompress
    
18/04/24 10:44:05 INFO bzip2.Bzip2Factory: Successfully
  loaded & initialized native-bzip2 library system-native
18/04/24 10:44:05 INFO compress.CodecPool: Got brand-new compressor [.bz2]

Once the program is successfully executed you can check the number of HDFS blocks created by running the hdfs fsck command.

$ hdfs fsck /user/out/test.bz2

.Status: HEALTHY
 Total size: 228651107 B
 Total dirs: 0
 Total files: 1
 Total symlinks:  0
 Total blocks (validated): 2 (avg. block size 114325553 B)
 Minimally replicated blocks: 2 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 0 (0.0 %)
 Mis-replicated blocks:  0 (0.0 %)
 Default replication factor: 1
 Average block replication: 1.0
 Corrupt blocks:  0
 Missing replicas:  0 (0.0 %)
 Number of data-nodes:  1
 Number of racks:  1
FSCK ended at Tue Apr 24 10:49:55 IST 2018 in 1 milliseconds

As you can see there are 2 HDFS blocks.

In order to verify that MapReduce job will create input splits or not giving this compressed file test.bz2 as input to a wordcount MapReduce program. Since the compression format used is bz2, which is a splittable compression format, there should be 2 input splits for the job.

   
hadoop jar /home/netjs/wordcount.jar org.netjs.WordCount /user/out/test.bz2 /user/mapout

    18/04/24 10:57:10 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    18/04/24 10:57:11 WARN mapreduce.JobResourceUploader: Hadoop command-line
    option parsing not performed. Implement the Tool interface and execute your
    application with ToolRunner to remedy this.
    18/04/24 10:57:11 WARN mapreduce.JobResourceUploader: No job jar file set.
    User classes may not be found. See Job or Job#setJar(String).
    18/04/24 10:57:11 INFO input.FileInputFormat: Total input files to process : 1
    18/04/24 10:57:11 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
    18/04/24 10:57:11 INFO mapreduce.JobSubmitter: number of splits:2

You can see from the console message that the two input splits are created.

That's all for this topic Compressing File in bzip2 Format in Hadoop - Java Program. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like-

2 comments:

Sahil27266December 4, 2019 at 4:02 PM
hi i compressed a file using bzip2 ,so in hadoop it got saved without any extension , i am using flink and compressing data in batches but when we download that file from hdfs and try to open that file using bzip2 decompression it says cant guess the file type and i am not able to decompress the file now ... do you have any idea how i can solve this

Tech Tutorials

Monday, November 6, 2023