Wednesday, November 8, 2023

Compressing File in snappy Format in Hadoop - Java Program

This post shows how to compress an input file in snappy format in Hadoop using Java API. The Java program will read input file from the local file system and copy it to HDFS in compressed snappy format. Input file is large enough (more than 128 MB even after compression) so that it is stored as more than one HDFS block. That way you can also see that the file is splittable or not when used in a MapReduce job. Note here that snappy format is not a splittable compression format so MapReduce job will create only a single split for the whole data.

Java program to compress file in snappy format

As explained in the post Data Compression in Hadoop, there are different codec (compressor/decompressor) classes for different compression formats. Codec class for snappy compression format is “ org.apache.hadoop.io.compress.SnappyCodec”.

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionOutputStream;

public class SnappyCompress {
 public static void main(String[] args) {
  Configuration conf = new Configuration();
  InputStream in = null;
  OutputStream out = null;
  try {
   FileSystem fs = FileSystem.get(conf);
   // Input file - local file system
   in = new BufferedInputStream(new FileInputStream("/netjs/Hadoop/Data/log.txt"));
   // Output file path in HDFS
   Path outFile = new Path("/user/out/test.snappy");
   // Verifying if the output file already exists
   if (fs.exists(outFile)) {
    throw new IOException("Output file already exists");
   }
   
   out = fs.create(outFile);
   
   // snappy compression 
   CompressionCodecFactory factory = new CompressionCodecFactory(conf);
   CompressionCodec codec = factory.getCodecByClassName
    ("org.apache.hadoop.io.compress.SnappyCodec");
   CompressionOutputStream compressionOutputStream = codec.createOutputStream(out);
   
   try {
    IOUtils.copyBytes(in, compressionOutputStream, 4096, false);
    compressionOutputStream.finish();
    
   } finally {
    IOUtils.closeStream(in);
    IOUtils.closeStream(compressionOutputStream);
   }
   
  } catch (IOException e) {
   e.printStackTrace();
  }
 }
}
    

To run this Java program in Hadoop environment export the class path where your .class file for the Java program resides.

$ export HADOOP_CLASSPATH=/home/netjs/eclipse-workspace/bin 

Then you can run the Java program using the following command.

$ hadoop org.netjs.SnappyCompress

18/04/24 15:49:41 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/04/24 15:49:41 INFO compress.CodecPool: Got brand-new compressor [.snappy]

Once the program is successfully executed you can check the number of HDFS blocks created by running the hdfs fsck command.

$ hdfs fsck /user/out/test.snappy

 Total size: 419688027 B
 Total dirs: 0
 Total files: 1
 Total symlinks:  0
 Total blocks (validated): 4 (avg. block size 104922006 B)
 Minimally replicated blocks: 4 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 0 (0.0 %)
 Mis-replicated blocks:  0 (0.0 %)
 Default replication factor: 1
 Average block replication: 1.0
 Corrupt blocks:  0
 Missing replicas:  0 (0.0 %)
 Number of data-nodes:  1
 Number of racks:  1
FSCK ended at Tue Apr 24 15:52:09 IST 2018 in 5 milliseconds

As you can see there are 4 HDFS blocks.

Now you can give this compressed file test.snapy as input to a wordcount MapReduce program. Since the compression format used is snappy, which is not splittable, there will be only one input split though there are 4 HDFS blocks.

$ hadoop jar /home/netjs/wordcount.jar org.netjs.WordCount /user/out/test.snappy /user/mapout1

18/04/24 15:54:44 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/24 15:54:45 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/24 15:54:46 INFO input.FileInputFormat: Total input files to process : 1
18/04/24 15:54:46 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries

18/04/24 15:54:46 INFO mapreduce.JobSubmitter: number of splits:1

18/04/24 15:54:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1524565091782_0001
18/04/24 15:54:47 INFO impl.YarnClientImpl: Submitted application application_1524565091782_0001
You can see from the console message that only one input split is created for the MapReduce job.

That's all for this topic Compressing File in snappy Format in Hadoop - Java Program. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. How to Compress Intermediate Map Output in Hadoop
  2. How to Compress MapReduce Job Output in Hadoop
  3. Java Program to Read File in HDFS
  4. Replica Placement Policy in Hadoop Framework
  5. What is SafeMode in Hadoop

You may also like-

  1. Word Count MapReduce Program in Hadoop
  2. MapReduce Flow in YARN
  3. Data Locality in Hadoop
  4. HDFS Commands Reference List
  5. Capacity Scheduler in YARN
  6. Lock Striping in Java Concurrency
  7. Creating Tar File And GZipping Multiple Files - Java Program
  8. Zipping Files in Java

No comments:

Post a Comment