How to Configure And Use LZO Compression in Hadoop

In this post we’ll see how to configure and use LZO compression in Hadoop.

Since LZO is GPL licensed it doesn't come bundled with Hadoop installation. You will have to install it separately.

By default LZO compression is not splittable but LZO compressed files can be indexed to make it splittable. That needs downloading hadoop-lzo and creating hadoop-lzo jar. So these are the first two steps you need to do in order to use LZO compression in Hadoop.

Table of contents

Installing LZO
Downloading and creating hadoop-lzo jar
Configuring LZO compression in Hadoop
Using LZO compression in Hadoop
Java program to compress file in LZO format in Hadoop
Running LZO indexer to split

Installing LZO

Use the following command to install LZO packages in Ubuntu

$ sudo apt-get install liblzo2-2 liblzo2-dev

Downloading hadoop-lzo and creating hadoop-lzo jar

Clone the hadoop-lzo repository.

$ git clone https://github.com/twitter/hadoop-lzo.git

Please refer this URL too – https://github.com/twitter/hadoop-lzo

To build the cloned code you will need Maven. If you don’t already have Maven you can download and install it using following command.

$ sudo apt install maven

Using Maven build the cloned code. Go to the directory where you have cloned the hadoop-lzo repository and run the following command.

$ mvn clean install

If everything is going fine till now then a "target" folder should be created with the hadoop-lzo-0.4.21-SNAPSHOT.jar file inside it.

Rather than downloading and building jar you can also download the jars in rpm package (Preferred if you are not using Ubuntu) from here- https://code.google.com/archive/p/hadoop-gpl-packing/downloads

Configuring LZO compression in Hadoop

Now you need to configure LZO and hadoop-lzo jar in Hadoop environment.

Update the configuration file $HADOOP_INSTALLATION_DIR/etc/hadoop/core-site.xml to register LZO codecs.

    
<property> 
   <name>io.compression.codecs</name>
   <value>org.apache.hadoop.io.compress.GzipCodec,
     org.apache.hadoop.io.compress.DefaultCodec,
     org.apache.hadoop.io.compress.BZip2Codec,
     com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
   </value>
</property>
<property> 
   <name>io.compression.codec.lzo.class</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

Add hadoop-lzo jar and native library for LZO compression codec to Hadoop class path. For that add the following to $HADOOP_INSTALLATION_DIR/etc/hadoop/hadoop-env.sh-

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:PATH_TO_HADOOP_LZO/target/hadoop-lzo-0.4.21-SNAPSHOT.jar
 
export JAVA_LIBRARY_PATH=PATH_TO_HADOOP_LZO/target/native/Linux-amd64-64:$HADOOP_INSTALLATION_DIR/lib/native

Copy hadoop-lzo jar to /share/hadoop/mapreduce/lib in your $HADOOP_INSTALLATION_DIR.

 
sudo cp PATH_TO_HADOOP_LZO/target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_INSTALLATION_DIR/share/hadoop/mapreduce/lib

Using LZO compression in Hadoop

Now you can use LZO compression in Hadoop. First let us see a Java program to compress a file using LZO compression. In the Java program file is read from local file system and stored in LZO compressed format in HDFS.

Java program to compress file in LZO format in Hadoop

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionOutputStream;

public class LZOCompress {

 public static void main(String[] args) {
  Configuration conf = new Configuration();
  InputStream in = null;
  OutputStream out = null;
  try {
   FileSystem fs = FileSystem.get(conf);
   // Input file - local file system
   in = new BufferedInputStream(new FileInputStream("netjs/Hadoop/Data/log.txt"));
    // Output file path in HDFS
   Path outFile = new Path("/user/out/test.lzo");
   // Verifying if the output file already exists
   if (fs.exists(outFile)) {
    throw new IOException("Output file already exists");
   }
   out = fs.create(outFile);
   
   // LZOP comression
   CompressionCodecFactory factory = new CompressionCodecFactory(conf);
   CompressionCodec codec = factory.getCodecByClassName
     ("com.hadoop.compression.lzo.LzopCodec");
   CompressionOutputStream compressionOutputStream = codec.createOutputStream(out);
   
   try {
    IOUtils.copyBytes(in, compressionOutputStream, 4096, false);
    compressionOutputStream.finish();
    
   } finally {
    IOUtils.closeStream(in);
    IOUtils.closeStream(compressionOutputStream);
   }
   
  } catch (IOException e) {
   e.printStackTrace();
  }
 }
}

To run this Java program in Hadoop environment export the class path where your .class file for the Java program resides.

export HADOOP_CLASSPATH=/home/netjs/eclipse-workspace/bin

Then you can run the Java program using the following command.

$ hadoop org.netjs.LZOCompress

18/04/27 18:13:30 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/04/27 18:13:30 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library 
[hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/04/27 18:13:30 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
18/04/27 18:13:30 INFO compress.CodecPool: Got brand-new compressor [.lzo]

Using hdfs fsck command you can get information about the created compressed file in HDFS.

hdfs fsck /user/out/test.lzo

 Total size: 417954457 B
 Total dirs: 0
 Total files: 1
 Total symlinks:  0
 Total blocks (validated): 4 (avg. block size 104488614 B)
 Minimally replicated blocks: 4 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 0 (0.0 %)
 Mis-replicated blocks:  0 (0.0 %)
 Default replication factor: 1

As you can see compressed file is stored as 4 HDFS blocks.

In order to verify that MapReduce job will create input splits or not, giving this compressed file test.lzo as input to a wordcount MapReduce program. By default LZO compression format is not splittable, so only one split would be created for the MapReduce job even if there are 4 HDFS blocks.

If LZO compressed file is used as input then the input format has to be LzoTextInputFormat in the wordcount MapReduce program, so following change is required in the job configuration of the MapReduce job.

job.setInputFormatClass(LzoTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

Running the MapReduce job

hadoop jar /home/netjs/wordcount.jar org.netjs.WordCount /user/out/test.lzo /user/mapout

18/04/27 18:23:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/27 18:23:19 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. 
Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/27 18:23:19 WARN mapreduce.JobResourceUploader: No job jar file set.  User classes may not be found. 
See Job or Job#setJar(String).
18/04/27 18:23:19 INFO input.FileInputFormat: Total input files to process : 1
18/04/27 18:23:20 INFO mapreduce.JobSubmitter: number of splits:1

You can see from the console message that only single split is created as the file is not indexed.

Running LZO indexer to split

In order to make LZO file splittable you will have to run indexer as a preprocessing step. You can run lzo indexer as a Java program or as a MapReduce job.

As Java program

$ hadoop jar PATH_TO_HADOOP_LZO/target/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /user/out/test.lzo

18/04/27 18:31:48 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/04/27 18:31:48 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/04/27 18:31:49 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /user/out/test.lzo, size 0.39 GB...
18/04/27 18:31:49 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
18/04/27 18:31:50 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.73 seconds (549.03 MB/s).  Index size is 32.48 KB.

You can verify that the /user/out/test.lzo.index file is created.

Running indexer as a MapReduce job

You can also run indexer as a MapReduce job to take advantage of parallel processing.

$ hadoop jar PATH_TO_HADOOP_LZO/target/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/out/test.lzo

By running the MapReduce job now, you can verify that 4 input splits are getting created.

hadoop jar /home/netjs/wordcount.jar org.netjs.WordCount /user/out/test.lzo /user/mapout

18/04/27 18:38:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/27 18:38:13 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/27 18:38:13 WARN mapreduce.JobResourceUploader: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
18/04/27 18:38:13 INFO input.FileInputFormat: Total input files to process : 1
18/04/27 18:38:13 INFO mapreduce.JobSubmitter: number of splits:4

That's all for this topic How to Configure And Use LZO Compression in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

You may also like-

Tech Tutorials

Monday, November 27, 2023