Hadoop WebHDFS usage in combination with HAR (hadoop archive) from PHP

Hadoop is not very efficient for storing a lot of smaller files.

If you need to access a lot of files nevertheless you can use HAR to get rid of the small file problem. Here are the steps that I did to get access to the files from PHP.

1. Copy the files from local filesystem to HDFS

2. Create a hadoop archive

Let hadoop create ONE single HAR file with name hadoop-api.har from the whole directory
/tmp/har/ (HDFS)

This command will start a MapReduce job that creates the HAR file without deleting the original small files in HDFS.

3. Delete the small files from HDFS

4. HAR file content (HAR filesystem)

5. HAR file content (HDFS filesystem)

The HAR file is not really just one file but a directory with a couple of files. Let’s have a look to it with raw hdfs commands

6. Structure of the HAR file index (how to get access the single files)

hdfs dfs -cat /har/hadoop-api.har/_index
...snip...
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2Fserializer%2Fclass-use%2FJavaSerialization.html file part-0 17439924 4592 1401786436896+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Frecord%2Fmeta%2FUtils.html file part-0 86374093 9779 1401786547239+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fhttp%2Flib%2FStaticUserWebFilter.StaticUserFilter.html file part-0 12578713 14088 1401786409718+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Ffs%2Fftp%2FFTPFileSystem.html file part-0 33753102 56570 1401786511587+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Frecord%2Fcompiler%2Fclass-use%2FConsts.html file part-0 23911123 4493 1401786471791+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fsecurity%2Fproto%2Fclass-use%2FSecurityProtos.GetDelegationTokenResponseProto.html file part-0 27203013 22194 1401786486455+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fha%2FFenceMethod.html file part-0 9642822 11698 1401786398308+420+hadoop+supergroup

=>  %2Fapi%2Forg%2Fapache%2Fhadoop%2Fservice%2Fclass-use%2FService.html file part-0 27995653 13268 1401786490938+420+hadoop+supergroup
                                                                                    --------
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fha%2Fproto%2Fclass-use%2FHAServiceProtocolProtos.MonitorHealthRequestProto.html file part-0 11807109 26499 1401786404845+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Ffs%2Fpermission%2Fclass-use%2FAccessControlException.html file part-0 8820869 4647 1401786392428+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Fipc%2Fprotobuf%2FRpcHeaderProtos.RpcRequestHeaderProto.OperationProto.html file part-0 76175228 529335 1401786535978+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Ffs%2FHardLink.LinkStats.html file part-0 6370459 15816 1401786381184+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2FMapFile.Reader.html file part-0 13487270 42673 1401786413666+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2FRawComparator.html file part-0 13752188 12048 1401786415087+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Frecord%2Fcompiler%2Fclass-use%2FJByte.html file part-0 23924635 4482 1401786472012+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Futil%2Fclass-use%2FShell.OSType.html file part-0 29576158 7120 1401786501833+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Fha%2Fproto%2FHAServiceProtocolProtos.TransitionToActiveRequestProtoOrBuilder.html file part-0 46601135 515881 1401786515905+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fnet%2Funix%2Fpackage-summary.html file part-0 23093614 4293 1401786467275+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fipc%2Fprotobuf%2Fclass-use%2FRpcHeaderProtos.RpcResponseHeaderProtoOrBuilder.html file part-0 20622979 7652 1401786447960+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2Fretry%2Fclass-use%2FAtMostOnce.html file part-0 17192339 4502 1401786434497+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Ftools%2Fproto%2Fclass-use%2FGetUserMappingsProtocolProtos.GetUserMappingsProtocolService.html file part-0 28531720 7322 1401786494023+420+hadoop+supergroup
...snip...

Each row of the index file contains several space-separated columns:

  • The url encoded path in the HAR file
  • The type of the entry, i.e. file or dir
  • The HDFS file which contains the content
  • The offset in the HDFS content file
  • The length of the file

7. Example access using curl

8. PHP access to WebHDFS

In a simple php script the HAR index file is loaded, parsed and used to construct the URL to download the content of the file (inside the HAR), where the local / relative path is append to the php script URL:

The part behind index.php [/api/org/apache/hadoop/ha/proto/class-use/HAServiceProtocolProtos.MonitorHealthRequestProto.html] is a example html file which is included in the HAR file.

The php file’s source is

 9. Remarks

It would be great if the WebHDFS implementation would allow to access a har filesystem directly.

[1] WebHdfs

Leave a Reply

Your email address will not be published. Required fields are marked *

This blog is kept spam free by WP-SpamFree.