Why Extended Attributes are Coming to HDFS

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Recently, my Intel colleague Yi Liu led the implementation of extended attributes for HDFS (HDFS-2006). This work is largely motivated by Cloudera and Intel contributions to bringing at-rest encryption to Apache Hadoop (HDFS-6134; also see this post) under Project Rhino – extended attributes will be the mechanism for associating encryption key metadata with files and encryption zones — but it’s easy to imagine lots of other places where they could be useful.

For instance, you might want to store a document’s author and subject in sometime like user.author=cwl and user.subject=HDFS. You could store a file checksum in an attribute called user.checksum. Even just comments about a particular file or directory can be saved in an extended attribute.

In this post, you’ll learn some of the details of this feature from an HDFS user’s point of view.

Inside Extended Attributes

Extended attribute keys are java.lang.Strings and the values are byte[]s. By default, there is a maximum of 32 extended attribute key/value pairs per file or directory, and the (default) maximum size of the combined lengths of name and value is 16,384. You can configure these two limits with the dfs.namenode.fs-limits.max-xattrs-per-inode and dfs.namenode.fs-limits.max-xattr-size config parameters.

Every extended attribute name must include a namespace prefix, and just like in the Linux filesystem implementations, there are four extended attribute namespaces: user, trusted, system, and security. The system and security namespaces are for HDFS internal use only; only the HDFS super user can access trusted namespace. So, user extended attributes will generally reside in the user namespace (for example, “user.myXAttrkey”). Namespaces are case-insensitive and extended attribute names are case-sensitive.

Extended attributes can be accessed using the hdfs dfs command. To set an extended attribute on a file or directory, use the -setfattr subcommand. For example,

 

You can replace a value with the same command (and a new value) and you can delete an extended attribute with the -x option. The usage message shows the format. Setting an extended attribute does not change the file’s modification time.

To examine the value of a particular extended attribute or to list all the extended attributes for a file or directory, use the hdfs dfs –getfattr subcommand. There are options to recursively descend through a subtree and specify a format to write them out (text, hex, or base64). For example, to scan all the extended attributes for /foo, use the -d option:

 

The org.apache.hadoop.fs.FileSystem class has been enhanced with methods to set, get, and remove extended attributes from a path.

 

Current Status

Extended attributes are currently committed on the upstream trunk and branch-2 and will be included in CDH 5.2 and Apache Hadoop 2.5. They are enabled by default (you can disable them with the dfs.namenode.xattrs.enabled configuration option) and there is no overhead if you don’t use them.

Extended attributes are also upward compatible, so you can use an older client with a newer NameNode version. Try them out!

Charles Lamb is a Software Engineer at Cloudera, currently working on HDFS.

Filed under:

2 Responses
  • Abhay D / July 31, 2014 / 5:36 AM

    Hi Charles,
    Thanks for passing on the info. I just checked the recent commit logs for CDH 5 here ( http://archive.cloudera.com/cdh5/cdh/5/ ) but did not find any xattr commits there.
    Do you have any idea about the ETA for xattrs ?
    Thanks and Regards,
    Abhay

    • Justin Kestelyn (@kestelyn) / July 31, 2014 / 11:03 AM

      Abhay,

      To reiterate: The commit has been made to trunk, and you should see this feature ship in CDH 5.2 and Apache Hadoop 2.5.

Leave a comment


7 − = zero