Tuesday, July 27, 2010

Metadata extraction from MS Office documents with Apache POI

Microsoft Office documents have several metadata or properties like "Title", "Author", "Comments", "Keywords", "CreateDateTime", "LastSaveDateTime", etc. Apache POI HPSF, Java API for Microsoft documents, is a neat library to extract such properties from Word, Excel or PowerPoint documents. It can be useful if user upload MS Office files and would like to show / edit metadata before uploaded files are stored in a content repository or somewhere else. Let's write an Java class named MsOfficeExtractor.
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.lang.reflect.Method;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import org.apache.poi.hpsf.PropertySetFactory;
import org.apache.poi.hpsf.SummaryInformation;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;

public class MsOfficeExtractor
{
 private String[] properties;
 private Map<String, Method> methodMap;

 public MsOfficeExtractor(final String[] properties)
 {
  this.properties = (properties == null ? new String[] {} : properties);
  methodMap = new HashMap<String, Method>();
  try {
   for (int i = 0; i < properties.length; i++) {
    methodMap.put(properties[i], SummaryInformation.class.getMethod("get" + properties[i], (Class[]) null));
   }
  } catch (SecurityException e) {
   // error handling
  } catch (NoSuchMethodException e) {
   // error handling
  }
 }
 
 public Map<String, Object> parseMetaData(final byte[] data)
 {
  if (properties.length == 0) {
   return Collections.EMPTY_MAP;
  }

  InputStream in = null;
  try {
   in = new ByteArrayInputStream(data);
   POIFSReader poifsReader = new POIFSReader();
   MetaDataListener metaDataListener = new MetaDataListener();
   poifsReader.registerListener(metaDataListener, "\005SummaryInformation");
   poifsReader.read(in);

   return metaDataListener.metaData;
  } catch (final IOException e) {
   // error handling
  } catch (final RuntimeException e) {
   // error handling
  } finally {
   if (in != null) {
    try {
     in.close();
    } catch (IOException e) {
     // nothing to do
    }
   }
  }
 }
}
The constructor expects property names of the properties we want to extract. All valid names are defined in org.apache.poi.hpsf.SummaryInformation. The map methodMap defines a mapping between a property to be extracted and a method to be called in the listener explained below. The core method is parseMetaData which expects a byte array of given MS Office file. We need now a listener class MetaDataListener which is called while parsing.
public class MetaDataListener implements POIFSReaderListener
{
 public final Map<String, Object> metaData;

 public MetaDataListener()
 {
  metaData = new HashMap<String, Object>();
 }

 public void processPOIFSReaderEvent(final POIFSReaderEvent event)
 {
  try {
   final SummaryInformation summaryInformation = (SummaryInformation) PropertySetFactory.create(event.getStream());

   for (int i = 0; i < properties.length; i++) {
    Method method = (Method) methodMap.get(properties[i]);
    Object propertyValue = method.invoke(summaryInformation, (Object[]) (Object[]) null);

    metaData.put(properties[i], propertyValue);
   }
  } catch (final Exception e) {
   // error handling
  }
 }
}
The goal of this listener is to build a map with extracted values to the given properties (which values we want to extract). The using is simple:
// initialize extractor
String[] poiProperties = new String[] {"Comments", "CreateDateTime", "LastSaveDateTime"};
MsOfficeExtractor msOfficeExtractor = new MsOfficeExtractor(poiProperties);

// get byte array of any MS office document
byte[] data = ...

// extract metadata
Map<String, Object> metadata = msOfficeExtractor.parseMetaData(data);
In the next post I will show how to extract metadata from MS Outlook msg files. Apache POI-HSMF, Java API to access MS Outlook msg files, has limits and is not flexible enough. I will present my own powerful solution.

1 comment:

  1. Thanks for your effort. I appreciate it.

    ReplyDelete

Note: Only a member of this blog may post a comment.