Apache

Apache Tika Parser Customization Liferay

November 16, 2016

The Apache Tika is a widely used content analysis toolkit which detects and extracts metadata and text from over a thousand different file types including standard docx, xlsx, pdf etc. files.

Liferay uses Apache Tika for metadata extraction for the files stored in its Document and Media Library. This information is displayed under “Automatically Extracted Metadata” panel in Document and Media Library.

Out of the box, Liferay provides metadata extraction for all the common file types. However, it is obvious that Liferay cannot provide metadata extraction for all the file types available in market. For such scenarios and cases where you have added custom metadata to files, you need to customize the Tika parser used by Liferay to extract and display metadata information in Document and Media Library.

In this blog, I will explain steps required to customize the Apache Tika code used by Liferay to add custom parser class. For the sake of simplicity, I am not going to cover how we can add custom parser in Apache Tika.

Prerequisite: To understand this blog, basic knowledge of Apache Tika, Liferay Ext and Hook is required.

Targeted Audience: For all Liferay users including developers who want to display metadata information for the file types which are not supported by Liferay out of the box.

Environment Information:

Liferay 6.2 CE GA4
JDK 7
Apache Tika 1.3

Note: Customizing the Tika code used by Liferay might prove lengthy process for some users. I suggest to read and follow all the steps carefully:

1. First of all, you need to figure out the version of Apache Tika used by Liferay. You can find Tika jars in {liferay-home}/tomcat-7.0.42/webapps/ROOT/WEB-INF/lib folder. Liferay 6.2 includes tika-core.jar and tika-parsers.jar. To determine its version, extract the jar files and view META-INF/MANIFEST.MF file. Find the Bundle-Version information. Liferay 6.2 CE GA4 uses Apache Tika 1.3.

2. After determining the version of the Tika code, you need to get the source code of tika-core and tika-parser from GitHub. Make sure to check out appropriate branch/tag for tika-core and tika-parser.

3. tika-core and tika-parser are Maven based project. Import them in IDE as Existing maven project. You can look the tika-mimetypes.xml in tika-core. This xml file defines the valid mime types used by Tika.

4. To add the custom parser class for listed mime type in tika-mimetypes.xml or for your new mime type, please visit this link .

5. Metadata is like a structure where a field holds a value. Liferay has a list of predefined fields taken from tika-core. You can find the list of included interfaces of tika-core in BaseRawMetadataProcessor (com.liferay.portal.metadata) class. These interfaces contain the fields which Liferay can read from the files added in Document and Media Library. If you have some custom fields, then first you need to create a separate interface for the same in org.apache.tika.metadata package of tika-core.

Following are the metadata interfaces which are included by Liferay 6.2 out of the box:

ClimateForcast
CreativeCommons
DublinCore
Geographic
HttpHeaders
Message
MSOffice
TIFF
TikaMetadataKeys
TikaMimeKeys
XMPDM

6. After adding the custom parser code and metadata fields, build the jar files of tika-core and tika-parser. Replace these jar files with the tika jar files in lib folder.

7. Restart the Liferay so that Liferay can use the updated Tika jar files.

8. Add your files in the Liferay Document and Media Library. Based on the extension and mime type of the file, Liferay and Tika processor extracts the metadata from the file.

In case you have created custom fields for holding the metadata information for your file type, then you need to follow two additional steps:

9. Create Liferay ext for adding your custom interface in the BaseRawMetadataProcessor class so that Liferay can understand the custom fields for specific file types. Deploy the ext.

Disclaimer: It is advisable to test ext plugins extensively in your dev and test environments before deploying it to Production environment.

10. There are chances that Liferay’s Language.properties file would not contain labels for your custom fields. For that you need to create a hook to add the language properties so that Liferay can display labels for custom metadata information in Document and Media Library “Automatically Extracted Metadata” panel. Deploy the hook.

For Your Business Requirements