- Logo

Apache Tika OCR for parsing text within image files or embedded images PDFs


Apache Tika OCR

Parsing and standardizing content from different sources and file types is one of the main requirements e. g. to make content searchable. For instance, files from shared resources rarely have common encodings or formats. Users usually share Office files (e. g. Word or Excel documents), archives (e. g. zips) or binaries (e. g. PDFs), which all have different formats. In addition, developers frequently cannot expect which files or formats will be retrieved from these systems, neither in present, nor in future. Therefore, a solution like Apache Tika is needed, which is able to detect the type of incoming files and to automatically initiate parsing procedures tailored to respective formats.

The most widespread open source tool for this purpose is Apache Tika (see https://tika.apache.org/) , which is capable to parse various different file types, such as Office documents, PDF files, archives and many more. Firstly, Apache Tika identifies the format of a file (MIME type) and subsequently tries to extract its metadata and content. However, even when the format of a file has been identified correctly, the parsing process can still be very challenging as the types of embedded files can be quite heterogeneous. For instance, PDFs are often generated by creating a Word document predominantly containing text and saving it as PDF. In this case, the content can be extracted by transforming the text within the PDF to plain text. However, many PDFs do not only contain text, but also text within images, especially if they are generated by scanners.

To address this issue, the release of Apache Tika 1.14 includes a solution to run OCR on images embedded in PDFs. Principally, Apache Tika can be integrated in Java applications (e. g. via Maven) or run as a server (REST). The following example demonstrates how to integrate Apache Tika into Java applications and how to run Apache Tika OCR standalone.

Before we start coding, we have to install Tesseract-OCR on our system, which is capable to run OCR on images and is utilized by Apache Tika. Tesseract can be installed following this guide.

Initially, the dependencies are included via Maven:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.14</version>
</dependency>
<dependency>
  <groupId>com.levigo.jbig2</groupId>
  <artifactId>levigo-jbig2-imageio</artifactId>
  <version>1.6.5</version>
</dependency>
<dependency>
  <groupId>com.github.jai-imageio</groupId>
  <artifactId>jai-imageio-core</artifactId>
  <version>1.3.1</version>
</dependency>

There are two aspects to be considered when integrating these dependencies: First, the official Apache Tika website points out that the dependency “tika-parsers” integrates lots of transitive dependencies into the project. Therefore it is recommended to check the already existing dependencies to avoid problems, e. g. caused by conflicting versions. Second, the dependencies “levigo-jbig2-imageio” and “jai-imageio-core” have to be included separately. They are not included into Apache Tika as their license is not compatible to the Apache 2.0 license.

Afterwards, an InputStream with an exemplary pdf containing text as well as text within images and a ByteArrayOutputStream is created to manage IO procedures. In this example, the standard configuration of Apache Tika is used. To parse the content from the InputStream, we create a BodyContentHandler object, which manages the processing of the content. The BodyContentHandler object can be created in different ways. In this case, we pass an OutputStream such that the handler can write the parsed content into it. If nothing is passed, the parsed content is written to a StringBuffer with a limit of 100k characters. This limit can be changed by providing a different int number to the handler or a “-1”, which actually tells the handler not to limit the output. The parser object initiates the parsing. In our case, we use the AutoDetectParser so that Tika decides, which parser to use for previously identified formats. The default behavior of Tika can be modified by the configuration that is passed to the parser. For instance, we can exclude the XMLParser and treat XML files as regular text files. The metadata of files (e. g. author) is passed to the Metadata object. Additionally, we create a ParseContext object, which additionally changes the default behavior of Tika.

InputStream pdf = Files.newInputStream(Paths.get("src/test/resources/testpdf.pdf"));
ByteArrayOutputStream out = new ByteArrayOutputStream();

TikaConfig config = TikaConfig.getDefaultConfig();
// TikaConfig fromFile = new TikaConfig("/path/to/file");
BodyContentHandler handler = new BodyContentHandler(out);
Parser parser = new AutoDetectParser(config);
Metadata meta = new Metadata();
ParseContext parsecontext = new ParseContext();

Before we can start parsing the PDF with images, we have to set up a few more things. First, a configuration for the PDF parser is created. Subsequently, we set the property to extract inline images to “true” (default is “false”). As a next step, a TesseractOCRConfig object is created. The language property is set to English and German (delimited by a plus), and the home directory of the Tesseract installation is set. These configuration objects are passed to the parse context as well as the parser itself, which is important if the content within archives is supposed to be parsed as well.

PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setLanguage("eng+deu");
tesserConfig.setTesseractPath("D:/development/tesseract/Tesseract-OCR");

parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);

Finally, the parsing is initiated by calling the parse method of the parser. The parsed content can be retrieved via OutputStream as defined above. Please note that the parsing may take some time, especially if the PDF comprises quite a few images.

parser.parse(pdf, handler, meta, parsecontext);
System.out.println(new String(out.toByteArray(), Charset.defaultCharset()));

It is worth annotating that parsing can be a very memory intensive procedure. Therefore, parsing procedures as described above should always consider a strategy to catch and handle OutOfMemoryErrors, especially if content is retrieved from shared resources. The user behavior in the context of sharing files sometimes challenges applications interacting with the respective resources (e. g. Excel files comprising 500 MB or archives containing thousands of PDFs and Office files).
Although the integration of generic OCR solutions as mentioned above can be done easily and quickly, their possibilities of reasonable usage are limited. Compared to OCR solutions tailored to certain application purposes (e. g. to parse the content of documents with a well-defined structure), their results might be of poor quality. Therefore, I would personally recommend to use such generic OCR solutions within applications, which are capable to handle unstructured and inaccurate data, such as search applications (e. g. to make content from shared resources searchable).

 

Apache, Apache Tika, Tika, Tesseract, and Tesseract-OCR are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Tags:
Woodmark - Johannes Peter

Johannes Peter

Johannes war bis Ende 2017 Berater und Architekt im Bereich Search und Big Data bei der Woodmark. Sein Spezialgebiet umfasste die Verarbeitung unstrukturierter Daten. Dazu zählen Suche, Log Analyse und Natural Language Processing.

Interesse am Arbeiten bei Woodmark?

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.