122
Tika Extracting PDF File
To extract content from pdf file, Tika uses PDFParser. PDFParser is a class that is used to extract content and metadata from a pdf file. This class is located into the org.apache.tika.parser.pdf package.
It contains constructor and methods that are tabled below.
Tika PDFParser Constructor
Constructor | Description |
---|---|
public PDFParser() | It is used to create instance of this class. |
Tika PDFParser Methods
Method | Description |
---|---|
public Set<MediaType> getSupportedTypes(ParseContext context) | It returns the set of media types supported by this parser when used with the given parse context. |
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException | It parses a document stream into a sequence of XHTML SAX events. |
public PDFParserConfig getPDFParserConfig() | It is used to get pdfparser config. |
public void setPDFParserConfig(PDFParserConfig config) | It is used to set config for pdfparser |
public void setEnableAutoSpace(boolean v) | The parser should estimate where spaces should be inserted between words. |
public boolean getExtractAnnotationText() | It extracts text in annotations.. |
public void setExtractAnnotationText(boolean v) | If true (the default), text in annotations will be extracted. |
public void setSuppressDuplicateOverlappingText(boolean v) | If true, the parser should try to remove duplicated text over the same region. |
Tika Extracting PDF File Example
In the following example, we are extracting content and metadata from a pdf file.
Output:
Document Content: Welcome to the tutoraspire. tutoraspire is a Technical portal that contains latest computer science topics. Document Metadata: pdf:PDFVersion: 1.4 xmp:CreatorTool: Online2PDF.com access_permission:modify_annotations: true access_permission:can_print_degraded: true meta:creation-date: 2018-05-05T11:25:40Z created: Sat May 05 16:55:40 IST 2018 access_permission:extract_for_accessibility: true access_permission:assemble_document: true xmpTPg:NPages: 1 Creation-Date: 2018-05-05T11:25:40Z dcterms:created: 2018-05-05T11:25:40Z dc:format: application/pdf; version=1.4 access_permission:extract_content: true access_permission:can_print: true pdf:docinfo:creator_tool: Online2PDF.com access_permission:fill_in_form: true pdf:encrypted: false producer: Online2PDF.com access_permission:can_modify: true pdf:docinfo:producer: Online2PDF.com pdf:docinfo:created: 2018-05-05T11:25:40Z Content-Type: application/pdf
Next TopicTika Extracting XML File