124
Tika Parsing Document to XHTML
Tika uses ToXMLContentHandler class to get output in XHTML format. It returns XHTML content of the whole document as a string.
This class contains the following constructors and methods.
Tika ToXMLContentHandler Constructors
Following are the constructors of ToXMLContentHandler class.
Constructor | Description |
---|---|
public ToXMLContentHandler() | It is used to create instance of the class. |
public ToXMLContentHandler(String encoding) | It creates instance by getting string argument. |
Tika ToXMLContentHandler Methods
Following are the methods of ToXMLContentHandler class.
Methods | Description |
---|---|
public void characters(char[] ch, int start, int length) throws SAXException | It writes the given characters to the given character stream. |
protected void write(char ch) throws SAXException | It writes the given character as-is. |
protected void write(String string) throws SAXException | It writes the given string of character as-is. |
public void startDocument() throws SAXException | It writes the XML prefix. |
Tika Parsing Document to XHTML Example
This example produce the output in XHTML format while the input is in text format.
Output:
Following is the content of hello.txt file.
Hello Welcome to Tutor Aspire
After extraction, it produces the output in XHTML format. See the below.
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" /> <meta name="Content-Encoding" content="ISO-8859-1" /> <meta name="Content-Type" content="text/plain; charset=ISO-8859-1" /> <title></title> </head> <body><p>Hello Welcome to Tutor Aspire</p> </body></html>
Next TopicTika Extracting HTML File