Free Download Jericho HTML Parser for Web ::: HTML Tools Scripts

Jericho HTML Parser

Software Screenshot:

Software Details:

Version: 3.4

Upload Date: 10 Dec 15

Developer: Martin Jericho

Distribution Type: Freeware

Downloads: 105

Download

Currently 4.50/5
1
2
3
4
5

Rating: 4.5/5 (Total Votes: 2)

It can edit server-side and client-side tags, while reproducing verbatim any unrecognised or invalid HTML.

It also provides high-level HTML form manipulation functions.

Features:

The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
A new stream based parsing option using the StreamedSource class, which allows memory efficient processing of large files using an event iterator. This is essentially a StAX alternative with the ability to process HTML and non-validating XML, as well as several other features not available in other streaming parsers.
In its standard form it is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
The row and column number of each position in the source document are easily accessible.
Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.
Built-in functionality to render HTML markup with simple text formatting.
Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy. (Click here for an online demonstration)
Built-in functionality to compact HTML source code by removing all unnecessary white space.
Custom tag types can be easily defined and registered for recognition by the parser.

What is new in this release:

Added Source(File) constructor.
Added OutputDocument.getSegment() method.
Added OutputDocument.remove(int begin, int end) method.
Added Renderer.setHRLineLength() method.
Added RenderToText.jsp webapp sample.
Added Segment.getRowColumnVector() method.
Encoding detection now ignores common encodings specified in meta tags that have a code unit size incompatible with the preliminary encoding.

What is new in version 3.1:

Bug Fixes:
Infinite loop on Segment.getAllStartTags()
Infinite loop on Segment.getAllElements()
Segment.getFirst* methods returned segments outside the bounding segment.
Segment.getAllElements methods did not return all enclosed elements in some circumstances.
Fixed documentation errors in Segment.getAllElements methods.
Added StreamedSource class.
Changes that could affect the behavior of existing programs:
Changed ParseText from class to interface.
Segment.getNodeIterator() now returns character references as separate nodes.
Added tag search methods based on attribute value regular expressions.
Added tag search methods based on HTML class attribute.
Added static Source.LegacyNodeIteratorCompatabilityMode property temporarily to restore Segment.getNodeIterator() functionality to that of previous versions.
Removed char[] based search methods in ParseText.
Added CharacterReference.appendCharTo(Appendable) method.
Added OutputDocument(Segment) constructor.
Added StreamedSourceCopy sample program.

10 Dec 15 in Development Tools Scripts, HTML Tools Scripts

Comments to Jericho HTML Parser

Search by Category

Jericho HTML Parser

Similar Software

domReady

Mosaic Flow

Inliner

Resolution dependent layout

Other Software of Developer Martin Jericho

Jericho HTML Parser

Jericho HTML Parser

Comments to Jericho HTML Parser

Comments not found

Add Comment

Search by Category

Search by Category

Popular software

HTML to docx Converter 5 Jun 15

HTML-Restrict 13 Apr 15

dope 13 May 15

KineticJS 13 May 15

Pym.js 10 Dec 15

Nome 13 Apr 15

Scrapy 1 Oct 15

Jericho HTML Parser

Similar Software

domReady

Mosaic Flow

Inliner

Resolution dependent layout

Other Software of Developer Martin Jericho

Jericho HTML Parser

Jericho HTML Parser

Comments to Jericho HTML Parser

Comments not found

Add Comment

Search by Category

Popular software

HTML-Restrict 13 Apr 15

Packery 10 Feb 16

Store.js 10 Feb 16

Scrapy 1 Oct 15

PopcornJS 14 Apr 15

Penthouse 11 Mar 16

Htmleasy 6 Jun 15