XEPOnline HTML to PDF Test Document

This document serves to test various sample HTML elements and there representation in the PDF.

Why Did we Develop This?

We tried to address many of the missing elements in some of the web page print solutions we found. In many solutions, they are merely converting the HTML fed to the browser, not the content from the browser DOM. We wanted to be able to process live content, that could vary based on user interaction.

We found most all solutions lacking in SVG support. You cannot actually inject the SVG into the print document, you can only insert an image of the SVG in raster form. We wanted high resolution vector-based SVG content. We wanted to leverage formatting technology far superior than a "PDF" writer and/or browser which is not designed as a print composition engine.

We also found many situations where we wanted the flexibility of just having some XML content in the page for special formatting. You can easily extend this solution to format any XML markup in your page, not only HTML.

Push the "Print It!" button, get your result and keep reading.

Throughout this document you will also see print buttons like this located on certain headings. This is a demonstration of using the code to print a single <div>. Some of the buttons vary settings for print.

How it Works

The XEPOnline javascript library extracts the content of a named <div> element in the HTML page. It processes that <div> and embeds all css-based styling into the HTML. You can use the library to generate a PDF of any <div>, including those generated with dynamic content as it is processing the browser DOM of the HTML at the time execution. In other words, this is not some canned HTML to print, it is the current HTML to print.

We'll cover advanced use concepts in the future, but the whole system is extensible and leverages XSL FO technology for print file generation while maintaining ease of styling via css. The extensibility allows you to customize output in many ways, to extend and expand upon a simple File->Print operation. Because it uses XSL FO at the core, one can certainly do more than generate PDF. You could generate PostScript, AFP or XPS print files if desired.

Because HTML is not XML, the solution does some additional processing to ensure a well-formed XML document is exported. That well-formed XML document is generated and sent to the XEPOnline formatter via REST with a reference to a specific XSL stylesheet for processing the HTML tag content to create XSL FO.

XEPOnline accepts the REST request, attempts to format the document with RenderX XEP and returns the result. There are several opions to return data from initiating a download to base64 encoding the result and inserting it into the document.

Currently Supported Elements

The XSL which processes incoming (X)HTML to XSL FO is written in XSL version 1.0. Most modern web environments make use of very few tags and control appearance through headings, div's, img's and span's with css styles. The template not only supports these core elements but also has support for many legacy structures. The following HTML structures are currently supported:

Block Elements - div, h1 through h6, p, blockquote, pre, address, main, figure, figcaption
Inline Elements - span, i, b, sup, sub, font, code, em, small, strike, strong, u, q, dfn, abbr, code, var, samp, kbd, mark
Special Handling - a, img, svg, header, footer
Tables - Table support includes thead, tbody, tfoot elements as well as column and row spanning
Lists - Partial support for both ul, ol

The new HTML5 tags <section> and <article> are only supported as block elements, not seperate page generating elements at this time. One <header> and one <footer> element are allowed inside the printable <div> and are used for printed document header and footer.

Style Handling

A selected list of appearance styles are passed in css format to the XSL stylesheet which parses and interprets css. There are some special considerations in the XSL to handle differences between HTML css and XSL FO style attributes. Every attempt is made to keep the XSL FO and resulting PDF output as close as possible to the css/browser representation.

The solution also takes into account hidden items, those with "display" attribute to "none". These items are not extracted to the print file. Clicking this paragraph¹¹ The paragraph is a large gray paragraph in the sample document. You may not see it here if it is toggled to "display:none". will toggle the display properties of the following paragraph.

I am a test paragraph!

Note also that some other attributes you can use within the HTML that have no effect on the appearance of the page can also be used. For instance, you can use the css attribute "page-break-before" to start a new page.

Print Media

This solution also supports various aspects of print media. The print media stylesheet is applied to the data before sending it to be processed and as such, you can affect various style changes as well as inject page information this way. Some browsers fully support CSS3 print media @page attributes while others do not. However, the system was created so you can either use @page or you can specify in code.

This paragraph will not print in the output because it's class is set to "noprint" and the print media css used sets all "noprint" elements to "display:none".

Headings

Heading level 3 without any specific style changes

Heading Level 4

Heading Level 5

Heading Level 6

Other Block-level Structures

The following shows various block-level structures, some with standard HTML interpretation of the style and other with css styling applied to augment/change the formatting.

This is a paragraph with some CSS styling applied

This is the standard <blockquote> element. It provides an indented look to the text. This is rarely used in the days of <div> elements, classes and css styling but it is supported.

This is the standard <pre> element. Again, most folks would use
css to style output like this, but for old HTML compatibility 
we support the "pre" element.

Notice: Throughout this document, "alert" boxes like this are used to convey information to you. In fact, this is a great example showing mapping css styles to formatting. It includes using border, colors, fonts and background-images all with css styles.

Inline Elements

Testing some inline elements. There are some elements still used like "b" for bold and "i" for italic. They can even be combined like bold italic underline. The more modern approach of using <span> with classes and css is also supported. A variety of other elements supported like the quote element even^superscripts and_subscripts.

Exception: Underline using the css style "text-decoration" in Chrome is not yet supported through to the PDF. It does not export the "text-decoration" attribute and instead uses a strange "0px" border construct that only Chrome interprets as an underline. The above css style for underline in Chrome is represented as "border: 0px none rgb(51, 252, 243)".

Tables

Tables: This section tests various tabular structures including borders, colors and spanning.

Notice: You can use the elements <thead> and <tfoot> in your HTML to mark areas of the table to be treated in XSL FO as table-header and table-footer. These would repeat at page breaks.

Heading 1	Heading 2	Heading 3
Body Cell 1	Body Cell 2	Body Cell 3
Body Cell 1	Body Cell 2	Body Cell 3
Body Cell 1
Body Cell 1	Body Cell 2	Body Cell 3
Body Cell 1	Body Cell 2	Body Cell 3

A fancier table with all CSS styling. Also note that this table implements "thead" and "tbody" which map to the appropriate XSL FO constructs so that the table header and table footer is repeated at a break in the page.

Company Header	Contact Header	Country Header
Company Footer	Contact Footer	Country Footer
Alfreds Futterkiste	Maria Anders	Germany
Berglunds snabbköp	Christina Berglund	Sweden
Centro comercial Moctezuma	Francisco Chang	Mexico
Ernst Handel	Roland Mendel	Austria
Island Trading	Helen Bennett	UK
Königlich Essen	Philip Cramer	Germany
Laughing Bacchus Winecellars	Yoshi Tannamuri	Canada
Magazzini Alimentari Riuniti	Giovanni Rovelli	Italy
North/South	Simon Crowther	UK
Paris spécialités	Marie Bertrand	France

Lists

Exception: Lists in HTML (can) carry an attribute that indicates the style for the list (like decimal, upper-roman, disk, square, etc.) In HTML this style is inherited from parent lists to child lists unless the style is changed. There is no equivalent to this in XSL FO. The current implementation maps inherited styles but has some issues in assumed ones.

Note:

There are many list styles in HTML. Currently this solution supports the following:

Bullet Lists:
- disk
- square
- circle
Numbered Lists:
- decimal
- decimal-leading-zero
- lower-alpha
- lower-latin
- lower-romwn
- upper-alpha
- upper-latin
- upper-romwn
Other:
- none

Bullet-style Lists

Testing un-numbered lists, first a simple list

One
Two
Three

Another list with list-style-type setting in CSS

One
Two
Three

Now, nested lists

One
A list inside the other list
- Level 2 bullet
- A list inside the other list
  - Level 2 bullet
  - A list inside the other list
    - Level 2 bullet
    - Level 2 bullet
    - Level 2 bullet
  - Level 2 bullet
- Level 2 bullet
Three

Numbered Lists

Testing the same lists as above, numbered, first a simple list

One
Two
Three

Another set of lists with list-style-type setting in CSS

One
Two
Three

One
Two
Three

One
Two
Three

Now, nested lists

One
A list inside the other list
1. Level 2 bullet
2. A list inside the other list
  1. Level 2 bullet
  2. A list inside the other list
    1. Level 2 bullet
    2. Level 2 bullet
    3. Level 2 bullet
  3. Level 2 bullet
3. Level 2 bullet
Three

Special Processing

A very common use of the list element in HTML is to provide specialty structures like breadcrumbs or navigation tabs. While most of these structures are likely not something you would want in the print output, we wanted to do as close as possible representation of the HTML. These type of lists normally make use of the attribute "display" set to "inline" on the list tag.

We have attempted to take this into account, mapping a list HTML tag with "display:inline" differently than a normal list. This example actually is a set of lists with css applied that turns that list into a breadcrumb. It also shows some advanced concepts introduced later like creating links to internal destinations in the document.

XEPOnline HTML to PDF Test Document

» Lists

» Special Processing

Or maybe a cool effect like a set of tabs you can drop in the document to format and control the navigation.

Exception: The css for paddings and margins needs to be worked through in this example. There are differences in the way these are handled between XSL FO and HTML.

XEPOnline HTML to PDF Test Document

Lists

Special Processing

Images

Images in web pages can be linked via absolute or relative references or they can be embedded directly in the web page using a data-uri scheme. All of these methods are supported. In addition, modern browsers make signifcant use of SVG as a format that can be directly included in the page. This is why we selected XSL FO for back-end processing of the information. XSL FO and specifically XEPOnline supports SVG not by converting the SVG to an image, but by processing SVG to the output, retaining all the vector-based information.

Image Example #1: Static SVG

This is a static SVG inserted directly in the HTML page.

Image Example #2: Linked PNG

This is a PNG inserted using "src" attribute of the "img" tag. When formatting, XEPOnline needs to be able to access the image in question so the path of the web page is sent using xml:base to provide XEPOnline with the ability to resolve the path to the image.

Notice: Since XEPOnline is an REST-based remote service, it will not be able to access the images if you are testing on a localhost system. If you choose, you can implement custom javascript that base-64 encodes images and injects them into the <img> tag's "src" attribute and eliminate the need for XEPOnline to reach back to your server for the image assets.

Processing the <img> also requires special handling because HTML5 allows non-closed tags. Since XSL FO is an XML-based processing solution, the XEPOnline javascript handles this by creating valid XML first before submitting.

Take special care if using auto-scaling of images. If you desire an exact size of the image, it should be specified in the HTML or css. The width is carried through to the PDF. The following two images are one locally referenced on the submit website and the same image remotely referenced.

Image Example #3: Base-64 Encoded Images Using data: URI

This image of a folder icon is directly in the web page as base-64 encoded.

Image Example #4: Dynamic, Javascript-based SVG Charting

The following chart is dynamically generated using Anychart JavaScript library. This shows that even page-based dynamic information can be sent to the engine after processing in its full SVG format. There is not pre-processing of this SVG to image format at any time. The full vector information is carried through to the PDF.

Image Example #5: Dynamic, Javascript-based SVG Charting II

The following chart is dynamically generated using d3 JavaScript library. You can even see that dynamic SVG is printed as it exists at any time. This sample is dynamic, click this paragraph and the pie chart data changes. The print file will represent what is on the screen at that time.

Hyperlinks

Hyperlinks can be carried into the output PDF.

External

This link should go to the XEPOnline web site.

Internal

Links can also point to internal destinations which are carried through to the PDF. Using many of the concepts above, you could create a list that is the document table of contents styled with css to add images and control appearance.

Notice: Obviously if you are generating internal links you should make sure that all the destinations in question are in the document. This sample has links outside of this section and therefore these will not work if you use only the section print function.

XEPOnline HTML to PDF Test Document

Floating Content

Many current implementations in HTML make use of responsive designs. The challenge is to try and replicate this HTML into the printed page. The trigger here is the float element. We will start with an easy example, how about replicating a drop-cap.

his would be a good example of a floating container. A drop capital is frequently used in book publishing. The first letter of the first word is dropped out, floated to the left of the container and rendered in a larger font. Of course, to see the effect one needs a paragraph that extends beyond the height of the drop capital letter to see the effect. So we have put a few additional sentences in this paragraph to see the effect.

Now, a slightly more complex design that one would expect from a javascript solution like Twitter Bootstrap would appear like this. We are directly writing the styles and not using Twitter Bootstrap for testing purposes.

Heading for Column 1

This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column. This is a bunch of text in the column.

Heading for Column 2

Heading for Column 3

Custom Tips and Tricks

Pass-Through XSL FO Styling

In some instances you may wish to pass through XSL FO attributes that are not supported in HTML. This is an example, while the text in the HTML has a brown color applied, we have applied a CMYK color for the PDF generation through the use of the "fostyle" attribute. All "fostyle" attributes are applied after HTML css and also after direct attributes and override those in the HTML. This paragraph also has "text-align" justify in the HTML and font-stretch, font-size-adjust and hyphenate only in the PDF output. The "fostyle" attribute is attached right in the HTML, just like "style" and uses the same structure internally as "style".

Footnotes

Since HTML supports arbitrary XML inside the markup, one of the easiest ways to do some simple footnotes is to simply use the regular XSL FO markup¹¹ This is a sample footnote in the body of the document. The css shows the footnote on mouseover and the PDF will render as a footnote. Because the css is tuned for mouseover display, the HTML css for the structure is discarded but you can use <span> to style the output to PDF. for a footnote. This paragraph is an example of this technique. The XEPOnline stylesheets have implemented the footnote structure, will recognize it and transfer it to the backend. You would implement inside the html as:

<footnote>
    <sup>your character here for inside the HTML</sup>
    <footnote-body>
        <block>
            <sup>optional footnote char here for inside print footnote</sup>
            your inline HTML markup here for footnote formatting 
            in PDF. Only inline styles (b, i, span) are supported in the css
            provided for footnotes. Of course, you can implement
            your own.
        </block>
    </footnote-body>
<footnote>

Custom XML in Web Page

With today's browser technology, you can use css to style any content you wish. This includes any generic XML inserted into the HTML file. For example, consider that you just wish to place the sales data through some dynamic process into a table. If you examine the source HTML for this page here, you will see only XML. It is 100% styled with external css in the file named "xmlsamp.css".

Account Description Account Number Balance NName Policy ID Product Group Flag Investment Savings Account 3003747305 $15,998.45 SAV 0 Guaranteed Investment (GIC) 3500462605 $43,097.63 CD 0 Guaranteed Investment (GIC) 3500628788 $15,413.42 CD 0 Guaranteed Investment (GIC) 3501134125 $10,728.14 CD 0 Guaranteed Investment (GIC) 3501244435 $10,744.40 CD 0 Guaranteed Investment (GIC) 3502110030 $12,385.05 CD 0 Guaranteed Investment (GIC) 3502130256 $15,481.32 CD 0 Guaranteed Investment (GIC) 3502284665 $10,362.26 CD 0 Guaranteed Investment (GIC) 3502284702 $15,543.39 CD 0 Guaranteed Investment (GIC) 3502798416 $20,000.00 CD 0 Guaranteed Investment (GIC) 3502798430 $20,000.00 CD 0 Guaranteed Investment (GIC) 3502801909 $25,000.00 CD 0 Guaranteed Investment (GIC) 3503119658 $16,000.00 CD 0 Guaranteed Investment (GIC) 3503183512 $29,273.21 CD 0 Guaranteed Investment (GIC) 3503190978 $55,756.26 CD 0 Guaranteed Investment (GIC) 3505007720 $35,000.00 CD 0 Guaranteed Investment (GIC) 3505014308 $7,060.41 CD 0 Guaranteed Investment (GIC) 3505025032 $3,500.00 CD 0 Guaranteed Investment (GIC) 3505025070 $1,500.00 CD 0 Guaranteed Investment (GIC) 3505043641 $4,069.04 CD 0 Guaranteed Investment (GIC) 3505363000 $8,000.00 CD 0

And this table is formatted according to the css styling, even carried through to the PDF. You so not have to change XML tags, you only indicate their style as "table" or "table-row" or "block" or "inline" in the css and XEPOnline will format the generic XML according to the css styling to print.

What's Coming?

We've done the implementation to support @media print. Would love to implement page-templates (first, last, left/right). Only Chrome supports @media print @page directives though. Guess that will need to wait a bit.

Next a little clean-up on the conversion to XSL FO, there are some issues pointed out in this document. There is certainly more work here to do, we probably did not think of everything.

After that, well we can do form fields ... a fillable HTML form to a PDF fillable form ... that would be cool.