From The Founder and Senior Analyst of ZapThink

Ron Schmelzer

Subscribe to Ron Schmelzer: eMailAlertsEmail Alerts
Get Ron Schmelzer: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: XML Magazine

XML: Article

XML Journal Feature: Transforming Large XML Documents, An Alternative to XSLT

XSL standard also became very popular for transforming XML data to XML, text, PDF, etc

As we know in classical XSLT, transformation of the full input DOM is loaded in the memory to do the transformation, and so there is a fixed limit on the number of "Person" elements the XSL transformation can handle without going out of memory. The success of transformation depends upon the available system resources, but passing a very large document might choke the full system resources, and it's not feasible to pump up the system resources every time to get a transformation completed. So with classical XSLT transformation, every system has an optimum limit on the document size, which could be transformed. To get rid of this major shortfall, loading the input source in an incremental way seems to be a viable solution, but this approach cannot be applied to a classical XSL transformation because there is no clue about the structure of the output file.

For XML to XML transformation, there is an advantage of knowing the output format of the file if a schema is present to describe the output file. Often enterprise application where XML to XML transformations are carried out, there is a schema present to describe the output file so that after the transformation the transformed file may be validated against the schema. In such a case where a schema is present to define the output xml data this schema-based transformation may be used to transform a document using an incremental approach of loading the input document defined in control attributes in the schema definition.

Let's consider the example above to see how a schema based-approach can ideally load the input DOM incrementally and discard the processed chunks of data after transformation to successfully and ideally complete an infinitely large document that provides only resources that are similar to the classical XSL transformation.

The schema for the above output in the example could be found in file "personsinfo.xsd." Some additional attributes are added to some of the elements in the schema declaration to enable the transformation.

In Listing 1, the basic schema definition is pretty simple. To add the transformation, instruction attributes are added from the name space xmlns:saxTran="http://oracle.schemaTransform/saxTran" so that instructions could be implemented to carry out the actual transformation. All of the control attributes in the Listing 1 schema definition are shown in italics. Right now only few elements that are shown in Table 1 are used, but depending on the complexity of transformation these attributes will increase.

Basic SAX-based Transformation Implementation
Now, let's go under the hood of a basic implementation and see how an incremental loading of the input DOM is possible using the schema-driven transformation approach.

Figure 1 shows the approach. The schema, which defines the target XML, has special attributes to match and map elements from input to target. So, first from the schema the default output document structure needs to be constructed without any values just according to the schema definition up to the element for which the saxTran:streamNode is defined. The saxTran:match attribute for that element will tell about the XPath of the input node on which streaming needs to be done. On each occurrence of the input node, a partial DOM has been constructed in the memory. All of the XPath references on the schema definition satisfied by the partially loaded DOM have been evaluated and values are replaced in the already created skeleton DOM from the schema definition. There might be saxTran:function attributes that also need the same XPath for the function evaluation; for these cases the value for that XPath is calculated and added to the expression for the function call as an argument. Once the XPath definition in the schema definition is fully satisfied, the node on which the streaming is applied is unloaded from the memory and the subsequent one is loaded for processing. Once all of the references of the matching XPath are dealt with from the input source, then the functions that are there to be evaluated are processed and the final value is populated in the node containing that function.

In the above example, the node on which streaming is applied is the /OrgChart/Office/Department/Person node. So, at any time during the transformation process the in memory node will look like the following:

<Title>Office Manager</Title>
<EMail>[email protected]</EMail>
For each Person node the values of all the matching XPath like "./First","./Last" etc. are populated in the target skeleton structure. For the aggregate functions like the count, sum, avg, etc., the function expression is updated continuously with actual XPath value.

So, for the three aggregate functions, the target node at the middle of the transformation process when three Person nodes are done will look like:


So, with each Person node repeating, there will be additional arguments added to the functions and populated with actual XPath values. All of the functions will be evaluated when the input source is fully read to populate the final values for these elements.

When using this approach there is no limit to the input file, which can be processed. As the memory is always replenished after processing one stream node, it can handle transformation of infinitely long XML documents without fail. For streaming output DOM too, control attributes might be specified so that it could be serialized after transformation of a required chunk of data. It will provide the flexibility of using any large XML in a transformation process, which was impossible with previous XSLT processors.

Huge, database dumps or XML coming out of serialized data records can now be transformed effectively with this approach. It can actually augment the classical XSLT engine to provide a fail-proof transformation engine. This is always seen in transforming large XML files using XSLT, and the bottleneck lies in loading the input XML as DOM tree. In most of the cases the XML data repeats for a particular element (a data record) for thousands of times, but loading all of them at once chokes the memory of the transformer. This approach can then augment the classical XSLT engine to stream the input source, and the transformation could be completed without fail. It can also store some of the transient variables within itself to provide the information to the next set of XSLT transformations in the pipeline. This approach will open a new dimension to the world of XML transformation and provide a solution for the impossible task of transforming large XML files. Here I discussed only the approach, so be sure to check back to find the implementation in my next article.


More Stories By Indroniel Deb Roy

Indroniel Deb Roy works as an UI Architect for BlueCoat Systems.He has more than 10 years of development experience in the fields of J2EE and Web Application development. In his past he worked in developing web applications for Oracle, Novell, Packeteer, Knova etc. He has a passion for innovation and works with various Web2.0 & J2EE technologies and recently started on Smart Phone & IPhone Development.

Comments (1)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.