From The Founder and Senior Analyst of ZapThink

Ron Schmelzer

Subscribe to Ron Schmelzer: eMailAlertsEmail Alerts
Get Ron Schmelzer: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: XML Magazine

XML: Article

XML Journal Feature: Transforming Large XML Documents, An Alternative to XSLT

XSL standard also became very popular for transforming XML data to XML, text, PDF, etc

With the evolution of XML, the XSL standard also became very popular for transforming XML data to XML, text, PDF, etc. However there are some limitations to the XSLT transformation. Today's XSLT processors rely on holding input data in memory as a DOM tree while the transformation is taking place. The tree structure in memory can be as much as ten times the original data size, so in practice, the limit on data size for an XSLT conversion is just a few megabytes. As a result it can only handle XML documents with moderate size - to be processed as the full input, DOM needs to be in the memory for any XSL transformation.

This major shortcoming of the classical XSLT transformation may be solved with the schema-based transformation API discussed herein. This method uses a stream-based approach to load parts of the document in the memory at one time to proceed with the transformation process. So at any given point of time enough resources are available for the actual transformation process to complete. As classical XSL transformation can transform an XML file to any format viz. XML, text, EDI, EFT, PDF, etc., this approach is restricted to only XML-to-XML transformation. In this approach XML schema plays a pivotal role. As we all know an XML schema can describe the data structure, hierarchy, and validation rules for any XML file. So in this approach, transforming a source XML to another destination XML format is based on describing an XML schema for the destination XML file with control attributes defined in the element definition to aid transformation, and using full XPath APIs to carry out the actual transformation. This approach provides a scalable, stream-based way to transform XML to XML and ideally can handle input of any size, which is impossible to obtain with today's XSLT transformers. This approach can be successfully implemented to transform XML exclusively in the B2B DOMain, where a target schema is always present to validate the generated XML target XML.

The Problems Faced in Classical Transformation
To transform a big XML DOM from one form to another entails a lot of problems using classical XML transformation, as the full input DOM needs to be loaded in the memory. For a huge XML input file, just loading the XML file might fail, given limited system resources. There is no published solution to date to tackle such an issue. There were efforts in this regard to serialize the DOM in permanent storage, to free up memory to get the transformation completed, but that gives rise to several I/O issues. Even the simplest of transformations using this approach to transform big XML documents takes a considerable amount of time, and hence this is not feasible for enterprise usage.

This approach considers all of the complexities of transforming large XML files and comes up with a real-time, scalable solution to this whole problem. Apparently there are no performance bottlenecks in this approach because it's schema-based and works on some basic rules, as defined in the subsequent sections.

The Approach in Detail
This is a schema-based approach to transform a source XML to a destination XML. The schema-based approach is beneficial because the complete structure of the destination XML could well be populated based on the schema definition, and then populating the bare XML structure with the required values into the target XML document might be done with the help of attributes and annotations defined in the schema definition. Here the source XML will be read in a stream-based fashion to load nodes that match the schema definition, and once a match occurs all of the elements that require that XPath to do all of the transformations to populate the skeleton node will be done. Once all of the nodes in the target XML are populated using the loaded input node, it would be removed from the memory and the next chunk of data as node will be loaded to perform the next set of transformations, until the full input DOM is read.

There is one thing to note: the streaming of the input file, i.e., which node needs to be read from the input depends entirely on the user and has to be declared in the schema definition as control attributes. As the schema provides the structure of the final target XML with special processing instructions embedded in the control attributes, the schema will be queried a number of times to get to the correct structure of target XML and will be populated with data using information provided as qualified attributes from namespace xmlns:saxTran= "http://oracle.schemaTransform/saxTran." Table 1 shows the first set of attributes needed to provide the basic stream-based transformation functionality.

There may be a lot of other attributes needed in due course of implementation, but for now these are the most crucial ones anticipated.

Let's take a simple example to illustrate the approach in detail. In this example we select a simple OrgChart.xml that shows the organization hierarchy of a company. The basic structure is shown below:


<OrgChart>
<Office>
<Department>
<Person>
<First>Vernon</First>
<Last>Callaby</Last>
<Title>Office Manager</Title>
<PhoneExt>582</PhoneExt>
<EMail>v.callaby@nanonull.com</EMail>
<Shares>1500</Shares>
</Person>
............
<Person>
</Person>
</Department>
<Department>
................
</Department>
</Office>
........
<Office>
........
</Office>
</OrgChart>
After the transformation the resulting document should show all of the persons in all departments and some calculations such as count, average, and summation are done on some fields of the Person element. To achieve this, classical XSL is written and could be found in personinfo.xsl file. The basic structure of the document after the transformation is shown below:

<PersonsInfo>
<Persons>
<Person>
<First>Vernon</First>
<Last>Callaby</Last>
<Title>Office Manager</Title>
<PhoneExt>582</PhoneExt>
<EMail>v.callaby@nanonull.com</EMail>
<Shares>1500</Shares>
</Person>
<Person>
................
</Person>
</Persons>
<TotalPersons>20</TotalPersons>
<AvgSharePerPerson>200.0
</AvgSharePerPerson>
<TotalSharesWithPersons>4000
</TotalSharesWithPersons>
</PersonsInfo>

More Stories By Indroniel Deb Roy

Indroniel Deb Roy works as an UI Architect for BlueCoat Systems.He has more than 10 years of development experience in the fields of J2EE and Web Application development. In his past he worked in developing web applications for Oracle, Novell, Packeteer, Knova etc. He has a passion for innovation and works with various Web2.0 & J2EE technologies and recently started on Smart Phone & IPhone Development.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
XML News Desk 10/20/05 07:04:56 PM EDT

XML Journal Feature: Transforming Large XML Documents, An Alternative to XSLT. With the evolution of XML, the XSL standard also became very popular for transforming XML data to XML, text, PDF, etc. However there are some limitations to the XSLT transformation. Today's XSLT processors rely on holding input data in memory as a DOM tree while the transformation is taking place. The tree structure in memory can be as much as ten times the original data size, so in practice, the limit on data size for an XSLT conversion is just a few megabytes. As a result it can only handle XML documents with moderate size - to be processed as the full input, DOM needs to be in the memory for any XSL transformation.