It’s inevitable that different XML parsers make different interpretations of the standards.
This leads to some fuzzy behavior where white space is concerned.
As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization.This project started from my frustration that I could not find any simple, portable XML Parser to use inside all my projects (for example, inside the award-winning TIMi software suite commercialized by the Business-Insight company). I was using XML as standard for all my input/ouput configuration and data files.Let's look at the well-known Xerces C library: The complete Xerces project is 53 MB! The source code of my small tools was usually around 600KB.Memory management is totally transparent through the use of smart pointers (in other words, you will never have to do any new, delete, malloc or free)("Smart pointers" are a primitive version of the garbage collector in Java).
Based on the expertise gained during the development of this XML Parsing library, I create a new, improved XML Parser: the Incredible XML Parser.
The first exception to the significant white space rule deals with attribute values.
The XML parser uses a set of rules to normalize attribute values.
Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials.
For each of eight dialect regions, 50 male and female speakers having a range of ages and educational backgrounds each read ten carefully chosen sentences.
TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name.