|
A Extensible Markup Language (XML) occurs as W3C-recommended general-purpose markup language for creating special-purpose markup languages. These are the simplified subset of SGML, capable of describing many different kinda data. Its primary purpose is to help a sharing of information through different systems, particularly systems attached via a Internet. Languages according to XML (e.g., RDF, RSS, MathML, XHTML, SVG, and cXML) are defined inside the formal way, allowing for computer software to modify & validate documents inside these languages forswearing anterior noesis of their form.
History
Per mid-1990s some practitioners of SGML had gained experience by owning a so-up to date World Wide Web, and believed that SGML offered solutions to a select few of a problems the Web wwhen belike to face as it grew. Jon Bosak argued that the W3C should sponsor an "SGML on the Web" activity. When occasionally trend lines he was authorized to launch that activity within mid-1996, albeit with little involvement by or even trend lines from either a W3C leadership. Bosak was easily-socially connected around a little community of population world health organization experienced case each in SGML & the Web. He received trend lines around his efforts from either Microsoft.
XML was designed by an eleven-member Working Class action supported by an (close to) 150-member Interest. Technical indicator even debate took place on the Interest Class action mailing listing & issues were resolved by consensus or, while that failed, majority vote of the Working Class action. James Clark served as Technical indicator Lead of a Working Class action, notably contributing the empty-element "" syntax & a title "XML". More list that experienced been put send on for consideration involved "MAGMA" (Minimal Architecture for Generalized Markup Applications), "SLIM" (Structured Language for Internet Markup) & "MGML" (Minimal Generalized Markup Language). A co-editors of the specification were originally Tim Bray and Michael Sperberg-McQueen. Midway through the design Bray accepted the consulting engagement using Netscape, provoking vociferous protests from either Microsoft. Bray was temporarily asked to resign a editorship. This led to incapacitating dispute in the Working Class action, one of these days solved per appointment of Microsoft's Jean Paoli as a third co-editor.
the XML Working Class action never met face-to-face; a project was accomplished utilizing a combination of electronic mail & hebdomadally teleconferences. A major project decisions were reached inside twenty weeks of vivid operate between July & November of 1996. More project function continued across 1997 and XML 1.Cipher became the W3C Recommendatiin on February 10, 1998.
XML can be deem a variant of LISP S-expressions which form tree structures where for each one node will use at times its have property names.
XML is suitable for management, display, & organization of information. XML occurs as technology caring by having a description & structuring of information.
Binary files come "proprietary". Of these may not become take a breath to open binary files created by of these applicatiin in a second application, or in the equivalent application going on another platform. Document come likewise streams, lakes, and wells throughout of bits. But, around the document these bits come grouped together within standardized ways, therefore that it universally form prices. These figures come so farther mapped to characters.
XML format combines a catholicity of document by using a efficiency & rich trading tools storage capabilities of binary files. XML describes the syntax you utilise to produce the have languages. XML is astir making it gentler to write computer software that accesses the information, by returning structure to information. By using XML there is a standardized way for a information there is no matter how else you structure the information. Thus by using newly structure of information comes the newly methodology to pull the information & XML might ease this run by standardization of the structure.
Strengths and weaknesses
A select few features of XML that produce it easily-suited for information transport come:
its at the same time human- and machine-readable format;
it has trend lines for Unicode, allowing just about any tools in any human being language to become communicated;
a ability to represent a virtually all general computer science data structures: records, lists and trees;
a self-documenting format that describes structure and field names when well as specific values;
a nonindulgent syntax and parsing requirements that allow the necessary parsing algorithms to remain simple, effective, & uniform.
XML is besides heavy utilized as a format for document storage & processing, both on the net & offline, and offers many rewards:
its robust, logically-verifiable format is according to international standards;
a hierarchical structure is suitable for most (but not completely) types of documents;
it manifests when plain text files, unencumbered by licenses or restrictions;
these are platform-independent, so comparatively resistant to changes inside technology;
it & its predecessor, SGML, have been within apply since 1986, therefore there exists extensive personal experience & package available.
Sure as shooting applications, XML besides has a below weaknesses:
Its syntax is fairly wordy & part redundant. This potty pain mortal readability & application efficiency, & yields higher storage costs. It can as well produce XML hard to use just in case in which bandwidth is limited, though compression can reduce the condition inside occasionally shells. This is particularly admittedly for multimedia system applications going in cell phones & PDAs which obviously apply XML to describe images & streaming videos.
Parsers should become designed to recursively address haphazardly nested information structures & must perform extra checks to detect improperly formatted or even even other than regulated syntax or information (this is because a markup is descriptive & part redundant, equally noted above). This induces the important overhead for virtually all basic utilizes of XML, particularly within which resources can be scarce - for instance in embedded systems. What is more, extra security considerations arise while XML input is fed from either untrusty sources, & resource exhaustion or even fold overflows come imaginable.
Occasionally assume the syntax to contain a total of obscure, unneeded features natural of its bequest of SGML compatibility. Nonetheless, an effort to fixate the subset known as "Minimal XML" led to the discovery that there was there is no consensus in which features were as a matter of fact obscure or even unneeded.
the basic parsing requirements don't trend lines a super wide array of data types, so interpretation occasionally involves extrthe act sequentially to run a desired information from either a document. E.g., no provision around XML for mandating that "3.14159" occurs as swimming-point total like than the seven-character string. XML schema languages add this functionality.
Modelling overlapping (non-nonhierarchic) information structures takes more effort.
Mapping XML to the relational or even object oriented paradigms is often cumbersome.
A select few st& argued that XML may be utilized as a information storage just a file is of moo volume, however this is only avowedly given particular assumptions just about architecture, information, implementation, and more issues.
Quick syntax tour
On text is an case of the elementary formula expressed applying XML:
Basic bread
Flour
Yeast
Warmly Water
Salt
Mix tons ingredients together, & knead thoroughly.
Handle by using the fabric, & leave for of these hour around warmly room.
Knead once more, place within the tin, and so bake in the oven.
A number 1 line is the XML declaration: these are an optional line stating what version of XML is around utilise (commonly version I.Zero), & will too contain trading tools all about character encryption & external dependencies.
A remainder of this page consists of nested elements, a select few of which stand attributes & content. An element often consists of deuce tags, the begin tag & an prevent tag, even encompassing text & more elements. A run tag consists of the title surrounded by angle brackets, rather "<step>"; a prevent tag consists of a equivalent title surrounded by angle brackets, however by having the send on slash preceding the title, prefer "</step>". A element's content is all about that appears between a run tag & a prevent tag, including text & more (infant) elements. A as a result occurs as complete XML element, by using run tag, text content, & prevent tag:
Knead once more, place within the tin, so bake in the oven.
Additionally to content, an element may contain attributes — title-value pairs involved in a run tag fallowing the element title. Attribute values must universally become quoted, utilizing exclusively or even double quotes, & to each one attribute title should pop up only once in any element.
Flour
In that case, a ingredient element has deuce attributes: total, getting value "3", & units, with value "cups". Inside each subjects, at a markup level, a list & values of a attributes, upright such as a list & content of the elements, come upright textual information — the "3" & "cups" come non a quantity & unit of measure, severally, however like are upright character sequences that the document creator can be applying to represent people items.
Additionally to text, elements can contain more elements:
Mix tons ingredients together, & knead thoroughly.
Handle using the textile, & leave for of these hour around warmly room.
Knead once more, place within the tin, and so bake in the oven.
In that experience, the Instructions element contains ternary step elements. XML takes that elements become properly nested — elements might never overlap. For instance, this is non easily-grammatical XML, because a em & hard elements overlap:
<!-- WRONG! Non Easily-Grammatical XML! -->
<p>Normal <em>emphasized <strong>strong emphasized</em> heavy</strong></p>
Each XML document must keep around exactly 1 top-level root element (instead known as the document element), and then the ensuing would as well exist as a distorted XML document:
<?xml version="1.0" encryption="UTF-8"?>
<!-- WRONG! Non Easily-Grammatical XML! -->
<thing>Thing a single</thing>
<thing>Thing ii</thing>
XML will bring favorite syntax for representing an element by having empty content. Instead of writing a begin tag followed immediately by an prevent tag, the document might contain the empty element tag in which the slash follows a element title. A as a result ii examples come exactly tantamount:
<foo></foo>
<foo/>
XML will bring deuce methods for even escaping (or only representing) favorite characters: breathe information & numeric character references. An suspire within XML occurs as known as body of information, unremarkably representing text, like an unusual character. An breathe information occurs as placeholder for that suspire, & consists of the breathe's title preceded by an ampersand ("&") and followed by the semicolon (";"). XML has many predeclared take a breath, like "lt" (referenced when "<") for the left angle bracket (<) & "amp" (referenced when "&") for the ampersand (&) itself, & these are imaginable to declare extra ones in case desired. Aside from either representing single characters, reproducing chunks of boilerplate text is an additional most common utilise for suspire. On this button is an case applying a predeclared XML take a breath to escape the ampersand in the title "AT&T":
<company-name>AT&T</company-name>
A fully listing of predeclared respire are
& (&)
< (<)
> (>)
' (')
" (")
In case further breathe want to exist as declared, this is waste the document's DTD, which is not demonstrated in that lesson, for brevity.
Numerical character look rather able, however instead of a title, it contain the "#" character followed by a total between a ampers& and a semicolon. A blunt (around decimal or even hexadecimal) represents the Unicode code point, & is occasionally wont to represent characters that are nin well encodable, like an Arabic character within the document produced on the European computer. A ampersand in the "AT&T" case can likewise exist as escaped rather this (decimal 38 is the Unicode value for "&"):
<company-name>AT&T</company-name>
There are numbers of further system necessary no doubt of writing easily-grammatical XML documents, like a accurate characters admit an XML title, however this quickly tour will bring a basic principles necessary to page through & see several XML documents.
Correctness in an XML document
For an XML document to exist as right, it must become:
Easily-grammatical. The easily-grammatical document conforms to 100% of XML's syntax system. For instance, in case the non-empty element has an opening tag sustaining there are no closing tag, these are non easily-grammatical. The document that is non easily-grammatical is non considered to become XML; the parser is called upon to refuse to run it.
Valid. The valid document has information that conforms to the particular placed of user-defined contented system that describe right information values & locations. For instance, whenever an element around the document is expected to contain text that may be interpreted when existence an integer numeric value, and it instead has a text "hello", is empty, or even has more elements within its content, so a document is non valid.
Well-formed documents
An XML document is text, which occurs as sequence of characters. A specification takes trend lines for Unicode encodings UTF-8 and UTF-16 (UTF-32 is not required). A utilize of more non-Unicode depending encryption, like ISO-8859, is admitted & is indeed widely utilized and supported.
The easily-grammatical document must conform to the as punishment system, among others:
a single and single one root element exists for the document. All a same, a XML declaration, processing videos, & comments potty precede the root element.
Non-empty elements come delimited by two the run-tag & an prevent-tag.
Empty elements can be marked by owning an empty-element (self-closing) tag, like <IAmEmpty/>. This is capable <IAmEmpty></IAmEmpty>.
Completely attribute values come quoted, either individual (') or even double (") quotes. Single quotes close a single quote and double quotes close a double quote.
Tags may be nested but may not overlap. Each non-root element must be completely contained in another element.
The document complies to its character set definition. The charset is usually defined in the xml declaration but it can be provided by the transport protocol, such as HTTP. If no charset is defined, usage of a Unicode encoding is assumed, defined by the Unicode Byte Order Mark. If the mark does not exist, UTF-8 is the default.
Element names are case-sensitive. For example, the following is a well-formed matching pair
whereas this is not
The careful choice of names for XML elements will convey the meaning of the data in the markup. This increases human readability while retaining the rigor needed for software parsing.
Choosing meaningful names implies the semantics of elements and attributes to a human reader without reference to external documentation. However, this can lead to verbosity, which complicates authoring and increases file size.
Valid documents
An XML document that complies with a particular schema, in addition to being well-formed, is said to be valid.
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic constraints imposed by XML itself. A number of standard and proprietary XML schema languages have emerged for the purpose of formally expressing such schemas, and some of these languages are XML-based, themselves.
Before the advent of generalised data description languages such as SGML and XML, software designers had to define special file formats or small languages to share data between programs. This required writing detailed specifications and special-purpose parsers and writers.
XML's regular structure and strict parsing rules allow software designers to leave parsing to standard tools, and since XML provides a general, data model-oriented framework for the development of application-specific languages, software designers need only concentrate on the development of rules for their data, at relatively high levels of abstraction.
Well-tested tools exist to validate an XML document "against" a schema: the tool automatically verifies whether the document conforms to constraints expressed in the schema. Some of these validation tools are included in XML parsers, and some are packaged separately.
Other usages of schemas exist: XML editors, for instance, can use schemas to support the editing process.
DTD
The oldest schema format for XML is the Document Type Definition (DTD), inherited from SGML. While DTD support is ubiquitous due to its inclusion in the XML 1.0 standard, it is seen as limited for the following reasons:
It has no support for newer features of XML, most importantly namespaces.
It lacks expressivity. Certain formal aspects of an XML document cannot be captured in a DTD.
It uses a custom non-XML syntax, inherited from SGML, to describe the schema.
XML Schema
A newer XML schema language, described by the W3C as the successor of DTDs, is XML Schema, or more informally referred to in terms of the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages. They use a rich datatyping system, allow for more detailed constraints on an XML document's logical structure, and are required to be processed in a more robust validation framework. Additionally, XSDs use an XML based format, which makes it possible to use ordinary XML tools to help process them, although WXS (W3C XML Schema) implementations require much more than just the ability to read XML.
Criticisms of WXS include the following:
The specification is very large, which makes it difficult to understand and implement.
The XML-based syntax leads to verbosity in schema description, which makes XSDs harder to read and write.
RELAX NG
Another popular schema language for XML is RELAX NG. Initially specified by OASIS, RELAX NG is now also an ISO international standard (as part of DSDL). It has two formats: an XML based syntax and a non-XML compact syntax. The compact syntax aims to increase readability and writability, but since there is a well-defined way to translate compact syntax to the XML syntax and back again by means of James Clark's Trang conversion tool, the advantage of using standard XML tools is not lost. Compared to XML Schema, RELAX NG has a simpler definition and validation framework, making it easier to use and implement. It also has the ability to use any datatype framework on a plug-in basis; for example, a RELAX NG schema author can require values in an XML document to conform to definitions in XML Schema Datatypes.
Other schema languages
Some schema languages not only describe the structure of a particular XML format but also offer limited facilities to influence processing of individual XML files that conform to this format. DTDs and XSDs both have this ability; they can for instance provide attribute defaults. RELAX NG intentionally does not provide these facilities.
International and worldwide use
XML fully supports unicode character encodings in element names, attributes and data. Therefore the following is a perfectly well-formed XML document, even though it includes both Chinese and Russian characters:
Displaying XML on the web
Extensible Stylesheet Language (XSL) is a supporting technology that describes how to format or transform the data in an XML document. The document is changed to a format suitable for browser display. The process is similar to applying a CSS to an HTML document for rendering.
Without using CSS or XSL, a generic XML document is rendered as raw XML text by most web browsers. Browsers like Internet Explorer, Mozilla and Mozilla Firefox display it with 'handles' that allow parts of the structure to be expanded or collapsed with mouse-clicks.
In order to style the rendering in a browser with CSS, the XML document must include a special reference to the stylesheet:
See the CSS article for an example of this in action.
This is different from specifying a stylesheet in HTML, which uses the <link> element.
To specify a client-side XSL Transformation (XSLT), the following processing instruction is required in the XML:
Client-side XSLT is not supported in Opera.
The alternative is conversion of XML into HTML, PDF and other formats on the server. Many such processors exist, and the end-user then need not be aware of what has been going on 'behind the scenes'.
See the XSLT article for an example of server-side XSLT in action.
XML extensions
XPath It is possible to refer to individual components of an XML document using XPath. This allows stylesheets in (for example) XSL and XSLT to dynamically "cherry-pick" pieces of a document in any sequence needed in order to compose the required output.
XQuery is to XML what SQL is to relational databases.
XML namespaces enable the same document to contain XML elements and attributes taken from different vocabularies, without any naming collisions occurring.
XML Signature defines the syntax and processing rules for creating digital signatures on XML content.
XML Encryption defines the syntax and processing rules for encrypting XML content.
XPointer is a system for addressing components of XML based internet media.
Processing XML files
SAX and DOM are APIs widely used to process XML data. SAX is used for serial processing whereas DOM is used for random-access processing. Another form of XML Processing API is data binding, where XML data is made available as a strongly typed programming language data structure, in contrast to the DOM. Example data binding systems are the Java Architecture for XML Binding (JAXB) [http://java.sun.com/xml/jaxb/] and the Strathclyde Novel Architecture for Querying XML (SNAQue) [http://www.cis.strath.ac.uk/research/snaque/].
A filter in the Extensible Stylesheet Language (XSL) family can transform an XML file for displaying or printing.
XSL-FO is a declarative, XML-based page layout language. An XSL-FO processor can be used to convert an XSL-FO document into another non-XML format, such as PDF.
XSLT is a declarative, XML-based document transformation language. An XSLT processor can use an XSLT stylesheet as a guide for the conversion of the data tree represented by one XML document into another tree that can then be serialized as XML, HTML, plain text, or any other format supported by the processor.
XQuery is a W3C language for querying, constructing and transforming XML data.
XPath is a path expression language for selecting data within an XML file. XPath is a sublanguage of both XQuery and XSLT.
The native file format of OpenOffice.org and AbiWord is XML. Some parts of Microsoft Office 11 will also be able to edit XML files with a user-supplied schema (but not a DTD), and on June 2, 2005 Microsoft announced that, by late 2006 all the files created by users of its Office suite of software will be formatted with web-centered XML specifications. There are dozens of other XML editors available.
Versions of XML
There are two current versions of XML. The first, XML 1.0, was initially defined in 1998. It has undergone minor revisions since then, without being given a new version number, and is currently in its third edition, as published on February 4, 2004. It is widely implemented and still recommended for general use. The second, XML 1.1, was initially published on the same day as XML 1.0 Third Edition. It contains features — some contentious — that are intended to make XML easier to use for certain classes of users (mainframe programmers, mainly). XML 1.1 is not very widely implemented and is recommended for use only by those who need its unique features.
XML 1.0 and XML 1.1 differ in the requirements of characters used for element names, attribute names etc.: XML 1.0 only allows characters which are defined in Unicode 2.0, which includes most world scripts, but excludes scripts which only entered in a later Unicode version, such as Mongolian, Cambodian, Amharic, Burmese, etc..
XML 1.1 only disallows certain control characters, which means that any other character can be used, even if it is not
defined in the current version of Unicode.
It should be noted here that the restriction present in XML 1.0 only applies to element/attribute names: both XML 1.0 and XML 1.1 allow for the use of full Unicode in the content itself. Thus XML 1.1 is only needed if in addition to using a script added after Unicode 2.0 you also wish to write element and attribute names in that script.
Other minor changes between XML 1.0 and XML 1.1 are that control characters are now allowed to be included but only when escaped, and two special Unicode line break characters are included, which must be treated as whitespace.
XML 1.0 documents are well-formed XML 1.1 documents with one exception: XML documents that contain unescaped C1 control characters are now malformed: this is because XML 1.1 requires the C1 control characters to be escaped with numeric character references.
There are also discussions on an XML 2.0, although it remains to be seen if such will ever come about. XML-SW (SW for skunk works), written by one of the original developers of XML, contains some proposals for what an XML 2.0 might look like: elimination of DTDs from syntax, integration of namespaces, XML Base and XML Information Set (infoset) into the base standard.
The World Wide Web Consortium also has a XML Binary Characterization Working Group doing preliminary research into use cases and properties for a binary encoding of the XML infoset. The working group is not chartered to produce any official standards. Since XML is by definition text-based, ITU-T and ISO are using the name [http://asn1.elibel.tm.fr/xml/finf.htm Fast Infoset] for their own binary infoset to avoid confusion (see ITU-T Rec. X.891 | ISO/IEC 24824-1).
|