Type:
Class
Constants:
LETTER
:
'[:alpha:]'
DIGIT
:
'[:digit:]'
COMBININGCHAR
:
''
EXTENDER
:
''
NCNAME_STR
:
"[#{LETTER}_:][-[:alnum:]._:#{COMBININGCHAR}#{EXTENDER}]*"
NAME_STR
:
"(?:(#{NCNAME_STR}):)?(#{NCNAME_STR})"
UNAME_STR
:
"(?:#{NCNAME_STR}:)?#{NCNAME_STR}"
NAMECHAR
:
'[\-\w\.:]'
NAME
:
"([\\w:]#{NAMECHAR}*)"
NMTOKEN
:
"(?:#{NAMECHAR})+"
NMTOKENS
:
"#{NMTOKEN}(\\s+#{NMTOKEN})*"
REFERENCE
:
"&(?:#{NAME};|#\\d+;|#x[0-9a-fA-F]+;)"
REFERENCE_RE
:
/#{REFERENCE}/
DOCTYPE_START
:
/\A\s*<!DOCTYPE\s/um
DOCTYPE_PATTERN
:
/\s*<!DOCTYPE\s+(.*?)(\[|>)/um
ATTRIBUTE_PATTERN
:
/\s*(#{NAME_STR})\s*=\s*(["'])(.*?)\4/um
COMMENT_START
:
/\A<!--/u
COMMENT_PATTERN
:
/<!--(.*?)-->/um
CDATA_START
:
/\A<!\[CDATA\[/u
CDATA_END
:
/^\s*\]\s*>/um
CDATA_PATTERN
:
/<!\[CDATA\[(.*?)\]\]>/um
XMLDECL_START
:
/\A<\?xml\s/u;
XMLDECL_PATTERN
:
/<\?xml\s+(.*?)\?>/um
INSTRUCTION_START
:
/\A<\?/u
INSTRUCTION_PATTERN
:
/<\?(.*?)(\s+.*?)?\?>/um
TAG_MATCH
:
/^<((?>#{NAME_STR}))\s*((?>\s+#{UNAME_STR}\s*=\s*(["']).*?\5)*)\s*(\/)?>/um
CLOSE_MATCH
:
/^\s*<\/(#{NAME_STR})\s*>/um
VERSION
:
/\bversion\s*=\s*["'](.*?)['"]/um
ENCODING
:
/\bencoding\s*=\s*["'](.*?)['"]/um
STANDALONE
:
/\bstandalone\s*=\s*["'](.*?)['"]/um
ENTITY_START
:
/^\s*<!ENTITY/
IDENTITY
:
/^([!\*\w\-]+)(\s+#{NCNAME_STR})?(\s+["'](.*?)['"])?(\s+['"](.*?)["'])?/u
ELEMENTDECL_START
:
/^\s*<!ELEMENT/um
ELEMENTDECL_PATTERN
:
/^\s*(<!ELEMENT.*?)>/um
SYSTEMENTITY
:
/^\s*(%.*?;)\s*$/um
ENUMERATION
:
"\\(\\s*#{NMTOKEN}(?:\\s*\\|\\s*#{NMTOKEN})*\\s*\\)"
NOTATIONTYPE
:
"NOTATION\\s+\\(\\s*#{NAME}(?:\\s*\\|\\s*#{NAME})*\\s*\\)"
ENUMERATEDTYPE
:
"(?:(?:#{NOTATIONTYPE})|(?:#{ENUMERATION}))"
ATTTYPE
:
"(CDATA|ID|IDREF|IDREFS|ENTITY|ENTITIES|NMTOKEN|NMTOKENS|#{ENUMERATEDTYPE})"
ATTVALUE
:
"(?:\"((?:[^<&\"]|#{REFERENCE})*)\")|(?:'((?:[^<&']|#{REFERENCE})*)')"
DEFAULTDECL
:
"(#REQUIRED|#IMPLIED|(?:(#FIXED\\s+)?#{ATTVALUE}))"
ATTDEF
:
"\\s+#{NAME}\\s+#{ATTTYPE}\\s+#{DEFAULTDECL}"
ATTDEF_RE
:
/#{ATTDEF}/
ATTLISTDECL_START
:
/^\s*<!ATTLIST/um
ATTLISTDECL_PATTERN
:
/^\s*<!ATTLIST\s+#{NAME}(?:#{ATTDEF})*\s*>/um
NOTATIONDECL_START
:
/^\s*<!NOTATION/um
PUBLIC
:
/^\s*<!NOTATION\s+(\w[\-\w]*)\s+(PUBLIC)\s+(["'])(.*?)\3(?:\s+(["'])(.*?)\5)?\s*>/um
SYSTEM
:
/^\s*<!NOTATION\s+(\w[\-\w]*)\s+(SYSTEM)\s+(["'])(.*?)\3\s*>/um
TEXT_PATTERN
:
/\A([^<]*)/um
PUBIDCHAR
:
"\x20\x0D\x0Aa-zA-Z0-9\\-()+,./:=?;!*@$_%#"
Entity constants
SYSTEMLITERAL
:
%Q{((?:"[^"]*")|(?:'[^']*'))}
PUBIDLITERAL
:
%Q{("[#{PUBIDCHAR}']*"|'[#{PUBIDCHAR}]*')}
EXTERNALID
:
"(?:(?:(SYSTEM)\\s+#{SYSTEMLITERAL})|(?:(PUBLIC)\\s+#{PUBIDLITERAL}\\s+#{SYSTEMLITERAL}))"
NDATADECL
:
"\\s+NDATA\\s+#{NAME}"
PEREFERENCE
:
"%#{NAME};"
ENTITYVALUE
:
%Q{((?:"(?:[^%&"]|#{PEREFERENCE}|#{REFERENCE})*")|(?:'([^%&']|#{PEREFERENCE}|#{REFERENCE})*'))}
PEDEF
:
"(?:#{ENTITYVALUE}|#{EXTERNALID})"
ENTITYDEF
:
"(?:#{ENTITYVALUE}|(?:#{EXTERNALID}(#{NDATADECL})?))"
PEDECL
:
"<!ENTITY\\s+(%)\\s+#{NAME}\\s+#{PEDEF}\\s*>"
GEDECL
:
"<!ENTITY\\s+#{NAME}\\s+#{ENTITYDEF}\\s*>"
ENTITYDECL
:
/\s*(?:#{GEDECL})|(?:#{PEDECL})/um
EREFERENCE
:
/&(?!#{NAME};)/
DEFAULT_ENTITIES
:
{
'gt' => [/>/, '>', '>', />/],
'lt' => [/</, '<', '<', /</],
'quot' => [/"/, '"', '"', /"/],
"apos" => [/'/, "'", "'", /'/]
}
MISSING_ATTRIBUTE_QUOTES
:
/^<#{NAME_STR}\s+#{NAME_STR}\s*=\s*[^"']/um
These are patterns to identify common markup errors, to make the error messages more informative.
Using the Pull Parser
This API is experimental, and subject to change.
parser = PullParser.new( "<a>text<b att='val'/>txet</a>" ) while parser.has_next? res = parser.next puts res[1]['att'] if res.start_tag? and res[0] == 'b' end
See the PullEvent class for information on the content of the results. The data is identical to the arguments passed for the various events to the StreamListener API.
Notice that:
parser = PullParser.new( "<a>BAD DOCUMENT" ) while parser.has_next? res = parser.next raise res[1] if res.error? end
Nat Price gave me some good ideas for the API.