Textantrieb | UText/1 | UText/1.2 Manual

Text Selectors

The parameter of the UText Script operation select and of some output tags such as [v] is a selector. This can be just a role name as in [v title] matching the first child unit whose role is "title" but it can be a more complex text query. Example:

select website.#webpage."review" status begin
 out Page:  Title: [v header.title]
end

The first selector, website.#webpage."review" status, causes the interpreter to step over each single webpage in the website that has review status, that is, that has a child ~status review. For each page, the tag u outputs the page unit name. Then the selector header.title retrieves the first child of the current page that plays the role header and returns its first child unit with the role title. The tag v outputs then its binary data.

Selectors can also be called programmatically. The above could be done in a Perl script this way:

$ut->foreach('website.#webpage."review" status', sub {
    print 'Page: ',
        $ut->getUnitName(),
        ' Title: ',
        $ut->getVar('header.title'),
        "\n";
});

Syntax

A selector consists of one or more clauses separated by a period, each selecting a level in the text structure.

level-1 . level-2 .  ... . level-N

Examples: title is a one-level selector, header.title defines two levels and matches all units with the role ”title“ that are direct children of a unit with the role ”header“.

Whitespace is irrelevant, being all these selectors identical:

header . title
       header .title
header.title

Each level contains a text unit identifier and possibly one or more prefix modifiers that refer to the current level. Example:

website.!<styles

Modifiers are optional, their order is irrelevant, whitespace is optional. Both lines select the same:

website.#<:webpage."hobby" category
website . :  < #    webpage . "hobby" category

The wildcards ? and ?? can be used instead of identifiers to define levels.

website.?.title
website.??.title

The first line above would return the titles of all webpages, provided there is exact one level (i.e. ”webpage“) between the website and the title units. The second form would return all titles under website, no matter how many levels under website they occur.

Semantics

An expression such as:

level-1 . level-2 . ... . level-N

returns all text units that match all given conditions. The matching follows these steps:

The root unit matching applies to the first given level. The clause is applied to the current unit and recursively to its parent unit, until a matching unit is found. If no root unit is found, the selector returns an empty list.

If the first level is empty (the selector begins with a dot) the root unit is unit.

Once the root unit is found, the next level of the selector is applied to the children of the root unit as a filter. The root unit gets replaced by the child units that match this level's selector. If the input selector has more levels, this procedure is repeated for each level.

What units are checked at each level, depends on the type of restriction we are applying.

A level clause =name returns the direct child with this name, if found.

A level clause ~role or :type returns not only the children with matching role resp. type but also recursively their matching children.

The wilcard ? matches every child, that is no child is discarded when applying it.

The wildcard ?? matches the current unit and all its descendants. This wildcard can occur at the beginning of a selector, too. In this case, the root unit is not a single unit but all descendants of the current unit.

If the prefix modifier ! is present, the matching is performed not only on the unit being processed, but also on all its parent units.

The selector ends up returning all descendants of the root unit that match all given conditions, each selector level applying consecutively to the next text level. The sort order of the returned list is —if not specified by the selector itself— the order of the underlying text.

Finally the prefix modifier # has the effect of not returning the last level that was evaluated but a particular one.

The details of the semantics are explained at the following description of each selector element.

Selector Clauses

A selector clause matching one single level consists of a unit identifier and possibly some modifiers.

Identifying Units with =, ~, :

A unit can be identified by name, role or type prefixing it respectively with ”=“, ”~“ and ”:“.

=index
~webpage
:string

In a selector units are identified by default by role, being webpage the same as ~webpage.

Wildcard Identifiers ?, ??

The wildcard identifier ? matches any unit. That is, no restriction applies to this level. For example, supposing you have this text:

^website {
    ^title :string
    ^webpage {
        ^title :string
    }
    ^article {
        ^title :string
        ^section {
            ^title :string
        }
    }
}
=geneaweb ~website {
    ~title Geneaweb
    ~webpage =smith {
        ~title The Smiths
    }
    ~webpage =smithereen {
        ~title The Smithereens
    }
    ~article {
        ~title Chronicle of Family Smith
        ~section {
            ~title The First Generation (1770-1790)
        }
        ~section {
            ~title Expansion in Middle America (1790-1820)
        }
    }
}

the selector website.?.title would match:

~title The Smiths
~title The Smithereens
~title Chronicle of Family Smith

The wildcard ?? matches the current unit plus all its descendants. The selector website.??.title applied to the text above would return:

~title Geneaweb
~title The Smiths
~title The Smithereens
~title Chronicle of Family Smith
~title The First Generation (1770-1790)
~title Expansion in Middle America (1790-1820)

With the selector .??, which is the same as unit.??, one gets all known text units. For example .??.title would return all ~title units currently registered by the interpreter, no matter where they are.

The ?? wildcard expansion can be restricted to an amount of levels by appending a number. Example:

website.??2.title

would match up to 2 levels under the unit ~website:

~title Geneaweb
~title The Smiths
~title The Smithereens
~title Chronicle of Family Smith

Matching Data with "", ''

A prefix modifier consisting of an arbitrary string enclosed in quotation marks limits the matching to the units whose binary data are exactly this string.

"Family Smith" title

This matches a unit with the role ~title only if its binary contents are ”Family Smith“. It does not match these:

~title family smith
~title Family    Smith
~title " Family Smith"

it matches only this:

~title Family Smith

One can use single or double quotation marks. For example:

'My "Corner"' subtitle

Matches:

~subtitle My "Corner"

Multiple Matching with !

By default a selector docks at the first matching unit, jumping to its parent only if nothing is found. For example given this text:

^website {
    ^style :string
    ^webpage {
        ^style :string
    }
}
~website
~sytle base.css
~style print.css
~webpage =index
~style cover.css
~style external.css

Suppose the UText object is currently processing the webpage ”index“. Now the selector style returns

~style cover.css
~style external.css

If you want to get not only the styles for the current page, but also all other styles that are eventually defined at the levels above webpage, you call instead the selector !style, that returns this:

~sytle base.css
~style print.css
~style cover.css
~style external.css

That would return all other ~style units that exist above webpage too, if there are some. If you want to set a limit at a particular level, you can put its name in the first place. For example:

website.webpage.!style

would return the styles defined for the current webpage and website, but not those above website.

Limiting Results with (n-m)

One can restrict the results of a selector to a particular item with the prefix modifier (n), n being the ordinal number of the item in the list beginning with 1. For example:

(2) ~webpage

returns not a list containing all webpages but only the second one.

An interval can be expressed with (n-m), returning (1-5) :string the list of the first five strings found. If less that 5 strings are found, it returns all of them.

Fixing a Return Level with #

By default the selector returns the last matched level. This can be overriden by setting an explicit return level with the prefix modifier #.

A selector such as website.webpage.style returns a list of styles, and website.webpage returns a list of webpages. But sometimes one wants to get some kind of items restricting them with further criteria. For example, with the selector:

manual.module."kernel" category

you get a list such as

~category kernel
~category kernel
~category kernel
~category kernel

Well, that is probably not of much use to you. If you want to get the modules itself that are in that category, then you put:

manual.#module."kernel" category

With this selector you get a list of units that match ~module, but not all, just those that contain a child unit that conforms with ~category kernel.

Sorting Results with -, <, >

If you do not specify a sort order, the results are returned in the same order as the underlying text structure: first a unit and then its children ordered as they were fed.

With a prefix modifier ”-“ one gets the order reversed. First the last child until the first child, then the parent unit. If you have this text:

~books
~book Literary Machines
~book Augmenting Human Intellect
~book Software Pioneers

with -book you get:

~book Software Pioneers
~book Augmenting Human Intellect
~book Literary Machines

That refers to the text order, not to its contents. You can sort according to the binary contents of the returned units with ”<“ (ascending) and ”>“ (descending). On the text above with <book you get:

~book Augmenting Human Intellect
~book Literary Machines
~book Software Pioneers

The modifier ”>“ is a shortcut for the modifiers ”-<“.

Concatenation with +

The results of two or more selectors can be concatenated with the character ”+“ (preceded with white space). Example:

history.book + science.textbook

This returns a single cursor that visits all history books and after that all science books. The order of the nodes inside each selector's result is not altered.

Substraction with -

The results of two or more selectors can be substracted with the character ”-“ (preceded with white space). Example:

history.book - history.#book."obsolete" tag

This returns a cursor that visits all history books which are not tagged as ”obsolete“.

Intersection with *

The results of two or more selectors can be intersected with the character ”*“ (preceded with white space). Example:

#book."history" category * #book."1972" year

This returns a single cursor that visits all books in the category ”history“ that appeared in 1972.

Note that the results of each clause must be compatible, above both clauses use ”#“ to get books as results. If the second clause were book. "1972" year it would return ~year units instead of ~book units and the intersection with the first clause would be empty.

The order of the nodes corresponds to the order in the second clause. This way one can restrict with one clause and sort with another one. For example to get all books in category ”history“ sorted by year:

#book."history" category * #book.<year

Calling Selectors

Selectors are called in a UText script with:

A selector can be called at the following output tags:

In Perl scripts these functions admit selectors:

See Navigation.pm for more information on these function calls.