Textantrieb | UText/1 | UText/1.2 Manual

Universal-Text Language

UTL/1.2

This page describes the Universal-Text Language UTL as implemented in UText/1.2.

UTL Syntax

Explicit Syntax Elements

Lines

The UTL language is a plain text prefix notation being each line a separate instruction. An input such as:

~title Joan's Homepage

is interpreted as a text unit with the role ”title“ and the binary contents ”Joan's Homepage“. An UTL line consists of some words preceeded with a prefix sign plus optionally some literal data. The prefixes and their meaning are:

~ the next word is a role name
^ the next word defines a new unit
: the next word is a type name
= the next word is a unit name
== the next word is a unit reference

When the first word without a prefix is found, this word and the rest until the end of the line is interpreted as a chunk of binary data.

Empty lines are ignored. The whitespace at the begining and the end of a line is ignored. The whitespace between words with prefix acts as a word delimiter. Whitespace between the prefix and the follwing word is irrelevant. The order of the prefixed words is irrelevant. These lines are equivalent:

~title :string My Site
~ title : string My Site
: string ~ title My Site

The whitespace at the begin and end of the binary data is ignored, but the whitespace inside it is respected. Both of these lines return the binary data ”My Site“, with 3 space characters between the words and no space at the begining and the end:

~ title              My   Site
~ title My   Site

If you want to get binary data with leading or trailing spaces you can enclose it in (single or double) quotation marks:

~ title "             My   Site  "
~ title '             My   Site  '
~ title              My   Site

The first and the second one return the binary data ” My Site “, the third one ”My Site“.

The binary data can be given in more than one line. If a line ends with '' or "", all the next lines until a line containing just the characters '' (resp. "") is being interpreted as a chunk of binary data. For example the following defines a unit with role ”code“ and binary data consisting of 4 lines:

~code ""
sub salute
{
    return "hello!";
}
""

A line may set a reference to another unit, too. By adding ==<unit name> at a line one sets a reference to the unit name. When navigating the text, the interpreter jumps to the refered unit. Instead of a unit name one can set a reference such as:

~title ==(toTitle webpage)

The reference is put in parentheses, the right most word is a unit name, above "webpage". The word "toTitle" is the name of a text transformation. A transformation is an operation that gets a text unit as parameter, transforms it, and gives a transformed text unit back. The result unit is the one that gets referenced by the "~title" unit.

Blocks

The second syntax element of the UTL are the line blocks. Some lines are grouped by setting them in brackets. The meaning of the block depends on the type of brackets.

An opening curly bracket { enters a new unit level and a closing curly bracket } leaves it. For example:

~website =myweb My Homepage {
    ~webpage =index
    ~webpage =contact
}
~website =myshop The Hobby Corner
{
    ~webpage =welcome
    ~webpage =shoppingcart
}

The above declares two websites. The webpages index and contact belong to the website "myweb" and the webpages welcome and shoppingcart belong to the website "myshop". The opening curly bracket must be at the end of the line, it can be the only character at the line but it is not necessarily. The closing curly bracket must be at a single line by its own.

To get a block of lines to be parsed by an alternate parser instead of by the UTL parser, one puts them in square brackets (read more in Alternative Parsers). Example:

~webpage =index Overview [
My Site
Welcome to my site!
This site is under construction.
]

To sum up: the UTL syntax has only two elements, lines and line blocks, and this meanings: unit declaration, enter level, leave level, parse block. They correspond with the four operations of the UText kernel: set, enter, leave, parse.

Comments

Finally one can label a portion of the input file as "comment", these lines being ignored by the interpreter. A comment block is enclosed between a {-- and a --} mark.

{--
To do: rewrite
I wrote this once but I am not sure it is right
--}

A single line beginning with -- is also a comment line. And if a line begins with __END__ the rest of the file is ignored.

A shebang line is identified by the interpreter and ignored, too. For example this at the very first line of the file ”geneaweb.utl“:

#!/usr/bin/perl geneaweb.pl

Implicit Syntax Elements

When reading a text representation the interpreter takes its structure into account and inferes some elements that are not explicit. The involved mechanisms are: child role inference, binary role inference and parent unit inference.

The following is correct UTL:

^webpage {
    ^title : string
    ^content {
        ^p : string
        ^h1 : string
    }
}
~webpage =index Overview
~content
~h1 My Site
Welcome to my site!
This site is under construction.

If you let the interpreter read this and then you perform

print $ut->toString('unit','index');

you will get this output, that shows what the interpreter recorded:

=index ~webpage {
    ~title Overview
    ~content {
        ~h1 My Site
        ~p Welcome to my site!
        ~p This site is under construction.
    }
}

You can find the above code at the file ”implicit.pl“ at the distribution files under the directory samples.

Binary role inference

The line ~webpage =index Overview defines the binary data ”Overview“, but the unit ”index“ has type ”webpage“ and is accordingly not a binary unit. What unit belongs this data to? The compiler looks up the definition of the type ”webpage“ for the first occurrence of a binary unit. It finds the child unit ”title“ to be of type ”string“, which is known to be binary. Thus the compiler considers ”title“ to be the so-called ”default binary child“ of the type ”webpage“. And when reading such a line it expands it to:

=index ~webpage {
    ~title Overview
}

Child role inference

A similar mechanism allows us to avoid expliciting the unit role. The above line ”Welcome to my site!“ consists of just binary data, without any prefixed word. The compiler looks up the definition of the type "content" for the first child, which gets the so-called "default child" of the type "content". Note that this can be a binary unit but it can be not binary, too. Therefore when reading this line the interpreter converts it automatically to: ~p Welcome to my site!.

A type gets the default child of its parent type if it does not have its own. For example if ”prose“ has the default child ”p“ and you define ^page :prose without explicit children, than ”page“ has the default child ”p“, too. The same applies to the default binary child.

Parent unit inference

Finally the system also inferes some curly brackets, which do not need to be given. Our input text contained these lines without any curly brackets:

~webpage =index Overview
~content
~h1 My Site
Welcome to my site!
This site is under construction.

The interpreter processes these this way:

It finds "~content". It checks: Admits the type "webpage" a child with the role "content"? Yes. Then the ~content refers to the previous ~webpage and a new level must be opened, a { is implicly set.
It finds "~h1". It checks: Admits the type "content" a child with the role "h1"? Yes. Then the ~h1 refers to the previous ~content and a new level must be opened, a { is implicly set.
It finds "Welcome to my site!". It checks: Has the type "h1" a default child? No. Has the type "content" a default child? Yes, being the role ~p. Then the line "~p Welcome to my site!" refers to ~content. This is already the current level, so no curly brackets are set.
The same happens with the line ”This site is under construction.“.
It reaches the end of the text. All implicitly opened levels must be left, two closing curly brackets are automatically set.

Unit Names

Units can be given a name. The interpreter currently allows units to be identified by any sequence of digits and letters without whitespace. You set a name with the prefix ”=“.

=index ~webpage

This creates a unit with the name "index". The difference between upper and lower case is relevant. These are two different webpages:

=index ~webpage
=Index ~webpage

The unit names of all child units for each parent unit are unique. One can only define two different units with the same name, provided they have different parents.

To disambiguate names one refers to more than one unit level separed by a period. Suppose you have this text:

~website =private-web {
    =index ~webpage
}
~website =shop-web {
    =index ~webpage
}

To refer to the index page of the site "shop-web" being at the index page of the site "private-web" you cannot write index, because this would point to the index page of the site "private-web", instead you write: shop-web.index.

If you write the same unit name being under the same parent the system will consolidate them into a single unit. For example:

~website =web
    =index ~webpage
    ~p This is my personal website.
~website =web
    =index ~webpage
    ~p This site is under construction.

This will lead to just one webpage named "index" which has 2 p child units. Instead of ”=index ~webpage“ the second time one can put simply ”=index“, because the role is already known.

But the following will raise an error:

^website {
    ^webpage {
        ^p : string
    }
    ^article {
        ^p : string
    }
}
~website =web
    =index ~webpage
    ~p This is my personal website.
    =index ~article
    ~p This site is under construction.

The interpreter realizes that the second "index" has another role, therefore both must be different units, but two child units of the same parent unit ("website") are not allowed to have the same name ("index"). The execution aborts with an error message.

Recursion

UText does allow self-reference at unit definition, thus allowing recursion with no limit. Example:

^t {
    ^title :string
    ^t :t
}
~t first level {
    ~t second level {
        ~t third level {
            ~t fourth level
        }
    }
}

Note that the child t is defined as having its own parent as type (the line ^t :t is not the same as the line ^t, the later defining a child having itself as type).

If you do not use explicit block open and close marks { }, however, a ~t following a ~t is not regarded as having a deeper level, but as having the same level. Example:

~t first level
~t first level, too
~t first level, too
~t first level, too

A different approach is used when at the recursive definition you do not use the same name for the parent and for the child:

^parent {
 ^title :string
 ^child :parent
}
~parent first level
~child second level
~child third level

If you want to have two fixed levels of the same type, you can do this:

^BaseType {
 ^title :string
}
^parent :BaseType {
 ^child :BaseType
}
~parent first level
~child second level
~child second level

UText Semantics

Text Formula

The semantics of a text structure can be specified by this one formula:

<parent unit> {
    = <child unit> ~ <role> : <type> [<binary data>]
}

Being parent unit, child unit, role and type text units. The binary data is optional. This formula applies to all text units. But a UText instance has always finite data, therefore it must be necessarly at least one unit with this form:

=U {
    =U ~U :U
}

That is, a unit that is its own parent, its own role and its own type. Such a unit gets in UText defined at startup by the interpreter with the name unit. When you define a new unit ^website the interpreter records this structure:

^unit {
 ^website
}

These constraints apply to every unit:

The role must be a) the unit itself (for definitions) or b) either a child unit of the parent's type unit or a child of one of its type antecessors.
The type must be compatible with the role. If the unit referred to as role has the type A, then the unit type must be A or a descendant type of it.

If one tries to define a text that violates these constraints the interpreter aborts execution with an error message.

Unit Types

Each text unit has a type. A line =A :B defines a unit A as having the type B, meaning: A is an occurrence of B. For example =elephant :mammal means that an elephant is a mammal.

All characteristics of a type apply to its instances. (See the file ”type.pl“ at the distribution files under the directory samples for the following code.)

^species {
    ^common-name :string
    ^scientific-name :string
    ^life-span :cardinal
}
^mammal : species

If we define a species as having a common and a scientific name, and then introduce the mammals as species, then mammals get implicitly a common and a scientific name, too. This does not restrict to one level, all descendant levels inherit from a type. For example if an elephant is a mammal, and a mammal is a species, then an elephant has common and scientific names, too:

^elephant : mammal
~elephant {
    ~common-name savanna elephant
    ~scientific-name  Loxodonta africana africana
}

If a unit is an instance of a type, then its children must instantiate the type's children, too. On the above lines, the unit ~common-name savanna elephant must be a string, because this is the type of the common name according to its definition line ^common-name :string. Consider this example:

^zoo {
    ^animal :species
}

Here we introduced a zoo as list of animals, each of them being an occurrence of a species. This list could be something like this:

=CityZoo ~zoo {
    ~animal :elephant
    ~animal :elephant
    ~animal :penguin
}

An elephant is an animal, therefore it can be placed as child of the unit CityZoo, and a penguin too. Children of a particular unit that play the same role can have different types, but they all share a type they are subtype of, which is the type of the role they are playing.

A subtype can define its own members additionally to the inherited ones.

^penguin : species {
    ^breeding-pairs : cardinal
}
~penguin {
    ~common-name Little Blue Penguin
    ~scientific-name Eudyptula minor
    ~breeding-pairs 300,000
}

The Universal-Text Interpreter grants the logical integrity of the entered text. If you try to define a text that does not conform to the text structure previously defined, the parsing will abort giving an error message. For example this will not be accepted:

^species
^penguin : species
^elephant : species {
    trunk-length :cardinal
}
~penguin {
    ~trunk-length 2
}

A child unit for penguin is not allowed to play the role trunk-length because according to the given definition only elephants have one. But this would be allowed, since it is coherent, despite being senseless:

^species
^elephant : species {
    trunk-length :cardinal
}
^penguin : elephant
~penguin {
    ~trunk-length 2
}

Note that types are units themselves. In the Universal-Text there is no categorical distinction between a unit and a type, on the contrary, the fundamental category is only the unit, being a type just a relationship a unit can have with another unit.

Unit References

It is not possible for a particular text unit to appear twice in the text structure, because each unit is restricted to have one parent, one type and one role unit. But one can set a so-called ”reference“ between units, which has a similar effect. References are set with the prefix ==. Suppose you have a table of contents containing an article list:

^article {
    ^title :string
    ...
}
^TOC {
    ^article :article
}

You could write the articles at place:

~TOC {
    ~article First Article {
    ...
    }
    ~article Second Article {
    ...
    }
}

But this forces you to a particular feed order and if you have many articles it is not clear. Appart from that this way you can just have one table of contents. What if you wanted different indexes for different media (say print and a shorter electronic edition)? With references you can define the articles wherever and then refer to them by its identifier at your article list:

~article =art1 First Article {
...
}
~article =art2 Second Article {
...
}
~TOC {
    ~article ==art1
    ~article ==art2
}

When navigating the table of contents one gets automatically the data of the referred unit. For example:

select TOC.article do ln v title

gives this output:

First Article
Second Article

Note that the reference is not a unit type, when defining the article list above we did not define it as a list of references to articles, but as a list of articles. Each list item can be an article or a reference to an article, you can mix both at will, too.

Definition Lines

The reader may have noticed that the semantics of the text specifies units, types and roles, but does not say anything about definition words having the prefix ^. What is in a definition line?

The expression ^<name> is just a naming convention. It means something like =<name> ~<name>, that is, the unit that is being introduced gets a name and gets itself as its own role, too. If you do not give it a type, it gets itself as its own type, too.

Binary Data

The so-called ”binary data“ can be defined as all data that is not being further analyzed, that is not being structured as text. It is up to the user to decide where he or she ceases structuring data. The more you structure it, the more work you must do, but the finer-grained queries you can run against it, too. The interpreter allows you to associate chunks of data with units and retrieve them.

Note that ”binary data“ does not belong to the text theory. It is just an implementation decision of the Universal-Text Interpreter.

The interpreter defines the following root units:

^unit {
 ^unit
 ^binary
}

That is, every text unit in the system must be necessarily a descendant of "unit" or of "binary", the later being flagged by the system as binary data.

The implementation of binary units is hardcoded and cannot be overriden. If you need some unit to have binary data, you must define it as a having the type binary or, more commonly, as having the type of a descendant unit of binary.

The interpreter defines at startup these binary types:

^field :binary
^bool :field
^cardinal :field
^string :field
^ustring :string

Currently the system does not differenciate between these types, but it is recommended to use them precisely for logical clearness and future compatibility. Currently binary data can only be parsed, stored and output as a character string.

If you need a type to contain strings, you define for example this:

^ title : string

Now you can enter binary data when defining a title:

~title My Great Homepage

Of course you can build your own types on the top of that. If you define a subtype of a binary type, it becomes automatically a binary type, too.

^ file {
    ^ line : string
    ^ comment : line
    ^ code : line
}
=to-do ~file {
    ~comment That must be done:
    ~code Function x() that returns the default value.
}

Here both "comment" and "code" are lines, thus strings, therefore they accept binary data. (You can find the above code at the file ”file.pl“ at the distribution files under the directory samples.)

Text Implementation

The Universal-Text Interpreter implements the text structure as a global array of scalars.

UNITS[ID]=(UNIT, REF, RID, TID, PID, BIN, CR, UP)

The index of an element is its internal ID. Each element consists of 8 scalars.

UNIT — A string containing the unit name (empty string if the unit has no name).
RID — The ID of the role unit
TID — The ID of the type unit
PID — The ID of the parent unit
BIN — A flag indicating if the unit is binary or not
REF — The ID of the referred unit, if == present
CR — The unit's creation timestamp
UP — The unit's last update timestamp

For example, the following text:

^ person
^ woman : person
^ man : person
^ family {
    ^ name : string
    ^ parent : person
    ^ child : person
}
=Smith ~family {
    ~name Smith
    ~parent =Mary : woman
    ~parent =John : man
}

is being represented internally with this values (timestamps not shown):

 ID UNIT                 REF RID TID  PID  BIN BINARY
  0 unit                       0   0    0
  1 binary                     1   1    0    1
  2 field                      2   1    0    1
[...]
  5 string                     5   2    0    1
[...]
 34 person                    34  34    0
 35 woman                     35  34    0
 36 man                       36  34    0
 37 family                    37  37    0
 38 name                      38   5   37    1
 39 parent                    39  34   37
 40 child                     40  34   37
 41 Smith                     37  37    0
 42                           38  38   41    1 Smith
 43 Mary                      39  35   41
 44 John                      39  36   41

There are some auxiliar arrays that extend the above and contain some indexes for each unit:

ROLES containing a hash with the children IDs grouped by role name
CHILDID containing a list of children IDs
CHILDREN containing a hash with the children IDs grouped by unit name
DEFCHILD containing the default child
DEFBIN containing the default binary child
BINARY containing the binary data for a unit

You can see the internal representation of the text currently in memory at the UText interactive shell with the instruction dump or dump <filename>. In a Perl script you get this with the class method list:

UText::list('<filename>');

This dumps the contents of the internal arrays in readable form into the file <filename>.