Textantrieb | UText/1 | UText/1.2 Manual

Round Tour

Introduction to the Universal-Text Interpreter

UText is an open source text markup and script language. It is a tool to produce written output that automates logical dependencies and gives the author control over the content. Furthermore, it is research software that applies for the first time the concept of text as a general-purpose data structure. See the-text.net for more about this research.

The Universal-Text is an implementation of the data structure ”text“. The software UText is based on it, it can register arbitrary data and query and transform it. In order to read the data, UText provides the Universal-Text Language (UTL) and supports defining parsers with Perl for integrating easily custom formats in UTL expressions or read documents in other file formats. In order to query and transform the data, UText provides a script language that can be used in batch processes and interactively on the UText Shell.

Universal-Text

Let us use Universal-Text Language (UTL) to introduce the data structure that we call ”Universal-Text“. At the beginning there is always a symbol definition.

 1: ^person

The above sentence creates the symbol person. Line numbers such as 1: are shown just for easy reference at the following paragraphs and are not part of the UTL language.

 2: ^ woman : person

After introducing a symbol, one can immediately instantiate it to create another symbol. The previous sentence creates the symbol woman as a particular occurrence of person. The Universal-Text calls this ”type“. We say: the symbol woman has the type person.

 3: ^ man : person
 4: ^ family {
 5:   ^ parent : person
 6:   ^ child : person
 7: }

The above code snippet defines a new symbol family having two subordinate symbols parent and child, both of which are of type person. This means: a family consists of persons, each of whom plays in this family the role of either a parent or a child.

Let us now instantiate the type family:

=Smith ~family

The above introduces the symbol ”Smith“ as a particular family. But as we said before: a family consists of parents and children, therefore this family may consist of some particular parents and some particular children.

 8: =Smith ~family {
 9:    ~parent =Mary : woman
10:    ~parent =John : man
11:    ~child =Lena : woman
12:    ~child =Peter : man
13: }

The above means: The symbol Smith denotes a family that consists of two parents, a woman called Mary and a man called John, and two children, a woman called Lena and a man called Peter.

Note that this expression is only valid assuming the prior definition of the symbols family, parent, child, woman, man and person at lines 1-7. The Universal-Text Interpreter enforces the logical correctness of the expressions. If you write:

=Jones ~fam

The interpreter will abort with a message:

Error: Role 'fam' unknown

The same if you write something like the following:

14: ^country 
15: ~country =UK
16: =Jones ~family {
17:  ~child =Joan : country
18: }

The country UK at line 15 is accepted, because there is a declaration of the type country at line 14. But parsing aborts at line 17 with the message:

Error: Incorrect type 'country', expected: 'person'

The interpreter knows that both man and woman are of type person because of lines 2 and 3 above, and according to lines 5 and 6 it accepts for a family a child of any of those types. But it rejects a child of all other known and unknown types.

The interpreter understands the underlying logical structure. That is its key feature. The text is parsed and it is recorded as symbols with specific relationships between them, namely the relationships: parent, role and type. This makes it possible to navigate and query the text, as we will see later.

The Universal-Text Interpreter also allows for unstructured data, apart from the structured data that we have just discussed, through the built-in type string.

^ book {
    ^ title : string
}
~ book {
    ~ title My Awesome Book
}    

If a unit is declared to have the type string —or a subtype of it—, then every instance of it can have a character string attached.

Writing Prose

To write prose, one defines first a data structure. Say we want to write some articles about families. We could define a data type article this way:

^article {
    ^p :string
    ^title :string
    ^extract :string
    ^family : family
}

Then we can compose articles in plain text files writing each paragraph at a separate line and preceding the lines for other units with their role.

~article

~family ==Smith

~title The First Mention of Family Smith

~extract The oldest known occurrences of the family name Smith in America are the birth certificate of a William Smith in Virginia in 1621 and the church marriage of a Henrietta and Godric Smith in New England in 1632.

This article introduces first the reader to the family name Smith in America and then explains the difficulties of determining the first ocurrence of the name for the particular family that is subject of this study among three candidates.

The noun ”smith“ is, as is well known, a job title. Smith is an occupational name for a worker in metal, from Middle English smith (Old English smið, probably a derivative of smitan ‘to strike, hammer’). Metal-working was one of the earliest occupations for which specialist skills were required, and its importance ensured that this term and its equivalents were perhaps the most widespread of all occupational surnames in Europe. Medieval smiths were important not only in making horseshoes, plowshares, and other domestic articles, but above all for their skill in forging swords, other weapons, and armor. This is the most frequent of all American surnames; it has also absorbed, by assimilation and translation, cognates and equivalents from many other languages (for forms, see Hanks and Hodges 1988).

Source: Dictionary of American Family Names © 2013, Oxford University Press

As you can see, entering prose for articles in Universal-Text Language is very straightforward. Only spare tilde signs are needed to mark some parts such as the title or an extract, in order to be able to operate with them semantically later on.

Depending on our particular needs for the family articles, we can use a more fine-grained document structure. For example, we could define a unit citation:

^ article {
    ...
    ^ citation {
        ^ content : string
        ^ source : string
    }
}

And enter the above article this way:

This article introduces [...]

The noun ”smith“ is, as is well known, a job title.

~citation {

Smith is an occupational name [...]

~source Dictionary of American Family Names [...]

}

This would allow us for example to format quotations adequately in the output documents and to generate automatically a list of all used sources or a list of all quotes grouped by source.

It is up to you to decide if you want to write all family articles in a single file or to distribute it in several files. To induce the interpreter to read some files in UTL notation, you can use the Universal-Text Script command read:

read smith.utl, jones.utl

You could also use multiple files for each family and let the interpreter read whole directories:

read ./Smith/*.utl

You can even simplify your source files.

read ./Smith/*.utl begin
~article
~family ==Smith
\%content
end

With such a read command the UTL files do not need to repeat the same first lines for ~article and ~family <name>, which avoids redundancies and is especially useful if you eventually want to restructure your articles.

Alternative Formats

For your prose documents you may want to use a word processor such as Libre-Office Writer. The UText distribution contains an optional add-in module with which the interpreter can read documents in OpenDocument Format.

load.bind odt
read.odt smith.odt

Modules for other file formats can be programmed in Perl and registered in the UText Interpreter for the read.<format> function call.

Programming in Perl one can also define parsers for custom formats that can be embedded in UTL expressions. For example, you could simplify entering data for families with this format:

 <mother name>
 <father name>
 <female child name>
 ...
 [blank line]
 <male child name>
 ...

You write a Perl script that feeds ~parent and ~child units with the given names and types according to the above structure and register it as default format for the type family. Then you can write the following in your UTL files:

=Smith ~family [
Mary
John
Lena
[blank line]
Peter
]

When the Universal-Text Interpreter finds the opening square (instead of curly) bracket, it calls the default parser for the type family to read the following lines until the closing square bracket.

You can also write a parser in Perl and register it in the interpreter under a name instead of binding it to a type. Then you can invoke the parser by its name at any place inside a UTL file.

[*my_parser
... lines in my format ...
]

Type-bound and standalone parsers of this kind can be written in Perl with a few lines of code and can help you achieve simpler and more compact source files tailored to your needs.

Text Queries

Once we have entered some data, we can let the interpreter query it according to its structure. Let us now see how to write queries in the UText Script language.

Imagine we have entered some data about families into the interpreter (see the file family.utl in the samples directory of the distribution files for the example used below). We could get a list of the registered families at the UText Shell with the command:

select family

This lists all text units that have the role family. In our example these are the results:

ut> select family
=Smith ~family
=Jones ~family
=Wagner ~family

To get all members of a particular family, we can use the selector =Jones.?. The question mark is a wildcard character that represents any unit. The equal sign restricts to the unit with the given name. Thus this selector retrieves all child units from the unit whose name is ”Jones“.

ut> select =Jones.?
=John ~parent :man
=Peggy ~parent :woman
=David ~child :man

To get the parents of this family, restrict to the role parent instead.

ut> select =Jones.parent
=John ~parent :man
=Peggy ~parent :woman

If we want to get all female members of a particular family, we restrict to this type using the colon character.

ut> select =Smith.:woman
=Mary ~parent :woman
=Lena ~child :woman

To get the children from all families, do:

ut> select ?.child
=Lena ~child :woman
=Peter ~child :man
=David ~child :man
=Mona ~child :woman
=Lisa ~child :woman

With selectors, you can retrieve units according to their parent, role and type relationships with other units. There are many more possibilities for selectors, such as sorting and filtering, see the manual page Text Selectors for details.

Output Tags

In a UText Script one can output a string with the function out.

ut> out hello
hello

The output function expands the so-called tags, which are set in square brackets and delimit a substring segment inside a string.

For example:

out This is [fam Smith]

Before the UText Interpreter can output the above, you must define a binding for the tag fam. For example, you can bind it to the string ”family“ followed by the tag parameter:

ut> declare tag fam to family \%param 
ut> out This is [fam Smith]
This is family Smith
ut> out Here we talk about [fam Johnson] for the first time.
Here we talk about family Johnson for the first time.

The selectors can execute a command for each of the matching units.

ut> select family do ln out This is [fam [u]]
This is family Smith
This is family Jones
This is family Wagner

The tag u is a built-in tag that returns the name of a text unit. Tags can trigger selectors (as u does) and call other tags and functions itself, so that a one-liner command can be very expressive.

ut> select family do ln out Family [u].  Mother [u parent * :woman], Father [u parent * :man].
Family Smith. Mother Mary, Father John.
Family Jones. Mother Peggy, Father John.
Family Wagner. Mother Michaela, Father Rudolf.

Output tags are particularly useful to obtain documents in several formats from the same source. For example, you may want to define a tag wp for Wikipedia references and use it in your source files (see Output Processors for a more detailed example).

Then in your source files you use the tag wp:

The first american Smith generation arrived in [wp Virginia] after sailing from [wp Wales].

To generate HTML files you bind the tag to a hyperlink:

ut> declare tag wp to <a href="http://en.wikipedia.org/wiki/\%param">\%param<a>
ut> out arrived in [wp Virginia] after sailing from [wp Wales]
arrived in <a href="http://en.wikipedia.org/wiki/Virginia">Virginia<a> after sailing from <a href="http://en.wikipedia.org/wiki/Wales">Wales<a>

And to generate LaTeX files you change the binding:

ut> set tag wp to \href{http://en.wikipedia.org/wiki/\%param}{\%param}
ut> out arrived in [wp Virginia] after sailing from [wp Wales]
arrived in \href{http://en.wikipedia.org/wiki/Virginia}{Virginia} after sailing from \href{http://en.wikipedia.org/wiki/Wales}{Wales}

Using output tags your source files become more compact and manageable and remain independent of the target document formats. You can find a thorough description of tags in Output Processors.

Creating a Website

We have already visited the key components of the Universal-Text Interpreter. We know how to define data structures and how to enter data, how to navigate and query the data with selectors and how to format the input (parsers) and the output (tags).

Let us now see a practical application of the system. Let us generate a website regarding the families we collected information about.

To begin, we define the data structure. We begin with a minimal setting for a website consisting of webpages, each of which has a title and a content (headings, paragraphs and lists):

^website {
    ^webpage {
        ^title : string
        ^content {
            ^p : string
            ^h1 : string
            ^ul {
                ^li : string
            }
        }
    }
}

Let us begin with the cover page of the website.

=geneaweb ~website
=index ~webpage
~title The Genealogy Site
~content
    ~h1 The Genealogy Site
    Welcome to the Genealogy site!
    This Site reports the history of families:
    ~ul {
        Smith
        Jones
        Wagner
    }

This way we could write manually some pages of our website.

But the interpreter can also provide automatically some part of the page contents, if it can be queried from the stored data. For example, the interpreter can provide the list of families above. We rewrite the list this way:

    ~ul {
    [*script
    select =tour.family do ln out ~li [u]
    ]
    }

The *script tag induces the interpreter to execute the following script instructions until the closing square bracket and to parse their output. The output of the instruction

select =tour.family do ln out ~li [u]

is the three lines:

    ~li Smith
    ~li Jones
    ~li Wagner

The complete webpage registered by the interpreter is thus exactly the same that we wrote first manually, but the family list now corresponds dynamically to the stored data. That is, if we eventually collect information about additional families, they will appear automatically at this page.

Besides generating a part of a particular page's content, the interpreter can generate whole pages. Say we want to have a separate webpage for each family. This can be achieved with the following UText script:

select =tour.family begin
    out begin
        =tour
        =geneaweb
        =[u] ~webpage
        ~title Family [u]
        ~content
            ~h1 Family [u]
    end
end

After executing the script, the repository contains these units:

=Smith ~webpage {
    ~title Family Smith
    ~content {
        ~h1 Family Smith
    }
}
=Jones ~webpage {
    ~title Family Jones
    ~content {
        ~h1 Family Jones
    }
}
=Wagner ~webpage {
    ~title Family Wagner
    ~content {
        ~h1 Family Wagner
    }
}

We could now complete the family pages with say a list of all articles about it, showing the title and the extract, and linking to a separate webpage that would show a single article whith the complete content. This would be achieved by similar means, using selectors that generate text units that are fed by the interpreter.

Generating HTML files

We have seen so far how to collect information about some topics and build a website around it, with pages that are manually written or partially or completely written by the interpreter according to the stored data. But so far all the data exists just inside the interpreter. How to get .html files to upload to the webserver?

The units of type webpage contained in the running interpreter instance can be serialized as plain text files using UText Script. A very simple serialization could look like this:

select webpage begin
    save [u].html begin
        ln out <html><head><title>[v title]</title></head><body>
        select content.? do case begin
            when role h1 do ln out <h1>[v]</h1>
            when role p do ln out <p>[v]</p>
            when role ul begin
                ln out <ul>
                select li do ln out <li>[v]</li>
                ln out </ul>
            end
        end
        ln out </body></html>
    end
end

The script loops through the four existing webpage units and creates a file for each of them named respectively index.html, Jones.html, Smith.html and Wagner.html. After creating each file and setting its header, the interpreter steps over each unit under the role content and generates HTML code according to each unit's role and attached string.

The samples ”tour.utl“ (website generation) and ”tour-html.utl“ (HTML file generation) can be found at the distribution files under the directory samples.

This example site and serialization script is of course just a simplistic exercise to show the basics. Real document generation is more complex and makes heavier use of the interpreter, defining more types and generic recursive functions. But real UText scripts, although longer that those we have seen at this tour, are tipically under 100 lines long and clearly much shorter than the scripts one would have to write in an all-purpose script language such as Perl or Python to achieve the same goals.

Summary

At this round tour we have seen the main idea of the Universal-Text Interpreter. With the Universal-Text Language one defines a text structure consisting of symbols with relationships between them and some literal data. The system ensures that the text structure is coherent and provides query capabilities to traverse the text and get it under different views. With simple UText scripts one can define transformations on the text that generate more text and final documents. We have built a sample website by entering structured and unstructured data, processing it to generate pages and serializing them. We could write some UText scripts to generate say a LaTeX document from the same source files. Once the data structure is defined and the scripts written, we can concentrate on collecting data and writing articles about our subject, and the system will generate automatically up to date documents to publish (i.e. HTML, LaTeX, PDF files), taking care of all formatting, indexing, listing and cross-referencing.

An important characteristic of the UText system is the freedom of formatting rules. UText does not prescribe either an input nor an output format. Users can easily adapt the formatting for source and target files to their own needs.

Finally, I want to stress the point that distinguishes the Universal-Text Interpreter from the document generation and webcontent management software. For the one thing, UText can be used for many more purposes that just document generation, such as generating source code and providing information lookup applications. But also generating documents with it constitutes a very different experience. UText does not dictate a data structure. The users define their own data structures according to their needs and are not limited at all to a document-oriented data structure. UText is even specially useful to represent directly all kinds of semantics, which is not only rewarding for the intellectual work but also simplifies all programmatic procedures. Furthermore, the user is not enslaved by the data structures she initially came up with, but can reshape them at any time with a few changes in the source files.