Textantrieb | UText/1 | UText/1.2 Manual

Feeding Text

Before a text can be navigated and queried, it must be entered into the system. A text can be programmatically entered by a UText Script or a Perl script or read by the interpreter in UTL format from source files or strings.

Reading UTL Data

Entering Strings

With the tag feed one can enter UTL strings. The interpreter parses the given UTL string and records it. For example the following UText Script instruction defines a unit named woman with the type person:

feed ^woman :person

It is also possible to enter multiple lines at one time, enclosing them between begin and end.

feed begin
^ person
^ woman : person
^ man : person
^ family {
    ^ name : string
    ^ parent : person
    ^ child : person
}
end

If the UTL string contains the reserverd lower case words begin, end or do they must be escaped with \ in order for the interpreter not to take them for instruction separators.

feed \begin with an easy case

With a Perl script one can also feed text. The UText object provides the method read to enter UTL strings directly.

After instantiating a UText object

use UText::UText;
$ut = new UText;

one can call its method read:

$ut->read('^woman :person');

Using Perl here-docs one can enter multiline strings:

$ut->read(<<'END');
^ person
^ woman : person
^ man : person
^ family {
    ^ name : string
    ^ parent : person
    ^ child : person
}
END

The content read is appended at the current writing position of the UText object. Each write operation affects the writing position. Note that the current write position is independent from the current read position, that is the one changed by select or foreach operations. If you want to write text into a particular unit, you must go to this unit performing a change unit (cu) call.

Reading Files Immediately

You can save the text source into a plain text file and let the interpreter read it using the tag read or the Perl method readfile.

For example you can save the above string as a text file under the name ”family.utl“ and read it through the UText script:

read family.utl

The default extension is .utl, so that the above can also be expressed:

read family

To read this with a Perl script:

use UText::UText;
$ut = new UText;
$ut->readfile('family.utl');
$ut = undef;

Directories

A simple file name such as ”family.utl“ above is expected to be at the current working directory. To specify another directory prepend it to the file name:

read /home/john/geneaweb/utl/family.utl

$ut->readfile('/home/john/geneaweb/utl/family.utl');

The path naming conventions depend on the operating system the script is running on. On Windows one would write something like that:

read C:\Documents and Settings\John\geneaweb\utl\family.utl

When reading a file, the current working directory is set to the directory where the file is placed. This is relevant if the file contains a [*script] section: the initial working directory of the script is always the directory where the file is saved regardless of the directory from where the file is being read.

Reading Multiple Files

One can let the interpreter read more than one file with a single command:

read family.utl, smith.utl, smithereen.utl, clark.utl

File names are separated by commas, the whole instruction can span over more than one line:

read first file,
    second file,
    third file

To read multiple files in Perl:

$ut->readfile('family.utl','smith.utl','smithereen.utl','clark.utl');

It is also possible to read all files whose name match a pattern using the wildcards * or ?:

read data/*.utl

Or in Perl with the function readfiles:

$ut->readfiles("data/*.utl");

This reads all files at the subdirectory data that have the name extension ”.utl“. By default file names ending with ~ or .bak are ignored when expanding wildcards, to change this modify SKIPFILE.

If you enumerate the file list explicitally the files are read by the order you have given. If you read them with a wildcard pattern they are read in alphabetical order by name.

Skipping already read files

If you issue read commands for a particular file more than once in the same session, the interpreter silently ignores the calls after the file has been read once.

This can be useful to force a particular file to be read at first and still being able to use a wildcard for the rest. Example:

read scheme, *.utl

Read preprocessing and alternative file formats

The read operation has more possiblitites:

the UTL can be preprocessed before parsing it using predefined arguments
one can read word processr files in OpenDocument Format with the modifier odt, f.e. [read.odt document.odt]
a custom add-in module can add support for more file types.

See the odt add-in module and the Perl function getfile for more information.

Differences Reading Strings and Files

The content fed when reading a string and a file is exactly the same, each source information can be given indifferently as string or as file. But there are two differences: context and node times.

Context

When reading strings, the internal cursor position of the interpreter is kept. For example, if you perform:

$ut->read('~family =Smith');
$ut->enter();
$ut->read('=Jane ~parent');
$ut->leave();

after the enter() instruction, the next read line gets ~family as parent.

If you perform a readfile, the cursor position is reset rather than kept. If you save the single line =Jane ~parent into the file jane.utl and then you perform:

$ut->read('~family =Smith');
$ut->enter();
$ut->readfile('jane.utl');
$ut->leave();

that will not be accepted and the execution will break with an error stating that no role ~parent is known under the unit unit. Before reading a file, the interpreter sets its cursor back to unit. This way reading a particular file has always the same effects regardless of where the file is read from.

As a consequence each file must consist of whole lines or whole line blocks, not a part of them. Brackets must be balanced in each single read file. This does not apply to each single read string.

Node Times

If you create text with readfile($filename) the creation and update time for the created nodes is automatically set to the creation and update time of the source file as returned by the operating system. If you create text with read($string) both are set to the parsing time.

When generating text from a file source one can use setfiletime to get the node times to be set according to the source file. For example:

open(IN,$source) or die;
$ut->setfiletime($source);
while(<IN>) {
    [... process the current line
    and generate utext accordingly
    through $ut ...]
}
close(IN);

Now all units that are appended in the while-block get the same timestamps (both the creation and the last modification time) as the source file, as returned by the operating system. If the instruction setfiletime were omitted, they would get as creation and update time this script's execution time.

If a unit has a role named timestamp, the system updates the unit creation and update time acordingly. Such units are expected to contain a binary value indicating a time stamp, which must be a string formated as for the tag time .

If the generation occurs as text transformation that appends new nodes based upon some existing ones, one can also use setnodetime instead. This way the generated nodes get the same time stamps as the original ones.

Reading Files On Demand

~loadfile

Every text unit can have a child with the role ”~loadfile“. This determines the name of the file to be read in order to get the unit's contents. For example:

~website =geneaweb {
    ~loadfile /home/john/geneaweb/utl/family.utl
}

There can be more than one ~loadfile children, all of them are read when necessary.

When the interpreter parses this, it does not read the file family.utl immediately. At first the unit geneaweb gets just the one child loadfile. And if the script you are running does not enter into this unit, this file won't be read. But the first time a script wants to access this unit, the file family.utl gets read.

The interpreter notes when a unit with deferred loading is going to be accessed for the first time and triggers the file load. This happens for example the first time a Perl script jumps into this unit and then performs an enter(), or if it opens a cursor that penetrates into it. This happens also if another unit refers to a descendant of it, for example referring to "geneaweb.Smith".

~loaddir

With a role ~loaddir you can specify what directories are to be looked up when loading files on demand. Example:

~website =geneaweb {
    ~loaddir /home/john/geneaweb/utl
    ~loadfile family.utl
    ~loadfile Smith.utl
    ~loadfile Clark.utl
}

When loading the files on demand, they are searched in the directory /home/john/geneaweb/utl.

There can be more than one ~loaddir directories, then all of them are searched until the first file with matching name is found.

Root UTL File

The Universal-Text Interpreter has a root UTL file that gets read once at startup. This file can contain a list of jump points to files defining main units. For example:

=Websites
~loadfile /home/john/www/utl/index.utl
=Job
~loadfile /home/john/job/index.utl
~loadfile /home/mary/job/index.utl
=Kids
~loadfile /home/wendy/index.utl
~loadfile /home/paul/index.utl

This way at any Perl script you do not need to give particular file names. If a UText script begins with:

select =Websites.=geneaweb.webpage [...]

or a Perl script begins with:

use UText::UText;
$ut = new UText;
$ut->foreach('=Websites.=geneaweb.webpage', [...]

you get all pages of the website geneaweb. If you some day change the file locations, you just need to change the index files and not each script.

Note that at each script only the files that are really needed are actually read.

The root file can also be used to load the add-in modules that are generally needed. Example:

[* script
load odt
]

Configuration: The name of the UTL root file is defined at the variable $UText::INDEXFILE. By default it is root.utl. A file with that name is read if found at startup at the same directory where the interpreter module UText.pm is located or at the current working directory. They are both read in that order, if both exist.

Generator Perl Scripts

A Perl script can also generate a text managing directly the UText class, bypassing the language module UTL. The following UText methods are available:

set — Creates a text unit or jumps to an existing one

def — Defines a new text unit

enter — Enters into the child unit level

leave — Leaves the current unit level and returns to parent level

parse — Invokes an alternate parser

transform — Invokes a text transformation on the current unit

These methods are described in the manual page UText.pm.

For example, both following scripts produce exactly the same text. (You can find the following code at the file generate.pl at the distribution files under the directory samples.)

Reading a UTL string:

use UText::UText;
$ut = new UText;
$ut->read(<<'END');
^webpage {
    ^title : string
}
~webpage =index
~title Overview
END
$ut = undef;

Calling the UText methods directly:

use UText::UText;
$ut = new UText;
$ut->def({def=>'webpage'});
$ut->enter();
$ut->def({def=>'title', type=>'string'});
$ut->leave();
$ut->set({role=>'webpage',unit=>'index'});
$ut->set({role=>'title',bin=>'Overview'});
$ut = undef;

A Perl script can combine UTL string and file reading with direct UText methods call at will.

Another possibility is to write a UText parser that can be bound to a text type so that when processing this type a special format or language can be embedded in UTL strings and files. See Alternative Parsers for more on this.