Requirements

RDFTEF needs the lastest CVS version of Jena2 and ARQ . Working versions of both are provided with RDFTEF, in the lib/ directory. The other packages in lib/ are required by Jena and ARQ.

RDFTEF was developed and tested with Java JDK 1.5 . It was not tested with JDK 1.4, though it might work well with it.


Download the Software

Download the last version from the project's CVS .

cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/rdftef login
cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/rdftef co -P RDFTEF2
For more details on how to do this see SourceForge instructions .


Files

The program has the following directory structure:

bin/ Contains the Java class files;
data/ Has example files and required configuration files:
     DivCommedia.owl This is the ontology file. It is required by RDFTEF to be here.
     result2-to-html.xsl This is a stylesheet file used to convert query results to HTML.
     CantoI-logic.xml An example file containing a slice of the "Divina Commedia", from Dante Alighieri, annotated in TEI format with logical information, like lines.
     CantoI-sint.xml Another example file, containing the same text as the above file, but with TEI annotations about the syntactic structure, like sentences and propositions.
     DivModel.xml An example file containing the RDF model obtained by RDFTEF from the two example files above.
doc/ Contains documentation files and the javadoc files.
lib/ Contains the required libraries besides Java JDK.
src/ The Java source files.


Running

Go to the root directory of RDFTEF (where it was copied). Make sure the DivCommedia.owl file is in the data/ directory. Run rdftef.bat if the operational system is MS Windows, or ./rdftef.sh if it is Linux. If this does not work, do the following: make sure the lib/ and the bin/ directories are in your Java CLASSPATH (along with your Java 1.5); run the program with java org.rdftef.Main.

Follow the program's usage steps in the next section.

Usage

When the program is executed, it shows up a shell command interface. All commands below are to be entered while inside this shell.

First, enter help to see a quick reference of the available commands. glossary prints a short description of some of the used acronymns.

The program has three main modules: import new models or load existing ones, export or save models in memory, and query the internal model, with some special functions. The three are detailed below.

Import and Load

To start making the program useful, type import data/CantoI-sint.xml, for example. This will import the TEI XML file CantoI-sint.xml to the internal RDF model. See the TEI tags section for the supported TEI annotation tags.

Other possibilities are to import a plain text file, in which case it will be annotated with respect to words, punctuation symbols, and sentences (considering some punctuation as final punctuation: .!?;:).

If there is already a file of an RDF model from RDFTEF, it can be loaded by load data/DivModel.rdf, for example.

Multiple imports can be called so as to merge them in the internal model. Note, however, that although this is a greatly useful feature, not all error cases are handled. Open source community contributions and further development will improve this feature.

Export and Save

To save the internal model use save data/DivModel.rdf, for example. The model is written in RDF/XML format.

An interesting feature of the program is the exporting to TEI XML files. Various options can be used. The general command is export [OPTIONS] teifile.xml. The OPTIONS can be:

--includeformatting Specify that formatting structures (lines and pages breaks) should be output;
--includesyntax Specify that that the syntactic structures (propositions) should be output;
--firstGROUP=NUMBER Specify that the first group to be output is the NUMBERth, counting from the very first group in the model (which is the NUMBER 1);
--lastGROUP=NUMBER Analogous to the above option. If NUMBER is greater than the total number of GROUPs, than the last one is considered.
--markclass http://uri#class=TAG
  Use the given TAG to mark the specified class. So, for example, if --markclass http://ontology#Periodo=b is provided, the export file will contain structures like <b>symbol1 symbol2</b>, for example. Currently, TAG cannot contain spaces, so things like ..class=hi b are not yet supported.

Query

In the program's shell, enter the command query to go to another shell, the query one. This query shell allows SPARQL queries to be entered. It has also some specific commands. Type help to view the available ones. A useful command is show prefix, that prints the prefixes that can be used inside queries. So, for example, it shows that the prefix divont can be used to specify the ontology's namespace, allowing queries to use things like divont:Periodo.

The results of the queries can be output in specific ways by using set [FMT]output commands.

set output=FILE Defines that the results of subsequent queries will be written to FILE in plain text format (although with nice formatting).
set xmloutput=FILE The results will be written to FILE in XML format.
set htmloutput=FILE The results will be written to FILE in HTML format. First the program writes the results to temporary files, in SPARQL query result format, and then uses a XSLT stylesheet to convert it to HTML and write to the given FILE. The default stylesheet is "data/result2-to-html.xsl", but another can be specified by using the command "set stylesheet=FILE".
set output The results will be written to the standard output (usually the user screen).

Any other entered command is considered as being a SPARQL query. A query can be entered in more than one line, thus a ";" must be used at the end of a line to finish the command (Note: take care with the ";" of the SPARQL's syntax. Do not used it at the end of a line).

All queries are considered as SELECT queries. Supporting others like ASK is a future implementation task.

Along normal SPARQL queries, special functions can be used to construct more powerful queries. These functions are IsNumber(?x,?y), BelongsTo(?x,?y), Precedes(?x,?y), and InTextOrder(?x). Their namespace is given by the function prefix, already defined. Examples of uses:

 SELECT ?s
WHERE {
?s a divont:Periodo .
FILTER function:IsNumber(?s, 4) .
};
This selects the node (not the content) of the 4th sentence.
Improve this query with:
 SELECT ?w
WHERE {
?s a divont:Periodo .
?w a divont:Word .
FILTER function:IsNumber(?s, 4) .
FILTER function:BelongsTo(?w, ?s) .
};
Now this selects the words of the 4th sentence.

Use

 SELECT ?x
WHERE {
?w a divont:Word .
?x a divont:Word .
FILTER function:IsNumber(?w, 38) .
FILTER function:Precedes(?x, ?w) .
};
to select all words occuring before the 38th word.

And finally, use

 SELECT ?c
WHERE {
?s a divont:Periodo .
?w a divont:Word .
?w divont:Printable_content ?c .
FILTER function:IsNumber(?s, 8) .
FILTER function:BelongsTo(?w, ?s) .
}
ORDER BY function:InTextOrder(?w);
to select the content of the words of the 8th sentence in order.

Examples: Usage and Powerful Queries

Run the program: from its root directory, run ./rdftef.sh, if on Linux, or rdftef.bat, on Windows. Wait for the shell to appear, then proceed entering the commands bellow and viewing the results.

# import data/CantoI-sint.xml

This will read a text of the "La Divina Commedia", from Dante Alighieri, syntactically annotated in the TEI format, and convert it to the internal RDF+OWL model

# import data/CantoI-logic.xml

The same as above, but with logical annotated information.

The two commands above import to the internal model the same text, but with different -- and most important, overlapping -- annotation structures. This offers now the possibility of naturally constructing powerful queries. But of course, first, one needs to know which type of annotated information is available. For the imported texts above, the first one contains annotations about paragraphs, sentences and propositions, and the second one about lines of text. The propositions can be of various types: take a look at the TEI Tags section.

To start querying the text, enter

# query

This will show another shell:

Q>

Every command entered is this shell is considered a query in the SPARQL language, except for the commands beginning with help, show, set, quit, bye, and exit.

For example, the command show prefix prints

PREFIX divont: <http://www.owl-ontologies.com/DivCommedia.owl#>
PREFIX rdftef: <http://semedia.deit.univpm.it/rdftef/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX daml: <http://www.daml.org/2001/03/daml+oil#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX function: <java:org.rdftef.query.functions.>

Now enter the following:

Q> SELECT ?x WHERE { ?x a divont:Line . };

Note the trailing semicolon (;): it is not part of SPARQL, but it must be entered to indicate the end of the query command. Thus, intuitively, entering (part of) a query command an typing enter without first entering a ';' will provide another line to continue entering the query. The above query could then be entered as:

Q> SELECT ?x
> WHERE {
> ?x a divont:Line .
> };

This query will show all the nodes from the model that are line resources. Though the result is not very useful yet, it gives some idea of the program behaviour.

Now, extending the query:

Q> SELECT ?x
> WHERE {
> ?x a divont:Line .
> FILTER function:IsNumber(?x, 90)
> };

This query will select just the node of the 90th line, showing the result:

--------
| x |
========
| _:b0 |
--------

Now, to finally make a more useful query, enter:

Q> SELECT ?c
> WHERE {
> ?w a divont:Printable_symbol .
> ?w divont:Printable_content ?c .
> ?x a divont:Line .
> FILTER function:IsNumber(?x, 90) .
> FILTER function:BelongsTo(?w, ?x)
> };

This will show the words of the 90th line, but in no specific order. To effectively get the 90th line, that is, to get the words of the 90th line in the correct order that they appear in the text, use the ORDER BY statement:

Q> SELECT ?c
> WHERE {
> ?w a divont:Printable_symbol .
> ?w divont:Printable_content ?c .
> ?x a divont:Line .
> FILTER function:IsNumber(?x, 90) .
> FILTER function:BelongsTo(?w, ?x)
> }
> ORDER BY function:InTextOrder(?w);

(query)

Following the above example, several other queries can be made, as for example:

Q> SELECT ?c
> WHERE {
> ?w a divont:Printable_symbol .
> ?w divont:Printable_content ?c .
> ?x a divont:Sentence .
> FILTER function:IsNumber(?x, 3) .
> FILTER function:BelongsTo(?w, ?x)
> }
> ORDER BY function:InTextOrder(?w);

(query)

This will return the 3rd sentence.

Among interesting queries are those that mix different hierarchies. For example, suppose we want the sentences that begin at the beginning of a line, or better, the words that are the first ones of both a sentence and a line. The query, then, is

Q> SELECT ?word
> WHERE {
> ?s a divont:Periodo .
> ?l a divont:Line .
> ?s divont:First_symbol ?symb .
> ?l divont:First_symbol ?symb .
 > ?symb divont:Printable_content ?word .
> };

This will give the result

----------
| word   |
==========
| "Ahi"  |
| "Tant" |
| "Nel"  |
----------

Now, if we want to refine this, and get all the sentences that are entirely contained on a single line, we may use the query

Q> SELECT ?s ?sf ?sl
> WHERE {
> ?s a divont:Periodo . ?l a divont:Line .
> ?s divont:First_symbol ?sf . ?l divont:First_symbol ?lf .
> ?s divont:Last_symbol ?sl . ?l divont:Last_symbol ?ll .
> FILTER (!function:Precedes(?sf,?lf)) .
> FILTER (!function:Precedes(?ll, ?sl)) .
> };

which will give the result

--------------------------------------------------------------------------------
| s | sf | sl |
================================================================================
| _:b0 | rdftef:text#Printable_symbol_600 | rdftef:text#Printable_symbol_601 |
| _:b1 | rdftef:text#Printable_symbol_675 | rdftef:text#Printable_symbol_682 |
| _:b2 | rdftef:text#Printable_symbol_582 | rdftef:text#Printable_symbol_585 |
| _:b3 | rdftef:text#Printable_symbol_1162 | rdftef:text#Printable_symbol_1166 |
| _:b4 | rdftef:text#Printable_symbol_586 | rdftef:text#Printable_symbol_599 |
| _:b5 | rdftef:text#Printable_symbol_569 | rdftef:text#Printable_symbol_575 |
| _:b6 | rdftef:text#Printable_symbol_576 | rdftef:text#Printable_symbol_581 |
--------------------------------------------------------------------------------

[More examples to come]

Supported TEI Tags and Used Ontology Resources

Every word, constituted by a sequence of letters or digits without whitespaces between, is created as an individual of the ontology class #Word, with the property #Printable_content specifying its content. Other characters are considered punctuation marks, and are created as individuals of the class #Punteggiatura, with the respective character pointed by the #Printable_content property. Both #Word and #Punteggiatura are subclasses of #Printable_symbol. Whitespaces and newlines are ignored.

The following TEI tags are recognized by RDFTEF.

<TEI.2> Defines the TEI file content.
<teiHeader> Specifies the header, containing informations as title and author, but is not yet supported (so it is just ignored).
<text> Specifies the main section containing the text.
<body> Has to follow <text>.
<div1> Specifies the text division. The attribute 'type' is read to define the text type. Currently, only the 'canto' value is recognized (it corresponds to the ontology "#Capitolo_o_canto").
<head> Defines some internal header. Ignored for now.

Groups:

By the ontology, a group can be an interval group or a bag group. An interval group is specified by the ontology class #Interval_group, and a bag group by #Bag_group. The difference is that an interval group has only the properties #First_symbol and #Last_symbol to define its content, while a bag group is a RDF Bag structure containing every member belonging to it, as rdf:_1, rdf:_2, etc.

<lg> Defines a line group. It is an interval group of lines.
<l> Lines inside a line group. Internal tags (attributes) like foreign and hi are not yet supported. It is an interval group of printable symbols.
<p> Defines a paragraph. It has no attribute. It is an interval group of sentences, if any; otherwise, of printable symbols. It is defined by the ontology's class #Paragrafo.
<s> Defines a sentence. TEI provides some attributes for it, but they are not yet supported (thus ignored). It is an interval group of printable symbols, and is defined by the ontology's class #Periodo.

Milestones:

<pb/> Specifies a pagebreak. It is an interval group of linebreaks, if any; otherwise, of printable symbols. All previous members (lines or printable symbols) until the previous pagebreak are added to it, by defining the properties #First_symbol and #Last_symbol. Its ontology's class is #Page.
<lb/> Specifies a linebreak. It is an interval group of printable symbols. All previous printable symbols until the previous linebreak are added to it. Its ontology's class is #Line.

Joins:

<cl> Defines a sentence proposition. It is a bag group, containing printable symbols. Its attributes type and function are analised in order to use the corresponding ontology's class. function can be princ or coord, for a #Principale proposition, or subord, for a #Secondario proposition. The type attribute defines specialized classes of #Principale and #Secondario (see the org.rdftef.io.TeiXmlInput source file). If a proposition has joins, that is, if it overlaps other propositions, its attributes id and next are parsed so as there is only one individual of this proposition containing all the elements of its joins (beginning with the first next attribute, continuing with a matching id, that may contain another next, and so on).

General properties:

All groups and printable symbols have the ontology's property #Next, that (obviously) links to the next object in the sequence. The propositions have also the property #refersTo, that points to its imediate containg proposition, or sentence, in case of the first proposition. Also in this latter case, such sentence contains the property #FirstProposition.

Tags not recognized during the importing procedure are added to the model as new classes with the ontology property #Unknown_class. For example, if the importing file has a tag <t> not recognized, the program creates a new class in the model named #_t, and adds to it the property #Unknown_tag. This way, all tags in the file are actually imported, but the unknown ones do not offer other specific behaviors.