Requirements
RDFTEF needs the lastest
CVS version of Jena2
and ARQ . Working versions of both
are provided with
RDFTEF, in the lib/
directory. The other
packages in lib/
are required by Jena and ARQ.
RDFTEF was developed and tested with Java JDK 1.5 . It was not tested with JDK 1.4, though it might work well with it.
Download the Software
Download the last version from the project's CVS
.
For more details on how to do this see SourceForge instructions .cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/rdftef login
cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/rdftef co -P RDFTEF2
Files
The program has the following directory structure:
bin/ |
Contains the Java class files; |
data/ |
Has example files and required configuration files: |
DivCommedia.owl
|
This is the ontology file. It is required by RDFTEF to be here. |
result2-to-html.xsl |
This is a stylesheet file used to convert query results to HTML. |
CantoI-logic.xml
|
An example file containing a slice of the "Divina Commedia", from Dante Alighieri, annotated in TEI format with logical information, like lines. |
CantoI-sint.xml
|
Another example file, containing the same text as the above file, but with TEI annotations about the syntactic structure, like sentences and propositions. |
DivModel.xml |
An example file containing the RDF model obtained by RDFTEF from the two example files above. |
doc/ |
Contains documentation files and the javadoc files. |
lib/ |
Contains the required libraries besides Java JDK. |
src/ |
The Java source files. |
Running
Go to the root directory
of RDFTEF (where it was copied). Make
sure
the DivCommedia.owl
file is in the data/
directory. Run rdftef.bat
if the operational
system is MS
Windows, or ./rdftef.sh
if it is Linux. If
this does not
work, do the following: make sure the lib/
and the bin/
directories are in your Java CLASSPATH (along with your Java 1.5); run
the program with java org.rdftef.Main
.
Follow the program's usage steps in the next section.
Usage
When the program is executed, it shows up a shell command interface. All commands below are to be entered while inside this shell.
First, enter help
to see a quick
reference of the
available commands. glossary
prints a short
description of
some of the used acronymns.
The program has three main modules: import new models or load existing ones, export or save models in memory, and query the internal model, with some special functions. The three are detailed below.
Import and Load
To start making the
program useful, type import
data/CantoI-sint.xml
, for example. This will import the
TEI XML
file CantoI-sint.xml
to the internal RDF
model. See the TEI tags
section for the supported TEI annotation
tags.
Other possibilities are to
import a plain text file, in which
case it
will be annotated with respect to words, punctuation symbols, and
sentences (considering some punctuation as final punctuation: .!?;:
).
If there is already a file
of an RDF model from RDFTEF, it can
be
loaded by load data/DivModel.rdf
, for example.
Multiple imports can be called so as to merge them in the internal model. Note, however, that although this is a greatly useful feature, not all error cases are handled. Open source community contributions and further development will improve this feature.
Export and Save
To save the internal model
use save
data/DivModel.rdf
,
for example. The model is written in RDF/XML format.
An interesting feature of
the program is the exporting to TEI
XML
files. Various options can be used. The general command is export
[OPTIONS] teifile.xml
. The OPTIONS can be:
--includeformatting |
Specify that formatting structures (lines and pages breaks) should be output; |
--includesyntax
|
Specify that that the syntactic structures (propositions) should be output; |
--firstGROUP=NUMBER
|
Specify that the first
group to be output is the NUMBER th,
counting from the very first group in the model (which is the NUMBER
1); |
--lastGROUP=NUMBER
|
Analogous to the above
option. If NUMBER
is greater than the total number of GROUP s,
than the last one is considered. |
--markclass
http://uri#class=TAG |
|
Use the given TAG
to mark the
specified class. So, for example, if --markclass
http://ontology#Periodo=b is provided, the export file
will contain structures like <b>symbol1
symbol2</b> , for example. Currently, TAG
cannot contain spaces, so things like ..class=hi b
are not yet supported. |
Query
In the program's shell,
enter the command query
to go to
another shell, the query one. This query shell allows SPARQL queries to
be entered. It has also some specific commands. Type help
to view the available ones. A useful command is show prefix
,
that prints the prefixes that can be used inside queries. So, for
example, it shows that the prefix divont
can
be used to
specify the ontology's namespace, allowing queries to use things like divont:Periodo
.
The results of the queries can be output in specific ways by
using set
[FMT]output
commands.
set output=FILE |
Defines that the results of subsequent queries will be written to FILE in plain text format (although with nice formatting). |
set xmloutput=FILE |
The results will be written to FILE in XML format. |
set htmloutput=FILE |
The results will be written to FILE in HTML format. First the program writes the results to temporary files, in SPARQL query result format, and then uses a XSLT stylesheet to convert it to HTML and write to the given FILE. The default stylesheet is "data/result2-to-html.xsl", but another can be specified by using the command "set stylesheet=FILE". |
set output |
The results will be written to the standard output (usually the user screen). |
Any other entered command is considered as being a SPARQL query. A query can be entered in more than one line, thus a ";" must be used at the end of a line to finish the command (Note: take care with the ";" of the SPARQL's syntax. Do not used it at the end of a line).
All queries are considered as SELECT
queries. Supporting
others like ASK
is a future implementation
task.
Along normal SPARQL
queries, special functions can be used to
construct more powerful queries. These functions are IsNumber(?x,?y)
,
BelongsTo(?x,?y)
, Precedes(?x,?y)
,
and InTextOrder(?x)
.
Their namespace is given by the function
prefix, already
defined. Examples of uses:
SELECT ?sThis selects the node (not the content) of the 4th sentence.
WHERE {
?s a divont:Periodo .
FILTER function:IsNumber(?s, 4) .
};
Improve this query with:
SELECT ?wNow this selects the words of the 4th sentence.
WHERE {
?s a divont:Periodo .
?w a divont:Word .
FILTER function:IsNumber(?s, 4) .
FILTER function:BelongsTo(?w, ?s) .
};
Use
SELECT ?xto select all words occuring before the 38th word.
WHERE {
?w a divont:Word .
?x a divont:Word .
FILTER function:IsNumber(?w, 38) .
FILTER function:Precedes(?x, ?w) .
};
And finally, use
SELECT ?cto select the content of the words of the 8th sentence in order.
WHERE {
?s a divont:Periodo .
?w a divont:Word .
?w divont:Printable_content ?c .
FILTER function:IsNumber(?s, 8) .
FILTER function:BelongsTo(?w, ?s) .
}
ORDER BY function:InTextOrder(?w);
Examples: Usage and Powerful Queries
Run the program: from its
root directory, run ./rdftef.sh
,
if on Linux, or rdftef.bat
, on Windows. Wait
for the shell
to appear, then proceed entering the commands bellow and viewing the
results.
# import data/CantoI-sint.xml
This will read a text of the "La Divina Commedia", from Dante Alighieri, syntactically annotated in the TEI format, and convert it to the internal RDF+OWL model
# import data/CantoI-logic.xml
The same as above, but with logical annotated information.
The two commands above import to the internal model the same text, but with different -- and most important, overlapping -- annotation structures. This offers now the possibility of naturally constructing powerful queries. But of course, first, one needs to know which type of annotated information is available. For the imported texts above, the first one contains annotations about paragraphs, sentences and propositions, and the second one about lines of text. The propositions can be of various types: take a look at the TEI Tags section.
To start querying the text, enter
# query
This will show another shell:
Q>
Every command entered is
this shell is considered a query in
the
SPARQL language, except for the commands beginning with help
,
show
, set
, quit
,
bye
,
and exit
.
For example, the command show prefix
prints
PREFIX divont: <http://www.owl-ontologies.com/DivCommedia.owl#>
PREFIX rdftef: <http://semedia.deit.univpm.it/rdftef/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX daml: <http://www.daml.org/2001/03/daml+oil#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX function: <java:org.rdftef.query.functions.>
Now enter the following:
Q> SELECT ?x WHERE { ?x a divont:Line . };
Note the trailing semicolon (;): it is not part of SPARQL, but it must be entered to indicate the end of the query command. Thus, intuitively, entering (part of) a query command an typing enter without first entering a ';' will provide another line to continue entering the query. The above query could then be entered as:
Q> SELECT ?x
> WHERE {
> ?x a divont:Line .
> };
This query will show all the nodes from the model that are line resources. Though the result is not very useful yet, it gives some idea of the program behaviour.
Now, extending the query:
Q> SELECT ?x
> WHERE {
> ?x a divont:Line .
> FILTER function:IsNumber(?x, 90)
> };
This query will select just the node of the 90th line, showing the result:
--------
| x |
========
| _:b0 |
--------
Now, to finally make a more useful query, enter:
Q> SELECT ?c
> WHERE {
> ?w a divont:Printable_symbol .
> ?w divont:Printable_content ?c .
> ?x a divont:Line .
> FILTER function:IsNumber(?x, 90) .
> FILTER function:BelongsTo(?w, ?x)
> };
This will show the words
of the 90th line,
but in no specific order. To effectively get
the
90th line, that is, to get the words of the 90th line in the correct
order that they appear in the text, use the ORDER
BY
statement:
Q> SELECT ?c
> WHERE {
> ?w a divont:Printable_symbol .
> ?w divont:Printable_content ?c .
> ?x a divont:Line .
> FILTER function:IsNumber(?x, 90) .
> FILTER function:BelongsTo(?w, ?x)
> }
> ORDER BY function:InTextOrder(?w);
(query)
Following the above example, several other queries can be made, as for example:
Q> SELECT ?c
> WHERE {
> ?w a divont:Printable_symbol .
> ?w divont:Printable_content ?c .
> ?x a divont:Sentence .
> FILTER function:IsNumber(?x, 3) .
> FILTER function:BelongsTo(?w, ?x)
> }
> ORDER BY function:InTextOrder(?w);
(query)
This will return the 3rd sentence.
Among interesting queries are those that mix different hierarchies. For example, suppose we want the sentences that begin at the beginning of a line, or better, the words that are the first ones of both a sentence and a line. The query, then, is
Q> SELECT ?word
> WHERE {
> ?s a divont:Periodo .
> ?l a divont:Line .
> ?s divont:First_symbol ?symb .
> ?l divont:First_symbol ?symb .
> ?symb divont:Printable_content ?word .
> };
This will give the result
----------
| word |
==========
| "Ahi" |
| "Tant" |
| "Nel" |
----------
Now, if we want to refine this, and get all the sentences that are entirely contained on a single line, we may use the query
Q> SELECT ?s ?sf ?sl
> WHERE {
> ?s a divont:Periodo . ?l a divont:Line .
> ?s divont:First_symbol ?sf . ?l divont:First_symbol ?lf .
> ?s divont:Last_symbol ?sl . ?l divont:Last_symbol ?ll .
> FILTER (!function:Precedes(?sf,?lf)) .
> FILTER (!function:Precedes(?ll, ?sl)) .
> };
which will give the result
--------------------------------------------------------------------------------
| s | sf | sl |
================================================================================
| _:b0 | rdftef:text#Printable_symbol_600 | rdftef:text#Printable_symbol_601 |
| _:b1 | rdftef:text#Printable_symbol_675 | rdftef:text#Printable_symbol_682 |
| _:b2 | rdftef:text#Printable_symbol_582 | rdftef:text#Printable_symbol_585 |
| _:b3 | rdftef:text#Printable_symbol_1162 | rdftef:text#Printable_symbol_1166 |
| _:b4 | rdftef:text#Printable_symbol_586 | rdftef:text#Printable_symbol_599 |
| _:b5 | rdftef:text#Printable_symbol_569 | rdftef:text#Printable_symbol_575 |
| _:b6 | rdftef:text#Printable_symbol_576 | rdftef:text#Printable_symbol_581 |
--------------------------------------------------------------------------------
[More examples to come]
Supported TEI Tags and Used Ontology Resources
Every word, constituted by a sequence of letters or digits
without
whitespaces between, is created as an individual of the ontology class #Word
,
with the property #Printable_content
specifying its
content. Other characters are considered punctuation marks, and are
created as individuals of the class #Punteggiatura
,
with
the respective character pointed by the #Printable_content
property. Both #Word
and #Punteggiatura
are
subclasses of #Printable_symbol
. Whitespaces
and newlines
are ignored.
The following TEI tags are recognized by RDFTEF.
<TEI.2> | Defines the TEI file content. |
<teiHeader> | Specifies the header, containing informations as title and author, but is not yet supported (so it is just ignored). |
<text> | Specifies the main section containing the text. |
<body> | Has to follow <text>. |
<div1> | Specifies the text division. The attribute 'type' is read to define the text type. Currently, only the 'canto' value is recognized (it corresponds to the ontology "#Capitolo_o_canto"). |
<head> | Defines some internal header. Ignored for now. |
Groups:
By the ontology, a group can be an interval group or a bag
group. An
interval group is specified by the ontology class #Interval_group
,
and a bag group by #Bag_group
. The difference
is that an
interval group has only the properties #First_symbol
and #Last_symbol
to define its content, while a bag group is a RDF Bag structure
containing every member belonging to it, as rdf:_1
,
rdf:_2
,
etc.
<lg> | Defines a line group. It is an interval group of lines. |
<l> | Lines inside a line group. Internal tags (attributes)
like foreign and hi
are not yet supported. It is an interval group of printable symbols. |
<p> | Defines a paragraph. It has no attribute. It is an
interval group of sentences, if any; otherwise, of printable symbols.
It is defined by the ontology's class #Paragrafo . |
<s> | Defines a sentence. TEI provides some attributes for
it, but they are not yet supported (thus ignored). It is an interval
group of printable symbols, and is defined by the ontology's class #Periodo .
|
Milestones:
<pb/> | Specifies a pagebreak. It is an interval group of
linebreaks, if any; otherwise, of printable symbols. All previous
members (lines or printable symbols) until the previous pagebreak are
added to it, by defining the properties #First_symbol
and #Last_symbol . Its ontology's class is #Page . |
<lb/> | Specifies a linebreak. It is an interval group of
printable symbols. All previous printable symbols until the previous
linebreak are added to it. Its ontology's class is #Line . |
Joins:
<cl> | Defines a sentence proposition. It is a bag group,
containing printable symbols. Its attributes type
and function are analised in order to use the
corresponding ontology's class. function can
be princ or coord ,
for a #Principale proposition, or subord ,
for a #Secondario proposition. The type
attribute defines specialized classes of #Principale
and #Secondario (see the org.rdftef.io.TeiXmlInput
source file). If a proposition has joins, that is, if it overlaps other
propositions, its attributes id and next
are parsed so as there is only one individual of this proposition
containing all the elements of its joins (beginning with the first next
attribute, continuing with a matching id ,
that may contain another next , and so on). |
General properties:
All groups and printable symbols have the ontology's property #Next
,
that (obviously) links to the next object in the sequence. The
propositions have also the property #refersTo
,
that points
to its imediate containg proposition, or sentence, in case of the first
proposition. Also in this latter case, such sentence contains the
property #FirstProposition
.
Tags not recognized during the importing procedure are added
to the
model as new classes with the ontology property #Unknown_class
.
For example, if the importing file has a tag <t>
not
recognized, the program creates a new class in the model named #_t
,
and adds to it the property #Unknown_tag
.
This way, all
tags in the file are actually imported, but the unknown ones do not
offer other specific behaviors.