Developers

Configuring REMBRANDT

1. Configuration variables hierarchy

REMBRANDT can run with default configuration values. Nonetheless, you'll sooner or later want to tweak REMBRANDT's configuration, to adapt it to the text and/or the machine used. This can be made through several ways:

  1. Creating a configuration file rembrandt.properties. The file must be valid XML, with a <configuration> root tag, followed by one or more <property> tags. A <property> tag must contain <name> and <value> tags, with an optional <description> tag.
  2. Refer a configuration file as the first argument on a command line, (for example, java rembrandt.bin.Rembrandt conf-sample.xml.
  3. set Java environment variables (-D).

The configuration parameters are read on this order, being overwritten when specified more than one time. Thus, the Java environment variables have priority over the variables set in the configuration file from the command line, and this one has priority over the rembrandt.properties file.

2. In and Outs of REMBRANDT

By default, REMBRANDT uses STDIN, STDOUT and STDERR on the encoding defined in the Java's file.encoding parameter. In case you want to use files to load/write text, you can set the rembrandt.${stream}.file parameters (${stream} can take the values input, output and err), as given in the following examples:

java -Drembrandt.input.file=file_input.txt -Drembrandt.output.file=file_output.txt rembrandt.bin.Rembrandt

STDERR can be used to output additional information. By default, STDERR is enabled (to disable it, use rembrandt.err.enabled=false) and outputs verbose information to the rembrandt.err.log file, about the mutations ocurred to NEs until their final state. We can reconfigure STDEER, as in:

echo "Rembrandt" | java -Drembrandt.err.file=file3.err -Drembrandt.err.writer=rembrandt.io.HTMLDocumentWriter -Drembrandt.err.styletag=rembrandt.io.HTMLStyleTag rembrandt.bin.Rembrandt

Now, file3.err will be used to write a HTML version of the tagged documents. Note the parameter rembrandt.err.writer; the rembrandt.${stream}.reader and rembrandt.${stream}.writer parameters set the file format while reading and writing. We can use simple formats (rembrandt.io.UnformattedReader and rembrandt.io.UnformattedWriter) or HTML-ish formats, XML serialized objects or the default REMBRANDT format. THe parameter values must be valid classes that extend extendam rembrandt.io.Reader and rembrandt.io.Writer.

The NE tag style, on the other way, are confiugured by the rembrandt.output.styletag parameter, that can take a class name that extends rembrandt.io.StyleTag (RembrandtStyleTag, by default). Other tag style configuration include:

  • rembrandt.output.tagstyle.lang, to set tag language for classifications
  • rembrandt.output.tagstyle.verbose, define the verbosity of the tag parameters:
    • 0 - just classification
    • 1 - plus a id, sentence number and term number
    • 2 - plus grounding information from Wikipedia / DBpedia
    • 3 - plus a NE mutation history

3. Configuring REMBRANDT's core

The rembrandt.core.doEntityRelation parameter, which can be true or false (default: false), sets if, after all entity recognition, it will recover unclassified NEs through entity relation detection.

  • Advantages: Increases the amount of properly classified NEs.
  • Disadvantages:Not optimized, and it can take a considerable amount of time for longer documents.

The rembrandt.core.doALT parameter, which can be true or false (default: true), specifies if REMBRANDT can generate alternative annotations for the same text excerpt.

  • Advantages: Generates more NEs that are more complete regarding the text excerpt. For instance, 'University of Lisbon' becomes tagged as 'University of Lisbon' and 'Lisbon' at the same time.
  • Desvantagens: A etiqueta usada, <ALT>, repete o texto para apresentar as alternativas, e como tal, dificulta o seu pós-processamento.

The rembrandt.core.removeRemainingUnknownNE parameter, which can be true or false (default: true), decides on what to do with the remaining NEs that have no semantic classification (that is, with unknown meaning). By default, these NEs are deleted from the output.

4. Configuring SASKIA on database access

To connect to the database, SASKIA uses the following parameters:

saskia.wikipedia.db.name - the database name (default: 'saskia').

saskia.wikipedia.db.url - the URL for the database connection, which allows connections to remote databases. The default value is jdbc:mysql://127.0.0.1.

saskia.wikipedia.db.user - the database user (default: 'saskia').

saskia.wikipedia.db.password - the database's user password (default: 'saskia').

saskia.wikipedia.db.params - for additional connection parameters. The default parameters for the MySQL connector/J are useUnicode=true&encodingCharset=UTF-8&autoReconnect=true, which enforce the use of UTF-8 on all MySQL transactions.

saskia.wikipedia.table.${name} - the database table names, where $name can be: page, category, categorylinks, pagelinks or redirect.

Last modified 3 years ago.