Configuring REMBRANDT
1. Configuration variables hierarchy
REMBRANDT can run with default configuration values. Nonetheless, you'll sooner or later want to tweak REMBRANDT's configuration, to adapt it to the text and/or the machine used. This can be made through several ways:
- Creating a configuration file
rembrandt.properties. The file must be valid XML, with a <configuration> root tag, followed by one or more <property> tags. A <property> tag must contain <name> and <value> tags, with an optional <description> tag.- Refer a configuration file as the first argument on a command line, (for example,
java rembrandt.bin.Rembrandt conf-sample.xml.- set Java environment variables (
-D).The configuration parameters are read on this order, being overwritten when specified more than one time. Thus, the Java environment variables have priority over the variables set in the configuration file from the command line, and this one has priority over the
rembrandt.propertiesfile.
2. In and Outs of REMBRANDT
By default, REMBRANDT uses
STDIN,STDOUTandSTDERRon the encoding defined in the Java'sfile.encodingparameter. In case you want to use files to load/write text, you can set therembrandt.${stream}.fileparameters (${stream} can take the valuesinput,outputanderr), as given in the following examples:
java -Drembrandt.input.file=file_input.txt -Drembrandt.output.file=file_output.txt rembrandt.bin.Rembrandt
STDERRcan be used to output additional information. By default,STDERRis enabled (to disable it, userembrandt.err.enabled=false) and outputs verbose information to therembrandt.err.logfile, about the mutations ocurred to NEs until their final state. We can reconfigure STDEER, as in:
echo "Rembrandt" | java -Drembrandt.err.file=file3.err -Drembrandt.err.writer=rembrandt.io.HTMLDocumentWriter -Drembrandt.err.styletag=rembrandt.io.HTMLStyleTag rembrandt.bin.RembrandtNow,
file3.errwill be used to write a HTML version of the tagged documents. Note the parameterrembrandt.err.writer; therembrandt.${stream}.readerandrembrandt.${stream}.writerparameters set the file format while reading and writing. We can use simple formats (rembrandt.io.UnformattedReaderandrembrandt.io.UnformattedWriter) or HTML-ish formats, XML serialized objects or the default REMBRANDT format. THe parameter values must be valid classes that extendextendam rembrandt.io.Readerandrembrandt.io.Writer.The NE tag style, on the other way, are confiugured by the
rembrandt.output.styletagparameter, that can take a class name that extendsrembrandt.io.StyleTag(RembrandtStyleTag, by default). Other tag style configuration include:
rembrandt.output.tagstyle.lang, to set tag language for classificationsrembrandt.output.tagstyle.verbose, define the verbosity of the tag parameters:
- 0 - just classification
- 1 - plus a id, sentence number and term number
- 2 - plus grounding information from Wikipedia / DBpedia
- 3 - plus a NE mutation history
3. Configuring REMBRANDT's core
The
rembrandt.core.doEntityRelationparameter, which can be true or false (default: false), sets if, after all entity recognition, it will recover unclassified NEs through entity relation detection.
- Advantages: Increases the amount of properly classified NEs.
- Disadvantages:Not optimized, and it can take a considerable amount of time for longer documents.
The
rembrandt.core.doALTparameter, which can be true or false (default: true), specifies if REMBRANDT can generate alternative annotations for the same text excerpt.
- Advantages: Generates more NEs that are more complete regarding the text excerpt. For instance, 'University of Lisbon' becomes tagged as 'University of Lisbon' and 'Lisbon' at the same time.
- Desvantagens: A etiqueta usada, <ALT>, repete o texto para apresentar as alternativas, e como tal, dificulta o seu pós-processamento.
The
rembrandt.core.removeRemainingUnknownNEparameter, which can be true or false (default: true), decides on what to do with the remaining NEs that have no semantic classification (that is, with unknown meaning). By default, these NEs are deleted from the output.
4. Configuring SASKIA on database access
To connect to the database, SASKIA uses the following parameters:
saskia.wikipedia.db.name- the database name (default: 'saskia').
saskia.wikipedia.db.url- the URL for the database connection, which allows connections to remote databases. The default value isjdbc:mysql://127.0.0.1.
saskia.wikipedia.db.user- the database user (default: 'saskia').
saskia.wikipedia.db.password- the database's user password (default: 'saskia').
saskia.wikipedia.db.params- for additional connection parameters. The default parameters for the MySQL connector/J areuseUnicode=true&encodingCharset=UTF-8&autoReconnect=true, which enforce the use of UTF-8 on all MySQL transactions.
saskia.wikipedia.table.${name}- the database table names, where $name can be: page, category, categorylinks, pagelinks or redirect.