NomesLex-PT01 in English

From XLDB

Jump to: navigation, search



Contents

This resource is no longer supported by the XLDB Group. Please check the new location.



Esta página também está disponível em Português.

NomesLex-PT01 - is a lexicon of person names made up of 2,027 first names and 8,019 surnames, and corresponding frequencies. These (mostly Portuguese) names were selected from the public list of teachers’ 2009 recruitment, published at the Portuguese Ministry of Education website, https://servicos.dgrhe.min-edu.pt/ConsultaCandidatos/


NomesLex-PT01 is available for download from this page (see below) under a Creative Commons license.

Frequently Asked Questions (FAQ)

What can I do with NomesLex-PT01?

NomesLex-PT01 is especially useful for named entity recognition and resolution tasks, involving Portuguese.

What is the format of NomesLex-PT01?

NomesLex-PT01 is available in two separate files, which are encoded in UTF and formatted as TSV (tab-separated values):

- NomesLex-PT01-firstnames.utf8.tsv
made up of 2,027 first names.
- NomesLex-PT01-surnames.utf8.tsv
made up of 8,019 surnames.

Each line contains a (fully capitalized) name followed by its frequency in the public list of teachers’ 2009 recruitment.

Where does the information in NomesLex-PT01 come from?

First names and surnames were selected from the public list of teachers’ 2009 recruitment, published at the Portuguese Ministry of Education website, https://servicos.dgrhe.min-edu.pt/ConsultaCandidatos/.

How many names did you collect?

  • Number of collected names: 92,544

How was the extraction of first names and surnames performed?

To extract the first names and surnames, we first tokenized the full names, considering the space as separator, then eliminated the common connecting words in Portuguese names: "de", "da", "do", "das", "dos", "d'" and "e". Abbreviations (e.g. "S." and "S") and specific connectors in foreign languages (e.g. "del" and "y", in Spanish) were also removed from the name combinations. The first token was taken as a first name. Considering that in the Portuguese culture, it is very common having two given names, we took the third and subsequent tokens as family names used in surnames, and discarded the second token from the name combinations.

Did you validate the extracted names?

We manually validated the extracted names, in order to correct typographical and spelling errors, mostly concerning accents (e.g. Clìmaco > Clímaco; Conceicao > Conceição; Camoes > Camões; Assunçâo > Assunção).

Does NomesLex-PT01 contain compound names?

The forms connected by hyphen remain unchanged, counting as a single token. We have a total of 3 compound given names and 46 compound surnames.

What are the five most frequent first names and surnames?

  • Most-frequent first names: MARIA (14374); ANA (7966); CARLA (2660); SANDRA (2300); PAULA (2021).
  • Most-frequent surnames: SILVA (10170); SANTOS (6785); FERREIRA (5918); PEREIRA (5557); OLIVEIRA (4614).

What is the percentage of first names and surnames with a single occurrence?

  • First names that only occur once: 956 entries (47%).
  • Surnames that only occur once: 3,534 entries (44%).

How can I obtain NomesLex-PT01?

You can download the two files directly from here:

What are the licensing terms of NomesLex-PT01?

NomesLex-PT01 is licensed under a Creative Commons Attribution 3.0 License (CC-BY).


Acknowledgments

NomesLex-PT01 was developed by the following team:

With support in part by the following grants from FCT:

Personal tools
Research Lines
Internal Information