NLP2 Resources to Bring NLP of Portuguese to State-of-Art

Putting together open data and tools to enable high-level NLP of the Portuguese language.

Leaders: Marcelo Finger and Thiago A. S. Pardo

We aim to produce resources for Brazilian Portuguese that will enable state-of-the-art tools and applications. We concentrate on both written and spoken modalities for Portuguese, focusing on three main tasks:

  1. with a syntactical view, we aim to produce million-word multi-genre corpus of annotated texts for building robust parsing models;
  2. with a language model view, we aim to generate a pipeline for constructing context-based neural models, with applications on natural language inference; and,
  3. for spoken language, we aim to build multitask corpora for speech recognition, multi-speaker synthesis, speaker identification, voice cloning and speech-as-a-biomarker classification, producing a large corpus of recorded and transcribed spoken Brazilian Portuguese.

Each task contemplates an activity that intersects some other front. For instance, the transcribed speech and neural models must be used by the syntactical initiative for training neural-based parsing models; the parsing models may provide more data to the natural language inference and speech-based tools; the speech data will be used for developing speech-as-a-biomarker neural models.

Initial applications must be on speech-based disease diagnosis, opinion mining, and fake news detection.

Emphasis is given to the construction and use of open and open-source resources, so as to share the resources inside and outside this project.


To grow resources to lead Portuguese NLP to world state-of-the-art, effectively moving Portuguese out of the low-resource language scenario, with essential resources and tools as well as applications on some current critical society demands.

We aim to obtain the following results for the syntactical front:

  1. Universal Dependencies-annotated corpus (at least 5M tokens);
  2. Refined linguistic annotation model (adapted to Portuguese language and multi-genre demands);
  3. Better parsing models for Portuguese.

On distributional models and NLI:

  1. 1 billion token plain text corpus made publically available;
  2. Training pipeline for distributional models;
  3. Complete NLI classification with distribution report and gaps;
  4. Application of distributional models to NLI classification trained on SICK-BR (in an evolutive approach).

For spoken language,

  1. Training two models - one for speaker identification and other for speech recognition, with the datasets compiled during the first year.


  • Name

    Relevant Information

  • Marcelo Finger
  • Thiago Alexandre Salgueiro Pardo
  • Sandra Maria Aluísio
  • Ariani Di Felippo
  • Evandro Eduardo Seron Ruiz
  • Flaviane R. Fernandes Svartman
  • Arnaldo Cândido Junior
  • Norton Trevisan Roman
  • Solange Oliveira Rezende
  • Ricardo Marcondes Marcacini
  • Maria Clara Paixão de Sousa
  • Roberto Hirata Junior
  • Marli Quadros Leite
  • Ivandré Paraboni
  • Glauber de Bona
  • Marcelo Gomes de Queiroz
  • Miguel Arjona Ramirez
  • Alessandra Alaniz Macedo
  • José Augusto Baranauskas
  • Miguel Oliveira