NLP2 Resources to Bring NLP of Portuguese to State-of-Art

Putting together open data and tools to enable high-level NLP of the Portuguese language.

Leaders: Marcelo Finger and Thiago A. S. Pardo

We aim to produce resources for Brazilian Portuguese that will enable state-of-the-art tools and applications. We concentrate on both written and spoken modalities for Portuguese, focusing on three main tasks:

  1. with a syntactical view, we aim to produce million-word multi-genre corpus of annotated texts for building robust parsing models;
  2. with a language model view, we aim to generate a pipeline for constructing context-based neural models, with applications on natural language inference; and,
  3. for spoken language, we aim to build multitask corpora for speech recognition, multi-speaker synthesis, speaker identification, voice cloning and speech-as-a-biomarker classification, producing a large corpus of recorded and transcribed spoken Brazilian Portuguese.

Each task contemplates an activity that intersects some other front. For instance, the transcribed speech and neural models must be used by the syntactical initiative for training neural-based parsing models; the parsing models may provide more data to the natural language inference and speech-based tools; the speech data will be used for developing speech-as-a-biomarker neural models.

Initial applications must be on speech-based disease diagnosis, opinion mining, and fake news detection.

Emphasis is given to the construction and use of open and open-source resources, so as to share the resources inside and outside this project.

Goals

To grow resources to lead Portuguese NLP to world state-of-the-art, effectively moving Portuguese out of the low-resource language scenario, with essential resources and tools as well as applications on some current critical society demands.

We aim to obtain the following results for the syntactical front:

  1. Universal Dependencies-annotated corpus (at least 5M tokens);
  2. Refined linguistic annotation model (adapted to Portuguese language and multi-genre demands);
  3. Better parsing models for Portuguese.

On distributional models and NLI:

  1. 1 billion token plain text corpus made publically available;
  2. Training pipeline for distributional models;
  3. Complete NLI classification with distribution report and gaps;
  4. Application of distributional models to NLI classification trained on SICK-BR (in an evolutive approach).

For spoken language,

  1. Training two models - one for speaker identification and other for speech recognition, with the datasets compiled during the first year.

Team

  • Name

    Relevant Information

  • Marcelo Finger
    IME-USP
  • Thiago Alexandre Salgueiro Pardo
    ICMC-USP
  • Sandra Maria Aluísio
    ICMC-USP
  • Ariani Di Felippo
    UFSCar
  • Evandro Eduardo Seron Ruiz
    FFCLRP-USP
  • Flaviane R. Fernandes Svartman
    FFLCH-USP
  • Arnaldo Cândido Junior
    UTFPR
  • Norton Trevisan Roman
    EACH-USP
  • Solange Oliveira Rezende
    ICMC-USP
  • Ricardo Marcondes Marcacini
    ICMC-USP
  • Maria Clara Paixão de Sousa
    FFLCH-USP
  • Roberto Hirata Junior
    IME-USP
  • Marli Quadros Leite
    FFLCH-USP
  • Ivandré Paraboni
    EACH-USP
  • Glauber de Bona
    POLI-USP
  • Marcelo Gomes de Queiroz
    IME-USP
  • Miguel Arjona Ramirez
    POLI-USP
  • Alessandra Alaniz Macedo
    FFCLRP-USP
  • José Augusto Baranauskas
    FFCLRP-USP
  • Miguel Oliveira
    UFAL