NLP2 Resources to Bring NLP of Portuguese to State-of-Art

Putting together open data and tools to enable high-level NLP of the Portuguese language.

Leaders: Marcelo Finger and Thiago A. S. Pardo

We aim to produce resources for Brazilian Portuguese that will enable state-of-the-art tools and applications. We concentrate on both written and spoken modalities for Portuguese, focusing on three main tasks:

  1. with a syntactical view, we aim to produce million-word multi-genre corpus of annotated texts for building robust parsing models;
  2. with a language model view, we aim to generate a pipeline for constructing context-based neural models, with applications on natural language inference; and,
  3. for spoken language, we aim to build multitask corpora for speech recognition, multi-speaker synthesis, speaker identification, voice cloning and speech-as-a-biomarker classification, producing a large corpus of recorded and transcribed spoken Brazilian Portuguese.

Each task contemplates an activity that intersects some other front. For instance, the transcribed speech and neural models must be used by the syntactical initiative for training neural-based parsing models; the parsing models may provide more data to the natural language inference and speech-based tools; the speech data will be used for developing speech-as-a-biomarker neural models.

Initial applications must be on speech-based disease diagnosis, opinion mining, and fake news detection.

Emphasis is given to the construction and use of open and open-source resources, so as to share the resources inside and outside this project.

Goals

To grow resources to lead Portuguese NLP to world state-of-the-art, effectively moving Portuguese out of the low-resource language scenario, with essential resources and tools as well as applications on some current critical society demands.

We aim to obtain the following results for the syntactical front:

  1. Universal Dependencies-annotated corpus (at least 5M tokens);
  2. Refined linguistic annotation model (adapted to Portuguese language and multi-genre demands);
  3. Better parsing models for Portuguese.

On distributional models and NLI:

  1. 1 billion token plain text corpus made publically available;
  2. Training pipeline for distributional models;
  3. Complete NLI classification with distribution report and gaps;
  4. Application of distributional models to NLI classification trained on SICK-BR (in an evolutive approach).

For spoken language,

  1. Training two models - one for speaker identification and other for speech recognition, with the datasets compiled during the first year.

Project Websites (External)

POeTiSA - Portuguese processing: Towards Syntactic Analysis and parsing

TaRSila - Tarefa de Anotação para o Reconhecimento e Síntese de fala da Língua Portuguesa

Team

  • Name

    Relevant Information

  • Adriano S. R. Silva
    EACH-USP
  • Aleksander T. Souza
    FFCLRP-USP
  • Alessandra Alaniz Macedo
    FFCLRP-USP
  • Alexandre Moreli
    Instituto de Relações Internacionais
  • Ariani Di Felippo
    UFSCar
  • Aline Silva Costa
    LAPELINC-UESB
  • Arnaldo Cândido Junior
    UTFPR
  • Bruno Angelo Papa Dias
    FFLCH-USP
  • Bruno O. R. Silva
    FFCLRP-USP
  • Bruno Baldissera Carlotto
    ICMC-USP
  • Carolina Postali
    UFSCar
  • Caroline Adriane Alves
    ICMC-USP
  • Clarissa Lenina Scandarolli
    ICMC-USP
  • Cristiane Namuiti
    LAPELINC-UESB
  • Daniel Martins Arrais
    ICMC-USP
  • Daniel Pinto da Silva
    UTFPR
  • Diogo Castanho Emidio
    ICMC-USP
  • Dionéia M. Monte-Serrat
    FFCLRP-USP
  • Edresson Casanova
    ICMC-USP
  • Emanuel Huber da Silva
    ICMC-USP
  • Ester Gonçalves de Oliveira
    UFSCar
  • Evandro Eduardo Seron Ruiz
    FFCLRP-USP
  • Fabio D. Cunha
    ICMC-USP
  • Felipe Ribas Serras
    IME-USP
  • Fernando Gorgulho Fayet
    ICMC-USP
  • Fernando J. V. Silva
    EACH-USP
  • Flaviane R. Fernandes Svartman
    FFLCH-USP
  • Francimeire Leme Coelho
    UFSCar
  • Gabriel Ceregatto
    UFSCar
  • Gabriela Carolina Ferreira Gimenez
    ICMC-USP
  • Gabriela Wick Pedro
    UFSCar
  • Gilberto Nunes Neto
    ICMC-USP
  • Giovanna Costa e Silva
    ICMC-USP
  • Glauber de Bona
    EP-USP
  • Guilherme Lamartine de Mello
    IME-USP
  • Guilherme Martiniano de Oliveira
    FFCLRP-USP
  • Heliana Mello
    UFMG
  • Heloisa de Oliveira
    ICMC-USP
  • Ingrid da Mata
    ICMC-USP
  • Isaac Souza de Miranda
    UFSCar
  • Isabela Simões Vertoni
    ICMC-USP
  • Ivandré Paraboni
    EACH-USP
  • João Paulo C. F. Longo
    FFCLRP-USP
  • José Augusto Baranauskas
    FFCLRP-USP
  • Julia Trovó
    UFSCar
  • Ketlen V. M. Souza
    ICMC-USP
  • Laura Santos Gazana
    UFSCar
  • Livia Oushiro
    UNICAMP
  • Luana B. Belisário
    ICMC-USP
  • Lucas Gabriel Mendes Miranda
    ICMC-USP
  • Lucas Oliveira
    UTFPR
  • Lucelene Lopes
    ICMC-USP
  • Marcella Monteiro Lemos Couto
    UFSCar
  • Marcelo Finger
    IME-USP
  • Marcelo Gomes de Queiroz
    IME-USP
  • Marcio L. Inácio
    ICMC-USP
  • Marco A. Sobrevilla Cabezudo
    ICMC-USP
  • Magali S. Duran
    ICMC-USP
  • Maria Clara Paixão de Sousa
    FFLCH-USP
  • Maria Clara Ramos Morales Crespo
    FFLCH-USP
  • Maria das Graças V. Nunes
    ICMC-USP
  • Maria Lina de Souza Jeannine Rocha
    FFLCH-USP
  • Maria Luiza Azevedo Morais
    FFLCH-USP
  • Mariana Lourenço Sturzeneker
    FFLCH-USP
  • Mariana Marques da Silva
    FFLCH-USP
  • Marli Quadros Leite
    FFLCH-USP
  • Mateus Rossato Silva
    FFLCH-USP
  • Mateus T. Machado
    ICMC-USP
  • Matheus Jose Garcia Fagundes
    EACH-USP
  • Mayara Feliciano Palma
    FFLCH-USP
  • Miguel Arjona Ramirez
    EP-USP
  • Miguel Oliveira Jr
    UFAL
  • Moacir Ponti Jr
    ICMC-USP
  • Norton Trevisan Roman
    EACH-USP
  • Oto Vale
    UFSCar
  • Patrícia Brasil Silva
    FFLCH-USP
  • Paula Marin de Oliveira
    FFLCH-USP
  • Paulo Matheus Silva Oliveira
    FFLCH-USP
  • Priscila Starline Estrela Tuy Batista
    FFLCH-USP
  • Rafael Sicoli Pacheco
    FFLCH-USP
  • Raquel de Paula Guets
    FFLCH-USP
  • Renan de Lima Izaias
    FFLCH-USP
  • Renata Morais Mesquita
    FFLCH-USP
  • Ricardo Corso Fernandes Jr
    UTFPR
  • Ricardo Marcondes Marcacini
    ICMC-USP
  • Roberto Hirata Junior
    IME-USP
  • Rogério F. Sousa
    ICMC-USP
  • Ronald Beline Mendes
    FFLCH-USP
  • Roney L. S. Santos
    ICMC-USP
  • Ryan Marçal Saldanga Maganã Martinez
    UFSCar
  • Sandra Maria Aluísio
    ICMC-USP
  • Sebastião Carlos Leite Gonçalves
    UNESP Rio Preto
  • Solange Oliveira Rezende
    ICMC-USP
  • Sungwon Yoon
    EACH-USP
  • Thiago Alexandre Salgueiro Pardo
    ICMC-USP
  • Tommaso Raso
    UFMG
  • Vanessa Martins do Monte
    FFLCH-USP
  • Vinícius Gonçalves dos Santos
    FFLCH-USP
  • Welton A. Gomes
    FFCLRP-USP
  • Wesley Ramos dos Santos
    EACH-USP