NLP2 Resources to Bring NLP of Portuguese to State-of-Art
Putting together open data and tools to enable high-level NLP of the Portuguese language.
Leaders: Marcelo Finger and Thiago A. S. Pardo
We aim to produce resources for Brazilian Portuguese that will enable state-of-the-art tools and applications. We concentrate on both written and spoken modalities for Portuguese, focusing on three main tasks:
- with a syntactical view, we aim to produce million-word multi-genre corpus of annotated texts for building robust parsing models;
- with a language model view, we aim to generate a pipeline for constructing context-based neural models, with applications on natural language inference; and,
- for spoken language, we aim to build multitask corpora for speech recognition, multi-speaker synthesis, speaker identification, voice cloning and speech-as-a-biomarker classification, producing a large corpus of recorded and transcribed spoken Brazilian Portuguese.
Each task contemplates an activity that intersects some other front. For instance, the transcribed speech and neural models must be used by the syntactical initiative for training neural-based parsing models; the parsing models may provide more data to the natural language inference and speech-based tools; the speech data will be used for developing speech-as-a-biomarker neural models.
Initial applications must be on speech-based disease diagnosis, opinion mining, and fake news detection.
Emphasis is given to the construction and use of open and open-source resources, so as to share the resources inside and outside this project.
To grow resources to lead Portuguese NLP to world state-of-the-art, effectively moving Portuguese out of the low-resource language scenario, with essential resources and tools as well as applications on some current critical society demands.
We aim to obtain the following results for the syntactical front:
- Universal Dependencies-annotated corpus (at least 5M tokens);
- Refined linguistic annotation model (adapted to Portuguese language and multi-genre demands);
- Better parsing models for Portuguese.
On distributional models and NLI:
- 1 billion token plain text corpus made publically available;
- Training pipeline for distributional models;
- Complete NLI classification with distribution report and gaps;
- Application of distributional models to NLI classification trained on SICK-BR (in an evolutive approach).
For spoken language,
- Training two models - one for speaker identification and other for speech recognition, with the datasets compiled during the first year.
Thiago Alexandre Salgueiro PardoICMC-USP
Sandra Maria AluísioICMC-USP
Ariani Di FelippoUFSCar
Evandro Eduardo Seron RuizFFCLRP-USP
Flaviane R. Fernandes SvartmanFFLCH-USP
Arnaldo Cândido JuniorUTFPR
Norton Trevisan RomanEACH-USP
Solange Oliveira RezendeICMC-USP
Ricardo Marcondes MarcaciniICMC-USP
Maria Clara Paixão de SousaFFLCH-USP
Roberto Hirata JuniorIME-USP
Marli Quadros LeiteFFLCH-USP
Glauber de BonaPOLI-USP
Marcelo Gomes de QueirozIME-USP
Miguel Arjona RamirezPOLI-USP
Alessandra Alaniz MacedoFFCLRP-USP
José Augusto BaranauskasFFCLRP-USP