Research activities in the C4AI are organized around five Great Challenges that combine fundamental aspects of artificial intelligence with applications in selected fields such as agribusiness, climate, and health. The current challenges are:


Resources to Bring NLP of Portuguese to State-of-Art

Putting together open data and tools to enable high-level NLP of the Portuguese language.

Leaders: Marcelo Finger, Sandra M. Aluísio and Thiago A. S. Pardo


We aim to produce resources for Brazilian Portuguese that will enable state-of-the-art tools and applications. We concentrate on both written and spoken modalities for Portuguese, focusing on three main tasks:

  • with a syntactical view, we aim to produce million-word multi-genre corpus of annotated texts for building robust parsing models;
  • with a language model view, we aim to generate a pipeline for constructing context-based neural models, with applications on natural language inference; and,
  • for spoken language, we aim to build multitask corpora for speech recognition, multi-speaker synthesis, speaker identification, voice cloning and speech-as-a-biomarker classification, producing a large corpus of recorded and transcribed spoken Brazilian Portuguese.

To grow resources to lead Portuguese NLP to world state-of-the-art, effectively moving Portuguese out of the low-resource language scenario, with essential resources and tools as well as applications on some current critical society demands.

We aim to obtain the following results for the syntactical front:

On distributional models and NLI:

For spoken language:

  • Training two models – one for speaker identification and other for speech recognition, with the datasets compiled during the first year.
Project Websites (External)

NLP2 – Web portal

POeTiSA – Portuguese processing: Towards Syntactic Analysis and parsing

TaRSila Tarefa de Anotação para o Reconhecimento e Síntese de fala da Língua Portuguesa

Carolina – General corpus of contemporary Brazilian Portuguese


Artificial Intelligence Technologies to Strengthen the Indigenous Languages of Brazil

Using Artificial Intelligence in partnership with Indigenous communities to develop tools to preserve, revitalize and disseminate the Indigenous languages of Brazil.

Leaders: Claudio Pinhanez, Luciana Storto


Most of the Indigenous languages if Brazil are under threat of disappearing by the end of the 21st century. On the one hand, Indigenous peoples and their territories continue to be under attack by individuals and organizations, with invasions, the spread of disease, and the destruction of ecosystems on which they depend. On the other hand, the violent processes which began with colonization and persist until today, such as forced migration, catechesis, and the imposition of European languages, have significantly affected the number of speakers of Indigenous languages.

This joint project by IBM Research and USP explores the creation and use of Artificial Intelligence for the development, in partnership with Indigenous communities, of tools to preserve, revitalize, and disseminate Indigenous languages of Brazil. However, although AI has made great strides in the last 10 years in languages such as English and Chinese, its use in Indigenous language contexts is still incipient and hampered by the lack of data and programs to support research and development. PROINDL focuses on exploring innovative solutions to these challenges.

This project is integrated with the objectives and principles of the International Decade of Indigenous Languages established in 2022 by the UN and UNESCO, aiming at the strengthening and continuity of Indigenous languages around the world, articulated in the “Declaration of Los Pinos” (Chapoltepek). Within this context, we have ongoing partnerships with Indigenous communities in the city of São Paulo area which explore, together with their members, the development of necessary, desired, and sustainable solutions.


The project comprises the following areas of work:

  • Adaptation of AI Technologies to Indigenous Languages: techniques and algorithms which use little data are explored and developed, with the aid of Large Language Models (LLMs), in the development of automatic translators, both for text and speech; and the use of AI to support the writing and use of Indigenous languages by their communities at school, in everyday life and on social media.
  • Tools to Support Linguistic Work: opportunities are being explored for the use of AI in linguistic documentation and analysis, through the monitoring of registration, data collection, research, and analysis activities carried out in and with various Indigenous communities. From the observations and through a co-design process with linguists and speakers of Indigenous languages, tools will be developed to support linguistic work.
  • Use of Indigenous Languages in Social Networks: based on a mapping of the use of Indigenous languages in social networks in Brazil, tools and support technologies are being investigated for the dissemination and use of Indigenous languages in social networks, under the control and management of the Indigenous leaders and communities.
  • Robots and Chatbots in Indigenous Education: in a pioneering effort, advanced technologies are being explored for the use of social robots and chatbots in educational activities with Indigenous children and youth, in partnership with Indigenous schools.
  • Teaching Information Technology, Programming, and Linguistics to Indigenous Peoples: the project includes programs for teaching information technology, computer programming, and linguistic documentation and analysis for members and supporters of Indigenous communities, towards a sustainable continuity of the technologies developed.


Knowledge-Enhanced Machine Learning for Reasoning about Ocean Data

Merging data-driven learning and knowledge-based reasoning to answer complex queries about the Blue Amazon.

Leaders: Fabio Cozman and Eduardo Tannuri


Recent breakthroughs in AI have depended on parallel processing of big datasets so as to learn large models through optimization. Further breakthroughs should be possible by judiciously enlisting knowledge representation and planning techniques so as to make learning more efficient, less brittle, and free of biases.

In this context, we investigate conversational agents that can answer high-level questions. Conversations with such agents should include arguments, causes, explanations, and reasoning; it should be possible to conduct a conversation over time and with a purposeful goal, taking into account desires and intentions of the user. Overall, these conversational agents are a laboratory in which to study the connection between data-driven machine learning and knowledge-driven reasoning and planning.

  • The concrete goal is to develop a framework for conversational agents that can respond to high level requests over time in a particular domain, including questions, arguments, causes, explanations, inferences and plans about specific tasks. We are building a complete conversational expert on the Blue Amazon so as to test and showcase the framework; we expect to develop general tools that are not excessively tied to a particular domain so that the framework can be specialized for any given domain of interest. A broader goal is to investigate how such conversational agents can benefit from data-driven and knowledge-driven techniques simultaneously.
  • The BLue Amazon Brain (BLAB) aspires to carry all existing information about the Blue Amazon, both capturing technical expertise in the form of rules and facts and by harvesting data sources available from sensors and from textual information, including scientific papers and newspaper information.


Graph-Oriented Machine Learning for Stroke Diagnosis and Rehabilitation

Improving Stroke diagnosis, treatment, and rehabilitation with graph-oriented machine learning on multimodal data.

Leaders: José Krieger and Zhao Liang


The recent advances of machine learning in medicine have been remarkable. However, there are still important issues that need to be addressed. Here we deal with two important questions:

1) How to integrate and select relevant medical features (biomarkers) from large-scale heterogeneous and dynamical sources?

In applications of machine learning in medicine, we often have to deal with large-scale heterogeneous and dynamical data sets. For example, in the case of applications and scientific research related to stroke, or cerebrovascular accident (CVA), various kinds of data accumulated for long period of time, such as texts, images, genetic biomarkers, electric signals, patient’s symptoms, and geographic information are often available even for a single patient. Information integration is essential to correctly address health problems, as healthcare professionals rarely use only one type of information when solving a medical problem. Another important aspect when dealing with a large amount of features is to properly select the most relevant ones: understanding which features are most relevant for the classification of a stroke provides important information for quick and accurate diagnosis and treatment.

2) How to interpret decisions made by machine learning algorithms and how to integrate human and artificial intelligence?

Currently, successful machine learning techniques do not provide an explicit mechanism to satisfactorily explain how a given result is achieved. Such a logical explanation is necessary in many medical applications, for example, in disease diagnosis. The lack of interpretability deeply impacts the possibilities of integrating human and artificial intelligence in Medicine. In the majority of the cases, healthcare professionals still consider machine learning algorithms as black-box machines. Again, this is highly influenced by the lack of interpretability of machine learning strategies.

Our approaches primarily deal with Cerebrovascular Accident (CVA) as the application domain. According to the WHO, more than one billion people in the world have some disability; among chronic diseases, stroke stands out because it is the main cause of disability and the second cause of death in the world. Much progress has been made in understanding the risk factors, mortality and rehabilitation of stroke; however, incidence continues to increase as a result of an aging population and other risk factors. The identification of more precise and sensitive stroke biomarkers can help to modify this worrying situation. Furthermore, developing diagnostic approaches with high accuracy and prediction of individualized outcomes is one of the main ambitions and is one of the strategies of the WHO 2014-2021 global action plan (ODS – objective 3, best health for all at all ages).


The objective here is two-fold.

  • To contribute to machine learning by developing new techniques to handle situations described above.
  • to apply new graph- oriented machine learning techniques (GOML), to be developed in this project, to obtain a better understanding of stroke (causes, impact, ways to improve decision, and rehabilitation). It is also important to investigate ways to mitigate the impact of stroke in Brazilian population, a major social contribution.

For the proposed study, we will use datasets from ATLAS (Anatomical Tracings of Lesions After Stroke), InCor (Heart Institute of Medical School of USP) stroke dataset (200 T1-weighted MRIs and Reports), and the data sets of IMREA – Instituto de Medicina Física e Reabilitação do Hospital das Clínicas FMUSP.


Causal Multicriteria Decision Making in Food Production Networks

Developing causal multicriteria AI models for decision making under uncertainty in food production networks.

Leaders: Antonio Saraiva and Alexandre Delbem


The agribusiness productive cycles, environmental sustainability, and food security are current demands that defy worldwide authorities. In these settings, proper modeling of heterogeneous large-scale information, resilient learning systems that work with the dynamicity of real environments, and methods that find a balance among many concerns on costs and benefits are significant challenges. Representation learning, resilience enhancement, and multicriteria decision making are important tools to deal with those challenges.

The construction of reliable causal models is an open problem. Advanced methods for generating Dynamic Bayesian Networks (DBNs) based on the capture of tacit knowledge can enable causal models that combine continuous and discrete variables (a level of heterogeneity) and that are also adaptive.

Hybridization through ensembles of conventional knowledge-based models and learning methods is a possible way to produce useful solutions for real-world complex problems. Such processes can contribute to resilience through dataset evaluation and improvement, and selection of learner parameters (as meta- features) in a scenario of ensemble setup, dynamic ensemble selection and meta-learning. The integration of resilient-enhanced models with the DBN-based approaches may generate a higher level of predictive resilience.

The construction of new approaches for multicriteria decision making that combine the solutions found by the conventional knowledge-based techniques and by the proposed learning methods seems a promising strategy to generate short- and long-term innovations.

An important aspect of food security is climate change, mainly involving water supply. Hydrological models are investigated aiming at developing preliminary methods that can combine knowledge-based and data-driven approaches. Models for critical hydrological conditions, as droughts and floods, are also investigated in order to benefit predictions of crop water stress or perishability.

  • Representation Learning: new strategies for Heterogeneous Information can emerge by extending multiple representation techniques to construct a new unified feature space. In this way, an embedding is generated to incorporate the main patterns and correlations existing in multiple types of information. Its integration with modeling methods that capture the tacit knowledge can contribute to the Dynamic Representation Learning. The first challenge is the automatic acquisition of those structures and the integration of them with DBNs.
  • Resilience Enhancement: the investigation of adaptive (evolutionary) ensembles according to large margin distribution for the resilience enhancement of learning is promising. The multiobjective combination of separability measures can enable to find patterns from the marginal sample distribution that, in turn, can produce resilient learning. The investigation of DBN-based approaches can enable the integration of predictive resilience and representation learning, such as dynamicity (concept drift) and heterogeneity. Moreover, large-scale DBNs construction is a challenge that multiobjective evolutionary algorithms (MOEAs) with proper representation can succeed.
  • Decision making: conventional knowledge-based AgriBio techniques for multicriteria decision making are relevant for dealing with the conflicting demands in AgriBio. The robustness or the stability of approximate-Pareto fronts are the basics for creating new approaches dedicated to the AgriBio challenges facing uncertainty. Resilient criteria should be chosen or formulated to address climate and market changes. They also should enable the construction of procedures for decision making from the solutions found by the techniques developed in the C4AI-AgriBio.

Al Humanity

AI in Emerging Countries: Public Policies and the Future of Work

Mapping, understanding, and addressing the impact of AI in emerging countries.

Leaders: Glauco Arbix, João Paulo Veiga


Societies are increasingly delegating to AI systems many complex and risk-intensive decisions, such as diagnosing patients, hiring workers, granting parole, and managing financial transactions. At the same time, there is significant consensus that in the field of AI, emerging countries are lagging behind pioneering countries, in particular the USA and China.

Countries like Brazil urgently need to get closer to the best practices in AI. To that end, they must develop strategies to qualify professionals, move forward in building a specific ecosystem and in developing public policies aimed at realizing the country’s potential. Moreover, AI, automation and rapid digitalization may favor the reduction of employment and alter the labor market; the use of biometric techniques can accentuate prejudices; the performance of companies without a base of values ​​can erode ethical and even democratic principles adopted by society.

It is necessary to examine novel questions around liability regarding the limits of current regulatory frameworks in dealing with disparate and unexpected impacts and in preventing algorithmic harms to society. Given AI’s broad impact, these pressing questions can only be successfully addressed from a multi- disciplinary perspective.

  • Analyze the impacts of AI on individual job search strategies, company recruitment policies, corporate ethics, and the definition of professional qualifications.
  • Identify the new skills required by AI and define guidelines for the qualification of professionals in order to mitigate the impact on employment and the increase in inequality.
  • Assess the progress of biometric techniques and the regulatory framework in formation in the country, in order to guarantee security and privacy of personal data.
  • Develop the debate on corporate ethics, regulation and self-regulation.
  • Classify Brazilian companies based on indicators for corporate ethics and compare them to international indicators.
  • Analyze the relationship between the quality of information of public interest and the quality of democracy.
  • Explore the interactions between humans and intelligent social robots, with special attention to the social, cultural and economic context, in order to formulate guidelines and protocols for the development and application of technology aimed at establishing safe and ethical relationships between humans and machines.