Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
lm-evaluation-harness-pt-br.ipynb		lm-evaluation-harness-pt-br.ipynb

README.md

New Evaluations

We performed the following evaluations using a Portuguese implementation of the EleutherAI LM Evaluation Harness:

ENEM (3-shot) - The Exame Nacional do Ensino Médio (ENEM) is an advanced High-School level exam widely applied every year by the Brazilian government to students that wish to undertake a University degree. This dataset contains 1,430 questions that don't require image understanding of the exams from 2010 to 2018, 2022, and 2023. - Data sources: [1], [2], [3], [4]. Metric - Accuracy.
BLUEX (3-shot) - BLUEX is a multimodal dataset consisting of the two leading university entrance exams conducted in Brazil: Convest (Unicamp) and Fuvest (USP), spanning from 2018 to 2024. The benchmark comprises 724 questions that do not have accompanying images - Data sources: [1], [2], [3]. Metric - Accuracy.
OAB Exams (3-shot) - OAB Exams is a dataset of more than 2,000 questions from the Brazilian Bar Association's exams from 2010 to 2018. - Data sources: [1], [2]. Metric - Accuracy.
ASSIN2 RTE (15-shot) - ASSIN 2 (Avaliação de Similaridade Semântica e Inferência Textual - Evaluating Semantic Similarity and Textual Entailment) is the second edition of ASSIN, an evaluation shared task in the scope of the computational processing of Portuguese. Recognising Textual Entailment (RTE), also called Natural Language Inference (NLI), is the task of predicting if a given text (premise) entails (implies) another text (hypothesis). - Data sources: [1], [2], [3]. Metric - F1-macro.
ASSIN2 STS (15-shot) - Same as a dataset as above. Semantic Textual Similarity (STS) 'measures the degree of semantic equivalence between two sentences'. - Data sources: [1], [2], [3]. Metric - Pearson.
FAQUAD NLI (15-shot) - FaQuAD is a Portuguese reading comprehension dataset that follows the format of the Stanford Question Answering Dataset (SQuAD). The dataset aims to address the problem of the abundance of questions sent by academics whose answers can be found in the available institutional documents in the Brazilian higher education system. It consists of 900 questions about 249 reading passages taken from 18 official documents of a computer science college from a Brazilian federal university and 21 Wikipedia articles related to the Brazilian higher education system. FaQuAD-NLI is a modified version of the FaQuAD dataset that repurposes the question-answering task as a textual entailment task between a question and its possible answers. - Data sources: [1], [2]. Metric - F1-macro.
HateBR (25-shot) - HateBR is the first large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection on the web and social media. The HateBR was collected from politicians' Brazilian Instagram comments and manually annotated by specialists. It comprises 7,000 documents annotated with a binary classification (offensive versus non-offensive comments). - Data sources: [1], [2], [3]. Metric - F1-macro.

The notebook used to run these evaluations is the lm-evaluation-harness-pt-br.ipynb. Available on Colab.

Benchmarks

	ASSIN2 RTE	ASSIN2 STS	BLUEX	ENEM	FAQUAD NLI	HateBR	OAB Exams	Average
Qwen-1.8B	64.83	19.53	26.15	30.23	43.97	33.33	27.20	35.03
TinyLlama-1.1B	58.93	13.57	22.81	22.25	43.97	36.92	23.64	31.72
TTL-460m	53.93	12.66	22.81	19.87	49.01	33.59	27.06	31.27
XGLM-564m	49.61	22.91	19.61	19.38	43.97	33.99	23.42	30.41
Bloom-1b7	53.60	4.81	21.42	18.96	43.97	34.89	23.05	28.67
TTL-160m	53.36	2.58	21.84	18.75	43.97	36.88	22.60	28.56
OPT-125m	39.77	2.00	21.84	17.42	43.97	47.04	22.78	27.83
Pythia-160	33.33	12.81	16.13	16.66	50.36	41.09	22.82	27.60
OLMo-1b	34.12	9.28	18.92	20.29	43.97	41.33	22.96	27.26
TTL-460m-Chat	43.39	4.84	23.23	19.38	33.98	33.49	26.97	26.46
Bloom-560m	33.33	8.48	18.92	19.03	43.97	37.07	23.05	26.26
Pythia-410m	33.33	4.80	19.47	19.45	43.97	33.33	23.01	25.33
OPT-350m	33.33	3.65	20.72	17.35	44.71	33.33	23.01	25.15
GPT-2 small	33.26	0.00	10.43	11.20	43.52	33.68	13.12	20.74
GPorTuguese	33.33	3.85	14.74	3.01	28.81	33.33	21.23	19.75
Samba-1.1B	33.33	1.30	8.07	10.22	17.72	35.79	15.03	17.35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New-EVAL

New-EVAL

README.md

New Evaluations

Benchmarks

Files

New-EVAL

Directory actions

More options

Directory actions

More options

Latest commit

History

New-EVAL

Folders and files

parent directory

README.md

New Evaluations

Benchmarks