MSc thesis / guided research project: Känguru  LLM mathematical reasoning challenge
23.08.2024, Diplomarbeiten, Bachelor und Masterarbeiten
Curate a mathematical reasoning benchmark dataset from the "Känguru der Mathematik" test, evaluate existing LLMs on it, and derive conclusions about students performance over the year as well as potential insights into humans vs machine thinking
Introduction
The “Känguru der Mathematik” test is an international math and reasoning test that is currently taken by some students in some schools in more than 100 countries. In 2023, it has been taken by more than 5.5M students (!). The test has been running in Germany since 1998, and there are typically 4 different tests for different age groups. The exercises and solutions for all tests are available from all years and all age groups (a total of 3900 exercises). Besides the PDFs, we also have LaTeX sources for some more recent tests. On top of the exercises and solutions, the results of students in Germany are also publicly available (in somewhat aggregated form, i.e., in relatively finegrained point ranges).
Goals
The main goal of this project will be to use existing ML tools and especially large language models to process the available data (exercise PDFs and LaTeX files) to render them readable for LLMs, have LLMs do all these tests with different availability of inputs (how graphics and text are represented) and different prompts, and finally evaluate the results both in isolation (or across different LLMs) as well as in comparison to students over the years.
Besides providing an entirely novel (as far as I know) German language benchmark dataset for mathematical reasoning in LLMs (both vision and language required), this may also allow us to study the following scientific questions:
 First, we would like to better understand the historical development of students’ performance in mathematics in Germany. However, the average scores achieved over the years do not accurately represent students’ mathematical knowledge, as the objective difficulty of the tests is difficult to compare. Large multimodal language models like GPT4 could, however, serve as an objective reference point. Relative to the AI’s score, students’ results can be better compared over the years.
 As a second step, we would like to analyze whether these tests can reveal fundamental differences between artificial and human analytical problemsolving abilities. For which “type” of task is human, adolescent, or child creativity superior to artificial intelligence? What kind of tasks do language models solve particularly reliably? From this categorization, we ultimately hope to gain insights into how mathematics can be taught as a creative and almost playful science in schools.
Work packages
On the methodological side, the project requires:
 reading PDFs and LaTeX files (with lots of pstricks graphics) from which we want to
 simply extract each problem together with the multiple choice solution options and all relevant illustrations as PDFs or images (one modality to feed LLMs is the purely visual representation of each problem)
 extract the multiple choice solution options as text (sometimes the solutions are illustrations as well, so it should be graphics)
 extract the exercises as text and the corresponding graphics as images
 extract the correct solutions for each exercise as well as the historic performance of students from the PDFs
 evaluate different LLMs (both commercial as well as open source) on the different possible input types of the exercises (image alone, image with answer options given, text + image, etc.) and extract (possibly over multiple evaluations and temperatures) the results
 Analyze the results compared to human performance over the years using (different) LLMs to normalize the difficulty of the tests over the years. Perform further exploratory data analysis into which exercises are easy for both humans and machine, easy for humans but hard for machines, hard for humans but easy for machines, or hard for both (split by different age groups).
Expectations
I expect the successful candidate to be comfortable and experienced in using large deep learning models as well as their APIs, to know how to find and use the best tool for the job (e.g., document processing) and to be able to independently formulate interesting research questions and provide careful analyses of the obtained data. Ideally, the candidate has had prior experience with highperformance compute environments (Linux Cluster managed by SLURM) and running deep learning models on (multiple) GPUs.
Kontakt: niki.kilbertus@tum.de
More Information
slide 
Overview slide of the project,
(Type: application/pdf,
Size: 684.3 kB)
Save attachment
