HiWi/ Student Assistant: Automatized classification of tuberculosis diagnosis using Machine Learning and Deep Learning
29.07.2024, Studentische Hilfskräfte, Praktikantenstellen, Studienarbeiten
For the winter semester, the AI consultant team at Helmholtz Munich is seeking a student to work on Automatized classification of tuberculosis diagnosis using Machine Learning and Deep Learning.
Background
Tuberculosis (TB) is one of the top ten causes of death worldwide and the leading cause of death from an infection in humans, causing even more death than HIV/AIDS. TB is caused by the bacterium Mycobacterium tuberculosis (MTB), which most commonly infects the lungs and is disseminated through airborne transmission from a person that suffers from the active form of TB to another. The easy transmission of MTB leads to a high infection burden worldwide, however, TB is asymptomatic in many MTB cases and hence, not everyone infected with MTB is a carrier. The average risk of developing active TB (aTB) when infected with MTB is about 5 − 10%.
Once infected with MTB, the cell-mediated immune response controls the MTB infection, mainly via MTB-specific CD4 T cells. If the immune system can clear the MTB infection, the TB infection is eliminated. Often, however, the immune system is not able to eliminate the TB infection but rather controls it, thereby reducing the likelihood of aTB. This form of TB infection is called latent tuberculosis infection (LTBI) and is the most common form of TB infection. Only about 5 − 10% of all LTBI patients will eventually develop aTB, which is the form of TB infection that causes clinical symptoms and can be detected through radiological or microbiological diagnostic tools. Most traditional diagnostic tests primarily detect MTB infection, but fail to differentiate between aTB, LTBI and successfully treated TB, making them unsuited for diagnosis of aTB and treatment monitoring. Both tools that form the current gold standard rely on the detection of MTB in sputum samples, which makes them error-prone due to the risk of contamination during sample extraction. Furthermore, the liquid culture method they are based on, is a long process, giving test results only after few weeks. T cell activation marker (TAM)-TB assays, on the other hand, give rapid test results and can distinguish between aTB, LTBI and treated TB by determination of phenotypical and functional characteristics of MTB-specific CD4 T cells via flow cytometry [1-4].
Motivation
Despite the ability to discriminate between aTB, LTBI and treated TB, the TAM-TB assay remains an expensive technique, requiring specialized equipment and trained personnel to operate. In order to diagnose the patient status via TAM-TB assays, the medical personnel has to manually gate the marker profiles in flow cytometry (FACS) data. Manual gating is an ad-hoc strategy to identify and select populations of interest from multicolour flow cytometry data. The process itself is prone to inter-expert variation and does not scale well for the classification of hundreds of patient samples.
To make this diagnostic approach more broadly applicable, a reliable method for automated and accurate analyses of cell profiles from flow cytometry (FACS) data is needed.
Test data
Annotated data from ongoing diagnostic TB studies at the Division of Infectious Diseases and Tropical Medicine University Hospital, LMU Munich and the German Center for Infection Research.
Approach
The goal of this project is to implement a machine learning pipeline as an alternative approach to manual gating. For this purpose, data from a ReFu Screen Study is provided that includes FACS data for 300 well characterized patients, where the outcome for each patient was derived by manual gating. The pipeline should be able to automatically and accurately analyse the cell profiles of the FACS data, and provide a rapid classification of patients into aTB / non-aTB status based on their FACS data.
We aready implemented a machine learning pipeline that can predict the TB status of a
patient from input FACS data. Therefore, we compute input features from the provided FACS data as follows:
- merge cell profiles of FACS data from all training patients
- compute a neighborhood graph of cells from the FACS data
- cluster the cells with Louvain clustering [5] using the weighted adjacency matrix of the neighborhood graph
- get distribution of cells among clusters for each patient
Next, we trained a Random Forest model that can predict the TB status of a patient from the computed input features (cell distributions across clusters). To predict the status of a new patient, we must be able to compute the abovementioned features for new FACS data. However, doing inference on a neighborhood graph is not trivial. For inference, we trained a runtime and memory efficient multi-layer perceptron model that can precisely predict the cluster label for a cell in the FACS data.
Aim
- tasks on existing pipeline:
- review existing code of two step model pipeline
- add explainability analysis for final model
- compare the results to manual gating
- modifications to existing pipeline
- research for an alternative feature representation approach that can handle large amount of data and could replace the inefficient clustering approach
- implement an alternative end-to-end pipeline in addition to the two-step classification baseline model, e.g. an multi-instance learning model where each patient represents one bag and the classification task can be replaced by an anomaly detection task. Therefore, transfer approaches from histopathology where multi-instance learning is successfully used in the presence of data scarcity [6-8]
- automatize the evaluation for potential applications in the clinics
References
[1] Ahmed, Mohamed IM, et al. "Phenotypic changes on Mycobacterium tuberculosis-specific CD4 T cells as surrogate markers for tuberculosis treatment efficacy." Frontiers in immunology 9 (2018)
[2] Portevin, Damien, et al. "Assessment of the novel T-cell activation marker–tuberculosis assay for diagnosis of active tuberculosis in children: a prospective proof-of-concept study." The Lancet infectious diseases 14.10 (2014): 931-938.
[3] Schuetz, Alexandra, et al. "Monitoring CD27 expression to evaluate Mycobacterium
tuberculosis activity in HIV-1 infected individuals in vivo." PloS one 6.11 (2011): e27284.
[4] Mohamed IM Ahmed et al. “The TAM-TB assay—a promising TB immune-diagnostic test with a potential for treatment monitoring”. In: Frontiers in pediatrics 7 (2019), p. 27.
[5] From Louvain to Leiden: guaranteeing well-connected communities, V.A. Traag, L. Waltman, and N.J. van Eck, arXiv, 2018
[6] Kazeminia, S., Sadafi, A., Makhro, A., Bogdanova, A., Albarqouni, S., & Marr, C. (2022). Anomaly-aware multiple instance learning for rare anemia disorder classification. arXiv preprint arXiv:2207.01742.
[7] Kazeminia, S., Sadafi, A., Makhro, A., Bogdanova, A., Marr, C., & Rieck, B. (2023). Topologically-Regularized Multiple Instance Learning for Red Blood Cell Disease Classification. arXiv preprint arXiv:2307.14025.
[8] Engelmann, J. P., Palma, A., Tomczak, J. M., Theis, F. J., & Casale, F. P. (2023). Attention-based Multi-instance Mixed Models. arXiv preprint arXiv:2311.02455.
Kontakt: lisa.barros@helmholtz-munich.de