personal photo of Álvaro García-Barragán

Álvaro García-Barragán

Tagline:Data Scientist & Computer Engineer | NLP Engineer | Transformers

Madrid, España

About

💻Machine learning model developer with medical data.🏥

🕰️Now researching in the field of NLP with medical data. In particular, finetuning Named Entity Recognition Models and new LLMs. ⚙️

⛏️ Used to working in community in the Git environment as a software developer and maintainer.
😺 Github-user : https://github.com/Alvaro8gb.

⌨️ Used to program efficiently, scalability and with software design patterns 🔷.

Programing languages I know in preferred order: Python🐍, Java☕️, C 🐧.

Education

  • MSc

    from: 2023, until: 2024

    Field of study:Data scienceSchool:Universidad Politécnica de Madrid

    Description

    Course’s:

    • Machine Learning
    • Deep Learning
    • Data Processes
    • Data Visualisation
    • Big Data
    • Statistical Data Analysis
    • Open Data and Knowledge Graphs
    • Cloud Computing and Big Data
    • Information, Retrieval, Extraction and Integration
    • Programming for Data Processing
    • Image Processing, Analysis and Clasification
    • Ethical, Legal and Societal Aspects in Data Science
    • Research Methodology
  • TCCLS Best PaperAward

    from: 2023, until: present

    Field of study:2023 IEEE International Symposium on Computer-Based Medical Systems (CBMS)School:Università degli Studi dell'Aquila

  • BSc

    from: 2019, until: 2023

    Field of study:Computer EngineeringSchool:Universidad Complutense de Madrid

    Description

    9.5/10 in Final Degree Project, titled “Structuring Electronic Health Records of breast cancer with Natural Language Processing”

  • English

    from: 2019, until: 2021

    Field of study:English Language and Literature, GeneralSchool:Escuela Oficial de Idiomas

Work Experiences

  • Data Science Researcher

    from: 2024, until: present

    Organization:Medical Data Analytics Laboratory (MEDAL)Location:Madrid, Community of Madrid, Spain

    Description:

    Associated with the project: ELADAIS (Extracción, Almacenamiento y Análisis de Datos con Alto Impacto Social) https://eladais.org/

  • Freelance Reviewer

    from: 2024, until: present

    Organization:Freelance

    Description:

    Review Articles on NLP in High-Impact Journals:

    • Database-The Journal of Biological Databases and Curation
    • Computers in Biology and Medicine
    • Multimedia Tools and Applications
  • Teacher

    from: 2024, until: 2024

    Organization:Samsung Innovation Campus

    Description:

    The Samsung Innovation Campus, in collaboration with the Polytechnic University of Madrid (UPM), offers a fully funded training program exclusively for women. The course includes comprehensive instruction in programming with Python, covering essential topics such as algorithms and data structures. The program is divided into two parts: the first focuses on foundational programming skills, while the second delves into advanced Artificial Intelligence subjects, including Machine Learning (ML), Natural Language Processing (NLP), Artificial Neural Networks (ANN), and Deep Learning (DL).

  • Freelance Reviewer

    from: 2024, until: 2024

    Organization:9th International Conference on Big Data Analytics, Data Mining and Computational IntelligenceLocation:Budapest, Hungría

  • Professor

    from: 2024, until: 2024

    Organization:IAU Institute for American UniversitiesLocation:Madrid, Community of Madrid, Spain

  • Professor

    from: 2024, until: 2024

    Organization:FUNDACIÓN ORTEGA-MARAÑÓNLocation:Madrid, Community of Madrid, Spain

  • Becario de investigación

    from: 2021, until: 2024

    Organization:Universidad Politécnica de MadridLocation:Community of Madrid, Spain

    Description:
    • Associated to CLARIFY European Project (https://www.clarify2020.eu/)
    • Developer of Natural Language Processing models with neural networks 🧠 like Google BERT. Train, validate, and develop Name Entity Recognition Models.
    • Use LLM based on transformers, pretrained and finetuning.
    • Use emebddings search.
    • Deploy web service in a linux server ( flask, brat, prodigy).
    • Work in group (English or Spanish).
    • Explain code to others.
    • Learn breast and lung cancer terminology.
    • Write papers.
    • Program bash scripts.
    • Program Python (pandas, csv, json, spacy, sklearn, tensorflow, keras, stadistics).
    • Program Java ( Apache Lucen).
    • Make queries in SQL Database and MongoDB.
    • Execute programs in Magerit-Cesvima supercomputer.
    • Learning how to be a researcher.
    • Program multithreaded scripts.
  • Research stay

    from: 2023, until: 2023

    Organization:TIB – Leibniz-Informationszentrum Technik und NaturwissenschaftenLocation:Hannover, Lower Saxony, Germany

    Description:

    Integrating Electronic Health Records (EHR) for spanish patients with breast cancer:

    • Ontology/Vocabulary: UMLS (NCI, SNOMED CT)
    • Neuro-symbolic system (LLM + KG) for Entity Alignment
  • Teacher

    from: 2019, until: 2023

    Organization:ParticularLocation:Madrid, Community of Madrid, Spain

    Description:

    Clases particulares de física/quimica de primero ESO, química/matemáticas bachillerato y clases de programación en Java y en C de grados universitarios técnicos.

Projects

  • MedSymbolicLLM: a Hybrid AI System for Normalizing Medical Entities in an Explainable Manner

    date: 2023

    Description:

    Electronic Health Records (EHRs) are crucial in enhancing healthcare systems by improving patient outcomes and reducing costs, thereby supporting the advancement of Health Care Information Systems (HCIS). However, the predominance of natural language storage complicates data mining as structuring this data requires more than just Named Entity Recognition (NER) models; it also needs normalization for effective organization and interoperability, crucial for data analysis across different institutions.

    This master’s thesis introduces MedSymbolicLLM, a method for normalizing medical entities using a hybrid neuro-symbolic AI architecture, addressing the critical need for ‘explainability’ in healthcare. Explainability is essential because, despite their proficiency, neural networks traditionally lack transparency in processes and decision-making. The proposed system integrates Large Language Models (LLMs) known for their advanced reasoning and explanation capabilities, which operate without extensive training. This design allows application across various medical fields and vocabularies while maintaining confidentiality in clinical documentation.

    The system unfolds in three phases: firstly, generating a candidate list from medical entities; secondly, employing a rule-based heuristic for necessary disambiguation; and thirdly, using an LLM for precise disambiguation. To assess this system, a case study with a breast cancer corpus was conducted, comparing it against leading tools like BioFalcon and scispaCy. This evaluation focuses on the LLMs’ explainability in disambiguating medical terms, using two types of LLMs: cloud-based GPT-4 and locally operated BioMistral. The results underscore the LLMs’ superior capacity to clarify medical disambiguation for human understanding, achieving a 15% performance enhancement over other models.

  • NCP: NLP Cancer Pipline

    date: 2022

    Description:

    Clinical records are written in natural language and, therefore, they consist of
    unstructured information. The objective of the project is to structure the informa-
    tion from clinical records of breast cancer patients in a public hospital in Madrid
    in order to obtain useful information for physicians. In this way, the proposal is to
    perform the structuring process using deep neural networks for entity classification,
    specifically Named Entity Recognition (NER), in combination with other NLP tech-
    niques. Ultimately, a semi-structured database in JSON format will be generated,
    containing the structured clinical records, which can be further processed for various
    purposes.

    Code: https://github.com/Alvaro8gb/NCP
    Paper: https://ieeexplore.ieee.org/document/10178871

Publications

  • Step-forward structuring disease phenotypic entities with LLMs for disease understanding

    DocumentPublisher:IEEEDate:2024
    Authors:
    Description:

    In the rapidly evolving field of biomedical text mining, the extraction of phenotypic entities from unstructured texts remains a pivotal challenge. This paper introduces a novel method that leverage Large Language Models (LLMs) to extract phenotypical entities from freely available texts such as Wikipedia. Our approach goes beyond traditional Named Entity Recognition (NER) techniques by utilizing both local and cloud-based LLMs. We present a comprehensive comparison with state-of-the-art tools. Our study confirms the significant advantages of LLMs in identifying relevant phenotypic entities, thus enhancing the ability of researchers and clinicians to understand and respond to disease dynamics more effectively. Therefore, this work underscores the potential of next-generation LLMs to redefine the standards for the extraction of phenotypic entities in biomedical research.

  • GPT for medical entity recognition in Spanish

    DocumentPublisher:Multimedia Tools and ApplicationsDate:2024
    Authors:
    Description:

    Can newer Large Language Models (LLMs) like GPT be utilized without the need for extensive annotation, thereby enabling direct entity extraction? In this study, we explore this issue, comparing the efficacy of fine-tuning techniques with prompting methods to elucidate the potential of GPT in the identification of medical entities within Spanish electronic health records (EHR)

  • Transformers for extracting breast cancer information from Spanish clinical narratives

    DocumentPublisher:Artificial Intelligence in Medicine, ElsevierDate:2023
    Authors:
    Description:

    The wide adoption of electronic health records (EHRs) offers immense potential as a source of support for clinical research. However, previous studies focused on extracting only a limited set of medical concepts to support information extraction in the cancer domain for the Spanish language. Building on the success of deep learning for processing natural language texts, this paper proposes a transformer-based approach to extract named entities from breast cancer clinical notes written in Spanish and compares several language models. To facilitate this approach, a schema for annotating clinical notes with breast cancer concepts is presented, and a corpus for breast cancer is developed. Results indicate that both BERT-based and RoBERTa-based language models demonstrate competitive performance in clinical Named Entity Recognition (NER). Specifically, BETO and multilingual BERT achieve F-scores of 93.71% and 94.63%, respectively. Additionally, RoBERTa Biomedical attains an F-score of 95.01%, while RoBERTa BNE achieves an F-score of 94.54%. The findings suggest that transformers can feasibly extract information in the clinical domain in the Spanish language, with the use of models trained on biomedical texts contributing to enhanced results. The proposed approach takes advantage of transfer learning techniques by fine-tuning language models to automatically represent text features and avoiding the time-consuming feature engineering process.

  • Structuring Breast Cancer Spanish Electronic Health Records Using Deep Learning

    DocumentPublisher:2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS)Date:2023
    Authors:
    Description:

    Using Natural Language Processing (NLP) in the clinical domain has increased the possibility of automatically extracting information from oncology clinical narratives. Specifically, deep learning methods have been used to extract information in the cancer domain. However, most of the above proposals have concentrated only on extracting named entities from clinical narratives, but those proposals do not include a methodology for structuring the information after an information extraction step. In this paper, we propose an automatic pipeline based on deep learning for structuring breast cancer information from clinical narratives written in Spanish. The pipeline inputs a set of clinical documents written in narrative form and automatically generates a structured JSON file that contains the information for each patient. This pipeline integrates both clinical entity extraction and negation and uncertainty detection. Obtained results have shown that deep learning methods are feasible for structuring information in the breast cancer domain.

  • Deep learning to extract Breast Cancer diagnosis concepts

    DocumentPublisher:IEEEDate:2022
    Authors:
    Description:

    The Bidirectional Encoder Representations from Transformers (BERT) has shown promising results in extracting information in the biomedical domain, including the cancer field. However, one of the challenges in the cancer domain is annotating resources to support information extraction. In this paper, we will show how models trained in a lung cancer corpus can be used to extract cancer concepts even in other cancer types. In particular, we will show the performance of BERT models on breast cancer data that was not used to train the models. Results are very promising as they show the possibility of applying deep learning-based models to predict cancer concepts in a different dataset to the one they were trained on, representing a considerable save of time and resources.

Honors & Awards

  • Award for Outstanding Academic Achievement in the Data Science MSc Program

    date: 2024-08-01

    Description:

    For achieving one of the 5 best average marks in the master’s program.

  • 3rd prize in the 2nd EELISA Student Scientific Competition in the Informatics session

    date: 2024-07-01

    Description:

    with the study entitled
    "OncoRuleLLM: Oncological Rule-based LLM for normalizing Spanish EHR in an explainable manner"
    BME, Budapest, Hungary

  • TCCLS Best Paper Award

    date: 2023-07-01

    Description:

    In conference 2023 IEEE International Symposium on Computer-Based Medical Systems (CBMS)

Skills

  • Transformers
  • BERT
  • Research Software
  • Research Computing
  • Statistical Research
  • Inteligencia artificial
  • Microservicios
  • Procesos editoriales
  • C++
  • Agile Methodologies
  • Software Development Methodologies
  • Operating Systems
  • Prompt Engineering
  • R (Programming Language)
  • Machine Learning
  • Data processs
  • Statistics
  • Model Development
  • Java
  • UMLS
  • PyTorch
  • Technical Presentations
  • Tutorials
  • Presentations
  • XGBoost
  • Flask
  • Docker
  • Postman
  • Postman API
  • Elasticsearch
  • Web Services API
  • HTTP
  • CURL
  • API de Postman
  • Deep Learning
  • Linux
  • spaCy
  • Prodigy
  • Git
  • Bash
  • Linux Server
  • Python (Programming Language)
  • TensorFlow
  • Pandas (Software)
  • SQL
  • Keras
  • JSON
  • Natural Language Processing (NLP)
  • Deep Neural Networks (DNN)
  • English
  • Data Analysis
  • Data Mining
  • Big Data
  • Data Modeling
  • Data Science
  • Data Visualization
  • Resolución de problemas
  • Inglés

Certifications

  • Inteligencia Artificial y Machine Learning

    Issue date: Dec 2019,

  • Docker Essentials: A Developer Introduction

    Issue date: Mar 2023,

    Issued by: cognitiveclass.ai .

  • Postman API Fundamentals Student Expert

    Issue date: Aug 2023,

    Issued by: badgr.io .

  • Quantization Fundaments

    Issue date: May 2024,

    Issued by: deeplearning.ai .

  • Prompt Engineering with Llama 2

    Issue date: Jun 2024,

    Issued by: deeplearning.ai .