Álvaro García-Barragán

Tagline:Data Scientist & Computer Engineer | Data Mining Engineer

Madrid, España

About

📊 Data Scientist specializing in the analysis of medical data.🏥

🕰️Enthusiast of artificial intelligence systems applied to medicine to improve healthcare.⚙️

🧤 Used to working in community in the Git environment as a software developer and maintainer.
⌨️ Used to program efficiently, scalability and with software design patterns 🔷.

🎯 𝗚𝗶𝘁𝗵𝘂𝗯: https://github.com/Alvaro8gb.
🤗 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲: https://huggingface.co/Alvaro8gb
🖥️ Programing languages I know in preferred order: Python🐍, Java☕️, C 🐧.
🧪 ORCID: https://orcid.org/0009-0007-6377-8150

Education

20232024
MSc
from: 2023, until: 2024
Field of study:Data scienceSchool:Universidad Politécnica de Madrid
Description
Course’s:

Machine Learning

Deep Learning

Data Processes

Data Visualisation

Big Data

Statistical Data Analysis

Open Data and Knowledge Graphs

Cloud Computing and Big Data

Information, Retrieval, Extraction and Integration

Programming for Data Processing

Image Processing, Analysis and Clasification

Ethical, Legal and Societal Aspects in Data Science

Research Methodology
20192023
BSc
from: 2019, until: 2023
Field of study:Computer EngineeringSchool:Universidad Complutense de Madrid
Description
Selected Course’s:

Machine Learning and Big Data

Databases I and II

Probability and Statistics

Ethics, Legislation and Profession

Computer Architecture

Evaluation of Computer Systems

Web Engineering

Computer Music

Technology and Organization of Computer Systems

Advanced Mathematics

Fundamentals of Programming I and II

Computer Programming Technology I and II

Software Engineering I and II

Fundamentals of Algorithms and Data Structures

Operating Systems and Computer Networks
English
from: 2019, until: 2021
Field of study:English Language and Literature, GeneralSchool:Escuela Oficial de Idiomas
20172019
Bachillerato de Ciencias
from: 2017, until: 2019
School:IES Camilo José Cela
Description
Subjects:

Information Technologies I and II: High honors

Technology and Robotics

Physics

Chemistry

Biology

Mathematics

Philosophy

Language and Literature

History of Spain

English
https://site.educa.madrid.org/ies.camilojosecela.pozuelodealarcon/

Work Experiences

Data Science Analyst
from: 2024, until: present
Organization:Cigna Healthcare International HealthLocation:Madrid, Comunidad de Madrid, España
Description:
Development of ML supervised classification models for data mining in international medical claims. The models are capable of recognizing data elements such as: Procedures, Drugs or Provider Name. From members all over the world, more than 100 countries. Only english language claims.
• Scaled Model Training for Resource-Constrained Environments.
– Increase the volume from 80k claims to 200k claims.
– Reduce the number of features from more than 50,000 to 50.
– Speed up: From hours training to minutes.
• Design and contribute to the development of tools to analyse and evaluate model retraining (MLOPs).
– Giving two models be able to register all the parameters and data used from training.
– Giving a new data input, evaluate a model by segment: Specialty, Country, Claim Type.

Agile product development using tools like Jira and Confluence.
Cloud Environment: S3, AWS Sagemaker, Athena, RDS.
Professor
from: 2024, until: 2024
Organization:FUNDACIÓN ORTEGA-MARAÑÓNLocation:Madrid, Community of Madrid, Spain
20242024
Data Science Researcher
from: 2024, until: 2024
Organization:Medical Data Analytics Laboratory (MEDAL)Location:Madrid, Community of Madrid, Spain
Description:
Associated with the project: ELADAIS (Extracción, Almacenamiento y Análisis de Datos con Alto Impacto Social) https://eladais.org/

Dialoguing with doctors to get medical data

Analysis of implementation of LLMs for use in NLP techniques

Participation in congresses and scientific competitions

Development of data mining tools to standardise EHR clinical notes in the field of cancer.
Teacher
from: 2024, until: 2024
Organization:Samsung Innovation Campus
Description:
The Samsung Innovation Campus, in collaboration with the Polytechnic University of Madrid (UPM), offers a fully funded training program exclusively for women. The course includes comprehensive instruction in programming with Python, covering essential topics such as algorithms and data structures. The program is divided into two parts: the first focuses on foundational programming skills, while the second delves into advanced Artificial Intelligence subjects, including Machine Learning (ML), Natural Language Processing (NLP), Artificial Neural Networks (ANN), and Deep Learning (DL).
Professor
from: 2024, until: 2024
Organization:IAU Institute for American UniversitiesLocation:Madrid, Community of Madrid, Spain
20212024
Becario de investigación
from: 2021, until: 2024
Organization:Medical Data Analytics Laboratory (MEDAL)Location:Community of Madrid, Spain
Description:
Associated to CLARIFY European Project (https://www.clarify2020.eu/)

Developer of Natural Language Processing models with neural networks 🧠 like Google BERT. Train, validate, and develop Name Entity Recognition Models.

Use LLM based on transformers, pretrained and finetuning.

Use embeddings search.

Deploy web service in a linux server ( flask, brat, prodigy).

Work in group (English or Spanish).

Explain code to others.

Learn breast and lung cancer terminology.

Write papers.

Program bash scripts.

Program Python (pandas, csv, json, spacy, sklearn, tensorflow, keras, stadistics).

Program Java ( Apache Lucen).

Make queries in SQL Database and MongoDB.

Deploy Elastic Search service.

Execute programs in Magerit-Cesvima supercomputer.

Learning how to be a researcher.

Program multithreaded scripts.
20232023
Research stay
from: 2023, until: 2023
Organization:TIB – Leibniz-Informationszentrum Technik und NaturwissenschaftenLocation:Hannover, Lower Saxony, Germany
Description:
Integrating Electronic Health Records (EHR) for spanish patients with breast cancer:

Ontology/Vocabulary: UMLS (NCI, SNOMED CT)

Neuro-symbolic system (LLM + KG) for Entity Alignment
Private Teacher
from: 2019, until: 2023
Organization:ParticularLocation:Madrid, Community of Madrid, Spain
Description:
Clases particulares de física/quimica de primero ESO, química/matemáticas bachillerato y clases de programación en Java y en C de grados universitarios técnicos.

Projects

20242024
PsoriAssistPG: Asistente para el diagnóstico de Psoriasis Pustulosa Generalizada (PPG)
date: 2024
Description:
PsoriAssistPG es una plataforma innovadora diseñada para apoyar a los dermatólogos en la orientación diagnóstica de pacientes con psoriasis, brindando una herramienta eficiente y especializada que centraliza el conocimiento clínico disponible. Esta solución tiene como objetivo optimizar el proceso de diagnóstico, proporcionando acceso a información crítica de manera rápida y precisa.

Entre los principales beneficios de PsoriAssistPG se encuentran:

Especialización: Centrada exclusivamente en dermatología, PsoriAssistPG ofrece herramientas específicas para la evaluación y diagnóstico de trastornos cutáneos como la psoriasis, garantizando relevancia clínica.

Optimización del Tiempo: El cuestionario corto integrado junto con un sistema avanzado de clustering permite al dermatólogo orientar el diagnóstico de manera rápida y efectiva, reduciendo la carga de trabajo y mejorando la eficiencia en la consulta.

Integración del Conocimiento: La plataforma proporciona acceso directo a información médica actualizada, ayudando a los dermatólogos a tomar decisiones basadas en evidencia científica reciente.

Análisis Centralizado: A través de un dashboard intuitivo, PsoriAssistPG facilita el análisis global de datos, permitiendo identificar patrones clínicos y mejorar la precisión diagnóstica, promoviendo una atención más personalizada y efectiva para cada paciente.

Con PsoriAssistPG, los dermatólogos cuentan con una herramienta integral para mejorar su práctica clínica, optimizando tanto el diagnóstico como la toma de decisiones basadas en el conocimiento más reciente.
20242024
Bizileku Bila: En busca de un lugar para vivir en Euskadi
date: 2024
Description:
El proyecto surge en un contexto de cambios migratorios impulsados por el auge del teletrabajo, donde muchas personas optan por mudarse a zonas rurales más tranquilas en busca de una mejor calidad de vida. Bizileku Bila responde a esta necesidad ofreciendo a los usuarios una plataforma que compara municipios en Euskadi a través de indicadores clave como calidad de vida, seguridad, espacios verdes, y desarrollo económico, utilizando datos abiertos proporcionados por el Gobierno Vasco y otras fuentes como Wikidata.

La aplicación ofrece dos funcionalidades principales:

Encuentra tu municipio ideal: Permite a los usuarios seleccionar indicadores específicos de grupos temáticos (como medio ambiente, movilidad, economía) y personalizar su búsqueda para obtener una lista de municipios que se ajusten a sus preferencias.

Explora los municipios: Proporciona una visión general de los municipios, permitiendo a los usuarios explorar de forma geográfica y consultar información clave sobre cada localidad.
PandemIA CyL
date: 2024
Description:
Esta aplicación web proporciona una herramienta de visualización para monitorear los datos hospitalarios relacionados con el COVID-19 en Castilla y León, España. Los usuarios pueden seleccionar diferentes variables como hospitalizaciones, ingresos en UCI, altas y fallecimientos en los hospitales, y visualizar los datos de manera interactiva en un mapa.
MedSymbolicLLM: a Hybrid AI System for Normalizing Medical Entities in an Explainable Manner
date: 2023
Description:
Electronic Health Records (EHRs) are crucial in enhancing healthcare systems by improving patient outcomes and reducing costs, thereby supporting the advancement of Health Care Information Systems (HCIS). However, the predominance of natural language storage complicates data mining as structuring this data requires more than just Named Entity Recognition (NER) models; it also needs normalization for effective organization and interoperability, crucial for data analysis across different institutions.

This master’s thesis introduces MedSymbolicLLM, a method for normalizing medical entities using a hybrid neuro-symbolic AI architecture, addressing the critical need for ‘explainability’ in healthcare. Explainability is essential because, despite their proficiency, neural networks traditionally lack transparency in processes and decision-making. The proposed system integrates Large Language Models (LLMs) known for their advanced reasoning and explanation capabilities, which operate without extensive training. This design allows application across various medical fields and vocabularies while maintaining confidentiality in clinical documentation.

The system unfolds in three phases: firstly, generating a candidate list from medical entities; secondly, employing a rule-based heuristic for necessary disambiguation; and thirdly, using an LLM for precise disambiguation. To assess this system, a case study with a breast cancer corpus was conducted, comparing it against leading tools like BioFalcon and scispaCy. This evaluation focuses on the LLMs’ explainability in disambiguating medical terms, using two types of LLMs: cloud-based GPT-4 and locally operated BioMistral. The results underscore the LLMs’ superior capacity to clarify medical disambiguation for human understanding, achieving a 15% performance improvement other models.
NCP: NLP Cancer Pipline
date: 2022
Description:
Clinical records are written in natural language and, therefore, they consist of unstructured information. The objective of the project is to structure the information from clinical records of breast cancer patients in a public hospital in Madrid in order to obtain useful information for physicians. In this way, the proposal is to perform the structuring process using deep neural networks for entity classification, specifically Named Entity Recognition (NER), in combination with other NLP techniques. Ultimately, a semi-structured database in JSON format will be generated, containing the structured clinical records, which can be further processed for various purposes.

Publications

NSSC: a neuro-symbolic AI system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes
DocumentPublisher:Medical & Biological Engineering & ComputingDate:2024
Authors:
Description:
Accurate recognition and linking of oncologic entities in clinical notes is essential for extracting insights across cancer research, patient care, clinical decision-making, and treatment optimization. We present the Neuro-Symbolic System for Cancer (NSSC), a hybrid AI framework that integrates neurosymbolic methods with named entity recognition (NER) and entity linking (EL) to transform unstructured clinical notes into structured terms using medical vocabularies, with the Unified Medical Language System (UMLS) as a case study. NSSC was evaluated on a dataset of clinical notes from breast cancer patients, demonstrating significant improvements in the accuracy of both entity recognition and linking compared to state-of-the-art models. Specifically, NSSC achieved a 33% improvement over BioFalcon and a 58% improvement over scispaCy. By combining large language models (LLMs) with symbolic reasoning, NSSC improves the recognition and interoperability of oncologic entities, enabling seamless integration with existing biomedical knowledge. This approach marks a significant advancement in extracting meaningful information from clinical narratives, offering promising applications in cancer research and personalized patient care.
Step-forward structuring disease phenotypic entities with LLMs for disease understanding
DocumentPublisher:IEEEDate:2024
Authors:
Description:
In the rapidly evolving field of biomedical text mining, the extraction of phenotypic entities from unstructured texts remains a pivotal challenge. This paper introduces a novel method that leverage Large Language Models (LLMs) to extract phenotypical entities from freely available texts such as Wikipedia. Our approach goes beyond traditional Named Entity Recognition (NER) techniques by utilizing both local and cloud-based LLMs. We present a comprehensive comparison with state-of-the-art tools. Our study confirms the significant advantages of LLMs in identifying relevant phenotypic entities, thus enhancing the ability of researchers and clinicians to understand and respond to disease dynamics more effectively. Therefore, this work underscores the potential of next-generation LLMs to redefine the standards for the extraction of phenotypic entities in biomedical research.
GPT for medical entity recognition in Spanish
DocumentPublisher:Multimedia Tools and ApplicationsDate:2024
Authors:
Description:
Can newer Large Language Models (LLMs) like GPT be utilized without the need for extensive annotation, thereby enabling direct entity extraction? In this study, we explore this issue, comparing the efficacy of fine-tuning techniques with prompting methods to elucidate the potential of GPT in the identification of medical entities within Spanish electronic health records (EHR).
Transformers for extracting breast cancer information from Spanish clinical narratives
DocumentPublisher:Artificial Intelligence in Medicine, ElsevierDate:2023
Authors:
Description:
The wide adoption of electronic health records (EHRs) offers immense potential as a source of support for clinical research. However, previous studies focused on extracting only a limited set of medical concepts to support information extraction in the cancer domain for the Spanish language. Building on the success of deep learning for processing natural language texts, this paper proposes a transformer-based approach to extract named entities from breast cancer clinical notes written in Spanish and compares several language models. To facilitate this approach, a schema for annotating clinical notes with breast cancer concepts is presented, and a corpus for breast cancer is developed. Results indicate that both BERT-based and RoBERTa-based language models demonstrate competitive performance in clinical Named Entity Recognition (NER). Specifically, BETO and multilingual BERT achieve F-scores of 93.71% and 94.63%, respectively. Additionally, RoBERTa Biomedical attains an F-score of 95.01%, while RoBERTa BNE achieves an F-score of 94.54%. The findings suggest that transformers can feasibly extract information in the clinical domain in the Spanish language, with the use of models trained on biomedical texts contributing to enhanced results. The proposed approach takes advantage of transfer learning techniques by fine-tuning language models to automatically represent text features and avoiding the time-consuming feature engineering process.
Structuring Breast Cancer Spanish Electronic Health Records Using Deep Learning
DocumentPublisher:2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS)Date:2023
Authors:
Description:
Using Natural Language Processing (NLP) in the clinical domain has increased the possibility of automatically extracting information from oncology clinical narratives. Specifically, deep learning methods have been used to extract information in the cancer domain. However, most of the above proposals have concentrated only on extracting named entities from clinical narratives, but those proposals do not include a methodology for structuring the information after an information extraction step. In this paper, we propose an automatic pipeline based on deep learning for structuring breast cancer information from clinical narratives written in Spanish. The pipeline inputs a set of clinical documents written in narrative form and automatically generates a structured JSON file that contains the information for each patient. This pipeline integrates both clinical entity extraction and negation and uncertainty detection. Obtained results have shown that deep learning methods are feasible for structuring information in the breast cancer domain.
Deep learning to extract Breast Cancer diagnosis concepts
DocumentPublisher:IEEEDate:2022
Authors:
Description:
The Bidirectional Encoder Representations from Transformers (BERT) has shown promising results in extracting information in the biomedical domain, including the cancer field. However, one of the challenges in the cancer domain is annotating resources to support information extraction. In this paper, we will show how models trained in a lung cancer corpus can be used to extract cancer concepts even in other cancer types. In particular, we will show the performance of BERT models on breast cancer data that was not used to train the models. Results are very promising as they show the possibility of applying deep learning-based models to predict cancer concepts in a different dataset to the one they were trained on, representing a considerable save of time and resources.

Honors & Awards

2° Price in Master's Final Project with Cum Laude
date: 2024-10-01
Description:
In recognition of having produced one of the best Master’s final projects of the 2023/2024 cohort.
Award for Outstanding Academic Achievement in the Data Science MSc Program
date: 2024-08-01
Description:
For achieving one of the 5 best average marks in the master’s of Data Science of 2023/2024 cohort.
3rd prize in the 2nd EELISA Student Scientific Competition in the Informatics session
date: 2024-07-01
Description:
with the study entitled
"OncoRuleLLM: Oncological Rule-based LLM for normalizing Spanish EHR in an explainable manner"
BME, Budapest, Hungary
TCCLS Best PaperAward in 2023 IEEE International Symposium on Computer-Based Medical Systems (CBMS)
date: 2023-08-01
Description:
In recognition for the paper “Structuring Breast Cancer Spanish Electronic Health Records Using Deep Learning” presented at the 2023 IEEE International Symposium on Computer-Based Medical Systems (CBMS)

Skills

Transformers
BERT
Research Software
Research Computing
Statistical Research
Inteligencia artificial
Microservicios
Procesos editoriales
C++
Agile Methodologies
Software Development Methodologies
Operating Systems
Prompt Engineering
R (Programming Language)
Machine Learning
Data processs
Statistics
Model Development
Java
UMLS
PyTorch
Technical Presentations
Tutorials
Presentations
XGBoost
Flask
Docker
Postman
Postman API
Elasticsearch
Web Services API
HTTP
CURL
API de Postman
Deep Learning
Linux
spaCy
Prodigy
Git
Bash
Linux Server
Python (Programming Language)
TensorFlow
Pandas (Software)
SQL
Keras
JSON
Natural Language Processing (NLP)
Deep Neural Networks (DNN)
English
Data Analysis
Data Mining
Big Data
Data Modeling
Data Science
Data Visualization
Resolución de problemas
Inglés
Cancer
Analytical Skills
Workload Prioritization
Semantic Search
Language Modeling
Big-Picture Thinking
Clinical Engineering
Oral Communication
Interpersonal Skills
Data Standards
Clinical Data
Jira
Cloud Computing
Tokenización
Recuperación de información
LLMS
Resolución colaborativa de problemas
Text Classification
Generative AI
Comunicación
ChatGPT
Investigación
IA generativa
AWS SageMaker

Certifications

Inteligencia Artificial y Machine Learning
Issue date: Dec 2019,
Docker Essentials: A Developer Introduction
Issue date: Mar 2023,
Issued by: cognitiveclass.ai .
Postman API Fundamentals Student Expert
Issue date: Aug 2023,
Issued by: badgr.io .
Quantization Fundaments
Issue date: May 2024,
Issued by: deeplearning.ai .
Prompt Engineering with Llama 2
Issue date: Jun 2024,
Issued by: deeplearning.ai .

Contacts

https://www.linkedin.com/in/%C3%A1lvaro-garc%C3%ADa-barrag%C3%A1n/

contents:

Álvaro García-Barragán

About

Education

MSc

BSc

English

Bachillerato de Ciencias

Work Experiences

Data Science Analyst

Professor

Data Science Researcher

Teacher

Professor

Becario de investigación

Research stay

Private Teacher

Projects

PsoriAssistPG: Asistente para el diagnóstico de Psoriasis Pustulosa Generalizada (PPG)

Bizileku Bila: En busca de un lugar para vivir en Euskadi

PandemIA CyL

MedSymbolicLLM: a Hybrid AI System for Normalizing Medical Entities in an Explainable Manner

NCP: NLP Cancer Pipline

Publications

NSSC: a neuro-symbolic AI system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes

Step-forward structuring disease phenotypic entities with LLMs for disease understanding

GPT for medical entity recognition in Spanish

Transformers for extracting breast cancer information from Spanish clinical narratives

Structuring Breast Cancer Spanish Electronic Health Records Using Deep Learning

Deep learning to extract Breast Cancer diagnosis concepts

Honors & Awards

2° Price in Master's Final Project with Cum Laude

Award for Outstanding Academic Achievement in the Data Science MSc Program

3rd prize in the 2nd EELISA Student Scientific Competition in the Informatics session

TCCLS Best PaperAward in 2023 IEEE International Symposium on Computer-Based Medical Systems (CBMS)

Skills

Certifications

Inteligencia Artificial y Machine Learning

Docker Essentials: A Developer Introduction

Postman API Fundamentals Student Expert

Quantization Fundaments

Prompt Engineering with Llama 2

Contacts