Salma OUARDI

2 Years of Work Experience

ML Engineer | Full-Time (1 year)

September 2023 – Present

Ryte AI – Healthcare Startup – Paris, France

Contributed to both applied NLP/LLM development (RyteGPT) and traditional ML pipelines in a high-scale healthcare data environment. Built robust data and ML systems on Azure, combining practical engineering with research-driven prototyping.

Key Contributions

Entity Resolution Pipeline: Led the redesign of patient-provider matching pipelines, improving F1 score from 77% to 98% and reducing processing time from 3 days to 1 hour. Techniques used included fuzzy logic, TF-IDF, and custom heuristics.
Big Data & Azure Engineering: Developed Spark-based pipelines processing 10+ TB of healthcare data. Used Scala UDFs for complex transformations and deployed solutions with Azure Databricks, VMs, and Blob Storage.
ML Testing & Reliability: Built robust validation layers using PyTest and Pydantic. Enforced TDD and schema checks across all core pipelines, achieving 100% test coverage in critical services.
RyteGPT – NLP & LLM Modules:

Intent Classification: Fine-tuned Roberta on synthetic data for 4-class intent detection (COR, INC, OOC, APPROPRIATE). Deployed to Azure Vertex AI with >90% accuracy.
Entity Extraction: Deployed LLaMA 3 and Mistral 7B models using vLLM + AWQ quantization. Designed IOB-tag-based evaluation pipeline using spaCy to track token/segment accuracy and F1.
Entity Linking: Built semantic search over medical terms using BioLORD embeddings + reranking with a cross-encoder. Achieved high top-k precision and improved strict match accuracy using GPT-4-generated variants.
Dataset Curation: Generated and labeled 1,000+ synthetic medical queries. Collaborated with MDs via Prodigy and GPT for multi-entity labeling (disease, procedure, specialty, location, distance).

ETL & Integration: Ingested and normalized data from external APIs and sources like PubMed, CMS, AHA, LexisNexis, and ClinicalTrials.gov. Implemented scalable deduplication, mapping, and linking workflows.
CI/CD & Deployment: Used Azure DevOps for version control and release pipelines. Managed GPU workloads for quantized model inference. Tracked model performance using MLflow and DVC.
R&D Initiatives: Prototyped new architectures (e.g. contrastive learning for entity resolution, multi-modal queries) and presented findings to product and engineering teams.

Data Scientist GCP | Internship (6 months)

March 2023 - August 2023

Orange - Paris, France

As part of Orange’s AI team, I contributed to automating the classification of French-language customer complaints using a fully cloud-native solution built on Google Cloud Platform (GCP). The goal was to enhance feedback processing while reducing reliance on manual labeling and costly third-party services.

I led a key use case within this migration: deploying an active learning pipeline for NLP classification on GCP's Vertex AI, resulting in improved precision, scalability, and operational efficiency.

Key Tasks

Designed and deployed an NLP model using Vertex AI AutoML to classify customer complaints, achieving ~86% precision and ~75% recall after iterative improvements.
Implemented an active learning loop to efficiently manage over 200K unlabeled complaints, significantly reducing manual annotation workload.
Built a scalable prediction pipeline on GCP (BigQuery + Cloud Functions), processing 1,000 rows in under 2 minutes with structured confidence outputs.
Led preprocessing using SpaCy's French model for lemmatization, tokenization, and stopword removal, ensuring linguistic accuracy and model robustness.
Evaluated model performance with detailed metrics (log loss, precision, recall, F1-score) across three training iterations; replaced K-means with active learning due to initial underperformance.
Collaborated closely with Orange's business team to validate and manually review uncertain predictions, aligning ML outputs with domain-specific terminology.
Ensured data integrity and traceability by using claim_id as a primary key and integrating results back into BigQuery for visualization and analysis.
Used GCP tools such as Vertex AI Notebooks, Log Explorer, Cloud Storage, and BigQuery SQL scripts for full-stack ML pipeline orchestration.

Manager Contact

Name: Benoit Eock Belinga
Role: Lead Data Scientist | Programme Data / IA
Email: [email protected]
Phone: +33 6 84 59 08 70

Recommendation Letter

For a detailed recommendation, please see the letter from my manager:

View Recommendation Letter

Data Engineer | Internship (6 months)

March 2022 - September 2022

Stellantis - Casablanca, Morocco

As a Data Engineering Intern at Stellantis, I played a pivotal role in the Carflow MEA Dashboards project, a key initiative aimed at leveraging data from diverse sources to monitor supply chain operations in the MEA region. The existing manual ETL (Extract, Transform, Load) process posed challenges in terms of increased effort, potential human error, and inefficiencies. My primary responsibility was to automate this ETL process, thereby streamlining data management for the project.

Key Tasks

Collaborated closely with a data architect to establish an effective working environment, gaining valuable insights into the data team's operations.
Conducted in-depth research into Stellantis's Supply Chain business, facilitated by the supply chain business team, to understand the business context and requirements.
Analyzed the existing ETL solution, identified business requirements, and mapped out a strategic plan for process improvement.
Designed and implemented an automated ETL solution using PySpark, Apache Airflow, and Oracle Exadata, tools from the Stellantis Data department.
Conducted rigorous testing of the data pipelines and documented the end-to-end automation process to ensure knowledge transfer and future reference.
The implemented solution significantly improved the system's efficiency, reducing latency by 46ms and decreasing the failure rate by 82%.

PROJECTS

Wav2Vec2 Fine-Tuning for ASR

January 2023

Implemented a fine-tuning approach on the Wav2Vec2 model for Automatic Speech Recognition (ASR). Fine-tuned on Spanish and Finnish speech datasets to improve performance on low-resource languages.

Project Steps

Setting up APIs
Loading and preprocessing the CSS10 dataset
Configuring the Wav2Vec2CTCTokenizer and Wav2Vec2FeatureExtractor
Fine-tuning and training the model
Evaluating the model using the Word Error Rate (WER) metric

Results

Achieved WER of 0.165 for Spanish and 0.376 for Finnish, demonstrating effectiveness in ASR tasks.

Future Work

Plan to fine-tune Wav2Vec2 on more low-resource languages and compare performance with other pretrained models.

Language: Python
Tools: Wav2Vec2 (Hugging Face Transformers), CSS10 dataset, WER metric, APIs, Google Drive.
GitHub: Wav2Vec2 Fine-Tuning for ASR

Bias Mitigation For Age Detection

October 2022

Developed models estimating age from images, focusing on mitigating biases related to age, gender, ethnicity, and facial expression.

Techniques Used

Data Augmentation with Albumentations and OpenCV
Customized loss functions
Base models: NASnet, RESnet

Language: Python
Tools: TensorFlow, PyTorch, OpenCV, NumPy, plotly
GitHub: Bias Mitigation For Age Detection

Probabilistic Generative Models

October 2022

Implemented several probabilistic generative models from scratch using PyTorch, including Variational AutoEncoder, Restricted Boltzmann Machine, and Real-NVP Normalizing Flows.

Language: Python
Tools: PyTorch, NumPy, Matplotlib
GitHub: Probabilistic Generative Models

Flower Recognition with Fine-tuned VGG16

March 2022

Executed multi-class image classification on a flower dataset using VGG16 with Transfer Learning. Achieved 85% accuracy on the validation set.

Project Steps

Implemented image generators with data augmentation
Fine-tuned the VGG16 architecture
Evaluated model performance on the validation set

Dataset

Used a dataset with 4242 images across five classes: chamomile, tulip, rose, sunflower, and dandelion.

Language: Python
Tools: TensorFlow, ImageDataGenerator, VGG16, Data Augmentation Techniques.
GitHub: Flower Recognition with Fine-tuned VGG16

Classification with PySpark

January 2022

Trained an ALS recommendation model using PySpark and MovieLens 100k dataset, stored on HDFS. Developed to generate insightful recommendations effectively.

Project Steps

Built recommendation model using user preferences
Computed recommendations and similar items
Applied evaluation metrics for model performance

Language: Python
Tools: PySpark, MLlib, ALS, HDFS, MovieLens 100k dataset.
GitHub: Classification with PySpark

Salma Ouardi

2 Years of Work Experience

ML Engineer | Full-Time (1 year)

Key Contributions

Data Scientist GCP | Internship (6 months)

Key Tasks

Manager Contact

Recommendation Letter

Data Engineer | Internship (6 months)

Key Tasks

Education

Paris-Saclay University

Ecole des sciences de l'information

Classes Preparatoires Aux Grandes Ecoles

Skills

Technologies

PROJECTS

Wav2Vec2 Fine-Tuning for ASR

January 2023

Project Steps

Results

Future Work

Bias Mitigation For Age Detection

October 2022

Techniques Used

Probabilistic Generative Models

October 2022

Flower Recognition with Fine-tuned VGG16

March 2022

Project Steps

Dataset

Classification with PySpark

January 2022

Project Steps

Interests