Salma Ouardi

Paris, France · (33) 7 58 22 41 69 · [email protected]

ML Engineer, with strong problem-solving skills and a knack for bridging the gap between business and technical needs. Seeking a collaborative team where I can contribute to impactful projects and grow professionally. Passionate about using data science to improve lives.

2 Years of Work Experience

ML Engineer | Full-Time (1 year)

September 2023 – Present
Ryte AI – Healthcare Startup – Paris, France

Contributed to both applied NLP/LLM development (RyteGPT) and traditional ML pipelines in a high-scale healthcare data environment. Built robust data and ML systems on Azure, combining practical engineering with research-driven prototyping.

Key Contributions

  • Entity Resolution Pipeline: Led the redesign of patient-provider matching pipelines, improving F1 score from 77% to 98% and reducing processing time from 3 days to 1 hour. Techniques used included fuzzy logic, TF-IDF, and custom heuristics.
  • Big Data & Azure Engineering: Developed Spark-based pipelines processing 10+ TB of healthcare data. Used Scala UDFs for complex transformations and deployed solutions with Azure Databricks, VMs, and Blob Storage.
  • ML Testing & Reliability: Built robust validation layers using PyTest and Pydantic. Enforced TDD and schema checks across all core pipelines, achieving 100% test coverage in critical services.
  • RyteGPT – NLP & LLM Modules:
    • Intent Classification: Fine-tuned Roberta on synthetic data for 4-class intent detection (COR, INC, OOC, APPROPRIATE). Deployed to Azure Vertex AI with >90% accuracy.
    • Entity Extraction: Deployed LLaMA 3 and Mistral 7B models using vLLM + AWQ quantization. Designed IOB-tag-based evaluation pipeline using spaCy to track token/segment accuracy and F1.
    • Entity Linking: Built semantic search over medical terms using BioLORD embeddings + reranking with a cross-encoder. Achieved high top-k precision and improved strict match accuracy using GPT-4-generated variants.
    • Dataset Curation: Generated and labeled 1,000+ synthetic medical queries. Collaborated with MDs via Prodigy and GPT for multi-entity labeling (disease, procedure, specialty, location, distance).
  • ETL & Integration: Ingested and normalized data from external APIs and sources like PubMed, CMS, AHA, LexisNexis, and ClinicalTrials.gov. Implemented scalable deduplication, mapping, and linking workflows.
  • CI/CD & Deployment: Used Azure DevOps for version control and release pipelines. Managed GPU workloads for quantized model inference. Tracked model performance using MLflow and DVC.
  • R&D Initiatives: Prototyped new architectures (e.g. contrastive learning for entity resolution, multi-modal queries) and presented findings to product and engineering teams.

Data Scientist GCP | Internship (6 months)

March 2023 - August 2023
Orange - Paris, France

At Orange, we were faced with a complex challenge: our existing data ecosystem was costly and required extensive data engineering services from third-party providers. The solution came in the form of Google Cloud Platform (GCP), which offered a comprehensive suite of services including cost-effective storage, serverless architectures, and robust data engineering tools.

Recognizing the potential of GCP, Orange brought me on board as a Data Scientist to lead this transition and to explore the full range of solutions that GCP could offer. Our initial focus was on classifying customer complaints.

Key Tasks

  • Led the transition of Orange's data ecosystem to Google Cloud Platform (GCP) to reduce costs and reliance on third-party data engineering services.
  • Developed a machine learning model on Vertex AI to classify customer complaints, improving the understanding of customer feedback and identifying areas for improvement.
  • Performed data engineering tasks including establishing the architecture for machine learning solutions on GCP's Vertex AI.
  • Conducted extensive data preprocessing and cleaning to prepare data for machine learning model training.
  • Selected and tuned machine learning algorithms to optimize model performance.
  • Implemented active learning techniques to handle a large amount of unlabeled data, iteratively improving the model's performance.
  • Achieved a model accuracy of 91%, demonstrating the effectiveness of the data science methodologies employed.
  • Successfully tested and validated the new data architecture, confirming its efficiency and robustness.
  • Utilized data science to drive business decisions and strategies, highlighting the importance of data-driven insights in business operations.

Manager Contact

  • Name: Benoit Eock Belinga
  • Role: Lead Data Scientist | Programme Data / IA
  • Email: [email protected]
  • Phone: +33 6 84 59 08 70

Recommendation Letter

For a detailed recommendation, please see the letter from my manager:

View Recommendation Letter

Data Engineer | Internship (6 months)

March 2022 - September 2022
Stellantis - Casablanca, Morocco

As a Data Engineering Intern at Stellantis, I played a pivotal role in the Carflow MEA Dashboards project, a key initiative aimed at leveraging data from diverse sources to monitor supply chain operations in the MEA region. The existing manual ETL (Extract, Transform, Load) process posed challenges in terms of increased effort, potential human error, and inefficiencies. My primary responsibility was to automate this ETL process, thereby streamlining data management for the project.

Key Tasks

  • Collaborated closely with a data architect to establish an effective working environment, gaining valuable insights into the data team's operations.
  • Conducted in-depth research into Stellantis's Supply Chain business, facilitated by the supply chain business team, to understand the business context and requirements.
  • Analyzed the existing ETL solution, identified business requirements, and mapped out a strategic plan for process improvement.
  • Designed and implemented an automated ETL solution using PySpark, Apache Airflow, and Oracle Exadata, tools from the Stellantis Data department.
  • Conducted rigorous testing of the data pipelines and documented the end-to-end automation process to ensure knowledge transfer and future reference.
  • The implemented solution significantly improved the system's efficiency, reducing latency by 46ms and decreasing the failure rate by 82%.

Education

Paris-Saclay University

September 2022 - September 2023
Paris, France
Master of Science, Artificial Intelligence
Main Courses: ML Algorithms, Deep Learning, Computer Vision, Large-Scale Distributed Data Processing, Probabilistic Generative Models, Applied Statistics, Advanced Optimization, Signal Processing, NLP, Information Retrieval, Reinforcement Learning.

Ecole des sciences de l'information

September 2018 - August 2022
Rabat, Morocco
Master of Engineering, Data and Knowledge

GPA: 3.25

Main Courses: Data Structures and Algorithm, Business Intelligence and Data Warehousing, Big Data, Artificial Intelligence, Expert Systems, Statistics, Machine Learning, Network Security, Operating Systems, Knowledge Management.

Classes Preparatoires Aux Grandes Ecoles

September 2016 - August 2018
Agadir, Morocco
MPSI, MP
Main Courses: Mathematics, Physics, Engineering Sciences, Chemistry, Computer Science.

Skills

Skills
  • Languages & Data: Python (NumPy, Pandas, Scikit-learn, spaCy, PySpark), SQL, Bash
  • Machine Learning & NLP: CNNs, RNNs, LLMs, Active Learning, Transformers, Token Classification, Entity Resolution & Linking, IOB Tagging, AutoML, Clustering
  • Deep Learning & LLMs: PyTorch, LLaMA, Mistral, Roberta, GPT-3.5/4, BioLORD
  • Data Engineering & Cloud: Spark, Azure (Databricks, VMs, Blob Storage, Vertex AI), GCP (BigQuery ML, APIs), Airflow
  • MLOps & DevOps: MLflow, DVC, PyTest, Pydantic, CI/CD, Docker, vLLM
  • Tools: Git, GitHub, Jupyter, VS Code, Prodigy, Postman
Soft Skills
  • Problem Solving: Designing efficient ML pipelines and resolving model issues
  • Analytical Thinking: Interpreting data patterns and translating insights into decisions
  • Statistical Reasoning: Applying statistical methods to validate hypotheses and model performance
  • Communication: Presenting technical results clearly to diverse stakeholders
  • Collaboration: Working in Agile teams with product managers, data engineers, and researchers
  • Initiative: Prototyping new models and exploring state-of-the-art techniques
Languages
  • English: Bilingual Proficiency
  • French: Bilingual Proficiency
  • Arabic: Native

Technologies


PROJECTS

Wav2Vec2 Fine-Tuning for ASR

January 2023

Implemented a fine-tuning approach on the Wav2Vec2 model for Automatic Speech Recognition (ASR). Fine-tuned on Spanish and Finnish speech datasets to improve performance on low-resource languages.

Project Steps

  • Setting up APIs
  • Loading and preprocessing the CSS10 dataset
  • Configuring the Wav2Vec2CTCTokenizer and Wav2Vec2FeatureExtractor
  • Fine-tuning and training the model
  • Evaluating the model using the Word Error Rate (WER) metric

Results

Achieved WER of 0.165 for Spanish and 0.376 for Finnish, demonstrating effectiveness in ASR tasks.

Future Work

Plan to fine-tune Wav2Vec2 on more low-resource languages and compare performance with other pretrained models.

  • Language: Python
  • Tools: Wav2Vec2 (Hugging Face Transformers), CSS10 dataset, WER metric, APIs, Google Drive.
  • GitHub: Wav2Vec2 Fine-Tuning for ASR

Bias Mitigation For Age Detection

October 2022

Developed models estimating age from images, focusing on mitigating biases related to age, gender, ethnicity, and facial expression.

Techniques Used

  • Data Augmentation with Albumentations and OpenCV
  • Customized loss functions
  • Base models: NASnet, RESnet

Probabilistic Generative Models

October 2022

Implemented several probabilistic generative models from scratch using PyTorch, including Variational AutoEncoder, Restricted Boltzmann Machine, and Real-NVP Normalizing Flows.

Flower Recognition with Fine-tuned VGG16

March 2022

Executed multi-class image classification on a flower dataset using VGG16 with Transfer Learning. Achieved 85% accuracy on the validation set.

Project Steps

  • Implemented image generators with data augmentation
  • Fine-tuned the VGG16 architecture
  • Evaluated model performance on the validation set

Dataset

Used a dataset with 4242 images across five classes: chamomile, tulip, rose, sunflower, and dandelion.

Classification with PySpark

January 2022

Trained an ALS recommendation model using PySpark and MovieLens 100k dataset, stored on HDFS. Developed to generate insightful recommendations effectively.

Project Steps

  • Built recommendation model using user preferences
  • Computed recommendations and similar items
  • Applied evaluation metrics for model performance

Interests

Traveling
Music
Chess
Basketball
Hiking