Jeremy Thaller

Senior Data Scientist jeremy@thaller.dev

I'm a dual MS engineer with degrees in materials science and applied physics. I have a passion for beautiful data visualization and specialize in deep learning predictive modeling.


Projects

Accredited Investors: Binary Classification and Probabilities

Python

To be elligible for accredidation, a user must fulfill certain criteria based on income, net-worth, marrital status, and home-ownership. Through data augmentation, I was able to aquire a rough dataset of the features needed for classification; however, within this dataset, there still remained a large subset of users with incomplete data. To distill which of these "nebulous" users were in fact elligible for accredidation, I utilized supervised machine learning. Through extensive feature cleaning, imputation, K-fold cross-validation, and sk-learn pipelines, I created a Random Forest model capable of predicting which users were elligible with 98.2% accuracy, 99.9% precision, and 86% recall. These scores are even higher when looking at the "edges" of the probability histogram. To conclude the project, I weighed the tradeoff between precision and recall via an ROC curve (AUC=.986).

Semiconductor Properties Prediction

Python

In this project, I used machine learning to optimize the transport integral of tetracene. Organic semiconductors are responsible for the OLED panels found on high end TV's and cell phone screens. Unfortunately, computing electronic properties of theoretical materials, such as the molecular transport integral, is computationally expensive. Using machine learning, we can predict the transport integral of a material much more quickly, allowing for more rapid structural iterations. Given the coulomb interactions (calculated as a function of the atomic positions and structure), I predicted the corresponding transport integrals. PCA was critical to reduce the computational cost, without affecting performance. The best method found for the predictions was a neural net built using Keras, though Kernel Ridge Regression performed quite well. K-fold cross-validation was implemented during training and a random grid search was used for hyperparameter tuning.

Nanoparticle Disorder Neural Network

Python, Mathematica

In this project, I used convolutional neural networks to predict disorder in metallic nanoparticles. Given a XANES (X-ray Absorption Near Edge Sepctroscopy) spectrum for the material, my neural network, built with Keras, predicts the mean relative squared displacement of atoms from their non-disordered, crystalline positions. This would allow researchers to quantify disorder in situ, rather than relying on supplementary measurements or collecting the full EXAFS (extended X-ray Absorption Fine Structure) data to approximate the debye-waller factor (a similar measurement of disorder). In previous work, training sets have been created through computationally intensive simluations, requiring days worth of running time on dedicated computing clusters. Instead, I have developed a new method for creating disordered-structure training data through clever statical averaging of non-disordered structures, reducing the computational cost to something that can be performed on a laptop in under an hour. Once trained to predict disorder on simulation data, I utilized transfer learning to extend the network's predictive domain onto experimental data. Due to the nature of the work, I can't link my github repository or thesis until after publication.

Friend Identifier

Python

I trained a naive Bayes classifier to be able to identify which one of my Facebook friends sent an unlabeled message. Using personal messages as training data, I took messages from a group chat (unseen by the classifier) and had the program determine which messages were sent by which person. The program also can make a classification based on a single unseen message. Instead of reusing my HTML scraping from the Message Counter project, I decided to download my JSON messenger files from Facebook to load and manipulate the data from there.

Facebook Message Counter

Python, HTML

The purpose of this project was to see messaging trends between my friend and me. Who sends more messages? Are there any anomaly? The main challenge of this project was to scrape the data and organize into a usable format. BeautifulSoup4 was used in the HTML scrape, with the message content and senders saved into CSV spreadsheets. The CSV's were then loaded into a Pandas DataFrame, where I could write queries to count message numbers according to time-stamps. Matplotlib and Seaborn were used for data visualization.

Titanic

Python

The classic first project for anyone's machine learning journey. I tried a variety of machine learning models, eventually deciding to tune the hyperparameters for XG-Boost and include a new interaction term between fare paid and passenger's sex. I spent some time stacking several models, but ended up abandoning this approach as it is impractical for most professional applications.

Toothbrushing Patterns

Python

Ever wonder about your toothbrushing patterns? No? Fair enough. My electric toothbrush actually happens to record lots of data, including the exact time and duration of every brushing seesion, as well as the percent of tooth coverage (and location!) during each session. Colgate doesn't let you download your data, but they let you synch your data with Apple Healthkit, which does let you download your data as an XML file. I imported this data into EXCEL and Pandas and created this heat map of 5 months of my tootbrushing data.

Image Classifier

Python

I've written several CIFAR-10 and MNIST classifier projects, and will include most of them here. The first classifier was a neural network written from scratch. I used a pretty simple architecture, and obtained un-remarkable results. I then wrote a more complicated network using Py-Torch lighting. Then I improved upon it by implementing CNN's, then RNN's, and finally including transfer learning. This was all work for a Deep Learning class at the Technical University of Munich. Because these are HW solution, I have to keep my Colab Notebooks and Github Repository private, at least for a few years.


Education

ERASMUS MUNDUS

Masters in Materials Science Dual-Degree Program
Ludwig Maximilians and Technical University of Munich, Germany

Joint M.S. in Materials Science and Engineering

Adam Mickiewicz University Poznan, Poland

M.S. in Applied Physics

Brookhaven National Laboratory, New York

Thesis: Investigation of Bond Strain Effects on XANES Spectra via Artificial Neural Networks

Sept. 2019 - Aug. 2021

Williams College

Bachelor of Arts with Honors
Physics - Pre-Engineering Studies

Thesis: Towards an Adhesion Based Measurement of Strain-Dependent Surface Stress in Soft Solids

Sept. 2015 - May 2019

Skills

Programming Languages & Competency
  • Python, 5 years. Expert.
  • MATLAB, 5 years. Expert.
  • Java, 7 years. Proficient (just rusty)
  • Mathematica, 4 years. Proficient.
  • HTML, 4 years. Proficient.
  • SQL, 2 year. Proficient.
  • R, 1 year. Basic Familiarity.

Relevant Coursework

  • Intro to Machine Learning
  • Deep Learning
  • Computational Materials Design
  • Molecular Dynamics Simulations
  • Particle Physics
  • General Relativity
  • Condensed Matter Physics
  • Statistical Mechanics
  • Multivariate Calculus
  • Linear Algebra
  • Partial Differential Equations

Interests

I'm happiest in life when watching the Liverpool game with a fresh cup of coffee and good company. When the weather is nice, I enjoy running. I was a sprinter and captain of the Track team in college, but I'm trying to make the great transition to distance running. I'm not sure I'll ever pick up my bassoon again, but I still enjoy playing jazz piano every day.

To see what I've been working on, checkout out my blog. Each week I recap what I've done, what I've learned, and decide what I'll do next. I've found this helps consolidate what I've learned, as well as keep me accountable for coninously learning more about data science and programming.