My Portfolio - Jeremy Thaller

Accredited Investors: Binary Classification and Probabilities

Python

To be elligible for accredidation, a user must fulfill certain criteria based on income, net-worth, marrital status, and home-ownership. Through data augmentation, I was able to aquire a rough dataset of the features needed for classification; however, within this dataset, there still remained a large subset of users with incomplete data. To distill which of these "nebulous" users were in fact elligible for accredidation, I utilized supervised machine learning. Through extensive feature cleaning, imputation, K-fold cross-validation, and sk-learn pipelines, I created a Random Forest model capable of predicting which users were elligible with 98.2% accuracy, 99.9% precision, and 86% recall. These scores are even higher when looking at the "edges" of the probability histogram. To conclude the project, I weighed the tradeoff between precision and recall via an ROC curve (AUC=.986).

Github Repository

Semiconductor Properties Prediction

Python

In this project, I used machine learning to optimize the transport integral of tetracene. Organic semiconductors are responsible for the OLED panels found on high end TV's and cell phone screens. Unfortunately, computing electronic properties of theoretical materials, such as the molecular transport integral, is computationally expensive. Using machine learning, we can predict the transport integral of a material much more quickly, allowing for more rapid structural iterations. Given the coulomb interactions (calculated as a function of the atomic positions and structure), I predicted the corresponding transport integrals. PCA was critical to reduce the computational cost, without affecting performance. The best method found for the predictions was a neural net built using Keras, though Kernel Ridge Regression performed quite well. K-fold cross-validation was implemented during training and a random grid search was used for hyperparameter tuning.

Github Repository

Nanoparticle Disorder Neural Network

Python, Mathematica

In this project, I used convolutional neural networks to predict disorder in metallic nanoparticles. Given a XANES (X-ray Absorption Near Edge Sepctroscopy) spectrum for the material, my neural network, built with Keras, predicts the mean relative squared displacement of atoms from their non-disordered, crystalline positions. This would allow researchers to quantify disorder in situ, rather than relying on supplementary measurements or collecting the full EXAFS (extended X-ray Absorption Fine Structure) data to approximate the debye-waller factor (a similar measurement of disorder). In previous work, training sets have been created through computationally intensive simluations, requiring days worth of running time on dedicated computing clusters. Instead, I have developed a new method for creating disordered-structure training data through clever statical averaging of non-disordered structures, reducing the computational cost to something that can be performed on a laptop in under an hour. Once trained to predict disorder on simulation data, I utilized transfer learning to extend the network's predictive domain onto experimental data. Due to the nature of the work, I can't link my github repository or thesis until after publication.

Friend Identifier

Python

I trained a naive Bayes classifier to be able to identify which one of my Facebook friends sent an unlabeled message. Using personal messages as training data, I took messages from a group chat (unseen by the classifier) and had the program determine which messages were sent by which person. The program also can make a classification based on a single unseen message. Instead of reusing my HTML scraping from the Message Counter project, I decided to download my JSON messenger files from Facebook to load and manipulate the data from there.

Github Repository
Explanatory Blog Post

Facebook Message Counter

Python, HTML

The purpose of this project was to see messaging trends between my friend and me. Who sends more messages? Are there any anomaly? The main challenge of this project was to scrape the data and organize into a usable format. BeautifulSoup4 was used in the HTML scrape, with the message content and senders saved into CSV spreadsheets. The CSV's were then loaded into a Pandas DataFrame, where I could write queries to count message numbers according to time-stamps. Matplotlib and Seaborn were used for data visualization.

Github Repository

Titanic

Python

The classic first project for anyone's machine learning journey. I tried a variety of machine learning models, eventually deciding to tune the hyperparameters for XG-Boost and include a new interaction term between fare paid and passenger's sex. I spent some time stacking several models, but ended up abandoning this approach as it is impractical for most professional applications.

Jupyter Notebook
Github Repository

Toothbrushing Patterns

Python

Ever wonder about your toothbrushing patterns? No? Fair enough. My electric toothbrush actually happens to record lots of data, including the exact time and duration of every brushing seesion, as well as the percent of tooth coverage (and location!) during each session. Colgate doesn't let you download your data, but they let you synch your data with Apple Healthkit, which does let you download your data as an XML file. I imported this data into EXCEL and Pandas and created this heat map of 5 months of my tootbrushing data.

Image Classifier

Python

I've written several CIFAR-10 and MNIST classifier projects, and will include most of them here. The first classifier was a neural network written from scratch. I used a pretty simple architecture, and obtained un-remarkable results. I then wrote a more complicated network using Py-Torch lighting. Then I improved upon it by implementing CNN's, then RNN's, and finally including transfer learning. This was all work for a Deep Learning class at the Technical University of Munich. Because these are HW solution, I have to keep my Colab Notebooks and Github Repository private, at least for a few years.

Jeremy Thaller

Projects

Accredited Investors: Binary Classification and Probabilities

Semiconductor Properties Prediction

Nanoparticle Disorder Neural Network

Friend Identifier

Facebook Message Counter

Titanic

Toothbrushing Patterns

Image Classifier

Education

ERASMUS MUNDUS

Williams College

Skills

Relevant Coursework

Interests