Jeremy Thaller
I'm a dual MS engineer with degrees in materials science and applied physics. I have a passion for beautiful data visualization and specialize in deep learning predictive modeling.
I'm a dual MS engineer with degrees in materials science and applied physics. I have a passion for beautiful data visualization and specialize in deep learning predictive modeling.
To be elligible for accredidation, a user must fulfill certain criteria based on income, net-worth, marrital status, and home-ownership. Through data augmentation
,
I was able to aquire a rough dataset of the features needed for classification; however, within this dataset, there still remained a large subset of users with incomplete
data. To distill which of these "nebulous" users were in fact elligible for accredidation, I utilized supervised machine learning. Through extensive feature cleaning
,
imputation
, K-fold cross-validation
, and sk-learn pipelines
, I created a Random Forest
model capable of predicting which users were elligible with 98.2% accuracy, 99.9% precision, and 86% recall. These scores are even higher when looking at the "edges" of the
probability histogram. To conclude the project, I weighed the tradeoff between precision and recall
via an ROC curve
(AUC=.986).
In this project, I used machine learning to optimize the transport integral of tetracene. Organic semiconductors are responsible
for the OLED panels found on high end TV's and cell phone screens. Unfortunately, computing electronic properties
of theoretical materials, such as the molecular transport integral, is computationally expensive. Using machine learning, we can
predict the transport integral of a material much more quickly, allowing for more rapid structural iterations. Given the coulomb interactions
(calculated as a function of the atomic positions and structure), I predicted the corresponding transport integrals.
PCA
was critical to reduce the computational cost, without affecting performance. The best method found for the predictions was a neural net built
using Keras
, though Kernel Ridge Regression
performed quite well. K-fold cross-validation
was implemented during training and a random grid
search was used for hyperparameter tuning
.
In this project, I used convolutional neural networks
to predict disorder in metallic nanoparticles. Given a XANES (X-ray Absorption Near Edge Sepctroscopy)
spectrum for the material, my neural network, built with Keras
, predicts the mean relative squared displacement of atoms from their non-disordered,
crystalline positions. This would allow researchers to quantify disorder in situ, rather than relying on supplementary measurements or collecting the full EXAFS (extended
X-ray Absorption Fine Structure) data to approximate the debye-waller factor (a similar measurement of disorder). In previous work, training sets have been created through computationally
intensive simluations, requiring days worth of running time on dedicated computing clusters. Instead, I have developed a new method for creating disordered-structure training data through
clever statical averaging of non-disordered structures, reducing the computational cost to something that can be performed on a laptop in under an hour. Once trained to predict disorder on
simulation data, I utilized transfer learning
to extend the network's predictive domain onto experimental data.
Due to the nature of the work, I can't link my github repository or thesis until after publication.
I trained a naive Bayes classifier
to be able to identify which one of my Facebook friends sent an unlabeled message.
Using personal messages as training data, I took messages from a group chat (unseen by the classifier) and had the program determine which
messages were sent by which person. The program also can make a classification based on a single unseen message.
Instead of reusing my HTML
scraping from the Message Counter project, I decided to download my JSON
messenger
files from Facebook to load and manipulate the data from there.
The purpose of this project was to see messaging trends between my friend and me. Who sends more messages? Are there any anomaly?
The main challenge of this project was to scrape the data and organize into a usable format. BeautifulSoup4
was used in the HTML scrape, with the message content and senders saved into CSV spreadsheets. The CSV's were then loaded into a Pandas
DataFrame
, where I could write queries to count message numbers according to time-stamps. Matplotlib
and Seaborn
were used for data visualization.
The classic first project for anyone's machine learning journey. I tried a variety of machine learning models, eventually deciding
to tune the hyperparameters for XG-Boost
and include a new interaction term between fare paid and passenger's sex. I spent some time
stacking several models, but ended up abandoning this approach as it is impractical for most professional applications.
Ever wonder about your toothbrushing patterns? No? Fair enough. My electric toothbrush actually happens to
record lots of data, including the exact time and duration of every brushing seesion, as well as the percent of
tooth coverage (and location!) during each session. Colgate doesn't let you download your data, but they let you synch
your data with Apple Healthkit, which does let you download your data as an XML
file. I imported this data into
EXCEL
and Pandas
and created this heat map of 5 months of my tootbrushing data.
I've written several CIFAR-10 and MNIST classifier projects, and will include most of them here. The first classifier was a neural network
written from scratch. I used a pretty simple architecture, and obtained un-remarkable results. I then wrote a more complicated network
using Py-Torch lighting
. Then I improved upon it by implementing CNN's
, then RNN's
, and finally including transfer learning. This was all work
for a Deep Learning class at the Technical University of Munich. Because these are HW solution, I have to keep my Colab Notebooks and Github Repository
private, at least for a few years.
Joint M.S. in Materials Science and Engineering
M.S. in Applied Physics
Thesis: Investigation of Bond Strain Effects on XANES Spectra via Artificial Neural Networks
Thesis: Towards an Adhesion Based Measurement of Strain-Dependent Surface Stress in Soft Solids
I'm happiest in life when watching the Liverpool game with a fresh cup of coffee and good company. When the weather is nice, I enjoy running. I was a sprinter and captain of the Track team in college, but I'm trying to make the great transition to distance running. I'm not sure I'll ever pick up my bassoon again, but I still enjoy playing jazz piano every day.
To see what I've been working on, checkout out my blog. Each week I recap what I've done, what I've learned, and decide what I'll do next. I've found this helps consolidate what I've learned, as well as keep me accountable for coninously learning more about data science and programming.