Python – Page 2 – import Data

UPDATE: Visualizing the COVID-19 Crisis Across the World and in the United States (5/1/20)

May 1, 2020 Ben Geissel10 Comments

Introduction

I am continuing a series of blog posts concerning the COVID-19 crisis that contain some world map visualizations and US State map visualizations of metrics I have found to be useful in analyzing the situation. COVID-19 is affecting countries all over the world and in many places the number of cases is growing exponentially everyday. This blog post with the associated Jupyter Notebook will look at different measures of how bad the outbreak is across the world and in the United States. Each metric will be displayed in a global or US choropleth map. Additionally, this exercise sets up repeatable code to use as the crisis continues and more daily data is collected.

Disclaimer

The point of this blog is not to try to develop a model or anything of the sort to detect COVID-19, as a poorly created model could cause more harm than good. This blog is simply to generate visualizations based on publicly available data about COVID-19. These visualizations will ideally help people understand the global effect of COVID-19 and the exponential pace at which cases are developing across the world and in the United States.

Data Sources

As stated in my previous blogs, the data used in this analysis is all publicly available data. The COVID-19 global daily data has been provided from the European Centre for Disease Prevention and Control. This data source is updated daily throughout the crisis and can be used to update this exercise regularly going forward. The US State level COVID-19 data has been made publicly available by the New York Times in a public GitHub Repository. In addition to the COVID-19 data, global and US state population data was used to provide per capita metrics. The global data is from The World Bank, while the US State level population data is from The United States Census Bureau.

Python Code Access

If you are interested in seeing the code used to generate these visualizations, the python code and Jupyter Notebook can be found on GitHub.

Results

To begin, previous blogs can be found here:

As a reminder, the five metrics I will be viewing at both a country level and US state level are the following:

Number of 2020 Cumulative Cases
Number of 2020 Cumulative Deaths
2020 Cases per Capita
2020 Deaths per Capita
2020 Death Rate

In this blog post, the global results are as of 5/1/20, while the US state level results are as of 4/29/20.

Global Results – 5/1/20

US State Level Results – 4/29/20

Conclusions

As you can see by looking at the various metrics, certain countries are handling the virus better than others. The United States has the most cases, and in comparison to the overall population, the number of cases is about as high as those of some European countries. The European countries are also struggling the most in terms of deaths per capita, with the US close behind. Death rates seem to have evened out across the globe as the virus spreads and there are less outliers. European countries seem to have the highest death rates in general, with many hovering above a 10% death rate. France has an astonishing 18.8% death rate currently. Some of these high numbers may have to do with how often tests are administered. Testing only those with intense symptoms, would show a higher death rate.

In the United States, certain states are facing worse COVID circumstances than others. The New York area has been hit the hardest, with both New York and New Jersey having a very high number of cases and deaths. In addition to the Northeast region states like Louisiana and Michigan have a lot of deaths per capita. Death rates seem to be fairly evenly spread throughout the states, with Michigan being the highest at 9%.

UPDATE: Visualizing the COVID-19 Crisis Across the World and in the United States (4/24/20)

April 24, 2020 Ben Geissel11 Comments

Introduction

Disclaimer

Data Sources

Python Code Access

If you are interested in seeing the code used to generate these visualizations, the python code and Jupyter Notebook can be found on GitHub.

Results

To begin, previous blogs can be found here:

As a reminder, the five metrics I will be viewing at both a country level and US state level are the following:

Number of 2020 Cumulative Cases
Number of 2020 Cumulative Deaths
2020 Cases per Capita
2020 Deaths per Capita
2020 Death Rate

In this blog post, the global results are as of 4/24/20, while the US state level results are as of 4/23/20.

Global Results – 4/24/20

US State Level Results – 4/23/20

Conclusions

As you can see by looking at the various metrics, certain countries are handling the virus better than others. The United States has the most cases, and in comparison to the overall population, the number of cases is about as high as those of some European countries. The European countries are also struggling the most in terms of deaths per capita. Death rates seem to have evened out across the globe as the virus spreads and there are less outliers. European countries seem to have the highest death rates in general, with many hovering above a 10% death rate. France has an astonishing 18% death rate currently. Some of these high numbers may have to do with how often tests are administered. Testing only those with intense symptoms, would show a higher death rate.

UPDATE: Visualizing the COVID-19 Crisis Across the World and in the United States (4/17/20)

April 17, 2020 Ben Geissel12 Comments

Introduction

Disclaimer

Data Sources

Python Code Access

If you are interested in seeing the code used to generate these visualizations, the python code and Jupyter Notebook can be found on GitHub.

Results

To begin, previous blogs can be found here:

As a reminder, the five metrics I will be viewing at both a country level and US state level are the following:

Number of 2020 Cumulative Cases
Number of 2020 Cumulative Deaths
2020 Cases per Capita
2020 Deaths per Capita
2020 Death Rate

In this blog post, the global results are as of 4/17/20, while the US state level results are as of 4/15/20.

Global Results – 4/17/20

US State Level Results – 4/15/20

Conclusions

As you can see by looking at the various metrics, certain countries are handling the virus better than others. The United States has the most cases, and in comparison to the overall population, the number of cases is about as high as those of some European countries. The European countries are also struggling the most in terms of deaths per capita. Death rates seem to have evened out across the globe as the virus spreads and there are less outliers. European and African countries seem to have the highest death rates in general, with many hovering around a 15% death rate. France has an astonishing 16.5% death rate currently.

UPDATE: Visualizing the COVID-19 Crisis Across the World and in the United States (4/10/20)

April 10, 2020 Ben Geissel13 Comments

Introduction

Disclaimer

Data Sources

Python Code Access

If you are interested in seeing the code used to generate these visualizations, the python code and Jupyter Notebook can be found on GitHub.

Results

To begin, global results as of 3/20/20 can be found in a previous blog.

Global results as of 3/27/20 and US results as of 3/25/20 can be found in this previous blog.

As a reminder, the five metrics I will be viewing at both a country level and US state level are the following:

Number of 2020 Cumulative Cases Number of 2020 Cumulative Deaths 2020 Cases per Capita 2020 Deaths per Capita 2020 Death Rate

In this blog post, the global results are as of 4/10/20, while the US state level results are as of 4/9/20.

Global Results – 4/10/20

US State Level Results – 4/9/20

Conclusions

As you can see by looking at the various metrics, certain countries are handling the virus better than others. The United States now has the most cases, but in comparison to the overall population, the number of cases is not as high as those of some European countries. European countries like Iceland, Spain, and Italy have a high amount of cases per capita. These European countries are also struggling the most in terms of deaths per capita. Death rates seem to have evened out across the globe as the virus spreads and there are less outliers. European and African countries seem to have the highest death rates in general, with many hovering around a 15% death rate.

In the United States, certain states are facing worse COVID-19 circumstances than others. The New York area has been hit the hardest, with both New York and New Jersey having a very high number of cases and deaths. In addition to the Northeast region states like Louisiana and Michigan have a lot of deaths per capita. Death rates seem to be fairly evenly spread throughout the states.

Learning R from Python

April 3, 2020 Ben GeisselLeave a comment

Introduction

There is always a big debate between which language, R or Python, is the best for statistical data analysis and machine learning. Both languages have pros and cons, so why not understand both? I have a strong Python background, but figured I should learn R as well. R has ggplot2, which is an amazing visualization library that seems to outdo what matplotlib and Seaborn offer through Python. Unfortunately, there seems to be a lot of information about switching from R to Python but not the other way around. So I decided to just learn R through rebuilding a Python project I have already completed.

For reference, I built a spam classifier model in Python and documented the process in a previous blog. I wanted to rebuild this spam classifier model, in a very simplified form, in R. Doing this has helped me learn many useful skills in R that I will show in this blog.

Getting Started

To begin, I downloaded Python through Anaconda. I typically code with jupyter notebooks, which come with Anaconda. I’d like to set up R in this environment as well.

First things first, let’s download R. Then in order to install properly follow the PC or Mac steps found in this helpful blog by Rich Pauloo.

Coding with R

Installing Libraries

If you need to install any of the libraries I’ll use, then you can do this with the following line of code, just change the library name:

install.packages("caret")

Importing Libraries

I’ll be using the following libraries:

library(wordcloud)
library(RColorBrewer)
library(tm)
library(magrittr)
library(caret)
library(e1071)
library(SnowballC)

Reading in the data

I had the spam data in a csv file from my python project, but the data originally comes from the UCI Machine Learning Repository.

I read in the data with the following code:

df <- read.csv("smsspamcollection/spamham.csv")

Data Visualizations through Wordcloud Library

Now in order to visualize the text data, I separated the data into spam vs ham (i.e. not spam) and then created a word cloud for each group. This was accomplished with the following code:

# Split dataframe by ham and spam
spam_split <- split(df, df$label)
spam <- spam_split$spam
ham <- spam_split$ham

# Create a vector of just spam text data
spam_text <- spam$text

# Create a spam corpus  
spam_docs <- Corpus(VectorSource(spam_text))

# Clean spam text with tm library
spam_docs <- spam_docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
spam_docs <- tm_map(spam_docs, content_transformer(tolower))
spam_docs <- tm_map(spam_docs, removeWords, stopwords("english"))

# Create document spam term matrix
spam_dtm <- TermDocumentMatrix(spam_docs) 
spam_matrix <- as.matrix(spam_dtm)
spam_words <- sort(rowSums(spam_matrix),decreasing=TRUE)
spam_df <- data.frame(word = names(spam_words),freq = spam_words)

# Create a vector of just ham text data
ham_text <- ham$text

# Create a ham corpus  
ham_docs <- Corpus(VectorSource(ham_text))

# Clean ham text with tm library
ham_docs <- ham_docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
ham_docs <- tm_map(ham_docs, content_transformer(tolower))
ham_docs <- tm_map(ham_docs, removeWords, stopwords("english"))

# Create document ham term matrix
ham_dtm <- TermDocumentMatrix(ham_docs) 
ham_matrix <- as.matrix(ham_dtm)
ham_words <- sort(rowSums(ham_matrix),decreasing=TRUE)
ham_df <- data.frame(word = names(ham_words),freq = ham_words)

# Create spam wordcloud
wordcloud(words = spam_df$word, freq = spam_df$freq, min.freq = 1,
          max.words = 200, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

# Create ham wordcloud
wordcloud(words = ham_df$word, freq = ham_df$freq, min.freq = 1,
          max.words = 200, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

This code should give you two word clouds, spam and ham respectively, that look like the following:

Model Preprocessing

Now I need to do some preprocessing for the modeling. This includes creating a corpus and document term matrix, which is a sparse matrix containing all the words and the frequency of which they appear in each message. The preprocessing code also removes stop words (i.e. the, it, etc.), makes all words lowercase, finds the stem of the words, and removes punctuation and numbers.

df_corpus <- VCorpus(VectorSource(df$text))

df_dtm <- DocumentTermMatrix(df_corpus, control = 
                                 list(tolower = TRUE,
                                      removeNumbers = TRUE,
                                      stopwords = TRUE,
                                      removePunctuation = TRUE,
                                      stemming = TRUE))

I also need to train test split the data. This step is admittedly much less straightforward as compared to Python, but I was able to get it done.

# Calculate train test proportions
index <- floor(5572 * .8)

#Training & Test set
train <- df_dtm[1:index, ]
test <- df_dtm[index:5572, ]

#Training & Test Label
train_labels <- df[1:index, ]$label
test_labels <- df[index:5572, ]$label

Finally, the last step of preprocessing is converting the data into categorical data. The Naive Bayes model I am using in R requires this. This step is good practice and creating and implementing a function!

# Convert to categorical for naive bayes model
convert_values <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

train <- apply(train, MARGIN = 2, convert_values)
test <- apply(test, MARGIN = 2, convert_values)

Modeling

Now we can model and evaluate! The following code initiates the Naive Bayes model and computes a confusion matrix and accuracy of the Naive Bayes model on the test set of data.

#Create model from the training dataset
spam_classifier <- naiveBayes(train, train_labels)

#Make predictions on test set
y_hat_test <- predict(spam_classifier, test)

#Create confusion matrix
confusionMatrix(data = y_hat_test, reference = test_labels,
                positive = "spam", dnn = c("Prediction", "Actual"))

You can see the results here:

Conclusions

I was able to get a very simplified working model of my spam classifier! Although this isn’t as pretty as the model generated in Python from my previous blog, it still works and is a great success for learning R. If you are wanting to learn R from Python, I encourage you to practice R by recreating a Python project you have already created. This way you know what you want as data inputs and model outputs, and all you have to figure out is the R. Googling helps!

UPDATE: Visualizing the COVID-19 Crisis Across the World and in the United States

March 28, 2020 Ben Geissel14 Comments

Introduction

I wrote a blog last week concerning the COVID-19 crisis that contained some world map visualizations of metrics I find to be useful in analyzing the situation. This week I am updating my study to reflect this week’s changes as well as adding in visualizations to look at the data at the US state level. COVID-19 is affecting countries all over the world and in many places the number of cases is growing exponentially everyday. This blog post with the associated Jupyter Notebook will look at different measures of how bad the outbreak is across the world and in the United States. Each metric will be displayed in a global or US choropleth map. Additionally, this exercise sets up repeatable code to use as the crisis continues and more daily data is collected.

Disclaimer

Data Sources

Again, the data used in this analysis is all publicly available data. The COVID-19 global daily data has been provided from the European Centre for Disease Prevention and Control. This data source is updated daily throughout the crisis and can be used to update this exercise regularly going forward. The US State level COVID-19 data has been made publicly available by the New York Times in a public GitHub Repository. In addition to the COVID-19 data, global and US state population data was used to provide per capita metrics. The global data is from The World Bank, while the US State level population data is from The United States Census Bureau.

Python Code Access

If you are interested in seeing the code used to generate these visualizations, the python code and Jupyter Notebook can be found on GitHub.

Results

To begin, global results as of 3/20/20 can be found in previous blog.

As a reminder, the five metrics I will be viewing at both a country level and US state level are the following:

Number of 2020 Cumulative Cases
Number of 2020 Cumulative Deaths
2020 Cases per Capita
2020 Deaths per Capita
2020 Death Rate

In this blog post, the global results are as of 3/27/20, while the US state level results are as of 3/25/20.

Global Results – 3/27/20

US State Level Results – 3/25/20

Conclusions

As you can see by looking at the various metrics, certain countries are handling the virus better than others. China and the United States have many cases, but in comparison to their overall population, the number of cases is not that high. European countries like Iceland, Spain, and Italy have a high amount of cases per capita. Unfortunately, when looking at the death rates, places with less medical resources seem to have higher death rates, such as Sudan, Zimbabwe or Guyana, caveat these rates with very low number of cases so far however. European countries on the other hand are not low either with high numbers of cases.

In the United States, certain states are facing worse COVID-19 circumstances than others. New York, Washington, and California have a lot of cases. States like Louisiana, Vermont, Washington, and New York have a lot of deaths per capita. Death rates seem to be fairly evenly spread throughout the states.

Visualizing the COVID-19 Crisis Across the World

March 21, 2020 Ben Geissel15 Comments

Introduction

The COVID-19 crisis is affecting countries all over the world. This blog post with the associated Jupyter Notebook will look at different measures of how bad the outbreak is across the world. Each metric will be displayed in a global choropleth map. Additionally, this exercise sets up repeatable code to use as the crisis continues and more daily data is collected.

Data Sources

The data used in this analysis is all open source data. The COVID-19 daily data has been provided from the European Centre for Disease Prevention and Control. This data source is updated daily throughout the crisis and can be used to update this exercise regularly. In addition to the COVID-19 data, global population data was used to provide per capita metrics. This data is from The World Bank.

Python Code Access

The python code and Jupyter Notebook used to generate these results can be found here.

Results

The main goal of this exercise was to create visualizations showing metrics for different countries across the globe. Therefore, each of five metrics are shown as global Choropleth maps. The five metrics that are displayed are:

Number of 2020 Cumulative Cases
Number of 2020 Cumulative Deaths
2020 Cases per Capita
2020 Deaths per Capita
2020 Death Rate

The maps shown here represent cases through 3/20/20. Although the code can be used to generate results for any date of 2020 prior to 3/20/20.

Conclusion

As you can see by looking at the various metrics, certain countries are handling the virus better than others. China has many cases, but in comparison to their overall population, the number of cases is not that high. Countries like Iceland and Italy have a high amount of cases per capita. Unfortunately, when looking a the death rates, places with less resources seem to have higher rates, such as Sudan or Guyana.

Creating a Spam Classifier Model with NLP and Naive Bayes

March 14, 2020March 14, 2020 Ben Geissel1 Comment

Introduction

Have you ever had an email you needed end up in your spam folder? Or too much spam getting into your inbox?

This is a problem that almost everyone faces. Based on this simple spam classifier model example, you’ll be able to see why this problem exists. Most spam classifiers simply take into account what words appear in the email and how many times they appear. Spam creators have gotten clever to add hidden words that will trick a classifier.

To better understand a simple classifier model, I’ll show you how to make one using Natural Language Processing (NLP) and a Multinomial Naive Bayes classification model in Python.

Loading Data

I got my dataset from the UCI Machine Learning Repository. This dataset includes messages that are labeled as spam or ham (not spam).

To begin, start by importing some necessary packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

and load in your data:

df = pd.read_csv('smsspamcollection/SMSSpamCollection.txt', sep = '\t', header = None)
df.columns = ['label', 'text']

and previewing your DataFrame:

df.head()

You should see the following table:

Data Visualization

Let’s start by looking at our data in Word Clouds based on spam or not spam (ham). First import more useful packages:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from wordcloud import WordCloud
import string

Now create the spam word cloud:

spamwords = ' '.join(list(df[df.label == 'spam']['text']))
spam_wc = WordCloud(width = 800, height = 512, max_words = 100, random_state = 14).generate(spamwords)
plt.figure(figsize = (10, 6), facecolor = 'white')
plt.imshow(spam_wc)
plt.axis('off')
plt.title('Spam Wordcloud', fontsize = 20)
plt.tight_layout()
plt.show()

and now create the ham word cloud:

hamwords = ' '.join(list(df[df.label == 'ham']['text']))
ham_wc = WordCloud(width = 800, height = 512, max_words = 100, random_state = 14).generate(hamwords)
plt.figure(figsize = (10, 6), facecolor = 'white')
plt.imshow(ham_wc)
plt.axis('off')
plt.title('Ham Wordcloud', fontsize = 20)
plt.tight_layout()
plt.show()

You should get the following two word clouds if you use the same random_state:

Text Preprocessing

Now we’ll have to create a text preprocessing function that we will use later on in our CountVectorizer function. This function will standardize words (lowercase, remove punctuation), generate word tokens, remove stop words (words that have no descriptive meaning), create bigrams (combination of two words i.e. "not good"), and find the stem of each word.

def message_processor(message, bigrams = True):
    
    # Make all words lowercase
    message = message.lower()
    
    # Remove punctuation
    punc = set(string.punctuation)
    message = ''.join(ch for ch in message if ch not in punc)
    
    # Generate word tokens
    message_words = word_tokenize(message)
    message_words = [word for word in message_words if len(word) &gt;= 3]
    
    # Remove stopwords
    message_words = [word for word in message_words if word not in stopwords.words('english')]
    
    # Create bigrams
    # Add grams to word list
    if bigrams == True:
        gram_words = []
        for i in range(len(message_words) + 1):
            gram_words += [' '.join(message_words[i:(i + 2)])]
    
    # Stem words
    stemmer = PorterStemmer()
    message_words = [stemmer.stem(word) for word in message_words if (len(word.split(' ')) == 1)]
    
    # Add grams back to list
    if bigrams == True:
        message_words += gram_words
    
    return message_words[:-1]

Now use CountVectorizer to create a sparse matrix of every word that is in the dataset after applying the text processing function created above:

from sklearn.feature_extraction.text import CountVectorizer
X_vectorized = CountVectorizer(analyzer = message_processor).fit_transform(df.text)

Train Test Split

Now the data needs to be split into train and test sets for fitting and evaluating the model. I’ve chosen to set aside 20% of the data for testing and have used a random_state for reproducibility.

X_train, X_test, y_train, y_test = train_test_split(X_vectorized, df.label, test_size = .20, random_state = 72)

Fitting the Naive Bayes Model

Now it’s time to fit the spam classifier model. In this case I will be using a Multinomial Naive Bayes. The Naive Bayes model in this case is looking at the probability of a message being spam given a certain word shows up in the message. Looking back to the generated word clouds, a message with the word "FREE" will have a high probability of being spam.

from sklearn.naive_bayes import MultinomialNB
MNB_Classifier = MultinomialNB()
model = MNB_Classifier.fit(X_train, y_train)
y_hat_test = MNB_Classifier.predict(X_test)

Evaluating the Model

Finally, we can evaluate the model by looking at the classification report, accuracy, and a confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
print(classification_report(y_test, y_hat_test))
print('Accuracy: ', accuracy_score(y_test, y_hat_test))

import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(y_test, y_hat_test, figsize = (9,6))
plt.ylim([1.5, -.5])
plt.title('Confusion Matrix for Multinomial Naive Bayes Spam Classifier', fontsize = 15)
plt.tight_layout()
plt.show()

Conclusion

We can see here that a Naive Bayes model works very well as a spam classifier. This is a very simple spam classifier, yet it still gets high metrics. However, the model is exposed to spammers who think a little more creatively. If a spammer was to include a lot of words (maybe even just hidden in the background) that typically appear in non spam messages, it could trick the model.

TensorFlow or Keras: Which is better for Neural Network Models in Python?

March 6, 2020 Ben GeisselLeave a comment

Introduction

As I finished up my Data Science program at the Flatiron School, I wanted to create a convolutional neural network for an image classification problem I was attempting for my capstone project. However, I wasn’t sure at first if I should use TensorFlow or Keras to build my neural network model. This lead me to writing this blog to give a general overview of the differences of TensorFlow and Keras. Hopefully this will help others avoid the same dilemma and research I went through when starting to craft my model.

TensorFlow and Keras

Both TensorFlow and Keras are frameworks to use in Python Data Science programming. Specifically, both can be applied to Deep Learning problems. Let’s look at TensorFlow first.

TensorFlow

TensorFlow is an open source project designed to help in Machine Learning. It provides a "toolbox" of resources to help craft workflows using high level APIs. You can use different levels of these APIs to accomplish different Machine Learning tasks. TensorFlow in general creates a framework that allows for easy to model building and training, as well as model deployment for all of the machine learning models generated. My personal take, however, is that TensorFlow’s model building code structure is more customizable, but more labor intensive and confusing as compared to Keras.

Keras

Keras runs on top of TensorFlow and is, in fact, a specifically designed Deep Learning wrapper of TensorFlow. It allows for quick, easy model architecture and building. Keras is also set up to run models seamlessly with both CPUs and GPUs. If you use Python, Keras will have another advantage, as it was built in Python. This makes for easier debugging. Like TensorFlow, Keras creates a framework that allows for easy model building. The code in Keras is also very consistent across different types of neural networks, which is a big advantage. Since Keras’s model building code acts like a series of building blocks, this also allows models to be improved and extended with ease. One personal take for Keras is that it also provides very clear error messages as compared to TensorFlow. It is very easy to debug your code when encountering errors while using Keras.

Recommendation

If you are someone who has very strong coding skills and wants to build an extremely customized Deep Learning model, then TensorFlow is probably the right framework to use. In all other cases, I would recommend using the Keras framework. It provides for easier coding using building blocks of code, easier debugging, and can run quicker when using GPUs.

Conclusion

Both Keras and TensorFlow are awesome frameworks to get to know for your Machine Learning needs. Remember that Keras is a framework built as a wrapper on top of TensorFlow, specifically for Deep Learning needs. TensorFlow has much more overall Machine Learning capabilities than Keras, but when it comes to building neural networks, Keras is a great framework to take advantage of.

Predicting Drug Use with Logistic Regression

January 16, 2020January 16, 2020 Ben GeisselLeave a comment

Introduction

This project began as a search for classification problems to tackle. I wanted to put my classification algorithm skills to the test, but wasn’t sure what data to use for this. Luckily, after digging around on the internet for a while, I came across the UC Irvine Machine Learning Data Repository. This repository has loads of datasets and even has a feature, "Default Task", that you can toggle in order to find the common machine learning task that would be applied to a given dataset. When looking through the Classification datasets I found a dataset regarding drug use that looked interesting. The dataset was originally collected by Elaine Fehrman, Vincent Egan, and Evgeny M. Mirkes in 2015.

This dataset provides 1885 data rows, each of which represents a person. Each person has a set of demographic and personality traits alongside a set of drug use responses. Data features included things like age, ethnicity, education level, country, extraversion score, openness to new experiences score, etc… The dataset looks at 17 common drugs (including chocolate??) and how recently someone has used each of the drugs. I decided to make drug use a binary outcome by grouping no usage ever and usage over a decade ago into "non user" and all other outcomes into "user", meaning that the person has used the drug within the last decade.

At this point I decided I would use Logistic Regression to predict whether or not a person was a user of each of the 17 drugs given in the dataset. This model information could help to determine which types of people are more susceptible to certain types of drug use given their personality type and demographic traits. This information could then be used to give treatment and assistance to those who require it.

ETL

After importing the necessary libraries and loading in the dataset into a Pandas DataFrame, the raw data looks like this:

You can see that the data comes in a very weird format, but luckily there are descriptions on the data webpage. Using these descriptions, I transformed the data into a more coherent DataFrame.

Dummy Variables and Checks for Multicollinearity

After getting the data into a useful format, I needed to deal with my categorical variables. I fit dummy variables to all of them and made sure to drop one of the resulting columns. I also needed to check if any of my numeric features had multicollinearity. As you can see in the seaborn pairplot below, the only two metrics that appear to have any multicollinearity are neuroticism and extraversion, but the correlation level is below .75, so I kept both scores in the DataFrame.

DataFrame Splitting

After a little more data manipulation (changing the column order), I needed to create separate DataFrames for each drug. I decided to store each of these in a dictionary, with each key being a name for the drug DataFrame and the value being the corresponding DataFrame. This was accomplished with the chunk of code below:

Note: The above portion of the code with df.iloc[:, :34] isolates the portion of the DataFrame corresponding to the features. After column 34 are the binary drug use columns.

Helper Functions for Model Fitting

First things first, I need to import the proper libraries and functions for all the tools I’ll need for my Logistic Regression process.

A quick note on the scikitplot library. This library has many awesome functions for calculating metrics and plotting them in one step. You’ll see a use of this for confusion matrices later on.

In order to loop through my new dictionary and fit a Logistic Regression model to each DataFrame, I needed to define a few helper functions. The first function will split the DataFrame into the features and the target, X and y respectively.

The next helper function will perform standard scaling on my numeric features after I have performed a train/test split on my data.

An additional helper function I wrote performs SMOTE on the datasets. Many of the drug user datasets are very unbalanced, so in order to train my model on a balanced dataset, I need to use SMOTE to synthetically create training data points for my minority class. This function is created below.

Another useful step in Logistic Regression is using a grid search for the best performing hyperparameters C and Penalty type. These control regularization of the models and this function will output the optimized hyperparameters for each drug use model.

Finally, one last helper function. This is the function that actually fits the Logistic Regression model and creates output metrics and visualizations. This model takes in the train/test split data and the optimized hyperparameters C and Penalty type. The function when applied will create visualizations for the confusion matrix and ROC Curve. It also outputs the test sample’s class balance, which helps to inform how the test data is distributed.

Fitting the Model

Finally, it is time to fit all of the models and view the results. I wrote one last function which combines all of the above functions into one simple function that can be applied to a single DataFrame.

This function is then applied to each DataFrame in my previously created dictionary with a for loop to generate all of the necessary model results and visualizations.

Results

As there were 17 different drugs in question, I will only highlight a few of the results below. The Logistic Regression model performed well in most cases, although those with extremely unbalanced datasets performed slightly worse despite the use of SMOTE to balance the training sets.

Alcohol

Alcohol had an extremely unbalanced dataset. You can see that 96% of the test group had used alcohol. Alcohol did produce a large number of false negatives, however the F1 Score and AUC metrics signaled that this was a decent model.

Cannabis

Cannabis had a more balanced dataset than alcohol, with 67% of the test group having used the drug. The Logistic Regression for Cannabis use was able to generate both a high F1 Score and high AUC.

Nicotine

Nicotine also had a more balanced dataset than alcohol, with 67% of the test group having used Nicotine. Similarly to Cannabis, the Logistic Regression model for Nicotine was able to produce both a high F1 Score and a high AUC.

Meth

Finally, I’ll look at a drug that has an unbalanced dataset in the other direction. Obviously not a ton of people use Meth, although more than I imagined, with 24% of the test group having used Meth before. The Logistic Regression for Meth use produced a fair amount of false positives. Additionally the metrics and F1 Score are fairly low. The ROC Curve and AUC show decent results, but in the case of an unbalanced dataset, I would trust F1 Score over AUC.

Conclusions

I was able to fit a Logistic Regression model to each of the drug use DataFrames, however for some of the drugs, Logistic Regression may not be the best choice classifier model. Since a lot of the datasets are unbalanced, this did cause some issues with the metrics.

Overall, the results were very good and could help to inform us about an individual’s drug use given information about their demographic traits and personality traits.

Just for Curiosity

Here is a bar chart of drug use percentage for all 1885 people in the dataset.