Rayan A. Althonian

Metis Data Science Bootcamp - Project 3

2019-10-12T00:00:00+00:00

In the fifth week of the bootcamp, we were given the instructions for the third project. In this project, the students were asked to create a classification model that classifies labels based on certain features. For instance, if the data of different crimes is provided, one can train a model to classify the different types of crimes based on, for example, the neighborhood name, time and date. This post will illustrate my approach to tackle this particular problem.

Introduction

One of the most important pre-processing steps in some of image processing applications, like face detection and gesture recognition, is skin segmentation. Skin segmentation is challenging due to many areas of concern like the changing light intensity. My goal in this project is to take the different values of RGB colors of different pixels and classify them as skin pixels or non skin pixels.

Methodology

The approach that i have followed is summarized in the following figure:

These steps will be discussed in detail in the following subsections.

Step 1: Data Collecting

The dataset, skin segmentation dataset, was taken from UCI repository. This dataset includes the RGB values of pixels that were taken from face pictures of people of different ages, ethnic groups and genders. Each pixel is labeled as as either a skin pixel or non-skin pixel.

Step 2: Data Cleaning

In this step, null, Nan values and outliers were handled.

Step 3: Exploratory Data Analysis (EDA)

The pair plot for the different features is shown below.

Clearly, there is an overlap between the features but it is partial overlapping which means we can continue with the same features without worrying about removing any feature. Also, note that since we just have 3 features which are the values of the RGB colors, we can not remove any of them even if we have a complete overlapping between the features. Additionally, the scatter plots show that nonlinear model must be used.

Now, let’s have a look at the histograms of the skin pixels and non-skin pixels.

Skin Pixels histograms:

Non-Skin Pixels histograms:

One interesting thing to notice in the histogram of the red color is that we can clearly say that any red value below 100 will be classified as a non-skin pixel. This tells us on initial thought that red color will be the most important feature in the classification model. The following figure illustrates the previously mentioned observation in a better way than the histogram.

Another interesting observation is that the number of skin pixels and non-skin pixels is different, so we have two options:

Oversampling
Undersampling

The dataset is not large, so oversampling is a better option. The oversampling is done using SMOTE.

Step 4: Modeling

Before creating any model, the data were split into training and test sets. Now, there are three things that must be noted before talking about the baseline model and the rest of the models. The first thing is that the train set is over-sampled, then the score of it is determined in all the models. The second thing is that when we do cross validation, we have to over-sample the 60 % train set and validate on the remaining 20% for each fold. The last thing is that the best possible hyper-parameters were chosen using cross validation that was applied using pipe-lining. The following figure illustrates how the typical cross validation is done.

The Baseline Model: Logistic Regression Model

Logisitc regression model was chosen as the baseline model because it typically performs well in binary classification problems. The parameters for the model were as follows:

solver= 'saga'
penalty='elasticnet'
l1_ratio=0.1
C=100
max_iter=1000

Training F1-score	Cross-Validation F1-Score
0.94	0.939

The last thing to note about this model is the coefficients. A simple of bar chart of the coefficients is shown below:

Well, this is strange. We expected red to have the highest impact on the model, but it did not.

The Second Model: Decision Trees Model

The performance of the logistic regression was not bad, but we can always do better. Decision trees model was tuned by using the following parameters that were obtained by the pipeline:

criterion='entropy'
max_depth=7

Training F1-score	Cross-Validation F1-Score	AUC
1	0.9958	0.9972

The last thing to note about this model is the coefficients. The following list shows the features importance in descending order(Important first):

Red (score = 0.65)
Green (Score = 0.177)
Blue (Score = 0.175)

Now, these results confirms our initial assessment of the data since red is the most important feature.

The Final Model: Random Forest

Decision trees model provided excellent results, but can we do even better ? Yes ! random forest is the best performing model and its training and cross validation results are shown below:

Training F1-score	Cross-Validation F1-Score	AUC
1	0.9996	0.9999

The last thing to note about this model is the coefficients. A simple of bar chart of the coefficients is shown below:

Now, these results confirms our initial assessment of the data as the previous model. However, it gives green higher score than the previous model. The testing results of this model will be illustrated in the following section

Step 5: Final Model Testing

Final Thoughts

Metis Data Science Bootcamp - Project 2

2019-09-22T00:00:00+00:00

In the beginning of the second week of the bootcamp, we were given a project, that is due the last day in the third week, which is about scraping the web and then creating a linear regression model that would predict something based on some features that might be relevant to the prediction. For instance, you can predict the return on investment (ROI) of certain movie based on the fact that you know the number of ratings, reviews, run-time, etc. This post will illustrate my approach to tackle this particular problem.

Introduction

Let’s say that you are a publisher or a writer who is eager to know whether people are going to be really engaged with your book or not. Now this is an interesting issue!

How can we provide such a client with this particular desire ?

Well, we can rely on books related websites like goodreads which is a massive social media network for book enthusiasts in which they can review, share, create lists, etc. The data from this website can be scraped, then we can manipulate them using python to get some insights and build models that can predict the engagement. The following sections will illustrate my approach in this project.

Methodology

The approach that i have followed is summarized in the following figure:

These steps will be discussed in detail in the following subsections.

Step 1: Web Scraping

In this step, Selenium and BeautifulSoup were used to extract data from goodreads. Data of 5000 books from 5 main genres were chosen which are Art, Business, Fiction, Science and Self-help. Selenium was used only to login to the website since the lists of books were hidden for guests. BeautifulSoup was used to extract the following features from each book:

Number of reviews
Number of pages
Average rating
Category
ISBN

In addition to the number of ratings which is the thing that i am trying to predict based on these features.

Step 2: Data Cleaning and Exploratory Data Analysis (EDA)

In the cleaning stage, some books were removed because some of their features did not make any sense. For example, books that have zero number of pages or have significantly high number of pages like over 1500 pages were removed.

After cleaning the data, EDA was done to get initial insights on the data. To elaborate, take a look at the following heatmap:

We can clearly see from the heatmap that overall we have low multi-colinearity so we do not need to worry about perfect colinearity. Also, the map tells us that number of reviews and fiction will probably have the highest effect on the model.

Now, we can also get more insights from the pair plot below:

This pair plot shows that the number of reviews have almost linear relationship with the number of ratings which means that it is of a great value to any model that we are going to create. Also, the histogram of number of ratings and number of reviews tells us that most likely we need to apply some kind of transformation like logarithmic transformation or power transformation.

Step 3: Models Creation

In this stage, a set of models were created based on the observations that were seen from EDA. Some models involved some kind of variable transformation, others involved interaction terms and others just used some of the parameters without any transformation or interaction parameters. Also, it should be noted that any feature that has a p-value > 0.05 were removed from the model. Now, only three models of all the models that were created will be discussed in the following subsection.

Step 4: Models Training and Validation

The First Model (Baseline Model)

In this model, only the numerical data were considered without any additional modifications.

R-squared= 0.648, cross-validation score= 0.639. The R-squared is not bad. However, the residual is absolutely horrible.

The Second Model

In this model, the number of reviews were only included. In addition to applying logarithmic transformation on both the number of reviews and the number of ratings.

R-squared= 0.879, cross-validation score= 0.878. The residual plot is better than the previous one. However, the Q-Q plot is not so great.

The Final Model

Like the previous model, logarithmic transformation was applied, but also the rest of features were included except Philosophy.

R-squared= 0.889, cross-validation score= 0.888. The residual plot is better than the previous one and the Q-Q plot is also better.

Step 5: Final Model Testing and Results

The cross validation result for the final model is 0.888 and the testing result is 0.875, so R-squared and the residual plot shows that we have a fairly excellent model. However, when we look at the condition number from the stats table shown below, it turns out that we were tricked by over-fitting.

Now, that’s quite an issue. How can we solve it ?

Well, thanks to regularization which can help us in avoiding over-fitting by providing a model that is somehow between the complex model and simple model.

Final Thoughts and Recommendations

In essence, a few modifications on the model are needed to make it perform in a better way. Firstly, and most importantly, regularization must be used to solve the issue of over-fitting. Furthermore, ISBN can be used to get an extremely significant feature like the price of the book which will certainly enhance the prediction of the model. Lastly, gathering additional data will also enhance the accuracy of the prediction.

Metis Data Science Bootcamp Project 1

2019-09-07T00:00:00+00:00

In the first week of the bootcamp, Metis intstructors teach basic data science libraries like Numpy, Matplotlib, Pandas and Seaborn.

The first project is about doing exploratory data analysis (EDA) on NYC subway data to provide a specific client with their needs which is done by using the tools that were taught in the first week of the bootcamp.

The client is an NGO called women tech, women yes (WTWY) which aims to empower women in the technology sector. The client wants to invite people to a gala, that is meant to support women in tech, in the beginning of summer, so they asked us to determine the best possible times, days and stations in which their street team is more likely to signup the maximum amount of people and hence invite them to the gala.

Our Approach in Analyzing the Data

We have followed certain steps in analyzing the data. Firstly, we have determined the analysis period which is 5 weeks prior to the beginning of June since June is the first month in summer. Secondly, we have extracted the data from MTA webiste. Thirdly, we have then cleaned the data by omitting outliers and negative values of the counting of the flow of people through the turnstiles. Lastly, we created visuals and deduced insights from them and then, we provided the client with the best possible times, stations and days.

The Visuals and the Corresponding Insights

The first visual that we have obtained is the one that shows the best days for the whole analysis period.

We notice from the figure that Thursday, Wednesday and Friday are the top three days in terms of the average daily inflows and outflows, so we can say on initial thought that these three days are more likely to bring more people to the gala. Also, we notice that the inflows and outflows are not equal that is because some of the exits do not have turnstiles which means that the number of people leaving will not be counted in these cases.

The second figure shows the top five stations in terms of volume for the whole analysis period.

The last figure shows the average daily volume for each day with respect to the time.

We notice here that there is one issue, on Friday between 5 to 7 am a sudden spike occurs. This garbage data is due to the inaccuracy in cleaning the data.

Recommendations

The analysis and visuals recommends the client to focus on the following days and stations:

Days	Stations
Thursday	34 ST-PENN STA
Wednesday	GRD CNTRL-42 ST
Friday	34 ST-HERALD SQ

Final Thoughts

I think the first project is relatively easy, but it might be challenging for those who do not have prior exposure to pandas. Overall, it was an interesting project and a great experience to test out what we have learned in the first week.