Deep Learning Project: Image Classifier deployed to Mobile from Scratch

Part 1: The Hotdog CNN Classifier

Kieran McCarthy
11 min readJun 18, 2022
Image created from AI of both AI and developers working together

This year I set out my goal of learning a new skill which I hope will be a stepping stone toward a career in Deep Learning someday. I started this journey by investing in myself and took to achieving the Udacity Deep Learning Nanodegree. After successfully graduating from the Nanodegree I grew hungry in wanting to put my skills to the test and build new interesting projects which would act to broaden my skills in DL while also having fun building some interesting and wacky applications.

For my first Deep Learning project, I wanted to build something end-to-end. From idea, gathering/preprocessing data, building the classification models to packaging up the model to be used by mobile devices. The idea I had in mind has been done many times of course as it follows the great Silicon Valley show gag of an app called Seafood. For those who don’t know about it, in the show, an app made by a developer is meant to classify all foods from an image taken from the camera of a mobile device and show recipes for that food. But instead, it just turns out to be a hotdog classifier, definitely check it out here Seafood.

Now I want to make this article a guide of work and findings, not a tutorial. You are free to use the code as you see fit and even work through this project as I have documented it to have a hotdog classifier by the end. But it's not my intention to have this article as a tutorial. I hope you enjoy reading through the journey.

So let's go!

Gathering and Cleaning the data

So to start the journey, I needed to gather a dataset that had pre-defined labels for the photos as I didn’t want to spend a large amount of time labeling images myself. I needed to find a hotdog dataset that was large enough to allow for the model to generalize. I managed to find a dataset that was both pre-defined with labels and also relatively large enough that would allow my model to learn and generalize.

Within Kaggle I was able to find this dataset: Deep Learning for Hot Dog Recognition. This turned out to be a great dataset that was clearly labeled with a few classifications that were already labeled to the images! I did previously try using another dataset for hotdogs, but the dataset was very small. Around 400 images which was a disaster to work with even when using the K-fold algorithm. Below are the examples of the dataset I did end up using.

=== Example Labels ===
['frankfurter', 'chili-dog', 'hotdog', 'food', 'furniture', 'people', 'pets']
Some examples of the images from the training dataset

After connecting Colab to Kaggle using my Kaggle API key and pulling the dataset to the session environment, I now needed to correct the images to be either categorized as a hotdog, or not a hotdog. This will allow me to organize the datasets being passed into the model.

=== Output example ===
train_kaggle/people_5176.jpg => train_kaggle/nothotdog.0.jpg train_kaggle/food_6124.jpg => train_kaggle/nothotdog.1.jpg train_kaggle/hotdog_9419.jpg => train_kaggle/hotdog.0.jpg train_kaggle/hotdog_9810.jpg => train_kaggle/hotdog.1.jpg train_kaggle/frankfurter_6525.jpg => train_kaggle/hotdog.2.jpg train_kaggle/food_5443.jpg => train_kaggle/nothotdog.2.jpg train_kaggle/pets_1474.jpg => train_kaggle/nothotdog.3.jpg train_kaggle/food_6018.jpg => train_kaggle/nothotdog.4.jpg

Now since the labels are correct they are able to be put into their respective subsets. It was now time to create the function to start copying the files into those subset directories.

The subdirectories after being set up.

Now that the data is set up and ready to be used to train the model I was able to start building out my classification model.

Building out the Classification Models

I choose to use Tensorflow as my framework of choice to build out my Convolutional Neural Network. I felt since I did my Nanodegree in PyTorch it would be good to use another library to better understand the foundations of both.

I started building out my model with a few Conv2D layers and Maxpooling2D layers. I wanted to target an accuracy higher than 60% in the hopes of having a good start with any changes to the model. I first needed to shape and scale my images as inputs to 256x256 to match the images coming in as inputs.

Model: "model_4" _________________________________________________________________  Layer (type)                Output Shape              Param #    =================================================================  input_5 (InputLayer)        [(None, 256, 256, 3)]     0                                                                             rescaling_4 (Rescaling)     (None, 256, 256, 3)       0                                                                             conv2d_20 (Conv2D)          (None, 254, 254, 32)      896                                                                           max_pooling2d_16 (MaxPoolin  (None, 127, 127, 32)     0           g2D)                                                                                                                                conv2d_21 (Conv2D)          (None, 125, 125, 64)      18496                                                                         max_pooling2d_17 (MaxPoolin  (None, 62, 62, 64)       0           g2D)                                                                                                                                conv2d_22 (Conv2D)          (None, 60, 60, 128)       73856                                                                         max_pooling2d_18 (MaxPoolin  (None, 30, 30, 128)      0           g2D)                                                                                                                                conv2d_23 (Conv2D)          (None, 28, 28, 256)       295168                                                                        max_pooling2d_19 (MaxPoolin  (None, 14, 14, 256)      0           g2D)                                                                                                                                conv2d_24 (Conv2D)          (None, 12, 12, 256)       590080                                                                        flatten_4 (Flatten)         (None, 36864)             0                                                                             dense_14 (Dense)            (None, 516)               19022340                                                                      dense_15 (Dense)            (None, 1)                 517                                                                          ================================================================= Total params: 20,001,353 Trainable params: 20,001,353 Non-trainable params: 0 _________________________________________________________________

I did previously do some changes to this which resulted differently on it’s predictions. Changing the input shape to 300x300 which increased the validation accuracy of its predictions, and even changing the optimizer from RMSprop to Adam which decreased the validation accuracy which I found very interesting as Adam is known to perform better on CNN's.

With the first version of the model setup, I now needed to preprocess the data to make sure it was suitable for the model to increase performance and efficiency.

=== Output example ===
Found 1544 files belonging to 2 classes.
Found 2000 files belonging to 2 classes.
Found 400 files belonging to 2 classes.
Using 200 files for validation.

After running for 30 epochs, I was able to plot the validation loss and accuracy to see if the model was overfitting. I noticed that after around step 10, the validation accuracy started to diverge confirming the model was indeed overfitting (which is a great start in any DL model, as now the model can be fine-tuned to work better)

And just to put this model's current version to the test, it gets evaluated.

49/49 [==============================] - 4s 75ms/step - loss: 0.7180 - accuracy: 0.5285 Test Acc: 0.528

52% accuracy in classifying images as hotdogs! So basically it misclassifies 48% of its images. It’s pretty much guessing. To make the model better, we can add a Data Augmentation layer which makes augmentations to the images to generalize more and decrease the inaccuracy.

Model v2 with Data Augmentation

To better the model in its predictions for classification, my first attempt was to augment the images in the datasets. This will allow for the model to generalize better and for it to never see the same image twice. I added a simple augmentation layer that randomly rotated, zoomed, or rotated the images it was provided.

Image of a cat augmented for the model (took me a while but now I can’t unsee the cat 😅)

Now with the new data augmentation layer built, it can be added as a layer in the new CNN model.

After running this model for 50 epochs, I was again able to plot the validation loss and accuracy to see if this model was resulting any better. This time around I noticed the model generalized a bit better to the data but there were some sharp differences in the validation in some steps.

49/49 [==============================] - 4s 75ms/step - loss: 0.4427 - accuracy: 0.7876 Test accuracy: 0.788

After adding the augmentation layer and running the model against the training dataset, there was a small enough increase from 52% to 78%. A 26% increase in accurate predictions. The model is still making incorrect predictions of around 22%, not bad, but certainly, I can improve it again.

The next bet was to use a pre-trained model to predict with higher accuracy by using the VGG16 model trained against the ImageNet weights.

Using a pre-trained VGG16 model with transfer learning

For the model to make better predictions we would have to gather more data and train for days or sometimes even weeks. Now I don’t have a powerful rig to run these models from with large storage space to reduce this time, but with the leaps Deep Learning has made over the years we don’t have to. I will be able to use a pre-trained model that has been already trained against thousands of images for multiple days and already now knows how to identify images. Extracting those features from the pre-trained model we can use them to work with this model. This process is called Transfer Learning.

Transfer learning consists of taking features learned on one problem, and leveraging them on a new, similar problem. For instance, features from a model that has learned to identify racoons may be useful to kick-start a model meant to identify tanukis.

Transfer learning is usually done for tasks where your dataset has too little data to train a full-scale model from scratch.

https://keras.io/guides/transfer_learning/

With that description, let's build the base to work from.

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5 58892288/58889256 [==============================] - 0s 0us/step 58900480/58889256 [==============================] - 0s 0us/step Model: "vgg16" _________________________________________________________________  Layer (type)                Output Shape              Param #    =================================================================  input_6 (InputLayer)        [(None, 256, 256, 3)]     0                                                                             block1_conv1 (Conv2D)       (None, 256, 256, 64)      1792                                                                          block1_conv2 (Conv2D)       (None, 256, 256, 64)      36928                                                                         block1_pool (MaxPooling2D)  (None, 128, 128, 64)      0                                                                             block2_conv1 (Conv2D)       (None, 128, 128, 128)     73856                                                                         block2_conv2 (Conv2D)       (None, 128, 128, 128)     147584                                                                        block2_pool (MaxPooling2D)  (None, 64, 64, 128)       0                                                                             block3_conv1 (Conv2D)       (None, 64, 64, 256)       295168                                                                        block3_conv2 (Conv2D)       (None, 64, 64, 256)       590080                                                                        block3_conv3 (Conv2D)       (None, 64, 64, 256)       590080                                                                        block3_pool (MaxPooling2D)  (None, 32, 32, 256)       0                                                                             block4_conv1 (Conv2D)       (None, 32, 32, 512)       1180160                                                                       block4_conv2 (Conv2D)       (None, 32, 32, 512)       2359808                                                                       block4_conv3 (Conv2D)       (None, 32, 32, 512)       2359808                                                                       block4_pool (MaxPooling2D)  (None, 16, 16, 512)       0                                                                             block5_conv1 (Conv2D)       (None, 16, 16, 512)       2359808                                                                       block5_conv2 (Conv2D)       (None, 16, 16, 512)       2359808                                                                       block5_conv3 (Conv2D)       (None, 16, 16, 512)       2359808                                                                       block5_pool (MaxPooling2D)  (None, 8, 8, 512)         0                                                                            ================================================================= Total params: 14,714,688 Trainable params: 14,714,688 Non-trainable params: 0 _________________________________________________________________

Now we just need to extract the features and the labels which are associated with the VGG16 model.

With the features and labels extracted, version 3 of the model can be built. But since we are using pre-trained weights, we are not able to use the data augmentation layer with this version.

We don’t have to run this version of the model for a large number of epochs because we are only training the last few layers to work with our model. So I just ran this model for 20 epochs.

49/49 [==============================] - 0s 4ms/step - loss: 24.1662 - accuracy: 0.9139 Test Acc: 0.914

91% accuracy! That’s a greatly increased result. The model now only mispredicts hotdogs with a 9% inaccuracy. I wanted to fine-tune this model one last time by freezing the conv_base layer which will now allow the use of the augmentation layer created two versions back!

49/49 [==============================] - 10s 195ms/step - loss: 1.9261 - accuracy: 0.9456 Test Acc: 0.946

Results

Increased by 3%! I’m quite happy about this. It’s a small increase but increased non the less! To be fair the previous 91% prediction accuracy for hotdogs is still pretty good for version 3 of the classifier. Allowing for more images to be collected in the future and allowing for a new version of the model to learn will better the model and bring up the accuracy. Also maybe finetuning the learning rate as I feel from the look of the validation accuracy in the epochs, it could be missing the small changes in weights.

But for now, I’m happy with 94% :)

Let's test the predictions :)

Testing the predictions of the CNN

Below is a function I used to test some images that the CNN hasn’t seen yet to see the accuracy of the predictions.

This function will download an image when given a URL as its input and attempt to predict if the image is a hotdog or not. Below was some examples of the predictions made.

Great! The predictions are very accurate and are classifying the images correctly (depending on how you view The Rock 😅). Now all that’s left is to package up the CNN model to be usable on a mobile device using Tensorflow Lite.

Converting the Tensorflow model to a Tensorflow Lite model

We now have a Tensorflow Lite model ready to be imported into a mobile application to identify hotdogs in images! Clearly very important and groundbreaking stuff here 🤯😅 .

I hope this product will come in handy for anyone interested in trying to get started in Deep Learning and work on Convolutional Neural Networks. This project was a lot of fun to build and I needed to research quite a bit to finish.

All my findings and work are available in this article along with the full Colab codebase on Github to be used as you like. If you found this interesting and would like to see more, give this post a 👏.

In the meantime, I will be writing about the mobile deployment side which will be using Flutter for this project and will update once it is ready.

Part 2 is now available: https://kieran-mcc91.medium.com/deep-learning-project-image-classifier-deployed-to-mobile-from-scratch-c3aa063fbe37

--

--

Kieran McCarthy

Software Engineer | Deep Learning Enthusiast | Chat Bot Builder | Coffee Lover | Creator | Practising Writing