Food classification is a straightforward problem: identify the fruit, vegetable, or meal in front of you. For humans, this is easy, yet for our phone, this is quite a complex task. One reason is that a large degree of visual variation exists between images of the same dish. It’s not only a problem of comparing apples and oranges, but also one of comparing apples and apples — and identifying them as the same. In this post, we will be talking about how we improved our food detection algorithm by taking a similarity-based approach.
Classifiers are great. Take a dataset of cars, for example, train it and the classifier will be able to tell you what brand car is in front of you. The same goes for dog breeds, types of flowers or anything, really.
All this greatness comes at a cost. These datasets are comprised of thousands of images, which have gone through quality assessment and had to be labelled. This work is usually done by people visually inspecting the images to assure that they are of a certain quality and that the label assigned to them was chosen correctly.
Where’s the Food?
In our case, our general deep learning based food classifier works for most foods. It covers most of the day-to-day variety of food items one comes across but struggles on more complex compositions.
Additionally, if we wanted to add new food categories to our classifier, we would require a large dataset of images representing it. That is the crux to our problem. Those categories are usually less represented and will not have as many images available, making it harder to assemble a dataset large enough to retrain the classifier.
Therefore, we require a different kind of approach that doesn’t need as many images and potentially doesn’t need to be retrained every time new data is available.
This is where our new approach based on embeddings comes in. An embedding can be seen as a simpler description of an image. These simple descriptions can be generated by pre-trained general purpose classifiers. Based on these embeddings, we can evaluate and quantify how similar two images are.
When we get a novel food image, we can search through all the images in our database and find the most similar ones. This allows us to propose their labels to the user as suggestions. With this approach, we can utilise smaller datasets with fewer images (even with only 1 image) to improve our suggestions on categories our classifier would else fail to identify.
Does it Get Any Better Than That?
Yes. Yes, it does! Every time new data becomes available, it can be instantly added to the embedding. No cumbersome retraining is needed anymore which ensures that the feedback loop is kept as short as possible and allows for the model to never be outdated.
Are You Telling Me I’m Getting a Free Lunch?
No, sadly not. As with all machine learning based approaches, the model is only as good as the data we feed it. Therefore, we still need to have quality assessment of some sort before we can embed new images, or we need other mechanisms that prune bad data later.
What Did We Gain?
We are able to offer our users specific suggestions instantaneously and can utilise all our images from the get-go without having to tediously retrain our classifier.