!!! abstract “Abstract” This chapter covers the execution of the work based on the previous definitions done, in terms of developing, testing, and validating the solutions based on the agreed acceptance criteria. (10 min read)
Once the repository was prepared, and the initial requirements were defined, we should start from the analysis of the available data to check the following aspects:
The first data source implemented was Spoonacular through a paid subscription to RapidAPI. Since the involved recipes were in English, a translator service was implemented as well to have Spanish data.
In order to ensure the database of recipes and ingredients as proper quality to implement the given approach, a Data Quality Control was conducted. Check the How-to: Data Quality Control document for details. Through this notebook, the analysis was done where in general aspects the data observed was fine. However, at that point and in later iterations there were data issues found like:
Therefore, the data sources was marked as not reliable, and an extra work was required to identify a new source and scrape recipes data from it.
After reviewing several potential data sources (having fields required by the StarvApp, and without legal restrictions), a new one was found. Therefore, a scraper script was developed to get datasets from there with some defined configuration. Here some important limitations to know about the scraper:
The resulting datasets in CSV format were controlled in a new data quality check conducted in this notebook, and then shared to the Backend team to a proper upload to the Database of the StarvApp. In addition, the snapshots of datasets obtained were stored in One Drive for availability.
!!! info This data extraction process was part of a workaround task handled after finding data issues in Spoonacular. Later work about automation of data extraction from these sources will be required in further steps.
It is important to mention that ingredients in recipes extracted from this source were not in a normalized basis as in Spoonacular, but they were in format of free text description that required to be processed and classified as described below.
Given a free text description of an ingredient, the challenge is to give it a normalized name to make easier the retrieval of recipes for a query of entered ingredients. For example, if you have an ingredient with description “2 cucharadas de aceite de oliva” you expect to classify it as “Aceite de oliva”. However, this is not an easy task due to the given aspects that may happen on the text:
Therefore, the approach taken was to train a classifier to process a text description and tell which is the corresponding ingredient. To tackle the given complexity in the text, a FastText model was defined due to be easy to adjust but powerful as it is a neural network designed for this purpose. The first step was to prepare a labeled dataset to train and validate the model with 2 fields per record (a text description, and a “label” indicating the expected ingredient classification), which was built as it follows:
A labeled dataset was obtained and stored in One Drive, so then it was used to train a FastText model. A script was developed to do that, which does the following steps
Once the model was obtained, then it is ready to be used to classify the extracted recipes to build the database. As a result, we can have the datasets with proper fields and ingredients ready to be used for the application.
Given an input from the user with ingredients, and a database with recipes, the idea was to have a recommender system to decide the top of recipes to return with some criteria. In this project, there were two approaches explored in this order:
Both of them were intended for a “cold start”, where no users history is available. Therefore, recommendations are obtained only based on the recipes information, not from users.
!!! tip Following the KISS principle, it is suggested to always start with a very simple approach (even if it is not Machine Learning), and once it is implemented you can collect feedback and data about usage and then refine it.
Based on the available data, the decision was to start with a simple approach of basic scoring, in order to accelerate integration and collect early feedback to justify the need for a more complex approach
In this “naive” approach, the idea is to get the recipes with the best score based on the following aspects:
By tuning the score functions of these aspects, as well as the way to aggregate them, we could have a very quick way to select the best recipes to recommend, in order to be able to start the integration with other services and then have the structure needed to try a better approach.
In order to assess if a set of recommended recipes is appropriate for a set of entered ingredients, these are the aspects to be evaluated:
The problem of this approach is that it does not take into account some concept of relevancy on ingredients involved in a recommended recipe. So, as a result, some “not common” recipes may be selected when having few matches of ingredients for a given input, and that was a not desired aspect that was analyzed to propose a new solution.
More details about approach in this Jupyter notebook.
To tackle the problem of recommendations with not common ingredients when there are few matched ingredients, a new approach was analyzed to “weight” ingredients to achieve a selection of “feasible” recipes on the recommendations delivered.
Given that context, we could compute some “weights” for the given ingredients on the database, where the idea is to express in a score with range [0, 1] how “common” is an ingredient. Therefore, the recommendations that will select additional ingredients that are not part of the input, could prefer “common” ingredients to be recommended instead of “not common” ones.
So, through an analysis on the data available, ingredients were processed to get frecuency and scale it into a range of [0,1] with a logarithmic scale. The result was saved into a YAML file, and the approach implemented was simply recommend recipes that:
!!! info Notice that the YAML file indicates which is the list of supported ingredients, and the relevance they have in the given database.
Finally, this was the approach selected to go production.
Once the recommendation approach was defined and developed, the final step is to develop and deploy a service that receives requests from the StarvApp with input of users and returns recipes recommendations. Therefore, a script was developed with asyncio to do the following steps:
For more details about the integration between services in the application, check the architecture shown in Chapter 2.
To test the service before the deployment, another script was developed to emulate requests with a random input to ensure the service will respond correctly.
To deploy this service, a Dockerfile was developed to install the system and run the service as entrypoint. At the same time, another Dockerfile was defined to also run the RabbitMQ service for the application. Notice that both services are managed by the DevOps team in their given infrastructure.
For this solution, there where two types of testing implemented:
pytest
.!!! info To run the system tests, a dataset must be available and with a configured valid path
The StarvApp has a Continuous Integration (CI) pipeline configured in terms of Azure pipelines and handled by the DevOps team, so the CI is already managed in this work and the files explained in the Environment section are the contract to ensure correct integration. However, additionally a Github Actions workflow was configured for this repository to make the following checks:
You can check the configuration and current status in the project’s README.