Swetha Subramanian     About     Research     Archive     CV     Portfolio

Conscious Chef

So I recently started the remote version of insight data science fellowship program. Insight asks you to come up with an idea for an app, then create and deploy it in 3 weeks!

Coming up with a non-academic and fun idea that was feasible to implement in 3 weeks was tougher than I imagined. Fail, learn and iterate was my mantra for the first three weeks. I finally decided to tackle a major challenge I encounter every summer: using up the produce from my CSA box.

I knew from my own and other CSA users experience that using up the CSA produce and not throwing away any of those gorgeous produce is no trivial task. Then comes the challenge of figuring out what to do with vegetables like kohlrabi, romanesco broccoli, rutabegas, and those insane varieties of squashes. Then that deluge of cukes, tomatoes, lettuce. And some vegetables spoil incredibly quickly. With these limitations a CSA box can quickly take over your life, which is sad because I love the access to fresh vegetables.

So I had the idea of creating a weekly meal planner app, where the CSA user could input the produce and quantity, number of people they would normally feed in a week and my app would recommend recipes optimizing for shelf life and minimizing quantity. When I pitched the idea to my cohort and the program directors, they immediately told me it would be challenging. The question of how would I ever validate it was raised time and time again.

So I quickly pivoted by ditching the weekly meal planner idea to just recommending recipes by optimizing for shelf life and minimizing quantity. This turned out to be an excellent idea. Because getting the information I needed to build this app was more challenging than I expected.

Challenge 1: Cleaning text

Like you can imagine, for every recipe I needed the name of the vegetables used, the quantity, serving size, and the unit of measurement. I didn’t immediately find any ready-made database that provided that and also provided pictures, star ratings, reviews etc (Who is ever going to use a recipe app without any pictures?) Those that provided some of those things also charged money. A cheap-did-not-stop-being-a-grad-student person like me didn’t want to spend that. So I had to resort to creating my own database.

Epicurious.com was a great resource because they didn’t mind web-scrapers and had everything I needed to create a recipe database. The issue with javascript code in their web-pages was easily overcome using selenium web driver. As the deluge of video ad’s in epicurious.com slowed my web scraping program a lot and the data ate away at my memory, I just ran my scraper at night and I ended up with a database of 8000 recipes.

Once I had all this text data from web-page, I then had to do a lot of cleaning. Beacuse most of the content in epicurious.com is generated by the users, the text data tended to be rather messy. There was no clean way to extract information like ingredients, quantity, and units of measurement. Ingredient lists for recipes tend to read like this:

* 1 pound tomatoes 
* 1 large onion finely chopped
* 1 cup basil
...

How then do I extract the information I need? I could use natural language processing. But training such models would need a lot of data, not to mention time. Both of which I did not have. Thankfully, when people write recipes, they tend to write it using a particular structure. I could exploit this structured nature of this text data to parse quantity, measurement units and produce name without training a model.

I could potentially use regular expressions but given the irregularity in the structured text, it would not be an easy tool. Enter pyparsing tool. The advantage it had over regular experssions was that it allowed you to define grammar. So I could match certain text to some targets. I could define numbers as ‘quantity’, words like ‘pounds’, ‘cups’, ‘tablespoons’ as ‘measurement units’, and ‘tomatoes’, ‘onions’ as ‘ingredients’. I could also specify the order these targets appear in. So in my case, I could specify that the first number would always be quantity, second target would always be a unit of measurement, and the third target would be always be an ingredient name.

This snippet demonstrates how for a simple form of text, ‘1 pound tomatoes’:

 import pyparsing as pp

 measurements = ['pound', 'cup']
 ingredients = ['tomatoes', 'onion']

 quantityGrammar = pp.Word(pp.nums)
 unitGrammar =  pp.Or(measurement 
 	for measurement in measurements)
 ingredientGrammar = pp.Or(ingredient 
 	for ingredient in ingredients)
 grammar =\
  quantityGrammar+unitGrammar+ingredientGrammar 

 ingredientPhrase = '1 pound tomatoes'
 a = list(grammar.parseString(ingredientPhrase)) 

This results in

 ['1', 'pound', 'tomatoes']

Recipe authors don’t always follow the rules though. You encounter quantities such as 1-1/2 or 1 1/2. Some times the author could omit the unit and might just say ‘1 tomato’, which would break the code above. The recipe author could also use adjectives ‘1 large chopped tomato’. This also needs to be taken into consideration. Then we also have to account for non-alphanumeric characters.

Also, plurals. We don’t really need to differentiate between ‘tomatoes’ and ‘tomato’, ‘pounds’ and ‘pound’. We could lemmatize these words using NLTK package. So now my more rigorous code reads as:

import pyparsing as pp
from nltk.stem.wordnet import WordNetLemmatizer

measurements = ['pound', 'cup']
ingredients = ['tomatoes', 'onion']
adjectives = ['chopped', 'large']

sillyGrammar = pp.Regex(r"[^-(),][^ (),]+")
quantityGrammar = pp.ZeroOrMore(pp.Word(pp.nums + ' /-'))
unitGrammar =  \
pp.originalTextFor(pp.ZeroOrMore(pp.Or(
	str(WordNetLemmatizer().lemmatize(measurement.split()[0].lower().encode("utf-8"))) 
	for measurement in measurements if measurement) ))
ingredientGrammar =\
pp.originalTextFor(pp.ZeroOrMore(pp.Or(
	[str(WordNetLemmatizer().lemmatize(str(ingredient).split()[0].lower().encode("utf-8"))) 
	for ingredient in ingredients if ingredient])))
adjectiveGrammar = \
pp.originalTextFor(pp.ZeroOrMore(pp.Or(
	[str(WordNetLemmatizer().lemmatize(adjective.split()[0].lower().encode("utf-8"))) 
	for adjective in adjectives if adjective])))

grammar = (pp.Optional(quantityGrammar)+pp.Optional(unitGrammar)+\
                          adjectiveGrammar +\
                          ingredientGrammar)

ingredientPhrase = '1-1/2 pound large tomatoes'

a = list(grammar.parseString((ingredientPhrase))) 
print a

This would result in ‘quantity’, ‘unit’, ‘adjective’, and ‘produce’ extracted as :

['1-1/2 ', 'pound', 'large', 'tomato']

There are other things a recipe author could include that would break the above code, but catching all these exceptions would make my grammar more complicated. This code extracted > 70 % of recipe info accurately, which resulted in a database of over 6000 recipes which seemed adequate for my project.

Because there are a lot of vegetables and adjectives, I didn’t want to define these manually. So I scraped these using BeautifulSoup package.

Because I still needed these units like ‘pounds’, ‘teaspoons’, ‘bunch’, ‘cups’ to be one common unit, I needed to convert these to a common unit. As a person with no sense of ‘pounds’, ‘kgs’ etc, I would like an app that counts the number of tomatoes. So I needed to convert all of these units to numbers that made sense to a clueless end user. Luckily a lot of cooking websites had that info and I could add it easily to my code. Because there were few of these, I did define these manually.

To complete my database I needed shelf life info. Luckily there was a wealth of info on this on the interwebs. I ended up scraping that too.

Once I got all this info in a clean form, the optimization algorithm seemed trivial in comparison. I will talk about that in my next post.

Check out my app at www.consciouschef.us.