There are currently well over 100,000 votes on What to Brew, meaning it’s a treasure trove of data. But data is only as useful as its analysis. This article looks at my work to find groupings of similar homebrew additions based on how well they work with each style.
I decided to use k-means clustering to try to group the additions. In some ways, this data is not the best fit for this method- the data is noisy, and there aren’t discrete clusters. However, there were some interesting findings.
K-means clustering basically takes the What to Brew data, and attempts to find which additions are most similar to other additions, based on the styles they do and don’t work with, grouping them into a set number of clusters. In other words, I ask the computer- “If you were to group these additions into this many clusters, how would you do it?”
Determining the correct number of clusters is tricky- too few, and the groupings are too big to be useful; too many, and many of the additions get put in clusters by themselves, which isn’t useful either. I decided to run the script for 2-30 clusters, to see the results. Here are some of the significant groupings:
With just 2 clusters, there was 1 large cluster, and 1 smaller one that seems to group additions in a promising way, listed below. You could likely successfully mix any 2 of these.
rye, maple, raspberry, cherry, oak, bourbon, cinnamon, coffee, chocolate, hazelnut, orange peel, blackberry, smoke, pecan, vanilla, whiskey
This grouping was identical to the list above, with the exception that maple was put into its own cluster, where it remains for much of the remainder of the analysis. I’m not sure if this is due to it being significantly different than other additions, or if its due to it having less votes, as it was added to the database after other additions. (Go vote!)
As we get into more, smaller clusters, we see some more interesting patterns emerge. Weird additions that are unpopular cluster in #5. There seems to be a summer flavor grouping in #6. I’m intrigued by the grouping of #3- it seems like a mix of herbal and fruit ingredients
1: rye, raspberry, cherry, oak, orange peel, blackberry
3: lemon grass, apple, ginger, peach, grapefruit, seeds of paradise, blueberry, lemon peel, pear, elderflower, juniper berries, coriander, rose hips, apricot, cranberry
4: bourbon, cinnamon, coffee, chocolate, hazelnut, smoke, pecan, whiskey
5: piña colada, coconut, chai, bacon, mint, anise, chicory, cardamom, peppercorn, sweet potato, pumpkin, peanut butter
6: watermelon, hibiscus, chamomile, rhubarb, basil, cucumber, green tea, lemon pepper, plum, strawberry
Smaller groupings are emerging as we get more clusters. Rye and oak (#0) are no longer grouped with the berry grouping (#4). The weird additions are still sticking together in #7. I’m curious why sweet potato and pumpkin aren’t clustering together. Other surprising ones that aren’t clustering together are chicory and coffee, and apricot and peach.
0: rye, oak
1: watermelon, rhubarb, cucumber
3: coconut, chai, anise, chicory, cinnamon, cardamom, pumpkin
4: raspberry, cherry, orange peel, blackberry
5: bourbon, coffee, chocolate, hazelnut, smoke, pecan, whiskey
6: lemon grass, grapefruit, lemon peel, apricot
7: piña colada, bacon, mint, basil, peanut butter
9: sweet potato
10; apple, lavender, hibiscus, chamomile, green tea, peppercorn, lemon pepper, rose hips
11: ginger, peach, seeds of paradise, blueberry, pear, elderflower, juniper berries, coriander, plum, strawberry, cranberry
At this point, we’re seeing more and more clusters with single additions. We’re also seeing some clusters being refined. Compare #3 from 12 clusters above with #16 below- coconut, pumpkin and cinnamon aren’t quite as similar to the other ingredients. We still have 2 larger clusters- #1 seems to be more tart/acidic, and #5 seems to be more citrusy/spicy.
0: raspberry, cherry, orange peel, blackberry
1: apple, watermelon, hibiscus, chamomile, rhubarb, green tea, peppercorn, lemon pepper, rose hips, strawberry
2: bacon, peanut butter
4: cinnamon, pecan
5: lemon grass, ginger, peach, grapefruit, seeds of paradise, blueberry, lemon peel, pear, elderflower, juniper berries, coriander, apricot, cranberry
6: sweet potato
12: bourbon, coffee, whiskey
13: piña colada, mint, cucumber
14: chocolate, hazelnut
16: chai, anise, chicory, cardamom
At 30 clusters, there are 18 clusters with single ingredients, which I’ve removed below. The ones that remain are likely quite similar. Note that #4 has stayed steady since 12 clusters. Bourbon and whiskey (#1) aren’t a surprise to see together. There are some really solid groupings still at this point.
0: hibiscus, strawberry
1: bourbon, whiskey
4: raspberry, cherry, orange peel, blackberry
5: piña colada, cucumber
6: bacon, peanut butter
9: apple, rhubarb, blueberry, pear, cranberry
15: lemon grass, grapefruit
16: peppercorn, lemon pepper
24: ginger, peach, lemon peel, elderflower, apricot
26: coffee, chocolate, hazelnut
27: seeds of paradise, juniper berries, rose hips
29: chai, chamomile, anise, chicory
I also ran a few tests on high numbers of clusters to see what still remained close. At 50 clusters, the following were still paired:
lemon grass, grapefruit
I wanted to see which two of the 54 additions were the closest, so I ran it with 53 clusters, and wasn’t surprised to see that the two closest additions were:
Conclusion of homebrew addition clusters:
K-means clustering provides some interesting insights into the data. While there are some surprising connections I wouldn’t have thought about, the groupings make sense in a way that validates the data. Further analysis could be done with other clustering algorithms.
If you’d like to see the full results, you can download all the iterations here.