Share to


LATEST posts

BFS (Big Freaking Storage)

by Anton Mazhurin

on September 20, 2018

Introduction One of the basic concepts in machine learning is the idea of splitting datasets into training and testing parts. However, in any production environment, another important split is almost unavoidable: the split between training and scoring execution flows. Training models is CPU intensive while scoring is usually not. Fortunately, we don’t have to retrain models every time because a trained model would most likely be good enough for some period of time depending on the nature of the application. …


Trees in Blossom, Piet Mondrian, 1912 A few weeks ago in our research seminar, Brian proposed a discussion about Mondrian trees as implemented by skgarden (a scikit-learn extension specialized on tree-based models). In their implementation, these are Bayesian decision tree objects with some nice schema for spreading uncertainty that seemed fast to train. When collected in ensembles, they seem to become a serious contender to random forests. Due to the nature of our products, we are always looking for scalable …


Bootstrapping Data for Significance Testing

by Tawsif Khan

on July 27, 2018

This is a question I came across on Quora – Why does it seem as if there is a disproportionate number of socially inept mathematicians/mathematics students compared to the average population? To validate whether math graduates really suffer from low social skills, we can set up an experiment to compare the differences in the social skills of people who studied math and those who didn’t. Technically, this is an example of a test for statistical significance. In the retail industry, …


So this is what happens: you begin building machine learning systems to solve problems, to provide predictions and estimates of variables whose knowledge improves operations within a certain business context. They evolve and gain substance, become better, sharper, and more precise. Complexity arises and needs to be tamed. Your systems grow as your models need to be more flexible, scalable, agnostic, etc., and then there comes a time when, for your own sanity, a more structured approach becomes a necessity. …


AI: The Next Evolution of Automation

by Brian Keng

on July 12, 2018

At Rubikloud, we believe that AI is causing a monumental shift in the way retailers run their businesses. However, you might be surprised that it’s not their fleet of flying delivery drones, nor their checkout-less stores that are the most impactful to their business but rather their uses of AI in their core business. We think that the real impact of AI for businesses will be beneath the surface, not with talking robots or self-driving cars, but rather for the …


How to Improve Spark SQL Query Times

by Rakesh Thakoordyal

on July 5, 2018

Superior Machine Learning and Artificial Intelligence solutions thrive in their ability to find those “golden nuggets” of value-added insights inside deep lakes of heterogeneous data. To do this, you often have to cruise through terabytes or petabytes of data, running complicated SQL queries with aggregations, analytical functions, subselects, and numerous table joins – just to name a few. Running such queries against big data sets can be frustrating when it comes to run-time. It’s not uncommon to hear the “start …


At Rubikloud, we are big fans of iterative development for our software, for our ML models and especially for our table tennis ranking system! Last year I wrote about Building A Table Tennis Ranking Model using the Bradley-Terry model and Google sheets. The model gives a single rating for each player (not unlike the Elo rating system in Chess). One drawback of this model is that it has no measure of “confidence”. For example, a new player could have a …


What to Ask Before Joining an AI Startup

by Brian Keng

on April 3, 2018

A lot of the folks I’ve been meeting recently have been so eager to join an AI startup that they forget that not all startups are created equal. In this post, I want to share with you what to look out for when joining an AI startup. To start off, there are a bunch of questions relating to the founders, team, funding, vision, business model, learning opportunities etc. that you should ask, these questions will probably cover 80-90% of what you …


LSTM, RFM, LMFAO – Making Sense of Data Science Acronyms with a Deep Dive

by Alexander Hinton & Chris Dulhanty

on February 26, 2018

Hello, world! We are Alexander Hinton and Chris Dulhanty, two University of Guelph students who have recently gone down the machine learning rabbit hole through enrollment in Professor Graham Taylor’s Introduction to Machine Learning class at the U of G last fall. The course provided us with a fantastic overview of the world of machine learning (ML) algorithms and techniques. In addition, we had the opportunity to get our hands dirty in a real-world machine learning problem via a unique …


The Hard Thing about Machine Learning

by Brian Keng

on August 21, 2017

Much of the buzz around machine learning lately has been around novel applications of deep learning models. They have captured our imagination by anthropomorphizing them, allowing them to dream, play games at superhuman levels , and read x-rays better than physicians. While these deep learning models are incredibly powerful with incredible ingenuity built into them, they are not humans, nor are they much more than “sufficiently large parametric models trained with gradient descent on sufficiently many examples.” In my experience, this is …


Building A Table Tennis Ranking Model

by Brian Keng

on July 4, 2017

At Rubikloud, our wonderful Operations team regularly plans fun activities that the whole company can participate in such as movie nights, ceramic painting, and curling to name a few. However, my favourite activity by far is visible as soon as you get off the elevator. Championship match at our annual ping pong tournament Many of our Rubikrew are big fans of table tennis, in fact, we’ve held an annual table tennis tournament for all the employees for three years running …


Data Science at Rubikloud

by Brian Keng

on April 26, 2017

Over the last three years, Rubikloud has had some tremendous growth going from a team of less than a dozen to a fast-growing venture-backed startup with more than 80 people.  In this short time, we’ve assembled a team of talented engineers, retail experts and, of course, incredibly bright data scientists. With access to huge amounts of retail data spanning 10 countries and over a $100 billion in retail transactional data, Rubikloud is leveraging data science to automate the thousands of …


What makes a good recommender system?

by Anton Mazhurin

on March 15, 2017

“I think you should move to Australia. You will be a lot happier there!”. How do you measure the quality of such a recommendation? In our tongue and cheek example, the basic approach would be to let a recommender system choose a large number of people, say 1,000, whom, from the recommender system’s perspective, will be happier in Australia. Then split them in half, relocate the first half to Australia, and ask all of them: “Are you happier now?”, and …


Gradient Boosting to the Xtreme – Part 2

by Rob Chin

on February 2, 2017

In this second blog post in this series on Extreme Gradient Boosting, we will be focusing on how to solve the immediate issue of overfitting that can occur when we have a single decision tree classifier. Please checkout part 1 which covers in detail the principles of decision tree classifiers and some of the challenges that Rubikloud face from a machine learning point of view. As alluded to in part 1 , overfitting can be resolved by pruning of the tree. …


Gradient Boosting to the Xtreme – Part I

by Rob Chin

on January 23, 2017

A key element of Rubikloud’s philosophy around software is that machine learning should be embedded into business software, not necessarily to replace human intuition, but rather, to augment and enhance it. The reasons for why we believe that to be true, you can refer to Kerry’s posts here: HOW TO USE MACHINE LEARNING TO FURTHER RETAIL ANALYTIC CAPACITY  AN INSIDER’S VIEW OF RETAIL: PART ONE From a data science perspective, the challenges that we face at Rubikloud on a day …


Definition of Done

by Raheel Govindji

on January 23, 2017

At Rubikloud, we focus heavily on shipping well-engineered products that are driven by data science and machine learning. That means we spend a lot of time prototyping and working through very large datasets and iterating over performance and feature considerations. This process requires many teams to work together to achieve a complete and shippable product. However, as any Product Manager knows, not everyone can tell when something is complete and shippable. Many teams struggle to clearly establish their DoD, or …