One of the basic concepts in machine learning is the idea of splitting datasets into training and testing parts. However, in any production environment, another important split is almost unavoidable: the split between training and scoring execution flows. Training models is CPU intensive while scoring is usually not. Fortunately, we don’t have to retrain models every time because a trained model would most likely be good enough for some period of time depending on the nature of the application. An obvious solution is to train models periodically, save them in binary storage, and use those pre-trained models for instant scoring whenever it is needed. In other words, we need to store different versions of models trained at different timestamps. In addition, these models might depend on other models and our storage should be flexible enough to store only one binary copy of a model, even though many other models might refer to it.
The diagram below shows an example of a graph describing model dependencies as well as the magnitude of the number of models at each level for every training cycle. We need storage that not only stores the models but also supports these inter-model dependencies.
This post is about BFS: an enhanced key-value binary storage solution we use at Rubikloud for storing pre-trained models.
What is BFS?
BFS is a generic key-value storage with some additional functionality.
The key features:
- Every value has optional metadata. This might be model version, error messages, context information, etc.
- Every key has an optional timestamp. During read operation, BFS returns either the most recent value or the latest value before the specified datetime.
- All the values are stored in binary format at a blob storage. BFS can also caches the values in a local folder.
Why we chose the name BFS?
At the time we were inspired by SpaceX. The company was choosing a name for their next rocket for the Mars expedition. They decided to name it Big F… Rocket. We did not have a better idea but to name our new storage project as Big F… Storage.
BFS interface is simple. It has only two functions: read and write.
In order to write, you must provide a list of BFSKey instances and a list of BFSValue instances.
In order to read, you must provide a list of BFSKey instances and get a list of BFSValue instances.
- BFSKey contains subkeys (a list of strings) and an optional timestamp.
- BFSValue contains the value to store and optional metadata
BFS reference is an additional feature which might be extremely useful in a situation when a lot of models share a reference to another shared model. In Figure 1, you can see, for example, model C references model B2 and model A. The problem is that during serialization this shared model will be duplicated in every model which contains a reference to it.
The BFS solution to this is to mark a shared reference as a BFS reference. A BFS reference is only stored once and each object that references a BFS reference is given a key to access the referenced object when needed. This key is used when reading and deserializing the referencing object. This gives us huge savings in terms of space as well as write times because often the referenced models as several MB large and there are many thousands of models referencing them.
Under the hood
BFS uses a database to store a storage index and the metadata. All the binaries are stored in binary format in a blob storage.
Every value BFS downloads from the blob storage is cached internally in a local subfolder. The caching mechanism is sophisticated enough to handle racing conditions when multiple BFS instances from different processes are downloading the same model.
- Custom pickling
In order to support BFS reference, custom pickling feature is used.
BFS, in a nutshell, is a key-value storage with a lot of additional features specifically designed to serve our particular needs to store the models with wide dependencies between them. Usually, it’s not an easy decision to implement something from scratch rather than to adopt or extend an existing solution. In the case of BFS, it worked very well. Using an existing solution like BerkleyDB, Dynamo or Riak would require implementation of additional necessary features (model timestamps and model references). BFS is now an essential part of many of Rubikloud deployments.