Which tool should we use to version our data? And to compare results between experiments? How to deploy our machine learning models? As a lead data scientist at Sicara, I've seen a huge diversity of projects, and such questions frequently echoed in my mind when navigating the complex landscape of Machine Learning. Through trial and error, I've honed in on a select set of technologies and best practices that now form the bedrock of all my projects. To ease the integration of all these tools and share related best practices, I’ve created Sicarator: a Machine Learning Project Generator.
You run it in your terminal, it asks you some questions about your project and then generates an initial codebase with all recommended tools, boilerplates and configurations.
One key aspect to understand is that Sicarator is opinionated: it has been designed with some precise convictions on how Machine Learning projects should be led by experienced engineers. Whether you agree with the opinions or not, understanding them is a start. And hey, that’s precisely what this read is for!
The Machine Learning Ecosystem
A visual representation of the ML ecosystem is enough to highlight its enormity:
Yes, it can be overwhelming… let’s dissect it together! We can identify several core categories within this landscape:
- Infrastructure technologies: The foundation upon which everything runs.
- Analytics/Visualization solutions: Visualizing data or metrics is fundamental in any ML project, and a lot of different technologies have been designed for that purpose.
- Open-source MLOps libraries and frameworks: A bunch of tools that help Data Scientists at many stages: from building data pipelines to training and deploying models.
- ML Platforms: Think of these as the all-inclusive resorts of ML — many features out of the box, managed services, and infrastructure, all bundled up in one environment. Amazon SageMaker or Google VertexAI are such examples, with hundreds more out there.
One Tool To Rule Them All?
One-size-fits-all ML platforms seem attractive right? These solutions indeed promise several advantages:
- A Complete environment with many features
- Ease of Use, even without strong coding skills
- Quick Setup
However, in addition to not always keeping these promises, we can identify several disadvantages to using them:
- Lack of Standard Development Practices: with no real code versioning in online notebooks, code reviews or rollbacks become complex - and I am not even talking about Continuous Integration, Linters, or other code quality best practices.
- Lack of Customization: you may find yourself constrained by the available infrastructure resources, deployment configurations, or the information displayed when comparing experiments, for instance.
- Poor Interoperability: integrating with other tools or systems can be more challenging compared to open-source alternatives.
- High Costs: initial development cost savings are often outweighed by high long-term running expenses.
- Vendor Lock-in Risk: switching solutions later could mean a significant overhead of recreating pipelines and transferring data, a typical “Vendor lock-in” scenario.
This got me pondering — how could I get the benefits offered by ML platforms without compromising on good development practices, customization, and the flexibility of not being tethered to a single vendor? With this goal in mind, I set out to cherry-pick tools from the open-source realm and combine them to suit all my ML projects’ needs. Here were my 4 criteria:
- Active community: many stars on GitHub and frequent releases
- Great DX (Developer Experience): Prioritizing my team’s satisfaction and productivity
- Unix Philosophy: “Do one thing and do it well”
- Code-centric: Code and CLI (Command-Line Interface) rather than GUI (Graphical User Interface)
ML platforms tend to move the data scientist away from code, but I strongly believe that it’s the wrong direction. Code is just so powerful: it can be modified to your needs, versioned, reviewed, factorized, commented, unit tested, generated with scaffolding tools, or even with AI… what a pity it would be to miss out on it! For this reason, I also believe that solid programming skills are essential for data scientists.
However, even with strong developing expertise, setting up a bunch of ML tools can take some time. That’s one of the reasons I created Sicarator: to speed up the initial setup; without compromising the quality of the project.
Now, to fully grasp the philosophy behind Sicarator and the deliberate choices made during its development, it's crucial to delve into some specific challenges my coworkers and I encountered on our projects, along with the solutions we adopted.
Efficient ML Experimentation
When you're working on machine learning projects, you're constantly trying different things. Maybe you change up your data, tweak your model, or try a new post-processing algorithm. Experiment tracking is like keeping a detailed diary of all these changes. Why bother? Because in a field where small changes can make a big difference, remembering what you did and the results it yielded - either 5 minutes ago or 5 months ago - is crucial. It helps you and your team stay on the same page, makes sure you can reproduce your best work in the future, and saves you from repeating past mistakes.
Besides end-to-end ML platforms that all include experiment tracking, there's a variety of specialized open-source tools available. Names like MLflow or Weights & Biases might ring a bell. Each offers unique features, but when you're faced with a plethora of options, selecting the right one for your needs becomes a daunting task. This abundance of choice is a prime example of the dilemmas we often encounter in the ML ecosystem.
This is where our journey with DVC and Streamlit begins. At Sicara, we've been using DVC (Data Version Control) for data versioning and to craft reproducible ML pipelines for over four years. Similarly, Streamlit has become our go-to for whipping up interactive web applications to visualize data and results.
For experiment tracking, we looked for a solution that fit naturally with the tools and processes we were already using. Back in 2022, I penned an article detailing how one might marry DVC with MLFlow to achieve data versioning coupled with experiment tracking. While the integration of DVC and MLflow served its purpose, it felt somewhat cobbled together, with overlapping functionalities that didn't quite mesh seamlessly. Both tools offered features for experiment tracking, creating a redundancy that could be confusing. It dawned on us that a more harmonious and flexible solution was already within our grasp: pairing DVC with Streamlit!
This combo aligned more naturally with each tool playing to its strengths without stepping on the other's toes. They converge to create a platform where DVC's robust data versioning meets Streamlit's captivating interactive visualizations, yielding a user-friendly, flexible interface for monitoring our experiments.
The key idea here is flexibility. The way you explore and compare experiments can vary significantly across projects and will even certainly shift as a project evolves. Streamlit caters to these fluid needs: a tweak in the Python script, and voilà — your experiment tracking interface reflects the new changes instantaneously.
Now, you might be wondering why we've delved so deep into this DVC and Streamlit discussion, right? Well, it's because their powerhouse combo as an experiment tracking system is not just a preferred choice at Sicara; it's also a key feature I’ve embedded right into Sicarator!
Robust and Flexible Model Serving Infrastructure
Beyond the crucial step of tracking experiments, another key phase in the machine learning lifecycle is deploying your models on the right infrastructure. In this stage, too, our compass pointed us towards a solution that was both code-centric and endowed with the flexibility to adapt to various needs and environments.
This code-centric approach to infrastructure, far from a novel concept, reflects a well-established best practice in software engineering known as Infrastructure as Code (IaC). This principle advocates for managing and provisioning computing infrastructure through machine-readable script files, instead of relying on error-prone and time-consuming configuration GUIs. We opted for Terraform, a prominent figure in the IaC domain, to manage our cloud services in a streamlined manner.
Specifically, when deploying projects within the AWS ecosystem, we combined services such as API Gateway, Auto Scaling Groups (ASG), Elastic Container Service (ECS), and CloudWatch to deploy and monitor our models. This approach, while exploiting the optimal functionalities of these services, constructs an auto-scaled, resilient, and cost-efficient infrastructure. More importantly, thanks to IaC, it remains easily and securely adaptable to meet distinct demands.
We’ve integrated and documented this Terraform-powered AWS infrastructure right into Sicarator. For those using GCP, we offer a serverless deployment variant via Cloud Run. Azure's integration is still on the horizon, but we're all for community contributions. So, if Azure is your cloud of choice, I'll be happy to integrate your Pull Request to add a new cloud option!
Sicarator is yours, make good use of it!
In essence, Sicarator is more than a tool; it's a crystallization of our vision for efficient, streamlined machine learning projects, emphasizing a code-centric philosophy and best practices born from our hands-on experiences. The synergy of DVC with Streamlit for adaptable experiment tracking, and the strategic employment of Terraform for flexible infrastructure management, are just glimpses into the comprehensive environment Sicarator provides.
As I look ahead, I’m excited about the road unfolding before us. The field of machine learning is anything but static, and as new challenges arise, so do opportunities for evolution and improvement. Sicarator is my contribution to this ever-progressing journey, and I’m committed to nurturing its growth, expanding its capabilities, and, most importantly, listening to the community for feedback and insights.
So, I encourage you to dive into Sicarator, explore its capabilities, and push its boundaries. Your experiences, feedback - through GitHub issues - and contributions are crucial for its growth and improvement. The road ahead is ours to travel together!