Listing industrialization - Tech Radar

Adopt

Airflow

Managing data workflows is essential for data scientists and involves processes such as data preparation and model building pipelines. The complexity of such management has highlighted the inadequacies of traditional orchestration tools like CRON. To address these challenges, Airbnb developed Airflow in 2014. Airflow is an open-source Python library designed for task orchestration, enabling the creation, deployment, and monitoring of complex workflows.

Airflow represents complex data workflows as directed acyclic graphs (DAGs) of tasks. It acts as an orchestrator, scheduling tasks based on their interdependencies while offering a user-friendly web interface for workflow visualization. The library's flexibility in handling various task types simplifies the automation of data processing tasks, contributing to Airflow's popularity in contemporary data management.

For data scientists, setting up workflows with steps like data preprocessing, model training, and performance evaluation can become cumbersome with intricate Bash scripts, which are hard to maintain. Airflow provides a more maintainable solution with its built-in monitoring and error handling capabilities.

While Airflow is a popular choice, there are alternatives suited to specific needs. For instance, Dagster facilitates direct data communication between tasks without the need for an external storage service. Kubeflow Pipelines offers specialized ML operators and is geared towards Kubernetes deployment but has a narrower community due to its ML focus. Meanwhile, DVC caters to the experimental phase, providing pipeline definitions and integration with experiment tracking, though it may not be ideal for production environments.

OUR PERSPECTIVE

We recommend Airflow for the robust orchestration of diverse tasks, including production-level Machine Learning pipelines. For developmental stages and model iteration, tools like DVC are preferable due to their superior experiment tracking features.

Infrastructure as Code

Developing machine learning solutions necessitates the allocation of resources such as databases and compute clusters. Traditionally, setting up these resources was done manually, leading to a higher risk of human error and making it more difficult to redeploy infrastructure quickly.

Infrastructure as Code (IaC) offers a method to create and manage a project's infrastructure resources. With infrastructure defined in files, its setup is automated and version-controlled. This approach minimizes errors and enables environments to be replicated quickly and infrastructure to evolve seamlessly.

Although widely adopted in web and data engineering, IaC is less prevalent in machine learning projects. Applying IaC to define different data storage services, model training environments, and scalable infrastructure for deploying models ensures control over components, costs, and importantly, their scalability and adaptability.

Using IaC effectively requires proficiency with tools like Terraform and adherence to their best practices. Infrastructure as code should be maintained with the same attention to detail and quality as application code.

OUR PERSPECTIVE

We advocate for the use of Infrastructure as Code in machine learning projects. This method offers a more agile, scalable, and efficient way to manage infrastructure, facilitating quicker deployments and enhanced consistency. IaC also improves security and maintenance, which are crucial in ML projects.

Poetry

In Python projects, dependency management has traditionally been handled by Pip, through requirements.txt, or by Conda, which manages only primary dependencies. This approach often leads to issues with compatibility and version discrepancies across different environments (e.g., development, production).

Poetry is a tool designed to address these challenges in Python dependency management and packaging. Its key features include:

Robust Dependency Resolution: Poetry employs a sophisticated algorithm for dependency resolution to prevent conflicts between libraries, streamlining the installation process and minimizing the need for manual conflict resolution.
Version Locking: Poetry ensures consistency by locking library versions across all developers' environments. This eliminates the common problem of "it works on my machine," where discrepancies in dependency versions lead to inconsistent behavior.
Ease of Use: With an intuitive command-line interface and a unified configuration file, Poetry simplifies the management of dependencies and project settings, making it user-friendly for data scientists who might be less experienced in Python package management.

While Poetry offers capabilities for managing Python virtual environments, these features have limitations, such as the lack of automatic activation for the appropriate dependency directory. To overcome this, we suggest using Poetry in conjunction with Pyenv and its pyenv-virtualenv plugin, enhancing the overall development workflow.

OUR PERSPECTIVE

We strongly advocate for the use of Poetry as an indispensable tool for modern Python dependency management. Its effective approach to solving dependency resolution issues, along with the simplicity it brings to synchronizing development environments, makes it a superior choice for Python projects.

Trial

LangChain

Applications based on Large Language Models (LLMs) often share many common functional components. LangChain is an open-source framework aimed at simplifying their setup and orchestration. It provides a high-level interface for defining the logic of applications while being agnostic of the LLM and/or vector store used. Despite LangChain's promise of easily transitioning between LLMs, migrating between different LLMs with LangChain often involves adjusting prompts and parameters to maintain optimal AI performance.

Currently in active development, LangChain still has some drawbacks:

Complex and sometimes unintuitive documentation.
The different components lack stability, although this issue can be mitigated with the recent isolation of key abstractions in the langchain-core module.
The intertwining of its code and the use of callbacks can complicate code navigation and debugging.

Other LLM frameworks like LlamaIndex or Haystack exist and present similar advantages and disadvantages. Alternatively, it is possible to not use a framework and call the different components (LLMs, vector databases, etc.) with custom code.

OUR PERSPECTIVE

We recommend using LangChain for prototyping an LLM project. By abstracting the calling of LLMs and embedding models, it allows for the development of a RAG pipeline in just a few lines of code. Beyond the prototype, the trade-off between the framework's added value (often based on developers' expertise) and the complexity it brings into play.

LangChain is also a useful tool for training on LLMs, allowing exploration of different application types, models, prompt engineering strategies, or vector databases.

Qarnot

Reducing carbon footprint is becoming a crucial issue for businesses. However, machine learning model training is a major emitter of greenhouse gases.

Qarnot offers to limit these emissions by providing "low-carbon" cloud computing designed for graphics rendering jobs and deep learning model training. By early 2024, Qarnot promises a 50% reduction in carbon footprint compared to other data centers in France and a 90% reduction compared to those in the United States, thanks to a decentralized approach and almost complete reuse of the produced heat. For an hour of training, the carbon footprint reduction is approximately 1kg CO2eq compared to traditional providers like AWS/GCP.

However, Qarnot has limitations compared to these providers:

There is only one type of GPU available, unlike the catalogs offered by AWS or GCP, which contain dozens.
Instances are not connected to the internet, which requires the use of Qarnot I/O buckets for data transfer. A Python SDK (open-source) is available for data transfer and task deployment on Qarnot; however, the transfer is relatively slow. Pooling resources helps address this issue but introduces new ones, such as job concurrency. Moreover, collaboration possibilities at the team/organization level are limited, with only billing being shared.

OUR PERSPECTIVE

In the face of the climate crisis, we support Qarnot's initiative to reduce training carbon footprint. The use of I/O buckets, task automation via Python, and workflow management with DVC have been sufficient to overcome most of its limitations. Therefore, we recommend testing the tool yourself. However, it is prudent to continue evaluating Qarnot in more diverse contexts before using it on a large scale.

vLLM

Managing inference and serving of Deep Learning models optimally is a complex issue, especially for Large Language Models (LLM), which are resource-intensive due to their size.

vLLM was created in 2023 to serve an LLM and optimize its inference. It allows for the deployment of a dedicated service accessible via an API by configuring a Docker image. vLLM uses an efficient attention algorithm, PagedAttention, which allows for up to 24x better performance than with a traditional approach. Additionally, vLLM implements many features facilitating the use of LLMs in production, such as continuous batching, which batches requests from different calls efficiently using the available resources. This still very young tool is quickly gaining in popularity, as evidenced by its 15k stars on GitHub (early 2024) or the fact that Mistral provides a vLLM image for its Mixtral model.

There are several alternatives to vLLM:

llama.cpp: This library enables the execution of LLMs in pure C/C++, which is particularly advantageous for performing model inference on small CPUs. However, this library was designed as an experimental tool and is not intended for production use.
Prior to vLLM, general-purpose tools like the Triton inference server and TensorFlow Serving were capable of handling various model types. Like vLLM, these tools are configurable via Docker and implement continuous batching. Though these older tools are well-tested and validated across numerous scenarios, they are not as optimized as vLLM for serving LLMs.

OUR PERSPECTIVE

We have been following this technology since its release and have used it to run LLM inference on internal projects. Use it without fear in an experimental context to cap hardware costs (especially thanks to PagedAttention). For production use, we also recommend it, but keep in mind that thistechnology is still young and use it with caution.

LangSmith

The specificities of Large Language Models (LLM) have brought about specific needs for monitoring and data collection. LangSmith, developed by LangChain, aims to address these challenges by logging the various interactions with LLMs.

It incorporates numerous features around this logging, such as the ability to modify and rerun prompts via a "playground" interface, track performance and costs, or create datasets from the logs. Initially developed to integrate with LangChain, it can also be used independently via an SDK.

However, LangSmith is still in beta version, and the user has no control over the versions (the tool is automatically updated): hence, instabilities are sometimes noted (for example, a regression on the playground feature). Furthermore, the transition to a potentially expensive paid model could restrict its accessibility for some users or organizations. Faced with these limitations, LangFuse, offering similar functionalities, emerges as an interesting open-source alternative.

OUR PERSPECTIVE

We have used LangSmith (via its SDK) on a Retrieval Augmented Generation project. Its integration was seamless, and we will use it on future projects (consideration will be given to migrating to LangFuse if it becomes paid). The "playground" feature notably facilitated the integration of the product owner and their domain expertise into prompt engineering iterations. Therefore, we recommend using LangSmith while taking into account the specified limitations

Assess

Guidance

With the recent emergence of Large Language Models (LLMs), the need for tooling has increased to robustly integrate these solutions into our applications. The open-source framework Guidance was created by Microsoft to enable complex templating of prompts. It allows interfacing with multiple LLMs (open-source or otherwise) in addition to providing a toolbox for inference at the token level (grammars, token healing, etc.).

It differentiates itself from the LangChain framework (which has a much larger community) in the following points:

A particular emphasis on controlling the model output, i.e., the ability to constrain it to a specific output structure (or "grammar"). However, this key functionality is only available for open-source models, as models available via API do not provide the necessary details to apply this post-processing.
It is very easy to define functions that can be called by the model (similar to OpenAI's "Function Calling" in the OpenAI Assistants API).
The framework is much less focused on using templates and pre-built chains like LangChain (the vision seems more oriented towards customization).

The formulation of prompts aims to be very easily accessible, but we find that the recently implemented syntax change, now using an overloaded addition operator, can be quite difficult to read and write.

Guidance's GitHub repository has several example notebooks to follow, but you will find fewer than for a more dominant framework like LangChain.

OUR PERSPECTIVE

Guidance is particularly powerful for templating and prompt generation. Still in beta version, its community is still small. Therefore, major regular changes are to be expected.

For more general use cases, we recommend LangChain. More plug-and-play, it offers more developed features for memory management or integrations with vector stores.

End-to-end ML platform

Until the mid-2010s, Machine Learning projects involved a series of manual operations. The MLOps tools that have gradually emerged since then have allowed for the structuring and streamlining of an increasing portion of these tasks, leading to complete platforms that address the entire lifecycle of models.

Databricks, available since 2015, can be cited as one of the precursors to these platforms, as well as the ML services of the three main cloud providers: Google VertexAI, Amazon Sagemaker, and Azure ML.

These platforms enable rapid implementation of the main components of a project: infrastructure for training and predictions, a model registry, pipeline orchestration, experiment tracking, etc.

Without going into details about each platform, several common problem types can be identified:

Resource costs: They are always higher than with less managed solutions. For example, training on Vertex AI is about ~15% more expensive than through Compute Engine.
Rigidity: All-in-one ML services are limited in terms of customization and integration with tools outside their ecosystem. For example, it may be difficult to use DVC to version data or launch low-carbon training on Qarnot.
Vendor lock-in: Dependence on a specific provider can be even more frustrating in the field of AI, where technologies evolve very rapidly.

The main alternative is to combine specific and less managed technologies, such as DVC, Streamlit, Airflow, which represents a significant investment in setup costs.

OUR PERSPECTIVE

At Sicara, our default stack does not use an end-to-end ML platform. We prefer a combination of open-source tools tailored to our customers’ needs, minimizing costs and avoiding vendor lock-in. Sicarator, our open-source ML project generator, accelerates their implementation.

An end-to-end solution remains preferable if one does not have the time or necessary skills for a custom stack, or for companies planning to undertake only a limited number of ML projects in the near future.

Hold

Dataiku for industrialized tech teams

Creating Data Science pipelines requires a lot of time and multiple teams to process and make data available. Dataiku, a proprietary platform, emerges as an accelerator in the Data and Machine Learning (ML) universe, relying on a low-code interface to simplify these steps.

Dataiku offers a fairly wide toolbox, interfaced with numerous data sources. The platform enables the automation of data transformation pipelines, model training, and the deployment of models into production. It is also possible to visualize data through personalized dashboards.

Although no-code makes it easier to create ML projects in the short term, it also creates new challenges in terms of maintainability for sophisticated projects. For example, involving multiple teams on the same pipeline is difficult: impossibility to modify at the same time, versioning changes at an early stage (making code reviews difficult, a standard at Sicara). Moreover, expanding access to the platform is tempting because it is intuitive, but it may require significant licensing costs.

OUR PERSPECTIVE

Dataiku is an excellent solution for quickly exploring and deploying Data Science use cases. We found that the solution complicates collaboration and quality control for industrialization contexts with a strong need for collaboration and a good technical background of the team. We thus recommend a modular stack with a higher initial investment but better scalability and flexibility (Streamlit, etc.). It certainly requires a larger initial investment in the project but allows for better scalability and flexibility.

Industrialization