Adopt

12

DINOv2 as image embedding model

A common need in computer vision is to obtain good vector representations (embeddings) of images that can be used for downstream tasks such as clustering, classification, detection, segmentation, etc. For example, a simple classification task may consist of the following steps:

  1. Calculating representations of a set of images
  2. Storing and indexing them in a vector database
  3. Determining the class using a simple algorithm like k-NN or a linear model

It's not always possible to build a model to compute good representations for a given problem due to the availability of training data: annotated images may not be available, expensive to produce, or nonexistent at the beginning of a project. Therefore, it is interesting to consider generic models pre-trained on large, general-purpose datasets. For a long time, our reference models at Sicara were CNNs (typically and in chronological order: VGG, ResNet, EfficientNet) pre-trained on ImageNet-1K (~1.2M images).

Over the past 5 years, several innovations have changed the landscape:

  • Advancements in self-supervised training techniques, which eliminate the need for annotation during training and therefore allow leveraging much larger image datasets (e.g., 1.2B images for DINO V2).
  • The emergence of transformers, a new architecture based on attention mechanisms (first in NLP with BERT and then in vision with ViT) that surpass CNNs in terms of performance, flexibility, and scalability to large datasets. Meta's DINO (2021) combines and benefits from both approaches.
  • Meta has made the pre-trained model weights public, offering several advantages such as commercial usage under the Apache 2.0 license, availability on torch hub, various model sizes ranging from 21M to 1.1B parameters, and the ability to construct highly performing classifiers without fine-tuning.

 

OUR PERSPECTIVE

We recommend using the DINO V2 models (architecture and weights) as off-the-shelf embedding models without retraining. Fine-tuning or full training with DINO V2 is recommended for experienced users with access to high computational power and large volumes of data.

13

GPT-4

GPT-4, the most powerful Large Language Model (LLM) currently developed by OpenAI, is an autoregressive text generation model. It is likely composed of a "Mixture of Experts" of several models, each with over 200 billion parameters, as confirmed by several leaks. GPT-4 can also accept images as input and can handle most text or image processing tasks such as translation, synthesis, comprehension, and writing, as demonstrated by using ChatGPT Plus. GPT-4 finds its utility in various domains such as information retrieval, automation of customer support workflows, and automated document processing. In nearly all tasks, GPT-4 significantly outperforms open-source alternatives and clearly surpasses other proprietary models (Claude, PalM, etc.).

The Function Calling feature integrated into GPT-4 allows the use of predefined “functions” or templates in model calls. This facilitates the generation of structured outputs, rather than being limited to raw text. Functions also enable interaction with external tools such as executing code, accessing APIs, or running shell commands.

Recently, OpenAI released GPT-4 Turbo, which is an iteration of GPT-4 with approximately 3x lower costs, faster inference time, and a maximum context window of 128k tokens (40k words). However, it remains relatively slow for inference compared to models like GPT-3.5, which is more suitable for real time applications, and it is expensive ($0.1 for a call with 4k tokens input and 2k tokens output). Finally, be cautious of the reliability of the Turbo version, which is still in beta and may have less reliable response accuracy. In contexts where output quality is particularly critical, we recommend sticking to the stable version of GPT-4 for now.

 

OUR PERSPECTIVE

GPT-4 surpasses all publicly accessible models and is particularly resilient to hallucinations. The integration of Function Calling and OpenAI Assistants API, along with the recent cost reduction with GPT-4 Turbo, further strengthens its position as a leader.

If reducing cost or latency is more important than raw performance, then prefer GPT-3.5 or another "smaller" LLM. For specific security or control constraints, there are also open-source LLMs available.

14

PyTorch

The widespread adoption of Machine Learning in the 2010s owes much to the advent of open-source libraries designed for tensor and gradient calculations, notably PyTorch in 2016 and TensorFlow in 2015.

PyTorch, a brainchild of MetaAI initially focused on research, has broadened its reach to industry applications. It facilitates the setup and training of Machine Learning models, including neural networks, and supports the crucial data preprocessing needed for model training. PyTorch is celebrated for its balance between detailed control over training loops and user-friendly syntax. Additions like PyTorch Lightning have further eased the syntax and streamlined deployment processes.

A standout feature of PyTorch is its extensive model library, courtesy of the active community. Platforms like Papers With Code and HuggingFace showcase a predominance of PyTorch-based models over TensorFlow. PyTorch also shines in backward compatibility, offering a safer environment for updates—a critical aspect for production deployments.

OUR PERSPECTIVE

We strongly advocate for PyTorch, praising its versatility and the vibrant community supporting it. However, for projects involving embedded AI, caution is advised as PyTorch Mobile is still under development. Despite this, PyTorch stands as a robust option for a broad range of applications, providing a well-rounded toolkit for developing and training Machine Learning models.

15

Retrievial Augmented Generation

Large Language Models (LLMs) have risen to prominence for their ability to perform zero-shot tasks and serve as a dynamic universal knowledge repository. However, their knowledge is confined to publicly available data up until their last training update.

Retrieval Augmented Generation (RAG) enhances LLMs by integrating them with an external knowledge base, such as Notion, Confluence, PDFs, and internal documents. This integration allows for natural language queries directly to these databases.

The RAG approach operates in two phases:

  1. Retrieval: Identifies the documents most relevant to a query by comparing the query's vector embedding with document representations in a database. This can be done using standard or dedicated vector databases (see Vector Search with Standard Databases and Dedicated Vector Databases).
  2. Generation: Leverages an LLM to formulate responses based on the information retrieved in the first step.

This methodology offers several benefits, including the avoidance of high GPU costs associated with LLMs, improved explainability through traceable data sources for answers, the ability to stay current with updates to the knowledge base without retraining, enhanced security by restricting sensitive data access, and reduced instances of model-generated inaccuracies by anchoring responses in sourced documents.

However, RAG setups are more complex and potentially costlier than using a singular model, especially if one opts for supervised LLM fine-tuning. The absence of fine-tuning could also limit the system's understanding of domain-specific terminology and the initial retrieval step may be a precision bottleneck—failing to source relevant information compromises the LLM's response accuracy.

OUR PERSPECTIVE

Despite these challenges, RAG systems are often favored over direct fine-tuning methods when it's crucial to manage the data underpinning generated responses, such as for security reasons or when dealing with frequently updated information (like daily document revisions).

Implementing an initial version of a RAG system to explore its potential is quite feasible without extensive technical expertise, thanks to tools like ChatGPT+ GPTs, OpenAI's Assistants API, and frameworks like LangChain. Nevertheless, optimizing the system to fully leverage its capabilities typically necessitates further customization and iteration.

16

SHAP

Complex machine learning models are often viewed as inscrutable "black boxes" due to their low interpretability. To build trust in these models, various interpretability methods have been developed. SHAP (SHapley Additive exPlanations) stands out as a method that demystifies the decision-making process of any model, making it particularly valuable for models using structured data.

Before SHAP, interpretation of complex ML models largely depended on simpler approaches like feature importance. SHAP differentiates itself by employing cooperative game theory to assess the contribution of each feature to the model's prediction on a local level.

One of SHAP's strengths is its local focus, enabling detailed examination of how each variable affects individual predictions or those within a specific dataset cluster. Unlike basic interpretive methods that might only offer a generalized view, SHAP values can be aggregated and visualized, providing a deeper insight into the model's workings beyond what a simple bar chart could convey.

However, SHAP's major drawback is its computational demand, especially with models that have numerous parameters or when applied to extensive datasets.

OUR PERSPECTIVE

In our projects, especially in sectors where understanding model decisions is critical—such as finance and healthcare—we consistently employ SHAP for its precise interpretative power with structured data. It not only aids in ensuring compliance and transparency but also helps in refining our models through a meticulous analysis of input contributions.

Given its benefits, we advocate for the adoption of SHAP, particularly in scenarios where detailed model interpretability is essential.

17

SpaCy

SpaCy stands out as a production-oriented open-source Python library for natural language processing (NLP), prized for its efficiency and focus on delivering "black box" models for swift deployment and dependable outcomes. In contrast to research-focused NLP tools like NLTK or Stanford NLP, SpaCy aims to offer practical, high-performance solutions across a broad spectrum of NLP tasks, including named entity recognition and text classification.

One of SpaCy's significant advantages is its simplicity. The library facilitates rapid model training through configuration files and integrates command-line interface (CLI) commands as Python functions, streamlining SpaCy's incorporation into various development workflows. Its object-oriented design enhances usability, making it accessible even to those relatively new to NLP. Additionally, SpaCy supports a comprehensive range of models from the HuggingFace Transformers library and multiple proprietary and open-source Large Language Models (LLMs) through APIs like OpenAI and HuggingFace, whether fine-tuned or not.

However, it's worth noting that while SpaCy allows for iterative training data and hyperparameter adjustments, it offers limited flexibility for deep modifications to the underlying models. This design choice reflects SpaCy's commitment to achieving quick, reliable results and simplifying the development process.

OUR PERSPECTIVE

SpaCy has established itself as a dependable tool in the NLP domain, making it a strong candidate for traditional NLP projects where robustness and performance are key considerations.

As we haven't yet utilized SpaCy with LLMs in a production environment, we refrain from offering guidance on its application in such contexts.

18

YOLO

The YOLO (You Only Look Once) series represents a groundbreaking approach in the realm of real-time object detection algorithms. Debuting in 2016, the inaugural YOLO model set itself apart from predecessors like R-CNN and Fast R-CNN by its remarkable inference speed. This leap in performance is largely attributed to YOLO's unique architecture, which merges the region proposal and classification steps into a single pass through the neural network, hence its name.

YOLO has seen multiple iterations, each improving upon the last. Notably, YOLOv4, introduced in 2020, achieved significant advancements over its contemporaries, such as Faster RCNN-FPN+, boasting a fivefold increase in speed and a 9.4-point improvement in Box AP on the MS COCO benchmark, a standard for measuring object detection precision.

Following YOLOv4, Ultralytics assumed development, releasing subsequent versions under the AGPL-3.0 license, complicating their commercial application. Despite only modest performance gains, these iterations are praised for their high code quality, enhancing usability. The most recent YOLO versions continue to lead in real-time object detection benchmarks and have expanded their application to tasks like segmentation, object tracking, and pose estimation.

OUR PERSPECTIVE

For projects demanding swift object detection, YOLO models are a solid choice, offering both speed and precision. However, alternatives that trade off some speed for increased accuracy exist.

Our experience primarily involves YOLOv4, which has proven effective in various settings, including on embedded systems through tflite compression. Although the open-source nature of these models has occasionally presented challenges in training, these hurdles were surmountable. Due to licensing constraints, we advise against using Ultralytics's iterations (v5 and onwards).

Trial

19

Boruta

Training Machine Learning models on structured data often encounters the challenge of excessive, unhelpful features that can dilute model effectiveness, prolong training and inference times, and complicate the model's development and refinement.

The Boruta method is a robust feature selection technique designed to sift through the noise, identifying only the most predictive features for a model. It operates on a unique principle:

  1. "Shadow features" are generated by randomly shuffling the values of actual features, creating a set of decoy data.
  2. A random forest model is then trained on both real and shadow features, with each feature's importance evaluated within the model. Only those features deemed statistically more significant than the shadow features are kept.

While other feature selection techniques exist, such as Principal Component Analysis (PCA) or those integrated into libraries like scikit-learn, Boruta distinguishes itself by its comprehensive evaluation of all variables. This includes detecting interactions among variables that are complex and non-linear. Additionally, the Boruta-Shap variant leverages SHAP values for feature ranking, providing a more nuanced and reliable analysis of feature importance.

Nonetheless, Boruta's approach can be computationally intensive, particularly with large datasets and a high number of features. It also requires careful adjustment of hyperparameters, like the significance threshold for importance testing.

OUR PERSPECTIVE

Boruta is recommended for early-stage feature analysis to swiftly pinpoint the most impactful features. However, it's crucial to validate the model's performance with and without these identified features, ensuring decisions on feature exclusion are grounded in actual model efficacy and not solely on algorithmic suggestions.

20

Causal Impact

A recurring challenge in time series analysis is quantifying the effect of specific interventions, such as a marketing campaign's influence on sales volume. Traditional methodologies like A/B testing and double-blind randomized tests hinge on comparing intervention outcomes against a control group unaffected by the intervention. However, pinpointing a suitable control group that mirrors the test group's characteristics can be complex and sometimes unfeasible.

Causal Impact emerges as a sophisticated approach to assess the causal influence of an intervention on a time series, circumventing the need for a control group. Developed by Google researchers, this methodology employs causal inference—specifically, a "Bayesian structural time-series" model—to construct a synthetic control. This synthetic control provides a baseline by predicting the trajectory of the time series had the intervention not occurred, using data from periods unaffected by the intervention for model training. The difference between the actual post-intervention data and this synthetic control reflects the intervention's impact.

The model's accuracy is evaluated through the Bayesian model's output, which produces a probability distribution, allowing for the estimation of the likelihood that observed impacts are coincidental.

Identifying the optimal model and predictive features to isolate the intervention's effect can be challenging. A recommended strategy is to assess the impact using only pre-intervention data; an effective model-feature combination should indicate no impact in this scenario. Once established, this combination can accurately measure the intervention's actual effect on the analyzed time series.

OUR PERSPECTIVE

We advocate for the Causal Impact method as a viable option for measuring interventions in time series analyses lacking a control group. Its statistical rigor offers a solid foundation for interpreting post-intervention changes. Given the original Google implementation is in R, we suggest exploring Willian Fuks' tfcausalimpact library for those seeking a more accessible Python alternative. This library facilitates implementing the Causal Impact methodology, broadening its applicability to various time series analysis projects.

21

OpenAI Assistants API

Language models, in generating responses, draw upon the data they were trained on. Yet, the quest for specific, nuanced answers often necessitates incorporating targeted knowledge, leading to the adoption of techniques like Retrieval Augmented Generation (RAG). While the setup for such systems, including LangChain for orchestration and Qdrant for vector databases, is indispensable across various applications, it also incurs significant costs. In response, OpenAI's Assistants API offers a streamlined, integrated solution designed to facilitate and expedite these scenarios, enabling language models to interface with external resources—such as knowledge bases, APIs, and computational tools—with ease.

Currently in its beta phase, this API is versatile, supporting functionalities like Code interpretation, Retrieval (to augment the model's base knowledge), and Function calling (initiating actions based on model prompts). It introduces abstract objects (Assistants, Thread, Message, and Run) for interaction, abstracting away complexities such as context window management and chat history, effectively removing any constraints on the length of a conversation thread.

Language models, in generating responses, draw upon the data they were trained on. Yet, the quest for specific, nuanced answers often necessitates incorporating targeted knowledge, leading to the adoption of techniques like Retrieval Augmented Generation (RAG). While the setup for such systems, including LangChain for orchestration and Qdrant for vector databases, is indispensable across various applications, it also incurs significant costs. In response, OpenAI's Assistants API offers a streamlined, integrated solution designed to facilitate and expedite these scenarios, enabling language models to interface with external resources—such as knowledge bases, APIs, and computational tools—with ease.

Currently in its beta phase, this API is versatile, supporting functionalities like code interpretation, retrieval (to augment the model's base knowledge), and function calling (initiating actions based on model prompts). It introduces abstract objects (Assistants, Thread, Message, and Run) for interaction, abstracting away complexities such as context window management and chat history, effectively removing any constraints on the length of a conversation thread.

One must, however, consider the costs associated with the storage of knowledge base files, priced at $0.20/GB per day. This pricing model implies that storing data in formats beyond plain text could become expensive. Additionally, using a managed, high-level API like the Assistants API introduces certain limitations, including potential dependency on the service (lock-in) and reduced control over model operations (data ingestion, segmentation, search logic, etc.).

OUR PERSPECTIVE

Traditionally, we've leaned towards developing our RAG solutions in-house. However, the Assistants API, with its promise of seamless integration of language models with knowledge bases, external tools, and Python environments, presents an attractive alternative. We're currently observing its performance and utility in larger-scale productions before issuing a definitive stance on its adoption.

One must, however, consider the costs associated with the storage of knowledge base files, priced at $0.20/GB per day. This pricing model implies that storing data in formats beyond plain text could become expensive. Additionally, using a managed, high-level API like the Assistants API introduces certain limitations, including potential dependency on the service (lock-in) and reduced control over model operations (data ingestion, segmentation, search logic, etc.).

22

LLM Fine-Tuning supervised on questions / answers

Adapting state-of-the-art Large Language Models (LLMs) to proprietary data presents a notable challenge, as these generalist models are primarily trained on public datasets. Fine-tuning LLMs in a supervised manner—where a pre-trained model (such as Mixtral, Llama, GPT-3.5) is further trained on specific questions and answers—enhances the model's ability to produce tailored responses. This approach is crucial when prompt engineering falls short, such as for ensuring responses adhere to a particular output format or incorporate unique terminologies.

In 2023, fine-tuning became significantly more accessible and cost-efficient with advancements like QLora (Quantization of weights and Low-Rank Adaptation), which focuses on modifying a subset of the model's weights. This method requires only a few thousand high-quality examples to achieve meaningful improvements, with costs remaining relatively low (in the range of a few tens of euros for a model with 7 billion parameters).

Nonetheless, fine-tuning is not without its challenges. It demands time and resources to prepare an appropriate dataset—often generated through models like GPT-4—and undergo several iterations to refine the outcomes. Additionally, a fine-tuned model might lose some of its general capabilities in favor of task-specific enhancements.

Alternatives to supervised fine-tuning include:

  • Methods without retraining: Such as Retrieval Augmented Generation and prompt engineering, which, while simpler, may necessitate larger prompts and are not as precise in understanding specialized terminology.
  • Unsupervised fine-tuning: Tailors the model for text completion tasks but may diminish its question-answering proficiency.
  • Reinforcement Learning from Human Feedback (RLHF): Although potentially more effective, RLHF is expensive and complex, requiring extensive human input and iterations.
  • DirectPreference Optimization (DPO): Offers a more straightforward, stable, and efficient approach than RLHF but still incurs higher costs than supervised fine-tuning.

OUR PERSPECTIVE

We advocate starting with prompt engineering using advanced models like GPT-4 during initial development phases. This strategy allows for rapid prototyping without the complexities of fine-tuning. When specific limitations arise or there's a need to adapt the model to unique data, consider utilizing Retrieval Augmented Generation or supervised fine-tuning based on specific questions and answers. As the project progresses, engaging test users can provide invaluable feedback and data, aiding in the preparation of datasets for potential fine-tuning, thereby enhancing the model's applicability and performance in real-world scenarios.

Assess

23

LLM Open Source

State-of-the-art language models have made significant shifts from being largely open-source, as seen with BERT and GPT-2 up until 2018/19, to the closed-source paradigm initiated by GPT-3 in 2020, where OpenAI chose not to release the model weights. This trend continues with GPT-4, which, accessible only via API, offers performance that far surpasses available open-source alternatives. The introduction of Mixtral (8x7B) has sparked renewed interest, slightly outperforming GPT-3.5, and reigniting discussions on the potential of open-source models to match or exceed the capabilities of their closed-source counterparts in the foreseeable future.

Despite the performance edge of closed-source models like GPT-4, open-source alternatives present several compelling advantages:

  • Model Availability: Owning the model ensures direct control over its deployment, mitigating potential latencies or downtimes—an issue sometimes encountered with OpenAI's API services.
  • Flexibility: Having direct access to the model allows for more creative use of its output, such as implementing specific guidance mechanisms.
  • Efficiency: Tailored, smaller models can be more computationally and energetically efficient compared to larger, general-purpose ones, particularly for specialized tasks.
  • Data Security and IP Control: Utilizing open-source models for fine-tuning offers enhanced data security and safeguards intellectual property.
  • Cost Considerations: While leveraging an open-source LLM may initially seem costly compared to the per-call pricing model of API-based solutions, the overall expense can become more favorable at higher usage volumes. Moreover, advancements in tools like vLLM are simplifying the deployment and reducing the costs of serving open-source LLMs.

OUR PERSPECTIVE

For entities equipped with the requisite resources and expertise, the adoption of an open-source LLM presents a strategic opportunity to capitalize on the benefits of flexibility and control. However, for endeavors where reliability and peak performance are non-negotiable, GPT-4 currently stands unrivaled. In broader terms, particularly during the proof-of-concept phase aimed at validating product value and technical viability, we recommend starting with the most advanced model available before exploring open-source alternatives, balancing the trade-offs between performance, control, and cost.

Hold

24

Classic Few-shot learning

The challenge of training machine learning models with a scant number of annotated examples per class is a well-documented hurdle. The straightforward solution of annotating more data often clashes with constraints such as budget, data scarcity, or the unpredictability of all classes involved.

Few-Shot Learning (FSL) emerges as a specialized field within machine learning, designed to equip models to learn new tasks effectively using only a handful of examples. Among the prominent strategies in FSL are:

  1. Metric Learning: This technique focuses on crafting a representation space where examples belonging to the same class cluster closely together while maintaining distance from those of different classes. Siamese networks, which utilize a contrastive loss function to differentiate between positive (same class) and negative (different class) examples, exemplify this approach.
  2. Meta-Learning: Meta-learning, or "learning to learn," aims at enabling algorithms to rapidly acclimatize to new tasks using minimal data. A notable example is Model-Agnostic Meta-Learning (MAML), which optimizes the model's initial weight configuration for quick adaptation to new tasks.

Recent analyses, such as the insights from "A closer look at few-shot classification," suggest that achieving cutting-edge performance may be less about specialized FSL techniques and more about leveraging supervised pre-training followed by task-specific fine-tuning. Furthermore, the advent of transformer-based models since 2019 has introduced a new era of "universal" representations that sufficiently address a broad spectrum of use cases without the need for intricate FSL methodologies.

OUR PERSPECTIVE

Our experience with FSL techniques, including siamese networks and contrastive training, has evolved alongside the advancements in foundational models. The enhanced capability of these models to produce expressive representations renders the fine-tuning of end-of-process elements—such as selecting class representatives or refining metrics—increasingly adequate. Consequently, we advise against initiating projects with a primary focus on optimizing representations specifically for FSL. Instead, leveraging the robust, pre-trained foundations provided by modern transformers may offer a more straightforward and effective starting point.