Adopt

1

Apache Parquet

In the realm of data science, the CSV format is often the go-to choice for storing datasets due to its widespread acceptance. Yet, this format comes with notable drawbacks:

  • CSV files can become cumbersome and slow to process when dealing with substantial data volumes.
  • The lack of inherent data types in CSVs means software must interpret the data, potentially leading to errors. For instance, complex data structures like lists or dictionaries are typically read as strings.
  • The simplicity of manually editing CSV files can also introduce mistakes, such as inaccurately renaming labels in an annotation file without verifying the related images, which could misinterpret the issue.

Apache Parquet, an open-source column-oriented data file format, emerges as a superior alternative, particularly for large datasets. Its architecture is optimized for efficient data reading and writing operations.

Parquet directly addresses the limitations of CSV by:

  • Being ten times lighter than CSV files (on average) and considerably faster in data handling.
  • Accurately managing various data types, including more complex structures, without the manual interpretation required by CSV.
  • Discouraging manual edits, instead promoting the use of scripts or specific graphical tools (like labeling tools or custom Streamlit interfaces) for safer data modifications.

Additionally, Parquet supports advanced optimizations like file partitioning and compression. Popular DataFrame libraries such as Polars and Pandas seamlessly work with Parquet, offering straightforward methods (read_parquet and to_parquet) to transition from CSV, simplifying the migration process.

One limitation of Parquet is its lesser accessibility for non-technical users who might find opening CSV files in spreadsheet software more straightforward.

Our Perspective

Although Parquet is commonly used by Data Engineers, Data Scientists largely underleverage its potential.

We advocate for Data Scientists to transition their tabular data storage from CSV to Parquet files. This switch not only boosts performance and diminishes bug risks but also complements well with tools designed for effective data visualization and editing, like Streamlit, enhancing overall data management practices.

2

DVC

In the realm of Machine Learning (ML) projects, code versioning has become an indispensable best practice. However, the practice of data versioning has not seen the same widespread adoption, often leading to the frustration of not being able to retrieve a specific dataset or replicate the success of a high-performing model. To bridge this gap, the Data Version Control (DVC) tool was introduced. Launched in 2017 by Iterative, DVC is an open-source Python library designed to prevent such setbacks.

DVC stands out by allowing for the versioning of data files in conjunction with Git, a popular version control system. It achieves this by storing the actual data in a chosen remote storage solution, like Google Cloud Storage or Amazon S3, while the metadata is versioned through Git. This approach ensures that large data files are handled efficiently without clogging the Git repository.

Furthermore, DVC facilitates the creation, execution, and versioning of data pipelines. This feature is crucial for tracking the progression of datasets and models, offering clarity on the steps taken to produce each outcome and the ability to replicate them.

While there are other data versioning and pipeline systems available, such as Pachyderm, DVC distinguishes itself with its ease of setup and user-friendly nature. Additionally, though tools like MLFlow and Weights & Biases exist for ML experiment tracking, DVC integrates more seamlessly into the Git workflow, allowing for a unified tracking of code, data, and experiment iterations. This integration simplifies the exploration of project histories and avoids the need to embed tracking operations directly into the code, a common requirement with other platforms.

For those seeking to visually explore their experiments, several options complement DVC:

  • Iterative Studio offers a comprehensive web app solution developed by DVC's creators, though it starts at $50 per user per month beyond two users.
  • A DVC extension for Visual Studio Code provides a free, albeit less extensive, alternative for collaboration within the VSCode environment.
  • A custom dashboard, created by integrating DVC with a visualization tool such as Streamlit, allows for a tailored exploration of project experiments, presenting precise information as needed.

Our Perspective

Adopting DVC can be likened to the transition to using Git: initially daunting, but soon becoming indispensable. At Sicara, we recommend leveraging DVC with Streamlit for a flexible, cost-effective approach to experiment visualization, as demonstrated by our Sicarator tool.

It's important to note, though, that while DVC excels in managing experimentation flows, it's not designed for operational pipelines in production environments. For such cases, a more specialized tool like Airflow is recommended, highlighting DVC's role as a powerful companion for data scientists in the experimentation and development stages.

3

Polars

Pandas has long been the preferred toolkit for data manipulation and analysis in Python. Its intuitive interface and comprehensive set of features have made it indispensable for data practitioners. However, Pandas encounters performance bottlenecks when handling very large datasets, primarily because its operations are not designed to be parallelized, and it requires data to be loaded entirely into memory.

Enter Polars, a modern DataFrame library that leverages the Rust programming language, introduced to the public in 2021. Polars is engineered to overcome the scalability issues of Pandas by supporting multi-threaded computations. This capability, combined with lazy evaluation strategies, enables Polars to efficiently manage and transform datasets that exceed available memory, enhancing performance significantly.

Polars is designed with an intuitive syntax that mirrors Pandas, making the transition between the two libraries smooth for users. This design choice ensures that data professionals can apply their existing knowledge of Pandas to Polars with minimal learning curve, facilitating adoption.

Despite its advantages, Polars is comparatively newer and thus may not offer the same breadth of functionality as Pandas. However, Polars integrates seamlessly with the Arrow data format, which simplifies the process of converting data between Polars and Pandas. This compatibility allows users to leverage Polars for performance-critical tasks while still accessing Pandas' extensive feature set for specific operations.

Our Perspective

Given the performance benefits and ease of use, we advocate for adopting Polars in new projects that involve DataFrame manipulation, reserving Pandas primarily for maintaining existing codebases. This strategy allows for leveraging the strengths of both libraries—utilizing Polars for its efficiency and scalability, and Pandas for its established ecosystem and rich functionality.

4

Prodigy

Annotating datasets, especially with textual data, involves complexities like streamlining the annotation process, storing annotations efficiently, and enabling simultaneous work by multiple annotators. Prodigy emerges as a powerful annotation tool tailored for tasks like classification, named entity recognition (NER), and sentence analysis.

Prodigy stands out for several reasons:

  • It boasts a user-friendly interface that simplifies the annotation task.
  • Its active learning component enhances annotation efficiency by prioritizing examples that will most improve the model.
  • The tool supports pre-annotation using Language Model Models (LLMs) for tasks such as NER and text classification, although users should be mindful of data privacy when using third-party LLMs.
  • Prodigy is scriptable, allowing for customized annotation workflows to suit various project needs.
  • At $490 for a lifetime license, its pricing is competitive.
  • Being developed by the same team as SpaCy, Prodigy integrates seamlessly with the popular NLP library, although alternatives like Doccano or Label Studio offer open-source options.

Our Perspective

In our textual annotation efforts, we particularly value Prodigy's active learning feature, which streamlines the annotation process. We recommend Prodigy for those working with SpaCy, leveraging its capabilities to enhance annotation efficiency and model training.

5

Pydantic

Given Python's nature as an interpreted language, it often lacks the strict data type assurances found in compiled languages. This flexibility, while a strength in many contexts, can introduce uncertainty, especially when dealing with outputs from language models, which may not always conform to expected structures.

Pydantic, an open-source library, offers a solution by enabling developers to define expected data structures explicitly and validate them automatically using Python's type annotations. This feature is particularly useful for enhancing the reliability and precision of language model outputs, ensuring they meet predefined specifications before further processing.

The utility of Pydantic is becoming more evident within the community, with emerging tools like Marvin, Instructor, or Outlines utilizing it to establish strong interfaces with language models. Moreover, Pydantic's role in LangChain's output parsers underscores its importance in handling complex data structures.

To maximize the benefits of Pydantic, a thorough understanding of Python's data type system and its limitations is essential. Knowledge of how to effectively use and transform these types to model intricate data is crucial for leveraging Pydantic's full potential.

Our Perspective

Our experience with Pydantic in Data Engineering projects over the years has proven its value, particularly with the rise of large language models (LLMs) in AI projects. Adopting Pydantic is a strategic move towards ensuring robust data management and system integrity. Additionally, its application in our projects has helped clarify data models, documenting and explicating component interfaces, thereby enhancing overall project clarity. We recommend Pydantic to teams well-versed in Python, as it stands as an indispensable tool for ensuring data precision and reliability.

6

Streamlit

Data scientists often require visual tools to effectively communicate findings with business stakeholders and peers. Rapid development of custom applications is key to this communication.

Streamlit, a Python library, democratizes web application creation for data scientists without necessitating web development skills. It offers a streamlined way to share technical findings via web applications, presenting a more accessible and visual format than traditional Jupyter notebooks for business audiences.

The integration of Streamlit with DVC (Data Version Control) enhances data science workflows, enabling effortless comparison of various model performances and outcomes. Moreover, Streamlit’s expansive community has contributed a wealth of components, extending its functionality to cover a broad range of needs.

Publishing applications on the Streamlit Community Cloud is straightforward and free, although private applications require manual deployment.

Streamlit excels in creating simple web apps where immediacy outweighs performance. For more complex applications demanding greater control and customization, traditional web development frameworks are recommended.

Our Perspective

At Sicara, Streamlit is a staple for early-stage project development, including proof of concept and investigative work. While we opt for bespoke solutions for production, Streamlit's low barrier to entry and efficiency make it a recommended tool for quickly validating ideas and facilitating communication in the data science workflow.

Trial

7

Dedicated vector database

Semantic information from text or images is typically encoded in the form of fixed-size vectors, called embeddings. Manipulating and querying such vectors require specific tools. Starting from the 2010s, libraries for vector search began to develop, but there was no associated storage.

Dedicated vector databases emerged in 2019 with Milvus and Pinecone. Compared to vector search with standard databases (such as PostgreSQL or Elasticsearch), dedicated vector databases are generally more performant and offer more specific functionalities: vector quantization, storing vectors on disk to save RAM space and minimize costs, or even being able to perform searches with multiple vectors to approach or move away from.

This gain in performance and available functionalities is particularly useful for use cases with large amounts of data (from several hundred thousand vectors to several billion). For example, Grok (X's LLM) relies on the Qdrant vector database for its RAG system.

Despite these advantages, vector databases can be complex to deploy and maintain. For example, Milvus is designed with a microservice's architecture, where each service is complex to understand and debug (with little explanatory documentation), even with a dedicated team. Additionally, compared to vector search with standard databases, this requires adding another component to the technical stack, which is a non-negligible choice in the case of complex infrastructures. This can also pose transactional problems, such as serialization errors, if the vector database needs to interact in real-time with a standard database.

 

OUR VIEWPOINT

We recommend dedicated vector databases for any project requiring search over many vectors. In particular, we often use Qdrant, which stands out for its flexibility, performance, and integration with DVC.

This choice depends on the existing technical stack, the complexity of integrating a new tool, and anticipated transactional issues. A standard database with vector search can be a good alternative.

8

Presidio

Data privacy, crucial in protecting user privacy and adhering to regulations like GDPR and CCPA, faces new challenges with the rise of external LLM APIs and shadow AI, increasing the risk of exposing Personally Identifiable Information (PII). To address this, solutions for anonymizing PII are essential.

Presidio, an open-source Python library by Microsoft, excels in detecting and anonymizing personal data within text such as emails, IP addresses, and phone numbers, a vital capability for maintaining data privacy.

What sets Presidio apart is its flexibility in detection and the anonymization process, allowing users to choose or retrain the entity detection model to meet their specific needs. However, it necessitates clear definitions of what data to anonymize, which can be a nuanced challenge. For instance, differentiating between a bank's name, which should remain unaltered, and a user's company name, which requires anonymization in a text, illustrates the complexity of applying such tools effectively to anonymize a user's interactions with a Chatbot

Our Perspective

We have successfully leveraged Presidio in projects involving LLMs, valuing its ease of integration and the minimal setup required. This makes Presidio a practical choice for projects needing efficient PII anonymization, even with limited resources.

9

Segments.ai

In data science projects, the significance of accurately labeled data cannot be overstated, yet the process of obtaining such data is labor-intensive. Segments.ai, a startup established in 2020, aims to revolutionize image annotation by minimizing costs and simplifying the process. Their platform has been recognized for its efficiency and user-friendliness in image annotation tasks, offering several advantages:

  1. MLOps Pipeline Integration: SegmentsAI enhances dataset management within MLOps pipelines through versioning capabilities and a Python SDK, ensuring seamless integration into Python-based workflows.
  2. Team Collaboration: It supports various roles for managing annotation lifecycles, such as review, validation, and rejection, enhancing team collaboration. Automation of these processes is possible through the Python SDK.
  3. Annotation Speed: The platform accelerates the segmentation task by enabling the use of pre-trained deep learning models for initial image annotation, significantly reducing manual effort.
  4. Technological Agility: Demonstrating agility in adopting new technologies, Segments.ai integrated SAM (Segment Anything Model from Meta AI) promptly, showcasing their commitment to providing cutting-edge tools.
  5. User-Friendly Interface: Despite a slight complexity, the platform's intuitive design facilitates user adoption, making it a strong contender in the annotation tool market.

However, the platform does exhibit limitations in customization and permission management, such as a fixed annotation workflow and less detailed permission settings compared to some alternatives.

CVAT stands as an open-source alternative, offering a different balance of features and flexibility, albeit with a narrower scope in comparison to Segments.ai's offerings.

10

TensorBoard Embedding Projector

Visualizing the results of an embedding model can be complex. The Tensorboard Embedding Projector addresses this issue by projecting embeddings into a 3D space via dimensionality reduction methods (e.g., t-SNE). This allows potential clusters to be visualized as well as zooming in to observe specific similarities. It implements features that simplify visualization, such as the ability to associate each point with an image (useful in the case of image embeddings) or to color by metadata.

The projector can be created in a few lines of code with PyTorch.

However, the projection remains an approximation of reality (two points close together in 3D space may potentially be far apart in reality). Furthermore, the tool's user experience is relatively poor when attempting to navigate through the point cloud, making it cumbersome to zoom in on a particular area.

There are several alternatives offering embedding projections, which are, however, in 2D:

  • The embedding projection directly integrated into the dashboard of the dedicated vector database Qdrant.
  • The embedding projection of the FiftyOne tool.

Hold

11

Pandas with NumPy backend

Before 2009, data processing in Python was limited to tools like NumPy or native Python. Pandas, built on top of NumPy, quickly became an industry and research standard for analyzing and manipulating tabular data more quickly and efficiently through its abstraction layer. NumPy is efficient for matrix operations and enables performance unattainable in "pure" Python for numerical operations because it uses a compiled language like C.

Despite this, the library shows its limitations in handling large datasets or complex tasks because it was only designed for in-memory analytics and not big data.

Today, the Python ecosystem for DataFrames has been enriched, especially with alternatives to Pandas' NumPy backend, such as the PyArrow and Polars backends (another DataFrame library). Arrow and Polars, written in Rust and focused on multi-threading, offer significantly superior performance because both use the Arrow data format to optimize memory usage and calculations.

 

OUR VIEWPOINT

We recommend moving away from Pandas' NumPy backend. Although this technology is comprehensive and well-integrated with other tools like Matplotlib, current options like PyArrow or Polars offer similar capabilities with more advanced optimizations. As dataset sizes increase, prioritizing performance becomes essential.

Pandas developers also recognize this trend, having established Arrow as the default backend since version 1.4.0 in January 2022.