7 min read

How To Build A Successful AI PoC

Rédigé par Arnault Chazareix

Arnault Chazareix

Turn Your Artificial Intelligence Ideas Into Working Software

Building an AI PoC is hard. In this post, I will explain my thought process to make my Artificial Intelligence PoCs succeed.

“What if my alarm could use traffic information to wake me up just in time to go to work?” We’ve all thought of resorting to AI to solve one of our problems. The goal of a Proof of Concept (PoC) is to test whether it’s worth investing time into it. Building a PoC is hard, but it’s even harder to build an AI PoC because it requires a large set of skills.

When building an AI PoC, data science only represents a small fragment of the work, but it’s one of the most important skills.  It’s easy to find great tutorials on how to solve a specific task like how to build a detection algorithm to park your car or how to deploy a flask app to the cloud. But it’s a lot harder to design a solution to your specific problem, mainly because you need the hindsight to reformulate your problem into a standardized task.

In this post, I will explain my method for achieving this.

First, I will start with a review of what an AI System looks like. Then, I will describe my 3-step process to design an Artificial Intelligence. Finally, we’ll see 2 examples, a simple one and a complete one with a python implementation.

Overview of an Artificial Intelligence System

As an example, I will take a system which classifies documents. It answers to “What kind of document is this?” with classes like an “electric invoice” or a “to-do list”.

AI workflows consist of 5 steps:

  • receiving the question: “What kind of document is this?”
  • adding complementary data on the user or the context: “What type of documents does the user have?”
  • using the data to answer the question: “Which type does this document belong to?” by “This is an energy invoice”
  • storing the result: adding the new documents to the database
  • answering the client’s question: “This is an energy invoice”

You can break this down into 3 tasks, or semantic blocks:

  • Handling the client : receiving the question, making him wait…Example: an HTTP server
  • Data conciliation: communication with the “company knowledge base” to add or receive relevant data.Example: communication with a database
  • AI Block: the AI itself which answers the question with a context.Example: expert system, SVM, neural networks…
Answering the question“What kind of document is this?”

You can find great tutorials on how to architect your server or your data conciliation layer on the web. The simplest solution for an AI PoC in Python is using Flask and a SQL database, but it highly depends on your needs and what you already have. Here is a tutorial on using Flask with SQLALchemy. We are going to focus on designing the AI itself.

Designing The AI Block

AI tasks can involve multiple heterogenous inputs. For example, the age and the location of a user or a whole email discussion.

AI Outputs depend on the task: the question we want to answer. There are a lot of different tasks in AI. You can see some of the usual tasks in computer vision in the image below.

Various computer vision tasks from a post about image segmentation

Thinking of ways to build an AI seems complicated as soon as you venture out of the standardized inputs and tasks.

To wrap my mind around the complexity of building an AI, I use a 3-step process.

Step 1: Browsing the relevant inputs

First, gather all the inputs you suspect are capable of answering the task at hand and select those that are self-sufficient in the majority of cases.

When testing an AI idea, it’s easy to get greedy and think about solutions that include a lot of inputs: the location of the user may give me an insight into what their next e-mail will be, for example. The truth is: it’s just so easy to get lost in mixing various inputs with different meanings or nature and end up delivering nothing.

Stick to simple, self-sufficient inputs when building your AI.

Step 2: Vectorizing the data

The second step is to preprocess those inputs, to make those usable for various algorithms. In a way, every AI process passes through a bunch of steps to obtain a vector representation.

Text to vector: Vectorization based on word counting

This process can be really simple, like counting how frequent words appear in a document or directly using the values of an image’s pixels. It can also become really complex with multiple layers of preprocessing.

Image to vector: Vectorization of a PNG image to a 48x48 grayscale vector based on pixel values. Who is Lenna?

Inputs can be really different: different sizes, color scales or formats for images. Keep in mind that the idea here is to build a meaningful, normalized representation of all inputs.

Build a normalized and meaningful representation of your inputs.

Step 3: Processing the vectors

The third step is the moment to think about the output and how to get to it.

Like the input, the output needs to be “vectorized”. For classification, it’s straightforward: one field by class.

Then, we need to find a way to get from the input vector to the output vector. In the end, this is the first thing we learn when we start looking for AI. It can involve simple tasks like finding the closest vector or the highest value, to more complex ones like using huge neural network architectures.

Most tasks like regression, classification, or recommendation are highly documented. For a PoC, the most simple action is to use a library of pre-implemented algorithms like scikit-learn and try them out.

Vectorization of the output of a classification task

Look for simple and pre-implemented algorithms.

A Straightforward Example

Task: Is a text in French or English?

A solution: 

Step 1: Browsing the relevant inputs. The text is the only input possible if we don’t have any origin or other metadata.

Step 2: Vectorizing the data. A simple way to vectorize it will be to count the presence of English words and French words. We are going to use the language-specific most frequent words. They are called stop words: the, he, him, his, himself, she, her…

Step 3: Processing the vectors. Then, we can just choose to classify with the highest of the two values in order to obtain a binary output: True or False.

Random French and English Wikipedia pages separated by their stopword ratio. The blue outlier is the French page on Ferroplasmaceae which sadly contains more English references than French sentences.

Building an AI is often a mix of Human Expertise (Business Knowledge) and Computer Intelligence (Machine Learning). In this example, I used Human Expertise to choose how to build my vector thanks to French and English stop words. I could have also used Machine Learning to train a model to either build a corresponding vector (Step 2) or learn the classification out of more complex vectors (Step 3).

A More Complex Problem

At a meetup, I was talking with someone working on a digital safe project. He told me he wanted to help his users classify and sort their personal documents: contracts, bills, papers… He noticed that as more content is stored and the folder tree becomes more complex, people tend to misclassify their document. It also becomes harder to find the content they are looking for. A search engine is only “patching” the problem, not killing the root cause: documents can be found only if precise information is known, and folders remain messy.

So how can we solve this issue?

Note: I actually developed it: check this GitHub Repository: digital-safe-document-classification.

Clarifying the idea and defining the scope of the PoC

We are going to design a user experience (UX) where the user can upload a document and be prompted for the perfect folder for it. We want to support these types of files: txt, text, markdown, and pdf.

Using the service with “A Formal Definition of Context for Concept Learning”. “data_to_read” is the folder where I put articles I want to read. Work is the folder containing my old school reports (mainly Data Science projects). 2 folders are selected among 15. You can find the implementation here.

We want to prompt the users for their current folders, not older ones nor ones from other people: the answer has to be user specific and time specific.

Step 1: Browsing the relevant inputs

First, we need to know the user’s folders, else we won’t be able to answer. To make our choice, we can use:

  • the content of the documents
  • the time they were added: some bills could be monthly or some tasks could be performed mostly at certain hours
  • the filename and type: “energy_invoice_joe_march.pdf”, “pdf”

In our case, the most reliable input is probably the content of the document. We are going to use the uploaded document and the content of the user’s folders as a comparison. Let’s focus on that.

Step 2: Vectorizing the inputs

Right now, we have different input formats: pdf, markdown, text, txt… We can directly work on the file content for markdown and other text formats. But we will have to process the pdf files to be able to use them in the same way as the others.

Converting pdf to text using Linux command line pdftotext from poppler-utilslink to the source code

I found Pdftotext, the tool used here, through a Google search. It’s effective but has a huge drawback, it does not perform optical character recognition (OCR). This means that it will read most pdf but not the ones created from an image or a scan. To solve this, I could use alternatives like Tesseract, but I am not going to bother for this example.

We want to transform our text into vectors. Let’s have a look at scikit-learn. If we look for vectorizers, we find a feature extraction package for text. This is exactly what we are looking for. It has two vectorizers: one based on word counting, another one called TfidfVectorizer, which we are going to use.

Transforming an invoice pdf first into text then into a vector

Tfidf stands for Term Frequency & Inverse Document Frequency. It’s basically word counting but in a smarter way. The idea is that rather than just counting the words, we access the importance of a term in a document by counting its frequency compared to the number of words in the document: Term Frequency (TF). We then compute its frequency compared to the number of documents. The less it is present in the documents, the more specific it is to the document: Inverse Document Frequency (IDF)

Step 3: Processing the vectors

We want a list of the best folders as our final output. It’s simple to map a folder name to a number. But we won’t be able to have a simple normalized output vector because the size of the output vector will change.  Indeed, the number of folders depends heavily on the user and its current folders. For this reason, we can’t use a normal classification algorithm with a fixed number of classes. We would need to retrain the model every time and build one for every user or a huge one for all users.

But we have already included “Intelligence” in the vectorization process. So we are going to take another approach, more similar to search engines: vectorize the uploaded document, the documents already in the folders and compare the resulting vectors.

To find the best folders, we look for documents matching our uploaded document best

We find the documents whose vector is most similar to the uploaded document and link them back to their original folder.

Finding the best folders using Cosine Similarity, link to the source code

To summarize, solving an AI problem can be simply reduced to these 3 steps:

  • First, Browsing the relevant inputs
  • Second, Vectorizing the data
  • Third, Processing the vectors

I hope it will help you make your AI Ideas real. :)

Thanks to Clara Vinh, Florian Carra, Clément Walter, Antoine Toubhans, Martin Müller, and Alexandre Sapet.

Cet article à été écrit par

Arnault Chazareix

Arnault Chazareix

Suivre toutes nos actualités

Data migration: Thinking about using AWS Data Pipeline? Think twice

4 min read

Machine learning metrics are as essential as your model

4 min read

Fundamentals of NLP with multi-choice question generation

6 min read