June 8, 2023 • 4 min read

NLP for preprocessing text data at scale with GCP Dataproc

Rédigé par Chloé Adnet

Chloé Adnet


Web scrapping is about extracting huge amounts of data from websites. I'm sure you are familiar with its use cases such as unlocking data sources that add value to the business or making smart data-based decisions. Preprocessing scrapped data is necessary to perform analysis on it. If well done: it increases the performance of NLP (Natural Language Processing) tasks. Spark NLP and Spark ML for machine learning are libraries adapted to scrapped data preprocessing in a distributed way built on Spark.

Let’s explore how to use NLP for preprocessing text data at scale with GCP Dataproc service which allows users to create managed clusters that can scale. For this, the different elements of an answer will be detailed separately to finally present the miracle recipe!

NLP pre-processing

Common NLP pre-processing tasks

Preprocessing text data extracted from web files corresponds to its cleaning and standardization and is the most important step before analysis tasks. NLP is the data science part specific to text and contains preprocessing. Some common data preprocessing tasks as :

  • Lower casing
  • Removal of Stopwords
  • Removal of Frequent words
  • Stemming
  • Lemmatization
  • Removal of URLs

The choice and scheduling of preprocessing steps in the pipeline differ for each dataset. It depends on the desired objective analysis task.

Spark NLP library

As we consider a huge amount of text data, Spark is the adapted analytics engine. You have to try it! It is a distributed computing system that manages big data at scale. First, it supports multiple languages and libraries. It makes complex data pipelines coherent, performs ETL (Extract Transform Load) jobs with large datasets, and has easy-to-use APIs for treatment. Spark NLP is a library that allows users to process NLP models like preprocessing at scale. Spark ML also provides machine learning tools.

Let’s implement stopwords removal and lemmatization scripts with Spark NLP and Spark ML in Python:

Lemmatization script
Stopwords removal script

Both scripts support data frames with web file content as inputs and create data frame output. A treatment pipeline is built in the first script to process data before lemmatization. Thus, the lemmatization process applies to occurrences of flexible lexemes a coding referring to their common lexical entry.

Dataproc pipeline

GCP Service

Dataproc is a GCP* managed service for Spark framework useful for Big Data Processing and ETL. It provides ephemeral or persistent auto-scaling and customizable clusters and job orchestration. Dataproc enables easy connection with other GCP services (Compute Engine, Cloud Storage, BigQuery) and is adapted when using Spark for distributed computing within clusters. The instantiated cluster allows users to compute data transformation. These transformations are described in jobs (specific files) and each cluster has a specific configuration. Dataproc is a job Scoped Cluster Model.

There are several ways to create a cluster with Dataproc and run a pipeline on it. The most common is to create a persistent cluster with a specific configuration. Once the cluster is created, users can submit jobs. The solution we will explore uses the Dataproc Workflow Template API in which clusters are ephemeral.

*Google Cloud Platform

Dataproc Workflow Template API

The Dataproc Workflow Templates API enables users to configure and execute pipelines called workflows. A workflow is a DAG (Directed Acyclic Graph) in which each step is a Spark job. The Dataproc pipeline schema illustrates it. Instantiation starts a workflow and it creates the specified cluster, runs the list of jobs, and deletes the cluster when jobs are finished.

Dataproc pipeline

The customer must have an account to instantiate the workflow template in GCP. The first step is to write a yaml template file the set of jobs and cluster properties, configurations, and dependencies. With the property prerequisiteStepIds, the job starts only after its dependencies complete successfully.

Stepped template

In the workflow template notice the managed cluster in the placement section and the two jobs. Notice the imageVersion which contains the OS for the cluster and properties for Spark jobs. The configuration also specifies machine type and impact service pricing as Dataproc pricing is based on cluster size, configuration, and duration.

Then instantiate it with the command line :

gcloud dataproc workflow-templates instantiate-from-file --file=template.yaml 
Dataproc workflow template deployment

We can observe logs for pipeline execution from the Google Cloud Dataproc Console. Below we see the three jobs completed successfully on the managed cluster.

Waiting on operation 
WorkflowTemplate [workflow-templates] RUNNING
Creating cluster: Operation ID
Created cluster: cluster.
Job ID first_job RUNNING
Job ID last_job RUNNING
Job ID first_job COMPLETED
Job ID last_job COMPLETED
Deleting cluster: Operation ID
WorkflowTemplate [workflow-templates] DONE
Deleted cluster: cluster.

In the output, you see the creation of the cluster. Both jobs running and completing and the cluster deletion.

Dataproc pipeline with Spark jobs

Compute full pipeline

Now that all the ingredients have been presented, we will move on to the realization of the recipe. We will preprocess large scrapping web data with GCP Dataproc as illustrated.

Web scrapping data must be stored in a bucket in GCP’s Google Cloud Storage service. A script must then be written to load these data into relational tables. For processing, you have to write Python scripts for data transformation within tables (lemmatization, tokenization, punctuation removal, standardization) thanks to library spark.

The next step is to write a . yaml file that launches a processing pipeline in a Spark cluster once instantiated. In this file, we specify references to job scripts, their associated libraries, and their dependencies. The proper functioning of the work is available in the console and via the logging services on GCP. That’s it!

Complete pipeline

Things to keep in mind

As you'd expect the explored methods have advantages and weaknesses. That's what this final section will summarize.

Method Weaknesses

On the one hand, some disadvantages are linked to GCP Dataproc service. Indeed, cluster deployment is slow. It increases iteration time and makes testing more complex. Infrastructure maintenance might be painful. With Dataproc, the user is responsible for including cluster provisioning and monitoring. It requests user effort in managing infrastructure components to realize smooth operations. Moreover, the last method's weakness is linked to the pipeline. If the NLP data like web scrapped pages is stored in a different environment, the user must transfer the data to the GCS service before processing it.

Method Strengths

On the other hand, Spark's distributed processing capabilities allow parallelizing NLP preprocessing tasks across multiple worker nodes. This leads to faster execution times and improved performance when dealing with large datasets. Finally, it provides libraries for NLP tasks, such as text tokenization, and word embedding. Users can use these libraries within Spark jobs to perform complex NLP preprocessing operations.

Dataproc integrates well with other Google Cloud services like GCS, seamlessly incorporating them into your NLP pipeline. Users can read data from and write data to these services, enabling efficient data ingestion and storage. Dataproc also offers cost advantages through autoscaling, allowing it to scale clusters up or down based on workload demands.

Thus when deciding whether to use Dataproc for your NLP preprocessing pipeline, it is crucial to evaluate strengths and weaknesses based on your specific requirements and team expertise! If you have an NLP project, contact Sicara!

Cet article a été écrit par

Chloé Adnet

Chloé Adnet