The aim of this post is to discuss the different approaches that can be taken when several models need to be transferred from experimental research into production. We are a dynamic company that adopts a different approach according to the individual needs of the customer, offering maximum flexibility and customisation, which begs for the need to be adaptable to new technologies in this ever-evolving field.
The language of choice for our Data Science team is Python and we use Jupyter Notebooks in all our projects. Therefore, a natural thought was: “We should use Python and Jupyter Notebooks in production!”.
The issue with that arises when the volume of data processed in production is quite a bit higher than the amount of data normally used in research to build models or obtain insights. In this case, a different technological framework is needed to handle this amount of data. To enter the world of Big Data, we need a framework that allows us to scale and parallelize the different elements of the machinery used to gain insights and obtain reliable results. Something as simple as transforming data to feed an algorithm can be very difficult and memory expensive when our customers have millions of users. Since all these requirements aren’t met by the available Python libraries (NumPy, Pandas, Modin, etc.), the solution when you are dealing with big datasets, lies in the use of a trusted cluster computing system.
In addition to that, there is also a concern when choosing the framework, as any results obtained by our algorithms need to be served to different APIs and components in our company’s environment i.e. the framework has to allow us to integrate our machinery in their environment easily.
Preferably, we would like to select a framework that allows us to code in Python. That would make it the ideal candidate for us.
Two cluster computing systems rise above the rest: Apache Spark and Dask. Please, find below the summaries and links of the main webpages of both cluster computing systems:
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimised engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimiser, and a physical execution engine.
- Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.
- It can access diverse data sources. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
Dask is a flexible library for parallel computing in Python and it is composed of two parts:
- Dynamic task scheduling optimised for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimised for interactive computational workloads.
- “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.
Dask emphasises the following virtues:
- Familiar: Provides parallelised NumPy array and Pandas DataFrame objects.
- Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
- Native: Enables distributed computing in pure Python with access to the PyData stack.
- Fast: Operates with low overhead, low latency, and minimal serialisation necessary for fast numerical algorithms.
- Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans.
In the community, the opinions are mixed and the choice can be biased.
How did we decide?
Easy, we tried both and decided to use PySpark (Python API for Apache Spark).
Our choice is aligned with the opinion of Wes McKinney as expressed in one of his blog posts (http://wesmckinney.com/blog/apache-arrow-pandas-internals/, Sep 2017), where he explains the disadvantages of Pandas and how this affects Dask :
“ One issue with the Dask model is that it’s using pandas as a black box. dask.dataframe does not solve pandas’s inherent performance and memory use problems, but it spreads them out across multiple processes and helps mitigate them by being careful to not work with too large pieces of data all at once, which can result in an unpleasant MemoryError.
Some problems don’t fit the Dask partition-parallel distributed task execution model. Also, Panda’s memory management and IO challenges make Dask jobs a lot slower than they could be with a more efficient in-memory runtime.”
We did not find any advantage when using Dask over Apache Spark. Notable aspects were the following:
- The Apache Software Foundation has many projects and technologies that you can integrate in a native way. Our company already uses software from the Apache ecosystem.
- When trying Dask, we found serious issues in crucial parts of our workflow. For example, Dask doesn’t have a multi-index sorting option implemented. This is because some operations like set_index and merge/join, are harder to do in a parallel or distributed setting than if they are in-memory on a single machine. In particular, shuffling operations that rearrange data become much more communication intensive. This can be an intensive process, particularly on a cluster.
- Note that, despite parallelism, Dask DataFrames may not always be faster than Pandas. It is even recommended to stay with Pandas for as long as possible before switching to dask.dataframe. On the other hand, PyArrow promises to solve most of these problems in the near future, and it is making huge strides toward doing just that. Also, integrating PyArrow onto Apache Spark is a breeze.
- On Apache Spark we can use different Machine Learning libraries like: MLlib, H2O (pysparkling.ml), TensorFlow, etc. At the moment this post is being written, Dask has some support for TensorFlow and XGBoost libraries, but the main library it uses is Scikit-Learn. In our case, because we use H2O and TensorFlow to build our models, we like the native H2O Sparkling Water option with Apache Spark.
- Apache Spark (PySpark) gave us more capabilities and freedom to change approaches easily. The use of DataFrames in Apache Spark instead of RDD, was a good improvement (as of Spark 2.x the use of the Dataset API is encouraged). Hence, we agreed with the Apache Community that Spark is a general-purpose cluster computing system and an established and trusted solution for business.
For extra details, Dask provides a comparison between Spark and Dask, it can be found in the following link https://docs.dask.org/en/latest/spark.html.
We continue to use Jupyter Notebooks for tasks that can be translated into PySpark, because most tasks are manual, i.e. need human intervention. Many parts in Data Science aren’t plug-and-play. For the automatic processes, we use Python scripts. In many situations, we use the sqlContext capabilities embedding SQL chunks of code. Using Apache Spark we can write applications in Java, Scala, Python, R, and SQL. The possibilities are endless.
Hands-on Deployment of a Decision Tree Model
Next, we show a simple demonstration of the research-to-deployment process. In the research stage, we explored the data and implemented a decision tree using Python and Sckit-Learn. Once the model was obtained, we “productised” it in PySpark. A simple dataset example was chosen for demonstration purposes.
Using the notebooks, we would have liked to be able to show a hands-on example for each approach, but our focus was not on obtaining the best model. For instance, we did not tune the algorithms to get the best results in each approach/library. Nevertheless, we used almost all the parameters available in every library, even when they were not the same between libraries. When you compare the approaches, you can find some differences in the performance because of this difference in library parameters. Experience is the best tool when deploying a model starting from a research base where you also need to master the technologies and have a good grasp of the process. Otherwise, you could get a what seems to be a great model in research, that in reality is almost impossible to reproduce in production.
The code presented run on: Apache Spark Version: 2.4.0, Python Version: 3.7, Scala Version: 2.11.12, Java Version: 1.8.0_172, Sparkling Water Version: 2.4 and H2O Version: 188.8.131.52.
This section is divided into 5 parts:
- The Dataset
- Decision Trees Theory (link)
- Sckit-Learn / Python
- PySpark + MLlib
- Pysparkling (H2O)
1. The Dataset
We analysed the data of the business premises census of Barcelona from 2016. You can find the data and more details in the following link:
“Inventory of premises in the city of Barcelona with the aim of identifying all those on the ground floor with or without economic activity.”
- ID_BCN: Unique ID given for each business premises.
- ID_PRINCIP: Code to identify the state of the business premises (Active, Non-active, Without information).
- N_PRINCIP: Description of the code 'ID_PRINCIP'.
- ID_SECTOR: Code of the business sector.
- N_SECTOR: Name of the business sector.
- ID_GRUPACT: Code of the activity group where the business premises belongs.
- N_GRUPACT: Name of the activity group.
- ID_ACT: Code of the activity of the business premises
- N_ACT: Name of the activity of the business premises.
- N_LOCAL: Name of the business premises.
- SN_CARRER: Binary code to know if the business premises is at street level.
- SN_MERCAT: Binary code to know if the business premises is in a market.
- ID_MERCAT: Code of the market.
- N_MERCAT: Name of the market.
- SN_GALERIA: Binary code to know if the business premises is in a galleria.
- N_GALERIA: Name of the galleria.
- SN_CCOMERC: Binary code to know if the business premises is in a mall.
- ID_CCOMERC: Code of the mall.
- N_CCOMERC: Name of the mall.
- N_CARRER: Name of the street where the business premises is located.
- NUM_POLICI: Number of the street where the business premises is located.
- REF_CAD: Register number of the business premises.
- DATA: Date of registration of the business premises in the dataset.
- Codi_Barri: Number of neighbourhood.
- Nom_Barri: Name of the neighbourhood.
- Codi_Districte: Number of district.
- N_DISTRI: Name of the district.
- N_EIX: Binary code to know if the business premises is in a comercial hub.
- SN_EIX: Name of the comercial hub.
- SEC_CENS: Code of the census section.
- Y_UTM_ETRS: Coordinate Y of the UTM.
- X_UTM_ETRS: Coordinate X of the UTM.
- LATITUD: Latitud.
- LONGITUD: Longitud.
In summary, all the ID_ features are the numeric conversion of the N_ features. The SN_ features are binary features that indicate when the observation in the N_ feature is missing or not.
2. Decision Trees Theory
Decision Tree algorithms involve stratifying or segmenting the predictor space into a number of simplified regions. The set of splitting rules used to segment the predictor space can be summarised in a tree which can be used for regression and classification.
If you’re interested in delving deeper into the decision tree theory, have a look at the link below:
3. Python +Scikit-Learn
First, we present the analysis and results obtained using python libraries. All the details are included in the notebook. It should be noted that, since some processes used in the notebooks are stochastic, you may not get the same results, and the models might need retuning. See an example of Python/Scikit here.
4. PySpark + MLlib
In this section, we “productise” the previous results and model using PySpark and MLlib. See an example of Pyspark/MLlib Decision Tree here.
5. Pysparkling (H2O)
As in the previous section we “productise” the model, but using Pysparkling (H2O) instead. See the H2O Decision Tree example here.
Here, we have discussed the different options and issues that arise when attempting to productise an AI algorithm based on experimental research. By trying out the most promising candidates, we finally decided to use Apache Spark instead of Dask to deploy our models to our customers.
To show the approaches that can be used, we explored the data of the business premises census of Barcelona to build a classifier that labels the business as active /non-active, as it could occur in the research stage. We used a Decision Tree and then translated the pipeline into PySpark and MLlib, and Pysparkling (H2O), i.e. we “productised” it! Both methodologies got very similar results and hence, we concluded that both of them are valid for production.
Thanks to Mireia Ballestà and Charalambos Kallepitis for all the support and effort to make this post possible.
The original version of this post can be found on Strands Tech Corner on Medium