Showing posts with label Deep learning. Show all posts
Showing posts with label Deep learning. Show all posts

Thursday, June 1, 2023

Exploratory Notebooks for Deep Learning, AI, and Data Tools: A Beginner's Guide

Are you looking at some resources to get you up to speed with popular Deep Learning and Data processing frameworks? This blog entry provides a curated collection of notebooks that will help you kickstart your journey.

You can find the notebooks at this link. See also the SWAN gallery.

CERN users can run the notebooks on the SWAN platform, using GPU resources. SWAN

Other options for running the notebooks in the cloud with a GPU include Google's Colab.  Open in Colab


Getting started with Deep Learning

These notebook showcase a digit recognition classifier using the MNIST dataset, which serves as a "Hello World!" for Deep Learning. Choose from the following options to get started:


Deep Learning and basic Data pipelines

Learn how to integrate Deep Learning frameworks with basic data pipelines using Pandas to feed data into the DL training step. These notebooks implement a Particle classifier using various DL frameworks. The data is stored in Parquet format, offering efficient data reading. 






More advanced Data pipelines

Take your data processing skills to the next level with these notebooks, which demonstrate advanced data pipelines suitable for large datasets. Discover how to leverage the Petastorm library to read data from Parquet files with TensorFlow and PyTorch, as well as utilizing the TFRecord format with TensorFlow.


Additional complexity with models and data

Building upon the previous examples, these notebooks introduce more complex models and larger datasets for the Particle classifier. Explore the capabilities of TensorFlow, GRU, Transformer, and TFRecord with:



AI Tools Examples

This section contains Jupyter notebook examples of AI tools, including LLMs, Transformers, vector databases. The notebooks are intended to be run using GPU resources.

Transformers library

Explore the powerful Transformers library from Hugging Face, widely used for LLM, Natural Language Processing (NLP), image, and speech tasks.




Large language models

These notebooks provide examples of how to use LLMs in notebook environments for tests and prototyping


Semantic search with Vector Databases and LLM

Semantic search allows to query a set of documents. This examples shows how to create vector embeddings, store them in a vector database, and perform semantic queries enhanced with LLM.



Data Tools Examples

This section offers example notebooks featuring popular frameworks and libraries for handling data. Please note that it does not cover scale-out data solutions such as Spark and Dask.

For Apache Spark see SparkTraining

If you require access to relational databases for testing, CERN users can reach out to Oracle and DBOD services. You can also set up test databases using container technology. Here's how:

Running a test Oracle instance on a container:

  • Run Oracle Free on a container from gvenzl dockerhub repo https://github.com/gvenzl/oci-oracle-free
    • see also https://github.com/gvenzl/oci-oracle-free
    • docker run -d --name mydb1 -e ORACLE_PASSWORD=oracle -p 1521:1521 gvenzl/oracle-free:latest
    • Wait until the DB is started (this may take a few minutes). Check progress with: docker logs -f mydb1
    • install the Python library for connecting to Oracle: pip install oracledb

Setting up a PostgreSQL instance for testing using a Docker image:

  • docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres
  • wait till the DB is started, check logs at: docker logs -f some-postgres
  • install the Python library for connecting to PostgreSQL: pip install psycopg2-binary

Pandas and numpy examples





Conclusions and acknowledgments

This blog entry provides a valuable collection of exploratory notebooks for individuals who are new to deep learning and data processing. With a focus on popular frameworks and libraries, these notebooks cover a range of topics including digit recognition, transformers for various tasks, integrating deep learning with data pipelines, advanced data processing techniques, and examples of data tools. Whether you are a CERN user or prefer cloud-based platforms like Google's Colab, these notebooks will help you quickly grasp the fundamentals and get started on your deep learning and data processing journey.

I would like to express my sincere gratitude to my colleagues at CERN for their invaluable assistance and insightful suggestions, in particular I'd like to acknowledge the CERN data analytics and web notebook services and ATLAS database and data engineering teams. Their expertise and support have played a crucial role in making this collection of notebooks possible. Thank you for your contributions and dedication.


Thursday, April 25, 2019

Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo

Topic: This post describes a data pipeline for a machine learning task of interest in high energy physics: building a particle classifier to improve event selection at the particle detectors. The pipeline is built using tools from the "Big Data ecosystem", notably Apache Spark, BigDL and Analytics Zoo. The work is accompanied by a set of notebooks with code and sample data.

The Physics use case

High Energy Physics (HEP) is a data-intensive domain: particle detectors and their sensors collect vast amounts of data. Data pipelines that move data from detectors to analysis engines are key to the experiments' success and operate at high throughput. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Most of the data thus collected has to be thrown away as it would be too costly to store the raw data. This is performed by fast data processing by a trigger system organized in multiple levels. The rate of data stored for later analysis is currently around 1-10 GB/sec.
Every improvement in the accuracy of the event filtering system is welcome as it can provide large savings in terms of compute and storage resources needed for data analysis. The work "Topology classification  with  deep  learning  to  improve  real-time event  selection  at  the  LHC" by Nguyen et al. provides a promising solution to this problem by using neural networks. The paper details how to build a classifier to used to improve the purity of data samples selected at trigger level, by distinguishing three different event topologies of interest ("W + jet", "QCD", "t tbar"). The classifier is intended to be used in the online filtering system to improve accuracy and in particular reduce the number of false positives compared to current systems, often implemented as complex rule-based systems.
The primary goal of this work is to reproduce the classification performance results of Nguyen et al., showing that the proposed pipeline makes a more efficient usage of computing resources and/or provides a more productive interface for the data scientist/physicists, along all the steps of the processing pipeline.


Figure 1: Event data collected from the particle detector (CMS experiment) contains different types of event topologies of interest. A particle classifier built with neural networks can be used as an event filter, improving state of the art in accuracy.

Data pipelines

Data pipelines are of paramount importance to make machine learning projects successful, by integrating multiple components and APIs used across the entire data processing chain. A good data pipeline implementation can accelerate and improve the productivity of the work around the core machine learning tasks. In particular, data pipelines are expected to provide solid tools for data processing, a task that ends up being one of the most time-consuming for data scientists/physicists approaching data analysis problems.
Traditionally, HEP has developed custom tools for data processing, which have been successfully used for decades. Recently, a large range of solutions for data processing and machine learning have become available from open source communities. The maturity and adoption of such solutions continue to grow both in industry and academia. Using software from open source communities comes with several advantages, including the reduced cost of development and maintenance and the possibility of sharing solutions and expertise with a large user and expert base.

In this work we build the machine learning pipeline detailed in the paper on topology classification with DL mentioned above, using tools from the "Big Data" ecosystem.
One of the key objectives for the machine learning data pipeline is to transform raw data into more valuable information used to train the ML/DL models. Apache Spark provides the backbone of the pipeline, from the task of fetching data from the storage system to feature processing and feeding training data into a DL engine (BigDL and Analytics Zoo are used in this work). The four steps of the pipeline we built are:

  • Data Ingestion: where we read data from ROOT format and from the CERN-EOS storage system, into a Spark DataFrame and save the results as a table stored in Apache Parquet files
  • Feature Engineering and Event Selection: where the Parquet files containing all the events details processed in Data Ingestion are filtered, and datasets with new  features are produced
  • Parameter Tuning: where the best set of hyperparameters for each model architecture are found by performing grid search
  • Training: where the best models found in the previous step are trained on the entire dataset

Figure 2: This illustrates the data processing pipeline used for training the event topology classifier.

Data source 

Data used for this work have been generated using software simulators to generate events and to calculate the detector response. The dataset is the result of a Monte Carlo event generation, where three different processes (categories) have been simulated: the inclusive production of a leptonically decaying W+/- boson, the pair production of a top-antitop pair (t, t_bar) and hadronic production of multijet events. Variables of low and high level are included in the dataset. See the paper on topology classification with DL for details.
For this exercise, the generated training data amounts to 4.5 TB, for a total of 54 million events, divided in 3 classes: "W + jet", "QCD", "t tbar" events. The generated training data is stored using the ROOT format, as it is a very common format for HEP. Data are originally stored in the CERN EOS storage system as it is the case for the majority of HEP data at CERN at present. The authors of the paper have kindly shared the training data for the purpose of this work. Each event of the dataset consists of a list of reconstructed particles. Each particle is associated with features providing information on the particle cinematic (position and momentum) and on the type of particle. Follow this link for additional details on the datasets and links to download data samples made available to reproduce this work.

Data ingestion

Data ingestion is the first step of the pipeline, where we read ROOT files from the CERN EOS storage system into a Spark DataFrame. For this, we use a dedicated library able to ingest ROOT data into Spark DataFrames: spark-root, an Apache Spark data source for the ROOT file format. It is based on a Java implementation of the ROOT I/O libraries, which offers the ability to translate ROOT files into Spark DataFrames and RDDs. In order to access the files stored in the CERN EOS storage system from Spark applications, another library has been developed: the Hadoop-XRootD connector. The Hadoop-XRootD connector is a Java library, it extends the HADOOP Filesystem and makes its capable of accessing files in EOS via the XRootD protocol. This allows Spark to read directly from EOS. which is convenient for our use case as it avoids the need for copying data in HDFS or other storage compatible with Spark/Hadoop libraries.
At the end of the data ingestion step the result is that data, with the same structure as the original ROOT files, are made available as a Spark DataFrame on which we can perform event selection and feature engineering.

Event selection and feature engineering

In this step of the pipeline, we process the dataset by applying relevant filters, by computing derived features and by applying data normalization techniques.
The first part of the processing requires domain-specific knowledge in HEP to simulate trigger selection: this is emulated by requiring all the events to include one isolated electron or muon with transverse momentum (p_T) above a given threshold, p_T >= 23 GeV. All particle are then ranked in decreasing order of p_T. For each event, the isolated lepton is the first entry of the list of particles. Together with the isolated lepton, the first 450 charged particles, the first 150 photons, and the first 200 neutral hadrons have been considered, for a total of 801 particles with 19 features each. The result is, that each event passing the filter is associated to a matrix with shape 801 x 19. This defines the Low Level Features (LLF) dataset.
Starting from the LLF, an additional set of 14 High Level Features (HLF) is computed. These additional features are motivated by known physics and data processing steps and will be of great help for improving the neural network model in later steps.
LLF and HLF datasets, computed as described above, are saved in Apache Parquet format: the amount of training data is reduced at this point from the original 4.4 TB of ROOT files to 950 GB of snappy-compressed Parquet files.

Additional processing steps are performed and include operations frequently found when preparing training data for classifiers, notably:
  • Data undersampling. This is done to work around the class imbalance in the original training data. After this step, we have the same number of events per each of the three classes, that is 1.4 million events for each of the 3 classes.
  • Data shuffling
  • All the features present in the datasets are scaled to take values between 0 and 1 or normalized, as needed for the different classifiers.
  • The datasets (for HLF and LLF features) are split in training and test datasets (80% and 20% respectively) and saved in two sets of Parquet files. Smaller datasets are also created as samples of the full train and test datasets, for development purposes,
Data are saved in snappy-compressed Parquet format at the end of this stage and amount to about 310 GB. The decrease in size from the previous step is mostly due to the undersampling step and the fact that the population on the 3 topology classes in the original training data used for this work are not balanced.


Code for this step: notebook for data ingestion and for feature preparation

Neural network models

We have tested three different neural network models, following the guiding research paper:

1. The first and simplest model is the "HLF Classifier". It is a fully connected feed-forward deep neural network taking as input the 14 high level features. The chosen architecture consists of three hidden layers with 50, 20, 10 nodes activated by Rectified Linear Units (ReLU). The output layer consists of 3 nodes, activated by the Softmax activation function.

2. The "Particle Sequence Classifier" is trained using recursive layers, taking as input the 801 particles in the Low Level Features dataset. Prior to "feeding the particles" into a recurrent neural network, particles are ordered by decreasing distance from the isolated lepton (see the original article for details).
Gated Recurrent Units (GRU) have been used to aggregate the input sequence of particles with a recurrent layer width of 50. The output of the GRU layer is fed into a fully connected layer with 3 Softmax-activated nodes.

3. The "Inclusive Classifier" is the most complex of the 3 models tested. In this classifier, some domain knowledge is injected into the particle-sequence classifier. In order to do this, the architecture of the particle sequence classifier has been modified by concatenating the 14 High Level Features to the output of the GRU layer after a dropout layer. An additional dense layer of 25 nodes is introduced before the final output layer consisting of 3 nodes, activated by the Softmax activation function.



Figure 3: (left) Code snippet of the inclusive classifier model with Analytics zoo, using Keras-API. (right) Neural network model diagram from Nguyen et al. 

Parameter tuning

Hyperparameter tuning is a common step for improving machine learning pipelines. In this work, we have used Spark to speed up this step. Spark allows training multiple models with different parameters concurrently on a cluster, with the result of speeding up the hyperparameter tuning step. We used the Area Under the ROC curve as the performance metric to compare different classifiers. When performing a grid search each run is independent of the others, hence this process can be easily parallelized and scales very well.
For example, we tested the High Level Features classifier, a feed forward neural network, taking as input the 14 High Level Features. For this model, we tested changing the number of layers and units per layer, the activation function, the optimizer, etc. As an experiment, we ran grid search on the small dataset containing 100K events using a grid of ~200 hyper-parameters sets (models).
Hyperparameter tuning can similarly be repeated for all three classifiers described above. For the following work, we have decided to use the same models as the ones presented in the paper, as they were offering the best trade-off between performances and complexity.
To run grid search in parallel, we used Spark with spark-sklearn and Tensorflow/Keras wrapper for scikit_learn.

Code: notebook with hyperparameter tuning

Figure 4: The hyperparameter tuning step consists of running multiple training jobs with the goal of finding an optimal set of parameters. Spark has been used to speed-up this step by running multiple Keras training jobs in parallel on a cluster.


Distributed training with Spark, BigDL and Analytics Zoo

There are many suitable software and platform solutions for deep learning training at present, and it is an exciting time in this respect with many paths to explore. However, choosing among them is not straightforward, as many products are available with different characteristics and optimizations for different areas of application.
For this work, we wanted to use a solution that easily integrates with the Spark service at CERN, running on YARN/Hadoop cluster (more recently also running Spark on Kubernetes using cloud resources). GPUs and other HW accelerators are not yet available for Spark deployments at CERN, so we also wanted a solution that could scale on CPUs. Moreover, we wanted to use Python/PySpark and standard APIs for processing the neural network: notably Keras API in this case.
Those reasons combined with our ongoing collaboration with Intel in the context of CERN openlab has led us to successfully test and use BigDL and Analytics Zoo for distributed model training in this work. BigDL and Analytics Zoo are open source projects distributed under the Apache 2.0 license. They provide a distributed deep learning framework for Big Data platforms and are implemented as libraries on top of Apache Spark. They allow users to write large-scale deep learning applications as standard Spark programs, running on existing Spark clusters. Notably, they allow users to work with models defined with Keras and Tensorflow APIs.
BigDL provides data-parallelism to train models in a distributed fashion across a cluster using synchronous mini-batch Stochastic Gradient Descent. Data are automatically partitioned across Spark executors as an RDD of Sample: an RDD of N-Dimensional array containing the input features and label. Distributed training is implemented as an iterative process, thus there will be multiple iterations over the same data. Reading data from disk multiple times is slow, for this reason, BigDL exploits the in-memory capabilities of Spark to cache the train RDD in the memory of each worker allowing faster access during the iterations (this also means that sufficient memory needs to be allocated for training large datasets). More information on BigDL architecture can be found in the BigDL white paper.
Notebooks with the code used for training the DL models are:



Figure 5: Snippets of code illustrating key operations for deploying distributed model training in Spark using BigDL and Analytics Zoo.

The resulting classifiers show very good performance figures for the three models tested, and reproduce the findings of the original research paper. The fact that the HLF classifier performs well, despite its simplicity, can be justified by the fact that we are putting a considerable physics knowledge into the the definition of the 14 high level features. This conclusion is further reinforced by additional tests, where we have built a topology classifier using a "random decision forest" and a "gradient boosting" (XGBoost) model, trained using the high level features dataset. In both cases we have obtained very good performance of the classifiers, just close to what has been achieved with the HLF classifier model.
In contrast, the fact that the particle sequence classifier performs better than the HLF classifier is remarkable, because we are not putting any a priori knowledge into the particle sequence classifier: the model is taking just a list of particles as input. Further improvements are seen by combining the HLF with the particle sequence classifier. In some sense, the GRU layer is identifying important features and physics quantities out of the training data, rather than using knowledge injected via feature engineering.

Figure 6: (Right) ROC curve and AUC for the three models tested. Good results are obtained that are consistent with the findings in the original research paper. (Left) Training and validation loss (categorical cross entropy) plotted as a function of iteration number for HLF classifier. Training convergence is obtained at around 50 epochs, depending on the model.


Figure 7: The confusion matrix for the "Inclusive classifier" model after training. Note that the off-diagonal elements are below 5% and the diagonal elements are close to 95%.
  

Workload and performance

Two Spark clusters have been used to develop and run the workload, the first is a development/test cluster, the second is a production consisting of 52 nodes, with the following resources:  1800 vcores, 14 TB of RAM, 9 PB of storage. It is a shared general-purpose YARN/Hadoop cluster build on commodity hardware, installed with Linux (CentOS 7). Only a fraction of the resources of the cluster was used when executing the pipeline, moreover jobs of other nature were running on the system currently.

Data ingestion and event filtering is a very resource-demanding step of the pipeline. Processing the original data set of 4.4 TB took ~3 hours, using 50 executors with 8 cores each, for a total of 400 cores allocated to the Spark job.
Monitoring of the workload showed that the job was almost completely CPU-bound. The large majority of the CPU cycles are spent executing Python code, that is outside of the Spark JVM. The fact that we chose to process the bulk of the training data using Python UDF functions mapped on RDDs is the main cause of the use of such large amount of CPU resources, as this a well-known "performance gotcha" in current versions of Spark. Notably, many CPU cycles are "wasted" in data serialization and deserialization operations, going back and forth from JVM and Python, in the current implementation.
Additional work on improving the performance of the event filtering has introduced the use of Spark SQL to implement some of the filters. Notably, we have made use of Spark SQL Higher Order Functions, a specialized category of SQL functions, introduced in Spark from version 2.4.0, that allow to improve processing for nested data (arrays). For our workload this has introduced the benefit of running a significant fraction of the filtering operations fully inside the JVM, optimized by the Spark code for DataFrame operations. The result is that the data ingestion step, optimized with Spark SQL and higher order functions, runs in ~2 hours (was 3 hours in the implementation that uses only Python UDF).
Future work on the pipeline may further address the issue of reducing the time spent in serialization and deserialization when using Python UDF, for example, we might decide to re-write the critical parts of the ingestion code in Scala. However, with the current training data size (4.4 TB), the performance of the data ingestion step, of just a couple of hours, is acceptable.

Performance of hyperparameter tuning with grid search: grid search runs multiple training jobs in parallel, therefore it is expected to scale well when run in parallel, as it was the case in our work (see also the paragraph on hyperparameter tuning). This is confirmed by measurements, as shown in Figure 8, by adding executors, i.e. the number of models trained in parallel, the time required to scan all the parameters decreases, as expected.

Figure 8: Performance of the hyperparameter tuning with grid search shows near-linear scalability as expected by this embarrassingly parallel process.

Performance of distributed training with BigDL and Analytics Zoo is also important for this exercise, as faster training time means improved productivity of the data scientist/physicists who will typically perform many experiments on the models and data. Figure 9 shows some preliminary results of the training speed for the HLF classifier, tested on the development cluster, using batch size of 32 per worker. You can notice there very good scalability behavior at the scale tested. Additional measurements, using 20 executors, with 6 cores each (for a total of 120 allocated cores), using batch size of 128 per worker, have shown training speed for the HLF classifier of the order of 100K rows/sec, sustained for the training of the full dataset. We have been able to train the model in less than 5 minutes, running for 12 epochs (each epoch processing 3.4 million training events). The batch size has an impact on the execution time and on the accuracy of the training. We found that a batch size of 128 for the HLF classifier is a good compromise for speed, which a batch size of 32 is slower but gives slightly better results, for example for producing publication plots. Future work will address more thorough investigation of the performance and scalability of the training step.

Figure 9: Performance of the training step for the HLF classifier on the development cluster. Very good scalability has been found at this scale.

An important point to keep in mind when training models with BigDL, is that the RDDs containing the features and labels datasets need to be cached in the executors' JVM memory. This can be a significant amount of memory. Figure 10 shows the RDDs cached by Spark block manager when training the model "Inclusive classifier" using BigDL.


Figure 10: Datasets cached using Spark block manager when training the "Inclusive Classifier" model using BigDL. More than 200 GB of RAM are used across the Spark cluster to cache the training and test data.

The training dataset size and the model complexity of the HLF classifier are of relatively small scale, making the HLF classifier suitable for training also on desktop computers. However, the Particle sequence classifier and the Inclusive classifier have a much higher complexity and require processing hundreds of GB of training data. Figure 11 shows the amount of CPU consumed during the training of the Particle Sequence classifier using BigDL/Analytics-zoo on a Spark cluster, using 70 executor and training with a batch size of 32. Notably, Figure 11 illustrates the fact that the training has lasted for 9 hours and has utilized the equivalent of 200 CPUs for the whole duration of the training. The executor CPU measurement has been performed using Spark monitoring instrumentation feeding into an InfluxDB backed and visualized with Grafana, as described in this post.


Figure 11: Measured aggregated CPU utilization of Spark executors during the training of the Particle Sequence classifier.

Conclusions, lessons learned and plans


We have successfully implemented the full data pipeline for training a "topology classifier" for event filtering at LHC experiments, using deep neural networks as detailed in an original research paper using tools and platforms from the Big Data and machine learning ecosystem in the High Energy Physics domain.
Apache Spark has been used as the main engine for the pipeline, we have appreciated the flexibility and the ease of use of Spark's API in the Python environment.
The integration of PySpark with Jupyter notebooks provides a user-friendly environment for data processing and exploration.
Additional key integrations with Spark have made this work possible, in particular, the spark-root library allows Spark to read data in ROOT format and the Hadoop-XRootD connector, which integrates Spark with CERN EOS storage system.
BigDL and Analytics Zoo have allowed us to easily scale out training using familiar Keras APIs and using the Spark clusters currently available at CERN.
We have identified areas for future performance improvements, notably in the data ingestion and filtering step where we chose to use Python functions on Spark RDDs for the current implementation.
Future work and plans include further exploration of the workload at scale, including running on cloud resources and investigating the use of HW accelerators for DL training. Another area we plan to explore in future work is to implement a streaming data pipeline with model inference, as the natural next step that follows this work.

Code for this work: https://github.com/cerndb/SparkDLTrigger
Presentation at Spark Summit 2019: Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technologies
 

Acknowledgments

This work has been performed in the context of the Spark, Hadoop and Streaming services at CERN IT and has profited of the support and consultancy of the engineers in the service. The primary author of this work is Matteo Migliorini, who worked on it during his stay at CERN in 2018 under the supervision of Luca Canali. Additional credits go to Riccardo Castellotti, Viktor Khristenko and Maria Girone of CERN openlab, to Marco Zanetti of Padua University and to the authors of the research paper Topology classification  with  deep  learning  to  improve  real-time event  selection  at  the  LHC", in particular to Maurizio Pierini and Thong Nguyen. Credits to the BigDL and Analytics Zoo team at Intel, in particular, many thanks to Jiao (Jennie) Wang and Sajan Govindan.


Tuesday, July 26, 2016

How to Build a Neural Network Scoring Engine in PL/SQL

Topic: In this post, you will find an example of how to build and deploy a basic artificial neural network scoring engine using PL/SQL for recognizing handwritten digits. This post is intended for learning purposes, in particular for Oracle practitioners who want a hands-on introduction to neural networks.


Introduction

Machine learning and neural networks in particular, are currently hot topics in data processing. Many tools and platform are now easily available to work and experiment with neural networks and deep learning (see also the links at the end of this post)Recognizing hand-written digits, in particular using the MNIST database by Yann LeCun et al., is currently the "hello world" example for neural networks.
In this post, you will see how to build and deploy a simple neural network scoring engine to recognize handwritten digits using Oracle and PL/SQL. The final result is a short PL/SQL package with an accuracy of about 98%. The neural network is built and trained using TensorFlow and then transferred to Oracle for serving it.

One of the ideas that this post wants to illustrate is that scoring neural networks is much easier than training them: the operations required for serving a trained network can be implemented relatively easily on many computing languages/environments. Discussions on these topics normally are centered around platforms for "Big Data" (see for example Spark and MLlib). I find interesting to note that neural networks can also be successfully applied to the RDBMS world. This can be useful as large quantities of valuable data are currently stored in relational databases. In the case of Oracle, the implementation of a scoring engine is also made easier by the availability of a mature the PL/SQL environment with a package for linear algebra: UTL_NLA.


Let's start from the end: how to deploy the PL/SQL package MNIST and recognize handwritten digits using Oracle

One short PL/SQL package and two tables is all you need to replay the following example (you can find the details of the code on Github). The tables are:
  • TENSORS_ARRAY: this table contains the numerical values for the vectors and matrices (tensors) that constitute the neural network. There is a total of 79510 floating point numbers encoded into four tensors using the data type UTL_NLA_ARRAY_FLT.
  • TESTDATA_ARRAY: this table contains the test images. There are 10K images, each composed of 28x28 = 784 pixels. Image data is also encoded using the data type UTL_NLA_ARRAY_FLT.
The engine for scoring the example neural network is in a package called MNIST. It has a procedure called INIT that loads the components of the neural network from the table tensors_array into PL/SQL variables and a function called SCORE that takes an image as input and return a number, the predicted value of the digit. 
Here is an example of its usage, where the first image in the table testdata_array is examined and correctly predicted to represent the number 7 (the image label agrees with the prediction by MNIST.SCORE):

SQL> exec mnist.init

PL/SQL procedure successfully completed.

SQL> select mnist.score(image_array), label from testdata_array where rownum=1;

MNIST.SCORE(IMAGE_ARRAY)      LABEL

------------------------ ----------
                       7          7

Figure 1: This is a bitmap display of the test image used in the example. This confirms that the prediction of MNIST.SCORE is correct and indeed the image is a representation of the number 7 handwritten and encoded in a grid of 28x28 gray-scale pixels.


Processing all the test images is also a matter of a simple SQL command. In the example of Figure 2 it takes 2 minutes to process 10000 test images, that is about 12 ms per image on average. The accuracy of the scoring function is about 98%. It is calculated as follows: out of 10000 images, 9787 are scored correctly according to the data labels. Note also that the set of test images is disjoint from the images used to train the neural network. Therefore we can expect that the MNIST package has an accuracy of about 98% for recognizing digits also when used on generic input (additional evaluations of the quality of the MNIST package as a classifier are beyond the scope of this post).


The full PL/SQL code and the datapump dump file with the relevant tables can be found on Github. In the following paragraphs, you can read how to build and train the neural network.


Figure 2: The accuracy of the PL/SQL scoring function MNIST.SCORE on the test set of 10K images is about 98%. Processing takes about 12 ms per image.


The neural network

The neural network used in this post is composed of three layers (see also Figure 3): one input layer, one hidden layer and one output layer. If this topic is new to you, I recommend to do some additional reading (see references) and in particular to read Michael Nielsen's "Neural Networks and Deep Learning" which provides an excellent introduction to the topic and a series of step-by-step examples on the problem of recognizing handwritten digits.


Figure 3: The artificial neural network used in this post is composed of three layers. The input layer has 784 neurons, one per pixel of the input image. A hidden layer of 100 neurons is added to improve the accuracy. The output layer has 10 neurons, one per each possible output value (that is digits from 0 to 9).


Get the training and test data, build and train the neural network using TensorFlow

Another important step for deploying neural networks is training. For this you need data, lots of it if possible. You also need an engine to do the necessary computation. Luckily there are many platforms available for working with neural networks, that that are free and relatively use to deploy (see references). In this post, you will see how to use Google's TensorFlow and the Python environment. TensorFlow comes with a tutorial for recognizing handwritten digits in the MNIST database. Included in the tutorial are training and test data with labels and also example code.

You can find the code I used for training the neural network on Github. Some highlights and code snippets are discussed in the following.

Importing the data: The example dataset that comes with TensorFlow provides 55000 images for training and 10000 images for testing. These originally come from the work of Yann LeCun and coworkers. Having large amounts of high-quality data is very important to the success of the process. Moreover, the images come with labels: the labels tell which number each image is intended to depict and provide a very important piece of information as the exercise is to do supervised learning.

Defining the neural network: there are four tensors (vector and matrices in this case) in the network: W0, W1, b0 and b1. They are defined in the following snippet of code. To better understand their role and the key role that the cross entropy and the gradient descent optimizer play in training the network see the references, in particular "Neural Networks and Deep Learning" and TensorFlow tutorial.



Training the neural network: training proceeds with multiple steps of optimization. Training is performed using 55000 images with labels. It runs over 30000 iterations using "mini-batch" size of 100 images. At each step, the gradient descent algorithm computes an update of the weights and biases (W0, W1, bo and b1) with the goal of minimizing the loss function (cross_entropy). The relevant snippet of the code is:


Result: as a result, the trained network has the accuracy of about 98% in recognizing the images in the test set. Note that the test set is composed of 10000 images and is disjoint from the set of images used for training (the training set contains 55000 images).
It is possible to get higher accuracy with more advanced neural network configurations (see references for details), but that is beyond the scope of this post.


Manually scoring the neural network, a Python example

The main result of the training operations is that the tensors (matrices and vectors in this case) that make the neural network are now populated with useful values. I believe that a good way to understand how all this works is to "run the network manually", that is run as an example of how to go from an image of a handwritten digit to the prediction of its value by the trained neural network. As a first step we extract the values of the trained tensors in our model into numpy arrays for later processing:


An example of "manually" operating the network in Python is as follows:



W0_matrix, b0_array, W1_matrix and b1_array are the tensors that constitute the neural network after training, "testimage" is the input, sigmoid() is used as activation function, "hidden_layer" represents the hidden layer of the network, "predicted" is the output layer and softmax() is a function used to normalize the output as a probability distribution. At the end of the calculation, the array predicted[n] contains the prediction that the input image represents the digit "n". The function argmax() finds the value of "n" where predicted[n] is maximized.
The code shown above predicts the value 7 for a test image. The prediction is confirmed as correct by the value of the label and can also be visually confirmed by the bitmap display of the test image (see Figure 1).


Move test data and neural network tensors to an Oracle database

The example in the previous paragraph on how to manually run a the scoring engine illustrates that serving a neural network can be straightforward, in some cases it is just a matter of performing some basic computations with matrices. This contrasts with the complexity of training neural network models, where often one needs a specialized engine, large quality of training data and in the more complex cases also specialized hardware, such as GPU cards.
The discussion of the previous paragraph has also prepared the terrain for the following development: that is moving the neural network tensors and test data to Oracle and implement a serving engine there.
There are many ways to export Python's numpy arrays. One way is to save them in a text format. Here you will see instead a method targeted to exporting directly into Oracle using cx_Oracle, the Python library to interact with Oracle. See also the notebook "Oracle and Python with cx_Oracle" for additional examples and references on how to use cx_Oracle.

You can find the code on Github, here are some relevant snippets:

- Create the tables to host the tensor definition and test data:

SQL> create table tensors (name varchar2(20), val_id number, val binary_float, primary key(name, val_id));

SQL> create table testdata (image_id number, label number, val_id number, val binary_float, primary key(image_id, val_id));

- From Python, open a connection to Oracle:

import cx_Oracle
ora_conn = cx_Oracle.connect('mnist/mnist@ORCL')
cursor = ora_conn.cursor()

- Example of how to transfer the matrix W0 into the Oracle table "tensors"

i=0
sql="insert into tensors values ('W0', :val_id, :val)"
for column in W0_matrix:
    array_values = []
    for element in column:
        array_values.append((i, float(element)))
        i += 1
    cursor.executemany(sql, array_values)
ora_conn.commit()


Oracle's optimizations for linear algebra using UTL_NLA

From Oracle documentation: "The UTL_NLA package exposes a subset of the BLAS and LAPACK (Version 3.0) operations on vectors and matrices represented as VARRAYS". This is very useful for implementing the calculations needed to serve the neural network of this post.
A snippet of the MNIST code to get the gist of this works in practice is reported below. The code performs the calculation v_Y0 = v_Y0 + g_W0_matrix * p_testimage_array, there g_W0_matrix is a 784x100 matrix, p_testimage_array is a vector of 784 elements (encoding the 28x28 images) and v_Y0 is a vector of 100 elements.



utl_nla.blas_gemv(
                      trans => 'N',
                      m => 100,
                      n => 784,
                      alpha => 1.0,
                      a => g_W0_matrix,
                      lda => 100,
                      x => p_testimage_array,
                      incx => 1,
                      beta => 1.0,
                      y => v_Y0,
                      incy => 1,
                      pack => 'C'
        );


In order to use UTL_NLA the tensors that make the neural network and the test images need to be stored in varrays of binary_float, or rather be declared of data type UTL_NLA_ARRAY.
For this reason it is also convenient to post-process the tables "tensors" and "testdata" as follows:

SQL> create table testdata_array as
select a.image_id, a.label, 
cast(multiset(select val from testdata where image_id=a.image_id order by val_id) as utl_nla_array_flt) image_array 
from (select distinct image_id, label from testdata) a order by image_id;

SQL> create table tensors_array as
select a.name, cast(multiset(select val from tensors where name=a.name order by val_id) as utl_nla_array_flt) tensor_vals 
from (select distinct name from tensors) a;

Finally, you can export the tables for later use. In the Github repository you can find a dump file obtained with the following command (run as Oracle):

$ expdp mnist/mnist tables=testdata,tensors directory=DATA_PUMP_DIR dumpfile=MNIST_tables.dmp

The final step, which brings you back to the discussion in the paragraph "let's start from the end: how to test the PL/SQL package MNIST",  is to create the PL/SQL package MNIST that loads the tensors and performs the operations needed to score the neural network, See the details of the code on Github.


Conclusions and comments

This post describes an example of how to implement a scoring engine for an artificial neural network using the Oracle RDBMS and PL/SQL. The discussion is about a simple implementation of the "hello world" example of neural networks: recognizing handwritten digits of the MNIST database. The network is trained using TensorFlow and later exported into Oracle. The final result is a short PL/SQL package which provides digit recognition with an accuracy of about 98%.

We can expect in the near future to find increasing deployments of neural networks close to data sources and data stores. The example in this post of how to implement a neural network serving engine on an Oracle database shows that this is not only possible but also easy to implement.
Serving neural networks is much simpler than training them. While training requires specialized software/platforms and domain knowledge and large amounts of training data, trained networks can be imported into target systems and executed there, in many cases requiring low usage of computing resources.
This post is intended as learning material: a simple feed forward neural network has been used instead of the more performing convolutional network (see references). Moreover, data movement from TensorFlow to Oracle and the implementation of the serving engine in PL/SQL is a sort of a hack in the present state and it is not intended for production usage.

The code accompanying this post is available on Github.


Notes on how to build the test environment

The main components and tools for testing the scripts in this post are:
the Python environment (on Linux with Centos 7) installed using Anaconda 4.1: Python 2.7, Jupyter Ipython notebook.
TensorFlow, version 0.9 (the latest as I write this), installed following the instructions at https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html
Oracle RDBMS running on Linux. The Oracle scripts have been tested on Oracle 11.2.0.4 and 12.1.0.2


References and acknowledgments

An excellent introduction to neural networks and an inspiration for this blog post is Michael Nielsen's book "Neural Networks and Deep Learning".
The code for neural network training used in this post is an extension of Google's TensorFlow MNIST tutorial.
See also: tutorial on TensorFlow by Martin Gorner
Basic techniques for TensorFlow by Aaron Schumacher
Visualizing MNIST by Christopher Olah
Python Machine Learning by Sebastian Raschka
Other popular frameworks for working with neural networks and deep learning besides TensorFlow include Theano and Torch among many others, see also this page on Wikipedia.