Enterprise Data Science is Largely an Engineering Challenge

Enterprise data science is largely an engineering challenge

Hear me out, maybe just maybe, I might win some of you over.

There is no taking away from the fact that building machine learning and AI algorithms are a hard and challenging endeavor. However, in the grand scheme of things the larger ecosystem around this is really critical to the success of this effort. In a recent whitepaper published by Google labeled “Hidden Technical Debt in Machine Learning Systems,” the authors presented the following visual:

That’s right, somewhat in the line of “missing the forest for the trees,” in the grand scheme of things, model development is a smaller piece of the larger puzzle. The success of Enterprise Data Science really depends on all of the other pieces of the puzzle collectively working together. From my experience, there are some key areas that require attention to make data science successful in enterprises:

Software Engineering Principles Apply

Notebooks are great, but they don’t represent software best practices, not yet at least. They are great for data scientists to collaborate and explore. Modularization, unit testing, bootstrap code, framework, identical environments, are all important considerations for robust and scalable ML development and deployment.

ModelOps: DevOps Principles Apply

Model development cannot succeed in isolation and model deployment is not independent of IT infrastructure. ModelOps is therefore gaining popularity in production ML environments as addressing “last mile” issues of model deployment and consumption. ModelOps refers to the tools and processes deployed to operationalize and manage all models in use in production systems (statistical, ML, and AI models). ModelOps tools and processes strive to provide dashboards, reporting, and information for stakeholders. One important aspect of ModelOps, specifically for machine learning models, is the comprehensive monitoring of these ML pipelines, which we will cover in some detail below.

ML Pipeline Monitoring is Critical

The reality on the ground is that ML lifecycles are long and complex. Once you have what you might call a “candidate algorithm”: something that has performed well on the training dataset and meets initial requirements expected of it, you now have to build a production pipeline environment that includes data integration pipelines, access to production datasets, robust transformation logic, feature stores, feedback loops, server and storage infrastructure, high availability and load balancing. And it’s not enough to build this pipeline. You may have contractual SLAs that you need to maintain and report on. So building metrics capabilities all along the pipeline are critical. In general, metrics can be classified into model metrics and operational metrics.

Model metrics capture important performance characteristics of the model. More importantly, metrics are needed to understand and monitor two important concepts that are unique to ML pipelines: model decay and drift. Depending on the type of algorithm, examples of metrics include:

  • AUC-ROC
  • Confusion Matrix
  • RMSE
  • Rand Index
  • KS Statistic
  • Gini Coefficient
  • Others

Model decay is a term used to capture the reality that models degrade over time. The reasons for that happening are closely associated with our next concept “drift”. At a high level, there are two types of drifts: concept drift and data drift. Concept drift is used to refer to the change in relationships between the inputs and outputs of the model. This could be for example customer buying behavior has changed because of economic factors or completely new products being introduced in the market (remember how phones were not just phones when the iPhone debuted?).

Data drift, on the other hand, is when the distribution of the data changes. You are a bank trying to offer personalized service. Maybe your training data was based largely on branch walk-ins and a small amount of data that was from online. During the launch, the pandemic hit, and a very large amount of data came in through online banking. In the training phase, the model hadn’t actually done so well with the online data.

Operational Metrics are another important aspect. Examples include:

  • Data load start and end times
  • Number of model calls
  • Counters
  • Resource utilization
  • Run completion times

Capturing metrics is great, but taking action based on them is crucial. This brings us to the topic of model management.

Model Management

Model management covers the following important aspects of a model deployed in a production pipeline:

  • Model Version control: Essential for failure management and continuity of operations
  • Rollback strategy: If a failure happens, how will rollback happen?
  • Re-training
  • Model Governance: Tracking Model and Feature Lineage
  • Data Quality and Drift Management
  • Dependency Management: libraries, frameworks, services, APIs

I am going to touch on Drift and Data Quality Management a bit deeper.

Data Quality and Drift Management

Let me use this cliché one more time: “garbage in, garbage out.” Data quality needs to be addressed early on and in different parts of the pipeline. Data quality is probably the single largest determinant of good model performance.

Quality of input data: Missing values, null fields, noise, bad labels—all of these influence model training.

Quality of features: Feature stores are beginning to play a more prominent role in ML quality and performance. Centrally managed features are important for scale and quality.

To this effect, it may be time to explore the concept of “data unit tests”: expressing expectations of data upfront such as data ranges, expected values, etc.

We touched on the topic of drift earlier. Monitoring and managing drift is a key aspect of model management. Metrics we capture can give us some serious insights into drift patterns. For instance, a change in the number of calls to a model may be an indicator of something wrong. Visualizations of the distribution of features are also useful. Observing the distributions over time can help spot changes that are maybe not normal. For example, over time, the range of values may have changed (maybe an error in downstream processing or maybe a real change). Or, you may be able to spot the dreaded “flat-lining” case, where you are expecting a continuously changing value but suddenly, for an extended time frame, you see the value flat-lining.

Performance and Scale Considerations

There are so many factors that influence the performance and scale of machine learning pipelines that it deserves an article on its own. To give you an idea, consider the following:

  • Data file formats: File formats define the structure and encoding of the data stored in them. There are many file formats, some that have been popular over the years, however, much of it is surprisingly not suited for machine learning. Additionally, file formats can play a significant role in performance and scale. In fact, data scaling issues led to the rise of distributed file systems and large block sizes on those file systems. Columnar file formats have risen in popularity as they perform very well for analytical use cases. This file format works well for feature engineering and the building of feature stores. Parquet is an example of such a file format. More popular today for machine learning is a format called petastorm which supports multi-dimensional data. TFRecords is another file format that is natively supported by TensorFlow and is protocol buffer format.
  • Model file Formats: Similar to data, models can be serialized into a format for transportation. Some of the formats for data files can also be useful for models, such as protocol buffer files. Python models are typically serialized into .pkl files. .pb and .h5 are other file formats used by popular frameworks.
  • Data pipeline scale and performance: Data pipelines in production need special attention. The design and architecture of data pipelines can make or break the performance and stability of the entire pipeline. When large volumes of data are involved, it may be prudent to develop parallel pipelines and set them up for failover. I have seen cases where production data suddenly increases and causes “clogging.” Suddenly the data is stale and incorrect leading to bizarre model performance.
  • Compute and storage hardware: ML models can be both compute and memory-intensive. GPUs and ASICs are the new evolution of processors that are meant to specifically geared towards ML performance.
  • Feature store scalability: Model serve time requires features and maybe other reference lookups. This means your feature store database and infrastructure need to be production-grade and tuned for performance. Do you know the call load on your feature store? Is it a few calls per second or tens of thousands of calls per second? Do you need to join other data with feature store data at runtime? The selection of feature store technologies and infrastructure will vary widely based on these factors.
  • Parallelism and distribution: This is less of a problem today than a few years ago as there are many frameworks and solutions that have addressed this issue. However, up until a few years ago, it was not too uncommon for algorithms in dev to be re-written sometimes to scale in a distributed environment. Some algorithms just don’t parallelize and distribute well. A few years ago, we were tasked with operationalizing a hierarchical clustering algo that our client built that plotted a dendrogram based on psychometric data. We found that the algorithm did not scale beyond 25,000 records. We looked at various options including Spark MLib for distributed hierarchical clustering but we found inconsistencies as the input variables increased. Finally, our solution was to rewrite the algorithm using a divisive bisecting K-means algorithm that scaled well and produced the desired output. And that was based on a paper published by the University of Minnesota. We have come a long way today from that!
  • Multi-tenancy: An aspect that gets overlooked to a good degree of peril, is multi-tenancy. Have you accounted for multi-tenant environments where the data is siloed? Is your model trained on a per-user basis?
  • Containerization: Deploying ML models in containers is generally a good strategy as it does help with scale. However, this does impact the ML workflow as you need to plan for containerized deployments. For example, you will need to decide what you put into the containers (frameworks, dependencies, training info, etc).

So food for thought. I hope this article highlights the critical broader ecosystem of machine learning and why they are all important even if they are not related to algorithm development. And just maybe, some converts to my controversial title.

Learn how to implement a data science strategy for your organization.

Raj Nair

Raj Nair

VP, Intelligence and Analytics