Artifacts, MetaData, and Data Lineage in MLOps

Akshay Kapoor
4 min readApr 24, 2022

Data-Centric AI has the capacity to realize the true potential for AI for real-life problems as it focuses on solving the most scalable and maintainable solution for the given data. This approach requires Automated Data Pipelines along with tools to keep track of the experiments and data flow throughout the data evolution in the ML process lifecycle.

In this post, I would be describing what are the commonly used keywords in ML-ops means:

Let’s start with MLOps :)

MLOps i.e. Machine Learning operations is similar to DevOps but with Data.

Basically, what it means is we still need to have all CI/CD (continuous integration/Continuous development) for the software part (eg: featuring automated tests, robust orchestrated pipelines) with an ADDITIONAL ability to experiment and iterate as the whole process with Data (conceptualizing, preprocessing, feature engineering, model training, model serving and monitoring … list goes on).

Now the major difference is created by Data and how it is handled at each of these ML pipeline stages.

Why does it all Matter?

In Data-centric AI, your Data along with the scripts that perform transformation, validation, and deployment need to be versioned.

versioned why?

Cause that way we can do these important things:

1. Manage and Understand multiple experiments featuring similar data.

2. Reproduce experimental results.

3. Moreover, GDPR policies require each working organization to show that they are not misusing PII (Personally identifiable information).

Let's think of an example:

Let's say you have training data where you predict the healthiness of a person’s heart using attributes like his/her/their’s weight, height, bod-fat, and the list goes on. Imagine getting this data using an external website and you use a connector and run it.

Now One day they change storing weight from Kgs to lbs and your model starts to be behaving weirdly.

This scenario is an example of covariate shift and your pipeline should be better able to handle it before it the data gets used either for (re)-training or serving.

The key takeaway from here is the importance of having version-controlled data which is developed by a robust data pipeline, which would enable you to understand issues faster and reproduce your work efficiently.

I will write more about different ways in which data can manifest (& create issues) itself to test the robustness of your pipeline, for now, let’s move on to the remaining important data pipeline keywords.

ML MetaData: ML Metadata is a high-level term used to describe the Metadata produced by components of the ML pipeline.

what these components are, you may ask?

eg: Gathering Data Statistics, Gathering Data schema, Data transformation, Feature Engineering, and the list goes on

ML metadata means the metadata associated with each of these components along with information of different execution times (say development/ experimentation, model training, validation, serving) runs these components.

This brings us to another important key-word: Artifacts

Artifacts in ML pipeline:

Artifacts are created as the components of the ML pipeline get executed, these include all the data & objects that are produced by the pipeline components.

You can think of artifacts as similar to logging-in software. It simply provides a way to store relevant information regarding the data for example time of the run, data version, data path, the execution graph corresponding to series if transformation applied e.t.c.

Pipeline ML Meta Data featuring artifacts associated with ML pipeline Multiple Components
Exploring Statistics Generator artifact

The above Images show Pipeline Meta Data and what an artifact stores. In this case, you can see the pipeline has multiple components featuring artifacts (CSVExampleGen, ExampleValidator e.t.c.). Moreover, you can see as we try to explore the StatisticGen artifact it features a .pb (proto buffer type) file which is a language-neutral serialized object data storage file.

You can read more about ML metadata and artifact concepts here https://www.tensorflow.org/tfx/guide/mlmd and understand more about .pb file here (https://developers.google.com/protocol-buffers)

So what does .pb file store here? It stores the series of transformations required to generate the data output as it goes through a given pipeline component. This information is stored in an artifact as Data Provenance / Data Lineage.

Data Provenance & Lineage:

Data Provenance and Lineage are usually used as synonyms. These are steps that led to the creation of a particular artifact. Basically, we can think of it as a TensorFlow graph (thinking of neural networks, or a spark graph if you fancy Big Data).

That’s all for this post, hope it helped you gain some understanding!

PS: I am currently reading through ML in production specialization on Coursera, I am making easy-to-read notes for me and for anyone who wants to get a basic idea about these methods.

--

--

Akshay Kapoor

I am a Data Scientist at an insurance firm, I write to learn and share :)