Red Hat OpenShift AI Intro / Data Science and MLOps Fundamentals

So this is the third article in this series on Openshift AI. I want to recap a little bit on what we've done so far.

Part 1 - Workbenches

In the first article, I introduced the Red Hat OpenShift AI (RHOAI) dashboard and walked through some of the core concepts, including a high-level overview of the menus and workspace layout. From there, we created a workbench, cloned a sample Git repository, and ran a notebook to load and display some sample event data in a pandas DataFrame.

Part 2 - Pipelines

In this article, we continued working with the Git repository we originally cloned. At a high level, we installed the pipeline server and connected it to our existing S3-compatible storage. Next, we used the Elyra pipeline editor to piece together two key steps: (1) loading the event data, and (2) processing it to perform frequency analysis on key words and phrases. Finally, we exported the pipeline to KFP (Kubeflow Pipelines) format to support reproducibility and compatibility with modern MLOps workflows.

A Teaching Moment

This article is going to be a little different from the rest of the series. So far, we've been building things hands-on in OpenShift AI, but now I want to step back and talk through some of the foundational theory behind data science and MLOps—particularly how it relates to what we've been doing so far.

Up to this point, the code we've written isn't actually a machine learning model. And that's okay—this is a common place to start. I realized this might be a good teaching moment: to clarify what a model is, what makes data usable for modeling, and how we can take unstructured or unlabeled data and prepare it for the next step in the MLOps journey.

Here are some common questions you might have at the moment related to this exercise.

Why isn't this a model?


The reason that counting words or phrases in the event data isn't considered a model is because it's not doing any sort of prediction or inference.

What we've done so far—word counts or frequency analysis—is a form of exploratory data analysis (EDA). It helps summarize or visualize the data, but it doesn’t make any decisions or predictions. It's purely descriptive, not predictive.

A machine learning model, by contrast, learns patterns in data to make predictions or classify new, unseen data points. In order to do that, it needs labeled data.

What is labeled data?

In the context of our event data example, labeled data means adding a new column (since we've already converted the data into tabular format) that indicates whether a given message or phrase represents a “Normal” or “Bad” condition.

There are a couple of ways we could do this:

  • Manual labeling: A human could read each message and assign a label like "Normal" or "Bad" (or "Warning").
  • Automatic labeling: We could use an existing field—like the "Type" field in our data, which already contains values such as "Normal" or "Warning"—to programmatically generate a label.

📄 Example: Labeled Event Messages

Let’s look at a few sample event messages and how we might label them:

Event MessageLabel SourceLabel
install strategy completed with no errorsType = NormalNormal
waiting for deployment rhods-operator to become ready: deployment "rhods-operator" not available: Deployment does not have minimum availabilityType = WarningBad

In this case, the "Type" field in the original event log helps us infer the label. This is now a supervised learning setup, where each input (event message) is paired with an expected outcome (label).

💡 Why This Matters

Once you have labeled data, you can begin training a machine learning model to classify new messages as "Normal" or "Bad" automatically—without relying on the "Type" field being present. This is the foundation of many practical ML applications in DevOps, IT monitoring, and anomaly detection.

What’s the difference between supervised and unsupervised learning?


In the last section, I introduced the idea of labeled data. Whether labels are generated automatically (e.g., using the Type field in our dataset) or added manually, they form the foundation for how a model can learn from data.

In supervised learning, the data includes both the input (like an event message) and the desired output or label (such as "Normal" or "Bad"). The model learns patterns from this labeled data to make predictions on new, unseen inputs. In our case, this could mean training a model to classify future messages based on past labeled ones.

In unsupervised learning, the data has no labels. Instead of learning from examples with known outcomes, the model tries to discover patterns or groupings on its own. This might include clustering similar event messages based on word usage or detecting unusual patterns that don’t fit the norm.

What do terms like “feature matrix” (X) and “label vector” (y) actually mean?


Think of the feature matrix (X) like the columns in a spreadsheet or the columns in a pandas DataFrame from the notebook we’ve been working with. Each row represents one data point (such as a single event), and each column is a measurable attribute or feature—like the number of times certain words appear, message length, or type of event.

The label vector (y) represents the outcome we want the model to learn or predict. In supervised learning, this is typically a single column where each entry corresponds to the correct label for the matching row in X.

In our example, if we label the data based on the Type field or some manual rule, the y values would be either "Normal" or "Bad".

When making predictions, we feed new, unlabeled feature data (X) into the model. The model then uses what it learned from past labeled data to output a predicted label (y). This prediction could be done inside a notebook—or through a deployed model API, which we'll explore when we get to model serving.

What is structured, semi-structured, or unstructured data?


Structured data refers to data that fits neatly into rows and columns, like a spreadsheet or SQL table. Each row represents a single record, and each column has a clearly defined data type and meaning.

Semi-structured data includes some defined structure—such as fields in a JSON object—but also contains free-form or variable elements. In our example, Kubernetes event data is semi-structured: fields like type and timestamp are consistent and well-defined, while the message field contains free-text descriptions that are less predictable and require extra processing.

Unstructured data would be fully free-form—like a plain text log file or a blob of text—without any consistent format or key-value structure. In our case, if the event messages weren’t wrapped in JSON or labeled fields, we’d be working with unstructured data.

Why is data cleaning and formatting often the most time-consuming part of the entire process?


As a data scientist, one of the most tedious and time-consuming parts of the job often involves reformatting data into a more structured and consistent format—especially when working with logs, metrics, or free-form messages like the ones in our dataset.

This process, commonly referred to as data cleaning or normalization, can involve several tasks, including:

🔄 Normalizing Data

For example, when working with free-form text like our event messages, normalization might include:

  • Removing unnecessary details, such as project or namespace names, if you're focusing on overall cluster health rather than a specific application. This helps reduce noise in the data.
  • Standardizing capitalization by converting all text to lowercase. This is especially useful when working with logs from different sources—some systems might log "Error" while others use "error" or "ERROR". By converting everything to lowercase, you ensure consistent matching of words and phrases.

🧱 Formatting Data

Formatting data may involve transforming free-form text into a more structured format. In the case of event messages, one approach is to break each phrase or sentence into individual words and create new columns to represent the position of each word. This lets us analyze not just the presence of certain terms, but where they appear in the message—helping us identify consistent patterns across events.

When we combine this structural transformation with labeled data, we can start to uncover relationships between specific words and message types. In our dataset, certain words tend to appear more frequently in either "Normal" or "Warning" messages. For example:

  • Words like "completed", "successful", or "no errors" are often found in Normal messages.
  • Words like "failed", "unavailable", or "waiting" tend to appear in Warning messages.

By analyzing these patterns—whether through word frequency or word position—we begin to create meaningful features that a model can learn from. This process is an essential first step in supervised text classification, where the words in a message form the input features (X), and the message type (Normal or Warning) becomes the label (y).

Where do we go from here?

Now that we've explored the foundational concepts behind labeling, formatting, and structuring data for machine learning, the next step is to put it all into action. In the next article, we’ll build a notebook that applies these ideas—adding labels, cleaning and formatting the messages, and transforming our event data into a structured dataset that’s ready for modeling.

From there, we’ll convert that notebook into a new pipeline that includes metrics, supports testing with new data, and prepares us for the final stage: deploying a model that can take in raw JSON event data and return a real-time classification.