AI and Data Plus

AI and Data Plus Weekly #2

suchismita sahu — Tue, 01 Apr 2025 10:43:43 GMT

Understanding Reasoning LLMs

Raschka outlines four main approaches to building and improving reasoning models

Pre-training on Reasoning Data: Incorporating datasets that emphasize logical reasoning during the initial training phase to instill foundational reasoning abilities.
Supervised Fine-Tuning: Refining pre-trained models on specific reasoning tasks using labeled data to enhance performance in targeted areas.
Reinforcement Learning from Human Feedback (RLHF): Utilizing human evaluations to guide the model's responses, promoting outputs that align with human reasoning patterns.
Prompt Engineering and Few-Shot Learning: Crafting prompts that encourage the model to produce reasoned responses and leveraging few-shot learning techniques to improve reasoning without extensive retraining.

Ref: https://magazine.sebastianraschka.com/p/understanding-reasoning-llms?utm_source=chatgpt.com

How To Make AI Agents Ready For Your Enterprise

The article "How To Make AI Agents Ready For Your Enterprise" emphasizes the importance of cautious evaluation when integrating AI agents into enterprise environments. It advises leaders to approach AI demonstrations with skepticism and underscores the necessity of reliable, supervised deployment to meet enterprise-grade standards. The piece highlights that, while AI agents hold significant potential, ensuring their dependability and alignment with organizational requirements is crucial for successful implementation.

Ref: https://substack.com/home/post/p-159509538?source=queue

The Semantic Layer Movement: The Rise & Current State

While the semantic layer enhances data discoverability and context, its effectiveness is amplified when integrated within a comprehensive data stack that includes data products, an all-purpose catalog, and application layers. This holistic approach ensures seamless discovery, rich context, and purpose-driven data utilization.

Ref: https://medium.com/@community_md101/the-semantic-layer-movement-the-rise-current-state-f8dbbb989b2e

Promptim: an experimental library for prompt optimization

"Promptim" is an experimental open-source library designed to automate and enhance prompt optimization for AI systems. By providing an initial prompt, dataset, and custom evaluators (with optional human feedback), Promptim iteratively refines prompts to improve performance on specific tasks.

The core algorithm involves running the initial prompt over a dataset to establish a baseline score, then iteratively suggesting and evaluating prompt modifications using a metaprompt. If the updated prompt shows improved metrics, it is retained; otherwise, the original is kept. This process can be repeated multiple times to achieve optimal results.

Promptim integrates with LangSmith for dataset and prompt management, result tracking, and optional human labeling. While it automates prompt optimization, incorporating human oversight ensures the quality and relevance of the final prompts. This approach balances automation with human judgment, facilitating efficient and effective prompt engineering.

Ref: https://blog.langchain.dev/promptim/

The Fallacy of Data-Driven Strategy

In "The Fallacy of Data-Driven Strategy," Collin Prather critiques the overreliance on data in formulating business strategies. He draws a parallel between traditional math education, which emphasizes rote memorization, and the data profession's focus on extracting insights from existing data. Prather argues that while data is invaluable for understanding current conditions and informing decisions, it cannot independently generate innovative strategies. He emphasizes that effective strategy development requires creativity and contextual understanding, elements that data alone cannot provide. Prather warns against the dilution of the term "insights" in the data industry, suggesting that an overemphasis on data can lead to a false sense of security and hinder the development of truly innovative strategies.

Ref: https://locallyoptimistic.com/post/the-fallacy-of-data-driven-strategy/

Embedding-Based Retrieval for Airbnb Search

Key components of the EBR system include:

Training Data Construction: Utilizing contrastive learning, Airbnb's model was trained to map both de-identified search queries and home listings into numerical vectors. This involved creating pairs of positive (booked) and negative (viewed but not booked) listings based on user interactions, capturing the multi-stage journey users undertake before making a booking. Medium
Model Architecture: The EBR model employs a dual-encoder setup with two neural networks: one for encoding search queries and another for encoding home listings. Both networks share the same architecture and are initialized with pre-trained embeddings. This design allows the model to effectively learn and represent the semantic similarities between queries and listings. Medium
Online Serving Strategy: To facilitate real-time retrieval, the system uses Approximate Nearest Neighbor (ANN) search techniques. This approach enables efficient matching of user queries with relevant listings by quickly identifying listings whose embeddings are closest to the query embedding.

Ref: https://medium.com/airbnb-engineering/embedding-based-retrieval-for-airbnb-search-aabebfc85839

The Ascending Arc of AI Agents

In "The Ascending Arc of AI Agents," Ananth Packkildurai explores the evolution of artificial intelligence from simple language models to sophisticated, autonomous agents capable of reasoning, learning, and interacting with their environment.

Chain-of-Thought Reasoning, Reasoning and Acting (ReAct), Open-Ended Skill Acquisition (VOYAGER), Managing Infinite Memory (MemGPT), Real-World Benchmarks (SWE-bench), Model Distillation

Packkildurai concludes that these advancements collectively signal a paradigm shift towards Artificial General Intelligence (AGI). By integrating reasoning, acting, continual learning, and effective memory management, AI agents are evolving into adaptive problem-solvers capable of navigating complex, real-world scenarios.

Ref: https://www.dataengineeringweekly.com/p/the-ascending-arc-of-ai-agents

Scalable data pipelines from dagster with pyspark

In his article "Scalable Data Pipelines from Dagster with PySpark," Georg Heiler explores the integration of PySpark into data pipelines using Dagster, an open-source data orchestrator. Georg Heiler+8Georg Heiler+8Georg Heiler+8

Key Insights:

Prudent Use of Distributed Systems: Heiler emphasizes that distributed systems should be employed only when necessary. While big data technologies have advanced, not all datasets require distributed processing. Tools like Pandas, Dask, Vaex, and Polars offer scalable solutions that can handle substantial data volumes on a single machine.
Dagster and Spark Integration: The article details how Dagster can orchestrate PySpark jobs, providing a unified framework for managing data workflows. This integration allows for scalable data processing, accommodating both small-scale and large-scale data tasks. arXiv+7Georg Heiler+7Georg Heiler+7
Separation of Business Logic and Execution Resources: By decoupling business logic from execution resources, Dagster enhances testability and maintainability of data pipelines. This approach enables developers to test business logic independently of the execution environment, improving pipeline reliability.

Ref: https://georgheiler.com/post/dagster-series-5-scalability/

The Future of Reliable Data + AI—Observing the Data, System, Code, and Model

In "The Future of Reliable Data + AI—Observing the Data, System, Code, and Model," Lior Gavish discusses the complexities of ensuring reliability in AI applications. He emphasizes that AI systems can fail due to issues in four key areas:Monte Carlo Data+9Monte Carlo Data+9Monte Carlo Data+9

Data: AI models depend on vast amounts of structured and unstructured data. Errors such as missing values or format changes can lead to inaccurate outputs. Ensuring data quality is foundational for reliable AI applications.
System: The infrastructure supporting AI, including data pipelines and orchestration tools, must function seamlessly. Failures in these systems can disrupt data flow, leading to incomplete or delayed AI outputs.
Code: The software components that process data and interact with AI models need to be robust. Bugs or inefficiencies in code can introduce errors, affecting the performance and reliability of AI applications.
Model: The AI models themselves can degrade over time or may not generalize well to new data. Continuous monitoring and updating of models are necessary to maintain their accuracy and relevance.

Gavish advocates for a comprehensive observability approach that monitors these four components collectively. By integrating intelligent monitoring, diagnosis, and resolution tailored to specific business contexts, organizations can transition from reactive problem-solving to proactive reliability management. This holistic strategy ensures that AI systems deliver consistent and trustworthy results.

Ref: https://www.montecarlodata.com/blog-the-future-of-reliable-data-ai-data-system-code-model/

2026 Will Be The Year of Data + AI Observability

In the article "2026 Will Be The Year of Data + AI Observability," Barr Moses predicts that 2026 will mark a significant shift towards integrating data observability with AI, emphasizing the importance of combining first-party data with large language models (LLMs) to unlock unique insights and automate processes.

Key Points:

Data + AI Integration: Organizations are increasingly merging their internal data with AI technologies to enhance decision-making and operational efficiency.
Observability Challenges: As AI applications become more complex, ensuring their reliability necessitates comprehensive observability across data, systems, code, and models.
Historical Context: Drawing parallels with past technological advancements, the article suggests that widespread adoption of data + AI will occur once enterprise-level reliability is achieved.

Moses concludes that embracing data + AI observability is crucial for organizations aiming to leverage these technologies effectively and maintain trustworthiness in their AI applications.

Ref: https://www.montecarlodata.com/blog-2026-will-be-the-year-of-data-ai-observability/

Happy Learning! Stay Tuned :)

AI and Data Plus Weekly #1

suchismita sahu — Thu, 27 Mar 2025 07:22:32 GMT

Journey of next generation control plane for data systems

LinkedIn developed Nuage, a control plane framework, to streamline resource provisioning and management across its data infrastructure, reducing the manual coordination previously required between developers and infrastructure teams. Initially offering self-service capabilities for over 30 platforms, including storage solutions like Espresso and Venice, and streaming services like Kafka, Nuage evolved to manage the entire resource lifecycle with features such as resource discoverability, access control, and policy enforcement. This transformation enhanced operational efficiency and scalability within LinkedIn's data systems.

https://www.linkedin.com/blog/engineering/infrastructure/journey-of-next-generation-control-plane-for-data-systems?utm_source=substack&utm_medium=email

How AI will disrupt data engineering as we know it

Tristan Handy discusses the profound impact artificial intelligence is expected to have on the field of data engineering over the next few years. He suggests that AI will significantly enhance efficiency in tasks such as data ingestion, transformation, and pipeline maintenance, potentially improving productivity by 50% or more. Handy emphasizes that while AI will automate many routine tasks, the role of data engineers will evolve towards more strategic responsibilities, focusing on areas like architecture design, stakeholder collaboration, and ensuring data quality. This shift is anticipated to benefit both data engineers, by elevating their roles, and organizations, by providing more effective and accessible data systems.

https://www.getdbt.com/blog/how-ai-will-disrupt-data-engineering

Critical role of effective log analysis in Complex system

Host Jesse Anderson interviews Cliff Crosland, CEO of Scanner.dev, discussing the critical role of effective log analysis in complex systems. Crosland shares insights from his experience with distributed systems, including graph creation and entity resolution, and explores the implications of Generative AI and Large Language Models (LLMs) for current and future programmers. The conversation also addresses challenges in transitioning from batch to real-time systems in security, perspectives on containerization and Kubernetes consolidation leading to the microservices paradigm, and an in-depth look at Scanner.dev's approach to utilizing lambda functions for creating a performant yet cost-efficient map/reduce-style distributed system.

Event Streaming: What It Is, How It Works, and Why You Should Use It

Event streaming is a data processing technique that captures and processes data in real time, allowing businesses to analyze and respond to events as they occur. Unlike traditional batch processing, which handles data in intervals, event streaming enables immediate action on data points such as user logins, page views, or transactions. This approach enhances operational efficiency and agility across various industries.

https://www.rudderstack.com/blog/event-streaming/

Protecting user data through source code analysis at scale

Meta's Anti-Scraping team outlines their proactive approach to combating unauthorized data scraping. They have integrated static analysis tools, such as Zoncolan for Hack and Pysa for Python, into their development workflow to detect and address potential scraping vulnerabilities across platforms like Facebook, Instagram, and parts of Reality Labs. By defining specific sources (e.g., user-controlled parameters) and sinks (e.g., data returned to users), these tools automatically identify and flag code paths that could be exploited for data scraping, allowing engineers to remediate issues before deployment.

https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/

Scale Unstructured Text Analytics with Efficient Batch LLM Inference

Snowflake discusses how organizations can leverage Snowflake Cortex AI to process and analyze large volumes of unstructured text data efficiently. By integrating batch processing capabilities with Large Language Models (LLMs), teams can perform tasks such as sentiment analysis, summarization, and translation directly within the Snowflake environment. This approach enables customer intelligence teams to analyze reviews and forum comments to identify sentiment trends, while support teams can process tickets to uncover product issues and inform gaps in a product roadmap.

https://www.snowflake.com/en/blog/batch-llm-inference-text-analytics-cortex/

PydanticAI: Advancing Generative AI Agent Development through Intelligent Framework Design

PydanticAI is a Python agent framework developed by the creators of Pydantic to streamline the development of production-grade applications utilizing Generative AI. It integrates Pydantic's robust data validation and parsing capabilities, ensuring strict data integrity in AI-driven applications.

https://www.marktechpost.com/2025/03/25/pydanticai-advancing-generative-ai-agent-development-through-intelligent-framework-design/

Embedding-Based Retrieval for Airbnb Search

Key Components of Airbnb's EBR System:

Training Data Construction:
- The team employed contrastive learning to train models that map search queries and home listings into numerical vectors.
- They constructed positive and negative pairs by analyzing user behavior, considering booked listings as positives and non-booked but viewed or wishlisted listings as negatives.
Model Architecture:
- The EBR system uses a dual-encoder architecture, with separate encoders for queries and listings.
- This design allows for efficient retrieval by precomputing listing embeddings, enabling real-time matching during user searches.
Online Serving Strategy:
- To facilitate quick retrieval, the system employs Approximate Nearest Neighbor (ANN) search methods.
- This approach balances retrieval accuracy with computational efficiency, ensuring timely responses to user queries.

https://medium.com/airbnb-engineering/embedding-based-retrieval-for-airbnb-search-aabebfc85839

Foundation Model for Personalized Recommendation

Key Developments in Foundation Models for Personalized Recommendation:

360Brew: Developed by LinkedIn, 360Brew is a 150-billion-parameter decoder-only model trained to handle over 30 predictive tasks across the platform. It eliminates the need for extensive feature engineering by utilizing a textual interface, streamlining the recommendation process and reducing technical debt. arXiv
VIP5 (Visual P5): This multimodal foundation model integrates visual, textual, and personalization modalities under a unified framework. By employing multimodal personalized prompts, VIP5 processes diverse data types, enhancing recommendation accuracy across various content forms. OpenReview+3OpenReview+3ACL Anthology+3
Graph Foundation Models: These models leverage graph structures to capture complex relationships within data, improving the understanding of user-item interactions. They have become instrumental in advancing recommender systems by effectively modeling intricate data dependencies.

https://netflixtechblog.medium.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39

Building Data Platforms: The Mistake Organisations Make

The article "Building Data Platforms: The Mistake" from Modern Data 101 discusses common pitfalls encountered during the development of data platforms. It also emphasizes the importance of a balanced approach that integrates technological innovation with strategic planning, user-centric design, robust data governance, and scalability considerations to successfully build effective data platforms.

https://moderndata101.substack.com/p/building-data-platforms-the-mistake

Happy Learning! Stay Tuned!

Deployment of an LLM model with AWS EKS

suchismita sahu — Mon, 26 Aug 2024 10:00:21 GMT

Generative AI technology involves tuning and deploying Large Language Models (LLM), and gives developers access to those models to execute prompts and conversations. Platform teams who standardize on Kubernetes can tune and deploy the LLMs on Amazon Elastic Kubernetes Service (Amazon EKS).

Usecase

Company ABC uses Stable Diffusion to generate contextualized images of a subject (e.g., a dog) in different scenes. The ABC follows an approach to bind a unique identifier with the subject (e.g., a photo of [v]dog), in order to synthesize photos of the said subject in photorealistic images based on the input prompt (e.g., a photo of [v]dog on the moon).

Commercial applications of ABC may include:

Generating images from text descriptions for social media platforms, e-commerce sites, and other other online platforms.
Creating personalized avatars of profile pictures for users.
Generating product images for online stores
Creating marketing materials and educational content that uses visual aids, etc.

Why AWS EKS

Amazon EKS clusters can scale to support tens of thousands of active containers, which makes it ideal for intensive AI workloads.
Beyond scalability, Amazon EKS offers a high degree of customization, that allows users to fine-tune configurations to match specific requirements.
Amazon EKS incorporates robust built-in safeguards to protect both your AI models and the data.

There is a vast eco-system of tools available to build and run models, even within the kubernetes landscape. One emerging stack on kubernetes is Jupyterhub, Argo Workflows, Ray and Kubernetes., called JARK stack.

JARK Architecture

JupyterHub provides a shared platform for running notebooks that are popular in business, education, and research. It promotes interactive computing where users can execute code, visualize results, and work together. In the realm of GenAI, JupyterHub accelerates the experimentation process, especially in the feedback loop. It’s also where Data Engineers collaborate on models for Prompt Engineering.

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. It provides a structured and automated pipeline tailored for the fine-tuning of models.

The Argo workflow pipeline composes of the following stages:

Data Preparation: Organize and preprocess training datasets.
Model Configuration: Define the architecture and hyper-parameters for LLM fine-tuning.
Fine-tuning: Execute the training regimen.
Validation: Gauge the performance of the fine-tuned model.
Hyperparameter Tuning: Optimize settings for peak performance.
Model Evaluation: Assess the model’s efficacy using separate test data.
Deployment: Host the model to cater to inference requests.

Ray is an open-source distributed computing framework that makes it easy to scale applications and to use state-of-the-art machine learning libraries. Ray is used to distribute the training of generative models across multiple nodes, which accelerates the training process and allows for the handling of larger datasets.

Ray Serve is a powerful model serving library that facilitates online inference application programming interface (API) creation. Notably, it’s compatible with major frameworks like PyTorch, Keras, and Tensorflow. It’s optimized for serving LLMs with features like response streaming, dynamic request batching, and multi-node/multi-GPU (Graphical processing Unit) support. Beyond just model serving, Ray Serve allows the integration of multiple models and business rules into a single service. Built on Ray, it’s designed for scalability across machines and offers resource-efficient scheduling.

Kubernetes is a powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications. Kubernetes provides the infrastructure to run and scale GenAI models in containers, which ensures high availability, fault tolerance, and efficient resource utilization.

In addition to the JARK stack, we also take advantage of the following two libraries from Hugging Face that give us the tools to personalize the Stable Diffusion model: Accelerate and Diffusers.

Accelerate is an open-source library specifically designed to simplify and optimize the process of training and fine-tuning deep learning models. For our purpose, it provides a high-level API that makes it easy to experiment with different hyper-parameters and training configurations without the need to rewrite the training loop each time and efficiently use available hardware resources.

Diffusers is the go-to library for state-of-the-art pre-trained diffusion models for generating images, audio, and even 3D structures of molecules. They provide easy to use training examples as a collection of scripts to demonstrate how to effectively use the diffusers library for a variety of personalization tasks, such as Unconditional Training, Text-to-Image Training, Dreambooth, ControlNet, Custom Diffusion, etc.

Steps to deploy Stable Diffusion Model on Amazon EKS

Pre-requisites

AWS Command Line Interface (AWS CLI) v2 – the CLI for AWS services
kubectl – the Kubernetes CLI
Terraform 1.5 – an infrastructure as code tool
Hugging Face Token with writescope
jq – a lightweight and flexible command line JSON processor

Step 1: Clone the GitHub repository

git clone https://github.com/awslabs/data-on-eks.git

Step 2: Deploy the sample blueprint

Navigate to the ai-ml/jark-stack blueprint directory and run the ./install.sh script. This script runs the terraform init and terraform -apply commands. Note that by default the solution is configure in us-west-2 Region. Please update the variables.tf file to deploy it to another AWS region. Note that this might take approximately 30 minutes for the deployment to complete successfully.

cd data-on-eks/ai-ml/jark-stack/terraform
export TF_VAR_huggingface_token=hf_XXXXXXXXXX

./install.sh  

Initializing ...
Initializing the backend...
Initializing modules...

Initializing provider plugins...
Terraform has been successfully initialized!
...
SUCCESS: Terraform apply of all modules completed successfully

The Terraform based blueprint provisions the following:

Amazon Virtual Private Cloud (Amazon VPC) with subnets, route tables, NAT gateway
Amazon EKS cluster (version 1.27)
Amazon EKS core-managed node groupused to host some of the add-ons that we’ll provision on the cluster.
Another EKS gpu-managed node group used to provision GPU based instances. For the purpose of the post, while we chose to use the Amazon Elastic Compute Cloud (Amazon EC2 ) G5 Instancesthat are based on the NVIDIA A10G Tensor Core GPUs that features 28GB memory per GPU.
A Kubernetes secret for the Hugging Face token and the configmapcontaining our sample ipython notebook that will be mounted on the notebook pod.
Install several add-ons we discuss in the next section.

Add-ons

Let’s take a look at the add-ons, which are the operational software pods that are deployed as a part of the stack.

aws eks update-kubeconfig --name jark-stack --region us-west-2

kubectl get deployments -A
                                                                
NAMESPACE              NAME                                                 READY   UP-TO-DATE   AVAILABLE   AGE
ingress-nginx          ingress-nginx-controller                             1/1     1            1           36h
jupyterhub             hub                                                  1/1     1            1           36h
jupyterhub             proxy                                                1/1     1            1           36h
kube-system            aws-load-balancer-controller                         2/2     2            2           36h
kube-system            coredns                                              2/2     2            2           2d5h
kube-system            ebs-csi-controller                                   2/2     2            2           2d5h
kuberay-operator       kuberay-operator                                     1/1     1            1           36h
nvidia-device-plugin   nvidia-device-plugin-node-feature-discovery-master   1/1     1            1           36h

Amazon EBS CSI Driver

The Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver allows Amazon Elastic Kubernetes Service (Amazon EKS) clusters to manage the lifecycle of Amazon EBS volumes for persistent volumes.

Set the default StorageClass to gp3 .

AWS Load Balancer Controller

The AWS Load Balancer Controller manages AWS Elastic Load Balancers for a Kubernetes cluster. You need a Network Load Balancer to access our Jupyter notebooks and eventually another Network Load Balancer that provides an ingress for our self-hosted inference endpoint, which is discussed later on in the post.

NVIDIA Device Plugin

The NVIDIA device plugin for Kubernetes is a DaemonSet that allows you to automatically expose the number of GPUs to Kubernetes thus allowing us to run GPU enabled containers on our cluster.

If you look under the data-on-eks/ai-ml/jark-stack/terraform/helm-values folder, you will see the values three HELM values file. In this example, pass a minimal values.yaml to the helm chart that enables the gpu-feature-discovery and node-feature-discovery features of the chart as well as a toleration that allows the node-feature-discovery pods to run on the GPU nodes we created via the blueprint. We’ll dive deeper in to advanced configuration of the NVIDIA Device Plugin/NVIDIA GPU Operator in another post.

gfd:
  enabled: true
nfd:
  enabled: true
  worker:
    tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - operator: "Exists"

JupyterHub

Similarly, you’ll see a jupyterhub-values.yaml. The Terraform script installed JupyterHub. In this example, we passed a values.yaml to the helm chart that configures JupyterHub to use a Load Balancer for access, specify GPU requirement on the resource, an Amazon EBS based storage volume for persistence. Please note that we show the use of basic user authentication based on username and password for the notebooks for demonstration purpose only. For real-world setup consider using an identity provider.

...
proxy:
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
      service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
      service.beta.kubernetes.io/aws-load-balancer-type: external
      service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: 'true'
      service.beta.kubernetes.io/aws-load-balancer-ip-address-type: ipv4
singleuser:
  image:
    name: public.ecr.aws/h3o5n2r0/gpu-jupyter
    tag: v1.5_cuda-11.6_ubuntu-20.04_python-only
    pullPolicy: Always
..
  extraResource:
    limits:
      nvidia.com/gpu: "1"
  extraEnv:
    HUGGING_FACE_HUB_TOKEN:
      valueFrom:
        secretKeyRef:
          name: hf-token
          key: token
  storage:
    capacity: 100Gi
...
      - name: notebook
        configMap:
          name: notebook
...

The dockerfile for the container image that we use for the notebook is provided in the repository under src/notebook/Dockerfile directory.

Ingress-Nginx

Ingress-nginx allows us to use some path rewrite rules to expose both the Ray dashboard and the inference endpoint using the same load balancer. This model also allows us to run multiple Ray Serve endpoints and use path based routing to serve say different model versions for example using the same load balancer.

Kuberay-Operator

The KubeRay Operator makes deploying and managing Ray clusters on top of Kubernetes painless. Clusters are defined as a custom RayCluster resource and managed by a fault-tolerant Ray controller. The KubeRay Operator automates Ray cluster lifecycle management, autoscaling, and other critical functions.

Later in the post, we describe how to create an inference service for dogbooth using the RayService custom resource definition on the cluster.

Step 3: Fine-Tune Stable Diffusion Model

You are now ready start experimenting with our model and prepare a notebook that helps us personalize it for your needs. Get the Load Balancer DNS.

kubectl get svc proxy-public -n jupyterhub --output jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Open the returned DNS hostname (e.g., k8s-jupyterh-proxypub-xxx.elb.us-west-2.amazonaws.com) in the web browser.

This triggers a pod jupyter-user1 to be provisioned on the g5 instance. You can see the pod if you issue kubectl get pods -n jupyterhub

Upon successful launch, you should be redirected to the notebook console on the browser.

Start the provided python notebook under the dogbooth directory in the notebook user interface (UI’s) file browser as shown in the following figure.

You can then step through the notebook’s cells as shown in the following figure. The first cell runs the NVIDIA System Management Interface (nvidia-smi) to verify our notebook instance is correctly provisioned on the GPU node and it sees the underlying NVIDIA A10G GPU.

The next four cells setup our development environment by cloning the Hugging Face diffusers GitHub repository and installing some python dependencies that the diffusers need. Additionally, install xFormers in order to enable memory efficient attention, as described in the dreambooth example.

Once you have stepped through those tasks, install bitsandbytes so that you use the 8-bit optimizer to reduce memory requirements further.

Upon successful installation of bitsandbytes, next setup the requirements for running the dreambooth training script. This includes installing some additional dependencies, setting up a default configuration for accelerate , logging into Hugging Face, and downloading a sample dataset from Hugging Face.

Now, you can launch training after setting up environment variables for the location of the input model, dataset directory, and output directory of the tuned model. Hugging Face accelerate does all the heavy lifting to help us experiment with the model. The hyper-parameters used for the following sample are optimized for the training to run successfully on 1 NVIDIA A10G GPU with 24 GB memory.

This takes about an 1.5 hours to complete – perfect time to grab some food. You can reduce the amount of training time by changing some of the hyper-parameters (e.g., –max_train_steps=400 ) but this comes at the expense of model’s performance and accuracy.

After the training script completes, you can verify the model has been created and run a sample inference to check how it performs.

Open the dog-bucket.png file. This picture is stored under /home/jovyan/diffusers/examples/deambooth folder

Since accelerate uploads the model to Hugging Face as well, you can even test a sample inference on their Hosted Inference API. You’ll find it if you navigate to https://huggingface.co/spaces//dogbooth or the value that you provided for $OUTPUT_DIR.

/dogbooth or the value that you provided for $OUTPUT_DIR.","title":null,"type":null,"href":null,"belowTheFold":true,"topImage":false,"internalRedirect":null,"isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Since accelerate uploads the model to Hugging Face as well, you can even test a sample inference on their Hosted Inference API. You’ll find it if you navigate to https://huggingface.co/spaces//dogbooth or the value that you provided for $OUTPUT_DIR." title="Since accelerate uploads the model to Hugging Face as well, you can even test a sample inference on their Hosted Inference API. You’ll find it if you navigate to https://huggingface.co/spaces//dogbooth or the value that you provided for $OUTPUT_DIR." srcset="https://substackcdn.com/image/fetch/$s_!uNb_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb3f43db-db66-4b01-a60e-e0a421992e86_879x474.png 424w, https://substackcdn.com/image/fetch/$s_!uNb_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb3f43db-db66-4b01-a60e-e0a421992e86_879x474.png 848w, https://substackcdn.com/image/fetch/$s_!uNb_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb3f43db-db66-4b01-a60e-e0a421992e86_879x474.png 1272w, https://substackcdn.com/image/fetch/$s_!uNb_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb3f43db-db66-4b01-a60e-e0a421992e86_879x474.png 1456w" sizes="100vw" loading="lazy">

If the model overfits or underfits, then please refer to an in-depth analysis of dreambooth performed by Hugging Face to help you adjust the hyper-parameters to improve model performance. Those recommendations are beyond the scope of this post.

Step 4: Serving the Large Language Model

Now that you have fine-tuned the model, host an inference endpoint for dogbooth on our Amazon EKS cluster.

You can use the RayService custom resource definition (CRD) to deploy a RayCluster with a RayServe application that pulls the dogbooth model from Hugging Face that you pushed earlier via accelerate training script as an output of the fine-tune experiment.

Define Entrypoint for RayService

The RayServe python application is packaged in a container image that can be pulled down for the RayCluster during deployment. Ray documentation provides a sample code to create an application for inference using Ray Serve and FastAPI . We tweak the provided python code to pass our custom dogbooth model that was pushed to Hugging Face as model_id by passing an environment variable MODEL_ID to the RayService configuration as shown in the following steps. Review the python application under src/service/dogbooth.py. To introspect the Dockerfile used to build a container image for the RayCluster, the head and worker nodes see src/service/Dockerfile.

Advanced configuration of the RayService is left as an exercise to the reader.

Define RayService

You are now ready to deploy the RayService. We have provided a ray-service.yaml in the data-on-eks/ai-ml/jark-stack/terraform/src/service directory note in the following manifest ray-service.yaml that:

Creates a namespace called dogbooth where we deploy the RayCluster.
Creates an Ingressso that you can expose the RayService endpoint via ingress-nginx out to the AWS Network Load Balancer with path based routing for dashboard and the inference services.
Edit the MODEL_IDunder runtime_env.env_vars to change the model repository to the one you create during fine-tune.

---
apiVersion: v1
kind: Namespace
metadata:
  name: dogbooth
---
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: dogbooth
  namespace: dogbooth
spec:
...
  serveConfig:
    importPath: dogbooth:entrypoint
    runtimeEnv: |
      env_vars: {"MODEL_ID": "askulkarni2/dogbooth"}
  rayClusterConfig:
    rayVersion: '2.6.0'
    headGroupSpec:
...
      template:
        spec:
          containers:
            - name: ray-head
              image: $SERVICE_REPO:0.0.1-gpu
              resources:
                limits:
                  cpu: 2
                  memory: 16Gi
                  nvidia.com/gpu: 1
...
    workerGroupSpecs:
      - replicas: 1
...
        template:
          spec:
            containers:
              - name: ray-worker
                image: $SERVICE_REPO:0.0.1-gpu
...
                resources:
                  limits:
                    cpu: "2"
                    memory: "16Gi"
                    nvidia.com/gpu: 1
...
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dogbooth
  namespace: dogbooth
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: "/\$1"
spec:
  ingressClassName: nginx
  rules:
  - http:
      paths:
        - path: /dogbooth/(.*)
          pathType: ImplementationSpecific
          backend:
            service:
              name: dogbooth-head-svc
              port:
                number: 8265
        - path: /dogbooth/serve/(.*)
          pathType: ImplementationSpecific
          backend:
            service:
              name: dogbooth-head-svc
              port:
                number: 8000

Run kubectl apply -f src/service/ray-service.yaml to create the RayService in the dogbooth namespace.

Once applied, the RayCluster’s head node and worker node are scheduled on the GPU nodes and inference endpoint are available to us via the load balancer’s DNS hostname. Because of the large image size of the GPU based rayproject/ray-ml:2.6.0-gpu base image, this can take up to eight minutes to complete.

Wait for the pods to be up, run kubectl get pods -n dogbooth –watch and then get the load balancer DNS hostname and explore the Ray Dashboard in the browser.

kubectl get ingress dogbooth -n dogbooth --output jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Open this URL in a browser http://k8s-ingressn-ingressn-xxx.elb.us-east-1.amazonaws.com/dogbooth/

You can now view the RayService under the Serve tab on the dashboard.

Finally, verify our dogbooth model deployment with a prompt such as:

 http://k8s-ingressn-ingressn-xxx.elb.us-east-1.amazonaws.com/dogbooth/serve/imagine?prompt=a photo of [v]dog on the beach

Conclusion

In this post I showed you the advent of GenAI models, their advantages, key use-cases, and the resource-intensive nature required to create robust outputs. We also spoke about the key advantages that Amazon EKS provides for these use-cases through its built-in scalability, resiliency, and repeatable deployments across environments while enabling customers to have more control, flexibility as well as drive cost effectiveness. We then stepped you through the steps to deploy a GenAI model on EKS utilizing the JARK stack.

Sagemaker vs. JupyterHub vs. Vertex AI

suchismita sahu — Mon, 26 Aug 2024 09:44:55 GMT

JupyterHub

JupyterHub provides a multi-user platform for Jupyter that your team members can log in to and run notebooks on. There are two JupyterHub distribution options: The Littlest JupyterHub for small-scale JupyterHub instances and a Kubernetes-based deployment for larger-scale deployments with a hundred or more users. Both options can be used with the most common cloud providers or as bare-metal installations.

One concern you may have about self-hosting JupyterHub is security: You will need to set up HTTPS and authentication (JupyterHub uses PAM authentication by default). JupyterHub supports integrating with any OAuth identity providers such as GitHub, GitLab, and Google and even supports LDAP and Active Directory authentication.

JupyterHub Pros:

You are not locked into any particular cloud vendor.
You can monitor costs closely and keep costs down by avoiding the extra costs associated with managed notebook services.
You have full control over configuration and can add any integrations or add-ons you want.

Cons:

You will need to do some low-level set up and maintenance, including security and authentication.
You will need to manage connections to external datasets.
You will need to set up integrations with version control systems for notebook sharing & collaboration.

Google Vertex AI Workbench

Google’s Jupyter offering is Vertex AI, a suite of machine learning functionality that includes feature stores, training pipelines, model registries, and endpoints, all available within the Google Cloud Platform (GCP). Vertex AI Workbench is the enterprise edition, which can be either user-managed or fully managed. The user-managed Vertex AI Workbench is a simple JupyterLab instance with a choice of kernels. The fully managed option includes extra functionality and integrations.

The fully managed Vertex AI Workbench option offers convenient, built-in integrations with Google Cloud Storage and Google Cloud BigQuery, and is easy to integrate with GitHub, all from within your JupyterLab environment. You can control compute on a per-notebook level and configure automated shutdown for idle instances. The Vertex AI Workbench notebook executor feature allows you to schedule notebook runs and save the output to Google Cloud Storage, where it can be shared.

The screenshot below shows a managed notebook instance. Notice the compute details drop down on the top right - where you can modify the compute for your notebook. You can also see the Good Cloud Storage navigation panel on the left, where you can navigate around stored files as if they were local. On the navigator bar at the top of the notebook you can see a git option - if you have set up a GitHub repo, then this button will let you do an nbdiff compare to the current git HEAD. The Execute command allows you to set up an execution schedule for your notebook.

Access to Vertex AI Workbench is managed through GCP with authentication and authorization controls provided as part of the GCP platform. If you want multiple users to access the same Workbench JupyterLab instance, you can set up permissions with a service account that multiple users can be given access to. These users can view and edit the same running notebook.

The cost of using Vertex AI Workbench is the cost of the compute and storage resources your notebooks use, plus management fees. Management fees for the fully managed option are about 10x higher than the user-managed option, and Vertex AI Workbench only supports the more expensive on-demand compute options, not spot compute. However, when you create a Jupyter instance, the Vertex AI Workbench UI provides an estimate of the cost based on the parameters you choose.

Pros:

Authentication through GCP.
Lower maintenance overhead.
Scalable compute, including GPU options.
With the managed option, integration with GCP storage options and GitHub, and the ability to schedule Notebook runs.
Notebooks can be shared.
Cost estimate on creation.
Automated shutdown of notebooks.

Cons:

Management fees make it more expensive, with higher costs for the managed option.
Third-party JupyterLab extensions such as Jupytext are not supported.

Amazon SageMaker

SageMaker is Amazon Web Service’s machine learning product, which has a similar suite of machine learning functionality to Google’s Vertex AI. SageMaker has two Jupyter Notebook products:

SageMaker Notebook Instance, the more straightforward, cloud-based notebook service.
SageMaker Studio, a more sophisticated platform that extends JupyterLab.

SageMaker Studio comes with many plug-ins and extensions that allow for easy integration with the rest of the SageMaker suite. The SageMaker Pipelines extension, for example, provides a SageMaker Studio-exclusive UI that allows you to watch your pipeline running. Third-party Jupyter extensions such as Jupytext are included.

The screenshot below shows some of the plug-in options the SageMaker Studio console offers. Although SageMaker Studio doesn’t offer native notebook scheduling, you can set up scheduling manually by adding the third-party jupyter-scheduler extension.

Both SageMaker Notebook Instance and SageMaker Studio can be accessed via the AWS console, and handle authorization through the AWS IAM authentication mechanisms. SageMaker Studio also offers authentication via SSO, so that users can log in to SageMaker Studio without having to go through the AWS console.

Both SageMaker options include built-in GitHub integration and can display Git diffs for notebooks against the current Git HEAD. You can see the git toolbar option in the screenshot above.

SageMaker Studio offers scalable compute and the option of using different compute types for different notebooks. Data is stored in elastic file storage, where it can be used by multiple notebooks.

SageMaker Studio notebooks can be shared, but what is shared is a copy of the notebooks. Subsequent changes to the shared notebook won’t be reflected. Which means you do need to use version control system like GitHub for collaboration work.

Pros:

Authentication through AWS or SSO.
Lower maintenance overhead.
Scalable compute, including GPU options.
Integration with GitHub and built-in support for GitHub diffs.
Many plug-ins and integrations with both other components in the SageMaker suite and external Jupyter extensions.

Cons:

Only copies of notebooks can be shared.
Cost of Sagemaker instances can be 20% to 40% more than an equivalent EC2 instance
Customising your JupyterLab environment with add-ons & extensions can be tricky

LLM Serving frameworks: LLMOps

suchismita sahu — Mon, 26 Aug 2024 08:59:06 GMT

The launch of GPT-3 and DALL-E steered up in the age of Generative AI and Large Language Models (LLM). With 175 billion parameters and trained on 45 TB of text data, GPT-3 was over 100x the 1.5 billion parameters of its predecessor. The next 18 months saw a cascade of innovation, with ever larger models, capped by the launch of ChatGPT at the tail end of 2022.

Basic workflow is as follows

So, Generative AI needs an operationalized workflow to accelerate adoption, where a terminology LLMOps comes into picture.

Key Components of LLMOps

Model Fine-Tuning: Adapting pre-trained LLMs for specific tasks by fine-tuning on domain-specific data.
Infrastructure Management: Handling the extensive computational resources needed for deploying and running LLMs, often involving GPUs or TPUs.
Latency & Performance Optimization: Ensuring that LLMs respond within acceptable timeframes, especially when deployed in real-time applications.
Scalability: Deploying LLMs across distributed systems to handle large-scale inference workloads.
Security & Privacy: Managing risks related to the potential misuse of LLMs, ensuring data privacy, and protecting intellectual property.
Bias & Fairness: Monitoring LLMs for biased outputs and implementing strategies to mitigate these biases.
Ethical Considerations: Ensuring responsible AI practices are followed, especially considering the powerful capabilities of LLMs.
Inference Cost Management: Optimizing the costs associated with running large models, including infrastructure and energy consumption.

Considering all the above points for serving a LLM application, we need to evaluate multiple frameworks those meet our business needs.

Here is a comparison of all those

Use vLLM when maximum speed is required for batched prompt delivery.
Opt for Text generation inference if you need native HuggingFace support and don’t plan to use multiple adapters for the core model.
Consider CTranslate2 if speed is important to you and if you plan to run inference on the CPU.
Choose OpenLLM if you want to connect adapters to the core model and utilize HuggingFace Agents, especially if you are not solely relying on PyTorch.
Consider Ray Serve for a stable pipeline and flexible deployment. It is best suited for more mature projects.
Utilize MLC LLM if you want to natively deploy LLMs on the client-side (edge computing), for instance, on Android or iPhone platforms.
Use DeepSpeed-MII if you already have experience with the DeepSpeed library and wish to continue using it for deploying Llm.

Gen AI for different personas of Data Platform

suchismita sahu — Mon, 19 Aug 2024 15:17:36 GMT

Here, I will not describe what is Data Platform, instead I will explain who are the different personas involved in building a Data Platform internal to an Organisation, with their roles, responsibilities and pain points, so that we can come up with Generative AI specific solutions for those challenges.

Product Manager
1. Role: Oversees the development and lifecycle of the data platform as a product.
2. Responsibility: Define product requirements, prioritize features, manage the product roadmap, and align the platform with business goals.
3. Use cases
  1. Domain model generation from various sources
  2. Auto documentation of different processes, data dictionary, business glossary, user manuals, training slides audio or text etc...
  3. Summarize lengthy legal documents or case files, helping legal professionals quickly grasp key points and make informed decisions
  4. Draft product strategy, roadmap, acceptance criteria, UAT test scenarios & cases, and release notes
  5. Figure out the product Risks and suggest risk mitigation strategies.
  6. Suggest 3rd party tools for assessments.
  7. Identifying trends in customer needs
  8. Creating new product designs, rafting technical documents - Personalised marketing
  9. Forecasting service trends

Data Steward
1. Role: Maintains the integrity and usability of data.
2. Responsibilities: Monitor data quality, manage metadata, and ensure that data is accessible and understandable to users across the organization.
3. Use Cases
  Automated data validation & clean-up
  data classification based on content, sensitivity and confidentiality
  data lineage generation
Data Engineer
1. Role: Constructs and maintains the data infrastructure.
2. Responsibility: Develops and manages ETL (Extract, Transform, Load) processes, optimizes data storage, and ensures efficient data flow between systems.
3. Use cases
  1. automated schema generation
  2. automated data mapping & transformation
  3. suggest required configurations
  4. synthetic data generation
  5. data augmentation
  6. Code co-pilot
Business Users
1. Role: Natural language query answering
2. Responsibility:
3. Usecases
  1. Data discovery, answering business queries through natural language queries.
MLE
1. Role: Focuses on deploying and maintaining machine learning models within the data platform
2. Responsibility: Integrate ML models into production, monitor model performance, and manage model retraining and updates
3. Use cases
  1. anomaly detection
  2. pipeline orchestration automation
Data Architect
1. Role: Designs the overall data architecture, ensuring that the data platform supports current and future business needs.
2. Responsibility: Define data models, establish data governance policies, design data pipelines, and ensure data quality and security.
3. Usecases for Data Catalog
  1. Automated Asset Data Discovery
  2. Simplified Data Accessibility
  3. Quality Asset Data
  4. Scaled Data Intelligence
Data Goverance Lead
1. Role: Establishes and enforces data governance policies.
2. Responsibility: Define data ownership, ensure data quality, manage data access controls, and enforce compliance with data regulations.
3. Usecases
  1. Create and enforce data governance policies by analyzing data usage patterns and regulatory requirements. It can automatically flag non-compliance issues and suggest corrective actions, ensuring the organization adheres to legal and ethical standards.
  2. Escalate privileges Gain unauthorized root access.
  3. Download remote files containing malicious code or tools.
  4. Establish reverse shells used for creating backdoors for unauthorized access.
  5. Take other actions in your system that may mimic the behavior of an administrator, and otherwise go unnoticed.
QA Engineer
1. Role: Test the pipelines
2. Responsibility: Verify and certify the data platform
3. Usecases
  1. Create a variety of test datasets with different characteristics.
  2. Generate test scripts that simulate various contrived real-world conditions such as sporadic API operation or significantly late arriving data.
  3. Monitor the pipeline’s behavior, logging inputs, outputs, and intermediate results.
  4. Automatically compare the pipeline’s output to expected outcomes.
  5. Alert data engineers if any anomalies or deviations are detected.
Cloud Engineer
1. Role: Manages cloud infrastructure for the data platform.
2. Responsibility: Deploy and manage cloud resources, optimize cloud costs, and ensure the scalability and reliability of cloud services.
3. Usecases
  1. Analyze cloud resource usage and automatically recommend optimizations, such as resizing instances or shifting workloads to more cost-effective resources, ensuring efficient cloud infrastructure management.
Data Analyst
1. Role: Interprets data and provides actionable insights
2. Responsibility: Create reports and dashboards, perform exploratory data analysis (EDA), and support business decision-making with data-driven insights.
3. Usecases
  1. automatically generate monthly performance reports based on sales data, providing managers with key insights and recommendations without manual analysis.
Data Scientist
1. Role: Analyzes data to extract insights and build predictive models.
2. Responsibility: Clean and preprocess data, build and validate machine learning models, and collaborate with data engineers to deploy models.
3. Usecases
  1. based on bivariate, multivariate, numeric datatype, a chart should be suggested and visualisation should be generated.
DBA
1. Role: Manages and maintains database systems.
2. Responsibility: Ensure database performance, implement backup and recovery strategies, and manage database security and user access.
3. Usecases
  1. analyze query patterns and database usage to automatically suggest or apply optimizations, such as indexing strategies or query restructuring, leading to improved database performance.
DevOps Engineer
1. Role: Facilitates the continuous integration and continuous deployment (CI/CD) of the data platform
2. Responsibility: Automate deployment processes, manage infrastructure as code, and monitor the performance and availability of data services
3. Usecases
  1. analyze historical deployment data and suggest optimizations for the CI/CD pipeline. It can predict potential deployment issues and recommend adjustments to configurations, improving deployment success rates and reducing downtime.
MLE
1. Role: Focuses on deploying and maintaining machine learning models within the data platform
2. Responsibility: Integrate ML models into production, monitor model performance, and manage model retraining and updates
3. Usecases
  1. hyperparameter tuning
  2. pipeline orchestration automation
  3. anomaly detection
UI/UX Engineer
1. Role: Designs user interfaces for interacting with the data platform.
2. Responsibility: Ensure that dashboards, reports, and data exploration tools are intuitive and user-friendly.
3. Usecases
  1. User Interface Prototyping
  2. Generative AI can create UI prototypes based on user behavior data and design principles. It can generate multiple design options, allowing designers to quickly iterate and select the most user-friendly interfaces.
Security Officer
1. Role: Ensures the security and compliance of the data platform.
2. Responsibility: Implement data encryption, monitor for security breaches, and ensure compliance with regulations like GDPR.
3. Usecases
  1. threat scenario simulation

Generative AI for Product Manager persona in Data Platform

suchismita sahu — Mon, 19 Aug 2024 14:14:13 GMT

Why Generative AI

Reduce Effort of development by optimizing development processes, such as code generation, testing, and deployment, reducing the time required to bring a product to market.
Faster Time to market by automating many processes involved in data pipeline, which we will discuss in this article.
Enhance Customer Experience by reviewing customer feedback, summarisation, sentiment analysis and providing product personalization wherever necessary.
Increase efficiency by mitigating risks.
Increase revenue by making the product compliance with regulatory guidelines, improving product quality continously, proving a competitive product with an adaptive pricing structure and improving communication and collaboration.

Application of Generative AI in Product Management is very vast and there can huge number of use cases be conceptualised specific to one industry. However in this article, we will discuss the use cases for Data Platform Technical Product Manager persona, which is an internal stakeholders facing role.

Making data driven decisions by analysing various data sources, pipelines helps product managers to evaluate potential outcomes and choose the best course of action.
Document Generation
- Auto documentation of different processes, data dictionary, business glossary, user manuals, training slides audio or text etc.
- Draft product strategy, roadmap, acceptance criteria, UAT test scenarios & cases, and release notes.
- Creating new product designs, drafting technical documents.
Document Summarisation
- Summarize lengthy legal documents or case files, helping legal professionals quickly grasp key points and make informed decisions
Information Retrieval
- Figure out the product risks and suggest risk mitigation strategies.
- Suggest 3rd party tools for assessments for a specific business scenarios.
- Identifying trends in internal stakeholders’ needs which will be changing based on main business line product’s objective, market trends and end user’s needs.
- Forecasting service trends.

What are the different open source models for these these tasks?

1. GPT-Neo (EleutherAI)

Use Cases:
- Document Summarization: Can summarize long texts by generating concise summaries.
- Text Generation: Used for generating creative writing, content creation, and automated text completion.
- Information Retrieval: Can answer questions based on provided documents.
Trained Dataset Size:
- 800GB of diverse text data (e.g., books, websites, and Wikipedia).
Number of Parameters:
- Available in different sizes: 1.3 billion and 2.7 billion parameters.
Cost:
- Free to use as open source.
- Running costs depend on the cloud provider or local infrastructure. Running larger models requires more compute power, which can be costly.

2. GPT-J (EleutherAI)

Use Cases:
- Document Summarization: Efficiently summarizes texts into key points.
- Text Generation: Generates coherent and contextually relevant content.
- Information Retrieval: Can retrieve information and generate responses based on documents.
Trained Dataset Size:
- 800GB of diverse text data.
Number of Parameters:
- 6 billion parameters.
Cost:
- Free to use as open source.
- Similar to GPT-Neo, running costs depend on infrastructure and usage.

3. T5 (Text-to-Text Transfer Transformer by Google)

Use Cases:
- Document Summarization: Summarizes articles and papers.
- Text Generation: Generates a wide range of texts from inputs.
- Information Retrieval: Can be adapted for answering questions and retrieving information.
Trained Dataset Size:
- Trained on the Colossal Clean Crawled Corpus (C4) dataset, containing hundreds of gigabytes of clean text.
Number of Parameters:
- Various sizes: Small (60 million), Base (220 million), Large (770 million), 3B (3 billion), and 11B (11 billion) parameters.
Cost:
- Open-source model; usage cost is dependent on the infrastructure.
- Larger versions require more resources, making them more expensive to run.

4. BART (Facebook AI)

Use Cases:
- Document Summarization: Summarizes long texts effectively.
- Text Generation: Generates diverse and creative content.
- Information Retrieval: Useful for generating answers to document-based questions.
Trained Dataset Size:
- Trained on a mixture of text datasets including books, Wikipedia, and web crawls.
Number of Parameters:
- Available in Base (139 million) and Large (406 million) versions.
Cost:
- Open-source and free to use.
- Running costs are relatively lower compared to larger models due to fewer parameters.

5. DistilBERT (Hugging Face)

Use Cases:
- Document Summarization: Provides concise summaries.
- Text Generation: Can generate text with less computational demand.
- Information Retrieval: Efficient in retrieving and summarizing information.
Trained Dataset Size:
- Trained on the same dataset as BERT (BookCorpus + English Wikipedia).
Number of Parameters:
- 66 million parameters (distilled version of BERT).
Cost:
- Free to use as open-source.
- Lower computational costs due to fewer parameters.

6. BERT (Bidirectional Encoder Representations from Transformers by Google)

Use Cases:
- Document Summarization: Can be fine-tuned for summarization tasks.
- Text Generation: Less commonly used for generation, more for understanding tasks.
- Information Retrieval: Highly effective for question answering and retrieving information from documents.
Trained Dataset Size:
- Trained on BookCorpus and English Wikipedia.
Number of Parameters:
- Available in Base (110 million) and Large (340 million) configurations.
Cost:
- Open-source; cost depends on fine-tuning and usage scenarios.

7. GPT-2 (OpenAI)

Use Cases:
- Document Summarization: Can summarize texts.
- Text Generation: Used for content creation, conversational agents, and more.
- Information Retrieval: Can be adapted for retrieval and summarization tasks.
Trained Dataset Size:
- Trained on 40GB of Internet text.
Number of Parameters:
- Multiple versions: 124 million, 355 million, 774 million, and 1.5 billion parameters.
Cost:
- Free to use as open-source.
- Costs vary depending on model size and infrastructure.

Factors to be considered while choosing a model

Cost: Running these models typically involves cloud costs (compute, storage), and costs scale with the model size. Open-source models are free to use, but infrastructure costs need to be considered.
Model Size: Larger models generally provide better accuracy and capabilities but come with higher costs and resource demands.
Dataset Size: Bigger and more diverse datasets contribute to the model's generalization ability, but also require more compute power during training.

Is it done?

No, the actual task is started now.

These models are trained on generic data, which needs to be trained on our product specific data to provide the expected output for our use cases. This process is called Fine Tuning, where they specialize in particular tasks or domains, honing their skills for more niche applications. It's honing its abilities for particular tasks or domains, transforming it from a language learner into a task-specific expert.

Following are major three ways of Fine tuning a Foundational Model

Transfer learning

Fine-tuning employs a strategy known as transfer learning. The model takes the understanding it gained during pre-training (e.g. learning grammar and syntax) and tailors it to the specific task at hand. This accelerates learning and makes the model more efficient in tackling new challenges.

Task-specific data

Imagine the LLM as a student studying for an exam. Fine-tuning involves providing the model with task-specific study material. For instance, if it's learning to categorize news articles, it's given a dataset of labeled articles. This targeted information equips the model with the domain expertise needed to excel in that task.

Gradient-based optimization

As the model processes task-specific data, it calculates the difference between its predictions and actual outcomes. This difference, known as the gradient, guides parameter adjustments. Optimization techniques then use this gradient information to iteratively fine-tune the model's parameters.
This minimizes prediction errors and enhances the LLM's task-specific expertise.

These models should be deployed in production, so that they can perform the use cases meeting their objectives, once these models are trained.

So, how is the architecture of complete pipeline from training to production?

At a very high level, the workflow can be divided into three stages:

Data preprocessing / embedding: This stage involves storing private data (data pipeline related documents, in our example) to be retrieved later. Typically, the documents are broken into chunks, passed through an embedding model, then stored in a specialized database called a vector database.
Prompt construction / retrieval: When a user submits a query, the application constructs a series of prompts to submit to the language model. A compiled prompt typically combines a prompt template hard-coded by the developer; examples of valid outputs called few-shot examples; any necessary information retrieved from external APIs; and a set of relevant documents retrieved from the vector database.
Prompt execution / inference: Once the prompts have been compiled, they are submitted to a pre-trained LLM for inference—including both proprietary model APIs and open-source or self-trained models. Some developers also add operational systems like logging, caching, and validation at this stage.

Image Source: MS Azure

Contextual data for LLM apps includes text documents, PDFs, and even structured formats like CSV or SQL tables. Data-loading and transformation solutions for this data vary widely across developers we spoke with. Most use traditional ETL tools like Databricks or Airflow. Some also use document loaders built into orchestration frameworks like LangChain (powered by Unstructured) and LlamaIndex (powered by Llama Hub).

For embeddings, most developers use the OpenAI API and models provided by Huggingface also.

The most important piece of the preprocessing pipeline, from a systems standpoint, is the vector database. It’s responsible for efficiently storing, comparing, and retrieving up to billions of embeddings (i.e., vectors). The most common choice we’ve seen in the market is Pinecone. Some use ChromaDB and pgvector a plugin of PostgreSQL also.

This is where orchestration frameworks like LangChain and LlamaIndex shine. They abstract away many of the details of prompt chaining; interfacing with external APIs (including determining when an API call is needed); retrieving contextual data from vector databases; and maintaining memory across multiple LLM calls. They also provide templates for many of the common applications mentioned above. Their output is a prompt, or series of prompts, to submit to a language model. These frameworks are widely used among hobbyists and startups looking to get an app off the ground, with LangChain the leader.

Today, OpenAI is the leader among language models. Nearly every developer we spoke with starts new LLM apps using the OpenAI API, usually with the gpt-4 or gpt-4-32k model. This gives a best-case scenario for app performance and is easy to use, in that it operates on a wide range of input domains and usually requires no fine-tuning or self-hosting.

When projects go into production and start to scale, a broader set of options come into play. Some of the common ones we heard include:

Switching to gpt-3.5-turbo: It’s ~50x cheaper and significantly faster than GPT-4. Many apps don’t need GPT-4-level accuracy, but do require low latency inference and cost effective support for free users.
Experimenting with other proprietary vendors (especially Anthropic’s Claude models): Claude offers fast inference, GPT-3.5-level accuracy, more customization options for large customers, and up to a 100k context window (though we’ve found accuracy degrades with the length of input).

Open-source models trail proprietary offerings right now, but the gap is starting to close. The LLaMa models from Meta set a new bar for open source accuracy and kicked off a flurry of variants. Since LLaMa was licensed for research use only, a number of new providers have stepped in to train alternative base models (e.g., Together, Mosaic, Falcon, Mistral).

Caching is relatively common—usually based on Redis—because it improves application response times and cost. Tools like Weights & Biases and MLflow (ported from traditional machine learning) or PromptLayer and Helicone (purpose-built for LLMs) are also fairly widely used. They can log, track, and evaluate LLM outputs, usually for the purpose of improving prompt construction, tuning pipelines, or selecting models. There are also a number of new tools being developed to validate LLM outputs (e.g., Guardrails) or detect prompt injection attacks (e.g., Rebuff). Most of these operational tools encourage use of their own Python clients to make LLM calls, so it will be interesting to see how these solutions coexist over time.

Model Evaluation Metrics

Even though, there are many scenarios based on which the models or the pipelines should be evaluated, but here we will mention only for our use cases under discussion. LLM evaluation can be broadly categorized into these dimensions:

Alignment Metrics: Evaluate how well the model aligns with human preferences in the given use-case, in aspects such as fairness, robustness, and privacy.

Perplexity: Measures how well the LLM predicts a sample of text. Lower perplexity values indicate better performance.
Human Evaluation: Involves human evaluators assessing the quality of the model's output based on criteria such as relevance, fluency, coherence, and overall quality.
BLEU (Bilingual Evaluation Understudy): Compares the LLM generated output with reference answer to measure similarity. Higher BLEU scores signify better performance.
Diversity: Measures the variety and uniqueness of generated LLM responses, including metrics like n-gram diversity or semantic similarity. Higher diversity scores indicate more diverse and unique outputs.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric used to evaluate the quality of LLM generated text by comparing it with reference text. It assesses how well the generated text captures the key information present in the reference text. ROUGE calculates precision, recall, and F1-score, providing insights into the similarity between the generated and reference texts.

Model Deployment

Being these are open source models, we can register these models with a model registry for ex: MLflow, which will be installed inside the minitube and then continuous deployment pipeline should be initiated through vLLM and Kubernetes. I will cover vLLM in a different post.

Data Platform

suchismita sahu — Wed, 14 Aug 2024 16:31:50 GMT

Lets build a Data Platform.

What is Data Platform

It is a ecosystem in the data stack, built making use of network effects between publishers and consumers providing improved developer experience, a sustainable marketplace and business model thereby increasing the Organization’s revenue. (Please refer article for these terminologies).

So, data platform is not a data storage layer, its a centralised metadata storage layer where required data governance, access control and security can be provided and maintained. This can be achieved through Data Catalog.

Objectives

1. Centralized Data Management

Create a unified platform that centralizes data from various sources across the organization, which facilitates better data governance, improves data accessibility, and reduces data silos. Mention the number of data sources with types of data.

2. Scalability

Design the platform to scale with the growing volume, variety, and velocity of data, which ensures that the platform can handle increased data loads and support future data initiatives without performance degradation. Mention the data volume, latency and throughput of data availability for data products.

3. Data Quality and Consistency

Implement mechanisms to ensure the accuracy, completeness, and consistency of data across the platform, which improves decision-making by providing reliable data and reduces the risk of errors in analysis. Mention percentage accuracy data accuracy and completeness required to build high quality data products using these data.

4. Real-time Data Processing

Enable the platform to process and analyze data in real-time or near real-time Which supports timely decision-making and allows for immediate insights, which is crucial for applications like monitoring and alerting. Mention the use cases which need real time and near real time data access.

5. Interoperability

Ensure the platform can integrate seamlessly with various tools, technologies, and systems used within the organization, which provides flexibility in adopting new technologies and integrating with existing systems, enhancing the overall data ecosystem. Prepare the data architecture and mention different technologies, third party tools, cloud infra to support a robust and scalable data platform.

6. Data Security and Compliance:

Implement robust security measures and ensure compliance with relevant regulations and standards, which protects sensitive data from unauthorized access and ensures the platform meets legal and regulatory requirements. Mention different standards such as HIPAA, GDPR as per your industry.

7. Self-service Analytics:

Empower users across the organization to access, analyze, and visualize data without requiring extensive technical expertise, which increases data-driven decision-making across departments and reduces the burden on IT teams. Mention how many data teams or products and growth rate of these on an annual basis.

8. Cost-efficiency:

Optimize the platform’s architecture and operations to minimize costs while maximizing performance and capabilities which ensures that the data platform is sustainable and delivers value within budget constraints. Mention target cost savings on infrastructure.

9. Support for Advanced Analytics and AI/ML:

Provide the necessary infrastructure and tools to support advanced analytics, machine learning, and AI applications which enables the organization to leverage data for predictive analytics, automation, and other AI-driven initiatives. Mention different types of data products to be supported and their needs.

10. Data Governance and Compliance:

Implement policies, procedures, and technologies that ensure proper data management, usage, and compliance which maintains data integrity, ensures compliance with regulations, and aligns with corporate governance policies.

11. Enhancing Customer Experience:

Use the data platform to gather insights that improve customer interactions and satisfaction which leads to better customer retention, personalized services, and a stronger competitive edge.

12. Operational Efficiency:

Streamline data operations and reduce the time and effort required to manage and analyze data which increases productivity, reduces operational costs, and speeds up time-to-insight.

Centralized vs Decentralized vs Domain Driven (Data Mesh) Data Platform

Centralized Data: Consolidation for Efficiency

Centralized data refers to the practice of storing and managing all data in a single, central repository. Here, data is collected from various sources and consolidated into one system, commonly referred to as a data warehouse. Let’s delve into the advantages and challenges associated with this approach.

Advantages

1. Efficient data management:

Centralizing data allows for streamlined data management processes. With a single data repository, businesses can easily organize, update, and maintain data integrity.

2. Improved data analysis:

A central data repository facilitates comprehensive data analysis, enabling businesses to derive meaningful insights and make data-driven decisions more efficiently.

3. Enhanced security:

Centralized data often benefit from robust security measures. Implementing stringent access controls and encryption mechanisms becomes more manageable, reducing the risk of unauthorized data breaches.

Challenges of Centralized Data

1. Data silos:

While centralization aims to consolidate data, it can inadvertently lead to the creation of data silos. Different departments or teams within an organization might hoard data, hindering cross-functional collaboration and diminishing the potential for holistic insights.

2. Single point of failure:

Relying solely on a central data repository introduces a single point of failure. If the centralized system encounters issues, such as technical glitches or cyber-attacks, it can significantly disrupt operations and potentially compromise the entire dataset.

3. Privacy concerns:

Centralized data raises privacy concerns, especially when dealing with sensitive user or customer information. Organizations must implement robust privacy protocols to ensure compliance with data protection regulations and maintain the trust of their users.

Decentralized Data: Empowering Autonomy

Decentralized data, on the other hand, promotes the distribution of data across multiple locations or systems. Rather than relying on a single central repository, data is stored in diverse nodes, often interconnected via a network. Let’s explore the advantages and challenges associated with this approach.

Advantages

1. Enhanced data ownership:

Decentralization empowers individuals or departments within an organization to own and manage their data. This autonomy fosters innovation, as it allows teams to tailor their data management practices to their specific needs.

2. Improved scalability:

Decentralized systems are inherently scalable, as data can be distributed across multiple nodes. This flexibility enables businesses to expand their operations without facing the limitations of a centralized infrastructure.

3. Resilience and fault tolerance:

Decentralized data architecture provides resilience against system failures. Even if one node encounters issues, other nodes can continue to function independently, ensuring business continuity and data availability.

Challenges of Decentralized Data

1. Data consistency:

Maintaining data consistency across multiple decentralized nodes can be challenging. Synchronization and version control mechanisms must be in place to ensure that data remains accurate and up-to-date across the network.

2. Complex data integration:

Integrating data from multiple decentralized sources can be complex and time-consuming. Data interoperability and compatibility become critical considerations to ensure seamless data exchange between different nodes.

3. Increased security risks:

With data dispersed across multiple nodes, securing decentralized data becomes more intricate. Each node must be adequately protected to prevent unauthorized access or tampering. Robust encryption, access controls, and authentication mechanisms are essential to mitigate security risks effectively.

Data Mesh proposes a paradigm shift by advocating for a domain-oriented decentralized approach to data management. Instead of relying on a central data team, Data Mesh advocates for data ownership and governance distributed across different domains or business units within an organization.

In a Data Mesh architecture, each domain or business unit becomes responsible for its data products, including data collection, storage, processing, and analysis. This approach promotes autonomy, scalability, and agility by allowing teams closest to the data to make decisions and derive value from it. Data Mesh emphasizes the importance of clear data product ownership, well-defined APIs, and data quality monitoring to ensure the reliability and usability of the data products across the organization.

Data Mesh recognizes the complexity and diversity of data in modern organizations and acknowledges that a centralized or purely decentralized approach may not effectively address these challenges. By embracing the principles of Data Mesh, organizations can foster a culture of data collaboration, where teams work together to build and leverage data products that align with their specific domain expertise.

It is worth noting that implementing a Data Mesh architecture requires careful planning, coordination, and a shift in organizational mindset. However, for organizations seeking a more distributed and flexible approach to data management, exploring the principles and practices of Data Mesh can offer new insights and opportunities.

Key Principles of Data Mesh Architecture

Data Mesh is an organizational approach to managing distributed data architecture. It advocates for domain-oriented decentralized data ownership and architecture, treating data as a product and applying principles of product thinking to data management.
Key Characteristics:
- Domain-Oriented Teams: Data mesh aligns with the principles of domain-driven design (DDD), emphasizing bounded contexts, ubiquitous language, aggregates, entities, value objects, contexts, or subdomains to model, structure, or organize data around business domains, contexts, or areas.
- Federated Data Ownership: Data mesh advocates for a distributed, federated data architecture where data, datasets, or data products are treated as first-class citizens and are discoverable, accessible, interoperable, or reusable across domains, teams, or organizational boundaries.
- Self-serve Data Infrastructure: It encourages the creation, standardization, or encapsulation of data products, APIs, or interfaces that encapsulate data capabilities, functionalities, or services, enabling seamless integration, consumption, or interaction with data assets.
- Data as a Product: Data itself acts as a product in marketplace to get used by any third party vendor for their ML model training.
Use Cases:
- Decentralized data ownership
- Cross-functional collaboration
- Scalable and agile data architecture

Personas

Business Users
Data Scientists/Analysts/Machine Learning Engineers
Security and Compliance Officers

Evaluation Metrics

Data Ownership

Scale out data sharing and generating value from data in step with Organization’s growth

Increased number of domains that provides analytical data
Increased number of domains that consumes analytical data
Increased peer to peer data sharing
Data business truthfulness- increased alignment between dev, business and operations

Data as a Product

Increase efficiency and effectiveness of data sharing within and across Organisational’s domains

Increased usage
Growth of active users
User satisfaction
- User conversion rate from search & discovery to read & use.
Usability
Quality & Security
- Data availability
- Data risk
- Change fail ratio
User Confidence & Trust
- Timeliness, completeness, integrity standards compliance
Interoperability

Self Serve
Increase domains autonomy with lower cognitive load and lower cost of data ownership

increase domain autonomy with self serve
- Coverage of automated tasks
- Platform users net promoter score
- Backlog and release dependencies from domain teams to platform teams
Increase services coverage
- Rate of platform product services usage
- Number of active users in the platform and per platform service
Abstract complexity
- Cost of data product life cycle management
- Change fail ratio of data products
- Number of data products using the platform
- Lead time to build, test, deploy and use data products

Federated Computational Governance
Generate higher order intelligence securely and consistently -in step with Organisational growth

Active engagement of domains in global governance operation
- Domains and data product owners who are active members in global federated governance
- Rate of new global policies established and adopted by domains.
Mesh wide interoperability, reliability and consistency
- Ratio of data products implementing latest versions of policies
Reduce governance friction through automation
- Lead time to detect and resolution of new data policy breaches
- Number of active users of data products complying with policies.

Data Catalog

Data Governance journey involves three obvious major components: people, processes and technologies. Some companies choose to launch an enterprise program and start with people (e.g. organisational structures, ownership, etc.) and processes (e.g. policies, standard operating procedures, etc.), others create a small enthusiastic data management group and start a data democratisation initiative promoting offensive Data Governance in a practical way — through a Data Catalog implementation. Any of these styles have their own challenges, advantages and disadvantages, but

Roughly there are 4 main categories of Data Catalogs.

Stand-alone solutions offer key and additional data cataloging components within a single tool. Commercial and open source offerings are available and examples include Alation, Atlan, data.world, Zeenea, Amundsen, DataHub.
Platform solutions offer key data cataloguing functions with modules providing additional capabilities like Data Quality, Data Privacy, some even MDM. Examples include Ataccama, Collibra, IBM, Informatica, Precisely, Talend.
Cloud native Data Catalogs which provide key components mostly limited within the cloud service provider environment. Use cases such as orchestration and ETL-processes are the main focus. Examples include AWS Glue, Azure Purview, Google Data Catalog (part of Dataplex).
Tool-specific Data Catalogs (add-ons) which support a specific tool. For examples within the area of business intelligence by providing key components as well as purpose related additional cataloguing features. A good example would be Tableau Catalog.

Looking into two last categories Databricks Unity catalog which is gaining traction with the speed of light is an interesting case as it initially could be considered as tool-specific one, but with all the latest developments it is now closer to the cloud native ones or even stand-alone.

Data Catalog maturity levels

This is an indicative way of dividing into maturity levels and borders can be blurred. However in practice these four main level have been observed.

L1 — Technical metadata hub. It is a metadata registry for data available in the data platform with ad-hoc curation based on crowdsourcing enabled by advanced users. It performs mostly metadata ingestion from various data sources on-prem and cloud with ad-hoc data modelling and use by advanced users (e.g. data analysts) to find data to build advanced analytics applications. Sometimes it can be a good start for enabling data democratisation especially in agile environments in the “from chaos to structure” implementation approach which pertains certain risks (see below).

L2 — Curated data inventory. It is a curated data registry with foundational governance capabilities, data classification and user collaboration. Metadata can be fetched from various places including other data catalogs (e.g. cloud native). Integration with communication systems (e.g. Slack) is possible via API and plays a key role for data curation. Since data becomes more structured, data development can leverage that for data search and understanding context. Data Lineage becomes more important and should be provided up to the level of analytics applications.

L3 — Data Governance Platform. It is a catalog integrated with Data Governance processes where automation of tasks is happening and it becomes a single point for data onboarding, assessment and metrics collection. Data Governance brings several new requirements as Data Quality, Data Classification and executing workflows. These features can either belong to the catalog itself or be provided by 3rd party tools via API integration. Since data is curated and governed it can be used in business applications consumed by business users.

L4 — Enterprise Data Marketplace. It is a single point of data discovery and access in the enterprise for all categories of data users. Data Marketplace can be either internal only or span across multiple external data consumers and providers, thus API integration with external systems is required.

Moving from one level to another might require additional capabilities to enable growth and sustainable adoption. Let’s look into core and additional data catalog capabilities and define what is necessary for each level.

Data Catalog capabilities

Data Management capabilities provided by a Data Catalog сan be divided into these major categories each containing capabilities which might be required at different levels of maturity.

Data Inventory (L1+) allows to register data sources, organise and describe data by ingesting and curating business, technical and operational metadata. This capability includes Data source connectivity, Data sampling, Business Glossary, Data Dictionary, Metadata Management and Data Lineage.
Data Assessment (L1+) performs the evaluation of data with fitness for use, which includes Data profiling, measuring data risk via classification, PII detection and tracking of data usage to understand how popular datasets are or perform audits. Data Quality assessment also falls into this capability though is likely to be either provided by an additional module of a platform type catalog (e.g. Collibra, Informatica) or sourced from a 3rd party tool via API integration. Either way it is critical to have Data Quality information in the Data Catalog to complete fitness for use assessment.
Data Discovery (L1+) enables users to locate the data asset they need via google like search, exploration and recommendations. This capability is a key for the success of a Data Catalog adoption and sustainable growth of the user community. It is important to highlight that some Data Catalog solutions separate this capability into a Marketplace add-on allowing not only to combine external and internal datasets, but also making it an online shop experience providing the option of requesting access via a shopping cart.
Data Governance (L3+) enables data curation activities via defining roles and responsibilities, rules (fullness of asset curation), policies (e.g. data retention or archiving), tasks automation and standardisation via workflows (e.g. change asset metadata or request access to a dataset) and manual or automated tagging including sensitive data definition.
Data Collaboration (L2+) enables communication and metadata crowdsourcing via tagging, rating, reviewing, sharing and texting. This is a key capability to facilitate data curation. With a reasonable amount of non-invasive governance can boost the tool adoption and metadata quality.
AI automation and assistance (L2+) facilitates data curation by supporting users and taking over manual tasks, enabling data catalogs to scale. Most of the capabilities potentially can be supported by AI functions to a certain extent, e.g. in the area of data ingestion, data labelling, classification and search.
Adoption tracking and Audit (L3+) allows to monitor and measure data catalog performance, analyse user behavior for changes tracking and log users activity to analyse tool adoption progress. Some solutions have embedded and customisable dashboards to make this task a pleasant experience.

Maturity indication above is not strict and some features might be relevant to different levels. What is important to understand is that maturity level growth means scaling up and growth of user community and curation demand which in turn will require more automation and AI augmentation.

MVP for Data Catalog

As mentioned above Data Catalog can be implemented at different stages of Data Governance program and have various roles. These are the three approaches observed in practice each of them having advantages and risks.

Iterative governed approach based on data sources/data domains with planned governance enhancements starts with the awareness creation plan, prioritised data domains, key roles available from the start. It enables fast and safe business user onboarding thus maximising business value.

What to consider:

High upfront planning and alignment efforts
Minimum viable training should be provided to key roles
Data Catalog tool should be carefully selected based on detailed requirements
Limited collaboration at the start and more centralised control

When it might not work:

Agile end-user community of advanced data professionals might not need upfront highly governed data catalog and can do curation via crowdsourcing and organic stewardship efforts
Open-source or cloud data catalog with limited capabilities and unfriendly UI

From chaos to structure aimed to bring all the metadata in and let users collaborate to curate and data governance to evolve gradually. Agile end-user community of advanced data professionals doesn’t need upfront highly governed data catalog and can do curation via crowdsourcing and organic stewardship efforts. Bringing all metadata in at once can help reveal duplicate datasets and provide a comprehensive picture on initial Data Quality state via profiling.

What to consider:

Training should be provided to all advanced catalog users
Data Catalog tool should be carefully selected based on detailed requirements
License/Usage costs should be carefully considered as some data catalog solutions charge per the amount of datasets profiled and volume of metadata loaded

When it might not work:

Open-source or cloud data catalog with limited collaboration, profiling and sharing capabilities
Highly regulated data environment with sensitive data
Governance-first approach to data management

Mixed approach with different parts of the catalog following its own approach and view permissions applied to restrict access. This fits mixed skill level user communities and prioritised data domains. It is possible to start adding business value immediately for part of the domains and grow other domains organically via crowdsourced curation. Some key roles should be available from the start and others emerge organically. Advanced users are not limited with highly curated datasets.

What to consider:

High user access security set-up effort
Minimum viable training should be provided to all catalog users
Data Catalog tool should be carefully selected based on detailed requirements (especially security)
Highly depends on DG operating model type (centralised vs federated)

When it might not work:

Open-source or cloud data catalog with limited security capabilities
Centralised DG Operating model with limited representation within data domains

What approach to take depends on multiple things including but not limited to Data Governance strategy, business goals, company culture, DataOps practices and user community.

Most likely in any approach on a high level the following steps should be taken to enable a successful data catalog implementation and adoption:

Assess your needs and goals to map them to Data Catalog capabilities and create efficient enablement plan
Review your data processes and tech landscape to define required integrations and customisations
Review your Data Governance model or create one to enable Data Catalog adoption and operational efficiency
Create thorough implementation plan including MVP phase and ensure smooth execution to streamline value generation

Before starting the MVP take some time to prepare and think of the following aspects of the future solution:

What would be the initial Critical Data Elements, data domains and data sources?
Who will be your data domain champions and data stewards? Can these key people allocate time to support the initiative?
What level of Data Catalog are you planning to build during MVP.?
What would be key Data Catalog capabilities you would like to start with

5 Terms a Platform Product Manager must know!

suchismita sahu — Wed, 31 Jul 2024 15:32:51 GMT

A platform is a foundation that allows others to build upon it, extending and customizing its functionality to meet diverse customer needs. But there's more to platforms than just their expandable nature.

Platform space is complex, with many angels that needs to be considered in order to create successful customer solutions on top. You need to understand and master:

Network Effect
Ecosystem
Developer Experience
App Marketplace
A Platform Business Model

Network Effect

Definition: Network effect refers to the phenomenon where the value of a product or service increases as more people use it. In a platform context, this means that as more users join and interact with the platform, it becomes more valuable for everyone involved.

Explanation:

Direct Network Effects: This occurs when the value of the platform increases directly with the number of users. For example, in social media platforms, the more users there are, the more connections can be made, enhancing the overall user experience.
Indirect Network Effects: These occur when the value of the platform increases due to the growth of complementary products or services. For example, a smartphone operating system becomes more valuable as more apps are developed for it.