Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon Bedrock Agent Samples: DevOps Multi-Agent Collaboration - [Content Improvement] #25

Open
robbieowens15 opened this issue Jan 30, 2025 · 0 comments

Comments

@robbieowens15
Copy link

I Really like the promise of this demo. It is an excellent idea for Agnatic solution that adds value to DevOps engineers by saving them time to diagnose the cause of poorly preforming systems. I can see this demo if revised being an valuable demonstration of the art of the possible with Multi-Agent collaboration for many AWS customers.

Describe the improvement request

Please Improve Setup instructions for examples/multi-agent-collaboration/devops_agent. Specifically for Sub Agent 1 - Grafana Assistant.

  • Current example assumes but does not explicitly state that user should already have a preconfigured Grafana Alerts.
  • Add setup instructions for creating sample Grafana project
    ----> Read Additional Context for what I attempted.

Please document how to create and configure a) Grafana Instance b) example telemetry data assumed by this demo c) Grafana alert configurations.

What are your suggestions?

Provide steps or reference materials on how to create an example Grafana project for those who wish to try this example but do not have an existing Grafana project to integrate with.

  • Focus on the simplest possible set of instructions to create a working example that can demonstrate the art of the possible.

Describe alternatives you've considered

  1. user-responsibility to bring your own project - A disclaimer that this example is intended to be configured with an existing project that leverages GitHub Pull Requests & Grafana as an observability tool.
  2. Additional Complimentary AWS Samples Repo - Creation of a Second GitHub Repo that implements a toy/demo application. As part of the requisite steps for devops_agent instruct the user to clone a second GitHub Repo and setup a project that launches some toy/demo application.

Additional context

To Setup this example (examples/multi-agent-collaboration/devops_agent) I attempted the following:

First - How to host Grafana

two choices

  1. Amazon Managed Grafana product link

Pros:
+ Secure managed environment of Grafana. Undifferentiated heavy lifting of hosting (namely compute & secure networking) is eliminated.
+ Easily deployable via CDK Construct - aws-cdk-lib.aws_grafana module
--> see example stack below:

Cons:
- Dependency on IAM Identity Center or SAML for authorization User Authentication - Amazon Managed Grafana - great for production. cumbersome for standing up a quick demo.
- Enrollment to identity center is not something that can or should be automated in the setup instructions.
2. Self Hosted Grafana

Key - must be accessible over the public internet since the action group (examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/lambda_function.py) invokes the API URL over the public internet

Best Example Repository - https://github.com/aws-samples/aws-cdk-grafana
Issue - Requires Route53 domain and configuration. - great for production. cumbersome for standing up a quick demo.

I attempted to deploy aws-samples/aws-cdk-grafana without HTTPS configuration (bypassing need for Route53). This deployment was identified as insecure and terminated by company AWS account security scanners.

TLDR - what is your recommendation?
To Deploy AWS Managed Grafana - the better of two bad choices in my opinion. Downside, demo not useable without SAML or IAM Identity Center configuration complete before hand.

Second - Sample/Toy Application Observed by Grafana

I created a basic AWS Lambda Function which writes phesdo-random data to CloudWatch metrics. This lambda function is invoked by a CRON job every minute. See code snippets below.

This writes fake about for {CPU, Memory, Response Time, SuccessfulTestHits} to 4 different dimensions by App Name - {app1, app2, app3, app4}. Did this because - examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/01_create_grafana_assistant.ipynb section "Testing the Agent"
has example queries """can you get me alert history of memory alert for the app app1""" & """when was the last Normal state for the response time for app4""".

Third - Configuring Grafana

Preface - This is my first experience with Grafana. Others more experience with this tool will likely be able to create a correct configuration.

I attempted to create 1 alert rule which will always be firing or in the alarm state and another that will alternate between on and off for each app. I can not understand why Image 1 has alarm state of "NORMAL" when all 4 apps are in "firing" for the alert condition.

TLDR - Please add instructions on how to implement basic Grafana alarms for this demo.

IMAGE 1 - High Response Time:
Image
Why does this alarm not enter the "ALERTING state"?
IMAGE 2 - High CPU Utilization:
Image
How can this alarm be re-configured so that it is in "ALERTING state" for specific apps that beach the threshold condition?

Fourth - Integrating

The Agent examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/01_create_grafana_assistant.ipynb built correctly 🎉 (unlike related issue #4 ). However, since the no alerts ever enter Alerting state The Action group Lambda Function throws an error. [ERROR] KeyError: 'alerts' at examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/lambda_function.py line #91 - rules_dimensions_df_full = rules_df.explode('alerts').

I know this is because my Grafana alerts was not configured correctly.

Side Note - This Action Group could benefit from better error handling. This should return "No Alerts Exist" or "No Alerts in Alterting State"
Also completely possible with correct grafana alerting rules implemented this would not be an issue.

Code Snippets

My implementation is incomplete. here are key code snipits that were moving in the right direction.

Amazon Managed Grafana Stack

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as grafana from 'aws-cdk-lib/aws-grafana';
import * as iam from 'aws-cdk-lib/aws-iam';

export class GraphanaStack extends cdk.Stack {
  private static readonly RESOURCE_PREFIX = 'devops-agent';
  
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const grafanaWorkspaceRole = this.createGrafanaRoles();
    const workspace = this.createGrafanaWorkspace(grafanaWorkspaceRole);
    
    this.createOutputs(workspace);
  }

  private createGrafanaRoles(): iam.Role {
    // Create the CloudWatch access role (Role 2)
    const cloudWatchAccessRole = new iam.Role(this, 'GrafanaCloudWatchRole', {
      assumedBy: new iam.ServicePrincipal('grafana.amazonaws.com'),
      description: 'IAM role with CloudWatch access permissions',
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonGrafanaCloudWatchAccess')
      ]
    });
  
    // Create the Grafana workspace default role (Role 1)
    const workspaceRole = new iam.Role(this, 'GrafanaWorkspaceRole', {
      assumedBy: new iam.ServicePrincipal('grafana.amazonaws.com'),
      description: 'Default IAM role for Grafana workspace'
    });
  
    // Allow the workspace role to assume the CloudWatch access role
    workspaceRole.addToPolicy(new iam.PolicyStatement({
      effect: iam.Effect.ALLOW,
      actions: ['sts:AssumeRole'],
      resources: [cloudWatchAccessRole.roleArn]
    }));
  
    // Allow the workspace role to be assumed by the CloudWatch access role
    cloudWatchAccessRole.assumeRolePolicy?.addStatements(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        principals: [new iam.ArnPrincipal(workspaceRole.roleArn)],
        actions: ['sts:AssumeRole']
      })
    );
  
    // Return the workspace role to be used as the default role
    return workspaceRole;
  }

  private createGrafanaWorkspace(role: iam.Role): grafana.CfnWorkspace {
    const workspace = new grafana.CfnWorkspace(this, 'ManagedGrafana', {
      accountAccessType: 'CURRENT_ACCOUNT',
      authenticationProviders: ['AWS_SSO'],
      roleArn: role.roleArn,
      permissionType: 'SERVICE_MANAGED',
      name: `${GraphanaStack.RESOURCE_PREFIX}-workspace`,
      dataSources: ['CLOUDWATCH'],
      notificationDestinations: ['SNS'],
      description: 'Amazon Managed Grafana Workspace for DevOps Agent Demo',
    });

    workspace.node.addDependency(role);
    return workspace;
  }

  private createOutputs(workspace: grafana.CfnWorkspace): void {
    new cdk.CfnOutput(this, 'WorkspaceEndpoint', {
      value: workspace.attrEndpoint,
      description: 'Grafana Workspace URL',
      exportName: `${GraphanaStack.RESOURCE_PREFIX}-workspace-url`
    });

    new cdk.CfnOutput(this, 'WorkspaceId', {
      value: workspace.ref,
      description: 'Grafana Workspace ID',
      exportName: `${GraphanaStack.RESOURCE_PREFIX}-workspace-id`
    });
  }
}

Self Hosted grafana

Best Example - https://github.com/aws-samples/aws-cdk-grafana

Toy Application - phesdo-random data to CloudWatch metrics

LAMBDA FUNCTION CODE

const { CloudWatch } = require('@aws-sdk/client-cloudwatch');

const cloudwatch = new CloudWatch();

const CLOUDWATCH_NAMESPACE = "multi-agent-collaboration-devops-poc"
const APPS = ['app1', 'app2', 'app3', 'app4'];

// Helper function to generate random number
const randomInt = (min, max) => {
  return Math.floor(Math.random() * (max - min + 1) + min);
};

// Function to generate metrics for a single app
const generateMetricsForApp = () => {
  return [
    {
      MetricName: 'CPU',
      Value: randomInt(1, 100),
      Unit: 'Percent'
    },
    {
      MetricName: 'MEM',
      Value: randomInt(40, 100),
      Unit: 'Percent'
    },
    {
      MetricName: 'ResponseTime',
      Value: randomInt(1, 10),
      Unit: 'Seconds'
    },
    {
      MetricName: 'SuccessfulTestHits',
      Value: randomInt(1, 12),
      Unit: 'Count'
    }
  ];
};

exports.handler = async () => {
  // Create promises for all apps, each with their own random metrics
  const promises = APPS.flatMap(appName => {
    const appMetrics = generateMetricsForApp();
    return appMetrics.map(metric => {
      return cloudwatch.putMetricData({
        Namespace: CLOUDWATCH_NAMESPACE,
        MetricData: [{
          ...metric,
          Dimensions: [{
            Name: 'App',
            Value: appName
          }]
        }]
      });
    });
  });

  await Promise.all(promises);
  return { statusCode: 200, body: 'Metrics published for all apps' };
};

CDK STACK

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as events from 'aws-cdk-lib/aws-events';
import * as targets from 'aws-cdk-lib/aws-events-targets';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as path from 'path';

export class DummyApplicationStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create Lambda function
    const metricGenerator = new lambda.Function(this, 'MetricGenerator', {
      runtime: lambda.Runtime.NODEJS_18_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset(path.join(__dirname, 'lambda/dummy-application/metric-generator')),
      timeout: cdk.Duration.seconds(30),
      environment: {
        // Add any environment variables if needed
      }
    });

    // Add CloudWatch permissions
    metricGenerator.addToRolePolicy(
      new iam.PolicyStatement({
        actions: ['cloudwatch:PutMetricData'],
        resources: ['*']
      })
    );

    // Create EventBridge rule to trigger Lambda every minute
    new events.Rule(this, 'MetricGeneratorSchedule', {
      schedule: events.Schedule.rate(cdk.Duration.minutes(1)),
      targets: [new targets.LambdaFunction(metricGenerator)]
    });
  }
}

How to Help:

Looking for advise and validation simplest way to setup this demo and document the setup of this demo. Looking for inputs from others on creative solutions to simplify and document the setup for this example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant