You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I Really like the promise of this demo. It is an excellent idea for Agnatic solution that adds value to DevOps engineers by saving them time to diagnose the cause of poorly preforming systems. I can see this demo if revised being an valuable demonstration of the art of the possible with Multi-Agent collaboration for many AWS customers.
Describe the improvement request
Please Improve Setup instructions for examples/multi-agent-collaboration/devops_agent. Specifically for Sub Agent 1 - Grafana Assistant.
Current example assumes but does not explicitly state that user should already have a preconfigured Grafana Alerts.
Add setup instructions for creating sample Grafana project
----> Read Additional Context for what I attempted.
Please document how to create and configure a) Grafana Instance b) example telemetry data assumed by this demo c) Grafana alert configurations.
What are your suggestions?
Provide steps or reference materials on how to create an example Grafana project for those who wish to try this example but do not have an existing Grafana project to integrate with.
Focus on the simplest possible set of instructions to create a working example that can demonstrate the art of the possible.
Describe alternatives you've considered
user-responsibility to bring your own project - A disclaimer that this example is intended to be configured with an existing project that leverages GitHub Pull Requests & Grafana as an observability tool.
Additional Complimentary AWS Samples Repo - Creation of a Second GitHub Repo that implements a toy/demo application. As part of the requisite steps for devops_agent instruct the user to clone a second GitHub Repo and setup a project that launches some toy/demo application.
Additional context
To Setup this example (examples/multi-agent-collaboration/devops_agent) I attempted the following:
Pros:
+ Secure managed environment of Grafana. Undifferentiated heavy lifting of hosting (namely compute & secure networking) is eliminated.
+ Easily deployable via CDK Construct - aws-cdk-lib.aws_grafana module
--> see example stack below:
Cons:
- Dependency on IAM Identity Center or SAML for authorization User Authentication - Amazon Managed Grafana - great for production. cumbersome for standing up a quick demo.
- Enrollment to identity center is not something that can or should be automated in the setup instructions.
2. Self Hosted Grafana
Key - must be accessible over the public internet since the action group (examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/lambda_function.py) invokes the API URL over the public internet
Best Example Repository - https://github.com/aws-samples/aws-cdk-grafana Issue - Requires Route53 domain and configuration. - great for production. cumbersome for standing up a quick demo.
I attempted to deploy aws-samples/aws-cdk-grafana without HTTPS configuration (bypassing need for Route53). This deployment was identified as insecure and terminated by company AWS account security scanners.
TLDR - what is your recommendation?
To Deploy AWS Managed Grafana - the better of two bad choices in my opinion. Downside, demo not useable without SAML or IAM Identity Center configuration complete before hand.
Second - Sample/Toy Application Observed by Grafana
I created a basic AWS Lambda Function which writes phesdo-random data to CloudWatch metrics. This lambda function is invoked by a CRON job every minute. See code snippets below.
This writes fake about for {CPU, Memory, Response Time, SuccessfulTestHits} to 4 different dimensions by App Name - {app1, app2, app3, app4}. Did this because - examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/01_create_grafana_assistant.ipynb section "Testing the Agent"
has example queries """can you get me alert history of memory alert for the app app1""" & """when was the last Normal state for the response time for app4""".
Third - Configuring Grafana
Preface - This is my first experience with Grafana. Others more experience with this tool will likely be able to create a correct configuration.
I attempted to create 1 alert rule which will always be firing or in the alarm state and another that will alternate between on and off for each app. I can not understand why Image 1 has alarm state of "NORMAL" when all 4 apps are in "firing" for the alert condition.
TLDR - Please add instructions on how to implement basic Grafana alarms for this demo.
IMAGE 1 - High Response Time:
Why does this alarm not enter the "ALERTING state"?
IMAGE 2 - High CPU Utilization:
How can this alarm be re-configured so that it is in "ALERTING state" for specific apps that beach the threshold condition?
Fourth - Integrating
The Agent examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/01_create_grafana_assistant.ipynb built correctly 🎉 (unlike related issue #4 ). However, since the no alerts ever enter Alerting state The Action group Lambda Function throws an error. [ERROR] KeyError: 'alerts' at examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/lambda_function.py line #91 - rules_dimensions_df_full = rules_df.explode('alerts').
I know this is because my Grafana alerts was not configured correctly.
Side Note - This Action Group could benefit from better error handling. This should return "No Alerts Exist" or "No Alerts in Alterting State"
Also completely possible with correct grafana alerting rules implemented this would not be an issue.
Code Snippets
My implementation is incomplete. here are key code snipits that were moving in the right direction.
Amazon Managed Grafana Stack
import*ascdkfrom'aws-cdk-lib';import{Construct}from'constructs';import*asgrafanafrom'aws-cdk-lib/aws-grafana';import*asiamfrom'aws-cdk-lib/aws-iam';exportclassGraphanaStackextendscdk.Stack{privatestaticreadonlyRESOURCE_PREFIX='devops-agent';constructor(scope: Construct,id: string,props?: cdk.StackProps){super(scope,id,props);constgrafanaWorkspaceRole=this.createGrafanaRoles();constworkspace=this.createGrafanaWorkspace(grafanaWorkspaceRole);this.createOutputs(workspace);}privatecreateGrafanaRoles(): iam.Role{// Create the CloudWatch access role (Role 2)constcloudWatchAccessRole=newiam.Role(this,'GrafanaCloudWatchRole',{assumedBy: newiam.ServicePrincipal('grafana.amazonaws.com'),description: 'IAM role with CloudWatch access permissions',managedPolicies: [iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonGrafanaCloudWatchAccess')]});// Create the Grafana workspace default role (Role 1)constworkspaceRole=newiam.Role(this,'GrafanaWorkspaceRole',{assumedBy: newiam.ServicePrincipal('grafana.amazonaws.com'),description: 'Default IAM role for Grafana workspace'});// Allow the workspace role to assume the CloudWatch access roleworkspaceRole.addToPolicy(newiam.PolicyStatement({effect: iam.Effect.ALLOW,actions: ['sts:AssumeRole'],resources: [cloudWatchAccessRole.roleArn]}));// Allow the workspace role to be assumed by the CloudWatch access rolecloudWatchAccessRole.assumeRolePolicy?.addStatements(newiam.PolicyStatement({effect: iam.Effect.ALLOW,principals: [newiam.ArnPrincipal(workspaceRole.roleArn)],actions: ['sts:AssumeRole']}));// Return the workspace role to be used as the default rolereturnworkspaceRole;}privatecreateGrafanaWorkspace(role: iam.Role): grafana.CfnWorkspace{constworkspace=newgrafana.CfnWorkspace(this,'ManagedGrafana',{accountAccessType: 'CURRENT_ACCOUNT',authenticationProviders: ['AWS_SSO'],roleArn: role.roleArn,permissionType: 'SERVICE_MANAGED',name: `${GraphanaStack.RESOURCE_PREFIX}-workspace`,dataSources: ['CLOUDWATCH'],notificationDestinations: ['SNS'],description: 'Amazon Managed Grafana Workspace for DevOps Agent Demo',});workspace.node.addDependency(role);returnworkspace;}privatecreateOutputs(workspace: grafana.CfnWorkspace): void{newcdk.CfnOutput(this,'WorkspaceEndpoint',{value: workspace.attrEndpoint,description: 'Grafana Workspace URL',exportName: `${GraphanaStack.RESOURCE_PREFIX}-workspace-url`});newcdk.CfnOutput(this,'WorkspaceId',{value: workspace.ref,description: 'Grafana Workspace ID',exportName: `${GraphanaStack.RESOURCE_PREFIX}-workspace-id`});}}
Toy Application - phesdo-random data to CloudWatch metrics
LAMBDA FUNCTION CODE
const{ CloudWatch }=require('@aws-sdk/client-cloudwatch');constcloudwatch=newCloudWatch();constCLOUDWATCH_NAMESPACE="multi-agent-collaboration-devops-poc"constAPPS=['app1','app2','app3','app4'];// Helper function to generate random numberconstrandomInt=(min,max)=>{returnMath.floor(Math.random()*(max-min+1)+min);};// Function to generate metrics for a single appconstgenerateMetricsForApp=()=>{return[{MetricName: 'CPU',Value: randomInt(1,100),Unit: 'Percent'},{MetricName: 'MEM',Value: randomInt(40,100),Unit: 'Percent'},{MetricName: 'ResponseTime',Value: randomInt(1,10),Unit: 'Seconds'},{MetricName: 'SuccessfulTestHits',Value: randomInt(1,12),Unit: 'Count'}];};exports.handler=async()=>{// Create promises for all apps, each with their own random metricsconstpromises=APPS.flatMap(appName=>{constappMetrics=generateMetricsForApp();returnappMetrics.map(metric=>{returncloudwatch.putMetricData({Namespace: CLOUDWATCH_NAMESPACE,MetricData: [{
...metric,Dimensions: [{Name: 'App',Value: appName}]}]});});});awaitPromise.all(promises);return{statusCode: 200,body: 'Metrics published for all apps'};};
CDK STACK
import*ascdkfrom'aws-cdk-lib';import{Construct}from'constructs';import*aslambdafrom'aws-cdk-lib/aws-lambda';import*aseventsfrom'aws-cdk-lib/aws-events';import*astargetsfrom'aws-cdk-lib/aws-events-targets';import*asiamfrom'aws-cdk-lib/aws-iam';import*aspathfrom'path';exportclassDummyApplicationStackextendscdk.Stack{constructor(scope: Construct,id: string,props?: cdk.StackProps){super(scope,id,props);// Create Lambda functionconstmetricGenerator=newlambda.Function(this,'MetricGenerator',{runtime: lambda.Runtime.NODEJS_18_X,handler: 'index.handler',code: lambda.Code.fromAsset(path.join(__dirname,'lambda/dummy-application/metric-generator')),timeout: cdk.Duration.seconds(30),environment: {// Add any environment variables if needed}});// Add CloudWatch permissionsmetricGenerator.addToRolePolicy(newiam.PolicyStatement({actions: ['cloudwatch:PutMetricData'],resources: ['*']}));// Create EventBridge rule to trigger Lambda every minutenewevents.Rule(this,'MetricGeneratorSchedule',{schedule: events.Schedule.rate(cdk.Duration.minutes(1)),targets: [newtargets.LambdaFunction(metricGenerator)]});}}
How to Help:
Looking for advise and validation simplest way to setup this demo and document the setup of this demo. Looking for inputs from others on creative solutions to simplify and document the setup for this example.
The text was updated successfully, but these errors were encountered:
I Really like the promise of this demo. It is an excellent idea for Agnatic solution that adds value to DevOps engineers by saving them time to diagnose the cause of poorly preforming systems. I can see this demo if revised being an valuable demonstration of the art of the possible with Multi-Agent collaboration for many AWS customers.
Describe the improvement request
Please Improve Setup instructions for
examples/multi-agent-collaboration/devops_agent
. Specifically for Sub Agent 1 - Grafana Assistant.----> Read Additional Context for what I attempted.
What are your suggestions?
Provide steps or reference materials on how to create an example Grafana project for those who wish to try this example but do not have an existing Grafana project to integrate with.
Describe alternatives you've considered
devops_agent
instruct the user to clone a second GitHub Repo and setup a project that launches some toy/demo application.Additional context
To Setup this example (
examples/multi-agent-collaboration/devops_agent
) I attempted the following:First - How to host Grafana
two choices
Pros:
+ Secure managed environment of Grafana. Undifferentiated heavy lifting of hosting (namely compute & secure networking) is eliminated.
+ Easily deployable via CDK Construct - aws-cdk-lib.aws_grafana module
--> see example stack below:
Cons:
- Dependency on IAM Identity Center or SAML for authorization User Authentication - Amazon Managed Grafana - great for production. cumbersome for standing up a quick demo.
- Enrollment to identity center is not something that can or should be automated in the setup instructions.
2. Self Hosted Grafana
Key - must be accessible over the public internet since the action group (
examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/lambda_function.py
) invokes the API URL over the public internetBest Example Repository - https://github.com/aws-samples/aws-cdk-grafana
Issue - Requires Route53 domain and configuration. - great for production. cumbersome for standing up a quick demo.
I attempted to deploy
aws-samples/aws-cdk-grafana
without HTTPS configuration (bypassing need for Route53). This deployment was identified as insecure and terminated by company AWS account security scanners.Second - Sample/Toy Application Observed by Grafana
I created a basic AWS Lambda Function which writes phesdo-random data to CloudWatch metrics. This lambda function is invoked by a CRON job every minute. See code snippets below.
This writes fake about for {CPU, Memory, Response Time, SuccessfulTestHits} to 4 different dimensions by App Name - {app1, app2, app3, app4}. Did this because -
examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/01_create_grafana_assistant.ipynb
section "Testing the Agent"has example queries """can you get me alert history of memory alert for the app app1""" & """when was the last Normal state for the response time for app4""".
Third - Configuring Grafana
Preface - This is my first experience with Grafana. Others more experience with this tool will likely be able to create a correct configuration.
I attempted to create 1 alert rule which will always be firing or in the alarm state and another that will alternate between on and off for each app. I can not understand why Image 1 has alarm state of "NORMAL" when all 4 apps are in "firing" for the alert condition.
IMAGE 1 - High Response Time:
![Image](https://private-user-images.githubusercontent.com/29908667/408311812-3b3cee02-2723-4f2a-b8c1-e46450edb4bf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNDg0NjIsIm5iZiI6MTczOTM0ODE2MiwicGF0aCI6Ii8yOTkwODY2Ny80MDgzMTE4MTItM2IzY2VlMDItMjcyMy00ZjJhLWI4YzEtZTQ2NDUwZWRiNGJmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDA4MTYwMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI2ZDNlMzdkMDg2MDU2NjFkYWZjMjhkZDY3YmMzZDI0MzExMThmZmRiYzExMjcyNTdhNDc3NTJkMDAyOWVkNjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.j-ncC9xH0gaFaLMpLDCRWI-JznOQSDgPWFa2FwM3TNw)
![Image](https://private-user-images.githubusercontent.com/29908667/408310483-7951e8f4-ed0c-46af-9562-d18e345d4f13.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNDg0NjIsIm5iZiI6MTczOTM0ODE2MiwicGF0aCI6Ii8yOTkwODY2Ny80MDgzMTA0ODMtNzk1MWU4ZjQtZWQwYy00NmFmLTk1NjItZDE4ZTM0NWQ0ZjEzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDA4MTYwMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYwMDBkY2YzZjM1ZTM1MjcwZmM3YTQzODdiZmE3MjQ4MWY5NTcxNjhkMWI1ZjI2YWY4MTRiNGQxNDZjMTljMzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.232cpCkqJR4k9Ru39FSLM_MScBtsybYw1v0NdSLmaSU)
Why does this alarm not enter the "ALERTING state"?
IMAGE 2 - High CPU Utilization:
How can this alarm be re-configured so that it is in "ALERTING state" for specific apps that beach the threshold condition?
Fourth - Integrating
The Agent
examples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/01_create_grafana_assistant.ipynb
built correctly 🎉 (unlike related issue #4 ). However, since the no alerts ever enter Alerting state The Action group Lambda Function throws an error.[ERROR] KeyError: 'alerts'
atexamples/multi-agent-collaboration/devops_agent/01_Create_Grafana_Assistant_Agent/lambda_function.py
line #91 -rules_dimensions_df_full = rules_df.explode('alerts')
.I know this is because my Grafana alerts was not configured correctly.
Code Snippets
My implementation is incomplete. here are key code snipits that were moving in the right direction.
Amazon Managed Grafana Stack
Self Hosted grafana
Best Example - https://github.com/aws-samples/aws-cdk-grafana
Toy Application - phesdo-random data to CloudWatch metrics
LAMBDA FUNCTION CODE
CDK STACK
How to Help:
Looking for advise and validation simplest way to setup this demo and document the setup of this demo. Looking for inputs from others on creative solutions to simplify and document the setup for this example.
The text was updated successfully, but these errors were encountered: