Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310)
- Loading branch information