Summary
After you install OverOps, all the exceptions and errors coming from your application will show up in your OverOps dashboard. The amount of information you receive can be overwhelming and for this reason, you might not find the events you need to focus on. The following Best Practice describes how in 5 steps you can reduce the noise, prioritize issues, and notify the right teams about critical/important events.
Step 1: Model your code using code filters (if needed)
OverOps allows you to either blacklist or whitelist packages and classes to help monitor and track the variable states and call stack of the events you are interested in. By properly black- and whitelisting 3rd party code benign noise will be reduced and important events will be captured.
The call stack tracks the methods and variables state for the event all the way back to its entry point in the code where the error occurred, and the parameters were passed into. The first method in line is the last method on the code within your application. Typically, you would only be interested in YOUR code and therefore we blacklisted classes and packages by default which we consider standard 3rd party code (Java 3rd Party Packages). These classes are by default not monitored and therefore would not show in your stack trace either.
Therefore, it is important that you either whitelist the packages and classes (from our default provided 3rd party code blacklist) you need/require to monitor or help identify as entry points to a particular section of code or blacklist the packages or classes which are causing noise or prevent you to see the code where the real issues are occurring.
Keep in mind that OverOps uses a sophisticated algorithm based on unique fingerprints of the bytecode to deduplicate the events reported by the JVM. It provides us details like
- Where in the application code the events are coming from
- How often it’s happening and how many calls into that specific code
- Which transaction in the code are affected and at what rate
- When it was first seen and last seen
- On which JVMs and machines
- In which deployment it was introduced
By blacklisting certain parts of your code you could further improve the deduplication function by focusing on the area(s) which are relevant and important.
In the end, this will allow you to structure your views not only by the teams responsible for it but allows you also to structure them by the severity of the issue. The routing, prioritization, and communication of issues based on these views is then easy.
As example:
During check out of an order in your shopping cart app, an error is thrown in your SSO Package.
The SSO Package is a very common class and is called from various areas in your application incl. the payment package, shipping package, order entry package, etc. - in our case the SSO package is not the cause of the problem. The issue occurred further up in the stack but is hidden by the SSO package. As the error is thrown in the SSO class it also does not allow you to route the problem to the team which is really responsible. In our case, the team responsible for the payment processes. To show where the issue really happens and route the issue accordingly to the team responsible for it you will need to blacklist the SSO package. The next time the issue occurs the classes/package which comes next in the stack (the payment processes) will be shown as the class which has thrown the error. You can build now a view that shows the issues originating in the payment classes and route it to the correct team. As these issues prevent the customers from completing their orders you would also classify this issue as a much higher severity/priority impact issue compared to other issues occurring anywhere else in your application.
Another scenario is where errors are thrown in a class but you do not know what the actual entry point of the issue is as the entry point is in a third party package we have by default blacklisted. As in the earlier example, this makes it difficult to route the issue to the correct team. To show the package as your entry point you would have to whitelist that package which would going forward expose the entry point of your error and allow you to route the issue to the correct team.
You might have to turn on “ Show 3rd party/utility methods ” to see the full stack to determine which 3rd party class you might need to whitelist so it is monitored.
Step 2: Setup views for the relevant classifications
Now that you modeled your code by using these code filters you can set up views accordingly.
In my example, the teams are organized by code so payment package issues would go to that respective development team.
Example:
Views for
-
Payment package issues Priority 1
Issued which occurred in the payment package and have the highest customer/business impact. -
Payment package issues Priority 2
-
Payment package issues Priority 3
-
etc.
-
Shipping package issues Priority 1
-
etc.
-
Authentication issues Priority 1
-
etc.
The important part is to structure these views based on
- Customer/Business impact
- Team/person responsible
- Internal engineering/business processes.
Setting up views in this manner allows you and your engineering teams quickly to see what issues occurred and how to prioritize them for resolution. What defines a priority 1 or a priority 2 you define based on where this issue occurs in your code or based on fixed or relative thresholds you set for your code in step number 3 below.
How do I set up views:
In your dashboard filter the events by Entry Point, Event Location, Event and save them into a View. Details can be found here: Event List
If some columns are not shown make sure you click on the + sign in the header of the grid view on the right of the screen and add/remove columns that are relevant to you.
Step 3: Create alerting threshold (1%, 5%, new error,…)
Now that you have your views set up the next step is to set up the alerting thresholds. Essentially identifying priorities and severities of the issues by view. Make sure your views represent the priority or severity of the issues collected in them. Learn more about setup passive and dynamic thresholds in this article.
Step 4: Setup routing rules - email, slack, etc…
Set up your routing rules to notify your respective teams via the selected communication channels like email, Slack, Jira, or others. Details can be found in our documentation.
We strongly recommend routing the events first to a less structured application like Slack/Email and fine-tune the filters and thresholds to improve the Signal/Noise ratio. When you feel comfortable that you have achieved a “clean” alert channel, we recommend routing the events to a more structured system like PagerDuty or Jira and improve your team detection/identification metrics.
Finally, we highly recommend sharing the information with your NOC / DevOps teams. You can do that by creating views within OverOps or you can publish the metrics to dashboards via Grafana, Splunk, or AppDynamics. To learn more about publishing metrics please see here.
Step 5: Focus (only) on NEW events!
Once you went through all the above steps you did reduce the amount of traffic you received and setup alerting channels and alerting thresholds to notify your teams accordingly. You might or might not have picked up on that OverOps can identify when NEW issues (aka events) were introduced in new versions deployed.
To help your Dev teams to focus on a few but important events is to focus only on events that were introduced with any new deployments. To help facilitate that you need to name your deployments (see Naming your Applications and Deployments). Out of the box, OverOps will provide you a list of NEW events introduced with every new version deployed. These events did NOT exist in any of the deployments before but were newly introduced. You can set up alerts as well as Jira creation automation as already described in the steps above.