Designing Batch Apps for better troubleshooting
A framework for categorizing logs in background applications into three types — detailed, audit, and business — each serving different audiences.
When it comes to background applications, be it a windows service or a Spring batch application, troubleshooting is one of the most challenging aspects because we don't have any direct interaction with the application. In this article I am going to categorize the troubleshooting in 3 major types at a higher level, the purpose of this segregation is to share the experiences to improve the troubleshooting and monitoring of your application.
1. Detailed Logs
Audience: Developers, Production Support Teams, Infrastructure teams
Detail Level: Code flows, errors, information, warnings etc.
This kind of logs are very common and easy. Historically we have been logging such things either in the log files or in the database. These logs are very detailed and complex, and it is hard to analyze and interpret from the raw log sources. This is where the power of log aggregation comes in. There are several log aggregation solutions out there and some of the famous ones are
- Splunk
- Paper Trail
- Logz.io
- LogRhythm
These tools ingest the logs from different sources e.g. from log files, XMLs, databases, event systems etc. and facilitate quick search and alert mechanisms.
2. Audit Logs
Audience: Audit Teams, Bizops teams
Detail Level: Actions
Audit logs are usually maintained where you need to store things mostly for compliance and audit purposes. Suppose you are processing the files, files can be in different locations (archive, error, processed), they can be in different states and with every action that happens on the file, you probably need to persist that information because when it comes to audit and investigation it might be very useful. Consider the following schema.
| Column | Description |
|---|---|
| FileId | Unique identifier |
| FileName | Name of file |
| Action | What happened |
| ActionDate | When it happened |
| ProcessName | Which process did it |
3. Business Logs
Audience: Business teams
Detail Level: Events, functional errors, data issues
Detailed logs are usually very complex to extract and interpret, although tools like Splunk make it easier to interpret these logs, still business users are not able to scroll these logs. When it comes to business, the things to log can be very different. For example consider that you are processing a data file and one of the fields called ProductConfiguration is not configured for some records. Such notifications should go to the business operations teams through targeted email notifications to relevant business groups or through a dedicated user interface for business error review and action.
Applications should persist these events in databases and generate notifications at job completion rather than relying on technical log aggregators.
Key Takeaway
For the purpose of Audit and business alerts you need very specific data so you should not rely on the Production support teams to create alerts for you, rather application should be able to process and distribute such logs.