I just completed reading 2 great books: The DevOps Handbook: How to create world-class agility, reliability, & security in technology organizations and The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. These awsome books caused that I started thinking about my past projects in terms of what we did wrong and how we could avoid some issues, and moreover what to do better in next projects.
I always have logs but just in few cases we had proper alerting. Also in few cases the logging was not on sufficient level so tracking issue was real pain. Moreover one of issues was caused by high resource consumption which we did not monitor. So few points to recap:
- log as much as possible from both application and resources
- check logs
- set alerts
It should be started always from single or few alerts and metrics and then increase based on the needs. Otherwise you may set too many of them which may result with unnecessary waking up people. Such situation would lead to ignoring alerts. Moreover the worst case is to learn about the issues with your software from your customers.
Continous integration and continous delivery
Continous integration and continous delivery are very basic concepts of modern software development. The dark age when developer published the package manually via FTP or similar solution has gone away (in most cases). Unfortunately they are still projects which do not have even any source control system. Lack of awarness about consequences is dramatic. You cannot track what was changed, when, by whom. No option for code review. You can imagine how erroneous such process is. So very first and basic step is having source control system.
Also peer code reviews are very valuables. It helps to learn about various approaches to the code, we may learn new libraries, correct some mistakes as well evaluate or get familiar with implemented logic.
Therefore you can ensure that all code was commited by having continous integration which is composed from at least two parts: build and test run. It ensures that code is in buildable state and test run ensures that some minimum quality check has been performed. Under continous delivery we may have some more actions: for example adding tags to code repository, deployments to various environments including deployment to production with manual approval.
When we have all steps automated, we may call that Continous Deployment. Important outcome of both approaches is fast delivery of code to production and also the fact that team is prepared on deploying code – nobody is affraid about the results.
It is critical from testing (behaviour and performance) perspective to have identical environments. Lets consider several scenarios.
The first case is different database engine version used in various environments. For example, if we would use compact SQL version versus standard SQL Server, performance will be different. Moreover some commands may behave differently. Even different set of settings may kick us. In one of the projects we had MultipleActiveResultSets set to True on dev environment however on the any other, it was set to False. That caused critical path failure. Fortunately that was discovered on the low environment and immediatly fixed. However you can imagine what would happen if this difference is set on the production environment.
Another example of different setup is having single compute instance in lower environments and multiple instances with load balancer in production. If application is statefull, it may lead to unpredicted behaviours if we will not ensure proper state behaviour. Also in stateless applications we may occur unpredicted problems (e.g. invalid balancing – one node overloaded while another is not used, misconfigured cache for multiple concurrent instances access). If the application is in cloud, in multiple regions, the problems like invalid routing or high traffic between regions may become the problem due to latency and additional costs (in most clouds traffic between regions is paid).
Yet another problem which may appear is different version of runtime libraries (e.g. .NET SDK version). For example it may result in various performance or different behaviour.
The same approach is a part of 12 factor app metodology.
This step is very important and very often ignored as usually time pressure on delivery is very high therefore the easiest part to cut is testing. And results of such activity are frightening – multiple bugs, unstable application, low performance, security gaps etc.
So what are we getting by automating the tests? First of all automated regression tests. It is especially important when system is getting bigger and bigger. Regression testing might be composed from multiple various types of tests however two of them are really important: unit and integration tests as they might be run by developers before there will start deploying their code.
Some people says that unit tests are documenting how system should work. I would like to see at least one person who learnt that from them. Most solutions have no unit tests or very few.
Security testing gives us quick feedback about potential gaps in the system. Obviously it won’t fully eliminate the risk or find all what is wrong in the system however is a good start before penetration tests.
All of these actions are speeding up delivery time and minimizing the risk of release.
You should always care about the highest possible automation level including tests, deployments and monitoring. It will save time and money in the future. It is especially important in big projects – bigger it is, faster return on investment will be visible in various area – high quality metrics will lead to easy maintenance and product stability and that will result in smilling faces of the customers.