Covers |
Is the cloud infrastructure healthy, performing and efficient
|
Application Performance Monitoring (APM) and Transaction Tracing.
Server-side code instrumentation for performance and errors. Transaction tracing.
|
APIs and Blackbox. Are our underlying APIs up, performing well (globally) and returning the right data?
|
Are web pages up, performing well (globally) and returning the right content?
|
APM for Client Side Code Instrumentation for performance and errors
|
Has our code or infrastructure been compromised or have vulnerabilities?
|
Examples |
- Databases
- Disks
- Compute
- Lambda Functions
- Networks (VPCs)
|
- DB Queries
- API Queries
- 3rd party API invocations
- Custom instrumentation markers
- Errors
- Function invocation rate/frequency
- Lambda Invocations
- Distribution Tracing
|
- Uptime
- (Global) Latency
- Contract Testing
- SLO Monitoring
- 3rd Party Contract/SLA Monitoring
|
- Uptime
- Latency
- Accuracy
- Synthetic user monitoring
- Real user monitoring
- Transaction monitoring
|
- Web page errors
- Web page speed
- Mobile errors
- Mobile speed
|
- Unauthorised Access
- Intrusion Detection
- Compromised "Supply Chain" (libraries)
- DDoS
- SIEM
|
Example Tools
Green -> yellow -> teal: Current implementation level
|
- AWS
Cloudwatch
- CloudHealth
- NewRelic
- DataDog
|
- NewRelic
- Sysdig
- DataDog
- Opentracing
- Logz.io
- AWS Xray
- Dashbird
|
|
- Pingdom
- Catchpoint
- StatusCake
- NewRelic
- Logz.io
|
- NewRelic
- NewRelic
- Rollbar
- Mobile
- Firebase Monitoring
- Mobile
- Crittercism
- Mobile
- Crashlytics
|
- Incapsula
- Snyk
- AWS Security hub
- AWS
GuardDuty
- AWS Macie
- Logz.io
- Sysdig & Falco
|
Responsible and Accountable roles/functions
(RACI)
|
|
|
|
- QA
- Service Delivery
- Product Owners
|
- Devs
- QA
- Service Delivery
- Product Owners
|
|
Current overall maturity
|
low to medium
|
low to medium
|
very low
|
medium to high
|
low to medium
|
very low
|
Maturity criteria
What does good look like?
|
- Can you pick up infra issues ahead of time
- Do you have detailed load stats on underlying infra
- Do you have enough information to make good infra rightsizing decisions
- Can you spot underlying infra issues
- Can you easily visualise all your data
|
- Can you pick up code issues picked up ahead of time
- Do you have detailed stats on load and app load profiles
- Do you have custom StatsD type metrics to show behaviours e.g Total Articles served today
- Do you have detailed stats on application behaviour under load
- Do you have a comprehensive view on 3rd party integrations
- Do you have a handle on how each deployment affects application performance
- Can you detect runtime errors very quickly
- Do you have detailed info that enables you to make the right optimisations
- Can you easily visualise all your data
|
- Do you have a comprehensive view on 3rd party integrations
- Do you have a handle on how each deployment affects application performance
- Can you detect API errors before they ripple too far up the stack
- Can you quickly detect schema changes/breaks early (contract monitoring)
- Do you have detailed info on Global api performance
- Can you easily visualise all your data
|
- Can you detect, monitor and audit website uptime
- Do you have detailed global data on website performance
- Can you ensure that website content is consistently accurate
- Can you easily visualise all your data
|
- Can you pick up code issues picked up ahead of time
- Do you have detailed stats on load and app load profiles
- Do you have detailed stats on application behaviour under load
- Do you have a comprehensive view on 3rd party integrations
- Do you have a handle on how each deployment affects application performance
- Can you detect runtime errors very quickly
- Do you have detailed info that enables you to make the right optimisations
- Do you have data on user behaviours
- Do you have data on the platforms your users are using?
- Can you easily visualise all your data
|
- Can you pick up security issues ahead of time
- Do you get regular alerts and remedies on new vulnerabilities
- Do you get heuristic pickup of suspicious behaviour on your infra and apps
- Do you have constant data on current threat/exposure level
- Do you get best practice recommendations automatically
- Can you easily visualise all your data
|