Observability Coverage

v2.2-2022-07-14

By following this coverage strategy we can increase the observability of each layer of our systems and make system easier to manager

Coverage Strategy
	Metal	Server Side Code (APM)	API	Website	Client Side Code (APM)	Security
Covers	Is the cloud infrastructure healthy, performing and efficient	Application Performance Monitoring (APM) and Transaction Tracing. Server-side code instrumentation for performance and errors. Transaction tracing.	APIs and Blackbox. Are our underlying APIs up, performing well (globally) and returning the right data?	Are web pages up, performing well (globally) and returning the right content?	APM for Client Side Code Instrumentation for performance and errors	Has our code or infrastructure been compromised or have vulnerabilities?
Examples	Databases Disks Compute Lambda Functions Networks (VPCs)	DB Queries API Queries 3rd party API invocations Custom instrumentation markers Errors Function invocation rate/frequency Lambda Invocations Distribution Tracing	Uptime (Global) Latency Contract Testing SLO Monitoring 3rd Party Contract/SLA Monitoring	Uptime Latency Accuracy Synthetic user monitoring Real user monitoring Transaction monitoring	Web page errors Web page speed Mobile errors Mobile speed	Unauthorised Access Intrusion Detection Compromised "Supply Chain" (libraries) DDoS SIEM
Example Tools Green -> yellow -> teal: Current implementation level	AWS Cloudwatch CloudHealth NewRelic DataDog	NewRelic Sysdig DataDog Opentracing Logz.io AWS Xray Dashbird	BlackBox Runscope	Pingdom Catchpoint StatusCake NewRelic Logz.io	NewRelic NewRelic Rollbar Mobile - Firebase Monitoring Mobile - Crittercism Mobile - Crashlytics	Incapsula Snyk AWS Security hub AWS GuardDuty AWS Macie Logz.io Sysdig & Falco
Responsible and Accountable roles/functions (RACI)	Devs	Devs	Devs QA Service Delivery	QA Service Delivery Product Owners	Devs QA Service Delivery Product Owners	Devs QA Service Delivery
Current overall maturity	low to medium	low to medium	very low	medium to high	low to medium	very low
Maturity criteria What does good look like?	Can you pick up infra issues ahead of time Do you have detailed load stats on underlying infra Do you have enough information to make good infra rightsizing decisions Can you spot underlying infra issues Can you easily visualise all your data	Can you pick up code issues picked up ahead of time Do you have detailed stats on load and app load profiles Do you have custom StatsD type metrics to show behaviours e.g Total Articles served today Do you have detailed stats on application behaviour under load Do you have a comprehensive view on 3rd party integrations Do you have a handle on how each deployment affects application performance Can you detect runtime errors very quickly Do you have detailed info that enables you to make the right optimisations Can you easily visualise all your data	Do you have a comprehensive view on 3rd party integrations Do you have a handle on how each deployment affects application performance Can you detect API errors before they ripple too far up the stack Can you quickly detect schema changes/breaks early (contract monitoring) Do you have detailed info on Global api performance Can you easily visualise all your data	Can you detect, monitor and audit website uptime Do you have detailed global data on website performance Can you ensure that website content is consistently accurate Can you easily visualise all your data	Can you pick up code issues picked up ahead of time Do you have detailed stats on load and app load profiles Do you have detailed stats on application behaviour under load Do you have a comprehensive view on 3rd party integrations Do you have a handle on how each deployment affects application performance Can you detect runtime errors very quickly Do you have detailed info that enables you to make the right optimisations Do you have data on user behaviours Do you have data on the platforms your users are using? Can you easily visualise all your data	Can you pick up security issues ahead of time Do you get regular alerts and remedies on new vulnerabilities Do you get heuristic pickup of suspicious behaviour on your infra and apps Do you have constant data on current threat/exposure level Do you get best practice recommendations automatically Can you easily visualise all your data