Operational Best practices
This article provides DIGIT Infra overview, Guidelines for Operational Excellence while DIGIT is deployed on SDC, NIC or any commercial clouds along with the recommendations and segregation of duties (SoD). It helps to plan the procurement and build the necessary capabilities to deploy and implement DIGIT.
In a shared control, the state program team/partners can consider these guidelines and must provide their own control implementation to the state’s cloud infrastructure and partners for a standard and smooth operational excellence.
DIGIT strongly recommends Site reliability engineering (SRE) principles as a key means to bridge between development and operations by applying a software engineering mindset to system and IT administration topics. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Monitoring Tools Recommendations: Commercial clouds like AWS, Azure and GCP offer sophisticated monitoring solutions across various infra levels like CloudWatch and StackDriver, in the absence of such managed services to monitor we can look at various best practices and tools listed below which helps debugging and troubleshooting efficiently.
- Segregation of duties and responsibilities.
- Monitoring dashboards at various levels like Infrastructure, Network and applications.
- Transparency of monitoring data and collaboration between teams.
- Periodic remote sync up meetings, acceptance and attendance to the meeting.
- Ability to see stakeholders availability of calendar time to schedule meetings.
- Periodic (weekly, monthly) summary reports of the various infra, operations incident categories.
- Communication channels and synchronization on a regular basis and also upon critical issues, changes, upgrades, releases etc.
While DIGIT is deployed at state cloud infrastructure, it is essential to identify and distinguish the responsibilities between Infrastructure, Operations and Implementation partners. Identify these teams and assign SPOC, define responsibilities and Incident management is followed to visualize, track issues and manage dependencies between teams. Essentially these are monitored through dashboards and alerts are sent to the stakeholders proactively. eGov team can provide consultation and training on the need basis depending any of the below categories.
- State program team - Refers to the owner for the whole DIGIT implementation, application rollouts, capacity building. Responsible for identifying and synchronizing the operating mechanism between the below teams.
- Implementation partner - Refers to the DIGIT Implementation, application performance monitoring for errors, logs scrutiny, TPS on peak load, distributed tracing, DB queries analisis, etc.
- Operations team - this team could be an extension of the implementation team who is responsible for DIGIT deployments, configurations, CI/CD, change management, traffic monitoring and alerting, log monitoring and dashboard, application security, DB Backups, application uptime, etc.
- **State IT/Cloud team -**Refers to state infra team for the Infra, network architecture, LAN network speed, internet speed, OS Licensing and upgrade, patch, compute, memory, disk, firewall, IOPS, security, access, SSL, DNS, data backups/recovery, snapshots, capacity monitoring dashboard.