Commit automatico: D2.2 Update of the Operations Management Policy (OMP) of RESILIENCE

This commit is contained in:
Michele Carraglia 2025-12-02 13:38:18 +01:00
parent ab5b49dbb8
commit 51bb9229bf
37 changed files with 967 additions and 0 deletions

View File

@ -0,0 +1,57 @@
---
config:
layout: elk
---
flowchart TD
U[["Users and requestors"]] --> A@{ label: "OMSP Service Desk<br><span style=\"font-size:12px\">Single point of contact · triage · ticketing</span>" }
A -- Log and acknowledge --> B@{ label: "IT Support Office<br><span style=\"font-size:12px\">Daily operations · monitoring · fix</span>" }
B1@{ label: "Operations Administrator<br><span style=\"font-size:12px\">Member of ITSO</span>" } --> B
B -- Ops triage --> DX{{"Infrastructure platform issue"}}
DX -- Yes --> D@{ label: "D4Science Support Team<br><span style=\"font-size:12px\">VRE · StorageHub · gCat · SocialService · CCP · IAM</span>" }
DX -- No --> DY{{"External integrated service issue"}}
DY -- Yes --> E@{ label: "External Service Provider<br><span style=\"font-size:12px\">Local admin for integrated apps</span>" }
DY -- No --> DZ{{"Product or content or feature request"}}
DZ -- Yes --> C@{ label: "Product Owner or Project Leader<br><span style=\"font-size:12px\">Backlog · UAT · documentation</span>" }
DZ -- No --> B
D -- Restore or workaround or RCA --> B
E -- Fix or configuration --> B
C -- Backlog and acceptance --> B
B -- Assess impact and SLOs --> DP{{"Priority one or SLO breach or security incident"}}
DP -- Yes --> F@{ label: "CTO Chief Technical Officer<br><span style=\"font-size:12px\">Tactical decisions · risk and SLO ownership</span>" }
DP -- No --> B
F -- Coordination and approvals --> DM{{"Strategic impact or major change"}}
DM -- Yes --> G@{ label: "Board<br><span style=\"font-size:12px\">Strategic oversight · major approvals</span>" }
DM -- No --> B
G -- Policy and direction --> F
F -- Directives and standards --> B
A -- Acknowledgements and updates --> N1[("User and stakeholder updates")]
B -- Status and closure --> N1
F -- Major incident communications --> N1
G -- Executive communications --> N1
A@{ shape: subroutine}
B@{ shape: subroutine}
B1@{ shape: rect}
F@{ shape: subroutine}
G@{ shape: subroutine}
D@{ shape: subroutine}
E@{ shape: subroutine}
C@{ shape: subroutine}
U:::notify
A:::ext
B:::int
B1:::int
DX:::decision
DY:::decision
DZ:::decision
DP:::decision
DM:::decision
F:::int
G:::int
D:::ext
E:::ext
C:::int
N1:::notify
classDef int fill:#e7f5ff,stroke:#1c7ed6,stroke-width:1px,color:#1c7ed6
classDef ext fill:#fff4e6,stroke:#d9480f,stroke-width:1px,color:#d9480f
classDef decision fill:#ffffff,stroke:#495057,stroke-dasharray:3 3,color:#495057
classDef notify fill:#f8f9fa,stroke:#adb5bd,color:#495057,stroke-dasharray:2 2

View File

@ -0,0 +1,53 @@
---
config:
layout: elk
---
flowchart TD
U["User community"] --> SD[["Service Desk and Support"]]
MON[["Monitoring and Control Service<br>Event Management"]] -- alerts and events --> SD & INC[["Incident Management"]]
SD -- log and classify --> D1{"Is this an incident"}
D1 -- Yes --> INC
D1 -- No --> D2{"Is this an access request"}
D2 -- Yes --> ACC[["Access Management"]]
D2 -- No --> REQ[["Request Fulfilment"]]
INC -- diagnose and route --> D3{"Infrastructure or application"}
D3 -- Infrastructure --> TECH[["Technical Management"]]
D3 -- Application --> APP[["Application Management"]]
TECH -- work orders and fixes --> OPS[["IT Operations Management"]]
APP -- fixes and releases --> OPS
OPS -- logs and status --> INC
TECH -- diagnostics and fixes --> INC
APP -- diagnostics and fixes --> INC
INC -- restore service --> SD
SD -- user communication and closure --> U
APP -- engage external teams --> EXT[["External Service Integration"]]
EXT -- third party actions --> APP
INC -- recurring or unknown root cause --> PM[["Problem Management"]]
PM -- root cause and permanent fix --> TECH & APP
PM -- publish work arounds --> KEDB[["Known error database"]]
KEDB --> SD & INC
PM -- update thresholds and correlations --> MON
ACC -- grant or revoke roles --> REQ
ACC -- security breach notifications --> INC
OPS -- telemetry and logs --> MON
TECH -- infrastructure metrics --> MON
APP -- application metrics --> MON
U:::aux
SD:::svc
MON:::svc
INC:::svc
D1:::decide
D2:::decide
ACC:::svc
REQ:::svc
D3:::decide
TECH:::svc
APP:::svc
OPS:::svc
EXT:::ext
PM:::svc
KEDB:::aux
classDef svc fill:#eef6ff,stroke:#1d4ed8,stroke-width:1px,color:#0f172a
classDef decide fill:#ffffff,stroke:#64748b,stroke-dasharray:3 3,color:#0f172a
classDef aux fill:#f8fafc,stroke:#94a3b8,color:#334155
classDef ext fill:#fff7ed,stroke:#f97316,color:#7c2d12

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 156 KiB

View File

@ -0,0 +1,31 @@
sequenceDiagram
autonumber
participant User as End User
participant L1 as OMSP Service Desk (L1)
participant AP as IT Service/Application Provider (L2/Service)
participant D4S as D4Science Support Team (L2/Infra)
participant PL as Project Leader(L2/Project)
participant CTO as Chief Technical Officer
User->>L1: Submit request (incident/defect/info/project)
L1->>L1: Log ticket, ACK to user (ID assigned)
L1->>L1: Classify (infra vs app vs info vs project)
alt Infrastructure issue
L1->>D4S: Escalate with priority, evidence, context
D4S-->>L1: Diagnostic update / Fix / Workaround
else Application defect
L1->>AP: Create bug record + assign (warranty/maintenance)
AP-->>L1: Fix/Workaround + resolution notes
else Information request
L1->>L1: Resolve via KB/docs or SME
else Project-related request
L1->>PL: Redirect with full context
PL-->>L1: Guidance / Action / Next steps
end
L1-->>User: Status updates / Resolution note
opt Major/High Impact or Policy/SLA issue
L1->>CTO: Notify & summarize (for governance)
CTO-->>L1: Direction / Escalation policy
end

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 122 KiB

View File

@ -0,0 +1,64 @@
sequenceDiagram
autonumber
%% === Actors ===
participant PL as PO/PL (Requestor)
participant OMSP as OMSP-OA (Ops Admin)
participant CTO as Chief Technical Officer (CAB)
participant ITSO as IT Support Office
participant D4S as D4S-ST (VRE/Core Infra)
participant ESP as ESP (External App Provider)
participant UAT as UAT Testers (PO/PL Team)
participant MON as OMSP (Monitoring/Observability)
%% 1) Intake & Classification (Service Request Mgmt)
PL->>OMSP: Service Request / RFC (scope, NFRs, AAI/IAM, SLOs)
OMSP-->>PL: ACK + Ticket ID
OMSP->>CTO: Register RFC & visibility
OMSP->>ITSO: Share intake context (for coordination)
%% 2) Feasibility & Risk (Change Enablement / Design Coord)
OMSP->>D4S: Feasibility (new vs existing VRE, core services, capacity)
OMSP->>ESP: Compatibility & integration constraints (if applicable)
OMSP->>OMSP: Feasibility Note + Risk Log + IAM/RBAC outline
OMSP-->>CTO: Submit feasibility package (Go/Adjust/No-Go)
%% 3) Plan & Approval (Change Enablement / CAB)
CTO-->>OMSP: CAB decision (approve/adjust)
OMSP->>OMSP: Deployment Plan (envs, UAT, cutover/rollback, monitoring, backup, comms)
OMSP->>ITSO: Align comms templates, inventory placeholders, runbook skeleton
%% 4) Build & Integrate (Release & Deployment)
alt New VRE
OMSP->>D4S: Provision VRE + core (StorageHub, gCat, CCP, SocialService)
else Existing VRE
OMSP->>D4S: Extend VRE (resources, policies, quotas)
end
OMSP->>OMSP: Configure IAM (AAI/OIDC), RBAC, ELK logs, Prom/Grafana alerts
opt External integration required
OMSP->>ESP: API/connector setup, secrets, mapping, routing
end
OMSP->>ITSO: Draft/Update runbook & KB
note over OMSP,D4S: Build complete, integration smoke tests pass
%% 5) Functional & Non-functional Testing (Service Validation & Testing)
OMSP->>OMSP: Functional, security, basic performance checks
OMSP->>D4S: Validate backup/restore & gCat metadata mapping
OMSP-->>CTO: Test summary UAT readiness
%% 6) User Acceptance Testing (Service Validation & Testing)
OMSP->>UAT: UAT kickoff (test scripts, data)
UAT->>OMSP: Defects/feedback
OMSP->>OMSP: Fix & retest cycles
UAT-->>OMSP: UAT Sign-off
%% 7) Production Cutover (Release & Deployment)
OMSP->>PL: Maintenance window notice (≥24h) + user comms
OMSP->>D4S: Execute cutover + smoke tests
OMSP->>ITSO: Publish Service Catalogue entry & SLOs
OMSP-->>PL: Go-live confirmation & support path
%% 8) Early Life Support & Handover (Service Operation)
MON->>OMSP: Heightened monitoring & alert verification
OMSP->>ITSO: Finalize runbook/KB, capture early-life metrics
OMSP-->>CTO: Early-life summary, schedule first service review
OMSP-->>PL: Ticket closure (links to docs, SLAs, escalation paths)

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 135 KiB

View File

@ -0,0 +1,54 @@
sequenceDiagram
autonumber
participant Req as Requestor (End User / Project Leader)
participant L1 as Service Desk (ITSO/COMSP - L1)
participant ITSO as IT Support Office (IAM Ops)
participant CTO as Chief Technical Officer (Governance)
participant D4S as D4Science IAM/Infra
participant EXT as External Service Provider (Integrated App)
participant AUD as Monitoring/Audit (ELK)
%% 1) Intake
Req->>L1: Access request / new role / role change
L1->>ITSO: Create IAM ticket + full context
%% 2) Role design & modelling
ITSO->>ITSO: Map business role → RBAC / claims (SoD check)
%% (5) Returned to Service Desk before approval
ITSO-->>L1: Role model package (for approval routing)
%% 3) Approval (initiated by Service Desk)
%% (6) L1 initiates the approval step to CTO
L1->>CTO: Submit role model for approval (new roles/sensitive auth)
CTO-->>L1: Approve/Adjust
%% 4) Provisioning (initiated by Service Desk)
%% (7,8) L1 triggers provisioning calls
L1->>D4S: Create/Update IAM groups (OIDC/RBAC)
L1->>EXT: Sync roles/entitlements (if integrated)
D4S-->>L1: Provisioning confirmation
EXT-->>L1: Sync confirmation
L1-->>Req: Access granted notice
%% 5) De-provisioning (initiated by Service Desk)
Req->>L1: Exit / role removal / transfer
%% (12) L1 opens de-provisioning work
L1->>ITSO: De-provision ticket (record/log)
%% (13,14) L1 triggers actual revocation
L1->>D4S: Revoke membership / tokens
L1->>EXT: Revoke app tokens (≤24h)
D4S-->>L1: Revocation confirmation
EXT-->>L1: Token revocation confirmation
%% (16) L1 informs requester
L1-->>Req: Access revoked confirmation
%% 6) Periodic reviews & compliance (initiated by Service Desk)
%% (17) L1 requests data for review
L1->>D4S: Extract audit logs & membership lists
%% (18) L1 instructs cleanup & recert workflow
L1->>ITSO: Quarterly recert + dormant cleanup
ITSO-->>L1: Recert/cleanup completion report
%% (20) L1 submits compliance report to CTO
L1->>CTO: Compliance report + exceptions/remediations
CTO-->>L1: Approved corrective actions
L1->>AUD: Persist logs & evidence (ELK)

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 127 KiB

View File

@ -0,0 +1,62 @@
sequenceDiagram
autonumber
%% Actors
participant MON as Automated Monitoring (ELK + Prom/Grafana)
participant OMSP as Monitoring (OMSP)
participant U as End User
participant L1 as Service Desk (OMSP-L1)
participant ITSO as IT Support Office
participant D4S as D4Science Support (Infra-L2)
participant ESP as External Service Provider (SW-L2)
%% 1) Health monitoring & observability (ITIL: Event Management)
MON-->>L1: Alert: health check/deep probe failure or non-SLO compliance vs SLR
U->>L1: Incident ticket (symptoms, impact)
L1->>L1: Triage & classify (infra vs software, severity, business impact)
L1-->>ITSO: Notification (for coordination and eventuallyescalation)
%% 2) Incident response (ITIL: Incident Management)
alt Infra suspected/confirmed
L1->>D4S: Attach logs/metrics, open infrastructure incident (if infra)
L1->>D4S: Escalate eventually with priority, evidence, timeframe
D4S-->>L1: Diagnostic update / Fix / Workaround
else Software malfunction suspected/confirmed
L1->>ESP: Attach logs/metrics, open application incident (if app)
L1->>ESP: Escalate eventually with priority, replication steps, logs, versions, timeframe
ESP-->>L1: Patch/workaround/ETA & notes
L1->>L1: Implement workaround if needed, monitor impact
end
ITSO-->>L1: Resolution summary (restore confirmation/next steps)
L1-->>U: Status updates and final resolution note
L1->>ITSO: Detailed Incident Information
L1-->>ITSO: Post-Incident Review (PIR) with root cause & actions
L1->>L1: Update runbook and KB with lessons learned
note over OMSP,D4S: Patch/Release Management
%% 3) Patch & release management (ITIL: Change Enablement / Release & Deployment)
ESP-->>L1: Vendor advisory / patch announcement
L1->>L1: Evaluate advisory risk, select candidate patches
L1->>L1: Test patches in test/pre-prod environment
L1->>ITSO: Draft Release Plan (scope, risk, smoke tests, rollback)
L1-->>ITSO: Maintenance notice (≥72h) with scope/impact
L1->>U: Maintenance notice (≥24h) with scope/impact
L1->>L1: Pre-patch backup & restore test (evidence)
L1->>L1: Deploy in approved change window, execute smoke tests
L1-->>ITSO: Change closure (success/rollback) + evidence
L1-->>U: Completion communication (outcome/next steps)
note over OMSP,D4S: Configuration Items/CMDB Documentation
%% 4) Configuration & documentation (ITIL: Knowledge Management)
L1->>L1: Update CMDB (versions, CIs, relationships)
L1->>L1: Update config inventory & user guidance (versioned)
L1->>OMSP: Adjust dashboards/alerts thresholds as needed
OMSP->>MON: Adjust dashboards/alerts thresholds as needed
note over OMSP,D4S: Capacity/Availability/Performance Monitoring
%% 5) Capacity, performance & cost (ITIL: Capacity Management)
OMSP->>MON: Monthly trigger: pull utilization (capacity/availability) trends
OMSP->>ITSO: Review hot spots / anomalies (CPU, RAM, I/O, latency, cost)
ITSO->>ITSO: Analysis
OMSP->>D4S: Request infra tuning / scaling options where needed
D4S->>D4S: Implementation of the optimisation plan
D4S->>OMSP: Optimisation plan executed
OMSP->>OMSP: Testing
OMSP->>ITSO: Record scaling plan / optimizations and publish summary

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 138 KiB

View File

@ -0,0 +1,34 @@
sequenceDiagram
autonumber
participant MON as Monitoring/Observability (SLAs/ELK/Prom-Grafana)
participant L1 as OMSP - Service Desk (tickets)
participant ITSO as IT Support Office
participant D4S as D4Science Support (Infra/VRE)
participant ESP as External Providers (integrations)
participant SM as OMSP - Service Manager
participant CTO as CTO
%% (One-off or yearly) Template agreement
SM->>CTO: Propose/agree reporting template (KPIs, risks, escalations, actions)
%% Monthly data collection
MON-->>SM: SLA/availability/MTTR/exported metrics
L1-->>SM: Ticket stats (incidents/requests), trends
ITSO-->>SM: Problems, changes, PIR highlights, patch outcomes
D4S-->>SM: Infra SLAs, capacity notes, major events
ESP-->>SM: External dependency incidents/escalations
SM->>SM: Normalize & consolidate dataset
%% Draft report
SM->>SM: Draft Monthly Service Report (KPIs, major issues, risks, CSI items)
SM->>ITSO: Internal review & factual check
ITSO-->>SM: Edits/confirmations
%% Submission & review
SM->>CTO: Submit report (≥5 days before month-end)
CTO-->>SM: Acknowledge & share agenda points
SM->>CTO: Review meeting (clarify escalations, agree corrective actions)
%% CSI follow-up
SM->>SM: Create/Update CSI register (owners, due dates)
SM->>CTO: Circulate meeting minutes & action log

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 122 KiB