Optical SLAs, Availability, and Operational Reality
Five-nines availability is a number a salesperson writes on a contract; what an engineer sees is a budget of roughly five minutes of unavailability per year, distributed across components with vastly different failure characteristics, and a Mean-Time-To-Repair dominated not by component reliability but by how fast a fibre crew can find and splice a cut. This chapter quantifies the math behind common SLA tiers, walks the MTBF / MTTR breakdown of an end-to-end circuit, addresses the common pitfalls of multi-cause failure modelling (independent vs correlated, shared SRLG), and explains how TCM evidence is used in carrier-to-carrier SLA disputes.
| Concept | What it says |
|---|---|
| Five-nines arithmetic | 99.999 % availability = ~5.26 minutes of unavailability per year. Four-nines is ~52.6 minutes; three-nines is ~8.76 hours. The number doubles in cost roughly with each “nine” — there is no commercial reason to over-target. |
| MTBF vs MTTR | Component reliability sets how often a failure occurs (Mean Time Between Failures); operational response sets how long it lasts (Mean Time To Repair). For fibre, MTTR dominates — cuts are roughly Poisson-rare per km but each takes hours to fix. |
| SRLG — Shared Risk Link Group | A set of links that share a physical risk (same duct, same conduit, same landing station, same power feed). Two paths that look topologically diverse may collapse to a single failure at the SRLG layer. |
| TCM-based SLA enforcement | Tandem Connection Monitoring data is contractual currency in carrier-to-carrier disputes. Per-segment BIP-8 error counters and unavailable-second logs unambiguously assign fault to a specific operator’s segment. |
Five-Nines Math
Annual unavailability budgets follow directly from the percentage:
| SLA target | Annual unavailability | Monthly | Weekly | Common contractual usage |
|---|---|---|---|---|
| 99 % | 87.6 hours | 7.3 hours | 1.68 hours | Best-effort consumer |
| 99.9 % (3-nines) | 8.76 hours | 43.2 minutes | 10.1 minutes | Business broadband |
| 99.95 % | 4.38 hours | 21.6 minutes | 5.04 minutes | Mid-tier business |
| 99.99 % (4-nines) | 52.6 minutes | 4.32 minutes | 1.01 minutes | Premium business / leased lambda |
| 99.999 % (5-nines) | 5.26 minutes | 25.9 seconds | 6.05 seconds | Carrier core, financial connectivity |
| 99.9999 % (6-nines) | 31.5 seconds | 2.59 seconds | 0.605 seconds | Mostly aspirational; some critical SS7 / control |
Key Insight
An SLA tier is meaningful only when the measurement window and fault definition are specified. “99.999 % over a calendar month” with faults defined as “more than 1 second of unavailable-seconds” is very different from “99.999 % calendar-year-rolling, 1 minute exclusion below” — and the difference can be a factor of 10 in remedy payouts.
Allocating the Budget Across Components
The 5.26 min/year five-nines budget is the end-to-end target for an optical circuit. Internally it is allocated across span, amplifier, ROADM, and transponder layers — each given a sub-budget that, summed in unavailability domain, totals the 5.26 minutes.
| Layer | Typical sub-budget (5-nines target) | Dominant failure mechanism |
|---|---|---|
| Fibre span (per 80 km) | ~1.5 min / yr | Cable cut (rare per km, high MTTR) |
| EDFA / OLA (per amp) | ~0.3 min / yr | Pump-laser failure, AGC fault |
| ROADM (per node) | ~0.5 min / yr | WSS failure, control-card reboot |
| Coherent transponder (per end) | ~0.5 min / yr | DSP / line-side optic failure |
| Power, environmental (per site) | ~1 min / yr | Power glitch, cooling event |
| Software / config error | ~1 min / yr | The most under-budgeted item — often dominates real-world outage |
| Sum | ~5 min / yr |
The sum-of-unavailabilities is conservative (it assumes uncorrelated single-failure restoration), and it is also a useful sanity check: if any single layer’s design budget exceeds the end-to-end target, the design is broken.
MTBF and MTTR
MTBF (Mean Time Between Failures) describes how often a component fails; MTTR (Mean Time To Repair) describes how long the failure persists. Availability is then:
A = MTBF / (MTBF + MTTR)
For component-driven failures (transponders, ROADMs, EDFAs), MTBF dominates because MTTR is fast (hot-swap, sometimes minutes for an in-service spare). For fibre cuts, MTTR dominates because the fibre is in the ground, far from the operator, and physical-locate + splice + retest is hours.
Component MTBF Reference
| Component | Typical MTBF | Notes |
|---|---|---|
| EDFA / OLA | 100 000 – 300 000 hours (~11–34 years) | Pump laser is the dominant failure mode; modern dual-pump designs improve substantially |
| Coherent transponder (line card) | 250 000 – 500 000 hours | DSP ASIC and line-side optics; pluggable form factors trend lower than embedded |
| WSS module (in ROADM) | 200 000 – 500 000 hours | LCoS or MEMS — both robust, settling time more often the operational issue |
| ROADM control card | 150 000 – 300 000 hours | Software is the riskier factor, not hardware |
| OTN switch fabric card | 200 000 – 400 000 hours | Industry-standard reliability |
| Optical fibre cable (per km) | ~150 000 – 250 000 hours per km (cut rate) | Highly geography-dependent; urban with high construction is worse |
| Coherent pluggable (CFP2-DCO, QSFP-DD ZR/ZR+) | 300 000 – 500 000 hours | Improving as silicon photonics matures |
| Submarine wet-plant repeater | 25-year design life, ~99 % chain-survival target | Cannot be repaired — extreme over-engineering, dual-redundant pumps |
Warning
MTBF figures from datasheets are typically calculated under controlled lab conditions (25 °C, no thermal cycling, no humidity excursion). Field MTBF is often 50–80 % of datasheet, especially for pluggable optics deployed in unconditioned outdoor cabinets. Always derate vendor MTBF when budgeting against an SLA.
MTTR Breakdown for Fibre Cuts
A fibre cut MTTR is not a single number — it is the sum of a sequence of operational steps, each variable in duration:
| Step | Typical duration | Dominant variable |
|---|---|---|
| 1. Alarm to dispatch decision | 5–30 min | NOC procedure, on-call response |
| 2. Crew dispatch + travel to suspect area | 30 min – 4 h | Distance, traffic, time of day |
| 3. Physical locate (OTDR-from-each-end + visual) | 30 min – 2 h | Cable length, accessibility, conduit complexity |
| 4. Excavate / access (if buried) | 1 h – many hours | Burial depth, surface (asphalt vs grass), permits |
| 5. Splice (per fibre, on a high-count cable) | ~5 min per fibre | Fibre count — a 144-fibre cable is hours of splicing |
| 6. Retest (OTDR pre/post) and document | 30 min | |
| 7. Service-restoration verification, alarms cleared | 15 min | |
| Total typical MTTR | 4–12 h for accessible terrestrial; 24+ h for difficult terrain |
The 4-hour fast case assumes a daytime cut on an accessible suburban cable with a nearby crew; a buried cable in a winter blizzard at 0300 with a remote excavation is 24+ hours.
Rule of Thumb
Fibre cuts dominate annual unavailability for a circuit that is unprotected at the optical layer. Two fibre cuts per year × 6 hours each = 12 hours unavailable = about 99.86 % availability — far short of even three-nines without protection. Optical protection (1+1 or OUPSR) is what turns hours-of-MTTR-fibre-cuts into a five-nines product.
Dual-Vendor Diversity
A common availability strategy is to deploy two vendors’ equipment in parallel — vendor A as primary, vendor B as protection. The intent is to immunise against systemic faults: a software defect in vendor A’s release, a firmware bug, an EOL/deprecation cascade, a supply-chain disruption, or even a shared design flaw across vendor A’s product line.
When does it pay off?
| Scenario | Dual-vendor diversity payoff |
|---|---|
| Vendor A has a recurrent control-plane software defect | High — vendor B keeps the network up while A debugs |
| Single vendor announces critical security issue | High — vendor B is unaffected (typically) |
| Hardware fleet recall | High |
| Routine card failure | None — single-vendor + spare is sufficient |
| Routine fibre cut | None — physical layer is vendor-agnostic |
| Long-term cost, training, spares logistics | Negative — dual-vendor doubles operational complexity |
Warning
Dual-vendor is not a no-brainer. The added operational complexity (two NMS, two configuration languages, two on-call rotations, two parts inventories) easily eats more time per year than the systemic-fault avoidance saves. Only deploy dual-vendor where the regulatory / customer requirement is explicit, or where past systemic faults have justified the discipline empirically. For most operators, single-vendor with strong vendor escalation is cheaper and equally available.
Multi-Cause Failure Modelling
Two paths are only as redundant as their shared risks allow. Three categories of correlated failure routinely catch designs that look diverse on a topology map:
| Correlation source | Concrete example | Mitigation |
|---|---|---|
| Shared duct / conduit | Two “diverse” fibres travelling in the same physical duct or under the same bridge | SRLG-aware route engineering; physical site survey |
| Shared landing station | Both submarine cables enter through the same beach manhole | Geographic diversity, dual landing stations on different beaches |
| Shared power feed | Both ROADM nodes share a single A/B utility feed | Independent A/B feeds + UPS + generator on each |
| Shared software / firmware | Both transponders running the same image with the same defect | Staggered deployments, canary testing, dual-vendor |
| Shared atmospheric / environmental | Both microwave backups in the same storm cell | Diverse media (fibre + microwave is true diversity, fibre + fibre on the same path is not) |
| Shared operations team | Both paths reconfigured by the same engineer in the same maintenance window | Change-management discipline, separate windows for the two paths |
Fault-Tree Analysis
Fault-tree analysis (FTA) decomposes a top-level event (“circuit X is down”) into contributing leaves connected by AND/OR gates. The diagram below illustrates a simplified FTA for a protected optical circuit:
flowchart TD TOP["Circuit DOWN"] AND1["AND<br/>both paths fail"] OR_A["OR<br/>working path failure"] OR_B["OR<br/>protection path failure"] PROT_FAIL["AND<br/>protection mechanism<br/>also fails"] FIBER_A["Fibre cut A"] AMP_A["EDFA fail A"] TX_A["Transponder fail A"] CONFIG["Config error"] FIBER_B["Fibre cut B"] AMP_B["EDFA fail B"] TX_B["Transponder fail B"] SELECTOR["Selector / APS<br/>fail"] SRLG_HIT["SRLG cut<br/>both A and B<br/>same duct"] TOP --> SRLG_HIT TOP --> AND1 TOP --> CONFIG AND1 --> OR_A AND1 --> OR_B OR_A --> FIBER_A OR_A --> AMP_A OR_A --> TX_A OR_B --> FIBER_B OR_B --> AMP_B OR_B --> TX_B AND1 --> PROT_FAIL PROT_FAIL --> SELECTOR style TOP fill:#E24B4A,stroke:#A32D2D,color:#fff style AND1 fill:#D85A30,stroke:#993C1D,color:#fff style PROT_FAIL fill:#D85A30,stroke:#993C1D,color:#fff style SRLG_HIT fill:#D85A30,stroke:#993C1D,color:#fff style CONFIG fill:#BA7517,stroke:#854F0B,color:#fff style OR_A fill:#7F77DD,stroke:#534AB7,color:#fff style OR_B fill:#7F77DD,stroke:#534AB7,color:#fff style FIBER_A fill:#1D9E75,stroke:#0F6E56,color:#fff style FIBER_B fill:#1D9E75,stroke:#0F6E56,color:#fff style AMP_A fill:#1D9E75,stroke:#0F6E56,color:#fff style AMP_B fill:#1D9E75,stroke:#0F6E56,color:#fff style TX_A fill:#378ADD,stroke:#185FA5,color:#fff style TX_B fill:#378ADD,stroke:#185FA5,color:#fff style SELECTOR fill:#7F77DD,stroke:#534AB7,color:#fff
The top event “Circuit DOWN” can occur from independent simultaneous failures on both paths (AND), from an SRLG hit affecting both paths simultaneously (a single event), from a configuration error, or from a protection-mechanism failure during a single-path event.
The FTA’s value is exposing the OR branches that look like ANDs: a fibre cut to a SRLG-shared duct is a single event that takes both paths down, even though the topology shows two paths. SRLG-shared duct is not the AND of (cut A) and (cut B); it is its own independent leaf in the tree.
TCM Evidence in Carrier-to-Carrier Disputes
When a multi-segment circuit underperforms its SLA, every operator in the chain is incentivised to claim the fault is somewhere else. TCM (see 13-optical-test-measurement-and-commissioning) provides per-segment performance data that turns a multi-party dispute into a fact-finding exercise.
The contractual mechanism:
- Each operator’s TCM endpoints log per-second and per-15-min counters: BIP-8 errors, errored-seconds, severely-errored-seconds, unavailable-seconds.
- Master Service Agreements between operators specify TCM data export formats (typically G.7710 PM, NETCONF telemetry, or vendor-specific exports normalised to a shared schema) and SLA-test windows.
- When the customer escalates, all operators provide TCM logs for the disputed window. The operator whose TCM segment shows the matching error excursion owns the fault.
- SLA credits flow from the at-fault operator(s) according to the per-segment SLA contract.
Key Insight
Without TCM, an SLA dispute is “the customer’s word vs each operator’s word, and every operator denies fault”. With TCM, the disagreement is reduced to verifying the timestamps and data integrity of the TCM exports — a far more bounded problem. This is why every wholesale optical contract above ~10 Gbit/s now mandates TCM provisioning.
Availability vs Reliability vs Performance
Three closely-related metrics that customers and engineers routinely conflate:
| Metric | What it measures | When to choose |
|---|---|---|
| Availability | Fraction of time the service is “up” by some criterion | Customer SLA, billing, headline |
| Reliability | Probability of operating without failure for a stated duration | Component selection, lifecycle planning |
| Performance | Quality of the service while it is up (BER, latency, packet loss) | Application QoE, premium-service tiering |
A circuit can be five-nines available but with 5 % packet loss for the available time — high availability, low performance. A circuit can be perfectly performant when up but fail twice a year for two hours each — good performance, mediocre availability. Premium SLAs increasingly specify all three, with separate remedies for breaches of each.
SLA-Tier Examples
| Tier | Availability target | Restoration-time target | Per-segment | Typical credit / remedy |
|---|---|---|---|---|
| Bronze | 99.9 % (8.76 h/yr) | <30 min restoration after fibre cut | Single-path, unprotected | 5 % monthly credit per outage hour beyond budget |
| Silver | 99.95 % (4.38 h/yr) | <5 min restoration | Optical-layer protected (1+1 or OUPSR) | 10 % credit per breach hour, capped at 100 % monthly fee |
| Gold | 99.99 % (52 min/yr) | <50 ms switch | Hard-wired 1+1 + diverse routing + SRLG-validated | 25 % credit per breach event |
| Platinum | 99.999 % (5.26 min/yr) | <50 ms switch | Dual-vendor diversity + 1+1 + SRLG-validated + diverse landings | 50 % credit per breach event, contract exit option after repeat |
| Six-nines | 99.9999 % (31.5 s/yr) | <50 ms with sub-50 ms re-converge | Triple-redundant, multi-domain | Bespoke; usually only critical control-plane or financial |
Rule of Thumb
The cost of each additional “nine” roughly doubles. Going from 99.9 % to 99.99 % requires optical-layer protection; from 99.99 % to 99.999 % requires SRLG-validated diversity and rigorous SLA-test discipline; from 99.999 % to 99.9999 % requires dual-vendor diversity and dual-domain routing. Customers asking for “six-nines” almost always mean “as available as you can make it” and will accept a five-nines product with the right pricing — most operators publish only up to five-nines in their service catalogue.
See Also
- 04-otn-sdh-and-network-design
- 11-optical-layer-protection-and-restoration
- 13-optical-test-measurement-and-commissioning
References
Standards (ITU-T)
- ITU-T G.709/Y.1331 — Interfaces for the Optical Transport Network (06/2020). https://www.itu.int/rec/T-REC-G.709
- ITU-T G.7710/Y.1701 — Common equipment management function requirements (10/2020). https://www.itu.int/rec/T-REC-G.7710
- ITU-T G.911 — Parameters and calculation methodologies for reliability and availability of fibre optic systems. https://www.itu.int/rec/T-REC-G.911
- ITU-T G.827 — Availability performance parameters and objectives for end-to-end international constant bit-rate digital paths. https://www.itu.int/rec/T-REC-G.827
- ITU-T M.2110 — Bringing into service of international multi-operator paths, sections and transmission systems. https://www.itu.int/rec/T-REC-M.2110
Standards (IEC / IEEE)
- IEC 61703 — Mathematical expressions for reliability, availability, maintainability and maintenance support terms. https://webstore.iec.ch/
- IEC 61025 — Fault Tree Analysis (FTA). https://webstore.iec.ch/
- IEEE 493 — Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems (Gold Book). https://standards.ieee.org/
Standards (IETF)
- RFC 4427 — Recovery (Protection and Restoration) Terminology for Generalized Multi-Protocol Label Switching (GMPLS) (03/2006). https://www.rfc-editor.org/rfc/rfc4427
- RFC 4428 — Analysis of Generalized Multi-Protocol Label Switching (GMPLS)-based Recovery Mechanisms (03/2006). https://www.rfc-editor.org/rfc/rfc4428
Books
- A. Farrel, J.-P. Vasseur, Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS, Morgan Kaufmann, 2004.
- R. Ramaswami, K. N. Sivarajan, G. H. Sasaki, Optical Networks: A Practical Perspective, 3rd ed., Morgan Kaufmann, 2009.
- P. D. T. O’Connor, A. Kleyner, Practical Reliability Engineering, 5th ed., Wiley, 2012.