Optical SLAs, Availability, and Operational Reality

Five-nines availability is a number a salesperson writes on a contract; what an engineer sees is a budget of roughly five minutes of unavailability per year, distributed across components with vastly different failure characteristics, and a Mean-Time-To-Repair dominated not by component reliability but by how fast a fibre crew can find and splice a cut. This chapter quantifies the math behind common SLA tiers, walks the MTBF / MTTR breakdown of an end-to-end circuit, addresses the common pitfalls of multi-cause failure modelling (independent vs correlated, shared SRLG), and explains how TCM evidence is used in carrier-to-carrier SLA disputes.

Concept	What it says
Five-nines arithmetic	99.999 % availability = ~5.26 minutes of unavailability per year. Four-nines is ~52.6 minutes; three-nines is ~8.76 hours. The number doubles in cost roughly with each “nine” — there is no commercial reason to over-target.
MTBF vs MTTR	Component reliability sets how often a failure occurs (Mean Time Between Failures); operational response sets how long it lasts (Mean Time To Repair). For fibre, MTTR dominates — cuts are roughly Poisson-rare per km but each takes hours to fix.
SRLG — Shared Risk Link Group	A set of links that share a physical risk (same duct, same conduit, same landing station, same power feed). Two paths that look topologically diverse may collapse to a single failure at the SRLG layer.
TCM-based SLA enforcement	Tandem Connection Monitoring data is contractual currency in carrier-to-carrier disputes. Per-segment BIP-8 error counters and unavailable-second logs unambiguously assign fault to a specific operator’s segment.

Five-Nines Math

Annual unavailability budgets follow directly from the percentage:

SLA target	Annual unavailability	Monthly	Weekly	Common contractual usage
99 %	87.6 hours	7.3 hours	1.68 hours	Best-effort consumer
99.9 % (3-nines)	8.76 hours	43.2 minutes	10.1 minutes	Business broadband
99.95 %	4.38 hours	21.6 minutes	5.04 minutes	Mid-tier business
99.99 % (4-nines)	52.6 minutes	4.32 minutes	1.01 minutes	Premium business / leased lambda
99.999 % (5-nines)	5.26 minutes	25.9 seconds	6.05 seconds	Carrier core, financial connectivity
99.9999 % (6-nines)	31.5 seconds	2.59 seconds	0.605 seconds	Mostly aspirational; some critical SS7 / control

Key Insight

An SLA tier is meaningful only when the measurement window and fault definition are specified. “99.999 % over a calendar month” with faults defined as “more than 1 second of unavailable-seconds” is very different from “99.999 % calendar-year-rolling, 1 minute exclusion below” — and the difference can be a factor of 10 in remedy payouts.

Allocating the Budget Across Components

The 5.26 min/year five-nines budget is the end-to-end target for an optical circuit. Internally it is allocated across span, amplifier, ROADM, and transponder layers — each given a sub-budget that, summed in unavailability domain, totals the 5.26 minutes.

Layer	Typical sub-budget (5-nines target)	Dominant failure mechanism
Fibre span (per 80 km)	~1.5 min / yr	Cable cut (rare per km, high MTTR)
EDFA / OLA (per amp)	~0.3 min / yr	Pump-laser failure, AGC fault
ROADM (per node)	~0.5 min / yr	WSS failure, control-card reboot
Coherent transponder (per end)	~0.5 min / yr	DSP / line-side optic failure
Power, environmental (per site)	~1 min / yr	Power glitch, cooling event
Software / config error	~1 min / yr	The most under-budgeted item — often dominates real-world outage
Sum	~5 min / yr

The sum-of-unavailabilities is conservative (it assumes uncorrelated single-failure restoration), and it is also a useful sanity check: if any single layer’s design budget exceeds the end-to-end target, the design is broken.

MTBF and MTTR

MTBF (Mean Time Between Failures) describes how often a component fails; MTTR (Mean Time To Repair) describes how long the failure persists. Availability is then:

A = MTBF / (MTBF + MTTR)

For component-driven failures (transponders, ROADMs, EDFAs), MTBF dominates because MTTR is fast (hot-swap, sometimes minutes for an in-service spare). For fibre cuts, MTTR dominates because the fibre is in the ground, far from the operator, and physical-locate + splice + retest is hours.

Component MTBF Reference

Component	Typical MTBF	Notes
EDFA / OLA	100 000 – 300 000 hours (~11–34 years)	Pump laser is the dominant failure mode; modern dual-pump designs improve substantially
Coherent transponder (line card)	250 000 – 500 000 hours	DSP ASIC and line-side optics; pluggable form factors trend lower than embedded
WSS module (in ROADM)	200 000 – 500 000 hours	LCoS or MEMS — both robust, settling time more often the operational issue
ROADM control card	150 000 – 300 000 hours	Software is the riskier factor, not hardware
OTN switch fabric card	200 000 – 400 000 hours	Industry-standard reliability
Optical fibre cable (per km)	~150 000 – 250 000 hours per km (cut rate)	Highly geography-dependent; urban with high construction is worse
Coherent pluggable (CFP2-DCO, QSFP-DD ZR/ZR+)	300 000 – 500 000 hours	Improving as silicon photonics matures
Submarine wet-plant repeater	25-year design life, ~99 % chain-survival target	Cannot be repaired — extreme over-engineering, dual-redundant pumps

Warning

MTBF figures from datasheets are typically calculated under controlled lab conditions (25 °C, no thermal cycling, no humidity excursion). Field MTBF is often 50–80 % of datasheet, especially for pluggable optics deployed in unconditioned outdoor cabinets. Always derate vendor MTBF when budgeting against an SLA.

MTTR Breakdown for Fibre Cuts

A fibre cut MTTR is not a single number — it is the sum of a sequence of operational steps, each variable in duration:

Step	Typical duration	Dominant variable
1. Alarm to dispatch decision	5–30 min	NOC procedure, on-call response
2. Crew dispatch + travel to suspect area	30 min – 4 h	Distance, traffic, time of day
3. Physical locate (OTDR-from-each-end + visual)	30 min – 2 h	Cable length, accessibility, conduit complexity
4. Excavate / access (if buried)	1 h – many hours	Burial depth, surface (asphalt vs grass), permits
5. Splice (per fibre, on a high-count cable)	~5 min per fibre	Fibre count — a 144-fibre cable is hours of splicing
6. Retest (OTDR pre/post) and document	30 min
7. Service-restoration verification, alarms cleared	15 min
Total typical MTTR	4–12 h for accessible terrestrial; 24+ h for difficult terrain

The 4-hour fast case assumes a daytime cut on an accessible suburban cable with a nearby crew; a buried cable in a winter blizzard at 0300 with a remote excavation is 24+ hours.

Rule of Thumb

Fibre cuts dominate annual unavailability for a circuit that is unprotected at the optical layer. Two fibre cuts per year × 6 hours each = 12 hours unavailable = about 99.86 % availability — far short of even three-nines without protection. Optical protection (1+1 or OUPSR) is what turns hours-of-MTTR-fibre-cuts into a five-nines product.

Dual-Vendor Diversity

A common availability strategy is to deploy two vendors’ equipment in parallel — vendor A as primary, vendor B as protection. The intent is to immunise against systemic faults: a software defect in vendor A’s release, a firmware bug, an EOL/deprecation cascade, a supply-chain disruption, or even a shared design flaw across vendor A’s product line.

When does it pay off?

Scenario	Dual-vendor diversity payoff
Vendor A has a recurrent control-plane software defect	High — vendor B keeps the network up while A debugs
Single vendor announces critical security issue	High — vendor B is unaffected (typically)
Hardware fleet recall	High
Routine card failure	None — single-vendor + spare is sufficient
Routine fibre cut	None — physical layer is vendor-agnostic
Long-term cost, training, spares logistics	Negative — dual-vendor doubles operational complexity

Warning

Dual-vendor is not a no-brainer. The added operational complexity (two NMS, two configuration languages, two on-call rotations, two parts inventories) easily eats more time per year than the systemic-fault avoidance saves. Only deploy dual-vendor where the regulatory / customer requirement is explicit, or where past systemic faults have justified the discipline empirically. For most operators, single-vendor with strong vendor escalation is cheaper and equally available.

Multi-Cause Failure Modelling

Two paths are only as redundant as their shared risks allow. Three categories of correlated failure routinely catch designs that look diverse on a topology map:

Correlation source	Concrete example	Mitigation
Shared duct / conduit	Two “diverse” fibres travelling in the same physical duct or under the same bridge	SRLG-aware route engineering; physical site survey
Shared landing station	Both submarine cables enter through the same beach manhole	Geographic diversity, dual landing stations on different beaches
Shared power feed	Both ROADM nodes share a single A/B utility feed	Independent A/B feeds + UPS + generator on each
Shared software / firmware	Both transponders running the same image with the same defect	Staggered deployments, canary testing, dual-vendor
Shared atmospheric / environmental	Both microwave backups in the same storm cell	Diverse media (fibre + microwave is true diversity, fibre + fibre on the same path is not)
Shared operations team	Both paths reconfigured by the same engineer in the same maintenance window	Change-management discipline, separate windows for the two paths

Fault-Tree Analysis

Fault-tree analysis (FTA) decomposes a top-level event (“circuit X is down”) into contributing leaves connected by AND/OR gates. The diagram below illustrates a simplified FTA for a protected optical circuit:

flowchart TD
    TOP["Circuit DOWN"]
    AND1["AND<br/>both paths fail"]
    OR_A["OR<br/>working path failure"]
    OR_B["OR<br/>protection path failure"]
    PROT_FAIL["AND<br/>protection mechanism<br/>also fails"]
    FIBER_A["Fibre cut A"]
    AMP_A["EDFA fail A"]
    TX_A["Transponder fail A"]
    CONFIG["Config error"]
    FIBER_B["Fibre cut B"]
    AMP_B["EDFA fail B"]
    TX_B["Transponder fail B"]
    SELECTOR["Selector / APS<br/>fail"]
    SRLG_HIT["SRLG cut<br/>both A and B<br/>same duct"]
    TOP --> SRLG_HIT
    TOP --> AND1
    TOP --> CONFIG
    AND1 --> OR_A
    AND1 --> OR_B
    OR_A --> FIBER_A
    OR_A --> AMP_A
    OR_A --> TX_A
    OR_B --> FIBER_B
    OR_B --> AMP_B
    OR_B --> TX_B
    AND1 --> PROT_FAIL
    PROT_FAIL --> SELECTOR
    style TOP fill:#E24B4A,stroke:#A32D2D,color:#fff
    style AND1 fill:#D85A30,stroke:#993C1D,color:#fff
    style PROT_FAIL fill:#D85A30,stroke:#993C1D,color:#fff
    style SRLG_HIT fill:#D85A30,stroke:#993C1D,color:#fff
    style CONFIG fill:#BA7517,stroke:#854F0B,color:#fff
    style OR_A fill:#7F77DD,stroke:#534AB7,color:#fff
    style OR_B fill:#7F77DD,stroke:#534AB7,color:#fff
    style FIBER_A fill:#1D9E75,stroke:#0F6E56,color:#fff
    style FIBER_B fill:#1D9E75,stroke:#0F6E56,color:#fff
    style AMP_A fill:#1D9E75,stroke:#0F6E56,color:#fff
    style AMP_B fill:#1D9E75,stroke:#0F6E56,color:#fff
    style TX_A fill:#378ADD,stroke:#185FA5,color:#fff
    style TX_B fill:#378ADD,stroke:#185FA5,color:#fff
    style SELECTOR fill:#7F77DD,stroke:#534AB7,color:#fff

The top event “Circuit DOWN” can occur from independent simultaneous failures on both paths (AND), from an SRLG hit affecting both paths simultaneously (a single event), from a configuration error, or from a protection-mechanism failure during a single-path event.

The FTA’s value is exposing the OR branches that look like ANDs: a fibre cut to a SRLG-shared duct is a single event that takes both paths down, even though the topology shows two paths. SRLG-shared duct is not the AND of (cut A) and (cut B); it is its own independent leaf in the tree.

TCM Evidence in Carrier-to-Carrier Disputes

When a multi-segment circuit underperforms its SLA, every operator in the chain is incentivised to claim the fault is somewhere else. TCM (see 13-optical-test-measurement-and-commissioning) provides per-segment performance data that turns a multi-party dispute into a fact-finding exercise.

The contractual mechanism:

Each operator’s TCM endpoints log per-second and per-15-min counters: BIP-8 errors, errored-seconds, severely-errored-seconds, unavailable-seconds.
Master Service Agreements between operators specify TCM data export formats (typically G.7710 PM, NETCONF telemetry, or vendor-specific exports normalised to a shared schema) and SLA-test windows.
When the customer escalates, all operators provide TCM logs for the disputed window. The operator whose TCM segment shows the matching error excursion owns the fault.
SLA credits flow from the at-fault operator(s) according to the per-segment SLA contract.

Key Insight

Without TCM, an SLA dispute is “the customer’s word vs each operator’s word, and every operator denies fault”. With TCM, the disagreement is reduced to verifying the timestamps and data integrity of the TCM exports — a far more bounded problem. This is why every wholesale optical contract above ~10 Gbit/s now mandates TCM provisioning.

Availability vs Reliability vs Performance

Three closely-related metrics that customers and engineers routinely conflate:

Metric	What it measures	When to choose
Availability	Fraction of time the service is “up” by some criterion	Customer SLA, billing, headline
Reliability	Probability of operating without failure for a stated duration	Component selection, lifecycle planning
Performance	Quality of the service while it is up (BER, latency, packet loss)	Application QoE, premium-service tiering

A circuit can be five-nines available but with 5 % packet loss for the available time — high availability, low performance. A circuit can be perfectly performant when up but fail twice a year for two hours each — good performance, mediocre availability. Premium SLAs increasingly specify all three, with separate remedies for breaches of each.

SLA-Tier Examples

Tier	Availability target	Restoration-time target	Per-segment	Typical credit / remedy
Bronze	99.9 % (8.76 h/yr)	<30 min restoration after fibre cut	Single-path, unprotected	5 % monthly credit per outage hour beyond budget
Silver	99.95 % (4.38 h/yr)	<5 min restoration	Optical-layer protected (1+1 or OUPSR)	10 % credit per breach hour, capped at 100 % monthly fee
Gold	99.99 % (52 min/yr)	<50 ms switch	Hard-wired 1+1 + diverse routing + SRLG-validated	25 % credit per breach event
Platinum	99.999 % (5.26 min/yr)	<50 ms switch	Dual-vendor diversity + 1+1 + SRLG-validated + diverse landings	50 % credit per breach event, contract exit option after repeat
Six-nines	99.9999 % (31.5 s/yr)	<50 ms with sub-50 ms re-converge	Triple-redundant, multi-domain	Bespoke; usually only critical control-plane or financial

Rule of Thumb

The cost of each additional “nine” roughly doubles. Going from 99.9 % to 99.99 % requires optical-layer protection; from 99.99 % to 99.999 % requires SRLG-validated diversity and rigorous SLA-test discipline; from 99.999 % to 99.9999 % requires dual-vendor diversity and dual-domain routing. Customers asking for “six-nines” almost always mean “as available as you can make it” and will accept a five-nines product with the right pricing — most operators publish only up to five-nines in their service catalogue.

References

Standards (ITU-T)

ITU-T G.709/Y.1331 — Interfaces for the Optical Transport Network (06/2020). https://www.itu.int/rec/T-REC-G.709
ITU-T G.7710/Y.1701 — Common equipment management function requirements (10/2020). https://www.itu.int/rec/T-REC-G.7710
ITU-T G.911 — Parameters and calculation methodologies for reliability and availability of fibre optic systems. https://www.itu.int/rec/T-REC-G.911
ITU-T G.827 — Availability performance parameters and objectives for end-to-end international constant bit-rate digital paths. https://www.itu.int/rec/T-REC-G.827
ITU-T M.2110 — Bringing into service of international multi-operator paths, sections and transmission systems. https://www.itu.int/rec/T-REC-M.2110

Standards (IEC / IEEE)

IEC 61703 — Mathematical expressions for reliability, availability, maintainability and maintenance support terms. https://webstore.iec.ch/
IEC 61025 — Fault Tree Analysis (FTA). https://webstore.iec.ch/
IEEE 493 — Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems (Gold Book). https://standards.ieee.org/

Standards (IETF)

RFC 4427 — Recovery (Protection and Restoration) Terminology for Generalized Multi-Protocol Label Switching (GMPLS) (03/2006). https://www.rfc-editor.org/rfc/rfc4427
RFC 4428 — Analysis of Generalized Multi-Protocol Label Switching (GMPLS)-based Recovery Mechanisms (03/2006). https://www.rfc-editor.org/rfc/rfc4428

Books

A. Farrel, J.-P. Vasseur, Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS, Morgan Kaufmann, 2004.
R. Ramaswami, K. N. Sivarajan, G. H. Sasaki, Optical Networks: A Practical Perspective, 3rd ed., Morgan Kaufmann, 2009.
P. D. T. O’Connor, A. Kleyner, Practical Reliability Engineering, 5th ed., Wiley, 2012.

Transport Network Guide

Explorer

Ch 14 — Optical SLAs, Availability, and Operational Reality

Optical SLAs, Availability, and Operational Reality

Five-Nines Math

Allocating the Budget Across Components

MTBF and MTTR

Component MTBF Reference

MTTR Breakdown for Fibre Cuts

Dual-Vendor Diversity

Multi-Cause Failure Modelling

Fault-Tree Analysis

TCM Evidence in Carrier-to-Carrier Disputes

Availability vs Reliability vs Performance

SLA-Tier Examples

See Also

References

Standards (ITU-T)

Standards (IEC / IEEE)

Standards (IETF)

Books

Graph View

Table of Contents

Backlinks