Optical SLAs, Availability, and Operational Reality

Five-nines availability is a number a salesperson writes on a contract; what an engineer sees is a budget of roughly five minutes of unavailability per year, distributed across components with vastly different failure characteristics, and a Mean-Time-To-Repair dominated not by component reliability but by how fast a fibre crew can find and splice a cut. This chapter quantifies the math behind common SLA tiers, walks the MTBF / MTTR breakdown of an end-to-end circuit, addresses the common pitfalls of multi-cause failure modelling (independent vs correlated, shared SRLG), and explains how TCM evidence is used in carrier-to-carrier SLA disputes.

ConceptWhat it says
Five-nines arithmetic99.999 % availability = ~5.26 minutes of unavailability per year. Four-nines is ~52.6 minutes; three-nines is ~8.76 hours. The number doubles in cost roughly with each “nine” — there is no commercial reason to over-target.
MTBF vs MTTRComponent reliability sets how often a failure occurs (Mean Time Between Failures); operational response sets how long it lasts (Mean Time To Repair). For fibre, MTTR dominates — cuts are roughly Poisson-rare per km but each takes hours to fix.
SRLG — Shared Risk Link GroupA set of links that share a physical risk (same duct, same conduit, same landing station, same power feed). Two paths that look topologically diverse may collapse to a single failure at the SRLG layer.
TCM-based SLA enforcementTandem Connection Monitoring data is contractual currency in carrier-to-carrier disputes. Per-segment BIP-8 error counters and unavailable-second logs unambiguously assign fault to a specific operator’s segment.

Five-Nines Math

Annual unavailability budgets follow directly from the percentage:

SLA targetAnnual unavailabilityMonthlyWeeklyCommon contractual usage
99 %87.6 hours7.3 hours1.68 hoursBest-effort consumer
99.9 % (3-nines)8.76 hours43.2 minutes10.1 minutesBusiness broadband
99.95 %4.38 hours21.6 minutes5.04 minutesMid-tier business
99.99 % (4-nines)52.6 minutes4.32 minutes1.01 minutesPremium business / leased lambda
99.999 % (5-nines)5.26 minutes25.9 seconds6.05 secondsCarrier core, financial connectivity
99.9999 % (6-nines)31.5 seconds2.59 seconds0.605 secondsMostly aspirational; some critical SS7 / control

Key Insight

An SLA tier is meaningful only when the measurement window and fault definition are specified. “99.999 % over a calendar month” with faults defined as “more than 1 second of unavailable-seconds” is very different from “99.999 % calendar-year-rolling, 1 minute exclusion below” — and the difference can be a factor of 10 in remedy payouts.

Allocating the Budget Across Components

The 5.26 min/year five-nines budget is the end-to-end target for an optical circuit. Internally it is allocated across span, amplifier, ROADM, and transponder layers — each given a sub-budget that, summed in unavailability domain, totals the 5.26 minutes.

LayerTypical sub-budget (5-nines target)Dominant failure mechanism
Fibre span (per 80 km)~1.5 min / yrCable cut (rare per km, high MTTR)
EDFA / OLA (per amp)~0.3 min / yrPump-laser failure, AGC fault
ROADM (per node)~0.5 min / yrWSS failure, control-card reboot
Coherent transponder (per end)~0.5 min / yrDSP / line-side optic failure
Power, environmental (per site)~1 min / yrPower glitch, cooling event
Software / config error~1 min / yrThe most under-budgeted item — often dominates real-world outage
Sum~5 min / yr

The sum-of-unavailabilities is conservative (it assumes uncorrelated single-failure restoration), and it is also a useful sanity check: if any single layer’s design budget exceeds the end-to-end target, the design is broken.

MTBF and MTTR

MTBF (Mean Time Between Failures) describes how often a component fails; MTTR (Mean Time To Repair) describes how long the failure persists. Availability is then:

A = MTBF / (MTBF + MTTR)

For component-driven failures (transponders, ROADMs, EDFAs), MTBF dominates because MTTR is fast (hot-swap, sometimes minutes for an in-service spare). For fibre cuts, MTTR dominates because the fibre is in the ground, far from the operator, and physical-locate + splice + retest is hours.

Component MTBF Reference

ComponentTypical MTBFNotes
EDFA / OLA100 000 – 300 000 hours (~11–34 years)Pump laser is the dominant failure mode; modern dual-pump designs improve substantially
Coherent transponder (line card)250 000 – 500 000 hoursDSP ASIC and line-side optics; pluggable form factors trend lower than embedded
WSS module (in ROADM)200 000 – 500 000 hoursLCoS or MEMS — both robust, settling time more often the operational issue
ROADM control card150 000 – 300 000 hoursSoftware is the riskier factor, not hardware
OTN switch fabric card200 000 – 400 000 hoursIndustry-standard reliability
Optical fibre cable (per km)~150 000 – 250 000 hours per km (cut rate)Highly geography-dependent; urban with high construction is worse
Coherent pluggable (CFP2-DCO, QSFP-DD ZR/ZR+)300 000 – 500 000 hoursImproving as silicon photonics matures
Submarine wet-plant repeater25-year design life, ~99 % chain-survival targetCannot be repaired — extreme over-engineering, dual-redundant pumps

Warning

MTBF figures from datasheets are typically calculated under controlled lab conditions (25 °C, no thermal cycling, no humidity excursion). Field MTBF is often 50–80 % of datasheet, especially for pluggable optics deployed in unconditioned outdoor cabinets. Always derate vendor MTBF when budgeting against an SLA.

MTTR Breakdown for Fibre Cuts

A fibre cut MTTR is not a single number — it is the sum of a sequence of operational steps, each variable in duration:

StepTypical durationDominant variable
1. Alarm to dispatch decision5–30 minNOC procedure, on-call response
2. Crew dispatch + travel to suspect area30 min – 4 hDistance, traffic, time of day
3. Physical locate (OTDR-from-each-end + visual)30 min – 2 hCable length, accessibility, conduit complexity
4. Excavate / access (if buried)1 h – many hoursBurial depth, surface (asphalt vs grass), permits
5. Splice (per fibre, on a high-count cable)~5 min per fibreFibre count — a 144-fibre cable is hours of splicing
6. Retest (OTDR pre/post) and document30 min
7. Service-restoration verification, alarms cleared15 min
Total typical MTTR4–12 h for accessible terrestrial; 24+ h for difficult terrain

The 4-hour fast case assumes a daytime cut on an accessible suburban cable with a nearby crew; a buried cable in a winter blizzard at 0300 with a remote excavation is 24+ hours.

Rule of Thumb

Fibre cuts dominate annual unavailability for a circuit that is unprotected at the optical layer. Two fibre cuts per year × 6 hours each = 12 hours unavailable = about 99.86 % availability — far short of even three-nines without protection. Optical protection (1+1 or OUPSR) is what turns hours-of-MTTR-fibre-cuts into a five-nines product.

Dual-Vendor Diversity

A common availability strategy is to deploy two vendors’ equipment in parallel — vendor A as primary, vendor B as protection. The intent is to immunise against systemic faults: a software defect in vendor A’s release, a firmware bug, an EOL/deprecation cascade, a supply-chain disruption, or even a shared design flaw across vendor A’s product line.

When does it pay off?

ScenarioDual-vendor diversity payoff
Vendor A has a recurrent control-plane software defectHigh — vendor B keeps the network up while A debugs
Single vendor announces critical security issueHigh — vendor B is unaffected (typically)
Hardware fleet recallHigh
Routine card failureNone — single-vendor + spare is sufficient
Routine fibre cutNone — physical layer is vendor-agnostic
Long-term cost, training, spares logisticsNegative — dual-vendor doubles operational complexity

Warning

Dual-vendor is not a no-brainer. The added operational complexity (two NMS, two configuration languages, two on-call rotations, two parts inventories) easily eats more time per year than the systemic-fault avoidance saves. Only deploy dual-vendor where the regulatory / customer requirement is explicit, or where past systemic faults have justified the discipline empirically. For most operators, single-vendor with strong vendor escalation is cheaper and equally available.

Multi-Cause Failure Modelling

Two paths are only as redundant as their shared risks allow. Three categories of correlated failure routinely catch designs that look diverse on a topology map:

Correlation sourceConcrete exampleMitigation
Shared duct / conduitTwo “diverse” fibres travelling in the same physical duct or under the same bridgeSRLG-aware route engineering; physical site survey
Shared landing stationBoth submarine cables enter through the same beach manholeGeographic diversity, dual landing stations on different beaches
Shared power feedBoth ROADM nodes share a single A/B utility feedIndependent A/B feeds + UPS + generator on each
Shared software / firmwareBoth transponders running the same image with the same defectStaggered deployments, canary testing, dual-vendor
Shared atmospheric / environmentalBoth microwave backups in the same storm cellDiverse media (fibre + microwave is true diversity, fibre + fibre on the same path is not)
Shared operations teamBoth paths reconfigured by the same engineer in the same maintenance windowChange-management discipline, separate windows for the two paths

Fault-Tree Analysis

Fault-tree analysis (FTA) decomposes a top-level event (“circuit X is down”) into contributing leaves connected by AND/OR gates. The diagram below illustrates a simplified FTA for a protected optical circuit:

flowchart TD
    TOP["Circuit DOWN"]
    AND1["AND<br/>both paths fail"]
    OR_A["OR<br/>working path failure"]
    OR_B["OR<br/>protection path failure"]
    PROT_FAIL["AND<br/>protection mechanism<br/>also fails"]
    FIBER_A["Fibre cut A"]
    AMP_A["EDFA fail A"]
    TX_A["Transponder fail A"]
    CONFIG["Config error"]
    FIBER_B["Fibre cut B"]
    AMP_B["EDFA fail B"]
    TX_B["Transponder fail B"]
    SELECTOR["Selector / APS<br/>fail"]
    SRLG_HIT["SRLG cut<br/>both A and B<br/>same duct"]
    TOP --> SRLG_HIT
    TOP --> AND1
    TOP --> CONFIG
    AND1 --> OR_A
    AND1 --> OR_B
    OR_A --> FIBER_A
    OR_A --> AMP_A
    OR_A --> TX_A
    OR_B --> FIBER_B
    OR_B --> AMP_B
    OR_B --> TX_B
    AND1 --> PROT_FAIL
    PROT_FAIL --> SELECTOR
    style TOP fill:#E24B4A,stroke:#A32D2D,color:#fff
    style AND1 fill:#D85A30,stroke:#993C1D,color:#fff
    style PROT_FAIL fill:#D85A30,stroke:#993C1D,color:#fff
    style SRLG_HIT fill:#D85A30,stroke:#993C1D,color:#fff
    style CONFIG fill:#BA7517,stroke:#854F0B,color:#fff
    style OR_A fill:#7F77DD,stroke:#534AB7,color:#fff
    style OR_B fill:#7F77DD,stroke:#534AB7,color:#fff
    style FIBER_A fill:#1D9E75,stroke:#0F6E56,color:#fff
    style FIBER_B fill:#1D9E75,stroke:#0F6E56,color:#fff
    style AMP_A fill:#1D9E75,stroke:#0F6E56,color:#fff
    style AMP_B fill:#1D9E75,stroke:#0F6E56,color:#fff
    style TX_A fill:#378ADD,stroke:#185FA5,color:#fff
    style TX_B fill:#378ADD,stroke:#185FA5,color:#fff
    style SELECTOR fill:#7F77DD,stroke:#534AB7,color:#fff

The top event “Circuit DOWN” can occur from independent simultaneous failures on both paths (AND), from an SRLG hit affecting both paths simultaneously (a single event), from a configuration error, or from a protection-mechanism failure during a single-path event.

The FTA’s value is exposing the OR branches that look like ANDs: a fibre cut to a SRLG-shared duct is a single event that takes both paths down, even though the topology shows two paths. SRLG-shared duct is not the AND of (cut A) and (cut B); it is its own independent leaf in the tree.

TCM Evidence in Carrier-to-Carrier Disputes

When a multi-segment circuit underperforms its SLA, every operator in the chain is incentivised to claim the fault is somewhere else. TCM (see 13-optical-test-measurement-and-commissioning) provides per-segment performance data that turns a multi-party dispute into a fact-finding exercise.

The contractual mechanism:

  1. Each operator’s TCM endpoints log per-second and per-15-min counters: BIP-8 errors, errored-seconds, severely-errored-seconds, unavailable-seconds.
  2. Master Service Agreements between operators specify TCM data export formats (typically G.7710 PM, NETCONF telemetry, or vendor-specific exports normalised to a shared schema) and SLA-test windows.
  3. When the customer escalates, all operators provide TCM logs for the disputed window. The operator whose TCM segment shows the matching error excursion owns the fault.
  4. SLA credits flow from the at-fault operator(s) according to the per-segment SLA contract.

Key Insight

Without TCM, an SLA dispute is “the customer’s word vs each operator’s word, and every operator denies fault”. With TCM, the disagreement is reduced to verifying the timestamps and data integrity of the TCM exports — a far more bounded problem. This is why every wholesale optical contract above ~10 Gbit/s now mandates TCM provisioning.

Availability vs Reliability vs Performance

Three closely-related metrics that customers and engineers routinely conflate:

MetricWhat it measuresWhen to choose
AvailabilityFraction of time the service is “up” by some criterionCustomer SLA, billing, headline
ReliabilityProbability of operating without failure for a stated durationComponent selection, lifecycle planning
PerformanceQuality of the service while it is up (BER, latency, packet loss)Application QoE, premium-service tiering

A circuit can be five-nines available but with 5 % packet loss for the available time — high availability, low performance. A circuit can be perfectly performant when up but fail twice a year for two hours each — good performance, mediocre availability. Premium SLAs increasingly specify all three, with separate remedies for breaches of each.

SLA-Tier Examples

TierAvailability targetRestoration-time targetPer-segmentTypical credit / remedy
Bronze99.9 % (8.76 h/yr)<30 min restoration after fibre cutSingle-path, unprotected5 % monthly credit per outage hour beyond budget
Silver99.95 % (4.38 h/yr)<5 min restorationOptical-layer protected (1+1 or OUPSR)10 % credit per breach hour, capped at 100 % monthly fee
Gold99.99 % (52 min/yr)<50 ms switchHard-wired 1+1 + diverse routing + SRLG-validated25 % credit per breach event
Platinum99.999 % (5.26 min/yr)<50 ms switchDual-vendor diversity + 1+1 + SRLG-validated + diverse landings50 % credit per breach event, contract exit option after repeat
Six-nines99.9999 % (31.5 s/yr)<50 ms with sub-50 ms re-convergeTriple-redundant, multi-domainBespoke; usually only critical control-plane or financial

Rule of Thumb

The cost of each additional “nine” roughly doubles. Going from 99.9 % to 99.99 % requires optical-layer protection; from 99.99 % to 99.999 % requires SRLG-validated diversity and rigorous SLA-test discipline; from 99.999 % to 99.9999 % requires dual-vendor diversity and dual-domain routing. Customers asking for “six-nines” almost always mean “as available as you can make it” and will accept a five-nines product with the right pricing — most operators publish only up to five-nines in their service catalogue.

See Also

References

Standards (ITU-T)

  1. ITU-T G.709/Y.1331Interfaces for the Optical Transport Network (06/2020). https://www.itu.int/rec/T-REC-G.709
  2. ITU-T G.7710/Y.1701Common equipment management function requirements (10/2020). https://www.itu.int/rec/T-REC-G.7710
  3. ITU-T G.911Parameters and calculation methodologies for reliability and availability of fibre optic systems. https://www.itu.int/rec/T-REC-G.911
  4. ITU-T G.827Availability performance parameters and objectives for end-to-end international constant bit-rate digital paths. https://www.itu.int/rec/T-REC-G.827
  5. ITU-T M.2110Bringing into service of international multi-operator paths, sections and transmission systems. https://www.itu.int/rec/T-REC-M.2110

Standards (IEC / IEEE)

  1. IEC 61703Mathematical expressions for reliability, availability, maintainability and maintenance support terms. https://webstore.iec.ch/
  2. IEC 61025Fault Tree Analysis (FTA). https://webstore.iec.ch/
  3. IEEE 493Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems (Gold Book). https://standards.ieee.org/

Standards (IETF)

  1. RFC 4427Recovery (Protection and Restoration) Terminology for Generalized Multi-Protocol Label Switching (GMPLS) (03/2006). https://www.rfc-editor.org/rfc/rfc4427
  2. RFC 4428Analysis of Generalized Multi-Protocol Label Switching (GMPLS)-based Recovery Mechanisms (03/2006). https://www.rfc-editor.org/rfc/rfc4428

Books

  1. A. Farrel, J.-P. Vasseur, Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS, Morgan Kaufmann, 2004.
  2. R. Ramaswami, K. N. Sivarajan, G. H. Sasaki, Optical Networks: A Practical Perspective, 3rd ed., Morgan Kaufmann, 2009.
  3. P. D. T. O’Connor, A. Kleyner, Practical Reliability Engineering, 5th ed., Wiley, 2012.