Non-technological barriers: the last frontier towards AI-powered intelligent optical networks

In this section, we discuss each individual non-technological barrier and elucidate our viewpoint with the help of example use-cases from real fiber-optic networks.

Legacy issues

Conventional non-ML solutions in optical networks offer a well-established ecosystem due to decades of successful deployment and operation history¹³. Notably, significant investments were dedicated to their development over time. This lessens the motivation to switch to alternative ML-based tools, after all, why fix something that is not broken? A 2021 study report by TM Forum (which is a global alliance of 850+ telecommunication companies including network operators, network equipment vendors, network management consultancies, etc.) that draws on a survey of 42 global operators showed that the issue of prevalence of legacy solutions was ranked as the topmost barrier to ML-aided autonomous networking by the survey respondents¹⁴.

While pitched against the traditional legacy tools, ML-assisted methods also suffer from reputation and trust challenges since it is sometimes argued by the optical network operators that the ML-based solutions developed in research may have little relevance and usefulness in practical fiber-optic networks.

Example network use-case: lightpaths’ QoT estimation

A prime example of legacy issues is the ML-aided QoT estimation^15,16,17 operation in optical networks, which has traditionally been dominated by non-ML approaches such as Gaussian noise (GN) model¹⁸ and its variants¹⁹. ML-based QoT estimation methods, despite offering significant advantage in scenarios involving certain uncertainties about link parameters values^20,21, have not been successful yet in achieving broad adoption in current fiber-optic networks and it may take a while before these techniques are deemed suitable substitutes for their legacy counterparts.

Cost restraints

As ML models essentially rely on data, having sufficient representative network data is a prerequisite for guaranteeing good model performance. However, the data generation and analysis processes demand substantial investments, e.g., for installing monitoring equipment (like optical channel monitors (OCMs), optical time-domain reflectometers (OTDRs), optical spectrum analyzers (OSAs), etc.) ubiquitously across the network; for securely storing large amount of network data; for procuring fast computational resources to process big data using ML algorithms; for acquiring an extended set of software tools; for employing additional skilled workforce, etc.²². What makes the matters worse is the unclear business impact of such sizable investments since the network operators often find it hard to translate ML models’ statistical metrics (e.g., accuracy) into relevant business value. The cost issues mentioned above pose major hindrance to ML-aided methods’ deployment in optical networks. Based on a 2022 worldwide survey of 78 network operators, which was jointly conducted by Light Reading Inc. (a New York-based telecommunications industry information company) and 4 key optical transport network suppliers, i.e., Ciena, Fujitsu, Infinera, and Juniper networks, while the survey respondents envisioned several benefits of ML-powered tools for open, automated and programmable transport networks, 36% of them identified high costs as the main obstacle towards their adoption²³.

Example network use-case: optical performance monitoring

Consider the OPM process in fiber-optic networks that is conventionally realized through low-cost ubiquitously-deployed OCMs, which typically comprise of a simple tunable band-pass filter or a diffraction grating followed by a single photodetector to monitor a few channel parameters like optical power, wavelength, and out-of-band optical signal-to-noise ratio (OSNR)²⁴. Although ML-based alternative OPM techniques^25,26 have successfully demonstrated multi-parameters monitoring for several wavelength-division multiplexed (WDM) channels, their vast deployment in optical networks still remains limited. The reason for this is that such data-driven OPM solutions necessitate new mechanisms for assembling a variety of training data as well as extra data analysis tools, all of which incur additional costs, thus restraining the network operators to switch to this new paradigm despite its obvious technical supremacy.

Expert workforce limitations

ML-assisted operation and management of optical networks is still at its infancy. That is why, there is currently a shortage of well-trained professionals proficient in both ML techniques and domain expertise related to optical communications and networking. This dearth of expert workforce has been reported by several industry stakeholders, such as Huawei Technologies²⁷ and Colt Technology Services²⁸, where the former proposed an operating model to retrain and transform the skills of a certain fraction of original workforce in a telecommunication company (e.g., converting the operations personnel into data analytics engineers) to cope with the ML workforce limitations.

Another crucial aspect, as the optical network operations aim to transition towards new “human + machine” collaboration model, is that the existing workforce is accustomed to traditional non-ML tools and thus it will be quite hard for it to work and cooperate with network devices/equipment controlled by artificial intelligence (AI). This is problematic because in many critical optical network operations, necessitating very high level of reliability, the workforce is required to seamlessly interact with ML models and fully trust their recommendations and decisions. The lack of user-friendly ML tools that could help the unaccustomed workforce easily apply ML methods further aggravates the situation. The above-mentioned workforce challenges ensure that a transition towards ML-aided or hybrid solutions in fiber-optic networks is going to be a lengthy process.

Example network use-case: end-to-end communication system optimization

Consider the fiber-optic communication system optimization process, where conventionally the individual system components like transmitter, receiver and transmission link are separately designed and optimized in a modular fashion by different teams of engineers, often leading to suboptimal E2E system performance²⁹. In contrast, an ML-assisted E2E learning and optimization approach optimizes various transceiver blocks, such as coding, pulse shaping, modulation, equalization, demodulation, decoding, etc., jointly such that the errors between transmitted and final received bits are minimized. Over the last few years, several E2E solutions^29,30,31 involving artificial neural networks, autoencoders, etc., have been experimentally demonstrated and despite offering clear performance advantages, their application in commercial optical networks remains nonexistent due to current scarcity of skilled professionals with multifaceted expertise in ML, digital signal processing (DSP), and optical communications, all of which are imperative for realizing ML-based E2E optimization of fiber-optic communication systems.

Data accessibility and privacy protection problems

For developing effective ML-aided solutions, it is necessary to access characteristic data sets from actual optical networks. However, realistically, it is hard to achieve that because of several practical reasons. Firstly, the mechanisms for seamless global access to relevant sources of data are not fully established yet, e.g., due to data ownership constraints, data sets size, bandwidth limitations to transport large volumes of data, etc. Secondly, the terms-of-use (ToU) for such shared data are not clearly defined yet. This has resulted in various stakeholders isolating their data and setting boundaries for data sharing in order to protect their own commercial interests. Thirdly, the rules and regulations for protecting the privacy and anonymity of shared data have not yet been enacted in current optical networks environment.

Due to above-mentioned constraints, there are presently only handful examples of real-world optical networks data sets that are openly accessible to solution developers. These include Microsoft wide-area optical backbone network performance monitoring data³² released in 2017, Alibaba production optical transport network QoT data³³ released in 2023, Germany50 and pan-European GÉANT optical backbone networks traffic data³⁴, etc. However, these data sets correspond to certain specific use-cases only (i.e., OPM, QoT estimation, and network traffic flow prediction, respectively) and are also limited in scale.

Example network use-case: physical layer security management

Consider the ML-assisted physical layer security management operation^35,36,37 in fiber-optic networks that is not realizable through localized actions only and necessitates network-wide sharing of data related to security incidents between entities belonging to various network domains. Unfortunately, effective mechanisms to access and use the required sensitive data as well as the protocols to assure data privacy in such collaborative network applications are still missing.

Interpretability, transparency and accountability issues

The lack of comprehensible explanation of the decisions made by the ML algorithms (which typically employ a “black-box” methodology) is a big hurdle in adopting ML-based solutions for mission-critical operations in commercial fiber-optic networks because it is not preferable in practice to employ a solution without really understanding how and why it works. Optical network operators are particularly interested in knowing how different factors affect the prediction results and may sometimes opt for simpler and intuitive non-ML models with inferior performance in exchange for having better insights. Recently, we have seen network operators and solution developers worldwide beginning to take significant interest in incorporating explainability in their ML-aided operations and decision-making. For example, Nokia Bell Labs³⁸ proposed a ML-enabled proactive fiber breaks detection mechanism in optical networks that additionally provides interpretable decision-making rules. Similarly, in a 2021 policy paper³⁹, Deutsche Telekom provided guidelines to its solution developers on how to increase the degree of comprehensibility of their ML-based solutions.

On the other hand, the absence of transparency in ML-assisted tools makes it hard to scrutinize and detect any potential discrepancies. Moreover, it leads to accountability problems, e.g., in the event of a false decision made by a ML-based solution, it is often impossible to determine whether or not the issue lies with ML model, the training data sets employed, the entities which obtained data, or the equipment used.

Example network use-case: network failures management

Consider the process of detecting faults in optical networks, where the conventional tools establish certain specific threshold levels and trigger some alarms whenever the set levels are surpassed, thus enabling a primitive but intuitive fault detection mechanism². On the other hand, by leveraging large amount of components data, links data, and network operational data, the ML-aided fault management approaches have successfully demonstrated several advanced features including proactive fault detection, fault classification, fault localization, fault root cause analysis, and preventive maintenance^2,40,41,42. However, such ML-based fault management solutions have not yet received broad acceptance from optical network operators because they hardly provide any insights about how certain decisions were reached and why exactly they shall be trusted.

Lack of standardization and regulatory frameworks

As ML techniques for fiber-optic networks are still evolving, there is currently a lack of consensus among the stakeholders on a range of issues related to the standardization of: data generation processes, data sets specifications for various network use-cases, data structures and formats, ML models’ performance metrics, ML models’ performance evaluation procedures, etc. Another issue closely related to standardization is the use of highly-specialized but little-standardized terminology for ML-based solutions, which is problematic for novices, e.g., technicians in the field.

It is worth mentioning here that in case of wireless networks, several global standardization organizations have taken serious steps in the past few years to address ML-related standardization challenges. For example, in 2017, International Telecommunication Union Telecommunication Standardization Sector (ITU-T) initiated a Focus Group on ML for Future Networks including 5G (FG-ML5G) that aimed to identify the standardization gaps of ML for 5G and beyond networks⁴³. Similarly, in 2017, China Telecom together with Huawei Technologies and other partners set up a work group named Experiential Networked Intelligence (ENI) in European Telecommunications Standards Institute (ETSI) to facilitate the formulation of standards for ML applications in wireless networks⁴⁴. Unfortunately, for optical transport networks, the efforts for standardizing the data generation and processing processes for different ML applications are still at a nascent stage. To this end, in 2019, National Institute of Standards and Technology (NIST) hosted a workshop¹³ which highlighted the emerging need of standardizing the relevant data sets for ML-aided fiber-optic networks.

Apart from standardization, there is also a dearth of regulatory policies pertaining to the regulation of “data market” as well as implementation, fair assessment, and anti-discrimination validation of ML models. Even worse, to the best of author’s knowledge, there aren’t any regulatory bodies in place yet to provide expertise and oversight on forthcoming legal challenges to ML-enabled solutions in optical networks.

Example network use-case: lightpaths’ QoT estimation

A relevant example of lack of standardization and regulatory frameworks is the ML-aided lightpaths’ QoT estimation, where the ML algorithms are trained to learn the complex mapping between the feature vectors, comprising of few selected parameters of the link/signal, and the lightpath’s chosen QoT metric^15,16,17. However, there is presently no standardization of the used feature vectors and various proposed solutions apply dissimilar parameter sets, leading to divergent QoT estimation performances. Similarly, there is no standardization of the QoT metric itself and several alternatives like lightpath’s feasibility class (i.e., a binary variable), OSNR, electrical signal-to-noise ratio (ESNR), Q-factor, bit-error ratio (BER), error vector magnitude (EVM), etc., have been considered⁴⁵. Furthermore, there are currently no bodies existing to regulate the used data sets and the ensuing data-driven models for predicting QoT. Due to above shortcomings, the optical network operators have no real means available to fairly compare different ML-based QoT estimation methods, which in turn reduces their adoption prospects.

Human factors and cognitive biases

Unlike conventional analytical approaches used in fiber-optic networks with certain fixed performance, the results of ML-based solutions are strongly dependent on which data points are included and which are ignored (on purpose or by accident) by the human developers, which is problematic since it may disallow objective and human-independent performance measure. Moreover, since ML algorithms are essentially trained by the humans, who naturally gain knowledge, technical skills, experience and intuition around certain processes, equipment and tools, it may result in some cognitive biases (a term in psychology that describes the tendency of people’s experiences and feelings to influence their judgment⁴⁶) and hence lead to distorted predictions.

The presence of human factors and intentional/unintentional biases makes it hard for optical network operators to completely trust ML-based prediction results especially while taking critical decisions. In a 2023 study report⁴⁷ published by the Body of European Regulators for Electronic Communications (BEREC), the results of a survey of 7 European network operators, such as Telefónica Germany, Koninklijke PTT Nederland, Telefónica, S.A., etc., showed that the respondents ranked undetected data biases as the topmost area of concern since they could entail misleading results.

Example network use-case: network failures management

A prime example of human factors and cognitive biases is the ML-based fault detection operation^40,41,42 in fiber-optic networks, where it is not always possible to completely automate the data generation process. For example, to assign a “normal” or “faulty” label to a given acquired data sample from a certain network device, the involvement of a subject matter human expert who could give a higher-level interpretation is inevitable. In such cases, there is always a risk that the annotation of data used in the learning process and consequently the performance of ML model may become vulnerable to humans’ inferences. Similarly, the performance of ML-aided fault detection tools may vary depending upon the nature of data employed (e.g., time/frequency/polarization domain data, optical/electrical domain data, etc.) as well as the type of ML algorithm applied, all of which are strictly humans’ prerogative and are often dictated by their prior experience, degree of familiarity, convenience, etc. The dependence of ML-based fault management solutions’ performance on humans’ traits makes them a less credible choice for optical network operators.

Source link