Overall scheme design
Taking into account the degree of mobility of people in the library and the different lighting needs, the university library is divided into three major areas: image detection area, infrared sensing area, and other areas. The image detection area encompasses the study and rest zones, the infrared sensing area includes corridors, bookshelves, and elevator vestibules, while the remaining areas consist of restrooms, stairwells, and public halls. The general framework of the designed intelligent lighting system is shown in Fig. 1.

Schematic diagram of the overall system framework.
According to the distribution of the library areas, the three regions are subdivided into N sub-regions. Each sub-region is equipped with a regional controller, which independently controls N sub-lighting units. The sub-lighting units in different regions consist of LED lamps, light sensors, or motion sensors. The core unit of the area controller is a microcontroller, which can realize the functions of sensor data acquisition, human body position information reception, lamp control command sending, and feedback lamp status information to the main control computer.
The human body video data collected by the camera is processed using a deep learning algorithm for target detection and spatial localization, allowing for the accurate identification and localization of the current human body position. Then, the information of the lighting area where the human body is located is sent to the area controller through Zigbee wireless communication technology, and the area controller makes decisions and controls the opening and closing of the lamps according to the light intensity and the presence or absence of human targets in the area where the lamps are located. In this way, the brightness of the lamps can be adjusted according to the actual situation, making full use of natural light and realizing the maximum energy saving effect20,21,22.
In addition, the light sensor and the human body sensor can detect in real time the light intensity of the area where the luminaire is located and whether there is a human body entering or leaving the infrared sensing area, and send the detection results to the area controller. The area controller adjusts the brightness and switching state of the lamps based on real-time light intensity and human target information to fulfill lighting needs. At the same time, the application of pulse width modulation (PWM) dimming technology enables the lamps to automatically increase their brightness in low-light conditions and reduce their brightness in bright conditions, thereby further saving energy.
According to the different degrees of personnel flow and lighting conditions in each area of the library, the human target detection algorithm and human spatial localization algorithm based on deep learning are only applicable to the image detection area. At the same time, this area has the largest proportion of area in the library and the highest power consumption, so this paper will focus on the research.
Design of image detection region scheme
A study area in the library is selected as an example and named sub-area 1. Sub-area 1 is divided into 9 sub-lighting units, each sub-lighting unit contains an LED luminaire and a light sensor, and all the sub-lighting units are controlled by the area controller. The floor plan of sub-area 1 is shown in Fig. 2.

Floor plan of sub-area 1.
According to the standard for lighting design of buildings (GB50034-2013), the illuminance given value of sub-area 1 is set to 300 lx, and the subsequent illuminance determination will be made according to the given value. The control scheme of sub-area 1 is shown in Fig. 3. The camera collects human video data within the effective range and uses the human target detection model and human spatial localization model to determine human body position information, which is then sent to the area controller. Meanwhile, the light sensor detects the light intensity of the current luminaire area in real time and uploads it to the area controller. The area controller executes the illumination determination command, wherein the light intensity is compared against a threshold of 300 lx. If the measured light intensity is greater than or equal to 300 lx, it indicates the availability of sufficient natural light, and the lamps are deactivated as no additional lighting is required; conversely, if the light intensity falls below 300 lx, it suggests that the natural light is insufficient for the visual needs of the human eye, triggering the activation of the lamps to provide the necessary supplementary illumination.

Sub-area 1 control scheme.
This paper uses PWM dimming technology to regulate the LED, PWM dimming is a digital dimming technique that adjusts the duty cycle to control the average current flowing through the LED, thereby regulating the brightness of the LED lamp. When the light sensor detects high light intensity, PWM dimming reduces the brightness of the LEDs, and conversely increases the LED brightness. The regulation process can respond in real-time to changes in ambient light, such as changes in cloud cover or fluctuations in the intensity of natural light, thus achieving more accurate light control, Fig. 4 shows the flow chart of the dimming process.

Design of human target detection model based on improved YOLOv5
YOLOv5 algorithm improvement
YOLOv5 has strong real-time processing capability and low hardware computational requirements, including four network structures. Considering detection speed and accuracy, this paper optimizes YOLOv5s, which has the smallest network depth and width, as the base network. As shown in Fig. 5, the YOLOv5 network model structure consists of four parts: Input, Backbone, Neck, and Prediction.

YOLOv5 network model structure.
In this paper, based on the YOLOv5 algorithm, its backbone network, detection scale feature extraction network and loss function are improved respectively. The structure of the improved YOLOv5 network is shown in Fig. 6.

Schematic diagram of the improved YOLOv5 network structure.
Backbone network improvements
With the introduction of the coordinate attention mechanism, the model can pay more attention to the target region of interest and improve the localization and recognition accuracy of the target. However, the incorporation of attention mechanisms typically increases the computational load of the model, which can negatively impact the detection rate. Simple and lightweight attention modules, such as the coordinate attention (CA) module, impose minimal additional computational overhead while enhancing model performance without significantly compromising the detection rate. Therefore, the use of CA modules can improve the accuracy and performance of target detection models while maintaining the detection rate.
As shown in Fig. 7, the CA module is divided into coordinate information embedding and coordinate information feature map generation. In the first step, the CA module performs channel coding on the input feature map X using pooling kernels of dimensions \((H,1)\) and \((1,W)\) and obtains the output of the \(c\) channel with height \(h\) and the \(c\) channel with width \(w\). This produces two independently oriented perceptual feature maps \(z^{h}\) and \(z^{w}\) , whose sizes are \(C \times 1 \times H\) and \(C \times 1 \times W\).

Schematic diagram of CA structure.
The formulas are as follows:
$$z_{c}^{h} \left( h \right) = \frac{1}{W}\sum\limits_{0 \le i < W}^{{}} {x_{c} \left( {h,i} \right)}$$
(1)
$$z_{c}^{w} \left( w \right) = \frac{1}{H}\sum\limits_{0 \le j < H}^{{}} {x_{c} \left( {j,w} \right)}$$
(2)
where \(z^{h}\) and \(z^{w}\) are subjected to a concat operation and a convolutional transform function \(F_{1}\) with a convolutional kernel size of 1 is used on them to generate an intermediate feature map \(f\) that encodes spatial information in the horizontal and vertical directions with the following formula:
$$f = \delta \left( {F_{1} \left( {\left[ {z^{h} ,z^{w} } \right]} \right)} \right)$$
(3)
where \(\delta\) is the nonlinear activation function. The intermediate feature map \(f\) is decomposed along the spatial dimension to obtain two tensors \(f^{h} \in R^{C/r \times H}\) and \(f^{w} \in R^{C/r \times W}\), where \(r\) denotes the downsampling ratio. Convolutional operations are performed on \(f^{h}\) and \(f^{w}\) using convolutional transform functions Fh and Fw with a convolutional kernel size of 1 to transform them into tensors with the same number of channels as the input X.
The formula is as follows:
$$g^{h} = \sigma \left( {F_{h} \left( {f^{h} } \right)} \right)$$
(4)
$$g^{w} = \sigma \left( {F_{w} \left( {f^{w} } \right)} \right)$$
(5)
where \(\sigma\) is the sigmoid activation function. Finally, the outputs \(g^{h}\) and \(g^{w}\) are expanded and used as the attention weight assignment values, respectively, and the final output equation is as follows:
$$y_{c} \left( {i,j} \right) = x_{c} \left( {i,j} \right) \times g_{c}^{h} \left( i \right) \times g_{c}^{w} \left( j \right)$$
(6)
Improvement of detection scale
A detection layer with a scale of 160 × 160 is added to the original network structure, and the feature fusion part is improved to realize four-scale detection. The specific operation is as follows: after the 17th layer of the original network structure, a CBL layer and an up-sampling operation are added to further expand the size of the feature map; in the 20th layer of the network, the feature map with a scale of 160 × 160 obtained from the expansion is concatenated with the 2nd layer of the feature map in the Backbone, to fuse the detail and semantic information of the two. In this way, larger-scale feature maps can be obtained for detecting smaller-size targets; a shallow detection layer with a scale of 160 × 160 is added to the 21st layer of the network while keeping the other three detection layers unchanged. This enables four-scale detection, which can effectively utilize the shallow feature information and the high semantic information of the deeper features23. This improvement enables the model to extract feature information from deeper network layers, effectively integrating shallow features with the high-level semantic information from deeper layers, thereby enhancing its ability to learn multi-scale targets.
Feature extraction network improvement
In the four-scale detection of YOLOv5, the method of increasing the network depth and the number of model parameters is adopted to enhance the detection accuracy, thereby achieving more accurate prediction results. However, the increased model complexity results in higher computational costs and a reduction in running speed. To address this, in the Neck part, some ordinary convolutions are replaced with depth-separable convolutions24. This modification helps reduce the number of model parameters and computation, ultimately increasing detection speed while maintaining detection accuracy. The principle of depth separable convolution is shown in Fig. 8.

Depth separable convolution schematic.
Depth separable convolution divides the standard convolution operation into two steps: depth convolution and pointwise convolution25. In the first step, a convolution kernel of size \(K \times K\) is used to do a channel-by-channel convolution of the input feature map with channel number \(M\) to obtain \(M\) feature maps of size \(Q \times Q\). Then a pointwise convolution operation is performed on the feature map by \(N\) filters to obtain an output feature map with channel number \(N\) and size \(D \times D\). The computation of ordinary convolution is as follows:
$$K \times K \times M \times N \times D \times D$$
(7)
The formula for the computational volume of the depth separable convolution is as follows:
$$K \times K \times M \times D \times D + M \times N \times D \times D$$
(8)
The ratio of computation of depth separable convolution to ordinary convolution is \(1/N + 1/K^{2}\). So replacing the ordinary convolution in the feature extraction network with depth separable convolution reduces the computation and the number of parameters of the model and thus improves the detection speed of the model. Where channel-by-channel convolution reduces the number of parameters and point-state convolution reduces the computational effort.
Loss function improvement
In YOLOv5, the loss function consists of three components: bounding box loss26, objectness loss27, and classification loss28. The original YOLOv5 algorithm uses GIOU_Loss as the bounding box regression loss function, but this loss function does not take into account the case where the prediction box is inside the target box and of the same size29. In order to solve this problem, EIoU_loss with better performance is introduced as the loss function of bounding box.EIoU_loss is a more optimized loss function of bounding box, which not only considers the overlapping area of the bounding box regression, the distance from the center point, and the aspect ratio of the prediction box and the target box, but also locates and regresses the bounding box more accurately and improves the accuracy of detection. It also introduces a penalty term for the width-height difference on this basis, and EIoU_loss consists of three parts: overlap loss (\(L_{IOU}\)), centroid distance loss (\(L_{IOU}\)) and width-height loss (\(L{\text{asp}}\))30. It considers both the overlap area and centroid distance of the bounding box regression and the aspect ratio of the predicted and target frames, and the optimization objective is set as the minimum width-height difference between the predicted frame and the real frame by introducing the width-height loss31. EIoU_loss can speed up the convergence of the model and improve the accuracy of the model.
EIoU_loss is calculated as follows:
$$L_{EIOU} = L_{IOU} + L_{dis} + L_{asp} = 1 – IOU + \frac{{\rho^{2} \left( {b,b^{gt} } \right)}}{{c^{2} }} + \frac{{\rho^{2} \left( {w,w^{gt} } \right)}}{{c_{w}^{2} }} + \frac{{\rho^{2} \left( {h,h^{gt} } \right)}}{{c_{h}^{2} }}$$
(9)
where \(Cw\) and \(Ch\) are the width and height of the smallest outer rectangle of the prediction frame and the real frame.
Design of human spatial localization model based on multilayer perceptron machine
MLP model training
The MLP model maps the 2D centroid coordinates of the human body bounding box (x, y) to the 3D coordinates in the actual library (x, y, 0), i.e., the specific location of the human body in the actual subregion is obtained based on the position of the human body in the image.
In this experiment, the YOLOv5 body detection algorithm directly provided the 2D coordinates of the human body, while the 3D coordinates were manually acquired. A total of 570 pairs of 2D and 3D coordinate samples were collected. Some of the data are shown in Table 1.
In order to train the MLP model, this paper employs the Backpropagation (BP) algorithm for supervised learning. The BP algorithm adjusts the weights of the individual connections in the MLP network layer by layer by calculating the gradient of the loss function, thus enabling the model to better fit the training data. The importance of backpropagation in MLP training is that it provides an efficient and effective way for the model to optimize the parameters and gradually find the optimal weight configuration. In this system, the accuracy of the human spatial localization model is improved32, and the experimental flow of the human spatial localization algorithm is shown in Fig. 9. The specific steps are as follows:

Flow chart of human spatial localization algorithm experiment.
The center point of the human body bounding box regressed using the YOLOv5 detection algorithm is used as the input training sample of the MLP model, and the network structure of the MLP model is constructed as in Fig. 10, with the initial learning rate set to 0.001, and the size of each training batch is 64, with a total of 100 epochs of training.

MLP model network structure.
The neural network carries out linear operations through forward propagation and uses the activation function to enhance the nonlinear ability, which is then transmitted to the output layer. Subsequently, the error is computed via backpropagation, and the Adam optimizer is employed to update the weights and biases. This process is repeated iteratively until the error falls below a predefined threshold or the maximum number of iterations is reached, thereby ensuring convergence of the learning process.
Control strategy design
This paper focuses on the image detection region using deep learning algorithms and proposes a lighting control strategy based on the distribution location of the human body, under the premise of meeting the illumination needs of the human eye, minimizing the number of lamps and lanterns turned on in the region to improve the level of intelligence and energy saving of the lighting system. As shown in Fig. 11 for the spatial mapping of the image detection area, the horizontal plane in the figure can be divided into three parts: the horizontal ground, the horizontal desktop and the lighting installation plane. These three planes correspond to each other vertically, with the horizontal desktop serving as the plane designated for illuminance detection. According to the spatial layout of the image detection region in practice, it is divided into N sub-regions, each of which includes \({\text{L}}_{\text{n}}(\text{n}\ge 1)\) sub-illumination units, and the sub-illumination region \(A{\text{n}}\) on the horizontal desktop is a vertical mapping of the region where the sub-illumination unit \(L{\text{n}}\) is located, and \({\text{H}}_{\text{m}}(\text{m}\ge 0)\) indicates the number of people in the sub-region.

Spatial mapping relationship of image detection region.
The flowchart of the control strategy for the image detection region is shown in Fig. 12, and the main steps are:
-
1.
System initialization;
-
2.
Detecting the presence of a human target in the image using the human target detection model \(Hi(i = 1,2 \ldots ,m)\) and outputting the spatial coordinates of the human body using the human spatial localization model \(Ak(xk,yk)\);
-
3.
Judge the region where the human body target \(H_{i}\) is located, if \(xa \le xk{ < }xb\) and \({\text{y}}c \le yk{\text{ < yd}}\) (\(xa,xb,yc,yd\) is the range of sub-illumination region \(A_{n}\)), then the human body target \(H_{i}\) belongs to the \(A_{n}\) sub-illumination region, the number of human body targets in the region accumulates 1, and the final output of the positional distribution of all the human body in the sub-region information;
-
4.
Send the human body position distribution information to the area controller via Zigbee wireless communication. When the number of human targets in the sub-lighting area \(A_{n}\) is greater than 0, a pending on command is sent, and when the number is equal to 0, a light-off command is sent;
-
5.
Based on the light intensity in each sub-lighting area, it is assessed whether the current illumination falls below the specified threshold. If it is greater than or equal to the given value, stop dimming; if it is less than the given value, turn on the lamps and lanterns in the area to be turned on and carry out PWM dimming.

Flowchart of image detection region control strategy.
