
IEEE/CAA Journal of Automatica Sinica
Citation: | W. Jang, J. Hyun, J. An, M. Cho, and E. Kim, “A lane-level road marking map using a monocular camera,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 1, pp. 187–204, Jan. 2022. doi: 10.1109/JAS.2021.1004293 |
SELF-DRIVING has received much attention from the industry and academia because of its potential [1]. The implementation of self-driving requires various technologies such as perception [2], localization [3], and path planning [4]. A lane-level map which includes information on road markings (RMs) is the key requirement for the precise localization of the autonomous vehicle [5]–[7]. A lane-level RM map is usually built using a commercial mobile mapping system (MMS) equipped with many expensive sensors, such as 3-D LiDAR, a real-time kinematic (RTK) GPS and inertial navigation systems. The approach, however, is highly expensive and ineffective since the MMS must visit every place it could go for mapping. Considering that a monocular camera has already been installed in most of vehicles, building a lane-level RM map using a monocular camera is a very attractive alternative. Sensor data can be gathered efficiently from a number of common vehicles on the road, thus mapping can be virtually free. In this paper, a new lane-level RM mapping system using a monocular camera is proposed. The key technique for the proposed lane-level RM mapping using a monocular camera is (T1) how the RLs and SRMs are extracted from a monocular camera, and (T2) how they are used as landmarks in the graph optimization of mapping. Here, RM is defined as a union of RL and SRM.
Regarding the RM extraction in (T1): Many recent papers have reported about RL and SRM detection [8]–[15]. The biggest difference between them and our method is that recent previous papers deal with RM detection, while our paper focuses on RM segmentation. Since the RLs detected in [8]–[15] are represented in the form of curved lines, they are not suitable for pixel-level mapping. The proposed method, however, outputs the segment of RMs in a pixel level and can build maps on the pixel level. The result of this paper can be extended from lane-level RM maps to pixel-level RM maps in the future, and the pixel-level RM map of this paper will compete with the high-definition (HD) point cloud map built using a LiDAR from a high-end MMS system.
Regarding the mapping in (T2): The key difference from the visual simultaneous localization and mapping (vSLAM) developed in mobile robotics [16]–[18] is how to match or associate RM segments. Since the RM is not a point landmark but it consists of a set of points, simple associations based on Euclidian distances do not work. In this paper, a segment-to-segment (S2S) association is developed instead of a point to point (P2P) one. A RM segment is divided into front and rear parts and the pair is registered as two nodes for association.
To the best of our knowledge, this is the first attempt of building a lane-level RM map based on the RM segmentation using a monocular camera. The proposed map building method consists of two phases: first, RMs are segmented from the camera image using a semantic segmentation network. Here, RM is defined as the union of SRM and RL. A multi-branch network is developed to cope with RL and SRM separately. Second, the lane-level RM map is built by optimizing a pose graph composed of RMs. DBoW2 [19] is used to detect loop closure and g2o [20] is used for graph optimization. The contributions of this paper are as follows:
1) A lane-level RM map is built based on the RM segmentation using a monocular camera. Since RMs are extracted on per-pixel basis from a segmentation network, it can be extended to pixel-level RM map in the future.
2) RMNet was developed to train on SeRM datasets. Two losses, named class-weighted loss (CWL) and class-weighted focal loss (CWFL), are proposed to train RMNet.
3) Previous vSLAM methods used point landmarks as nodes in the graph optimization. Since the RM is not a point but a segment, the point-based association used in the classical vSLAM is not appropriate. In the proposed RM mapping, an S2S association is developed by dividing a RM segment into front and rear parts. The pair is registered as two different nodes in a graph and graph optimization is performed.
4) A new dataset, called the semantic road mark mapping (SeRM) dataset, is developed for effective RM segmentation. The set is required to train a semantic segmentation network for RM and test the lane-level RM map. The set consists of 25157 pixel-wise annotated images and 21000 position labeled images.
The rest of this paper is organized as follows: Section II introduces related work. Section III presents the SeRM dataset details. Section IV outlines the overview of the proposed map building system. Section V presents the RM segmentation network. Section VI discusses how the lane-level map consisting of RMs is built through graph optimization. Section VII gives the experimental results, and Section VIII concludes this study.
The most common way to create maps for autonomous vehicles is to use MMS, which is equipped with a suite of high accuracy sensors such as 3-D LiDAR and high accuracy GPS sensors [21]–[23]. When a commercial MMS is used, highly accurate GPS trajectories are recorded, and the point clouds obtained from the 3-D LiDAR are superimposed to the GPS trajectories, making a lane-level HD map. The major shortcoming of building lane-level map using an MMS is that building a map costs too much, and LiDAR sensor in the MMS does not contain color information; hence, colored lanes (e.g., yellow and white lanes) are quite difficult to distinguish. To solve this problem, lane-level map building using a low-cost camera has been receiving more attention within the field of autonomous vehicles [24]–[27]. To achieve this, RMs (RLs and SRMs) should be detected, and a lane-level map composed of RMs should be built.
To build a lane-level map with road marking using a monocular camera, what matters most is reliable detection of RMs (RLs and SRMs). To achieve this, various image processing methods were widely used in classical RL and SRM detection methods [28]–[31], and the most popular ones are using Top-Hat filter and hand-crafted low-level features.
Recently, with the development of deep learning, various learning-based methods have been proposed [8]–[15] to detect RMs effectively. First, let us consider the detection of RL. In [8]–[10], end-to-end segmentation networks were designed to perform a pixel-level classification into either RL and background, and line-shaped RLs were reprojected onto image using clustering and curve fitting. Then, Neven et al. extended [8]–[10] to 3D case and proposed 3D-LaneNet in [11]. In [12], Guo et al. developed GeoNet by applying geometric assumptions to 3D-LaneNet, and training the network using regression loss. In GeoNet, 3D lane attributes are predicted using anchor representation. In [13], Ghafoorian et al. proposed EL-GAN to use a generative adversarial network architecture. In the network, a discriminator is trained on both predictions and labels at the same time. Second, let us consider the detection of SRM. Lee et al. improved the SRM detection by massaging the bounding boxes from tiny-YOLO in [14]. True positives from tiny-YOLO are deblurred, whereas false positives are blurred through GAN. The third one detects RL and SRM simultaneously. Lee et al. [15] proposed VPGNet to detect RLs and SRMs simultaneously. The key idea of the work is to boost the detection performance of RL and SRM by detecting vanishing point along with them.
Here, it should be noted that the RL and SRM detection proposed in this paper is completely different from the previous papers [8]–[15]. The first and basic difference between them is that previous papers deal with RM detection, but our method deals with RM segmentation. Since the previous papers [8]–[13], [15] aim at being applied to LKS (lane keeping system), the lanes in [8]–[15] are represented in the form of curved lines, and they are not suitable for pixel-level mapping. The proposed method, however, outputs the segment of RMs in a pixel level and can build maps in the pixel level. We believe that the result of this paper can be extended from lane-level RM map to pixel-level RM map in the future, and the pixel-level RM map of this paper will compete with the high-definition (HD) point cloud map built using a LiDAR of high-end MMS system.
The second difference is more significant than the first one. The previous works and this paper have completely different goals. Since the previous works were developed aiming at LKS (lane keeping system), the RLs in [8]–[13], [15] aim to provide guidance on autonomous driving, and thus they cannot be used in RM mapping. Fig. 1 shows the training samples used in [8]–[13], [15] and SeRM dataset. They clearly show the difference in the lane detection for different two applications.
Large datasets regarding automotive driving are required to train a deep network to detect RMs using a monocular camera. Several datasets have been reported: Datasets [31]–[35] include road lane (RL), dataset [28] includes symbolic road marking (SRM), and datasets [15], [36]–[41] provide both RL and SRM labels.
First, let us consider the dataset about RL. The first dataset about RL is the Caltech lanes dataset [31]. In the set, the shape of a lane is represented using a polynomial, but its pixel level ground truth is not provided. Later, several other datasets regarding RL were reported: KITTI road dataset [32], TuSimple [33], FiveAI [34] and CULane dataset [35]. The datasets in [32], [34] provide instance level RLs in curved form, and they are beneficial for more sophisticated driving manoeuvres. Datasets [33], [35] provide information about ego-lane, and can be used in lane following tasks.
Second, let us consider the dataset about SRM. The most popular dataset about SRM is the road marking detection (RMD) dataset [28]. It provides about 28000 images with 10 categories of SRMs. The ten categories include speed limit signs “35” and “40”, “left turn”, “right turn” and “forward” arrows, and “bike”, “stop”, “ped”, “xing” and “rail” signs and they cover all kinds of SRMs commonly found on US roads.
Third, let us consider the datasets which include both RL and SRMs. The early datasets that provide both RL and SRM are ROMA [36], CamVid [37], and TRoM [38] dataset. All of them are very small. CamVid [37] is the first dataset with semantic annotated videos. The size of the dataset is small, containing 701 manually annotated images with 32 semantic classes. ROMA [36] is a dataset with 116 images with 3 RM classes. Since the set is too small, it can only be used for tests, and cannot be used for training. TRoM dataset [38] is the first dataset on semantic RMs under different weather, illumination, and traffic-load. The set provides 712 images accompanied by pixel-level annotation of 19 classes of RMs. Recently, a few large-scale pixel-level annotated datasets on semantic RMs have been reported, and examples are Mapillary vistas [39], VPGNet dataset [15], ApolloScape [40], and BDD100k [41]. Mapillary dataset [39] is a large set of images with fine annotations, and includes 25000 images with 66 object categories and 6 classes of RMs. VPGNet dataset [15] focuses only on RMs and does not consider other instances such as objects. The set consists of about 20000 images with 17 manually annotated RLs and SRMs. Vanishing point annotation is also provided in the set. However, the images in VPGNet dataset are not annotated with the location at which the images are captured. Thus, the dataset can be used for RM segmentation but not for localization or mapping. ApolloScape [40] has 140K annotated images and 27 RM classes, and is the largest datasets regarding RMs. The set covers not only RM detection, but also other tasks such as 3D reconstruction, semantic parsing, instance segmentation, etc. BDD100K dataset [41] contains 100K raw video sequences with about 120 million images. One image is selected from each video clip for annotation. 100K images are annotated on the bounding box level and 5K images are annotated on the pixel level. All the datasets related to RM are compared in Table I in terms of various aspects.
Type | Name | Year | # of frames | LIc | PLAd | SRM | Cls#e | Odf | GPS | LCg | MTh |
RLa | Caltech Lanes [31] | 2008 | 1224 | √ | – | – | – | – | – | – | – |
KITTI road [32] | 2013 | 600 | – | – | – | – | √ | √ | – | – | |
TuSimple [33] | 2017 | 6408 | √ | – | – | – | – | – | – | – | |
FiveAI [34] | 2018 | 23980 | √ | – | – | – | – | – | – | – | |
CULane [35] | 2018 | 133235 | √ | – | – | – | – | – | – | – | |
SRMb | Road Marking Detection [28] | 2012 | 28614 | – | – | √ | 10 | – | – | – | – |
RL+SRM | ROMA [36] | 2008 | 116 | √ | √ | √ | 3 | – | – | – | – |
CamVid [37] | 2008 | 701 | – | √ | √ | 2 | – | – | – | – | |
TRoM [38] | 2017 | 712 | √ | √ | √ | 19 | – | – | – | ||
Mapillary Vistas [39] | 2017 | 20000 | – | √ | √ | 6 | – | – | – | √ | |
VPGNet [15] | 2017 | 21097 | – | √ | √ | 17 | – | – | – | – | |
ApolloScape [40] | 2018 | 143K | – | √ | √ | 27 | – | √ | – | – | |
BDD100k [41] | 2020 | 120M1 | √ | √ | – | 11 | – | √ | – | √ | |
Ours | 2021 | 25157 | – | √ | √ | 16 | √ | √ | √ | √ | |
a Road Lane; b Symbolic Road Markings; c Lane Instances; d Pixel Level Annotation; e Number of Classes; f odometry of vehicle; g Loop Closure; h Multi-Trajectory 1 only 5683 images are pixel-level annotated |
After detecting RMs using a monocular camera, the absolute positions of these RMs should be recovered to make a lane-level map. In [24], a map correction system was reported in which noisy lane hypotheses from sensor inputs, noisy maps with RLs, and noisy global positioning were all combined to fine-tune a map using a probabilistic approach. In [25], pixel-level submaps comprising the RLs were made, and the entire lane map was built using submap matching. The submap matching methods were improved in [26] to build a lane-level map that contains four classes of RMs.
Interestingly, vehicle localization [5] and reference mapping [7] based on RM detection using a camera has been developed in [5], [7]. In particular, the goal of [7] looks similar to that of this paper. However, the mapping paper [7] completely differs from this paper in two aspects: The first difference between them is that the reference map developed in [7] consists of consecutive points on the lanes, thus it cannot represent the various environments such as the complex intersection shown in Fig. 2. Further, because the reference map in [7] consists only of consecutive points, the map in [7] does not provide much information about longitudinal movement, having clear limitation in correcting longitudinal error. On the contrary, our RM map consists of various RM segmentations, and it outputs not simple points but the pixel-level segments of RMs. Thus, unlike the reference map in [7], our RM map can represent various environments, and the map looks similar to high definition (HD) point cloud map built using a LiDAR, as shown in Fig. 2.
The second difference between [7] and our RM map is about the technical implementation. The reference map in [7] is built simply using only GPS or GNSS, and loops in the trajectory are not used at all. As the distance traveled by the mapping vehicle increases, more and more errors will be accumulated, degrading the map accuracy. Our RM map, however, is based on vSLAM, and it does not use GPS or GNSS. Whenever loops are closed, localization errors are corrected and are not accumulated.
Lots of research has been conducted regarding the vSLAM in the field of mobile robotics [16]–[18]. The key difference between the visual simultaneous localization and mapping (vSLAM) developed in mobile robotics [16]–[18] and the one in this paper is how to match (or associate) features, which in this paper are RM segments. Since RM is not represented by a point as in [16]–[18], but a set of points, the existing matching methods does not work. Therefore, we propose an S2S matching method which treats a RM as a pair of two nodes. The details about the registration are given in Section IV-C. To the best of our knowledge, there is no research work in which a lane-level RM map is built based on the RM segmentation using a monocular camera, and the map is built based on vSLAM.
To apply deep learning in building a lane-level map using only a low-cost camera, a large training set is needed. The training set should satisfy the following two conditions:
C1: All kinds of RMs, including lanes and SRMs, on the road should be annotated at the pixel level.
C2: For effective mapping, each image should be provided together with GPS and odometry information. The vehicle trajectories must have multiple loop closures, which can be used to reduce the mapping error.
Obviously, C1 is required to train a network which extracts the RMs from a camera input. C2 is needed to build a map using the RMs extracted from the camera image. However, none of the datasets satisfy the above-mentioned two conditions. In this study, we collected a large number of images and made a dataset that best fits affordable lane-level RM mapping. Table I shows a comparison of our SeRM dataset with other datasets.
For RM map building, the RM should be extracted from a camera image at the pixel level using a semantic segmentation network. To train the semantic segmentation network for the lane-level RM map, a large number of images were gathered in South Korea. The RMs in the images were manually annotated at the pixel level. The SeRM dataset included 25 158 images annotated with 16 classes of RMs at the pixel level. The number of images constituting the dataset was 19 998 for training and 5160 for testing. The image resolution is
CRM=CSRM∪CRL∪{background} |
where
Slow down | Go ahead | Turn right | Turn left | Ahead or turn right | Ahead or turn left |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Crosswalk | Numbers | Texts | Others (etc.) | ||
![]() | ![]() | ![]() | ![]() | ||
Double line (yellow) | Double line (blue) | Broken line (white) | Single line (yellow) | Single line (white) | Stop line |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Fig. 3 depicts some examples of the SeRM set, where the first column represents the original RGB images acquired by the monocular camera, and the second column shows the corresponding image annotated at the pixel level.
The images annotated with the experimental vehicle pose, where the images were acquired, were needed in building the lane-level RM map. To this end, 21 000 images were gathered on three routes in Seoul and Goyang-si and annotated with RTK-GPS and odometry data at the image level. Route 1 data were acquired from Sangam, Seoul. Route 1 is a complex route with many loop closures, including both narrow and wide roads. Route 2 data were obtained from Ilsan, Goyang-si, which consists of wide roads. Route 3 data were collected from very large areas in Sangam and contains data obtained while driving on the same road in opposite directions. The three routes include several loop closures, as shown in Fig. 4. The total length of our three driving routes was approximately 26 km, with each route being 4.86, 9.47, and 11.63 km, respectively.
The proposed lane-level map building system consisted of two phases: 1) RM segmentation and 2) lane-level RM mapping. Fig. 5 shows the outline of our lane-level map building system. In the first phase, a semantic segmentation network (i.e., RMNet) was developed and trained on the SeRM set. The RGB images
A lane-level RM map comprising the segmented RMs obtained from RMNet was built in the second phase. Lane-level RM map building refers to building a graph, including vehicle poses and RMs as nodes, and refining a graph-by-graph optimization. It is a graph SLAM [20] using RMs as landmarks. Lane-level RM mapping consists of four parts: inverse perspective mapping (IPM), RM correction, front-end, and back-end. In the IPM part, the RMs in the image plane are transformed to vehicle coordinates, and the observation error is corrected in the RM correction part. In the front-end, a graph composed of vehicle poses and RMs is built, whereas in the back-end, the graph is optimized.
Let us denote the original RGB image, the per-pixel ground truth, and the segmentation result from RMNet by
The proposed RMNet has encoder-decoder architecture, and it consists of shared parts and branch parts, as shown in the Fig. 6. The shared part corresponded to an encoder in encoder-decoder structure, and it generates features for both SRM and RL. Meanwhile, the branch part corresponds to a decoder and two decoding networks output
To lessen the possibility, we separate the two decoder networks for RL and SRM physically in the early phase of training. In the late phase of training, we connect two decoder networks while sharing the feature extraction encoder and fine-tune the two decoder networks.
When we train RMNet with the SeRM dataset, the simplest loss that we can use is a cross entropy (CE) loss function defined by
L(I)=−∑alli,j,cti,j,clogpi,j,c | (1) |
where
Balanced Cross Entropy Loss (BCE)
The most common method for addressing a class imbalance problem is to use balanced cross entropy (BCE) [42], which is based on cross entropy (CE) loss. For notational simplicity, we will denote per-pixel ground truth
CE(pc)=−tclog(pc). | (2) |
BCE is defined by introducing a weighting factor
BCE(pc)=−αctclog(pc) | (3) |
where a weighting factor
Focal Loss (FL)
Focal loss (FL) [42] is defined by adding a modulating factor
FL(pc)=−αc(1−pc)δtclog(pc). | (4) |
The modulating factor reduces the relative weight for well-classified examples but puts more focus on misclassified examples. When an example is misclassified and
Class-Weighted Loss (CWL)
In this paper, a new loss, named class-weighted loss (CWL), is proposed. The CWL is inspired by BCE and FL: BCE strikes a balance between classes by assigning a high weighting factor
More specifically, CWL is defined as
CWL(pc)=−αcλc,ttclog(pc) | (5) |
where
λc,t+1=λc,t+γ∑alli,j(ti,j,c−pi,j,c) | (6) |
where
For example, let us consider the case given in Fig. 8. Since the number of classes is 17,
Class-Weighted Focal Loss (CWFL)
Finally, another loss named as class-weighted focal loss (CWFL) is proposed in this paper. As stated, the notion of false negative is applied to each sample in FL, whereas the same notion is applied to each class in CWL. In CWFL, the notion of false negative is applied to both sample and class. That is, both class weight factor
CWFL(pc)=−αcλc,t(1−pc)δtclog(pc). | (7) |
The two proposed losses CWL and CWFL are applied to RMNet and they are analyzed in experimental section (Section VII).
Building a lane-level map using the RMs obtained in Section IV as landmarks requires four steps. First, the RMs on the camera image plane are transformed to a vehicle coordinate through inverse perspective mapping (IPM), enabling the RMs to be used for mapping. Second, the RM correction step is applied to correct the error from the IPM, which often occurs because of bumps on the road. A weighted iterative closest point (ICP) [43] is used to correct it in the second step. Third, a pose graph consisting of vehicle poses and RMs is built, and loop closing is considered. In loop closing, revisited locations are recognized, and the pose graph is simplified using the revisited locations. This step is actually a front-end of lane-level mapping. Fourth, the graph built in the third step is optimized in such a way that all the RM locations and vehicle odometries are respected as much as possible.
The RM segments obtained from RMNet should be back-projected into a vehicle coordinate to make a lane-level RM map. The back-projection was conducted using IPM in [44], as shown in Fig. 9. A pixel
[xy1]=M[uv1] | (8) |
assuming that the camera was installed at the height of H and tilted at the angle of
When a vehicle drives on real roads, it often encounters bumps or sunken parts and shakes up and down (Figs. 10(a) and 10(b)), leading to a location error in mapping RMs. In Fig. 10(a), the road is flat, and the RMs continue to be back-projected to the vehicle coordinate by the IPM. However, in Fig. 10(b), a part of the road is slightly sunken, and the RMs are recovered at the wrong locations by the IPM. A step, called RM correction, was conducted after the IPM to reduce the RM mapping error. In the RM correction, a set of points was sampled from the mapped RMs, and the ICP was applied to them, as shown in Fig. 10(c). From here, the word “mapped” means “re-projected into a vehicle coordinate by IPM and transformed into a world coordinate.” In Fig. 10(c), the blue points
However, based on our experience, the vanilla ICP is not enough. More specifically, the mapping error caused by the up and down shaking of the vehicle was mainly not latitudinal but longitudinal, as in Fig. 10(b). Many of the RMs were RLs (belong to
(R∗t,T∗t)=argminR,T∑cwc1NcNc∑i‖Rpci,t+T−pcj(i),t−1‖2, |
wherewc={0ifc=crosswalkλsrmifc=SRM&stoplineλo.wo.w. | (9) |
j(i)=argminj∈{1,…,Nm}‖(Rtpci,t+Tt)−pcj,t−1‖,wherei=1,…,Nc | (10) |
where
ˉPt=∪c{R∗tpci,t+T∗t|i=1,…,Nc}. | (11) |
Fig. 10(d) shows two consecutive RMs corrected by the weighted ICP. Compared with using the vanilla ICP, the alignment error between the consecutive RMs was significantly reduced by the weighted ICP. The corrected RM points
The third step is the front-end of the lane-level mapping that denotes graph building for mapping. In the front-end, the RMs obtained in the second step and the vehicle positions are registered as nodes in a pose graph. The RMs are used as landmark nodes of the graph. The position of the vehicle, from which the RMs are observed, is registered as the pose node. The odometry measurement obtained from the wheel encoder becomes the edge constraint between the pose nodes. Moreover, the distance to the RM is stored as the measurement edge between the pose and landmark nodes. After the initial graph is built, the graph is simplified or pruned whenever the same place is revisited and loop closure is detected.
1) Graph Building
The relation between a vehicle pose and the RMs observed therein is represented as an edge in the graph. When building a graph, an RM will be observed several times in consecutive times and should be registered as a single landmark node, regardless of how many times the RM was observed in consecutive times. In other words, when an RM observed at time
Let us consider the situation in which an RM turn left was observed in three consecutive times. During the transition from times
The instantaneous localization error caused by the low-cost odometry and the vehicle shaking on the bumpy road can be compensated to an extent by the weighted ICP given in Section IV-B. However, a small error will remain even after the weighted ICP. When it is accumulated over a huge environment, it will seriously degrade map accuracy. Here, the accumulated error was compensated for by detecting a revisited region and closing the loop in the graph. Let us suppose that our vehicle drives clockwise around a town as shown in Fig. 14(a), and five pose nodes are generated in the order of N1, N2,…, N5, as shown in Fig. 14(b). Here, the black solid line in Fig. 14(a) denotes the actual trajectory that our vehicle follows, and the blue solid line in Fig. 14(b) denotes the trajectory estimated based on odometry. Along the blue estimated trajectory, five pose nodes N1, N2 through N5 are generated. N1 is the pose node generated at the starting position, and N5 is the pose node generated when the vehicle revisits the starting position. Obviously, when the vehicle position is estimated using odometry, the estimation error will be accumulated during the driving, and N5 will be registered as a new node which is completely different from N1. But if the loop closure is detected (if we come to know that N5 is the same as N1), N5 will be dragged to N1 and the two pose nodes will be merged into one, as shown in Fig. 14(c). The mergence of N1 and N5 will reduce the position error not only at N5 but also errors at other nodes on the loop such as N2, N3 and N4, as shown in Fig. 14(d).
2) Loop-Closure Detection
In this paper, the DBoW2 developed in [19] was used to detect the revisited region using camera images and close the loop in the graph. Each image from the camera is represented by a bag-of-words (BoW) vector
s(vi,vj)=1−12‖vi‖vi‖2−vj‖vj‖2‖2. | (12) |
Using the similarity, when the BoW vector of the
ci=argmaxci−1<j≤i−Δts(vi,vj) | (13) |
where
When the
Here, one might think that it is quite uncommon for the same vehicle to revisit an area in the real world, but we can think of two possible scenarios in which loop closure occurs. First, in the current autonomous driving system, a mapping vehicle and autonomous vehicles are clearly separated. The role of the mapping vehicle is only mapping, and the mapping vehicle should make as many loop closures as possible. Second, as stated in Section I, a monocular camera has been installed on a number of vehicles. If the images taken by countless vehicles are collected in the cloud platform in the future, we can find a number of loop closures in the cloud.
The back-end module is the last step of our lane-level RM mapping, which is actually the graph optimization. A graph that considers loop closing was built in the front-end. The graph was then optimized in the back-end. Several off-the-shelf optimization frameworks can be used, and g2o [20] was used herein to optimize the graph. The initial vehicle pose was used as an origin in the graph coordinate. Finally, the origin was mapped into the geographic coordinate using GPS information, and the lane-level RM map was built in the geographic coordinate. The next section presents the results of the optimized vehicle trajectories and lane-level RM maps.
In this section, the validity of the proposed RMNet and the subsequent lane-level RM mapping is demonstrated by applying them to the SeRM dataset. The images in the SeRM dataset were captured in various environments under various weather conditions. The SeRM dataset also include the odometry and GPS position of the vehicle on which the camera was installed and loop closure was manually annotated.
In this subsection, the proposed RMNet and other semantic segmentation networks are applied to the RM segmentation using the SeRM dataset. Since the RMNet is developed for low-end MMS, the network is compared with small or real-time segmentation networks. Compared networks are FCN-based segmentation networks [27], [45] and state-of-the-art real-time semantic segmentation methods [46], [47]. For fair comparison, all the competing networks use the same backbone networks. ResNet-18 [48], and VGG 16 [49] are used as a backbone network. Experiments are conducted using PyTorch and python, and networks are trained and tested on GTX 1080Ti. All the competing methods are compared in terms of base model, computational and storage complexities in Table III. The unit of the computational complexity is Giga FLoating point OPerations (GFLOPs), which is widely used to describe how many operations are required to run a single instance of a given network model. The storage complexity is measured using the number of parameters (weights) which the network has. When the same backbone networks are used, all the competing networks consume almost the comparable computation and storage, as shown in Table III.
Method | Base model | Computational complexity (GFLOPs) | Storage complexity (# parameters, M) |
FCN-8s | VGG16 | 106.3 | 134.0 |
Weighted-FCN | VGG16 | 106.3 | 134.0 |
RMNet (ours) | VGG16 | 132.8 | 152.0 |
FCN-8s | Res18 | 16.5 | 42.7 |
BiSeNet | Res18 | 21.6 | 49.0 |
SwiftNet | Res18 | 20.3 | 11.8 |
RMNet (ours) | Res18 | 20.6 | 52.7 |
The performance of competing networks are compared in terms of precision, recall,
Precision | Class Acc. (Recall) | F1-score | mIoU | |
FCN [44] | 0.7681 | 0.8886 | 0.8240 | 0.7753 |
Weighted-FCN [27] | 0.7541 | 0.9170 | 0.8276 | 0.5865 |
BiSeNet [47] | 0.8117 | 0.8737 | 0.8416 | 0.5742 |
SwiftNet [48] | 0.8577 | 0.8942 | 0.8756 | 0.6156 |
RMNet | 0.8837 | 0.8546 | 0.8689 | 0.6169 |
RMNet+CWL | 0.8808 | 0.8740 | 0.8774 | 0.6466 |
RMNet+CWFL | 0.8942 | 0.9012 | 0.8977 | 0.6526 |
For further analysis, the mIoU for each class was summarized in Table V. RMNet+CWFL and RMNet+CWL take the first and second places in total mIoU. In particular, they demonstrate relatively consistent mIoU regardless of the classes. SwiftNet, however, demonstrates quite low mIoU in a few classes with small samples such as turn right or turn left).
Class | |||||||||||||||||
Networks | Slow down | Ahead | Turn right | Turn left | Ahead or Turn right | Ahead or Turn left | Cross- walk | Number | Text | Other | Double line (y) | Double line (b) | Broken line (w) | Single line (y) | Single line (w) | Stop line | Total mIoU |
FCN [44] | 0.7044 | 0.5485 | 0.6738 | 0.5455 | 0.3108 | 0.2403 | 0.7858 | 0.5902 | 0.8056 | 0.6440 | 0.4854 | 0.8029 | 0.8306 | 0.4708 | 0.3103 | 0.0487 | 0.5499 |
W-FCN [27] | 0.7868 | 0.6247 | 0.5374 | 0.6628 | 0.4714 | 0.4110 | 0.7560 | 0.6693 | 0.7783 | 0.6607 | 0.4576 | 0.7596 | 0.8912 | 0.5315 | 0.2926 | 0.0937 | 0.5865 |
BiSeNet [47] | 0.8153 | 0.7762 | 0.2591 | 0.2789 | 0.4040 | 0.1371 | 0.8587 | 0.7729 | 0.8461 | 0.7338 | 0.6060 | 0.8037 | 0.8617 | 0.5856 | 0.3869 | 0.0607 | 0.5742 |
SwiftNet [48] | 0.8622 | 0.8340 | 0.2826 | 0.2904 | 0.2891 | 0.2356 | 0.9008 | 0.8471 | 0.8855 | 0.7527 | 0.6198 | 0.8208 | 0.8991 | 0.7087 | 0.4926 | 0.1281 | 0.6156 |
RMNet | 0.8444 | 0.7036 | 0.6149 | 0.6514 | 0.5243 | 0.4418 | 0.8549 | 0.6907 | 0.7929 | 0.7122 | 0.5111 | 0.7765 | 0.8840 | 0.5332 | 0.3225 | 0.0125 | 0.6169 |
RMNet+CWL | 0.8380 | 0.7219 | 0.6915 | 0.6553 | 0.5587 | 0.4636 | 0.8624 | 0.7070 | 0.8425 | 0.7147 | 0.5201 | 0.7853 | 0.8993 | 0.6164 | 0.3789 | 0.0902 | 0.6466 |
RMNet+ CWFL | 0.8381 | 0.7214 | 0.7259 | 0.6416 | 0.6087 | 0.5866 | 0.8622 | 0.7207 | 0.8425 | 0.7199 | 0.5098 | 0.7986 | 0.9015 | 0.5761 | 0.3276 | 0.0604 | 0.6526 |
Fig. 16 shows some of the RM segmentation results. The false positives and the false negatives are indicated by the yellow and red circles, respectively. The class that each color represents is at the bottom of the figure. As shown in Fig. 16, FCN and W-FCN have frequent false predictions in some complex classes such as Text class. BiSeNet also shows false predictions for the pixels which are far from the camera such as Crosswalk in 2nd column or RLs in 4th and 5th columns. In SwiftNet, false negatives often occur for Crosswalk and for imbalanced classes such as [Turn right, Ahead or Turn right], as shown in the 3rd–6th column. The proposed RMNet, however, demonstrates good segmentation results consistently in all complex classes and imbalanced classes, and for pixels which are far from the camera. More examples of the RM segmentation results are included in a supplementary video at
Further, to see the validity of the proposed RMNet, we apply it to public datasets rather than SeRM set. ApolloScape [40] and VPGNet [15] are the public datasets which provide per-pixel GT about RM. RMNet is applied to them and the results are summarized in Table VI.
s_w_d | s_y_d | ds_y_dn | b_w_g | b_y_g | s_w_s | s_w_p | c_wy_z | a_w_t | a_w_tl | a_w_tr | a_w_l | a_w_r | a_w_lr | b_n_sr | r_wy_np | om_n_n | mIoU | |
Apollo [40] | 0.486 | 0.531 | 0.578 | 0.521 | 0.227 | 0.364 | 0.187 | 0.591 | 0.404 | 0.271 | 0.491 | 0.574 | 0.209 | 0.0001 | 0.009 | 0.361 | 0.405 | 0.4541 |
RMNet (+CWFL) | 0.739 | 0.707 | 0.788 | 0.674 | 0.667 | 0.588 | 0.101 | 0.697 | 0.605 | 0.701 | 0.742 | 0.705 | 0.213 | 0.339 | 0.015 | 0.665 | 0.529 | 0.5609 |
(a) RMNet performance on ApolloScape dataset [40] | ||||||||||||||||||
single (RL, w) | dashed (RL, w) | single (RL, y) | double (RL, y) | stopline (RL) | straight (SRM) | crosswalk (SRM) | safety zone (SRM) | mIoU | ||||||||||
RMNet (+CWFL) | 0.62 | 0.59 | 0.45 | 0.61 | 0.47 | 0.38 | 0.74 | 0.51 | 0.5462 | |||||||||
(b) RMNet performance on VPGNet dataset [15] |
When applied to ApolloScape given in Table VI(a), only 17 classes are used to evaluate our RMNet as in [40]. Here, “s_w_d” is short for solid, white, and dividing and “a_w_tl” is short for arrow, white, and thru & left turn, by combining the first letter of type, color, and usage. All of scenes 2 and 3 and half of scene 4 are used in the ApolloScape dataset for training; the other scenes and the remaining half of scene 4 are used for testing. Compared to the baseline given in [40], RMNet+CWFL demonstrates better mIoU.
Regarding VPGNet dataset [15], the set includes four categories: scene_1: daytime, no rain, scene_2: daytime, rain, scene_3: daytime, heavy rain, scene_4: night. First, we measure overall performance using all four categories. That is, we select test images from each category, making a total of 4000 test images. RMNet is trained on the remaining images in VPGNet. As in [15], RMNet is evaluated using only 8 classes. Since VPGNet dataset is quite new, we cannot find any result about RM segmentation. Only detection results were reported in [15]. Thus, RMNet+CWFL is not compared with other methods. Considering that mIoU is 0.5462, we can see that RMNet works reasonably well in VPGNet dataset. Next, we applied RMNet to more challenging scenes in the VPGNet dataset to see the limitations of our RMNet. Night scenes and rainy scenes are considered. In case of night scenes, we train RMNet on the first three categories (scene_1, scene_2, scene_3) and test on scene_4. In case of rainy scenes, we train RMNet on scene_1, scene_2, and scene_4, and test on scene_3. The results for challenging scenes are summarized in Table VII.
single (RL, w) | dashed (RL, w) | single (RL, y) | double (RL, y) | stopline (RL) | straight (SRM) | crosswalk (SRM) | safety zone (SRM) | mIoU | |
Night | 0.4407 | 0.4539 | 0.2681 | 0.6050 | 0.2867 | 0.4748 | 0.7621 | 0.3697 | 0.4576 |
Rainy | 0.5153 | 0.5035 | 0.5085 | 0.6194 | 0.4126 | 0.4365 | 0.7339 | 0.4867 | 0.5271 |
Compared to the mIoU in Table VI(b), the performance is slightly degraded under the night or rainy conditions, as shown in Table VII.The performance at night is much lower than in rainy time. In case of night scenes, illumination is very low, and the images are quite dark overall. Even humans have difficulty in recognizing the colors of the lanes due to the headlight of the approaching vehicles, which explains why mIoUs of “single (RL, w)” and “single (RL, y)” are quite low. In case of rainy scenes, illumination is much higher than in night scenes, and thus the performance is not degraded as much as in night scenes. The major difficulties in rainy scenes are either reflections from water on the road or raindrops on the windshield. These qualitative results are summarized in Fig. 17. In each condition, the first column is the example in which RMNet works relatively well but the remaining two columns are the examples in which RMNet predicts incorrect results.
In case of night scenes, the safety zone on the left of the figure is incorrectly segmented as yellow road lane in the second column, and the yellow road lane on the figure of the image is misclassified as white lane in the third column. Obviously, the misclassification is due to the low illumination in night scenes. In case of rainy scenes, the stopline in the middle of the figure is not detected in the second column due to the reflection from the water; and overall segmentation results are degraded in the third column due to the raindrops on the windshield.
One of the key factors which affects the performance of lane-level RM map is loop closure detection. In this subsection, a lane-level RM map is built by fully utilizing the loop closure detections. The three routes included in the SeRM dataset are designed to include various loop closures. For example, let us consider the frames 1940 to 11380 in “route 2”. The RMs in the frames of the interval can be depicted based on odometry as in Fig. 18(a).
Our vehicle drives clockwise and observes the RMs. When we recognize that frame 1940 looks quite similar to frame 8844, the loop closure occurs and the RMs in frame 8844 are dragged to the corresponding RMs in frame 1940, as shown in Fig. 18(b). Merging the landmark nodes in frames 1940 and 8844 aligns other RMs observed during the interval and reduces the mapping error that is accumulated through the entire route, as shown in the figure.
Then, let us consider the lane-level mapping using three routes given in Fig. 4. For an accurate evaluation of the lane-level mapping, the lane-level map built herein should be compared with the GT map of the three routes. Unfortunately, the GT position of RMs is not provided in the SeRM dataset. The lane-level mapping performance should also be evaluated based on the localization error using the lane-level map because the RTK-GPS signal is provided in the SeRM set.
Thus, the estimated trajectories are compared with the GT trajectories provided by RTK-GPS. The estimated trajectory of the proposed method was given and compared with the true trajectory (GT) in Fig. 19. On the left column of the figure, the trajectories of the proposed method, GT, and that obtained by the wheel encoder are indicated in black, red, and blue, respectively. On the right column, the deviation of the proposed trajectory from the true one is displayed in color. Blue denotes a small localization error, whereas red depicts a large error. The error was kept low in most areas when the environment was as simple as that in Fig. 19(d).
Fig. 20 shows the final lane-level RM maps built by the proposed method. The results were compared against the aerial view map. Figs. 20(a) and 20(c) depict the various complex RMs (e.g., straight-prohibition or U-turn) that were segmented and mapped well. Fig. 20(e) illustrates the lane-level map of the long road with the broken lines and the SRMs, where the long road was well mapped. The proposed method successfully worked even for a large environment. A supplementary video was submitted with this paper to demonstrate the value of the proposed method. The video is available at https://youtu.be/h4pIEwkPDd0.
The proposed mapping method is compared with the odometry and a previous method for the quantitative analysis. Since the absolution positions of RMs are unknown and thus the direct evaluation of mapping accuracy is infeasible, we compared all the competing methods in terms of localization accuracy. All trajectories are evaluated based on the GT trajectory acquired from the RTK-GPS. RMSE and max errors are defined by
RMSE=√1TT∑t=1||xt−xGTt||2 | (14) |
maxerror=maxt||xt−xGTt|| | (15) |
respectively, and are provided in Table VIII, where
Dataset | Route 1 | Route 2 | Route 3 | |||
RMSE (m) | Max (m) | RMSE (m) | Max (m) | RMSE (m) | Max (m) | |
Odometry only | 34.591 | 91.879 | 50.282 | 169.18 | 79.093 | 226.89 |
Weighted-FCN [27] | 7.758 | 28.475 | 14.646 | 30.154 | 15.998 | 56.463 |
Proposed method | 5.243 | 10.617 | 4.696 | 24.335 | 12.460 | 28.367 |
The baseline is the simple mapping using only wheel odometry. The proposed mapping system outperformed the vanilla odometry and method using weighted-FCN [27] in all measures. The table shows that the second row (weighted FCN) is an algorithm in which the semantic segmentation method reported in [27] was used, but the mapping method was the same as that developed herein. The proposed method outperformed the algorithm in the second row because the more accurately the RMs are segmented, the better these RMs are associated with each other and the better the graph obtained is.
In this subsection, an ablation study is conducted to compare the losses (CE, FL, CWL, and CWFL) used in this paper. The results are summarized in Table IX. mIoUs are evaluated while varying the parameter in each loss in Tables IX(b)–IX(d), and the best values are indicated in bold. In Table IX(a), all four losses are compared in terms of mIoU using respective best parameters. The best parameter of each loss is selected by evaluating mIoU while varying the parameter, as shown in Tables IX(b)–IX(d).
Loss function | mIoU | δ | mIoU | γ | mIoU | α | mIoU | |||
Cross Entropy (CE) | 0.6169 | 0 (= CE) | 0.6169 | 0 (= CE) | 0.6169 | 1 | 0.6502 | |||
Focal Loss (δ=1.0) | 0.6235 | 0.5 | 0.6162 | 0.1 | 0.6374 | Balanced Weight | 0.6526 | |||
Class-Weighted Loss (γ=0.2) | 0.6466 | 1.0 | 0.6235 | 0.2 | 0.6466 | |||||
Class-Weighted Focal Loss (δ=1.0&γ=0.2) | 0.6526 | 2.0 | 0.6225 | 0.5 | 0.6408 | |||||
5.0 | 0.5499 | |||||||||
(a) Comparison of various loss functions | (b) Varing δ for FL | (c) Varing γ for CWL | (d) Varing α for CWFL |
In Table IX(a), the CE is the same as a vanilla RMNet in Table V. Compared to other losses, CWFL is most effective in handling class imbalance.
In this paper, a new dataset, named SeRM, and lane-level RM mapping method have been developed. The proposed lane-level map is a landmark-based map that uses RLs and SRMs as landmarks. The SeRM dataset will be made publicly available① upon publication of this paper. The SeRM dataset can be used not only on its own, but also by being combined with other datasets, or being transformed to synthetic images as well, which lowers the need for other scholars to label data for mapping.
RMNet was also proposed and trained on a new data SeRM to extract the RMs from the images of the monocular camera. To build a lane-level map that uses the RMs as landmarks, the RM locations were corrected by the weighted ICP, and graph optimization. The proposed mapping system worked well for large environments. All the experiments and the submitted video clip showed that the proposed system can be implemented at a low price, and the lane-level map built by the proposed method is good and sufficiently accurate.
The main strength of the proposed lane-level RM map is that the final map consists of landmarks (RMs), and thus the map is much lighter than the point cloud map built by the classical MMS. Ironically, however, the trait is also the limitation of our RM map: When there are not enough RMs, the mapping and localization accuracy can be degraded. To solve the problem, further research along this direction should be conducted. For example, the possible solution could be the use of statistical sparse data handling methods such as [50]–[55], or the fusion with additional low-cost sensors such as an IMU.
[1] |
Y. F. Ma, Z. Y. Wang, H. Yang, and L. Yang, “Artificial intelligence applications in the development of autonomous vehicles: A survey,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 2, pp. 315–329, Mar. 2020. doi: 10.1109/JAS.2020.1003021
|
[2] |
J. Van Brummelen, M. O’Brien, D. Gruyer, and H. Najjaran, “Autonomous vehicle perception: The technology of today and tomorrow,” Transp. Res. Part C Emerg. Technol., vol. 89, pp. 384–406, Apr. 2018. doi: 10.1016/j.trc.2018.02.012
|
[3] |
H. Jo, W. Lee, and E. Kim, “Mixture density-PoseNet and its application to monocular camera-based global localization,” IEEE Trans. Industr. Inform., vol. 17, no. 1, pp. 388–397, Jan. 2021. doi: 10.1109/TII.2020.2986086
|
[4] |
L. Chen, X. M. Hu, W. Tian, H. Wang, D. P. Cao, and F. Y. Wang, “Parallel planning: A new motion planning framework for autonomous driving,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 236–246, Jan. 2019. doi: 10.1109/JAS.2018.7511186
|
[5] |
M. Schreiber, C. Knöppel, and U. Franke, “Laneloc: Lane marking based localization using highly accurate maps,” in Proc. IEEE Intelligent Vehicles Symp., Gold Coast, Australia, 2013, pp. 449−454.
|
[6] |
Y. Lu, J. W. Huang, Y. T. Chen, and B. Heisele, “Monocular localization in urban environments using road markings,” in Proc. IEEE Intelligent Vehicles Symp., Los Angeles, USA, 2017, pp. 468−474.
|
[7] |
R. P. D. Vivacqua, M. Bertozzi, P. Cerri, F. N. Martins, and R. F. Vassallo, “Self-localization based on visual lane marking maps: An accurate low-cost approach for autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 2, pp. 582–597, Feb. 2018. doi: 10.1109/TITS.2017.2752461
|
[8] |
D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and L. Van Gool, “Towards end-to-end lane detection: An instance segmentation approach,” in Proc. IEEE Intelligent Vehicles Symp., Changshu, China, 2018, pp. 286−291.
|
[9] |
W. Van Gansbeke, B. De Brabandere, D. Neven, M. Proesmans, and L. Van Gool, “End-to-end lane detection through differentiable least-squares fitting,” in Proc. IEEE/CVF Int. Conf. Computer Vision Workshop, Seoul, Korea (South), 2019, pp. 905−913.
|
[10] |
S. Yoo, H. S. Lee, H. Myeong, S. Yun, H. Park, J. Cho, and D. H. Kim, “End-to-end lane marker detection via row-wise classification,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2020, pp. 4335−4343.
|
[11] |
N. Garnett, R. Cohen, T. Pe’er, R. Lahav, and D. Levi, “3D-LaneNet: End-to-end 3D multiple lane detection,” in Proc. IEEE Int. Conf. Computer Vision, Seoul, Korea (South), 2019, pp. 2921−2930.
|
[12] |
Y. L. Guo, G. Chen, P. T. Zhao, W. D. Zhang, J. H. Miao, J. G. Wang, and T. E. Choe, “Gen-LaneNet: A generalized and scalable approach for 3D lane detection,” in Proc. European Conf. Computer Vision, Glasgow, UK, 2020, pp. 666−681.
|
[13] |
M. Ghafoorian, C. Nugteren, N. Baka, O. Booij, and M. Hofmann, “EL-GAN: Embedding loss driven generative adversarial networks for lane detection,” in Proc. European Conf. Computer Vision Workshop, Munich, Germany, 2019, pp. 256−272.
|
[14] |
Y. Lee, J. Lee, Y. Hong, Y. Ko, and M. Jeon, “Unconstrained road marking recognition with generative adversarial networks,” in Proc. IEEE Intelligent Vehicles Symp., Paris, France, 2019, pp. 1414−1419.
|
[15] |
S. Lee, J. Kim, J. S. Yoon, S. Shin, O. Bailo, N. Kim, T. H. Lee, H. S. Hong, S. H. Han, and I. S. Kweon, “VPGNet: Vanishing point guided network for lane and road marking detection and recognition,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1965−1973.
|
[16] |
R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A versatile and accurate monocular SLAM system,” IEEE Trans. Robot., vol. 31, no. 5, pp. 1147–1163, Oct. 2015. doi: 10.1109/TRO.2015.2463671
|
[17] |
J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” in Proc. European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 834−849.
|
[18] |
J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, Mar. 2018. doi: 10.1109/TPAMI.2017.2658577
|
[19] |
D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5, pp. 1188–1197, Oct. 2012. doi: 10.1109/TRO.2012.2197158
|
[20] |
R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “G2o: A general framework for graph optimization,” in Proc. Int. Conf. Robotics and Automation, Shanghai, China, 2011, pp. 3607−3613.
|
[21] |
V. Ilci and C. Toth, “High definition 3D map creation using GNSS/IMU/LiDAR sensor integration to support autonomous vehicle navigation,” Sensors, vol. 20, no. 3, Article No. 899, Feb. 2020. doi: 10.3390/s20030899
|
[22] |
R. Wan, Y. C. Huang, R. C. Xie, and P. Ma, “Combined lane mapping using a mobile mapping system,” Remote Sens., vol. 11, no. 3, pp. 305–330, Feb. 2019. doi: 10.3390/rs11030305
|
[23] |
N. Sairam, S. Nagarajan, and S. Ornitz, “Development of mobile mapping system for 3D road asset inventory,” Sensors, vol. 16, no. 3, pp. 367–386, Mar. 2016. doi: 10.3390/s16030367
|
[24] |
T. Heidenreich, J. Spehr, and C. Stiller, “LaneSLAM – Simultaneous pose and lane estimation using maps with lane-level accuracy,” in Proc. IEEE Int. Conf. Intelligent Transportation Systems, Gran Canaria, Spain, 2015, pp. 2512−2517.
|
[25] |
E. Rehder and A. Albrecht, “Submap-based SLAM for road markings,” in Proc. IEEE Intelligent Vehicles Symp., Seoul, Korea (South), 2015, pp. 1393−1398.
|
[26] |
J. Jeong, Y. Cho, and A. Kim, “Road-SLAM: Road marking based SLAM with lane-level accuracy,” in Proc. IEEE Intelligent Vehicles Symp., Los Angeles, USA, 2017, pp. 1736−1473.
|
[27] |
W. Jang, J. An, S. Lee, M. Cho, M. Sun, and E. Kim, “Road lane semantic segmentation for high definition map,” in Proc. IEEE Intelligent Vehicles Symp., Changshu, China, 2018, pp. 1001−1006.
|
[28] |
T. Wu and A. Ranganathan, “A practical system for road marking detection and recognition,” in Proc. IEEE Intelligent Vehicles Symp., Madrid, Spain, 2012, pp. 25−30.
|
[29] |
J. K. Suhr and H. G. Jung, “Fast symbolic road marking and stop-line detection for vehicle localization,” in Proc. IEEE Intelligent Vehicles Symp., Seoul, Korea (South), 2015, pp. 186−191.
|
[30] |
S. Jung, J. Youn, and S. Sull, “Efficient lane detection based on spatiotemporal images,” IEEE Trans. Intell. Transp. Syst., vol. 17, no. 1, pp. 289–295, Jan. 2016. doi: 10.1109/TITS.2015.2464253
|
[31] |
M. Aly, “Real time detection of lane markers in urban streets,” in Proc. IEEE Intelligent Vehicles Symp., Eindhoven, Netherlands, 2008, pp. 7−12.
|
[32] |
J. Fritsch, T. Kühnl, and A. Geiger, “A new performance measure and evaluation benchmark for road detection algorithms,” in Proc. IEEE Int. Conf. Intelligent Transportation Systems, The Hague, Netherlands, 2013, pp. 1693−1700.
|
[33] |
TuSimple: Lane detection challenge. 2017. [Online]. Available: https://github.com/TuSimple/tusimple-benchmark/tree/master/doc/lane_detection.
|
[34] |
B. Roberts, S. Kaltwang, S. Samangooei, M. Pender-Bare, K. Tertikas, and J. Redford, “A dataset for lane instance segmentation in urban environments,” in Proc. European Conf. Computer Vision, Munich, Germany, 2018, pp. 543−559.
|
[35] |
X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial as deep: Spatial CNN for traffic scene understanding,” in Proc. Association for the Advancement of Artificial Intelligence, 2018.
|
[36] |
T. Veit, J. P. Tarel, P. Nicolle, and P. Charbonnier, “Evaluation of road marking feature extraction,” in Proc. IEEE Int. Conf. Intelligent Transportation Systems, Beijing, China, 2008, pp. 174−181.
|
[37] |
G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in Proc. European Conf. Computer Vision, Marseille, France, 2008, pp 44−57.
|
[38] |
X. L. Liu, Z. D. Deng, H. C. Lu, and L. L. Cao, “Benchmark for road marking detection: Dataset specification and performance baseline,” in Proc. IEEE Int. Conf. Intelligent Transportation Systems, Yokohama, Japan, 2017, pp. 1−6.
|
[39] |
G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 5000−5009.
|
[40] |
X. Y. Huang, X. J. Cheng, Q. C. Geng, B. B. Cao, D. F. Zhou, P. Wang, Y. Q. Lin, and R. G. Yang, “The ApolloScape dataset for autonomous driving,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Salt Lake City, USA, 2018, pp. 1067−1073.
|
[41] |
F. Yu, W. Q. Xian, Y. Y. Chen, F. C. Liu, M. K. Liao, V. Madhavan, and T. Darrell, “BDD100K: A diverse driving video database with scalable annotation tooling,” arXiv preprint arXiv: 1805.04687, 2020.
|
[42] |
P. J. Besl and N. McKay, “A method for registration of 3-D shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 2, pp. 239–256, Feb. 1992. doi: 10.1109/34.121791
|
[43] |
J. Wang, T. Mei, B. Kong, and H. Wei, “An approach of lane detection based on inverse perspective mapping,” in Proc. IEEE Int. Conf. Intelligent Transportation Systems, Qingdao, China, 2014, pp. 35−38.
|
[44] |
E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, 2017. doi: 10.1109/TPAMI.2016.2572683
|
[45] |
T. Y. Lin, P. Goyal, R. Girshick, K. M. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, pp. 2999−3007, 2017.
|
[46] |
K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 770−778.
|
[47] |
C. Q. Yu, J. B. Wang, C. Peng, C. X. Gao, G. Yu, and N. Sang, “BiSeNet: Bilateral segmentation network for real-time semantic segmentation,” in Proc. European Conf. Computer Vision, Munich, Germany, 2018, pp. 334−349.
|
[48] |
M. Oršic, I. Krešo, P. Bevandic, and S. Šegvic, “In defense of Pre-Trained ImageNet architectures for real-time semantic segmentation of road-driving images,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 12599−12608.
|
[49] |
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learning Representations, San Diego, USA, 2015.
|
[50] |
X. Luo, M. C. Zhou, Y. N. Xia, Q. S. Zhu, A. C. Ammari, and A. Alabdulwahab, “Generating highly accurate predictions for missing QoS data via aggregating nonnegative latent factor models,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3, pp. 524–537, Mar. 2016. doi: 10.1109/TNNLS.2015.2412037
|
[51] |
X. Luo, M. C. Zhou, S. Li, and M. S. Shang, “An inherently nonnegative latent factor model for high-dimensional and sparse matrices from industrial applications,” IEEE Trans. Ind. Inf., vol. 14, no. 5, pp. 2011–2022, 2018. doi: 10.1109/TII.2017.2766528
|
[52] |
X. Luo, M. C. Zhou, S. Li, Y. N. Xia, Z. H. You, Q. S. Zhu, and H. Leung, “Incorporation of efficient second-order solvers into latent factor models for accurate prediction of missing QoS data,” IEEE Trans. Cybern., vol. 48, no. 4, pp. 1216–1228, Apr. 2018. doi: 10.1109/TCYB.2017.2685521
|
[53] |
X. Luo, H. Wu, H. Q. Yuan, and M. C. Zhou, “Temporal pattern-aware QoS prediction via biased non-negative latent factorization of tensors,” IEEE Trans. Cybern., vol. 50, no. 5, pp. 1798–1809, May 2020. doi: 10.1109/TCYB.2019.2903736
|
[54] |
X. Luo, M. C. Zhou, S. Li, L. Hu, and M. S. Shang, “Non-negativity constrained missing data estimation for high-dimensional and sparse matrices from industrial applications,” IEEE Trans. Cybern., vol. 50, no. 5, pp. 1844–1855, May 2020. doi: 10.1109/TCYB.2019.2894283
|
[55] |
S. C. Gao, M. C. Zhou, Y. R. Wang, J. J. Cheng, H. Yachi, and J. H. Wang, “Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 2, pp. 601–614, Feb. 2019. doi: 10.1109/TNNLS.2018.2846646
|
[1] | Juanjuan Li, Rui Qin, Sangtian Guan, Wenwen Ding, Fei Lin, Fei-Yue Wang. Attention Markets of Blockchain-Based Decentralized Autonomous Organizations[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(6): 1370-1380. doi: 10.1109/JAS.2024.124491 |
[2] | Yang Yuan, Haibin Duan, Zhigang Zeng. Prescribed Performance Evolution Control for Quadrotor Autonomous Shipboard Landing[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(5): 1151-1162. doi: 10.1109/JAS.2024.124254 |
[3] | Xiaolong Chen, Biao Xu, Manjiang Hu, Yougang Bian, Yang Li, Xin Xu. Safe Efficient Policy Optimization Algorithm for Unsignalized Intersection Navigation[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(9): 2011-2026. doi: 10.1109/JAS.2024.124287 |
[4] | Yuanxin Lin, Zhiwen Yu, Kaixiang Yang, Ziwei Fan, C. L. Philip Chen. Boosting Adaptive Weighted Broad Learning System for Multi-Label Learning[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(11): 2204-2219. doi: 10.1109/JAS.2024.124557 |
[5] | Luigi D’Alfonso, Francesco Giannini, Giuseppe Franzè, Giuseppe Fedele, Francesco Pupo, Giancarlo Fortino. Autonomous Vehicle Platoons In Urban Road Networks: A Joint Distributed Reinforcement Learning and Model Predictive Control Approach[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(1): 141-156. doi: 10.1109/JAS.2023.123705 |
[6] | Tangyike Zhang, Junxiang Zhan, Jiamin Shi, Jingmin Xin, Nanning Zheng. Human-Like Decision-Making of Autonomous Vehicles in Dynamic Traffic Scenarios[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(10): 1905-1917. doi: 10.1109/JAS.2023.123696 |
[7] | Hongyi Lin, Yang Liu, Shen Li, Xiaobo Qu. How Generative Adversarial Networks Promote the Development of Intelligent Transportation Systems: A Survey[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(9): 1781-1796. doi: 10.1109/JAS.2023.123744 |
[8] | Jiayu Chai, Qiang Lu, Xudong Tao, Dongliang Peng, Botao Zhang. Dynamic Event-Triggered Fixed-Time Consensus Control and Its Applications to Magnetic Map Construction[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(10): 2000-2013. doi: 10.1109/JAS.2023.123444 |
[9] | Hui Liu, Ciyun Lin, Bowen Gong, Dayong Wu. Automatic Lane-Level Intersection Map Generation using Low-Channel Roadside LiDAR[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(5): 1209-1222. doi: 10.1109/JAS.2023.123183 |
[10] | Junjie Wang, Qichao Zhang, Dongbin Zhao. Highway Lane Change Decision-Making via Attention-Based Deep Reinforcement Learning[J]. IEEE/CAA Journal of Automatica Sinica, 2022, 9(3): 567-570. doi: 10.1109/JAS.2021.1004395 |
[11] | Wen Hu, Zejian Deng, Dongpu Cao, Bangji Zhang, Amir Khajepour, Lei Zeng, Yang Wu. Probabilistic Lane-Change Decision-Making and Planning for Autonomous Heavy Vehicles[J]. IEEE/CAA Journal of Automatica Sinica, 2022, 9(12): 2161-2173. doi: 10.1109/JAS.2022.106049 |
[12] | Guojun Wang, Jian Wu, Rui He, Bin Tian. Speed and Accuracy Tradeoff for LiDAR Data Based Road Boundary Detection[J]. IEEE/CAA Journal of Automatica Sinica, 2021, 8(6): 1210-1220. doi: 10.1109/JAS.2020.1003414 |
[13] | Chen Sun, Jean M. Uwabeza Vianney, Ying Li, Long Chen, Li Li, Fei-Yue Wang, Amir Khajepour, Dongpu Cao. Proximity Based Automatic Data Annotation for Autonomous Driving[J]. IEEE/CAA Journal of Automatica Sinica, 2020, 7(2): 395-404. doi: 10.1109/JAS.2020.1003033 |
[14] | Guangyuan Pan, Liping Fu, Qili Chen, Ming Yu, Matthew Muresan. Road Safety Performance Function Analysis With Visual Feature Importance of Deep Neural Nets[J]. IEEE/CAA Journal of Automatica Sinica, 2020, 7(3): 735-744. doi: 10.1109/JAS.2020.1003108 |
[15] | Long Chen, Xuemin Hu, Wei Tian, Hong Wang, Dongpu Cao, Fei-Yue Wang. Parallel Planning: A New Motion Planning Framework for Autonomous Driving[J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6(1): 236-246. doi: 10.1109/JAS.2018.7511186 |
[16] | Zhe Chen, Jing Zhang, Dacheng Tao. Progressive LiDAR Adaptation for Road Detection[J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6(3): 693-702. doi: 10.1109/JAS.2019.1911459 |
[17] | Liang Qi, Mengchu Zhou, Wenjing Luan. A Dynamic Road Incident Information Delivery Strategy to Reduce Urban Traffic Congestion[J]. IEEE/CAA Journal of Automatica Sinica, 2018, 5(5): 934-945. doi: 10.1109/JAS.2018.7511165 |
[18] | Cunxiao Miao, Jingjing Li. Autonomous Landing of Small Unmanned Aerial Rotorcraft Based on Monocular Vision in GPS-denied Area[J]. IEEE/CAA Journal of Automatica Sinica, 2015, 2(1): 109-114. |
[19] | Derong Liu, Changyin Sun, Bin Xian. Guest Editorial for Special Issue on Autonomous Control of Unmanned Aerial Vehicles[J]. IEEE/CAA Journal of Automatica Sinica, 2015, 2(1): 1-1. |
[20] | Nan Xiao, Xuehe Wang, Lihua Xie, Tichakorn Wongpiromsarn, Emilio Frazzoli, Daniela Rus. Road Pricing Design Based on Game Theory and Multi-agent Consensus[J]. IEEE/CAA Journal of Automatica Sinica, 2014, 1(1): 31-39. |
Type | Name | Year | # of frames | LIc | PLAd | SRM | Cls#e | Odf | GPS | LCg | MTh |
RLa | Caltech Lanes [31] | 2008 | 1224 | √ | – | – | – | – | – | – | – |
KITTI road [32] | 2013 | 600 | – | – | – | – | √ | √ | – | – | |
TuSimple [33] | 2017 | 6408 | √ | – | – | – | – | – | – | – | |
FiveAI [34] | 2018 | 23980 | √ | – | – | – | – | – | – | – | |
CULane [35] | 2018 | 133235 | √ | – | – | – | – | – | – | – | |
SRMb | Road Marking Detection [28] | 2012 | 28614 | – | – | √ | 10 | – | – | – | – |
RL+SRM | ROMA [36] | 2008 | 116 | √ | √ | √ | 3 | – | – | – | – |
CamVid [37] | 2008 | 701 | – | √ | √ | 2 | – | – | – | – | |
TRoM [38] | 2017 | 712 | √ | √ | √ | 19 | – | – | – | ||
Mapillary Vistas [39] | 2017 | 20000 | – | √ | √ | 6 | – | – | – | √ | |
VPGNet [15] | 2017 | 21097 | – | √ | √ | 17 | – | – | – | – | |
ApolloScape [40] | 2018 | 143K | – | √ | √ | 27 | – | √ | – | – | |
BDD100k [41] | 2020 | 120M1 | √ | √ | – | 11 | – | √ | – | √ | |
Ours | 2021 | 25157 | – | √ | √ | 16 | √ | √ | √ | √ | |
a Road Lane; b Symbolic Road Markings; c Lane Instances; d Pixel Level Annotation; e Number of Classes; f odometry of vehicle; g Loop Closure; h Multi-Trajectory 1 only 5683 images are pixel-level annotated |
Slow down | Go ahead | Turn right | Turn left | Ahead or turn right | Ahead or turn left |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Crosswalk | Numbers | Texts | Others (etc.) | ||
![]() | ![]() | ![]() | ![]() | ||
Double line (yellow) | Double line (blue) | Broken line (white) | Single line (yellow) | Single line (white) | Stop line |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Method | Base model | Computational complexity (GFLOPs) | Storage complexity (# parameters, M) |
FCN-8s | VGG16 | 106.3 | 134.0 |
Weighted-FCN | VGG16 | 106.3 | 134.0 |
RMNet (ours) | VGG16 | 132.8 | 152.0 |
FCN-8s | Res18 | 16.5 | 42.7 |
BiSeNet | Res18 | 21.6 | 49.0 |
SwiftNet | Res18 | 20.3 | 11.8 |
RMNet (ours) | Res18 | 20.6 | 52.7 |
Precision | Class Acc. (Recall) | F1-score | mIoU | |
FCN [44] | 0.7681 | 0.8886 | 0.8240 | 0.7753 |
Weighted-FCN [27] | 0.7541 | 0.9170 | 0.8276 | 0.5865 |
BiSeNet [47] | 0.8117 | 0.8737 | 0.8416 | 0.5742 |
SwiftNet [48] | 0.8577 | 0.8942 | 0.8756 | 0.6156 |
RMNet | 0.8837 | 0.8546 | 0.8689 | 0.6169 |
RMNet+CWL | 0.8808 | 0.8740 | 0.8774 | 0.6466 |
RMNet+CWFL | 0.8942 | 0.9012 | 0.8977 | 0.6526 |
Class | |||||||||||||||||
Networks | Slow down | Ahead | Turn right | Turn left | Ahead or Turn right | Ahead or Turn left | Cross- walk | Number | Text | Other | Double line (y) | Double line (b) | Broken line (w) | Single line (y) | Single line (w) | Stop line | Total mIoU |
FCN [44] | 0.7044 | 0.5485 | 0.6738 | 0.5455 | 0.3108 | 0.2403 | 0.7858 | 0.5902 | 0.8056 | 0.6440 | 0.4854 | 0.8029 | 0.8306 | 0.4708 | 0.3103 | 0.0487 | 0.5499 |
W-FCN [27] | 0.7868 | 0.6247 | 0.5374 | 0.6628 | 0.4714 | 0.4110 | 0.7560 | 0.6693 | 0.7783 | 0.6607 | 0.4576 | 0.7596 | 0.8912 | 0.5315 | 0.2926 | 0.0937 | 0.5865 |
BiSeNet [47] | 0.8153 | 0.7762 | 0.2591 | 0.2789 | 0.4040 | 0.1371 | 0.8587 | 0.7729 | 0.8461 | 0.7338 | 0.6060 | 0.8037 | 0.8617 | 0.5856 | 0.3869 | 0.0607 | 0.5742 |
SwiftNet [48] | 0.8622 | 0.8340 | 0.2826 | 0.2904 | 0.2891 | 0.2356 | 0.9008 | 0.8471 | 0.8855 | 0.7527 | 0.6198 | 0.8208 | 0.8991 | 0.7087 | 0.4926 | 0.1281 | 0.6156 |
RMNet | 0.8444 | 0.7036 | 0.6149 | 0.6514 | 0.5243 | 0.4418 | 0.8549 | 0.6907 | 0.7929 | 0.7122 | 0.5111 | 0.7765 | 0.8840 | 0.5332 | 0.3225 | 0.0125 | 0.6169 |
RMNet+CWL | 0.8380 | 0.7219 | 0.6915 | 0.6553 | 0.5587 | 0.4636 | 0.8624 | 0.7070 | 0.8425 | 0.7147 | 0.5201 | 0.7853 | 0.8993 | 0.6164 | 0.3789 | 0.0902 | 0.6466 |
RMNet+ CWFL | 0.8381 | 0.7214 | 0.7259 | 0.6416 | 0.6087 | 0.5866 | 0.8622 | 0.7207 | 0.8425 | 0.7199 | 0.5098 | 0.7986 | 0.9015 | 0.5761 | 0.3276 | 0.0604 | 0.6526 |
s_w_d | s_y_d | ds_y_dn | b_w_g | b_y_g | s_w_s | s_w_p | c_wy_z | a_w_t | a_w_tl | a_w_tr | a_w_l | a_w_r | a_w_lr | b_n_sr | r_wy_np | om_n_n | mIoU | |
Apollo [40] | 0.486 | 0.531 | 0.578 | 0.521 | 0.227 | 0.364 | 0.187 | 0.591 | 0.404 | 0.271 | 0.491 | 0.574 | 0.209 | 0.0001 | 0.009 | 0.361 | 0.405 | 0.4541 |
RMNet (+CWFL) | 0.739 | 0.707 | 0.788 | 0.674 | 0.667 | 0.588 | 0.101 | 0.697 | 0.605 | 0.701 | 0.742 | 0.705 | 0.213 | 0.339 | 0.015 | 0.665 | 0.529 | 0.5609 |
(a) RMNet performance on ApolloScape dataset [40] | ||||||||||||||||||
single (RL, w) | dashed (RL, w) | single (RL, y) | double (RL, y) | stopline (RL) | straight (SRM) | crosswalk (SRM) | safety zone (SRM) | mIoU | ||||||||||
RMNet (+CWFL) | 0.62 | 0.59 | 0.45 | 0.61 | 0.47 | 0.38 | 0.74 | 0.51 | 0.5462 | |||||||||
(b) RMNet performance on VPGNet dataset [15] |
single (RL, w) | dashed (RL, w) | single (RL, y) | double (RL, y) | stopline (RL) | straight (SRM) | crosswalk (SRM) | safety zone (SRM) | mIoU | |
Night | 0.4407 | 0.4539 | 0.2681 | 0.6050 | 0.2867 | 0.4748 | 0.7621 | 0.3697 | 0.4576 |
Rainy | 0.5153 | 0.5035 | 0.5085 | 0.6194 | 0.4126 | 0.4365 | 0.7339 | 0.4867 | 0.5271 |
Dataset | Route 1 | Route 2 | Route 3 | |||
RMSE (m) | Max (m) | RMSE (m) | Max (m) | RMSE (m) | Max (m) | |
Odometry only | 34.591 | 91.879 | 50.282 | 169.18 | 79.093 | 226.89 |
Weighted-FCN [27] | 7.758 | 28.475 | 14.646 | 30.154 | 15.998 | 56.463 |
Proposed method | 5.243 | 10.617 | 4.696 | 24.335 | 12.460 | 28.367 |
Loss function | mIoU | δ | mIoU | γ | mIoU | α | mIoU | |||
Cross Entropy (CE) | 0.6169 | 0 (= CE) | 0.6169 | 0 (= CE) | 0.6169 | 1 | 0.6502 | |||
Focal Loss (δ=1.0) | 0.6235 | 0.5 | 0.6162 | 0.1 | 0.6374 | Balanced Weight | 0.6526 | |||
Class-Weighted Loss (γ=0.2) | 0.6466 | 1.0 | 0.6235 | 0.2 | 0.6466 | |||||
Class-Weighted Focal Loss (δ=1.0&γ=0.2) | 0.6526 | 2.0 | 0.6225 | 0.5 | 0.6408 | |||||
5.0 | 0.5499 | |||||||||
(a) Comparison of various loss functions | (b) Varing δ for FL | (c) Varing γ for CWL | (d) Varing α for CWFL |
Type | Name | Year | # of frames | LIc | PLAd | SRM | Cls#e | Odf | GPS | LCg | MTh |
RLa | Caltech Lanes [31] | 2008 | 1224 | √ | – | – | – | – | – | – | – |
KITTI road [32] | 2013 | 600 | – | – | – | – | √ | √ | – | – | |
TuSimple [33] | 2017 | 6408 | √ | – | – | – | – | – | – | – | |
FiveAI [34] | 2018 | 23980 | √ | – | – | – | – | – | – | – | |
CULane [35] | 2018 | 133235 | √ | – | – | – | – | – | – | – | |
SRMb | Road Marking Detection [28] | 2012 | 28614 | – | – | √ | 10 | – | – | – | – |
RL+SRM | ROMA [36] | 2008 | 116 | √ | √ | √ | 3 | – | – | – | – |
CamVid [37] | 2008 | 701 | – | √ | √ | 2 | – | – | – | – | |
TRoM [38] | 2017 | 712 | √ | √ | √ | 19 | – | – | – | ||
Mapillary Vistas [39] | 2017 | 20000 | – | √ | √ | 6 | – | – | – | √ | |
VPGNet [15] | 2017 | 21097 | – | √ | √ | 17 | – | – | – | – | |
ApolloScape [40] | 2018 | 143K | – | √ | √ | 27 | – | √ | – | – | |
BDD100k [41] | 2020 | 120M1 | √ | √ | – | 11 | – | √ | – | √ | |
Ours | 2021 | 25157 | – | √ | √ | 16 | √ | √ | √ | √ | |
a Road Lane; b Symbolic Road Markings; c Lane Instances; d Pixel Level Annotation; e Number of Classes; f odometry of vehicle; g Loop Closure; h Multi-Trajectory 1 only 5683 images are pixel-level annotated |
Slow down | Go ahead | Turn right | Turn left | Ahead or turn right | Ahead or turn left |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Crosswalk | Numbers | Texts | Others (etc.) | ||
![]() | ![]() | ![]() | ![]() | ||
Double line (yellow) | Double line (blue) | Broken line (white) | Single line (yellow) | Single line (white) | Stop line |
![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Method | Base model | Computational complexity (GFLOPs) | Storage complexity (# parameters, M) |
FCN-8s | VGG16 | 106.3 | 134.0 |
Weighted-FCN | VGG16 | 106.3 | 134.0 |
RMNet (ours) | VGG16 | 132.8 | 152.0 |
FCN-8s | Res18 | 16.5 | 42.7 |
BiSeNet | Res18 | 21.6 | 49.0 |
SwiftNet | Res18 | 20.3 | 11.8 |
RMNet (ours) | Res18 | 20.6 | 52.7 |
Precision | Class Acc. (Recall) | F1-score | mIoU | |
FCN [44] | 0.7681 | 0.8886 | 0.8240 | 0.7753 |
Weighted-FCN [27] | 0.7541 | 0.9170 | 0.8276 | 0.5865 |
BiSeNet [47] | 0.8117 | 0.8737 | 0.8416 | 0.5742 |
SwiftNet [48] | 0.8577 | 0.8942 | 0.8756 | 0.6156 |
RMNet | 0.8837 | 0.8546 | 0.8689 | 0.6169 |
RMNet+CWL | 0.8808 | 0.8740 | 0.8774 | 0.6466 |
RMNet+CWFL | 0.8942 | 0.9012 | 0.8977 | 0.6526 |
Class | |||||||||||||||||
Networks | Slow down | Ahead | Turn right | Turn left | Ahead or Turn right | Ahead or Turn left | Cross- walk | Number | Text | Other | Double line (y) | Double line (b) | Broken line (w) | Single line (y) | Single line (w) | Stop line | Total mIoU |
FCN [44] | 0.7044 | 0.5485 | 0.6738 | 0.5455 | 0.3108 | 0.2403 | 0.7858 | 0.5902 | 0.8056 | 0.6440 | 0.4854 | 0.8029 | 0.8306 | 0.4708 | 0.3103 | 0.0487 | 0.5499 |
W-FCN [27] | 0.7868 | 0.6247 | 0.5374 | 0.6628 | 0.4714 | 0.4110 | 0.7560 | 0.6693 | 0.7783 | 0.6607 | 0.4576 | 0.7596 | 0.8912 | 0.5315 | 0.2926 | 0.0937 | 0.5865 |
BiSeNet [47] | 0.8153 | 0.7762 | 0.2591 | 0.2789 | 0.4040 | 0.1371 | 0.8587 | 0.7729 | 0.8461 | 0.7338 | 0.6060 | 0.8037 | 0.8617 | 0.5856 | 0.3869 | 0.0607 | 0.5742 |
SwiftNet [48] | 0.8622 | 0.8340 | 0.2826 | 0.2904 | 0.2891 | 0.2356 | 0.9008 | 0.8471 | 0.8855 | 0.7527 | 0.6198 | 0.8208 | 0.8991 | 0.7087 | 0.4926 | 0.1281 | 0.6156 |
RMNet | 0.8444 | 0.7036 | 0.6149 | 0.6514 | 0.5243 | 0.4418 | 0.8549 | 0.6907 | 0.7929 | 0.7122 | 0.5111 | 0.7765 | 0.8840 | 0.5332 | 0.3225 | 0.0125 | 0.6169 |
RMNet+CWL | 0.8380 | 0.7219 | 0.6915 | 0.6553 | 0.5587 | 0.4636 | 0.8624 | 0.7070 | 0.8425 | 0.7147 | 0.5201 | 0.7853 | 0.8993 | 0.6164 | 0.3789 | 0.0902 | 0.6466 |
RMNet+ CWFL | 0.8381 | 0.7214 | 0.7259 | 0.6416 | 0.6087 | 0.5866 | 0.8622 | 0.7207 | 0.8425 | 0.7199 | 0.5098 | 0.7986 | 0.9015 | 0.5761 | 0.3276 | 0.0604 | 0.6526 |
s_w_d | s_y_d | ds_y_dn | b_w_g | b_y_g | s_w_s | s_w_p | c_wy_z | a_w_t | a_w_tl | a_w_tr | a_w_l | a_w_r | a_w_lr | b_n_sr | r_wy_np | om_n_n | mIoU | |
Apollo [40] | 0.486 | 0.531 | 0.578 | 0.521 | 0.227 | 0.364 | 0.187 | 0.591 | 0.404 | 0.271 | 0.491 | 0.574 | 0.209 | 0.0001 | 0.009 | 0.361 | 0.405 | 0.4541 |
RMNet (+CWFL) | 0.739 | 0.707 | 0.788 | 0.674 | 0.667 | 0.588 | 0.101 | 0.697 | 0.605 | 0.701 | 0.742 | 0.705 | 0.213 | 0.339 | 0.015 | 0.665 | 0.529 | 0.5609 |
(a) RMNet performance on ApolloScape dataset [40] | ||||||||||||||||||
single (RL, w) | dashed (RL, w) | single (RL, y) | double (RL, y) | stopline (RL) | straight (SRM) | crosswalk (SRM) | safety zone (SRM) | mIoU | ||||||||||
RMNet (+CWFL) | 0.62 | 0.59 | 0.45 | 0.61 | 0.47 | 0.38 | 0.74 | 0.51 | 0.5462 | |||||||||
(b) RMNet performance on VPGNet dataset [15] |
single (RL, w) | dashed (RL, w) | single (RL, y) | double (RL, y) | stopline (RL) | straight (SRM) | crosswalk (SRM) | safety zone (SRM) | mIoU | |
Night | 0.4407 | 0.4539 | 0.2681 | 0.6050 | 0.2867 | 0.4748 | 0.7621 | 0.3697 | 0.4576 |
Rainy | 0.5153 | 0.5035 | 0.5085 | 0.6194 | 0.4126 | 0.4365 | 0.7339 | 0.4867 | 0.5271 |
Dataset | Route 1 | Route 2 | Route 3 | |||
RMSE (m) | Max (m) | RMSE (m) | Max (m) | RMSE (m) | Max (m) | |
Odometry only | 34.591 | 91.879 | 50.282 | 169.18 | 79.093 | 226.89 |
Weighted-FCN [27] | 7.758 | 28.475 | 14.646 | 30.154 | 15.998 | 56.463 |
Proposed method | 5.243 | 10.617 | 4.696 | 24.335 | 12.460 | 28.367 |
Loss function | mIoU | δ | mIoU | γ | mIoU | α | mIoU | |||
Cross Entropy (CE) | 0.6169 | 0 (= CE) | 0.6169 | 0 (= CE) | 0.6169 | 1 | 0.6502 | |||
Focal Loss (δ=1.0) | 0.6235 | 0.5 | 0.6162 | 0.1 | 0.6374 | Balanced Weight | 0.6526 | |||
Class-Weighted Loss (γ=0.2) | 0.6466 | 1.0 | 0.6235 | 0.2 | 0.6466 | |||||
Class-Weighted Focal Loss (δ=1.0&γ=0.2) | 0.6526 | 2.0 | 0.6225 | 0.5 | 0.6408 | |||||
5.0 | 0.5499 | |||||||||
(a) Comparison of various loss functions | (b) Varing δ for FL | (c) Varing γ for CWL | (d) Varing α for CWFL |