Classification of parotid gland tumors by using multimodal MRI and deep learning

Various MRI sequences have shown their potential to discriminate parotid gland tumors, including but not limited to T 2‐weighted, postcontrast T 1‐weighted, and diffusion‐weighted images. In this study, we present a fully automatic system for the diagnosis of parotid gland tumors by using deep learning methods trained on multimodal MRI images. We used a two‐dimensional convolution neural network, U‐Net, to segment and classify parotid gland tumors. The U‐Net model was trained with transfer learning, and a specific design of the batch distribution optimized the model accuracy. We also selected five combinations of MRI contrasts as the input data of the neural network and compared the classification accuracy of parotid gland tumors. The results indicated that the deep learning model with diffusion‐related parameters performed better than those with structural MR images. The performance results (n = 85) of the diffusion‐based model were as follows: accuracy of 0.81, 0.76, and 0.71, sensitivity of 0.83, 0.63, and 0.33, and specificity of 0.80, 0.84, and 0.87 for Warthin tumors, pleomorphic adenomas, and malignant tumors, respectively. Combining diffusion‐weighted and contrast‐enhanced T 1‐weighted images did not improve the prediction accuracy. In summary, the proposed deep learning model could classify Warthin tumor and pleomorphic adenoma tumor but not malignant tumor.

modalities, such as MRI and computed tomography, are useful to identify the location and size of PGTs. Fine-needle aspiration biopsy is the primary method for identifying the tumor type, but its sensitivity is low (70%-80%) for recognizing malignant PGTs. 1,2 MRI can be useful for tumor classification. For example, T 1 -and T 2 -weighted images clearly present the texture of tumors, including areas of normal and lesion tissues. 3 High-grade malignant salivary gland tumors are distinguished on routine MR images by ill defined borders, cystic components, low T 2 signal intensity, necrosis, and invasion of surrounding tissues. However, MRI is often unable to distinguish between benign and malignant salivary tumors. [4][5][6] The apparent diffusion coefficient (ADC) derived from diffusion-weighted imaging (DWI) has been shown to be associated with tumor cellularity, and MTs exhibit hyperintensity in DWI. 7,8 The ADC value of a PGT region is useful for differentiating between WTs and PMAs. [9][10][11][12] However, the mean ADC values of WTs and MTs are not significantly different. 8,10,13 The ADC has a sensitivity of 50%-60% in distinguishing MTs. 9,14 Therefore, identifying MTs through MRI remains a challenge.
Recently, deep learning methods, particularly convolution neural network (CNN)-based models, have demonstrated effectiveness in image recognition tasks. CNN methods for pixel-wise classification, also referred to as semantic segmentation, are now widely employed in computervision applications, such as robotics and self-driving cars. 15,16 The semantic segmentation method has also been used in MRI applications. For example, in the global competition of the Multimodal Brain Tumor Segmentation Challenge (BraTS), 17,18 researchers achieved an accuracy of more than 80% for the pixel-wise classification of brain gliomas. In addition, deep-learning-based tumor segmentation and classification have been investigated for several cancers, including breast cancer, 19,20 liver tumor, [21][22][23] and nasopharyngeal carcinoma. 24 We hypothesize that the deep learning method on MRI data can also help detect and distinguish PGTs. In this study, we implemented a semantic segmentation method of multimodal MRI images for the segmentation of PGTs and classification of tumor types.

| The patient cohort and MRI protocol
The Institutional Review Board of Tri-Service General Hospital approved the study and waived the requirement of written informed consent for this retrospective study. Eighty-five consecutive patients with PGT (54 men and 31 women; age 49.6 ± 15.6 years) who underwent MRI examination were enrolled. Their PGTs were of the types WT (n = 27), PMA (n = 33), and MT (n = 25) according to histologic findings. All MRI examinations were performed on a 1.5 T MRI system (Signa HDx, GE Healthcare) with an eight-channel neurovascular head-and-neck array coil. Before contrast administration, the scanning protocol included T 2 -weighted and DWI sequences. Acquisition parameters for T 2 -weighted imaging were slice orientation axial, T R 3150 ms, T E 77.3 ms, number of excitations 2, and slice number 32. Moreover, the single-shot echo-planar DWI with parameters (slice orientation axial, T R 7000 ms, T E 72.2 ms, number of excitations 4, slice number 18, and fat saturated) were acquired with diffusion gradients b = 0 and 1000 s/mm 2 applied in each of three orthogonal directions. After contrast administration (gadolinium-DTPA, 0.1 mmol/kg), we acquired T 1 -weighted images by using fat-saturated fast spin-echo with the following parameters: slice orientation axial, T R 616.7 ms, T E 12 ms, number of excitations 0.5, and slice number 32. Thus, we collected datasets containing four MRI contrasts for each patient, namely T 2 weighted, T 1 weighted with contrast enhancement, b 0 (DWI, b = 0 s/mm 2 ), and b 1000 (DWI, b = 1000 s/mm 2 ).

| Data conversion and image registration
After data acquisition, we collected all DICOM image files and sorted them by identifying DICOM tags. Next, the procedure converted files into NIfTI format files with the dcm2niix software (https://github.com/rordenlab/dcm2niix). We transferred them to a workstation for further processing. A board-certified radiologist (CJJ), with more than 15 years of experience in head-and-neck MRI, manually outlined the region of the tumor on contrast-enhanced T 1 -weighted images and constructed another three-dimensional (3D) volume with pixels labeled as 0, nontumor (NT; including background), 1, WT, 2, PMA, and 3, MT, according to histological records. Finally, the procedure saved the 3D volume with the tumor labels into another NIfTI file.
The subsequent step was to co-register the four volumes by using the Advanced Normalization Tools (ANTs) software package (http://stnava.github.io/ANTs/). We registered the T 2 volume to the contrast-enhanced T 1 -weighted volume. Subsequently, we used deformable registration to obtain the coordinate transformation between the b 0 and T 2 volumes and used the transformation to obtain the registered b 0 and b 1000 volumes. Using the obtained diffusion-weighted volumes, we calculated ADC maps using the equation ADC = ln[(SI 0 / SI 1000 )]/1000, where SI 0 and SI 1000 are the signal intensities of the b 0 and b 1000 volumes, respectively. 25 The ADC maps subsequently underwent median filtering with a 3 × 3 kernel. Therefore, we had six 3D volumes, including the contrast-enhanced T 1 ,  Figure 1A shows an example stack of the five modalities of MRI. Figure 1B presents the examples of the manually outlined region of the three tumor types (red, WT, green, PMA, and blue, MT). Because of the restriction of the input layer of the implemented neural network, which will be discussed later, the input stack size was fixed to a four-channel stack (256 × 256 × 4). To compare the relation between classification accuracy and MR contrasts, we generated the following types of four-channel stack: sT2 combining four identical images ( c T 2 , c T 2 , c T 2 , and c T 2 ), sT1 combining ( c T 1 , c T 1 , c T 1 , and c T 1 ), sT1T2 combining ( c T 1 , c T 1 , c T 2 , and c T 2 ), sDWI consisting of (zeros, b b 0 , d b 0100 , and d ADC), sALL consisting of ( c T 1 , c T 2 , b b 0 , and d b 1000 ), and sALL2 consisting of ( c T 1 , c T 2 , d b 1000 , and d ADC).

| Deep learning: U-Net and transfer learning
We used 2D U-Net for pixel-wise tumor classification. 26 The network consisted of encoding and decoding paths with convolutional blocks. Each of the blocks consisted of a 3 × 3 convolution layer followed by a rectified linear unit and a dropout layer. In the encoding path, the output of each block was down-sampled with a max-pooling operation with a stride of 2. In the decoding path, the input of each block was concatenated with the corresponding feature maps obtained in the encoding path, and the output of each block was up-sampled using a transpose convolution.
The final output layer of the U-Net was connected to a multiclass softmax classifier.
To initialize U-Net, we used transfer learning, which refers to transferring network weights from a pretrained model to another model. In general, the pretrained models are trained with immense numbers of datasets. With the same architecture of the deep learning network, the weights of the pretrained model can be utilized as the initial values of the weights for the new model. Because the weights are linked to the process of extracting and filtering features, most deep learning machines are specialized in a particular field or task. Therefore, we adapted a method that won the third prize in BraTS 2017 to produce the pretrained model. 27 In that model, a four-channel input layer was implemented, and the U-Net was pretrained with four types of brain MR image (ie T 2 , FLAIR, T 1 , and contrast-enhanced T 1 ) and three tumor labels After constructing the pretrained model, we transferred its weights to initialize the training procedure in our current study for classifying PGTs. The training parameters of U-Net were the following: optimizer, Adam, batch size, 6 or 8, loss, cross-entropy, beta of L2 regularization,

| Cross-validation, prediction, and performance assessment
We distributed the 2726 stacks into three groups by using stratified random sampling to conduct a threefold cross-validation of the U-Net model. 26 All the stacks of one patient were dispatched into the same group, and every group had a proportional allocation of the three tumor types. The number of patients in the three groups was (WT 9, PMA 8, WT 8), (9,8,8), and (9,8,9), respectively. We performed eight trials of random sampling and U-Net training.

| Comparing training schemes: transfer learning and input batches
During the preliminary investigation stage, we evaluated four training schemes to determine a scheme that optimized the tumor classification performance. In Scheme 1, the input batch size was six, and transfer learning was not applied. The training procedure randomly selected six stacks from the training stacks containing all three PGTs. Scheme 2 was the same as Scheme 1 but with transfer learning. Scheme 3 was the same as Scheme 2 except that the input batch was not a random mix of three tumor types but comprised exactly two WT, two PMA, and two MT stacks.
In Scheme 4, transfer learning was applied, and all stacks, including both tumor and NT stacks, were used to train the U-Net model. The batch size was eight stacks-two WT, two PMA, two MT, and two NT stacks. To select the optimized training scheme, we trained the U-Net model by using the sDWI stacks and the four schemes, and the training scheme that yielded the best results was then chosen to evaluate the segmentation and classification performance.   Figure 3A shows the input DWI stack. This patient had one WT in the right parotid gland. Figure 3B presents predicted d SEG after different training steps for Scheme 1 (upper row) and Scheme 2 (lower row), respectively. The tumor was clearly outlined after 5000 training steps for Scheme 1 and 1000 for Scheme 2. After 5000 steps, the Dice coefficient of the Scheme 2 result was considerably higher than that of Scheme 1. This suggested that applying transfer learning in Scheme 2 not only accelerated the convergence of U-Net optimization but also improved the prediction accuracy. F I G U R E 3 Demonstration of the PGT segmentation of an image containing a WT on the right side. The input is an sDWI stack (A), and WT regions are predicted using models trained with Scheme 1 (B) and Scheme 2 (C). The WT regions are presented as red pixels overlaying the ADC images. After 1000, 3000, and 5000 training steps, the WT region became progressively more accurate. The number beneath each image is the Dice coefficient between the predicted and actual tumor regions. This example demonstrates the advantage of using transfer learning in Scheme 2 patient with MT, the segmentation results of all the types of stack, except sT2, were close to the correct label. However, only results obtained from sT1, sT1T2, and sDWI classify PGT into the correct category (ie MT, blue color). Table 2 presents the group statistics of the recognition of PGTs. For segmentation results, the PGT region obtained using sALL produced the highest average Dice coefficient (0.48 ± 0.01). For tumor classification, PGT types predicted using sDWI yielded the highest average accuracy. The group analysis revealed that the performance of sT2 was the poorest. Combining DWI and structural images did not improve the outcome. Among all models, classifying WTs exhibited the highest accuracy. Table 3 presents the complete results of classification using sDWI. The sensitivity of WT, PMA, and MT was 0.83, 0.67, and 0.33, respectively, suggesting that the obtained U-Net model was not sensitive to classify MTs.

| DISCUSSION
In this study, we describe a fully automatic system for the detection and classification of PGTs. We used a 2D U-Net CNN for multiclass segmentation. To investigate a suitable training procedure, we proposed and compared four training schemes for U-Net. We found that transfer learning and manipulations of training batches gradually improved the classification accuracy. Unlike Scheme 1, we applied transfer learning in Scheme 2, which improved performance; this suggests that the network weights pretrained with 44 175 stacks in the BraTs dataset may lead the U-Net model closer to the optimal solution because filters for the detection of the brain and PGTs in convolutional layers could be similar. In Schemes 3 and 4, the training batch for the forward path of U-Net was a fixed structure. Each batch in Scheme 3 comprised two stacks of each tumor type, with class balance maintained during the training stage. Although this setup improved the classification accuracy, a nonignorable number of NT pixels were misclassified into PGTs (ie false positives). In Scheme 4, we added two stacks generated from image slices without PGTs in each batch.
The performance of Scheme 4 was the best. Thus, we fixed the training procedure to that used for Scheme 4 and kept exploring the model efficiency with various combinations of MRI images to identify PGT types.
We used six types of stack as input to the U-Net model and obtained the corresponding models. The Dice coefficient results revealed that sALL and sDWI models produced better segmentation results than the other models. The sDWI model outperformed the others in accuracy.
Among the stacks consisting of only structural images, the sT1 model performed better than the sT2 or sT1T2 models. As for sALL and sALL2, we assembled it with all available MR modalities, such as T 1 , T 2 , and DWI, and assumed that the U-Net training procedure could derive the optimal combination of all MR information. However, neither model was better than the sDWI model. This could be for two reasons: image registration and data size. For example, the four-channel sALL stack (matrix size 256 × 256 × 4) was constructed from four images ( c T 1 , c T 2 , b b 0 , and d b 1000 ), with the assumption of multichannel deep learning that all the channels were aligned pixel by pixel. Although we used deformable registration to amend the misalignment between spin-echo-based structural images and echo-planar imaging (EPI)-based DWI images, residual image distortion along the phase-encoding direction in DWI images was inevitable, thereby impairing registration precision. The misregistration between channels in the sALL-based U-Net model could have reduced the accuracy of PGT recognition. We merged all MR information in the input layer and tested whether the training procedure could select dominating image channels and exclude less useful ones. In theory, if the training procedure achieves the optimal solution, the sT1T2 model should be at least comparable to the sT1 model under equal computation power. However, our limited data size (2726 stacks) could have restricted model optimization, and more information did not produce better results.
Among all the models investigated in this study, the sDWI model provided the best PGT classification results. The accuracy results were 0.81, 0.76, and 0.71 for WT, PMA, and MT, respectively. The classification performance of WT is comparable to that reported in a previous study, which required sex and age information in addition to MRI images. 12 The sensitivity results were 0.83, 0.63, and 0.33 for WT, PMA, and MT, respectively. The results suggest that our current model was insensitive to MTs. This critical limitation originates from the fact that malignant PGTs in humans are related to deeper structures, such as the parapharyngeal space, adjacent muscles, and bony tissues, which are not clearly presented in MRI. 28  One study limitation is the small data size, which hampers the optimization of the large segmentation network. Although transfer learning improves the classification accuracy, deep learning models with a larger dataset are warranted. Another study limitation is the alignment of DWI, T 1 -weighted, and T 2 -weighted images. Although we retrospectively performed deformable registration of all images, the residual misregistration may have reduced the classification performance. Reducing EPI distortion or acquiring all images with the same type of MR sequence, such as the multishot EPI, 34 may be a solution.
In summary, we assessed the PGT classification performance of a U-Net method combined with multimodal MRI. The U-Net model based on DWI information outperformed contrast-enhanced T 1 -and T 2 -weighted images. Combining all available modalities did not improve accuracy. The U-Net model could simultaneously outline the tumor region and identify the tumor type. The U-Net model can be practical to use in the clinical setting to detect WTs and PMAs, but it is not sensitive for MTs.