Enhanced machining quality, including the appropriate surface roughness of the machined parts, is the focus of many industries. This paper proposes and implements transformer-based deep learning (DL) architecture for machining roughness classification for the end-milling operation using cutting force and machining sound data. To increase the accuracy of the classification outcomes, audio feature extraction techniques—Mel-spectrogram and Mel-frequency cepstral coefficients (MFCCs)—were incorporated with the DL model. To measure and demonstrate the proposed model’s performance, a number of experiments were conducted by training the models on 0–30 s machining data, including end mill-workpiece impact at the beginning of the experiment and 10–40 s machining data. Based on the outcomes’ accuracies, four DL models were designed and the number of parameters, maximum epoch required for convergence, and training, validation and testing accuracies were compared for each model. DL models trained on 10–40 s machining data achieved over 90% validation and test accuracies, suggesting that the cutting force and machining data after reaching steady-state performs better than data from the beginning of machining operation. Confusion matrices were plotted for the inference results of each model to observe the prediction accuracy visually. The proposed method will be able to predict machining surface quality with high accuracy saving time and cost for post-operations.