Automatic detection and recognition of safety helmet wearing based on video analysis is an important means to ensure production safety. However, the complex environment and changeable factors pose challenges to realize accurate and detection and recognition of safety helmet wearing. Helmet detection is generally divided into Traditional machine learning methods and deep learning methods. Traditional machine learning methods use manually selected features or statistical features, with poor model stability. Deep learning-based methods are divided into "two-stage" and "one-stage" methods. The "two-stage" method has high detection accuracy but cannot achieve real-time detection,while the "one-stage" method is the reverse. Achieving both accuracy and real-time detection is an important challenge in the development of video-based helmet detection. Correct and fast detection of helmets can effective real-time monitoring of production sites. In this end, this paper proposes DS-YOLOv5, which is a real-time helmet detection and recognition model based on the YOLOv5 model. The following problems are mainly solved: First, Insufficient global information extraction for CNN models. Second, the algorithm lacks robustness for multiple targets and occlusion problems in video scenes. Third, Insufficient feature extraction for multi-scale targets. To address this problem, First, it takes advantage of the improved Deep SORT multi-target tracking to reduce the rate of missed detections in multi-target detection and occlusion , to increase the error tolerance in video detection. Secondly, a simplified Transformer module(transformer block) is integrated into the backbone network to enhance the capture of global information from the images and thus enhance the learning of features from small targets. And finally, the Bidirectional Feature Pyramid Network (BiFPN) was applied to fuse multi-scale features, that can better adapt to the target scale changes caused by the photographic distance. The DS-YOLOv5 model was validated on the GDUT-HWD dataset by ablation and comparison experiments. Experiments comparing the tracking capability of the improved deepsort using the public pedestrian dataset MOT. The experimental results of the comparison of the five “one stage” mothod and the four helmet detection and recognition models. It is demonstrated that the proposed model has better capability for dealing with occlusion and target scale. Its mAP achieved 95.5% which is superior to that of the other helmet detection and recognition method.