小熊奶糖(BearCandy)
小熊奶糖(BearCandy)
发布于 2026-02-23 / 7 阅读
0
0

DFL 云端多线程崩溃排障指南

DeepFaceLab (DFL) 云端环境多线程崩溃全景排障指南

一、 报错现象与底层错误码破译

在云端 GPU 容器(如 AutoDL、恒源云等)运行 train 时,程序未进入训练界面便直接崩溃,并伴随海量报错。

核心报错特征与系统级翻译:

  • [ERROR:0] 116: Can't spawn new thread: res = 11
    • 破译: 这里的 res = 11 是 Linux 系统底层的 POSIX 错误码 EAGAIN。意思是 "Resource temporarily unavailable"(资源暂时不可用)。系统在警告你:“当前进程或线程名额已耗尽,系统拒绝为你创建新的执行单元”。
  • OpenBLAS blas_thread_init: pthread_create failed for thread 19 of 20
    • 破译: 底层矩阵计算库(OpenBLAS)试图为单个任务分配 20 个计算线程,但在创建第 19 个时被系统拦截。
  • FileNotFoundError: [Errno 2] ... SemLock
    • 破译: Python 多进程在崩溃时,没有来得及清理跨进程通信(IPC)使用的系统信号量文件和共享内存,导致后续程序一启动就死锁。

二、 核心原理解析:为什么会发生“线程大爆炸”?

这个问题是云平台虚拟化环境与 Python 默认机制之间产生的严重冲突,我们可以称之为**“线程大爆炸(Thread Explosion)”**。它由三个层面的“误解”叠加引发:

1. 第一层:云平台 Cgroups 的“硬隔离”

云平台通过 Docker 容器将一台巨大的物理服务器划分给多个用户。为了防止某个用户写出死循环卡死整台物理机,系统内核使用 Cgroups(控制组)对你的容器设置了硬性配额(例如 pids.max 限制最多只能有 256 个进程/线程)。你用 ulimit 无法修改这个物理门禁。

2. 第二层:Python os.cpu_count 的“穿透幻觉”

当 DFL 准备加载图片数据时,它会调用 Python 的 os.cpu_count() 来决定雇用多少个“工人(子进程)”。但在 Docker 容器中,Python 非常笨,它会直接穿透容器,看到宿主机真正的物理核心数(例如 128 核)。于是,DFL 误以为自己可以随心所欲,直接拉起 128 个子进程。

3. 第三层:底层 C 库的“套娃乘法”(罪魁祸首)

Python 在进行图片处理和矩阵运算时,依赖 Numpy 和 OpenCV。这些库的底层是由 C/C++ 编写的 OpenBLAS 或 MKL 库。这些数学库极其贪婪,默认策略是:“每次运算,我都要调动等同于 CPU 核心数的线程”。

灾难性的算账时刻:

  • DFL 看到 128 核,拉起了 128 个子进程
  • 每个子进程内的 OpenBLAS 看到 128 核,又试图拉起 128 个计算线程
  • 瞬间并发请求:128 × 128 = 16,384 个线程!
  • 结果:瞬间撞碎云平台限制的 256 个名额,系统直接拉闸,触发 res = 11

三、 终极解决方案(手把手实操代码)

我们需要采取“先清扫,后镇压”的策略,直接从 Python 源码层面进行拦截。

步骤 1:打扫战场(清理僵尸进程与死锁内存)

在云平台的终端(Terminal)中,依次执行以下命令,或者直接在网页控制台重启实例

# 强制终结所有卡在后台、没有彻底死掉的 Python 进程,释放 PID 配额
pkill -9 -f python

# 清空 Linux 的共享内存盘。解决 SemLock 找不到文件和 multiarray 导入失败的问题
rm -rf /dev/shm/*

步骤 2:植入“猴子补丁”(源码级线程拦截)

打开 DFL 项目根目录下的 main.py 文件。我们需要在程序刚睡醒、还没来得及拉起任何进程之前,强制篡改它的环境变量和 CPU 认知。

请将 main.py 最顶部的代码(if __name__ == "__main__": 及其下方几行),完全替换为以下结构:

if __name__ == "__main__":
    import os
  
    # 【防御第一道防线:锁死底层 C 库(非必要)】
    # 必须在 import cv2 或 numpy 之前设置!
    # 强制让所有矩阵计算库在单个进程内只能使用 1 个线程,杜绝“套娃乘法” 
    """
    os.environ['OPENBLAS_NUM_THREADS'] = '1'
    os.environ['OMP_NUM_THREADS'] = '1'
    os.environ['MKL_NUM_THREADS'] = '1'
    os.environ['NUMEXPR_NUM_THREADS'] = '1'
    os.environ['NUMEXPR_MAX_THREADS'] = '1'
    os.environ['VECLIB_MAXIMUM_THREADS'] = '1'
    """
    import multiprocessing
  
    # 【防御第二道防线:欺骗 Python 解释器(必要)】
    # 使用 Lambda 匿名函数覆盖系统原始的 cpu_count 方法
    # 强制让整个程序以为这台机器只有 2 个核心,从而只拉起 2 个数据加载子进程
    os.cpu_count = lambda: 2
    multiprocessing.cpu_count = lambda: 2

    # 针对 Linux 环境的多进程启动方式修复(DFL 官方原有代码)
    multiprocessing.set_start_method("spawn")

    # 【防御第三道防线:阉割 OpenCV 并发(非必要)】
    # 强制关闭图像处理库的多线程加速,避免与 Numpy 抢夺内存导致段错误
    """
    try:
        import cv2
        cv2.setNumThreads(0)
    except Exception:
        pass
    """"
    # 进入正常的 DFL 初始化流程
    from core.leras import nn
    nn.initialize_main_env()
  
    # 下方保留原有的 import sys, time 等代码...

完整 main.py示例:

if __name__ == "__main__":
    import os
    import sys
    import time
    import multiprocessing

    # ====== 强制限制容器线程数 ======
    # 强制覆盖 CPU 核心计数,欺骗底层的 SubprocessGenerator
    # 如果改为 4 之后可以正常跑通但你觉得稍慢,可以逐渐往上加到 6 或 8
    # 如果还是报 res = 11,就降到 2 甚至 1
    os.cpu_count = lambda: 6
    multiprocessing.cpu_count = lambda: 4
    # ===============================

    # Fix for linux
    multiprocessing.set_start_method("spawn")

    from core.leras import nn
    nn.initialize_main_env()
    import argparse

    from core import pathex
    from core import osex
    from pathlib import Path
    from core.interact import interact as io

    if sys.version_info[0] < 3 or (sys.version_info[0] == 3 and sys.version_info[1] < 6):
        raise Exception("This program requires at least Python 3.6")

    class fixPathAction(argparse.Action):
        def __call__(self, parser, namespace, values, option_string=None):
            setattr(namespace, self.dest, os.path.abspath(os.path.expanduser(values)))

    exit_code = 0
  
    parser = argparse.ArgumentParser()
    subparsers = parser.add_subparsers()

    def process_extract(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import Extractor
        Extractor.main( detector                = arguments.detector,
                        input_path              = Path(arguments.input_dir),
                        output_path             = Path(arguments.output_dir),
                        output_debug            = arguments.output_debug,
                        manual_fix              = arguments.manual_fix,
                        manual_output_debug_fix = arguments.manual_output_debug_fix,
                        manual_window_size      = arguments.manual_window_size,
                        face_type               = arguments.face_type,
                        max_faces_from_image    = arguments.max_faces_from_image,
                        image_size              = arguments.image_size,
                        jpeg_quality            = arguments.jpeg_quality,
                        cpu_only                = arguments.cpu_only,
                        force_gpu_idxs          = [ int(x) for x in arguments.force_gpu_idxs.split(',') ] if arguments.force_gpu_idxs is not None else None,
                      )

    p = subparsers.add_parser( "extract", help="Extract the faces from a pictures.")
    p.add_argument('--detector', dest="detector", choices=['s3fd','manual'], default=None, help="Type of detector.")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
    p.add_argument('--output-dir', required=True, action=fixPathAction, dest="output_dir", help="Output directory. This is where the extracted files will be stored.")
    p.add_argument('--output-debug', action="store_true", dest="output_debug", default=None, help="Writes debug images to <output-dir>_debug\ directory.")
    p.add_argument('--no-output-debug', action="store_false", dest="output_debug", default=None, help="Don't writes debug images to <output-dir>_debug\ directory.")
    p.add_argument('--face-type', dest="face_type", choices=['half_face', 'full_face', 'whole_face', 'head', 'mark_only'], default=None)
    p.add_argument('--max-faces-from-image', type=int, dest="max_faces_from_image", default=None, help="Max faces from image.")  
    p.add_argument('--image-size', type=int, dest="image_size", default=None, help="Output image size.")
    p.add_argument('--jpeg-quality', type=int, dest="jpeg_quality", default=None, help="Jpeg quality.")  
    p.add_argument('--manual-fix', action="store_true", dest="manual_fix", default=False, help="Enables manual extract only frames where faces were not recognized.")
    p.add_argument('--manual-output-debug-fix', action="store_true", dest="manual_output_debug_fix", default=False, help="Performs manual reextract input-dir frames which were deleted from [output_dir]_debug\ dir.")
    p.add_argument('--manual-window-size', type=int, dest="manual_window_size", default=1368, help="Manual fix window size. Default: 1368.")
    p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Extract on CPU..")
    p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")

    p.set_defaults (func=process_extract)

    def process_sort(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import Sorter
        Sorter.main (input_path=Path(arguments.input_dir), sort_by_method=arguments.sort_by_method)

    p = subparsers.add_parser( "sort", help="Sort faces in a directory.")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
    p.add_argument('--by', dest="sort_by_method", default=None, choices=("blur", "motion-blur", "face-yaw", "face-pitch", "face-source-rect-size", "hist", "hist-dissim", "brightness", "hue", "black", "origname", "oneface", "final-by-blur", "final-by-size", "absdiff"), help="Method of sorting. 'origname' sort by original filename to recover original sequence." )
    p.set_defaults (func=process_sort)

    def process_util(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import Util

        if arguments.add_landmarks_debug_images:
            Util.add_landmarks_debug_images (input_path=arguments.input_dir)

        if arguments.recover_original_aligned_filename:
            Util.recover_original_aligned_filename (input_path=arguments.input_dir)

        if arguments.save_faceset_metadata:
            Util.save_faceset_metadata_folder (input_path=arguments.input_dir)

        if arguments.restore_faceset_metadata:
            Util.restore_faceset_metadata_folder (input_path=arguments.input_dir)

        if arguments.pack_faceset:
            io.log_info ("Performing faceset packing...\r\n")
            from samplelib import PackedFaceset
            PackedFaceset.pack( Path(arguments.input_dir) )

        if arguments.unpack_faceset:
            io.log_info ("Performing faceset unpacking...\r\n")
            from samplelib import PackedFaceset
            PackedFaceset.unpack( Path(arguments.input_dir) )
  
        if arguments.export_faceset_mask:
            io.log_info ("Exporting faceset mask..\r\n")
            Util.export_faceset_mask( Path(arguments.input_dir) )

    p = subparsers.add_parser( "util", help="Utilities.")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
    p.add_argument('--add-landmarks-debug-images', action="store_true", dest="add_landmarks_debug_images", default=False, help="Add landmarks debug image for aligned faces.")
    p.add_argument('--recover-original-aligned-filename', action="store_true", dest="recover_original_aligned_filename", default=False, help="Recover original aligned filename.")
    p.add_argument('--save-faceset-metadata', action="store_true", dest="save_faceset_metadata", default=False, help="Save faceset metadata to file.")
    p.add_argument('--restore-faceset-metadata', action="store_true", dest="restore_faceset_metadata", default=False, help="Restore faceset metadata to file. Image filenames must be the same as used with save.")
    p.add_argument('--pack-faceset', action="store_true", dest="pack_faceset", default=False, help="")
    p.add_argument('--unpack-faceset', action="store_true", dest="unpack_faceset", default=False, help="")
    p.add_argument('--export-faceset-mask', action="store_true", dest="export_faceset_mask", default=False, help="")

    p.set_defaults (func=process_util)

    def process_train(arguments):
        osex.set_process_lowest_prio()


        kwargs = {'model_class_name'         : arguments.model_name,
                  'saved_models_path'        : Path(arguments.model_dir),
                  'training_data_src_path'   : Path(arguments.training_data_src_dir),
                  'training_data_dst_path'   : Path(arguments.training_data_dst_dir),
                  'pretraining_data_path'    : Path(arguments.pretraining_data_dir) if arguments.pretraining_data_dir is not None else None,
                  'pretrained_model_path'    : Path(arguments.pretrained_model_dir) if arguments.pretrained_model_dir is not None else None,
                  'no_preview'               : arguments.no_preview,
                  'force_model_name'         : arguments.force_model_name,
                  'force_gpu_idxs'           : [ int(x) for x in arguments.force_gpu_idxs.split(',') ] if arguments.force_gpu_idxs is not None else None,
                  'cpu_only'                 : arguments.cpu_only,
                  'silent_start'             : arguments.silent_start,
                  'execute_programs'         : [ [int(x[0]), x[1] ] for x in arguments.execute_program ],
                  'debug'                    : arguments.debug,
                  }
        from mainscripts import Trainer
        Trainer.main(**kwargs)

    p = subparsers.add_parser( "train", help="Trainer")
    p.add_argument('--training-data-src-dir', required=True, action=fixPathAction, dest="training_data_src_dir", help="Dir of extracted SRC faceset.")
    p.add_argument('--training-data-dst-dir', required=True, action=fixPathAction, dest="training_data_dst_dir", help="Dir of extracted DST faceset.")
    p.add_argument('--pretraining-data-dir', action=fixPathAction, dest="pretraining_data_dir", default=None, help="Optional dir of extracted faceset that will be used in pretraining mode.")
    p.add_argument('--pretrained-model-dir', action=fixPathAction, dest="pretrained_model_dir", default=None, help="Optional dir of pretrain model files. (Currently only for Quick96).")
    p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir", help="Saved models dir.")
    p.add_argument('--model', required=True, dest="model_name", choices=pathex.get_all_dir_names_startswith ( Path(__file__).parent / 'models' , 'Model_'), help="Model class name.")
    p.add_argument('--debug', action="store_true", dest="debug", default=False, help="Debug samples.")
    p.add_argument('--no-preview', action="store_true", dest="no_preview", default=False, help="Disable preview window.")
    p.add_argument('--force-model-name', dest="force_model_name", default=None, help="Forcing to choose model name from model/ folder.")
    p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Train on CPU.")
    p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")
    p.add_argument('--silent-start', action="store_true", dest="silent_start", default=False, help="Silent start. Automatically chooses Best GPU and last used model.")
  
    p.add_argument('--execute-program', dest="execute_program", default=[], action='append', nargs='+')
    p.set_defaults (func=process_train)
  
    def process_exportdfm(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import ExportDFM
        ExportDFM.main(model_class_name = arguments.model_name, saved_models_path = Path(arguments.model_dir))

    p = subparsers.add_parser( "exportdfm", help="Export model to use in DeepFaceLive.")
    p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir", help="Saved models dir.")
    p.add_argument('--model', required=True, dest="model_name", choices=pathex.get_all_dir_names_startswith ( Path(__file__).parent / 'models' , 'Model_'), help="Model class name.")
    p.set_defaults (func=process_exportdfm)

    def process_merge(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import Merger
        Merger.main ( model_class_name       = arguments.model_name,
                      saved_models_path      = Path(arguments.model_dir),
                      force_model_name       = arguments.force_model_name,
                      input_path             = Path(arguments.input_dir),
                      output_path            = Path(arguments.output_dir),
                      output_mask_path       = Path(arguments.output_mask_dir),
                      aligned_path           = Path(arguments.aligned_dir) if arguments.aligned_dir is not None else None,
                      force_gpu_idxs         = arguments.force_gpu_idxs,
                      cpu_only               = arguments.cpu_only)

    p = subparsers.add_parser( "merge", help="Merger")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
    p.add_argument('--output-dir', required=True, action=fixPathAction, dest="output_dir", help="Output directory. This is where the merged files will be stored.")
    p.add_argument('--output-mask-dir', required=True, action=fixPathAction, dest="output_mask_dir", help="Output mask directory. This is where the mask files will be stored.")
    p.add_argument('--aligned-dir', action=fixPathAction, dest="aligned_dir", default=None, help="Aligned directory. This is where the extracted of dst faces stored.")
    p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir", help="Model dir.")
    p.add_argument('--model', required=True, dest="model_name", choices=pathex.get_all_dir_names_startswith ( Path(__file__).parent / 'models' , 'Model_'), help="Model class name.")
    p.add_argument('--force-model-name', dest="force_model_name", default=None, help="Forcing to choose model name from model/ folder.")
    p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Merge on CPU.")
    p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")
    p.set_defaults(func=process_merge)

    videoed_parser = subparsers.add_parser( "videoed", help="Video processing.").add_subparsers()

    def process_videoed_extract_video(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import VideoEd
        VideoEd.extract_video (arguments.input_file, arguments.output_dir, arguments.output_ext, arguments.fps)
    p = videoed_parser.add_parser( "extract-video", help="Extract images from video file.")
    p.add_argument('--input-file', required=True, action=fixPathAction, dest="input_file", help="Input file to be processed. Specify .*-extension to find first file.")
    p.add_argument('--output-dir', required=True, action=fixPathAction, dest="output_dir", help="Output directory. This is where the extracted images will be stored.")
    p.add_argument('--output-ext', dest="output_ext", default=None, help="Image format (extension) of output files.")
    p.add_argument('--fps', type=int, dest="fps", default=None, help="How many frames of every second of the video will be extracted. 0 - full fps.")
    p.set_defaults(func=process_videoed_extract_video)

    def process_videoed_cut_video(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import VideoEd
        VideoEd.cut_video (arguments.input_file,
                           arguments.from_time,
                           arguments.to_time,
                           arguments.audio_track_id,
                           arguments.bitrate)
    p = videoed_parser.add_parser( "cut-video", help="Cut video file.")
    p.add_argument('--input-file', required=True, action=fixPathAction, dest="input_file", help="Input file to be processed. Specify .*-extension to find first file.")
    p.add_argument('--from-time', dest="from_time", default=None, help="From time, for example 00:00:00.000")
    p.add_argument('--to-time', dest="to_time", default=None, help="To time, for example 00:00:00.000")
    p.add_argument('--audio-track-id', type=int, dest="audio_track_id", default=None, help="Specify audio track id.")
    p.add_argument('--bitrate', type=int, dest="bitrate", default=None, help="Bitrate of output file in Megabits.")
    p.set_defaults(func=process_videoed_cut_video)

    def process_videoed_denoise_image_sequence(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import VideoEd
        VideoEd.denoise_image_sequence (arguments.input_dir, arguments.factor)
    p = videoed_parser.add_parser( "denoise-image-sequence", help="Denoise sequence of images, keeping sharp edges. Helps to remove pixel shake from the predicted face.")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory to be processed.")
    p.add_argument('--factor', type=int, dest="factor", default=None, help="Denoise factor (1-20).")
    p.set_defaults(func=process_videoed_denoise_image_sequence)

    def process_videoed_video_from_sequence(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import VideoEd
        VideoEd.video_from_sequence (input_dir      = arguments.input_dir,
                                     output_file    = arguments.output_file,
                                     reference_file = arguments.reference_file,
                                     ext      = arguments.ext,
                                     fps      = arguments.fps,
                                     bitrate  = arguments.bitrate,
                                     include_audio = arguments.include_audio,
                                     lossless = arguments.lossless)

    p = videoed_parser.add_parser( "video-from-sequence", help="Make video from image sequence.")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input file to be processed. Specify .*-extension to find first file.")
    p.add_argument('--output-file', required=True, action=fixPathAction, dest="output_file", help="Input file to be processed. Specify .*-extension to find first file.")
    p.add_argument('--reference-file', action=fixPathAction, dest="reference_file", help="Reference file used to determine proper FPS and transfer audio from it. Specify .*-extension to find first file.")
    p.add_argument('--ext', dest="ext", default='png', help="Image format (extension) of input files.")
    p.add_argument('--fps', type=int, dest="fps", default=None, help="FPS of output file. Overwritten by reference-file.")
    p.add_argument('--bitrate', type=int, dest="bitrate", default=None, help="Bitrate of output file in Megabits.")
    p.add_argument('--include-audio', action="store_true", dest="include_audio", default=False, help="Include audio from reference file.")
    p.add_argument('--lossless', action="store_true", dest="lossless", default=False, help="PNG codec.")

    p.set_defaults(func=process_videoed_video_from_sequence)

    facesettool_parser = subparsers.add_parser( "facesettool", help="Faceset tools.").add_subparsers()

    def process_faceset_enhancer(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import FacesetEnhancer
        FacesetEnhancer.process_folder ( Path(arguments.input_dir),
                                         cpu_only=arguments.cpu_only,
                                         force_gpu_idxs=arguments.force_gpu_idxs
                                       )

    p = facesettool_parser.add_parser ("enhance", help="Enhance details in DFL faceset.")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory of aligned faces.")
    p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Process on CPU.")
    p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")

    p.set_defaults(func=process_faceset_enhancer)
  
  
    p = facesettool_parser.add_parser ("resize", help="Resize DFL faceset.")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory of aligned faces.")

    def process_faceset_resizer(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import FacesetResizer
        FacesetResizer.process_folder ( Path(arguments.input_dir) )
    p.set_defaults(func=process_faceset_resizer)

    def process_dev_test(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import dev_misc
        dev_misc.dev_gen_mask_files( arguments.input_dir )

    p = subparsers.add_parser( "dev_test", help="")
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
    p.set_defaults (func=process_dev_test)
  
    # ========== XSeg
    xseg_parser = subparsers.add_parser( "xseg", help="XSeg tools.").add_subparsers()
  
    p = xseg_parser.add_parser( "editor", help="XSeg editor.")

    def process_xsegeditor(arguments):
        osex.set_process_lowest_prio()
        from XSegEditor import XSegEditor
        global exit_code
        exit_code = XSegEditor.start (Path(arguments.input_dir))
  
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")

    p.set_defaults (func=process_xsegeditor)
  
    p = xseg_parser.add_parser( "apply", help="Apply trained XSeg model to the extracted faces.")

    def process_xsegapply(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import XSegUtil
        XSegUtil.apply_xseg (Path(arguments.input_dir), Path(arguments.model_dir))
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
    p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir")
    p.set_defaults (func=process_xsegapply)
  
  
    p = xseg_parser.add_parser( "remove", help="Remove applied XSeg masks from the extracted faces.")
    def process_xsegremove(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import XSegUtil
        XSegUtil.remove_xseg (Path(arguments.input_dir) )
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
    p.set_defaults (func=process_xsegremove)
  
  
    p = xseg_parser.add_parser( "remove_labels", help="Remove XSeg labels from the extracted faces.")
    def process_xsegremovelabels(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import XSegUtil
        XSegUtil.remove_xseg_labels (Path(arguments.input_dir) )
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
    p.set_defaults (func=process_xsegremovelabels)
  
  
    p = xseg_parser.add_parser( "fetch", help="Copies faces containing XSeg polygons in <input_dir>_xseg dir.")

    def process_xsegfetch(arguments):
        osex.set_process_lowest_prio()
        from mainscripts import XSegUtil
        XSegUtil.fetch_xseg (Path(arguments.input_dir) )
    p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
    p.set_defaults (func=process_xsegfetch)
  
    def bad_args(arguments):
        parser.print_help()
        exit(0)
    parser.set_defaults(func=bad_args)

    arguments = parser.parse_args()
    arguments.func(arguments)

    if exit_code == 0:
        print ("Done.")
  
    exit(exit_code)
  
'''
import code
code.interact(local=dict(globals(), **locals()))
'''

步骤 3:重构 Conda 环境(针对顽固性 Numpy 报错)

如果你执行了上述两步,程序不再报 res = 11,但仍然偶尔抛出 ImportError: numpy.core.multiarray failed to import,这说明带 GUI 的 OpenCV 组件与底层驱动产生了严重的内存段错误(Segfault)。

在终端中卸载并重装纯净的无头版本:

pip uninstall opencv-python opencv-python-headless numpy -y
pip install numpy
pip install opencv-python-headless

四、 进阶调优与运行指导

当代码修改完毕,你再次运行 python main.py train ... 时,程序应该能顺利跳出交互菜单(询问你使用哪张显卡、模型参数等)。此时请遵循以下原则:

1. 交互界面的关键抉择

当控制台提示 Use multiprocessing for data loading? (y/n) 时,请坚决输入 n (No)。我们在代码里将其限制为了 2 核,如果在这里选择 n,它将退化为单进程线性加载,虽然数据预处理极轻微变慢,但能换来在云端容器中 100% 的绝对稳定性。

2. 如何压榨 GPU 性能?

因为我们限制了 CPU 输送图片(Data Loader)的速度,如果你的云端显卡极强(例如 RTX 3090 / 4090),显卡可能会处于“吃不饱”的等待状态。

  • 初期试探: 第一次跑通时,将 Batch Size 设定为保守值(如 4 或 8)。
  • 逐步放宽: 如果程序连续稳定运行了半小时,且显卡利用率低于 50%,你可以按回车停止训练,然后将 main.py 中的 lambda: 2 改为 lambda: 4,并将 Batch Size 提高,以此慢慢试探当前云主机的性能红线。

五、 排障速查对照表

观察到的报错关键字 根本原因 对应采取的行动
res = 11/Can't spawn new thread 系统线程数配额爆满 main.py开头加入 os.cpu_count = lambda: 2补丁。
OpenBLAS blas_thread_init... 底层计算库套娃暴涨 在最顶端设置环境变量 os.environ['OPENBLAS_NUM_THREADS'] = '1'等。
SemLock/No such file or directory 共享内存中有上一次崩溃的残余文件 终端执行 rm -rf /dev/shm/*
numpy.core.multiarray failed 内存溢出或 OpenCV/Numpy 库冲突 卸载标准版 OpenCV,安装 opencv-python-headless

按照这份全面的指南操作后,你的模型训练环境应该坚如磐石。


评论