DeepFaceLab (DFL) 云端环境多线程崩溃全景排障指南
一、 报错现象与底层错误码破译
在云端 GPU 容器(如 AutoDL、恒源云等)运行 train 时,程序未进入训练界面便直接崩溃,并伴随海量报错。
核心报错特征与系统级翻译:
[ERROR:0] 116: Can't spawn new thread: res = 11- 破译: 这里的
res = 11是 Linux 系统底层的 POSIX 错误码EAGAIN。意思是 "Resource temporarily unavailable"(资源暂时不可用)。系统在警告你:“当前进程或线程名额已耗尽,系统拒绝为你创建新的执行单元”。
- 破译: 这里的
OpenBLAS blas_thread_init: pthread_create failed for thread 19 of 20- 破译: 底层矩阵计算库(OpenBLAS)试图为单个任务分配 20 个计算线程,但在创建第 19 个时被系统拦截。
FileNotFoundError: [Errno 2] ... SemLock- 破译: Python 多进程在崩溃时,没有来得及清理跨进程通信(IPC)使用的系统信号量文件和共享内存,导致后续程序一启动就死锁。
二、 核心原理解析:为什么会发生“线程大爆炸”?
这个问题是云平台虚拟化环境与 Python 默认机制之间产生的严重冲突,我们可以称之为**“线程大爆炸(Thread Explosion)”**。它由三个层面的“误解”叠加引发:
1. 第一层:云平台 Cgroups 的“硬隔离”
云平台通过 Docker 容器将一台巨大的物理服务器划分给多个用户。为了防止某个用户写出死循环卡死整台物理机,系统内核使用 Cgroups(控制组)对你的容器设置了硬性配额(例如 pids.max 限制最多只能有 256 个进程/线程)。你用 ulimit 无法修改这个物理门禁。
2. 第二层:Python os.cpu_count 的“穿透幻觉”
当 DFL 准备加载图片数据时,它会调用 Python 的 os.cpu_count() 来决定雇用多少个“工人(子进程)”。但在 Docker 容器中,Python 非常笨,它会直接穿透容器,看到宿主机真正的物理核心数(例如 128 核)。于是,DFL 误以为自己可以随心所欲,直接拉起 128 个子进程。
3. 第三层:底层 C 库的“套娃乘法”(罪魁祸首)
Python 在进行图片处理和矩阵运算时,依赖 Numpy 和 OpenCV。这些库的底层是由 C/C++ 编写的 OpenBLAS 或 MKL 库。这些数学库极其贪婪,默认策略是:“每次运算,我都要调动等同于 CPU 核心数的线程”。
灾难性的算账时刻:
- DFL 看到 128 核,拉起了 128 个子进程。
- 每个子进程内的 OpenBLAS 看到 128 核,又试图拉起 128 个计算线程。
- 瞬间并发请求:
128 × 128 = 16,384个线程! - 结果:瞬间撞碎云平台限制的 256 个名额,系统直接拉闸,触发
res = 11。
三、 终极解决方案(手把手实操代码)
我们需要采取“先清扫,后镇压”的策略,直接从 Python 源码层面进行拦截。
步骤 1:打扫战场(清理僵尸进程与死锁内存)
在云平台的终端(Terminal)中,依次执行以下命令,或者直接在网页控制台重启实例。
# 强制终结所有卡在后台、没有彻底死掉的 Python 进程,释放 PID 配额
pkill -9 -f python
# 清空 Linux 的共享内存盘。解决 SemLock 找不到文件和 multiarray 导入失败的问题
rm -rf /dev/shm/*
步骤 2:植入“猴子补丁”(源码级线程拦截)
打开 DFL 项目根目录下的 main.py 文件。我们需要在程序刚睡醒、还没来得及拉起任何进程之前,强制篡改它的环境变量和 CPU 认知。
请将 main.py 最顶部的代码(if __name__ == "__main__": 及其下方几行),完全替换为以下结构:
if __name__ == "__main__":
import os
# 【防御第一道防线:锁死底层 C 库(非必要)】
# 必须在 import cv2 或 numpy 之前设置!
# 强制让所有矩阵计算库在单个进程内只能使用 1 个线程,杜绝“套娃乘法”
"""
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'
os.environ['NUMEXPR_MAX_THREADS'] = '1'
os.environ['VECLIB_MAXIMUM_THREADS'] = '1'
"""
import multiprocessing
# 【防御第二道防线:欺骗 Python 解释器(必要)】
# 使用 Lambda 匿名函数覆盖系统原始的 cpu_count 方法
# 强制让整个程序以为这台机器只有 2 个核心,从而只拉起 2 个数据加载子进程
os.cpu_count = lambda: 2
multiprocessing.cpu_count = lambda: 2
# 针对 Linux 环境的多进程启动方式修复(DFL 官方原有代码)
multiprocessing.set_start_method("spawn")
# 【防御第三道防线:阉割 OpenCV 并发(非必要)】
# 强制关闭图像处理库的多线程加速,避免与 Numpy 抢夺内存导致段错误
"""
try:
import cv2
cv2.setNumThreads(0)
except Exception:
pass
""""
# 进入正常的 DFL 初始化流程
from core.leras import nn
nn.initialize_main_env()
# 下方保留原有的 import sys, time 等代码...
完整 main.py示例:
if __name__ == "__main__":
import os
import sys
import time
import multiprocessing
# ====== 强制限制容器线程数 ======
# 强制覆盖 CPU 核心计数,欺骗底层的 SubprocessGenerator
# 如果改为 4 之后可以正常跑通但你觉得稍慢,可以逐渐往上加到 6 或 8
# 如果还是报 res = 11,就降到 2 甚至 1
os.cpu_count = lambda: 6
multiprocessing.cpu_count = lambda: 4
# ===============================
# Fix for linux
multiprocessing.set_start_method("spawn")
from core.leras import nn
nn.initialize_main_env()
import argparse
from core import pathex
from core import osex
from pathlib import Path
from core.interact import interact as io
if sys.version_info[0] < 3 or (sys.version_info[0] == 3 and sys.version_info[1] < 6):
raise Exception("This program requires at least Python 3.6")
class fixPathAction(argparse.Action):
def __call__(self, parser, namespace, values, option_string=None):
setattr(namespace, self.dest, os.path.abspath(os.path.expanduser(values)))
exit_code = 0
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers()
def process_extract(arguments):
osex.set_process_lowest_prio()
from mainscripts import Extractor
Extractor.main( detector = arguments.detector,
input_path = Path(arguments.input_dir),
output_path = Path(arguments.output_dir),
output_debug = arguments.output_debug,
manual_fix = arguments.manual_fix,
manual_output_debug_fix = arguments.manual_output_debug_fix,
manual_window_size = arguments.manual_window_size,
face_type = arguments.face_type,
max_faces_from_image = arguments.max_faces_from_image,
image_size = arguments.image_size,
jpeg_quality = arguments.jpeg_quality,
cpu_only = arguments.cpu_only,
force_gpu_idxs = [ int(x) for x in arguments.force_gpu_idxs.split(',') ] if arguments.force_gpu_idxs is not None else None,
)
p = subparsers.add_parser( "extract", help="Extract the faces from a pictures.")
p.add_argument('--detector', dest="detector", choices=['s3fd','manual'], default=None, help="Type of detector.")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
p.add_argument('--output-dir', required=True, action=fixPathAction, dest="output_dir", help="Output directory. This is where the extracted files will be stored.")
p.add_argument('--output-debug', action="store_true", dest="output_debug", default=None, help="Writes debug images to <output-dir>_debug\ directory.")
p.add_argument('--no-output-debug', action="store_false", dest="output_debug", default=None, help="Don't writes debug images to <output-dir>_debug\ directory.")
p.add_argument('--face-type', dest="face_type", choices=['half_face', 'full_face', 'whole_face', 'head', 'mark_only'], default=None)
p.add_argument('--max-faces-from-image', type=int, dest="max_faces_from_image", default=None, help="Max faces from image.")
p.add_argument('--image-size', type=int, dest="image_size", default=None, help="Output image size.")
p.add_argument('--jpeg-quality', type=int, dest="jpeg_quality", default=None, help="Jpeg quality.")
p.add_argument('--manual-fix', action="store_true", dest="manual_fix", default=False, help="Enables manual extract only frames where faces were not recognized.")
p.add_argument('--manual-output-debug-fix', action="store_true", dest="manual_output_debug_fix", default=False, help="Performs manual reextract input-dir frames which were deleted from [output_dir]_debug\ dir.")
p.add_argument('--manual-window-size', type=int, dest="manual_window_size", default=1368, help="Manual fix window size. Default: 1368.")
p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Extract on CPU..")
p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")
p.set_defaults (func=process_extract)
def process_sort(arguments):
osex.set_process_lowest_prio()
from mainscripts import Sorter
Sorter.main (input_path=Path(arguments.input_dir), sort_by_method=arguments.sort_by_method)
p = subparsers.add_parser( "sort", help="Sort faces in a directory.")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
p.add_argument('--by', dest="sort_by_method", default=None, choices=("blur", "motion-blur", "face-yaw", "face-pitch", "face-source-rect-size", "hist", "hist-dissim", "brightness", "hue", "black", "origname", "oneface", "final-by-blur", "final-by-size", "absdiff"), help="Method of sorting. 'origname' sort by original filename to recover original sequence." )
p.set_defaults (func=process_sort)
def process_util(arguments):
osex.set_process_lowest_prio()
from mainscripts import Util
if arguments.add_landmarks_debug_images:
Util.add_landmarks_debug_images (input_path=arguments.input_dir)
if arguments.recover_original_aligned_filename:
Util.recover_original_aligned_filename (input_path=arguments.input_dir)
if arguments.save_faceset_metadata:
Util.save_faceset_metadata_folder (input_path=arguments.input_dir)
if arguments.restore_faceset_metadata:
Util.restore_faceset_metadata_folder (input_path=arguments.input_dir)
if arguments.pack_faceset:
io.log_info ("Performing faceset packing...\r\n")
from samplelib import PackedFaceset
PackedFaceset.pack( Path(arguments.input_dir) )
if arguments.unpack_faceset:
io.log_info ("Performing faceset unpacking...\r\n")
from samplelib import PackedFaceset
PackedFaceset.unpack( Path(arguments.input_dir) )
if arguments.export_faceset_mask:
io.log_info ("Exporting faceset mask..\r\n")
Util.export_faceset_mask( Path(arguments.input_dir) )
p = subparsers.add_parser( "util", help="Utilities.")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
p.add_argument('--add-landmarks-debug-images', action="store_true", dest="add_landmarks_debug_images", default=False, help="Add landmarks debug image for aligned faces.")
p.add_argument('--recover-original-aligned-filename', action="store_true", dest="recover_original_aligned_filename", default=False, help="Recover original aligned filename.")
p.add_argument('--save-faceset-metadata', action="store_true", dest="save_faceset_metadata", default=False, help="Save faceset metadata to file.")
p.add_argument('--restore-faceset-metadata', action="store_true", dest="restore_faceset_metadata", default=False, help="Restore faceset metadata to file. Image filenames must be the same as used with save.")
p.add_argument('--pack-faceset', action="store_true", dest="pack_faceset", default=False, help="")
p.add_argument('--unpack-faceset', action="store_true", dest="unpack_faceset", default=False, help="")
p.add_argument('--export-faceset-mask', action="store_true", dest="export_faceset_mask", default=False, help="")
p.set_defaults (func=process_util)
def process_train(arguments):
osex.set_process_lowest_prio()
kwargs = {'model_class_name' : arguments.model_name,
'saved_models_path' : Path(arguments.model_dir),
'training_data_src_path' : Path(arguments.training_data_src_dir),
'training_data_dst_path' : Path(arguments.training_data_dst_dir),
'pretraining_data_path' : Path(arguments.pretraining_data_dir) if arguments.pretraining_data_dir is not None else None,
'pretrained_model_path' : Path(arguments.pretrained_model_dir) if arguments.pretrained_model_dir is not None else None,
'no_preview' : arguments.no_preview,
'force_model_name' : arguments.force_model_name,
'force_gpu_idxs' : [ int(x) for x in arguments.force_gpu_idxs.split(',') ] if arguments.force_gpu_idxs is not None else None,
'cpu_only' : arguments.cpu_only,
'silent_start' : arguments.silent_start,
'execute_programs' : [ [int(x[0]), x[1] ] for x in arguments.execute_program ],
'debug' : arguments.debug,
}
from mainscripts import Trainer
Trainer.main(**kwargs)
p = subparsers.add_parser( "train", help="Trainer")
p.add_argument('--training-data-src-dir', required=True, action=fixPathAction, dest="training_data_src_dir", help="Dir of extracted SRC faceset.")
p.add_argument('--training-data-dst-dir', required=True, action=fixPathAction, dest="training_data_dst_dir", help="Dir of extracted DST faceset.")
p.add_argument('--pretraining-data-dir', action=fixPathAction, dest="pretraining_data_dir", default=None, help="Optional dir of extracted faceset that will be used in pretraining mode.")
p.add_argument('--pretrained-model-dir', action=fixPathAction, dest="pretrained_model_dir", default=None, help="Optional dir of pretrain model files. (Currently only for Quick96).")
p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir", help="Saved models dir.")
p.add_argument('--model', required=True, dest="model_name", choices=pathex.get_all_dir_names_startswith ( Path(__file__).parent / 'models' , 'Model_'), help="Model class name.")
p.add_argument('--debug', action="store_true", dest="debug", default=False, help="Debug samples.")
p.add_argument('--no-preview', action="store_true", dest="no_preview", default=False, help="Disable preview window.")
p.add_argument('--force-model-name', dest="force_model_name", default=None, help="Forcing to choose model name from model/ folder.")
p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Train on CPU.")
p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")
p.add_argument('--silent-start', action="store_true", dest="silent_start", default=False, help="Silent start. Automatically chooses Best GPU and last used model.")
p.add_argument('--execute-program', dest="execute_program", default=[], action='append', nargs='+')
p.set_defaults (func=process_train)
def process_exportdfm(arguments):
osex.set_process_lowest_prio()
from mainscripts import ExportDFM
ExportDFM.main(model_class_name = arguments.model_name, saved_models_path = Path(arguments.model_dir))
p = subparsers.add_parser( "exportdfm", help="Export model to use in DeepFaceLive.")
p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir", help="Saved models dir.")
p.add_argument('--model', required=True, dest="model_name", choices=pathex.get_all_dir_names_startswith ( Path(__file__).parent / 'models' , 'Model_'), help="Model class name.")
p.set_defaults (func=process_exportdfm)
def process_merge(arguments):
osex.set_process_lowest_prio()
from mainscripts import Merger
Merger.main ( model_class_name = arguments.model_name,
saved_models_path = Path(arguments.model_dir),
force_model_name = arguments.force_model_name,
input_path = Path(arguments.input_dir),
output_path = Path(arguments.output_dir),
output_mask_path = Path(arguments.output_mask_dir),
aligned_path = Path(arguments.aligned_dir) if arguments.aligned_dir is not None else None,
force_gpu_idxs = arguments.force_gpu_idxs,
cpu_only = arguments.cpu_only)
p = subparsers.add_parser( "merge", help="Merger")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory. A directory containing the files you wish to process.")
p.add_argument('--output-dir', required=True, action=fixPathAction, dest="output_dir", help="Output directory. This is where the merged files will be stored.")
p.add_argument('--output-mask-dir', required=True, action=fixPathAction, dest="output_mask_dir", help="Output mask directory. This is where the mask files will be stored.")
p.add_argument('--aligned-dir', action=fixPathAction, dest="aligned_dir", default=None, help="Aligned directory. This is where the extracted of dst faces stored.")
p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir", help="Model dir.")
p.add_argument('--model', required=True, dest="model_name", choices=pathex.get_all_dir_names_startswith ( Path(__file__).parent / 'models' , 'Model_'), help="Model class name.")
p.add_argument('--force-model-name', dest="force_model_name", default=None, help="Forcing to choose model name from model/ folder.")
p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Merge on CPU.")
p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")
p.set_defaults(func=process_merge)
videoed_parser = subparsers.add_parser( "videoed", help="Video processing.").add_subparsers()
def process_videoed_extract_video(arguments):
osex.set_process_lowest_prio()
from mainscripts import VideoEd
VideoEd.extract_video (arguments.input_file, arguments.output_dir, arguments.output_ext, arguments.fps)
p = videoed_parser.add_parser( "extract-video", help="Extract images from video file.")
p.add_argument('--input-file', required=True, action=fixPathAction, dest="input_file", help="Input file to be processed. Specify .*-extension to find first file.")
p.add_argument('--output-dir', required=True, action=fixPathAction, dest="output_dir", help="Output directory. This is where the extracted images will be stored.")
p.add_argument('--output-ext', dest="output_ext", default=None, help="Image format (extension) of output files.")
p.add_argument('--fps', type=int, dest="fps", default=None, help="How many frames of every second of the video will be extracted. 0 - full fps.")
p.set_defaults(func=process_videoed_extract_video)
def process_videoed_cut_video(arguments):
osex.set_process_lowest_prio()
from mainscripts import VideoEd
VideoEd.cut_video (arguments.input_file,
arguments.from_time,
arguments.to_time,
arguments.audio_track_id,
arguments.bitrate)
p = videoed_parser.add_parser( "cut-video", help="Cut video file.")
p.add_argument('--input-file', required=True, action=fixPathAction, dest="input_file", help="Input file to be processed. Specify .*-extension to find first file.")
p.add_argument('--from-time', dest="from_time", default=None, help="From time, for example 00:00:00.000")
p.add_argument('--to-time', dest="to_time", default=None, help="To time, for example 00:00:00.000")
p.add_argument('--audio-track-id', type=int, dest="audio_track_id", default=None, help="Specify audio track id.")
p.add_argument('--bitrate', type=int, dest="bitrate", default=None, help="Bitrate of output file in Megabits.")
p.set_defaults(func=process_videoed_cut_video)
def process_videoed_denoise_image_sequence(arguments):
osex.set_process_lowest_prio()
from mainscripts import VideoEd
VideoEd.denoise_image_sequence (arguments.input_dir, arguments.factor)
p = videoed_parser.add_parser( "denoise-image-sequence", help="Denoise sequence of images, keeping sharp edges. Helps to remove pixel shake from the predicted face.")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory to be processed.")
p.add_argument('--factor', type=int, dest="factor", default=None, help="Denoise factor (1-20).")
p.set_defaults(func=process_videoed_denoise_image_sequence)
def process_videoed_video_from_sequence(arguments):
osex.set_process_lowest_prio()
from mainscripts import VideoEd
VideoEd.video_from_sequence (input_dir = arguments.input_dir,
output_file = arguments.output_file,
reference_file = arguments.reference_file,
ext = arguments.ext,
fps = arguments.fps,
bitrate = arguments.bitrate,
include_audio = arguments.include_audio,
lossless = arguments.lossless)
p = videoed_parser.add_parser( "video-from-sequence", help="Make video from image sequence.")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input file to be processed. Specify .*-extension to find first file.")
p.add_argument('--output-file', required=True, action=fixPathAction, dest="output_file", help="Input file to be processed. Specify .*-extension to find first file.")
p.add_argument('--reference-file', action=fixPathAction, dest="reference_file", help="Reference file used to determine proper FPS and transfer audio from it. Specify .*-extension to find first file.")
p.add_argument('--ext', dest="ext", default='png', help="Image format (extension) of input files.")
p.add_argument('--fps', type=int, dest="fps", default=None, help="FPS of output file. Overwritten by reference-file.")
p.add_argument('--bitrate', type=int, dest="bitrate", default=None, help="Bitrate of output file in Megabits.")
p.add_argument('--include-audio', action="store_true", dest="include_audio", default=False, help="Include audio from reference file.")
p.add_argument('--lossless', action="store_true", dest="lossless", default=False, help="PNG codec.")
p.set_defaults(func=process_videoed_video_from_sequence)
facesettool_parser = subparsers.add_parser( "facesettool", help="Faceset tools.").add_subparsers()
def process_faceset_enhancer(arguments):
osex.set_process_lowest_prio()
from mainscripts import FacesetEnhancer
FacesetEnhancer.process_folder ( Path(arguments.input_dir),
cpu_only=arguments.cpu_only,
force_gpu_idxs=arguments.force_gpu_idxs
)
p = facesettool_parser.add_parser ("enhance", help="Enhance details in DFL faceset.")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory of aligned faces.")
p.add_argument('--cpu-only', action="store_true", dest="cpu_only", default=False, help="Process on CPU.")
p.add_argument('--force-gpu-idxs', dest="force_gpu_idxs", default=None, help="Force to choose GPU indexes separated by comma.")
p.set_defaults(func=process_faceset_enhancer)
p = facesettool_parser.add_parser ("resize", help="Resize DFL faceset.")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir", help="Input directory of aligned faces.")
def process_faceset_resizer(arguments):
osex.set_process_lowest_prio()
from mainscripts import FacesetResizer
FacesetResizer.process_folder ( Path(arguments.input_dir) )
p.set_defaults(func=process_faceset_resizer)
def process_dev_test(arguments):
osex.set_process_lowest_prio()
from mainscripts import dev_misc
dev_misc.dev_gen_mask_files( arguments.input_dir )
p = subparsers.add_parser( "dev_test", help="")
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
p.set_defaults (func=process_dev_test)
# ========== XSeg
xseg_parser = subparsers.add_parser( "xseg", help="XSeg tools.").add_subparsers()
p = xseg_parser.add_parser( "editor", help="XSeg editor.")
def process_xsegeditor(arguments):
osex.set_process_lowest_prio()
from XSegEditor import XSegEditor
global exit_code
exit_code = XSegEditor.start (Path(arguments.input_dir))
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
p.set_defaults (func=process_xsegeditor)
p = xseg_parser.add_parser( "apply", help="Apply trained XSeg model to the extracted faces.")
def process_xsegapply(arguments):
osex.set_process_lowest_prio()
from mainscripts import XSegUtil
XSegUtil.apply_xseg (Path(arguments.input_dir), Path(arguments.model_dir))
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
p.add_argument('--model-dir', required=True, action=fixPathAction, dest="model_dir")
p.set_defaults (func=process_xsegapply)
p = xseg_parser.add_parser( "remove", help="Remove applied XSeg masks from the extracted faces.")
def process_xsegremove(arguments):
osex.set_process_lowest_prio()
from mainscripts import XSegUtil
XSegUtil.remove_xseg (Path(arguments.input_dir) )
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
p.set_defaults (func=process_xsegremove)
p = xseg_parser.add_parser( "remove_labels", help="Remove XSeg labels from the extracted faces.")
def process_xsegremovelabels(arguments):
osex.set_process_lowest_prio()
from mainscripts import XSegUtil
XSegUtil.remove_xseg_labels (Path(arguments.input_dir) )
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
p.set_defaults (func=process_xsegremovelabels)
p = xseg_parser.add_parser( "fetch", help="Copies faces containing XSeg polygons in <input_dir>_xseg dir.")
def process_xsegfetch(arguments):
osex.set_process_lowest_prio()
from mainscripts import XSegUtil
XSegUtil.fetch_xseg (Path(arguments.input_dir) )
p.add_argument('--input-dir', required=True, action=fixPathAction, dest="input_dir")
p.set_defaults (func=process_xsegfetch)
def bad_args(arguments):
parser.print_help()
exit(0)
parser.set_defaults(func=bad_args)
arguments = parser.parse_args()
arguments.func(arguments)
if exit_code == 0:
print ("Done.")
exit(exit_code)
'''
import code
code.interact(local=dict(globals(), **locals()))
'''
步骤 3:重构 Conda 环境(针对顽固性 Numpy 报错)
如果你执行了上述两步,程序不再报 res = 11,但仍然偶尔抛出 ImportError: numpy.core.multiarray failed to import,这说明带 GUI 的 OpenCV 组件与底层驱动产生了严重的内存段错误(Segfault)。
在终端中卸载并重装纯净的无头版本:
pip uninstall opencv-python opencv-python-headless numpy -y
pip install numpy
pip install opencv-python-headless
四、 进阶调优与运行指导
当代码修改完毕,你再次运行 python main.py train ... 时,程序应该能顺利跳出交互菜单(询问你使用哪张显卡、模型参数等)。此时请遵循以下原则:
1. 交互界面的关键抉择
当控制台提示 Use multiprocessing for data loading? (y/n) 时,请坚决输入 n (No)。我们在代码里将其限制为了 2 核,如果在这里选择 n,它将退化为单进程线性加载,虽然数据预处理极轻微变慢,但能换来在云端容器中 100% 的绝对稳定性。
2. 如何压榨 GPU 性能?
因为我们限制了 CPU 输送图片(Data Loader)的速度,如果你的云端显卡极强(例如 RTX 3090 / 4090),显卡可能会处于“吃不饱”的等待状态。
- 初期试探: 第一次跑通时,将
Batch Size设定为保守值(如 4 或 8)。 - 逐步放宽: 如果程序连续稳定运行了半小时,且显卡利用率低于 50%,你可以按回车停止训练,然后将
main.py中的lambda: 2改为lambda: 4,并将Batch Size提高,以此慢慢试探当前云主机的性能红线。
五、 排障速查对照表
| 观察到的报错关键字 | 根本原因 | 对应采取的行动 | |||
|---|---|---|---|---|---|
res = 11/Can't spawn new thread |
系统线程数配额爆满 | 在 main.py开头加入 os.cpu_count = lambda: 2补丁。 |
|||
OpenBLAS blas_thread_init... |
底层计算库套娃暴涨 | 在最顶端设置环境变量 os.environ['OPENBLAS_NUM_THREADS'] = '1'等。 |
|||
SemLock/No such file or directory |
共享内存中有上一次崩溃的残余文件 | 终端执行 rm -rf /dev/shm/*。 |
|||
numpy.core.multiarray failed |
内存溢出或 OpenCV/Numpy 库冲突 | 卸载标准版 OpenCV,安装 opencv-python-headless。 |
按照这份全面的指南操作后,你的模型训练环境应该坚如磐石。