{"paper_id":"6ececb12-8fb3-4485-b89c-08e11880dfb9","body_text":"Post-surgical Endometriosis Segmentation in\nLaparoscopic Videos\nAndreas Leibetseder, Klaus Schoeffmann\nInstitute of Information Technology\nKlagenfurt University\nKlagenfurt, Austria\n[aleibets,ks]@itec.aau.at\nJ¨org Keckstein\nMedical Faculty\nUlm University\nUlm, Germany\njoerg@keckstein.at\nSimon Keckstein\nUniversity Hospital\nLudwig-Maximilians-University\nMunich, Germany\nsimon.keckstein@med.uni-muenchen.de\nAbstract—Endometriosis is a common women’s condition ex-\nhibiting a manifold visual appearance in various body-internal\nlocations. Having such properties makes its identification very\ndifficult and error-prone, at least for laymen and non-specialized\nmedical practitioners. In an attempt to provide assistance to\ngynecologic physicians treating endometriosis, this demo paper\ndescribes a system that is trained to segment one frequently\noccurring visual appearance of endometriosis, namely dark\nendometrial implants. The system is capable of analyzing la-\nparoscopic surgery videos, annotating identified implant regions\nwith multi-colored overlays and displaying a detection summary\nfor improved video browsing.\nIndex Terms—Endometriosis, Lesion Segmentation, Mask R-\nCNN\nI. INTRODUCTION\nEndoscopic surgical procedures are well established partic-\nularly in gynecology. The exact diagnosis of various diseases\ntakes place via an endoscopy camera system which is inserted\ninto the abdominal cavity through a small port. The endoscopic\nimage is made available to the surgeon on monitors. The\nexploration of the abdominal cavity and especially the inner\ngenital tract is very informative and helpful for a correct\ndiagnosis and therapy in the case of painful conditions or\npathological findings. One condition commonly treated this\nway is termedendometriosis, which refers to the abnormal\ngrowth of uterine-like tissue outside of the uterus and is diag-\nnosed among women of child-bearing age. Affected patients\nexhibit lesions of varying severity – often in various locations.\nComplete identification and recording of all foci and their\ntherapy (removal) is essential for improving symptoms and\nquality of life of the patient. There are two mainly used\nsystems to classify the disease, the revised American Society\nfor Reproductive Medicine (rASRM) score [1] and theEnzian\nclassification [2], [3]. The rASRM classification is particu-\nlarly applicable to the recording of all intraperitoneal lesions,\nwhereas the Enzian classification covers deep endometriosis.\nThe classification is primarily carried out by the surgeon’s\nvisual assessment complimenting each other for quantifying a\npatient’s overall condition.\nThe entire detection of the endometriosis in the partially\ninaccessible area of the pelvis and the large area of the\nperitoneum can be limited, and is made more difficult by the\ndifferent color and appearance of the respective endometrial\nlesions. Due to these various manifestations of endometriosis,\ngood training and great attention is required from the surgeon\nduring diagnosis. The lack of experience, possibly combined\nwith time pressure under a large operation list, carries the risk\nof incomplete recording of the disease. This has an essential\nconsequence for the further treatment and the patient’s well-\nbeing. There is a requirement to prevent misdiagnosis of the\ndisease as far as possible and at the same time to intensify\nthe visual perception of all lesions, especially for doctors in\ntraining. This could be supported intra- or post-operatively\nwith the help of image segmentation.\nWith deep learning already heavily employed in medical\nimaging, it naturally could be regarded as an opportunity for\nnot only improving aforementioned educational training but as\nwell facilitate post-surgical analysis. In order to demonstrate\nthe feasibility of such a goal, for this work we focus on the\nobject segmentation of a specific visual appearance of en-\ndometriosis – darkendometrial implants. Figure 1 depicts four\nexamples taken from a custom-created ground truth dataset 1\nincluding region-based annotations of such pathological areas.\nWhen regarding these annotations, it can be observed that,\nalthough the indicated regions appear distinctly different from\ntheir immediate surroundings, they seem quite similar to\nother non-pathological areas such as spots of blood or dark\nvessels. The dataset exclusively contains single-class implant\nannotations and is used to adapt and train the state-of-the-art\ndeep object segmentation network Mask R-CNN [4], which\nis a region-based convolutional neural network capable of\nproducing pixel masks for detected objects in addition to\nbounding boxes generated by an incorporated region proposal\nnetwork (c.f. Faster R-CNN [5]). Overall, we formulate our\ncontributions as follows:\n•Adapting Mask R-CNN and providing a model for binary\nsegmentation of endometrial implants.\n•Local and temporal visualization of endometrial implants\nin laparoscopic surgery videos.\n•Providing the tool source code as well as pre-trained\nmodels for academic purposes 2.\n1https://tinyurl.com/ENIDDS\n2https://tinyurl.com/EndoSegTool\narXiv:2510.13899v1  [cs.CV]  14 Oct 2025\n\n(a)\n (b)\n(c)\n (d)\nFig. 1: Examples of dark endometrial implants\nThis demonstration highlights partial results of an ongoing\nmore thorough study on the subject of endometriosis segmen-\ntation. As such, the following sections intentionally focus on\ndescribing the tool and its features rather than portraying the\ndataset creation and training approach in very much detail.\nII. ENDOMETRIOSISSEGMENTATIONTOOL\nThe endometriosis segmentation tool can generally be de-\nscribed as an ensemble of technologies combined, resulting in\na series of scripts for analyzing post-surgical video archives.\nThese scripts are used for creating annotated output videos\nas well as a configurable amount of metadata, which can for\ninstance be incorporated into potential interactive systems. As\nmentioned above, this demo should be regarded as a showcase\nfor highlighting the feasibility of endometriosis segmentation,\ntherefore, we reserve building a fully-fledged user interface\nfor future versions of the tool. In the following sections\nwe describe its architecture, usage, hardware-specific runtime\nanalysis and implementation details.\nA. Architecture\nThe system’s overall architecture is comprised of three three\nmain steps: dataset creation, model training and video analysis\n(model application).\nWe custom-create a single-class lesion dataset from re-\nfining parts of the more extensive and multi-class Gyne-\ncologic Laparoscopy Endometriosis Dataset [6] (GLENDA).\nThe collected base dataset comprises over 350 region-based\nendometrial implant annotations for 160 frames taken from\nmore than 100 patient cases exhibiting endometriosis. In order\nto improve the trained segmentation model, we augment this\ndataset by applying various techniques including rotating,\nblurring, perspective transformation, desaturation as well as\nobject tracking. For the subsequent training step we divide\nthese various resulting datasets into two different subsets used\nfor training, validation and testing.\nAs mentioned above, for model training we adapt state-\nof-the-art object segmentation network MASK R-CNN for\ntransfer learning a single output label. As a backbone network\nwe employ ResNet-101 [7] together with overall multi-task\nloss function incorporating class (log loss), bounding box\n(smoothL 1 loss) and mask segmentation (binary cross entropy\nloss) predictions as described in [4], [8]. Training is conducted\nfor 50 epochs using a learning rate of0.001and stochastic\ngradient descent as an optimizer. The best performing model in\nterms of mean average precision (mAP) for mask segmentation\nas employed in the MS COCO-detection [9] evalutaions is\nachieved after 29 epochs using rotation as well as cropping\nfor augmentation: 0.642 mAP@0.50IoU at a threshold of 0.5\nFig. 2: Video Processing Pipeline.\n\n(a)\n (b)\n(c)\n (d)\nFig. 3: Video at two different points in time – raw (top row) and analyzed (bottom row)\nmask overlap (0.324 mAP for a threshold range of 0.50 to\n0.95 with 0.05 steps). This model together with other well-\nperforming models from both splits are made available for\ndownload3.\nFinally, we utilize such a model in our system for detecting\npathologically suspicious regions with a confidence threshold\nof 0.50 or above. The employed core processing pipeline is\ndepicted in Figure 2: first a user provides the tool with a\nraw surgery video, which then is analyzed frame by frame\nextracting bounding boxes, masks and labels. Whenever results\nare found, the tool uses the determined segmentation masks\nto produce annotated frames as well as an overall detection\nsummary in form of an indication bar, as depicted in Figure 3.\nThis bar indicates frame-by-frame detections over-time, col-\nored by detection confidence (yellow to dark red) – values for\nmultiple detections are averaged. Both, segmentation results as\nwell as indication bar are integrated into the final video output,\nwhile additionally marking the current video position with a\ngreen horizontal bar. This way, viewers of such annotated\noutput videos at any point in time are provided with an\noverview of potentially important sections. All extracted data\ncan additionally be stored in JSON-format, as to facilitate the\nintegration in to future interactive video browsing systems.\nB. Hardware and Runtime Analysis\nFor implementation, training and evaluation we used a\nworkstation with the following specifications: Intel Core i7-\n3https://tinyurl.com/ENIDDS\nTABLE I: Processing time comparison of 16:9 resolutions.\nresolution avg in ms\n640×360153\n1280×720158\n1920×1080170\n3840×2160207\n5820K CPU @ 3.30GHz x 6, 32 GiB DDR3 @ 1333 MHz,\nNvidia GeForce GTX 1080. On such a machine, model train-\ning required approximately 2h to complete. The tool has been\nimplemented using Linux Ubuntu 18.x, but also successfully\ntested on Windows 10 systems. Given the exclusive utilization\nof cross-platform technologies (c.f. SectionII-C), it is assumed\nto be compatible with MacOS as well.\nConcerning runtime performance, when using GPU pro-\ncessing the system requires an average of approximately 150-\n250ms of processing time per frame for most videos, as is out-\nlined by Table I. Albeit clearly growing with larger resolutions,\nthe processing time essentially depends on resizing the input\nimages, since the generated model’s input is resized to fit a re-\nstricted distinct pixel range, i.e. 800 pixels for the shortest and\n1333 pixels for the longest image side. Hence, assuming a per-\nframe performance of 170ms we can approximately estimate\nthe overall time requirements of processing an hour of video\nproduced by an endoscope recording in HD resolution with\n25 frames per second: 170×25×60×60\n1000 = 15300s= 4h15m.\n\nC. Installation and Usage\nThe tool requires working installations OpenCV 4, Python\n3.x 5, FFmpeg6 and Detectron27. All further requirements can\nsimply be installed by running:\n$ pip install requirements.txt\nIn its most basic use case – analyzing a single video – the\ntool can be executed by running:\n$ python demo.py -i <video file> -m <model\n,→file> -o <output folder>\nThe tool is also capable of multi-video and -model process-\ning and a detailed description of all available options can be\nproduced by running the script with the ’-h’ flag.\nIII. CONCLUSION\nWe present a tool for segmenting and annotating endome-\ntrial implants in laparoscopic videos. Approaching this prob-\nlem by combining video object tracking in combination with\nstate-of-the-art image segmentation, we achieve qualitatively\ngood results that can be regarded as a first step towards an\ninteractive post-surgical video archive browser, which could\nbe of great assistance for treatment planning as well as clin-\nical education. Finally, this work represents valuable insights\ninto the feasibility of applying traditional machine learning\ndeveloped real-world object detection to a practical medical\nuse case.\nACKNOWLEDGMENTS\nThis work was funded by the FWF Austrian Science Fund\nunder grant P 32010-N38.\nREFERENCES\n[1] M. Canis, J. Donnez, D. Guzick, J. Halme, J. Rock, R. Schenken,\nand M. Vernon, “Revised american society for reproductive medicine\nclassification of endometriosis: 1996,”Fertility and Sterility, vol. 67,\nno. 5, pp. 817–821, 1997.\n[2] J. Keckstein, U. Ulrich, M. Possover, K. Schweppeet al., “Enzian-\nklassifikation der tief infiltrierenden endometriose,”Zentralblatt f ¨ur\nGyn¨akologie, vol. 125, p. 291, 2003.\n[3] J. Keckstein and G. Hudelist, “Classification of die including bowel\nendometriosis: from r-asrm to #enzian-classification,”Best Practice &\nResearch Clinical Obstetrics & Gynaecology, 2020.\n[4] K. He, G. Gkioxari, P. Doll ´ar, and R. B. Girshick, “Mask R-CNN,”\nIEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386–397,\n2020. [Online]. Available: https://doi.org/10.1109/TPAMI.2018.2844175\n[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time\nobject detection with region proposal networks,”IEEE Transactions\non Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp.\n1137–1149, June 2017. [Online]. Available: https://doi.org/10.1109/\nTPAMI.2016.2577031\n[6] A. Leibetseder, S. Kletz, K. Schoeffmann, S. Keckstein, and\nJ. Keckstein, “GLENDA: gynecologic laparoscopy endometriosis\ndataset,” inMultiMedia Modeling - 26th International Conference,\nMMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings,\nPart II, ser. Lecture Notes in Computer Science, Y . M. Ro, W. Cheng,\nJ. Kim, W. Chu, P. Cui, J. Choi, M. Hu, and W. D. Neve,\nEds., vol. 11962. Springer, 2020, pp. 439–450. [Online]. Available:\nhttps://doi.org/10.1007/978-3-030-37734-2 36\n4OpenCV 4.x, https://opencv.org\n5Python 3.x, https://www.python.org\n6https://ffmpeg.org\n7https://github.com/facebookresearch/detectron2\n[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image\nrecognition,” inProceedings of the IEEE conference on computer vision\nand pattern recognition, 2016, pp. 770–778.\n[8] R. Girshick, “Fast r-cnn,” inProceedings of the IEEE international\nconference on computer vision, 2015, pp. 1440–1448.\n[9] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,\nP. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in\ncontext,” inEuropean conference on computer vision. Springer, 2014,\npp. 740–755.","source_license":"CC0","license_restricted":false}