Post-surgical Endometriosis Segmentation in
Laparoscopic Videos
Andreas Leibetseder, Klaus Schoeffmann
Institute of Information Technology
Klagenfurt University
Klagenfurt, Austria
[aleibets,ks]@itec.aau.at
J¨org Keckstein
Medical Faculty
Ulm University
Ulm, Germany
[email protected]
Simon Keckstein
University Hospital
Ludwig-Maximilians-University
Munich, Germany
[email protected]
Abstract—Endometriosis is a common women’s condition ex-
hibiting a manifold visual appearance in various body-internal
locations. Having such properties makes its identification very
difficult and error-prone, at least for laymen and non-specialized
medical practitioners. In an attempt to provide assistance to
gynecologic physicians treating endometriosis, this demo paper
describes a system that is trained to segment one frequently
occurring visual appearance of endometriosis, namely dark
endometrial implants. The system is capable of analyzing la-
paroscopic surgery videos, annotating identified implant regions
with multi-colored overlays and displaying a detection summary
for improved video browsing.
Index Terms—Endometriosis, Lesion Segmentation, Mask R-
CNN
I. INTRODUCTION
Endoscopic surgical procedures are well established partic-
ularly in gynecology. The exact diagnosis of various diseases
takes place via an endoscopy camera system which is inserted
into the abdominal cavity through a small port. The endoscopic
image is made available to the surgeon on monitors. The
exploration of the abdominal cavity and especially the inner
genital tract is very informative and helpful for a correct
diagnosis and therapy in the case of painful conditions or
pathological findings. One condition commonly treated this
way is termedendometriosis, which refers to the abnormal
growth of uterine-like tissue outside of the uterus and is diag-
nosed among women of child-bearing age. Affected patients
exhibit lesions of varying severity – often in various locations.
Complete identification and recording of all foci and their
therapy (removal) is essential for improving symptoms and
quality of life of the patient. There are two mainly used
systems to classify the disease, the revised American Society
for Reproductive Medicine (rASRM) score [1] and theEnzian
classification [2], [3]. The rASRM classification is particu-
larly applicable to the recording of all intraperitoneal lesions,
whereas the Enzian classification covers deep endometriosis.
The classification is primarily carried out by the surgeon’s
visual assessment complimenting each other for quantifying a
patient’s overall condition.
The entire detection of the endometriosis in the partially
inaccessible area of the pelvis and the large area of the
peritoneum can be limited, and is made more difficult by the
different color and appearance of the respective endometrial
lesions. Due to these various manifestations of endometriosis,
good training and great attention is required from the surgeon
during diagnosis. The lack of experience, possibly combined
with time pressure under a large operation list, carries the risk
of incomplete recording of the disease. This has an essential
consequence for the further treatment and the patient’s well-
being. There is a requirement to prevent misdiagnosis of the
disease as far as possible and at the same time to intensify
the visual perception of all lesions, especially for doctors in
training. This could be supported intra- or post-operatively
with the help of image segmentation.
With deep learning already heavily employed in medical
imaging, it naturally could be regarded as an opportunity for
not only improving aforementioned educational training but as
well facilitate post-surgical analysis. In order to demonstrate
the feasibility of such a goal, for this work we focus on the
object segmentation of a specific visual appearance of en-
dometriosis – darkendometrial implants. Figure 1 depicts four
examples taken from a custom-created ground truth dataset 1
including region-based annotations of such pathological areas.
When regarding these annotations, it can be observed that,
although the indicated regions appear distinctly different from
their immediate surroundings, they seem quite similar to
other non-pathological areas such as spots of blood or dark
vessels. The dataset exclusively contains single-class implant
annotations and is used to adapt and train the state-of-the-art
deep object segmentation network Mask R-CNN [4], which
is a region-based convolutional neural network capable of
producing pixel masks for detected objects in addition to
bounding boxes generated by an incorporated region proposal
network (c.f. Faster R-CNN [5]). Overall, we formulate our
contributions as follows:
•Adapting Mask R-CNN and providing a model for binary
segmentation of endometrial implants.
•Local and temporal visualization of endometrial implants
in laparoscopic surgery videos.
•Providing the tool source code as well as pre-trained
models for academic purposes 2.
1https://tinyurl.com/ENIDDS
2https://tinyurl.com/EndoSegTool
arXiv:2510.13899v1 [cs.CV] 14 Oct 2025
(a)
(b)
(c)
(d)
Fig. 1: Examples of dark endometrial implants
This demonstration highlights partial results of an ongoing
more thorough study on the subject of endometriosis segmen-
tation. As such, the following sections intentionally focus on
describing the tool and its features rather than portraying the
dataset creation and training approach in very much detail.
II. ENDOMETRIOSISSEGMENTATIONTOOL
The endometriosis segmentation tool can generally be de-
scribed as an ensemble of technologies combined, resulting in
a series of scripts for analyzing post-surgical video archives.
These scripts are used for creating annotated output videos
as well as a configurable amount of metadata, which can for
instance be incorporated into potential interactive systems. As
mentioned above, this demo should be regarded as a showcase
for highlighting the feasibility of endometriosis segmentation,
therefore, we reserve building a fully-fledged user interface
for future versions of the tool. In the following sections
we describe its architecture, usage, hardware-specific runtime
analysis and implementation details.
A. Architecture
The system’s overall architecture is comprised of three three
main steps: dataset creation, model training and video analysis
(model application).
We custom-create a single-class lesion dataset from re-
fining parts of the more extensive and multi-class Gyne-
cologic Laparoscopy Endometriosis Dataset [6] (GLENDA).
The collected base dataset comprises over 350 region-based
endometrial implant annotations for 160 frames taken from
more than 100 patient cases exhibiting endometriosis. In order
to improve the trained segmentation model, we augment this
dataset by applying various techniques including rotating,
blurring, perspective transformation, desaturation as well as
object tracking. For the subsequent training step we divide
these various resulting datasets into two different subsets used
for training, validation and testing.
As mentioned above, for model training we adapt state-
of-the-art object segmentation network MASK R-CNN for
transfer learning a single output label. As a backbone network
we employ ResNet-101 [7] together with overall multi-task
loss function incorporating class (log loss), bounding box
(smoothL 1 loss) and mask segmentation (binary cross entropy
loss) predictions as described in [4], [8]. Training is conducted
for 50 epochs using a learning rate of0.001and stochastic
gradient descent as an optimizer. The best performing model in
terms of mean average precision (mAP) for mask segmentation
as employed in the MS COCO-detection [9] evalutaions is
achieved after 29 epochs using rotation as well as cropping
for augmentation: 0.642
[email protected] at a threshold of 0.5
Fig. 2: Video Processing Pipeline.
(a)
(b)
(c)
(d)
Fig. 3: Video at two different points in time – raw (top row) and analyzed (bottom row)
mask overlap (0.324 mAP for a threshold range of 0.50 to
0.95 with 0.05 steps). This model together with other well-
performing models from both splits are made available for
download3.
Finally, we utilize such a model in our system for detecting
pathologically suspicious regions with a confidence threshold
of 0.50 or above. The employed core processing pipeline is
depicted in Figure 2: first a user provides the tool with a
raw surgery video, which then is analyzed frame by frame
extracting bounding boxes, masks and labels. Whenever results
are found, the tool uses the determined segmentation masks
to produce annotated frames as well as an overall detection
summary in form of an indication bar, as depicted in Figure 3.
This bar indicates frame-by-frame detections over-time, col-
ored by detection confidence (yellow to dark red) – values for
multiple detections are averaged. Both, segmentation results as
well as indication bar are integrated into the final video output,
while additionally marking the current video position with a
green horizontal bar. This way, viewers of such annotated
output videos at any point in time are provided with an
overview of potentially important sections. All extracted data
can additionally be stored in JSON-format, as to facilitate the
integration in to future interactive video browsing systems.
B. Hardware and Runtime Analysis
For implementation, training and evaluation we used a
workstation with the following specifications: Intel Core i7-
3https://tinyurl.com/ENIDDS
TABLE I: Processing time comparison of 16:9 resolutions.
resolution avg in ms
640×360153
1280×720158
1920×1080170
3840×2160207
5820K CPU @ 3.30GHz x 6, 32 GiB DDR3 @ 1333 MHz,
Nvidia GeForce GTX 1080. On such a machine, model train-
ing required approximately 2h to complete. The tool has been
implemented using Linux Ubuntu 18.x, but also successfully
tested on Windows 10 systems. Given the exclusive utilization
of cross-platform technologies (c.f. SectionII-C), it is assumed
to be compatible with MacOS as well.
Concerning runtime performance, when using GPU pro-
cessing the system requires an average of approximately 150-
250ms of processing time per frame for most videos, as is out-
lined by Table I. Albeit clearly growing with larger resolutions,
the processing time essentially depends on resizing the input
images, since the generated model’s input is resized to fit a re-
stricted distinct pixel range, i.e. 800 pixels for the shortest and
1333 pixels for the longest image side. Hence, assuming a per-
frame performance of 170ms we can approximately estimate
the overall time requirements of processing an hour of video
produced by an endoscope recording in HD resolution with
25 frames per second: 170×25×60×60
1000 = 15300s= 4h15m.
C. Installation and Usage
The tool requires working installations OpenCV 4, Python
3.x 5, FFmpeg6 and Detectron27. All further requirements can
simply be installed by running:
$ pip install requirements.txt
In its most basic use case – analyzing a single video – the
tool can be executed by running:
$ python demo.py -i -m -o
The tool is also capable of multi-video and -model process-
ing and a detailed description of all available options can be
produced by running the script with the ’-h’ flag.
III. CONCLUSION
We present a tool for segmenting and annotating endome-
trial implants in laparoscopic videos. Approaching this prob-
lem by combining video object tracking in combination with
state-of-the-art image segmentation, we achieve qualitatively
good results that can be regarded as a first step towards an
interactive post-surgical video archive browser, which could
be of great assistance for treatment planning as well as clin-
ical education. Finally, this work represents valuable insights
into the feasibility of applying traditional machine learning
developed real-world object detection to a practical medical
use case.
ACKNOWLEDGMENTS
This work was funded by the FWF Austrian Science Fund
under grant P 32010-N38.
REFERENCES
[1] M. Canis, J. Donnez, D. Guzick, J. Halme, J. Rock, R. Schenken,
and M. Vernon, “Revised american society for reproductive medicine
classification of endometriosis: 1996,”Fertility and Sterility, vol. 67,
no. 5, pp. 817–821, 1997.
[2] J. Keckstein, U. Ulrich, M. Possover, K. Schweppeet al., “Enzian-
klassifikation der tief infiltrierenden endometriose,”Zentralblatt f ¨ur
Gyn¨akologie, vol. 125, p. 291, 2003.
[3] J. Keckstein and G. Hudelist, “Classification of die including bowel
endometriosis: from r-asrm to #enzian-classification,”Best Practice &
Research Clinical Obstetrics & Gynaecology, 2020.
[4] K. He, G. Gkioxari, P. Doll ´ar, and R. B. Girshick, “Mask R-CNN,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386–397,
2020. [Online]. Available: https://doi.org/10.1109/TPAMI.2018.2844175
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,”IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp.
1137–1149, June 2017. [Online]. Available: https://doi.org/10.1109/
TPAMI.2016.2577031
[6] A. Leibetseder, S. Kletz, K. Schoeffmann, S. Keckstein, and
J. Keckstein, “GLENDA: gynecologic laparoscopy endometriosis
dataset,” inMultiMedia Modeling - 26th International Conference,
MMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings,
Part II, ser. Lecture Notes in Computer Science, Y . M. Ro, W. Cheng,
J. Kim, W. Chu, P. Cui, J. Choi, M. Hu, and W. D. Neve,
Eds., vol. 11962. Springer, 2020, pp. 439–450. [Online]. Available:
https://doi.org/10.1007/978-3-030-37734-2 36
4OpenCV 4.x, https://opencv.org
5Python 3.x, https://www.python.org
6https://ffmpeg.org
7https://github.com/facebookresearch/detectron2
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” inProceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[8] R. Girshick, “Fast r-cnn,” inProceedings of the IEEE international
conference on computer vision, 2015, pp. 1440–1448.
[9] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” inEuropean conference on computer vision. Springer, 2014,
pp. 740–755.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.