Introduction
Automated analysis of laparoscopic videos is emerging as a valuable tool for improving the detec-
tion of diseases such as endometriosis, offering the potential to assist surgeons with complex in-
traoperative decisions and to standardize diagnostic accuracy [1, 2]. Traditional deep learning
approaches for surgical video analysis have relied heavily on convolutional neural networks
(CNNs) and recurrent neural networks (RNNs), which excel at capturing local spatial features and
1. Department of Gynecology, Babol University of Medical Sciences, Babol, Iran; 2*. Corresponding author: Department
of Surgery, Shahid Beheshti University Medical Sciences, Tehran, Iran. Email:
[email protected] ; 3. Faculty of
Medicine, Mazandaran University of Medical science, Mazandaran, Iran.4. Student Research Committee, Semnan Uni-
versity of Medical Sciences, Semnan, Iran; 5. Student Research Committee, Babol University of Medical Sciences, Babol,
Iran. / Open Access. © 202 5 the author(s), published by InfoPub. This work is licensed under the Creative Commons
Attribution 4.0 International License. (Journal homepage: https://www.isjtrend.com)
https://doi.org/10.61186/ist.202502.05.02
Endometriosis diagnosis via laparoscopy remains challenging due to subtle lesion appearances and in-
ter-observer variability. While artificial intelligence shows promise for surgical video analysis, the po-
tential of Vision Transformers (ViTs) specifically for endometriosis detection remains unexplored. This
study applied a SWOT framewor k to evaluate ViTs for automated endometriosis diagnosis in laparo-
scopic videos. Analysis of 10 studies from PubMed, IEEE Xplore, and Scopus identified key findings:
Strengths included (1) global attention for lesion detection, (2) outperforming CNNs/RNNs in surgical
tasks (91-97% accuracy), and (3) multimodal data integration. Weaknesses were (1) dependence on
unavailable annotated datasets, (2) high computational needs, (3) limited local feature sensitivity, and
(4) annotation variability issues. Opportun ities involved (1) self -supervised learning from unlabeled
videos and (2) explainable attention maps. Threats comprised (1) performance variability across surgi-
cal settings, (2) lacking regulatory standards, and (3) data privacy concerns. Crucially, no stu dies di-
rectly tested ViTs for endometriosis diagnosis despite their potential. For clinical implementation, three
requirements emerged: (1) collaborative dataset creation, (2) optimized hybrid architectures, and (3)
ethical guidelines for surgical AI. This structured analysis provides a roadmap for developing ViT-based
diagnostic tools while addressing current limitations in data, technology, and clinical integration .
Vision Transformers (ViTs), Endometriosis, Laparoscopic Surgery, SWOT Analysis,
Ethical AI.
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
short-term temporal patterns but often struggle with the nuanced and diffuse presentations typ-
ical of gynecologic pathologies like endometriosis [3, 4]. Advances in Vision Transformers (ViTs)
and transformer -inspired architectures have introduced a new paradigm capable of modeling
global spatial and temporal relationships in video data, resulting in superior performance over
conventional models for tool detection, workflow analysis, and related tasks within laparoscopy
[2, 5].
Despite this progress, the application of ViT -based models to direct disease diagnosis from
laparoscopic video, and specifically to endometriosis identification, remains largely uncharted in
the published literature [6, 7]. The feasibility of deploying such models is challenged by factors
including data scarcity, high annotation costs, and significant computational demands required
for real-time clinical integration [4, 8]. Moreover, ethical issues such as annotation burden, data
privacy, interpretability, and regulatory acceptance have not been systematically explored in the
context of transformer-based surgical video analysis [7, 9].
To systematically assess these complex dimensions, the SWOT (Strengths, Weaknesses, Op-
portunities, Threats) analysis framework is increasingly used in technology evaluation [10].
SWOT provides a structured approach to examining both the internal capabilities and limitations
of a technology, as well as the external factors that may facilitate or hinder its clinical adoption
[11]. By identifying strengths and weaknesses intrinsic to ViT architectures, and analyzing oppor-
tunities and threats rooted in clinical, ethical, or regulatory contexts, a SWOT analysis can offer a
comprehensive perspective for guiding future research and practical implementation.
The purpose of this study is to bridge these critical gaps by providing a structured SWOT anal-
ysis of Vision Transformer models for the automated diagnosis of endometriosis from laparo-
scopic videos. This review aims to evaluate the current state of technical feasibility and to high-
light the ethical challenges involved, offering practical considerations for future research and clin-
ical translation.
Methods
Study Design
This study employs a qualitative, structured SWOT (Strengths, Weaknesses, Opportunities,
Threats) analysis to systematically evaluate the feasibility and ethical challenges of deploying Vi-
sion Transformer (ViT) architectures for automated diagnosis of endometriosis from laparoscopic
video data.
Literature Search and Data Sources
A comprehensive literature search was conducted using PubMed, IEEE Xplore and Scopus to
identify primary research articles, systematic reviews, and technical reports relevant to:
• Vision Transformer (ViT) or transformer-inspired models in medical image/video analy-
sis
• Automated disease diagnosis in laparoscopy, including but not limited to endometriosis
• Feasibility, clinical integration, and ethical aspects of AI in surgical video analysis
Search terms included combinations of Technology Terms: " vision transformer" OR "ViT" OR
"visual transformer" OR "transformer architecture" OR "attention mechanism". Clinical Application
Terms: "laparoscopic video" OR "surgical video" OR "endoscopic video" OR "minimally invasive sur-
gery". Diagnosis Terms: "disease diagnosis" OR "pathology detection" OR "lesion identification" OR
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
"medical diagnosis" .Analysis Framework Terms: "technology assessment" OR "clinical implemen-
tation" OR "adoption barriers". Ethical Considerations: "ethical challenges" OR "AI ethics" OR "al-
gorithmic bias", and related synonyms.
Inclusion and Exclusion Criteria
Inclusion criteria
• Peer-reviewed articles or preprints published in English
• Studies describing ViT or transformer -based machine learning approaches in laparo-
scopic video analysis
• Papers addressing automated disease diagnosis, particularly for endometriosis, or dis-
cussing general feasibility and ethical considerations for AI in surgical video
Exclusion criteria
• Studies focusing solely on surgical phase recognition, tool detection, or workflow analysis
without disease/pathology diagnosis
• Articles without explicit discussion of model strengths, weaknesses, opportunities, or
threats
SWOT Framework Development
Drawing on the included literature and expert domain knowledge, a SWOT matrix was con-
structed to capture:
• Strengths: Inherent capabilities and advantages of ViT models for disease detection in
laparoscopic video
• Weaknesses: Technical and practical limitations, including data and clinical constraints
• Opportunities: External factors and future directions that may facilitate clinical adop-
tion, innovation, or improved outcomes
• Threats: Risks, barriers, and ethical concerns associated with real -world deployment
and broader societal implications
Evaluation Process
Each article was independently reviewed by two researchers. Factual statements regarding
ViT architectures, feasibility, and ethical issues were extracted and categorized into the SWOT
components. Discrepancies were resolved through discussion.
Results
Systematic Literature Search and Screening Process
Our comprehensive search across PubMed (n=45), IEEE Xplore (n=32), and Scopus (n=28)
initially identified 105 articles published between 2018 –2023. After removing 23 duplicates, 82
unique records underwent title/abstract screening. Of these, 52 were excluded for irrelevance to
laparoscopic video analysis or lack of focus on Vision Transformers (ViTs).
The remaining 30 full-text articles were assessed against predefined inclusion/exclusion cri-
teria. 20 additional studies were excluded for the following reasons:
• 12 studies focused exclusively on surgical tool detection or workflow segmentation
without disease diagnosis.
• 7 studies employed CNNs/RNNs only, omitting ViT or transformer-based architectures.
• 1 study was excluded due to the non-English language.
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
The final 10 articles (Table 1) met all criteria, including:
• Use of ViT or hybrid transformer models in laparoscopic/endoscopic video analysis.
• Relevance to automated disease diagnosis (though none addressed endometriosis
directly).
• Discussion of feasibility, clinical integration, or ethical challenges.
Notably, PubMed contributed 6 studies, IEEE Xplore 3, and Scopus 1, reflecting the interdis-
ciplinary nature of the field (computer science vs. clinical journals).
Table 1. Summary of Included Studies on Vision Transformers (ViTs) in Surgical Video Analysis.
Study Purpose Type Method &
Materials
Results Conclusion
[9]
Develop and assess a
multimodal
transformer model for
analyzing
laparoscopic surgical
videos, aiming to
reduce negative
outcomes and
improve patient
safety.
Original
research
(empirical)
Multimodal model
inspired by Video-
Audio-Text
Transformer; uses
ViT for images and
BERT for text;
trained on
Cholec80 LC
videos with
various
complexities.
Mean accuracy
91.0%, precision
81%, recall 83% on
30/80 test videos
(Cholec80 dataset).
Shows the model
can extract hidden
and distinct
features, helping to
create safer surgery
systems, but the
overall role and
advantages of AI
models in surgery
remain uncertain.
[12]
Propose dataset and
Method
for event
recognition in
laparoscopic
gynecology videos;
evaluate hybrid
transformer for
detecting critical
intra- and post-
operative events.
Original
research
(empirical)
Introduces
annotated event
dataset; compares
several CNN-RNN
models; develops
hybrid
transformer
architecture for
event recognition;
uses frame
sampling.
The hybrid
transformer
improves recognition
accuracy, counteracts
occlusion/motion
blur, and yields high
temporal resolution
in event recognition.
The proposed
hybrid transformer
approach is
superior to existing
CNN-RNN methods
for event
recognition in
laparoscopic
surgery videos.
[2]
Detect surgical tool
presence in
laparoscopic video
using a transformer
architecture
(LapFormer).
Original
research
(empirical)
LapFormer model:
feed-forward
transformer with
attention for inter-
frame correlation;
evaluated on
Cholec80 dataset.
Outperforms CNN
and RNN baselines by
20.3 and 17.3 points,
respectively, in
macro-F1 score;
includes ablation
studies.
Transformer
architecture is
more effective than
previous methods
for surgical tool
detection in
laparoscopic
videos.
[7]
Explore use of pure
vision transformers
for classifying single-
and multi-label
surgical tool frames in
laparoscopic surgery.
Original
research
(empirical)
Pure ViT models
for SL/ML tool
classification; 5-
fold cross-
validation on
Cholec80 dataset.
Mean average
precision (mAP) =
95.8%,
outperforming
conventional multi-
label models.
Results
suggest
promise for ViT
models in surgical
tool detection,
warranting further
research.
[13]
Present a
bidirectional
transformer with
sparse attention for GI
disease recognition
from endoscopy
images.
Original
research
(empirical)
Bidirectional
Transformer with
Sparse Attention
(BTSA); trained
and tested on
large-scale GI
endoscopy
datasets.
BTSA achieves
outstanding
performance,
surpasses existing
models in GI disease
recognition, and is
efficient.
Model shows
significant
potential, but
further research
and validation are
needed for clinical
utility.
[8]
Benchmark deep
learning models
(convolutional,
transformer, hybrid)
for surgical
workflow/phase
Original
research
(empirical)
Compares fully
convolutional, fully
transformer, and
hybrid models;
workflow
recognition from
Hybrid model
achieves 93% frame-
level accuracy and 85
segmental edit
distance; fully
transformer also gets
Hybrid models
effectively capture
workflow and yield
best results in
surgical video
analysis.
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
recognition in
endoscopic videos.
gastric bypass
surgery videos.
state-of-the-art
results.
[14]
Develop a CNN with a
new attention module
(P-CSEM) to improve
laparoscopic tool
detection.
Original
research
(empirical)
CNN with P-CSEM
attention modules
at various layers;
trained and tested
on Cholec80
dataset.
Attention-enhanced
model achieves mean
average precision of
93.14%; feature
relevance improved.
Custom attention
modules inside
CNNs can enhance
tool relevance in
laparoscopic tool
classification.
[15]
Use a transformer-
based, multi-view
approach to improve
surgical phase
recognition in
laparoscopic
cholecystectomy.
Original
research
(empirical)
Multi-view phase
recognition using
transformer-based
model with late
fusion of
laparoscopic and
in-room camera
data.
Performance is
mixed; real-world
data collection
introduced
challenges; model
performance
decreased with poor
data.
Integration of
multi-view data is
complex; real-
world diversity and
better data are
needed for optimal
models.
[16]
Review and
summarize deep
learning methods
(incl. transformers)
for phase and step
recognition in surgical
workflow analysis.
Systematic
review
(review
article)
Systematic review;
searched
databases for
studies post-2018
on deep learning
Methods
for
surgical workflow;
44 studies
reviewed.
Temporal context
(RNN, CNN,
Transformers) is key;
lack of diverse
datasets is a major
challenge in
workflow
recognition.
The field is
advancing, but
robust,
generalizable
models are
hampered by
limited, supervised
datasets.
[4]
Compare performance
of ViT, CNN, and
hybrid CNN-ViT
architectures for GI
disease classification
from endoscopy
images.
Original
research
(empirical)
ViT, hybrid CNN-
ViT, CNN
architectures;
trained on GI
endoscopy images
(WCE, etc.);
classified 6 GI
disease classes.
Hybrid model
achieves test
accuracy of 97.91%,
F1 97.91%, precision
98.01%; compared
against CNN and ViT
models.
Hybrid CNN-ViT
models can
accurately classify
GI diseases from
endoscopy images.
Vision Transformer (ViT) or Transformer-Inspired Models in Medical Image/Video Analysis
Recent years have seen increased application of Vision Transformer (ViT) and transformer -
inspired architectures in medical image and video analysis. Multiple studies have focused on lev-
eraging the global attention mechanisms of ViTs and their hybrid forms for diverse tasks, includ-
ing tool detection, workflow and phase recognition, and disease diagnosis in gastrointestinal en-
doscopy images. These models have shown performance improvements over traditional CNNs or
RNNs, especially for spatio-temporal data in laparoscopic and endoscopic domains. However, the
predominant use cases reported in the literature relate to surgical tool or workflow recognition,
with relatively few studies targeting disease diagnosis, and none specifically focusing on endome-
triosis in laparoscopic videos [4, 7-9, 12] (Table 2).
Table 2. Applications of ViTs in Laparoscopic/Endoscopic Video Tasks.
Study Task Data Type ViT Usage Key Outcome
[3] Surgical tool detection Laparoscopic video Pure ViT Superior macro-F1 to CNN/RNN
(+20%)
[7]
Tool classification
(SL/ML) Laparoscopic video Pure ViT mAP 95.8%; outperforming
previous CNN models
[8]
Workflow/phase
recognition Laparoscopic video ViT/CNN/Hybrid Transformers match SOTA
convolutional results
[4] GI disease classification Capsule endoscopy
img ViT/Hybrid CNN ViT and hybrids outperform
CNN only
[13] GI disease recognition GI endoscopy
images Bi-directional T Sparse attention improves
efficiency
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
Automated Disease Diagnosis in Laparoscopy, Including but Not Limited to Endometriosis
Automated disease diagnosis in laparoscopy remains a relatively underexplored application
compared to tool detection or workflow recognition. Across the reviewed literature, there are no
studies directly addressing endometriosis detection with ViT -based methods from laparoscopic
videos. While several references evaluate transformers for disease classification in gastrointesti-
nal endoscopy images, these studies do not utilize laparoscopic video data nor focus on gyneco-
logical pathologies such as endometriosi s. The only disease diagnosis applications identified re-
late to GI pathologies from endoscopy, not laparoscopy [4]. Tool and event recognition studies in
laparoscopy do not provide systematic assessment of disease classification performance [7-9, 12]
(Table 3).
Table 3. Gap Analysis: Disease Diagnosis in Laparoscopy vs. Endoscopy.
Study Disease Task Method Laparoscopic
Video Outcome/Limitations
[13]
GI disease
(general) Bi-directional T No Applies to endoscopy, not laparoscopy
[4]
GI disease (6
types) ViT/Hybrid CNN No Only WCE images, not laparoscopy
video
[2, 7-9,
12] Tool/phase/event ViT/Hybrid/CNN Yes No disease/pathology classification
task
Feasibility, Clinical Integration, and Ethical Aspects of AI in Surgical Video Analysis
Feasibility and clinical integration of AI models—especially Vision Transformers—in surgical
video analysis are discussed peripherally in several empirical studies, though not as primary out-
comes. The main technical feasibility concerns center on data scar city, need for large and granu-
larly annotated training datasets, and high computational demands for real-time application. For
clinical integration, general requirements include high interpretability (attention -based
heatmaps), robust performance across ha rdware/sites, and alignment with surgical workflow.
Ethical aspects, such as annotation burden, trust in AI, and concerns over robust deployment, are
noted in generic terms in the literature, with no detailed frameworks specific to ViT-based disease
diagnosis in laparoscopy [4, 7-9]. Safety, explainability, and regulatory barriers are highlighted as
ongoing challenges (Table 4).
Table 4. Key Feasibility and Ethical Challenges for ViT Deployment.
Aspect Noted Issues Studies Citing the Issue
Data hunger Requires large annotated datasets [2, 4, 7-9]
Computation Resource-intensive; real-time limits [2, 7-9]
Interpretation Needs explainable attention maps [2, 9]
Annotation cost High labeling burden [2, 7, 8]
Ethical/safety Clinical trust, regulatory hurdles [2, 4, 7, 8]
SWOT Analysis
Strengths of ViT for Automated Diagnosis of Endometriosis from Laparoscopic Videos
Vision Transformers provide a global receptive field allowing for comprehensive aggregation
of visual information across each frame and, in extended forms, across video sequences. This spa-
tial (and potentially temporal) attention facilitates detection of diffuse, subtle, or irregularly
shaped disease lesions commonly seen in endometriosis. ViTs are also highly adaptable to multi-
modal fusion, supporting the integration of additional context such as surgical reports or instru-
ment data. Empirical studies in tool and workflow recognition indicate ViTs can outperform tra-
ditional CNNs, particularly when applied to well-annotated surgical video datasets [2, 7-9].
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
Weaknesses of ViT for Automated Diagnosis of Endometriosis from Laparoscopic Videos
A primary weakness is the requirement for large quantities of diverse, accurately annotated
video data for effective model training—resources scarce in laparoscopic disease diagnosis, espe-
cially for conditions like endometriosis. ViTs are computationally d emanding, posing challenges
for real-time operating room integration without significant hardware support or model optimi-
zation. Their limited inherent inductive bias for local features can reduce sensitivity to small or
localized lesions, unless modified with hybrid architectures. Issues with label quality and interob-
server variability in disease annotation may also affect reliability [2, 7-9].
Opportunities of ViT for Automated Diagnosis of Endometriosis from Laparoscopic Videos
Emerging opportunities include leveraging self-supervised pretraining on large, unannotated
laparoscopic video archives to reduce manual labeling requirements and improve model robust-
ness. Federated learning across multiple centers can help build generaliz able models while pre-
serving patient privacy. Explainable AI techniques leveraging visual attention maps may facilitate
clinical acceptance by enhancing model interpretability. Additionally, advances in spatio -tem-
poral transformer variants and multimodal a pproaches could further enhance disease detection
capabilities [2, 4, 7-9].
Threats of ViT for Automated Diagnosis of Endometriosis from Laparoscopic Videos
Key threats include vulnerability to domain shifts caused by different surgical equipment,
camera systems, lighting conditions, and operator techniques —all of which could degrade ViT
model performance if not addressed. The lack of standardized datasets and high-quality labeled
examples for endometriosis remains a significant barrier. Further, regulatory approval for clinical
deployment may be hindered by the complexity and opacity of transformer models, challenges
with real-time guarantees, and persistent c oncerns about trust and accountability in automated
decision-making [2, 4, 7, 8].
SWOT Matrix for ViT-Based Endometriosis Diagnosis from Laparoscopic Videos
The SWOT matrix synthesizes internal (Strengths/Weaknesses) and external (Opportuni-
ties/Threats) factors influencing ViT -based endometriosis diagnosis. Strengths highlight ViTs’
global attention for detecting diffuse lesions, while weaknesses address data/computational de-
mands. Opportunities include federated learning and explainable AI, whereas threats encompass
domain shifts and regulatory barriers. This framework guides balanced evaluation for clinical
adoption (Figure 1).
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
Figure 1. SWOT Matrix for ViT-Based Endometriosis Diagnosis from Laparoscopic Videos.
Feasibility and Ethical Challenges
The use of Vision Transformer (ViT) architectures for automated disease diagnosis in laparo-
scopic videos offers significant promise but also faces considerable challenges (Table 5). Moreo-
ver, integrating ViT-based diagnostic models into clinical laparoscopic workflows introduces sev-
eral ethical concerns, many of which remain unresolved in the literature. A key challenge is the
annotation burden: developing large, high -quality labeled datasets for surgical videos demands
substantial time from medical experts and raises potential risks regarding patient data privacy [2,
8] (Table 5).
Table 5. Feasibility and Ethical Challenges for ViT-based Automated Diagnosis of Endometriosis from Laparoscopic
Videos.
Domain Challenges/Considerations References
Feasibility
- Large, annotated laparoscopic video datasets are scarce, especially with
disease/pathology labels (e.g., endometriosis). [2, 4, 7, 8]
- ViTs require substantial computational resources; real-time clinical OR
deployment is challenging. [7, 8]
- Annotation requires expert clinicians, leading to high costs and annotation
burden. [2, 7, 8]
- Self-supervised pretraining and federated learning proposed to address data
scarcity and improve generalization but are not yet standard in this context. [4, 9]
- Hybrid or sparse attention architectures may partially alleviate computational
loads; further validation is needed. [8, 13]
- No direct empirical evidence exists for ViT feasibility in endometriosis
diagnosis; existing evidence is from adjacent tasks (tool/phase detection). [2, 4, 7, 9]
Ethical - High annotation burden and privacy concerns associated with creation and use
of surgical video datasets. [2, 7, 8]
Challenges
- Potential for bias (class imbalance, label noise, interobserver variability) may
compromise fairness and reliability. [4, 8, 16]
- "Black-box" nature of ViTs limits interpretability; attention maps help but have
limits. [2, 9]
- Regulatory, accountability, and liability issues are more complex due to ViT
model opacity and performance uncertainty. [7, 8]
- Consent, secure data storage, and transparency are unresolved in current
frameworks for AI in surgical video analysis. [2, 4, 7, 8]
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
Implications
The absence of direct evidence for ViT -based endometriosis diagnosis underscores a critical
gap in both the research landscape and clinical translation efforts for advanced AI models in gy-
necologic laparoscopy. While the strengths of transformers —such as global attention, flexibility
for integrating multiple data modalities, and state-of-the-art performance—present clear oppor-
tunities for improving disease detection, significant weaknesses and external threats must be ad-
dressed. These include the need for large -scale, high-quality annotated datasets, computational
barriers to real-time deployment, vulnerability to domain shift, and lack of explainability required
for surgeon trust and regulatory approval. The practical integration of ViT-based diagnostic AI in
surgical settings therefore depends not only on further technological innovation, but also on ro-
bust clinical validation, ethical guidelines, and cross-disciplinary collaboration.
Recommendations for Practice
1. Data Infrastructure and Sharing Collaborative efforts should be initiated to develop,
standardize, and share large, annotated laparoscopic video datasets—specifically
including cases of endometriosis—to enable model training and robust benchmarking [2,
7, 8].
2. Model Development and Validation Researchers should explore hybrid and spatio-
temporal transformer architectures, leveraging approaches such as self-supervised and
federated learning to address data scarcity and improve cross-site generalizability [2, 4,
7-9].
3. Interpretability and Trust Development of explainable AI techniques—such as attention
heatmaps—should be prioritized to support clinical interpretation and facilitate trust
among surgeons and regulatory bodies [2, 9].
4. Ethical and Regulatory Oversight Early integration of ethical considerations, including
annotation burden, bias, and informed consent, is essential. Engagement with regulatory
agencies should guide the development of compliance-ready AI systems [2, 4, 7, 8].
5. Clinical Integration Pilot implementation of ViT-based diagnostic support tools should be
closely monitored in controlled settings, ensuring that workflow integration,
computational requirements, and real-time performance meet clinical and patient safety
standards [8].
Discussion
This study provides the first structured SWOT analysis of Vision Transformer (ViT) models for
automated diagnosis of endometriosis from laparoscopic videos —a task that remains un-
addressed directly in the current literature. While recent advances in deep learning have estab-
lished ViT and transformer -inspired architectures as state -of-the-art tools for complex visual
tasks, a comparative examination of their performance, limitations, and broader implications in
surgical video analysis is essential for informed development and adoption.
Compared to prior convolutional and recurrent approaches, ViT models show clear empirical
strengths in related laparoscopic video analysis tasks. For instance, in tool detection and work-
flow/phase recognition, ViT -based and hybrid transformer models achiev ed superior accuracy
and macro-F1 scores over conventional CNN and RNN baselines [2, 7, 8]. These gains stem largely
from the global attention mechanisms that allow ViTs to model spatial dependencies and long -
range context more efficiently than convolutional filters or sequential RNN layers. Additionally,
ViT architectures are highly adaptable: studies show their applicability to single- and multi-label
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
classification [7], as well as their extension to multimodal fusion, incorporating text and audio
data for richer surgical scene understanding [9].
However, in direct comparison, the domain of automated disease diagnosis—particularly for
conditions like endometriosis—is far less developed for ViT -based models than the adjacent do-
mains of surgical tool recognition or workflow segmentation. While transf ormer models have
been applied with promising results to GI disease diagnosis from endoscopic images [4], these
studies do not use laparoscopic video data, nor do they address gynecological pathologies specif-
ically. Tool detection and workflow studies in laparoscopy [2, 7-9, 14, 15] consistently exclude
disease classification as an explicit outcome. This gap is striking, as it suggests that the advantages
observed for ViT models in instrument and phase tasks may not translate automatically to disease
recognition in the more heterogeneous and visually subtle context of endometriosis.
Feasibility constraints form a recurrent theme across the literature. Data scarcity and annota-
tion costs are reported as the primary bottlenecks for transformer training in laparoscopy [2, 7].
ViT models require large datasets —often orders of magnitude greater than those needed for
CNNs—to generalize well and avoid overfitting. This demand is particularly challenging in endo-
metriosis research, where annotated laparoscopic video collections are uncommon. Computa-
tional barriers also persist, with studies highlighting the significant memory and processing re-
quirements of self-attention mechanisms, especially for long surgical videos [4,6]. In comparative
terms, these challenges exceed those faced by most CNN/RNN -based pipelines and necessitate
innovations such as hybrid architectures, sparse attention mechanisms, or pretraining with large-
scale unlabeled data [4, 9, 13].
Ethically, the transformer literature underlines issues that are not unique to ViT models but
are amplified by their complexity. Class imbalance, label noise, and interobserver variability —
ubiquitous in medical annotation—pose risks for fairness and reliability of automated disease di-
agnosis [4, 8]. Where prior CNN/RNN approaches have already drawn scrutiny for their "black -
box" characteristics, ViTs—while capable of some explainability through attention maps [2, 9]—
still struggle to provide fully interpretable and trustworthy outputs. This complicates regulatory
approval, clinical acceptance, and potential liability discussions. Furthermore, data-sharing inno-
vations such as federated learning are only nascent and un tested in the context of disease detec-
tion in laparoscopic video [4, 9].
Comparing ViT adoption in laparoscopic disease diagnosis to its success in endoscopic GI dis-
ease tasks [4, 16], it becomes clear that pathologies with strong, localized visual signatures (e.g.,
certain GI lesions) align more easily with patch-based transformer reasoning. For heterogeneous,
poorly demarcated diseases such as endometriosis in laparoscopy, future res earch will need to
consider domain-adapted attention mechanisms or hybrid models that reintroduce local struc-
tural biases [2, 8]. Moreover, workflow and tool recognition studies demonstrate the potential of
training on larger, multi-institutional datasets [2, 8], a lesson directly transferable to the endome-
triosis context if privacy and labeling barriers can be overcome.
In summary, while there is strong comparative evidence that ViT models can achieve and
sometimes exceed the performance of existing deep learning approaches in laparoscopy for tasks
such as tool and workflow recognition [7, 8], there remains a marked lack of direct research on
their application to automated disease diagnosis—and none at all on endometriosis detection in
laparoscopic videos. The feasibility and ethical challenges identified here surpass those of previ-
ous generations of AI models and must be systematically addressed before ViT-based diagnostics
can be safely and successfully integrated into surgical practice.
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
Limitations
This study has several limitations. First, the literature search was restricted to English -lan-
guage articles, potentially omitting relevant non-English research. Second, the lack of direct stud-
ies on ViTs for endometriosis diagnosis necessitated extrapolation from adjacent tasks (e.g., tool
detection), which may not fully represent disease -specific challenges. Third, the SWOT analysis,
while systematic, remains qualitative and could benefit from quantitative validation through
stakeholder surveys or multice ntric data. Finally, rapid advancements in transformer architec-
tures may outdate some technical feasibility assessments.
Conclusion
Vision Transformers (ViTs) demonstrate potential for automating endometriosis diagnosis in lap-
aroscopic videos, leveraging their ability to model global spatial relationships. However, this ap-
plication remains understudied compared to tool or workflow recognition. Key barriers include
data scarcity, computational costs, and unresolved ethical concerns. Future work must prioritize
curated datasets, hybrid architectures for efficiency, and rigorous clinical validation to bridge the
gap between technical promise and real-world utility.
Abbreviations
ViT: Vision Transformer; CNN: Convolutional Neural Network; RNN: Recurrent Neural Network;
SWOT: Strengths, Weaknesses, Opportunities, Threats; GI: Gastrointestinal.
Ethical approval
This study analyzed published literature and did not involve human participants, primary data
collection, or patient interactions. As a text -based review, it required no institutional ethics ap-
proval or compliance with declarations such as the Helsinki Code.
Availability of data and materials
Please contact the corresponding author if you would like access to the datasets used and/or
analyzed during this study.
Funding
This research was not funded or supported by any organizations.
Authors’ Contribution
P.H.: Conceptualization, literature review, manuscript drafting. AA.: Data curation, table/figure
design, ethical analysis. F.R.: Methodology, literature screening, SWOT framework development.
K.GH.: Abstract/keywords, limitations/conclusion drafting. H.GH.: References formatting,
technical validation, final editing.
Acknowledgment
Not applicable.
Consent for publication
The authors provided their consent for the publication of the study results.
Competing interests
The authors declare no competing interests.
InfoScience Trends ǁ (2025) NO 05; VOL 02: 11-22 ǁ ǁ https://doi.org/10.61186/ist.202502.05.02ǁ InfoPub
www.isjtrend.com
References
[1]. Maheux-Lacroix S, Belanger M, Pinard L, Lemyre M, Laberge P, Boutin A. Diagnostic Accuracy of Intraoperative
Tools for Detecting Endometriosis: A Systematic Review and Meta-analysis. Journal of minimally invasive
gynecology. 2020;27(2):433-40.e1. https://doi.org/10.1016/j.jmig.2019.11.010
[2]. Kondo S. Lapformer: surgical tool detection in laparoscopic surgical video using transformer architecture.
Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization. 2021;9(3):302-7.
https://doi.org/10.1080/21681163.2020.1835550
[3]. Thakur GK, Thakur A, Kulkarni S, Khan N, Khan S. Deep learning approaches for medical image analysis and
diagnosis. Cureus. 2024;16(5). https://doi.org/10.7759/cureus.59507
[4]. Yadav S, Aparna P, editors. Performance Comparison of Transformers and Convolutional Neural Networks
(CNNs) Based Architecture on Endoscopy Images. 2024 IEEE International Conference on Electronics,
Computing and Communication Technologies (CONECCT); 2024: IEEE.
https://doi.org/10.1109/CONECCT62155.2024.10677209
[5]. Alijani S, Fayyad J, Najjaran H. Vision transformers in domain adaptation and domain generalization: a study of
robustness. Neural Computing and Applications. 2024;36(29):17979-8007. https://doi.org/10.1007/s00521-
024-10353-5
[6]. Gratton SM, Choudhry AJ, Vilos GA, Vilos A, Baier K, Holubeshen S, et al. Diagnosis of Endometriosis at
Laparoscopy: A Validation Study Comparing Surgeon Visualization with Histologic Findings. Journal of
obstetrics and gynaecology Canada : JOGC = Journal d'obstetrique et gynecologie du Canada : JOGC.
2022;44(2):135-41. https://doi.org/10.1016/j.jogc.2021.08.013
[7]. El Moaqet H, Janini R, Abdulbaki Alshirbaji T, Aldeen Jalal N, Möller K, editors. Using Vision Transformers for
Classifying Surgical Tools in Computer Aided Surgeries. Current Directions in Biomedical Engineering; 2024:
De Gruyter. https://doi.org/10.1515/cdbme-2024-2056
[8]. Zhang B, Abbing J, Ghanem A, Fer D, Barker J, Abukhalil R, et al. Towards accurate surgical workflow
recognition with convolutional networks and transformers. Computer Methods in Biomechanics and
Biomedical Engineering: Imaging & Visualization. 2022;10(4):349-56.
https://doi.org/10.1080/21681163.2021.2002191
[9]. Abiyev RH, Altabel MZ, Darwish M, Helwan A. A Multimodal Transformer Model for Recognition of Images from
Complex Laparoscopic Surgical Videos. Diagnostics. 2024;14(7):681.
https://doi.org/10.3390/diagnostics14070681
[10]. Puyt RW, Lie FB, Wilderom CPM. The origins of SWOT analysis. Long Range Planning. 2023;56(3):102304.
https://doi.org/10.1016/j.lrp.2023.102304
[11]. Rony MKK, Akter K, Debnath M, Rahman MM, Johra Ft, Akter F, et al. Strengths, weaknesses, opportunities and
threats (SWOT) analysis of artificial intelligence adoption in nursing care. Journal of Medicine, Surgery, and
Public Health. 2024;3:100113. https://doi.org/10.1016/j.glmedi.2024.100113
[12]. Nasirihaghighi S, Ghamsarian N, Husslein H, Schoeffmann K, editors. Event Recognition in Laparoscopic
Gynecology Videos with Hybrid Transformers. International Conference on Multimedia Modeling; 2024:
Springer. https://doi.org/10.1007/978-3-031-56435-2_7
[13]. Cao X, Guan H. Bidirectional transformer with sparse attention for gastrointestinal disease recognition. In2023
4th International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE)
2023 Aug 25 (pp. 357-361). IEEE. https://doi.org/10.1109/ICBAIE59714.2023.10281350
[14]. Arabian H, Abdulbaki Alshirbaji T, Jalal NA, Krueger-Ziolek S, Moeller K. P-CSEM: An Attention Module for
Improved Laparoscopic Surgical Tool Detection. Sensors. 2023;23(16):7257.
https://doi.org/10.3390/s23167257
[15]. Bajraktari F, Pott PP, editors. Multi-view surgical phase recognition during laparoscopic cholecystectomy.
Current Directions in Biomedical Engineering; 2024: De Gruyter. https://doi.org/10.1515/cdbme-2024-2011
[16]. Demir KC, Schieber H, Weise T, Roth D, May M, Maier A, et al. Deep learning in surgical workflow analysis: a
review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics. 2023;27(11):5405-
17. https://doi.org/10.1109/jbhi.2023.3311628
Publisher’s Note
© 2025 The Author(s). Published by InfoPub.
Publisher homepage: https://infopub.ir/
Disclaimer: The views, opinions, and data presented in this article are solely those of the author(s)
and do not necessarily reflect the official policy or position of InfoScience Trends or its editorial team.
InfoScience Trends and its editors disclaim any liability for errors, consequences, or damages arising
from the use of information contained in this publication.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.