An Integrated Vision-Audio Architecture for Responsive Humanoid Robot Interaction with Dynamic Attention Switching

doi:10.21203/rs.3.rs-8054965/v1

An Integrated Vision-Audio Architecture for Responsive Humanoid Robot Interaction with Dynamic Attention Switching

2025 · doi:10.21203/rs.3.rs-8054965/v1

preprint OA: closed

Full text JSON View at publisher

Full text 89,776 characters · extracted from preprint-html · click to expand

An Integrated Vision-Audio Architecture for Responsive Humanoid Robot Interaction with Dynamic Attention Switching | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article An Integrated Vision-Audio Architecture for Responsive Humanoid Robot Interaction with Dynamic Attention Switching M Tanseer Ali, Mahidul Islam Nihad, Shafayet Ullah, Md Radowan Sikder, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8054965/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Human-Robot Interaction (HRI) is a rapidly growing field that enables socially meaningful communication between humans and robotic systems. Most current robotic platforms rely heavily on visual or auditory cues, but few achieve seamless integration of both in a dynamic, context-aware manner. Motivated by the need for more natural, human-like interaction, this paper presents the development of an AI-based humanoid Chat Robot designed as a stationary robotic face capable of real-time multimodal interaction. The presented work integrates facial and mouth movement detection using the MediaPipe framework, auditory direction detection using Fast Fourier Transform (FFT) on multi-microphone input, and rule-based voice interaction using a dynamic CSV dataset. A core switching logic governs attention shifts between vision and audio based on environmental cues, ensuring robust and adaptive interaction. The robot's 3D design features a natural, humane facial structure, including servo-controlled eyes, jaw, and neck, which offer expressive motion to reinforce engagement. Evaluation across varied single and multi-user interaction scenarios demonstrates accurate speaker tracking, reliable audio localization, and smooth servo actuation. The system provides a low-cost, modular platform suitable for HRI research, educational applications, and experimental Social Robotics. Computing Methodologies Embedded systems Human–computer interaction (HCI) Natural Language Interfaces Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 1. Introduction In modern robotics, Human-Robot Interaction (HRI) is an essential field of research. The ongoing research focuses on how to make the responses and behavior of humanoid robots more natural to people, thereby enhancing their interactions and making them more human-like and socially acceptable [1–3]. Robots that can engage smoothly with people are becoming increasingly valuable across various sectors, including education, healthcare, service industries, and hazardous environments where human presence is risky or impractical [4], [5]. The key challenge in this domain is designing robots that can perceive and respond to the external environment in ways that replicate natural human behavior. This involves not only accurately responding to visual and sound stimuli from the surroundings but also the ability to express corresponding responses through lifelike movements [6], [7], [20]. This paper presents the development of an AI-based humanoid Chat Robot, designed as a robotic face capable of engaging in natural, human-like interaction through both visual and auditory modules. The system aims to be low-cost and modular, making it accessible for research, education, and human-robot interaction (HRI) experimentation.[2] The robot features a facial structure modeled on human anatomy, equipped with servo motors that control the eyes, jaw, and neck to simulate realistic facial expressions and gestures. The proposed architecture integrates three core Python modules: vision, audio, and interaction, which operate independently yet communicate via a shared control framework. This paper details the complete system design, module functionalities, and their integration. It also discusses the implementation process and evaluates performance in real-world scenarios. 2. Related Research Works and Findings The development of expressive robotic heads and faces for HRI has a rich history. Sophia’s facial realism [3], Furhat’s projection-based head [4], and Kismet’s biomimetic emotional model [5] remain essential milestones. Breazeal’s work on sociable robots emphasized the role of emotion in natural interaction [1]. Affective and non-verbal communication play a central role, with studies showing that robots capable of expressing emotions through facial or bodily cues increase engagement and trust [9], [14]. Research on gaze and eye contact highlights its critical role in joint attention and multiparty interaction [10], [17]. Furthermore, Okuno et al. demonstrated the importance of integrating audio and visual channels for attention systems in robots [18]. Attitudes toward robots also influence HRI outcomes, as demonstrated by Nomura et al. [7]. Additionally, Walters et al. investigated spatial preferences during interaction with robots [11]. Together, these works illustrate the evolution from ultra-realistic, high-cost systems to affordable, socially responsive robots that strike a balance between embodiment and accessibility. 2.1 Sophia (Hanson Robotics): Sophia is arguably the most publicly recognized humanoid robot head. Its primary contribution lies in its highly realistic facial expressions, achieved through sophisticated patented elastomer skin and numerous (typically 30+) servo motors [3]. This allows for nuanced emotional displays. However, Sophia's interaction paradigm is often scripted or remotely operated. While it can perform face and voice tracking, its proprietary software architecture, combined with its extremely high cost and lack of modularity, places it out of reach for most research labs. In contrast, our work prioritizes a transparent, low-cost, modular, and open-source architecture, focusing on a fully autonomous, integrated audio-visual attention system rather than on achieving ultra-realism. 2.2 Furhat (Furhat Robotics): Furhat takes a different approach by projecting a rear-facing image onto a 3D-printed mask. This enables incredible flexibility in character and gender representation without requiring hardware changes [4]. Furhat excels in social signal processing and multi-party interaction, using advanced microphone arrays and software for speaker localization and turn-taking. It is a powerful commercial research platform. However, like Sophia, it is a high-cost system. Furthermore, its projection-based face, while versatile, lacks the physical, embodied presence and mechanical motion that our servo-actuated face provides. Our work demonstrates that core HRI behaviors like gaze following, and speaker tracking can be achieved effectively with a low-cost physical platform. 2.3 Kismet (MIT AI Lab): As a foundational platform in social robotics, Kismet used a combination of visual, auditory, and proprioceptive cues to drive expressive motors and create emotionally resonant interactions [5]. Kismet's architecture was biomimetic, with subsystems for emotion, motivation, and behavior that competed for expression. While groundbreaking, Kismet's hardware and software are now dated. Our system builds on Kismet's core philosophy of integrating perception and action, but does so with modern, accessible tools (Python, MediaPipe, FFT) and a more lightweight, rule-based switching logic rather than a complex cognitive architecture. This makes our system easier for the broader research community to replicate, modify, and extend. 2.4 Recent Research Platforms: Recent trends have seen a surge in low-cost, 3D-printed platforms. For instance, the “Mario" robot uses a similar 3D-printed face with servos for jaw and neck movement but focuses primarily on voice interaction without integrated visual tracking for attention [6]. Other platforms utilize off-the-shelf smart speakers for audio processing but lack a physical embodiment altogether [7]. A common limitation in many low-cost systems is their reliance on a single modality (e.g., only vision or audio) or on a static priority system in which one modality always takes precedence.[8] 2.5 Novelty of Our Approach: Our AI-based humanoid chat robot distinguishes itself within this landscape through its tightly integrated, dynamic, and context-aware fusion of low-cost visual and auditory perception. Unlike high-cost platforms (Sophia, Furhat), we provide a fully open and accessible system. Unlike many low-cost platforms, we implement a dynamic switching logic that allows the robot to seamlessly shift its attention between visual cues (faces) and auditory cues (sound direction) based on environmental context, a feature crucial for natural interaction that is often absent in comparable low-tier systems. This work fills a niche between high-performance commercial platforms and simplistic single-modality prototypes, offering a robust, multimodal, and replicable platform for HRI research. 3. Methodology The architecture of the proposed humanoid chat robot emphasizes synchronized multimodal processing. This aligns with theories of joint action, which posit that humans and agents coordinate their bodies and minds [8]. Designing robot motion to be predictable and legible is essential for smooth interactions, as shown in prior work [12], [13]. Our system also synthesizes verbal and non-verbal behavior, inspired by Aly and Tapus’ model of personality-driven multimodal robots [15]. Beyond technical perception-action integration, social cognition models such as those proposed by Lemaignan et al. [16] are highly relevant. Finally, the dataset-driven learning approach aligns with Thomaz and Breazeal’s studies on teachable robots, which show that human guidance improves robot adaptability [19]. 3.1 System Architecture The proposed AI-based humanoid robot system presents a robotic face designed to simulate human-like attention and interaction through integrated visual and audio processing. The architecture consists of three core modules: a vision-based face and speech detection system, an audio direction estimation unit using the Fast Fourier Transform (FFT), and an interaction logic module that generates contextual verbal responses. These modules operate in parallel using Python's multiprocessing module. The resulting commands drive servo actuations through an Arduino microcontroller interfaced with a PCA9685 driver, enabling synchronized movement of the eyes, neck, and jaw. 3.2 Core Module The proposed system comprises three interdependent modules that operate in parallel to achieve responsive, human-like interaction. Each module operates independently but communicates with the others via multiprocessing queues, enabling asynchronous decision-making and prioritization input. 3.2.1 Vision Module The Vision module uses the MediaPipe [9] Face Mesh framework to detect and track multiple faces in real time. For each frame, it extracts facial landmarks, assigns unique identifiers to individuals, and calculates a Mouth Aspect Ratio (MAR) to determine speaking status based on lip landmark displacement. Horizontal face position is also recorded to support servo-based directional tracking for the eyes. In multi-person contexts, the module selects the primary subject based on the highest MAR and closest proximity to the center. A fallback strategy ensures equitable visibility by alternating focus between active speakers every 5 seconds. If no speech is detected, the system enters an idle scanning state. This module serves as the primary input channel whenever at least one face is in the camera's field of view. 3.2.2 FFT Module The FFT module estimates the direction of the dominant sound source using a three-microphone array, with a face covering the left, center, and right audio fields. Each microphone captures incoming audio signals, which are processed using the Fast Fourier Transform (FFT) to extract frequency-domain representations. The system computes the energy distribution and amplitude differences across the three microphone channels, allowing it to localize the most prominent sound source based on comparative intensity. Direction estimation is achieved by identifying the microphone channel with the highest energy within the relevant frequency range of speech. This directional index, left, center, or right, is sent to the central control process for further decision-making. The module operates independently in parallel with the other subsystems. 3.4 Interaction Module The interaction module operates voice-based communication between the system and users. It has three stages of operation: speech recognition, response matching, and speech synthesis. Audio input is captured from a primary microphone and transcribed using the VOSK offline speech recognition engine. The transcribed text is compared against a predefined dataset of questions and responses stored in CSV format in the central control system. If a match or approximate match is found, the corresponding reply is selected. This text is then passed to the gTTS (Google Text-to-Speech) engine, which generates the speech audio that is played back to the user. The entire pipeline is designed to work offline, ensuring robust operation without requiring internet connectivity. 3.5 Multimodal Switching Module The AI-based humanoid Robot system is designed to operate in a context-aware manner, dynamically switching between visual and audio modes. This adaptive control ensures efficient focus management and supports natural, human-like interactions.[10] The switching logic determines whether the system should prioritize visual cues or rely on audio input, based on environmental context and user presence. The overall operational logic of the system is summarized in Fig. 1 . 4. Hardware Implementation The hardware subsystem is responsible for translating the software's perceptual and decision-making commands into physical, expressive movements. The design prioritizes low cost, modularity, and replicability, using widely available components and 3D printing technology. The core of the actuation system consists of five servo motors, controlled by a microcontroller, enabling precise control of the robot's eyes, jaw, and neck. The speech recognition component uses Whisper-Tiny, a compact variant of the Whisper model developed by OpenAI [5], known for its robustness to noise and multilingual capabilities. Audio input captured via a microphone is first passed through a spectral gating filter to suppress background noise, thereby enhancing transcription accuracy with minimal computational overhead. 4.1 Mechanical Design and Fabrication The robotic face is constructed on a lightweight, rigid frame designed using CAD software and fabricated via Fused Deposition Modeling (FDM) 3D printing with Polylactic Acid (PLA) filament. The face structure was sourced from an open-source repository and modified to include custom mounting points for the servo motors and mechanical linkages. This biomimetic design approximates the key articulatory features of a human face, providing a foundation for expressive non-verbal communication. The final assembly, shown in Fig. 2 , presents an abstract yet recognizable humanoid visage. 4.2 Actuation System The primary actuators are five standard hobbyist servo motors (e.g., SG90 or MG90S analogues), chosen for their low cost, adequate torque for this application, and ubiquity. Each servo provides positional control over a specific degree of freedom (DoF), as detailed in Table 1 . Table 1 Servo Channel Mapping PCA9685 Channel Assigned Component Movement Type DoF Description 0 Left Eye Horizontal Controls the panning motion of the left eyeball. 1 Right Eye Horizontal Controls the panning motion of the right eyeball. 2 Left Jaw Vertical Controls the vertical displacement of the left side of the jaw. 3 Right Jaw Vertical Controls the vertical displacement of the right side of the jaw. 4 Neck Horizontal Controls the panning (left-right rotation) of the entire head. 4.4 Jaw Actuation (Channels 2 & 3): Two servos actuate the jaw along a single pivot axis, ensuring balanced, stable movement. They are activated synchronously only during speech synthesis playback. The jaw's range of motion is calibrated to open proportionally to the amplitude of the audio signal or a pre-defined sequence, effectively simulating syllabic movement during speech (Fig. 4 ). 4.5 Neck Actuation (Channel 4): The neck servo provides the most extensive range of mo-tion, enabling the entire head to rotate horizontally. This is primarily used for gross orientation towards a new speaker, either identified by the Vision Module (turning towards a face) or the FFT Module (turning towards a sound source). Its movement is dampened and speed-controlled to avoid sudden, jerky motions. 4.6 Control Electronics The control architecture is designed for scalability and low-latency communication between the high-level software and low-level hardware. Microcontroller: An Arduino Uno microcontroller serves as the robust and simple intermediary between the Python-based control system and the actuators. Servo Driver: The Arduino is connected to a PCA9685 16-channel 12-bit PWM driver via the I²C bus. This dedicated driver is essential for managing the precise timing pulses required for multiple servos simultaneously without overloading the Arduino's CPU or its limited number of PWM pins. Communication Protocol: The central software system, running on a host computer (e.g., a laptop or Raspberry Pi), communicates with the Arduino via a serial UART connection at a standard baud rate of 9600 bps. The software sends high-level commands (e.g., EYE, 45 or JAW, 10) which are parsed and executed by the Arduino, which in turn sets the appropriate PWM signals on the PCA9685. 5. Result Analysis Our evaluation showed accurate audio localization and lifelike servo actuation. These results are consistent with earlier findings that movement design impacts how humans perceive robot behavior [13]. Moreover, tactile and spatial aspects of interaction, such as physiological arousal during contact [4] and proxemics in interaction zones [11], suggest that our system could be extended to richer behavioral experiments. 5.1 Vision Module Performance Figures 5 and 6 show the vision module's output under different scenarios. In Fig. 5 , the system correctly identifies a single person and detects mouth movement to determine the speaking state. When multiple individuals are present, as shown in Fig. 6 , the module successfully identifies both and detects various people speaking. Although both are tracked simultaneously, no servo output is generated at this stage; only identification and detection data are pushed to the interaction controller. The system exhibited stable face tracking, with accurate bounding-box alignment and mouth-movement classification in well-lit indoor environments. 5.2 Audio Localization via FFT Figure 7 illustrates the FFT spectrum produced by the audio module. The plot shows clear frequency peaks corresponding to the active speaker's position relative to the microphone array. During the tests, as shown in Fig. 7 , the FFT module identified the dominant speech direction when no face was visible to the vision module. It consistently classifies the input into left, center, or right zones based on the FFT module's output. The module functioned well in quiet and moderately noisy conditions but showed reduced accuracy in environments with background noise. 5.3 Servo Response under Module Control Servo motor movements were logged and plotted separately under each module's interaction triggers. As the left and right eye servos mirror each other’s motion with identical degrees of freedom, only Channel 0 (left eye) was analyzed. As shown in Fig. 7 , eye movement responds directly to the horizontal face position detected by the vision module. The plot confirms that the eye servos follow movement smoothly, not just jumping from one angle to another, indicating synchronized gaze direction. Jaw Servos (Channels 2–3) Fig. 9 shows the behavior of jaw servos during active interaction. Servo pulses were activated only during response playback, confirming proper integration with the speech module. Motion was brief and repetitive, simulating mouth motion effectively without overextension or jitter. Neck Servo (Channel 4) Fig. 10 illustrates the behavior of the neck servo when triggered by the FFT module. The servo smoothly rotates toward the estimated angle of the sound source. Transitions were visibly consistent with the audio input classification (left, center, right). 6. Limitations and Future Work While this paper presented a technical evaluation, future work will include a user study using validated metrics [11] to assess the interaction's perceived naturalness and likeability quantitatively. In close-range scenarios, silent facial gestures were occasionally misclassified as speech activity. This led to incorrect speaker designation, causing the system to shift attention unnecessarily. FFT-based localization was highly susceptible to environmental noise, resulting in occasional misdirection in cluttered auditory environments. The interaction module relies on a pre-structured dataset of response templates. As a result, several responses were contextually inaccurate, especially when user input deviated from expected phrases or was distorted by background noise. 7. Conclusion This paper presents the development of an AI-based humanoid chat Robot, a stationary robot capable of engaging in human-like interactions. The system integrates real-time face and mouth tracking using the Media Pipe framework, FFT-based audio direction detection, and a CSV-driven voice interaction mechanism. The modular architecture enables individual modules to operate independently for vision, audio, and interaction, while communicating via shared control logic. The proposed multimodal switching strategy ensures that robots exhibit natural behavior. Outcomes demonstrate stable face tracking, accurate audio localization, and smooth servo action across multiple scenarios, including single-person and multiple-person interactions. This work offers a low-cost, scalable platform suitable for HRI research, educational applications, and interactive robotics experimentation. Future improvements for this project may include emotion recognition, dynamic speech generation, and enhanced person-tracking and machine learning to make human-robot interaction more natural. Declarations Data Availability Statement: Data sets generated during the current study are available from the corresponding author on reasonable request. Ethics Approval: This is an engineering experimental study. The AIUB Research Ethics Committee has confirmed that no ethical approval is required. Funding declarations: Self-Funded Ethics, Consent to Participate, and Consent to Publish declarations: not applicable Competing Interest declaration: no Competing Interests Author contributions: M Tanseer Ali : Conceptualization, Methodology, Writing - Original; Mahidul Islam Nihad : Team Lead, Formal analysis and Software; Shafayet Ullah : Validation and Data Curation; Md Radowan Sikder : 3D design and mechanical integration; Navid Rahman Nadvi : Control mechanism and electronics integration; Ishmam Newaz : Software implementation and ML integration; and Abir Ahmed: Funding acquisition and Coordination. References Admoni, H., & Scassellati, B. (2017). Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction, 6(1), 25-63. Paetzel, M., et al. (2020). Towards a robust, affordable and open-source humanoid robot platform. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction (HRI) D. Hanson, et al., "Upending the Uncanny Valley," in Proc. AAAI Conf. Artif. Intell., 2017. S. Al Moubayed, et al., "Furhat: A back-projected human-like robot head for multiparty human-machine interaction," in Cognitive Behavioural Systems. Springer, 2012. C. Breazeal, "Designing Sociable Robots," MIT Press, 2002. Grondin, F., et al. (2020). Recurrent neural networks for broadband source localization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Huang, S., Zhao, R., Bjorling, L., Lin, Z., Zhong, H., Yuan, Y., Zhu, H., … & Abbeel, P. (2025). Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot. arXiv preprint arXiv:2504.17249. https://arxiv.org/abs/2504.17249. Haresamudram, K., Gill, S., Ryoo, J., & Short, E. (2024). The impact of body and voice on perception of agents. Frontiers in Robotics and AI, 11, 1456613. https://doi.org/10.3389/frobt.2024.1456613 Lugaresi, C., et al. (2019). MediaPipe: A Framework for Building Perception Pipelines. arXiv preprint arXiv:1906.08172. Garcia, F., et al. (2020). A multimodal architecture for human-robot communication. IEEE Transactions on Cognitive and Developmental Systems, 12(3), 541-554. Bartneck, C., et al. (2020). Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. International Journal of Social Robotics C. Breazeal, “Emotion and sociable humanoid robots,” Int. J. Hum.-Comput. Stud., vol. 59, no. 1–2, pp. 119–155, 2003. T. Fong, I. Nourbakhsh, and K. Dautenhahn, “A survey of socially interactive robots,” Robot. Auton. Syst., vol. 42, no. 3–4, pp. 143–166, 2003. K. Dautenhahn, “Socially intelligent robots: Dimensions of human–robot interaction,” Philos. Trans. R. Soc. B Biol. Sci., vol. 362, no. 1480, pp. 679–704, 2007. J. Li, W. Ju, and B. Reeves, “Touching a mechanical body: Tactile contact with a humanoid robot is physiologically arousing,” J. Hum.-Robot Interact., vol. 6, no. 3, pp. 118–130, 2017. N. Mavridis, “A review of verbal and non-verbal human–robot interactive communication,” Robot. Auton. Syst., vol. 63, pp. 22–35, 2015. B. Scassellati, “Foundations for a theory of mind for a humanoid robot,” Artif. Intell., vol. 107, no. 1, pp. 85–121, 2001. T. Nomura, T. Kanda, and T. Suzuki, “Experimental investigation into influence of negative attitudes toward robots on human–robot interaction,” AI Soc., vol. 20, no. 2, pp. 138–150, 2006. N. Sebanz, H. Bekkering, and G. Knoblich, “Joint action: Bodies and minds moving together,” Trends Cogn. Sci., vol. 10, no. 2, pp. 70–76, 2006. G. Castellano, S. Paiva, and A. Pereira, “Affective expression in robots for HRI,” in Proc. IEEE Int. Symp. Robot Hum. Interact. Commun. (RO-MAN), 2010, pp. 473–480. K. Kompatsiari, F. Ciardo, V. Tikhanoff, G. Metta, and A. Wykowska, “On the role of eye contact in gaze cueing,” Sci. Rep., vol. 9, no. 1, pp. 1–10, 2019. M. L. Walters, K. Dautenhahn, R. te Boekhorst, K. L. Koay, C. Kaouri, S. Woods, C. L. Nehaniv, D. Lee, and I. Werry, “The influence of subjects’ personality traits on personal spatial zones in a human–robot interaction experiment,” in Proc. IEEE Int. Workshop Robot Hum. Interact. Commun. (RO-MAN), 2008, pp. 347–352. A. D. Dragan, K. C. Lee, and S. S. Srinivasa, “Legibility and predictability of robot motion,” in Proc. 8th ACM/IEEE Int. Conf. Hum.-Robot Interact. (HRI), 2013, pp. 301–308. G. Hoffman and W. Ju, “Designing robots with movement in mind,” J. Hum.-Robot Interact., vol. 3, no. 1, pp. 89–122, 2014. T. Kanda and H. Ishiguro, Human–Robot Interaction in Social Robotics. CRC Press, 2013. A. Aly and A. Tapus, “A model for synthesizing a combined verbal and nonverbal behavior based on personality traits in human–robot interaction,” in Proc. Int. Conf. Adv. Robot. (ICAR), 2013, pp. 1–6. S. Lemaignan, R. Ros, E. A. Sisbot, R. Alami, and M. Beetz, “Artificial cognition for social human–robot interaction: An implementation,” Artif. Intell., vol. 247, pp. 45–69, 2017. B. Mutlu, J. Forlizzi, and J. Hodgins, “A storytelling robot: Modeling and evaluation of human-like gaze behavior,” in Proc. 6th IEEE-RAS Int. Conf. Humanoid Robots, 2006, pp. 518–523. H. G. Okuno, K. Nakadai, and H. Kitano, “Realizing audio-visual integration in humanoid robots,” in Proc. Nat. Conf. Artif. Intell. (AAAI), 2002, pp. 1248–1253. A. L. Thomaz and C. Breazeal, “Teachable robots: Understanding human teaching behavior to build more effective robot learners,” Artif. Intell., vol. 172, no. 6–7, pp. 716–737, 2008. D. Vernon, G. Metta, and G. Sandini, “A survey of artificial cognitive systems: Implications for the autonomous development of mental capabilities in computational agents,” IEEE Trans. Evol. Comput., vol. 11, no. 2, pp. 151–180, 2007. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8054965","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":551714671,"identity":"fa3f5f0d-58fc-4789-9cee-501944c664a7","order_by":0,"name":"M Tanseer Ali","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAtklEQVRIiWNgGAWjYBADOQZmBjYgzUxYKQ+UNiZdS2IDA7Fa7NnPHv7woeJeen87j9kDhgprkF4CtvDkpUnOOFOcO+Mwj7kBw5l0IrQw5Jgx87Yl5DYc5jGTYGw7TIQW/jfGn//+S0iXB2v5R4wWiRwDacaGhAQDsJYGYrTceGMm2XMswXDjYbZyg4Rj6cYEtbD35xh/+FGTIC93/vC2Bx9qrGUJakEFCaQpHwWjYBSMglGACwAAcg84cH4oPBkAAAAASUVORK5CYII=","orcid":"","institution":"American International University – Bangladesh","correspondingAuthor":true,"prefix":"","firstName":"M","middleName":"Tanseer","lastName":"Ali","suffix":""},{"id":551714672,"identity":"16f3092c-f283-45d3-8e00-3fc0d2d90dda","order_by":1,"name":"Mahidul Islam Nihad","email":"","orcid":"","institution":"American International University – Bangladesh","correspondingAuthor":false,"prefix":"","firstName":"Mahidul","middleName":"Islam","lastName":"Nihad","suffix":""},{"id":551714673,"identity":"6548beef-79bc-4732-8c3f-2e727bef8194","order_by":2,"name":"Shafayet Ullah","email":"","orcid":"","institution":"American International University – Bangladesh","correspondingAuthor":false,"prefix":"","firstName":"Shafayet","middleName":"","lastName":"Ullah","suffix":""},{"id":551714675,"identity":"1225af23-743d-4e1e-b862-7da3dcc486a7","order_by":3,"name":"Md Radowan Sikder","email":"","orcid":"","institution":"American International University – Bangladesh","correspondingAuthor":false,"prefix":"","firstName":"Md","middleName":"Radowan","lastName":"Sikder","suffix":""},{"id":551714676,"identity":"2fb01c9e-807f-474c-a673-036e60003e21","order_by":4,"name":"Navid Rahman Nadvi","email":"","orcid":"","institution":"American International University – Bangladesh","correspondingAuthor":false,"prefix":"","firstName":"Navid","middleName":"Rahman","lastName":"Nadvi","suffix":""},{"id":551714677,"identity":"b77d7ad7-5f31-4d6a-aea6-78edf318d579","order_by":5,"name":"Ishmam Newaz","email":"","orcid":"","institution":"American International University – Bangladesh","correspondingAuthor":false,"prefix":"","firstName":"Ishmam","middleName":"","lastName":"Newaz","suffix":""},{"id":551714678,"identity":"6cac5881-d2ae-48d1-a480-f4a2878c1159","order_by":6,"name":"Abir Ahmed","email":"","orcid":"","institution":"American International University – Bangladesh","correspondingAuthor":false,"prefix":"","firstName":"Abir","middleName":"","lastName":"Ahmed","suffix":""}],"badges":[],"createdAt":"2025-11-07 08:38:30","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8054965/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8054965/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":97139255,"identity":"2a8dc31d-abae-4ffa-b7a9-ef048a788778","added_by":"auto","created_at":"2025-12-01 09:59:54","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2111436,"visible":true,"origin":"","legend":"","description":"","filename":"Paper2HumanHead.docx","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/457601c39d092c2184f1b860.docx"},{"id":97139671,"identity":"47cc7d95-27cd-4308-8e56-b6009dbe8bf6","added_by":"auto","created_at":"2025-12-01 10:01:34","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8119,"visible":true,"origin":"","legend":"","description":"","filename":"9ea00fcf0e0f45c789ff91ad8fce0b06.json","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/75b626c17113dfaa173aa3c7.json"},{"id":97003359,"identity":"a25c0aee-3749-481b-843d-7f875db182eb","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":71519,"visible":true,"origin":"","legend":"","description":"","filename":"9ea00fcf0e0f45c789ff91ad8fce0b061enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/703e05f25251c31436f179db.xml"},{"id":97003346,"identity":"80dd3bf7-202d-4ed4-bf81-4c7b87d8cca4","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":41293,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/07bc7a5d4b01fc00d7368794.png"},{"id":97003351,"identity":"4c86e12c-259d-497c-aba6-b432a3170398","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":92226,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/6a158376c44e2b887b3f71fd.jpeg"},{"id":97003375,"identity":"b841079a-dc83-4b22-b137-9b60123fa423","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":607467,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/a0cf7c41585336b84a510868.png"},{"id":97139351,"identity":"33e3662a-627a-4ae2-a4ed-cb631b7e29bf","added_by":"auto","created_at":"2025-12-01 10:00:08","extension":"jpeg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":46368,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/50d3d85832c05aa47b93d53e.jpeg"},{"id":97140267,"identity":"d5b81098-881e-4a5f-81d3-0dab496ed854","added_by":"auto","created_at":"2025-12-01 10:04:24","extension":"jpeg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":46673,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/92b90d86e5b872f832842eb3.jpeg"},{"id":97003372,"identity":"942c7f7d-8eef-40b5-9e23-34ebb95553cf","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":468176,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/518585c3624fdff63557f0c2.png"},{"id":97003363,"identity":"1d1013de-22a0-49bb-8389-6480fab0750a","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":534772,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/725a3d5a12c35334ec875867.png"},{"id":97139317,"identity":"9ffcd3b8-5aca-475d-aa19-15a9d805edc3","added_by":"auto","created_at":"2025-12-01 10:00:00","extension":"jpeg","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":29143,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/0353bbb9bc8de08aee73a82b.jpeg"},{"id":97003352,"identity":"ad3b2f3e-0010-427e-83a5-61eb4f0265e2","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"jpeg","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":57082,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/13f437eac5efb53ac4c96212.jpeg"},{"id":97139786,"identity":"cc1336c8-55cb-4983-875c-4062cfc3cf8b","added_by":"auto","created_at":"2025-12-01 10:02:31","extension":"jpeg","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":134616,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/3dc1afcecd7f8781ea4922c3.jpeg"},{"id":97139363,"identity":"bdf9bf88-de7f-41bd-8e86-67f76300c25b","added_by":"auto","created_at":"2025-12-01 10:00:09","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":13606,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/54da3a99d43c74696f20c15f.png"},{"id":97003371,"identity":"eea289a9-2cd1-43ef-9a24-041adf304f3a","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21740,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/adda16626c0a5ad17bdc75d0.png"},{"id":97003369,"identity":"6e6e8c24-91f1-4d7f-8a1d-d747b5e1dde8","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":54701,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/8544387efa5fa35a1f2a0030.png"},{"id":97003373,"identity":"3422465e-a736-4ada-87f9-5c627ac8c731","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":53687,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/8f67a1ed0f59f7c662fb701f.png"},{"id":97003361,"identity":"3ea9b72d-26db-4bd5-9fac-f70a1b999cb2","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":52745,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/b13b755a955b5cc4a6717a9c.png"},{"id":97138079,"identity":"efcf565a-16d0-45ba-b751-11533ab8a68e","added_by":"auto","created_at":"2025-12-01 09:58:28","extension":"png","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":47176,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/d56140f59eb308b9f74bfb1c.png"},{"id":97139015,"identity":"497ef76c-47a3-4f30-8afe-e1cc74bcdda6","added_by":"auto","created_at":"2025-12-01 09:59:32","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":54826,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/594e6ac6691b88678c643778.png"},{"id":97003370,"identity":"908dcd62-3950-4ea8-96bc-4437a76b7326","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19135,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/13c082ff2ee625e5e8432f22.png"},{"id":97003362,"identity":"1da19009-789f-4795-8239-24c0580e6e74","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21914,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/4265a5a01d2e9c3327b633c7.png"},{"id":97138315,"identity":"a7c02b04-9d92-40df-96b0-2ba9181a40fe","added_by":"auto","created_at":"2025-12-01 09:58:45","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":23547,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/cf90e1dd98fefe0979731a61.png"},{"id":97003367,"identity":"c80075bf-c10f-4f02-9b00-142b87201c68","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"xml","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":67264,"visible":true,"origin":"","legend":"","description":"","filename":"9ea00fcf0e0f45c789ff91ad8fce0b061structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/fadab7e6ba90e95777733d14.xml"},{"id":97003374,"identity":"cd87dba8-8da9-4c2a-b7f4-9ab2c13e4937","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"html","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":79345,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/2c5935b93dafd6faa1efe420.html"},{"id":97003342,"identity":"f072f67b-769b-46e6-9c0b-c424cb992932","added_by":"auto","created_at":"2025-11-28 14:07:55","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":41293,"visible":true,"origin":"","legend":"\u003cp\u003eOperational Flowchart of AI-Based Humanoid Chat Robot.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/46d9044289b4b860d28bb709.png"},{"id":97003343,"identity":"02354828-2dec-492d-ba51-cf37b8905e7c","added_by":"auto","created_at":"2025-11-28 14:07:55","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":607467,"visible":true,"origin":"","legend":"\u003cp\u003eAI-Based Humanoid Chat Robot (Front View).\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/57f8a3eb8b0e7513b5e8b48a.png"},{"id":97003344,"identity":"68c496e5-cb8d-4f89-8345-f6af92480875","added_by":"auto","created_at":"2025-11-28 14:07:55","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":84145,"visible":true,"origin":"","legend":"\u003cp\u003ePlacement of Servo Motor for Eye Movements.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/55f948b310545da16bdaedce.png"},{"id":97003349,"identity":"280f123b-0c94-468f-9e3e-ed71f13ae6d5","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":82275,"visible":true,"origin":"","legend":"\u003cp\u003ePlacement of Jaw servos and Neck Servo and its Rotational Mechanism.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/4c31bfff73a3a48b9c823f8f.png"},{"id":97003350,"identity":"a4167434-d057-42ba-9775-ae21d3a79f8c","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":468176,"visible":true,"origin":"","legend":"\u003cp\u003eVision Module Detecting Speaking Person (One Person).\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/d2a9b862ef0859fd6b1b385a.png"},{"id":97003356,"identity":"f0d633ad-cef4-4c5d-ae37-cef76edb5a2a","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":534772,"visible":true,"origin":"","legend":"\u003cp\u003eVision Module Detecting Speaking Person (Multiple Person).\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/0b89444f5c8dd61fe44b210e.png"},{"id":97138510,"identity":"2cc4f765-20ab-4598-b6b1-22bf0585a208","added_by":"auto","created_at":"2025-12-01 09:58:58","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":246414,"visible":true,"origin":"","legend":"\u003cp\u003eFFT Module Spectrum.\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/6f4cbdfc5f59ba4765280ff8.png"},{"id":97003358,"identity":"696600b0-54b3-4d7b-9abc-967db1baea9b","added_by":"auto","created_at":"2025-11-28 14:07:56","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":188283,"visible":true,"origin":"","legend":"\u003cp\u003eEye Servo Movement.\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/9ff50ecadfa2e5fa1cb90f8c.png"},{"id":97138672,"identity":"6b629483-9818-4348-9e57-9aa8b580c4a3","added_by":"auto","created_at":"2025-12-01 09:59:10","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":297419,"visible":true,"origin":"","legend":"\u003cp\u003eJaw Servo Movements.\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/1e712eb4eb6ca0a5850d08f1.png"},{"id":97138235,"identity":"c07036dd-6721-4385-8b5c-c43df1e4cd62","added_by":"auto","created_at":"2025-12-01 09:58:40","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":151288,"visible":true,"origin":"","legend":"\u003cp\u003eNeck Servo Movements.\u003c/p\u003e","description":"","filename":"10.png","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/cb7cd54761e0b358cf1844a9.png"},{"id":99310952,"identity":"940f04e5-a421-4f1b-ae82-a0bbd727ac51","added_by":"auto","created_at":"2025-12-31 16:13:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3722918,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8054965/v1/ea81faa1-a4cb-41cb-a2b5-ca913254b124.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"An Integrated Vision-Audio Architecture for Responsive Humanoid Robot Interaction with Dynamic Attention Switching","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eIn modern robotics, Human-Robot Interaction (HRI) is an essential field of research. The ongoing research focuses on how to make the responses and behavior of humanoid robots more natural to people, thereby enhancing their interactions and making them more human-like and socially acceptable [1\u0026ndash;3]. Robots that can engage smoothly with people are becoming increasingly valuable across various sectors, including education, healthcare, service industries, and hazardous environments where human presence is risky or impractical [4], [5].\u003c/p\u003e\u003cp\u003eThe key challenge in this domain is designing robots that can perceive and respond to the external environment in ways that replicate natural human behavior. This involves not only accurately responding to visual and sound stimuli from the surroundings but also the ability to express corresponding responses through lifelike movements [6], [7], [20].\u003c/p\u003e\u003cp\u003eThis paper presents the development of an AI-based humanoid Chat Robot, designed as a robotic face capable of engaging in natural, human-like interaction through both visual and auditory modules. The system aims to be low-cost and modular, making it accessible for research, education, and human-robot interaction (HRI) experimentation.[2] The robot features a facial structure modeled on human anatomy, equipped with servo motors that control the eyes, jaw, and neck to simulate realistic facial expressions and gestures. The proposed architecture integrates three core Python modules: vision, audio, and interaction, which operate independently yet communicate via a shared control framework. This paper details the complete system design, module functionalities, and their integration. It also discusses the implementation process and evaluates performance in real-world scenarios.\u003c/p\u003e"},{"header":"2. Related Research Works and Findings","content":"\u003cp\u003eThe development of expressive robotic heads and faces for HRI has a rich history. Sophia\u0026rsquo;s facial realism [3], Furhat\u0026rsquo;s projection-based head [4], and Kismet\u0026rsquo;s biomimetic emotional model [5] remain essential milestones. Breazeal\u0026rsquo;s work on sociable robots emphasized the role of emotion in natural interaction [1].\u003c/p\u003e\u003cp\u003eAffective and non-verbal communication play a central role, with studies showing that robots capable of expressing emotions through facial or bodily cues increase engagement and trust [9], [14]. Research on gaze and eye contact highlights its critical role in joint attention and multiparty interaction [10], [17]. Furthermore, Okuno et al. demonstrated the importance of integrating audio and visual channels for attention systems in robots [18].\u003c/p\u003e\u003cp\u003eAttitudes toward robots also influence HRI outcomes, as demonstrated by Nomura et al. [7]. Additionally, Walters et al. investigated spatial preferences during interaction with robots [11]. Together, these works illustrate the evolution from ultra-realistic, high-cost systems to affordable, socially responsive robots that strike a balance between embodiment and accessibility.\u003c/p\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Sophia (Hanson Robotics):\u003c/h2\u003e\u003cp\u003eSophia is arguably the most publicly recognized humanoid robot head. Its primary contribution lies in its highly realistic facial expressions, achieved through sophisticated patented elastomer skin and numerous (typically 30+) servo motors [3]. This allows for nuanced emotional displays. However, Sophia's interaction paradigm is often scripted or remotely operated. While it can perform face and voice tracking, its proprietary software architecture, combined with its extremely high cost and lack of modularity, places it out of reach for most research labs. In contrast, our work prioritizes a transparent, low-cost, modular, and open-source architecture, focusing on a fully autonomous, integrated audio-visual attention system rather than on achieving ultra-realism.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Furhat (Furhat Robotics):\u003c/h2\u003e\u003cp\u003eFurhat takes a different approach by projecting a rear-facing image onto a 3D-printed mask. This enables incredible flexibility in character and gender representation without requiring hardware changes [4]. Furhat excels in social signal processing and multi-party interaction, using advanced microphone arrays and software for speaker localization and turn-taking. It is a powerful commercial research platform. However, like Sophia, it is a high-cost system. Furthermore, its projection-based face, while versatile, lacks the physical, embodied presence and mechanical motion that our servo-actuated face provides. Our work demonstrates that core HRI behaviors like gaze following, and speaker tracking can be achieved effectively with a low-cost physical platform.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3 Kismet (MIT AI Lab):\u003c/h2\u003e\u003cp\u003eAs a foundational platform in social robotics, Kismet used a combination of visual, auditory, and proprioceptive cues to drive expressive motors and create emotionally resonant interactions [5]. Kismet's architecture was biomimetic, with subsystems for emotion, motivation, and behavior that competed for expression. While groundbreaking, Kismet's hardware and software are now dated. Our system builds on Kismet's core philosophy of integrating perception and action, but does so with modern, accessible tools (Python, MediaPipe, FFT) and a more lightweight, rule-based switching logic rather than a complex cognitive architecture. This makes our system easier for the broader research community to replicate, modify, and extend.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.4 Recent Research Platforms:\u003c/h2\u003e\u003cp\u003eRecent trends have seen a surge in low-cost, 3D-printed platforms. For instance, the \u0026ldquo;Mario\" robot uses a similar 3D-printed face with servos for jaw and neck movement but focuses primarily on voice interaction without integrated visual tracking for attention [6]. Other platforms utilize off-the-shelf smart speakers for audio processing but lack a physical embodiment altogether [7]. A common limitation in many low-cost systems is their reliance on a single modality (e.g., only vision or audio) or on a static priority system in which one modality always takes precedence.[8]\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e2.5 Novelty of Our Approach:\u003c/h2\u003e\u003cp\u003eOur AI-based humanoid chat robot distinguishes itself within this landscape through its tightly integrated, dynamic, and context-aware fusion of low-cost visual and auditory perception. Unlike high-cost platforms (Sophia, Furhat), we provide a fully open and accessible system. Unlike many low-cost platforms, we implement a dynamic switching logic that allows the robot to seamlessly shift its attention between visual cues (faces) and auditory cues (sound direction) based on environmental context, a feature crucial for natural interaction that is often absent in comparable low-tier systems. This work fills a niche between high-performance commercial platforms and simplistic single-modality prototypes, offering a robust, multimodal, and replicable platform for HRI research.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Methodology","content":"\u003cp\u003eThe architecture of the proposed humanoid chat robot emphasizes synchronized multimodal processing. This aligns with theories of joint action, which posit that humans and agents coordinate their bodies and minds [8]. Designing robot motion to be predictable and legible is essential for smooth interactions, as shown in prior work [12], [13].\u003c/p\u003e\u003cp\u003eOur system also synthesizes verbal and non-verbal behavior, inspired by Aly and Tapus\u0026rsquo; model of personality-driven multimodal robots [15]. Beyond technical perception-action integration, social cognition models such as those proposed by Lemaignan et al. [16] are highly relevant. Finally, the dataset-driven learning approach aligns with Thomaz and Breazeal\u0026rsquo;s studies on teachable robots, which show that human guidance improves robot adaptability [19].\u003c/p\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.1 System Architecture\u003c/h2\u003e\u003cp\u003eThe proposed AI-based humanoid robot system presents a robotic face designed to simulate human-like attention and interaction through integrated visual and audio processing. The architecture consists of three core modules: a vision-based face and speech detection system, an audio direction estimation unit using the Fast Fourier Transform (FFT), and an interaction logic module that generates contextual verbal responses. These modules operate in parallel using Python's multiprocessing module. The resulting commands drive servo actuations through an Arduino microcontroller interfaced with a PCA9685 driver, enabling synchronized movement of the eyes, neck, and jaw.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Core Module\u003c/h2\u003e\u003cp\u003eThe proposed system comprises three interdependent modules that operate in parallel to achieve responsive, human-like interaction. Each module operates independently but communicates with the others via multiprocessing queues, enabling asynchronous decision-making and prioritization input.\u003c/p\u003e\u003cdiv id=\"Sec11\" class=\"Section3\"\u003e\u003ch2\u003e3.2.1 Vision Module\u003c/h2\u003e\u003cp\u003eThe Vision module uses the MediaPipe [9] Face Mesh framework to detect and track multiple faces in real time. For each frame, it extracts facial landmarks, assigns unique identifiers to individuals, and calculates a Mouth Aspect Ratio (MAR) to determine speaking status based on lip landmark displacement. Horizontal face position is also recorded to support servo-based directional tracking for the eyes. In multi-person contexts, the module selects the primary subject based on the highest MAR and closest proximity to the center. A fallback strategy ensures equitable visibility by alternating focus between active speakers every 5 seconds. If no speech is detected, the system enters an idle scanning state. This module serves as the primary input channel whenever at least one face is in the camera's field of view.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section3\"\u003e\u003ch2\u003e3.2.2 FFT Module\u003c/h2\u003e\u003cp\u003eThe FFT module estimates the direction of the dominant sound source using a three-microphone array, with a face covering the left, center, and right audio fields. Each microphone captures incoming audio signals, which are processed using the Fast Fourier Transform (FFT) to extract frequency-domain representations. The system computes the energy distribution and amplitude differences across the three microphone channels, allowing it to localize the most prominent sound source based on comparative intensity. Direction estimation is achieved by identifying the microphone channel with the highest energy within the relevant frequency range of speech. This directional index, left, center, or right, is sent to the central control process for further decision-making. The module operates independently in parallel with the other subsystems.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e3.4 Interaction Module\u003c/h2\u003e\u003cp\u003eThe interaction module operates voice-based communication between the system and users. It has three stages of operation: speech recognition, response matching, and speech synthesis. Audio input is captured from a primary microphone and transcribed using the VOSK offline speech recognition engine. The transcribed text is compared against a predefined dataset of questions and responses stored in CSV format in the central control system. If a match or approximate match is found, the corresponding reply is selected. This text is then passed to the gTTS (Google Text-to-Speech) engine, which generates the speech audio that is played back to the user. The entire pipeline is designed to work offline, ensuring robust operation without requiring internet connectivity.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e3.5 Multimodal Switching Module\u003c/h2\u003e\u003cp\u003eThe AI-based humanoid Robot system is designed to operate in a context-aware manner, dynamically switching between visual and audio modes. This adaptive control ensures efficient focus management and supports natural, human-like interactions.[10] The switching logic determines whether the system should prioritize visual cues or rely on audio input, based on environmental context and user presence. The overall operational logic of the system is summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Hardware Implementation","content":"\u003cp\u003eThe hardware subsystem is responsible for translating the software's perceptual and decision-making commands into physical, expressive movements. The design prioritizes low cost, modularity, and replicability, using widely available components and 3D printing technology. The core of the actuation system consists of five servo motors, controlled by a microcontroller, enabling precise control of the robot's eyes, jaw, and neck.\u003c/p\u003e\u003cp\u003eThe speech recognition component uses Whisper-Tiny, a compact variant of the Whisper model developed by OpenAI [5], known for its robustness to noise and multilingual capabilities. Audio input captured via a microphone is first passed through a spectral gating filter to suppress background noise, thereby enhancing transcription accuracy with minimal computational overhead.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003e4.1 Mechanical Design and Fabrication\u003c/h2\u003e\u003cp\u003eThe robotic face is constructed on a lightweight, rigid frame designed using CAD software and fabricated via Fused Deposition Modeling (FDM) 3D printing with Polylactic Acid (PLA) filament. The face structure was sourced from an open-source repository and modified to include custom mounting points for the servo motors and mechanical linkages. This biomimetic design approximates the key articulatory features of a human face, providing a foundation for expressive non-verbal communication. The final assembly, shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, presents an abstract yet recognizable humanoid visage.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Actuation System\u003c/h2\u003e\u003cp\u003eThe primary actuators are five standard hobbyist servo motors (e.g., SG90 or MG90S analogues), chosen for their low cost, adequate torque for this application, and ubiquity. Each servo provides positional control over a specific degree of freedom (DoF), as detailed in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eServo Channel Mapping\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePCA9685 Channel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAssigned Component\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMovement Type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eDoF Description\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLeft Eye\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eHorizontal\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eControls the panning motion of the left eyeball.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRight Eye\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eHorizontal\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eControls the panning motion of the right eyeball.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLeft Jaw\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eVertical\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eControls the vertical displacement of the left side of the jaw.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRight Jaw\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eVertical\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eControls the vertical displacement of the right side of the jaw.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNeck\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eHorizontal\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eControls the panning (left-right rotation) of the entire head.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e4.4 Jaw Actuation (Channels 2 \u0026amp; 3):\u003c/h2\u003e\u003cp\u003eTwo servos actuate the jaw along a single pivot axis, ensuring balanced, stable movement. They are activated synchronously only during speech synthesis playback. The jaw's range of motion is calibrated to open proportionally to the amplitude of the audio signal or a pre-defined sequence, effectively simulating syllabic movement during speech (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003e4.5 Neck Actuation (Channel 4):\u003c/h2\u003e\u003cp\u003eThe neck servo provides the most extensive range of mo-tion, enabling the entire head to rotate horizontally. This is primarily used for gross orientation towards a new speaker, either identified by the Vision Module (turning towards a face) or the FFT Module (turning towards a sound source). Its movement is dampened and speed-controlled to avoid sudden, jerky motions.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\u003ch2\u003e4.6 Control Electronics\u003c/h2\u003e\u003cp\u003eThe control architecture is designed for scalability and low-latency communication between the high-level software and low-level hardware.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eMicrocontroller: An Arduino Uno microcontroller serves as the robust and simple intermediary between the Python-based control system and the actuators.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eServo Driver: The Arduino is connected to a PCA9685 16-channel 12-bit PWM driver via the I\u0026sup2;C bus. This dedicated driver is essential for managing the precise timing pulses required for multiple servos simultaneously without overloading the Arduino's CPU or its limited number of PWM pins.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eCommunication Protocol: The central software system, running on a host computer (e.g., a laptop or Raspberry Pi), communicates with the Arduino via a serial UART connection at a standard baud rate of 9600 bps. The software sends high-level commands (e.g., EYE, 45 or JAW, 10) which are parsed and executed by the Arduino, which in turn sets the appropriate PWM signals on the PCA9685.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Result Analysis","content":"\u003cp\u003eOur evaluation showed accurate audio localization and lifelike servo actuation. These results are consistent with earlier findings that movement design impacts how humans perceive robot behavior [13]. Moreover, tactile and spatial aspects of interaction, such as physiological arousal during contact [4] and proxemics in interaction zones [11], suggest that our system could be extended to richer behavioral experiments.\u003c/p\u003e\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\u003ch2\u003e5.1 Vision Module Performance\u003c/h2\u003e\u003cp\u003eFigures \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e and \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e show the vision module's output under different scenarios.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eIn Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, the system correctly identifies a single person and detects mouth movement to determine the speaking state. When multiple individuals are present, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e, the module successfully identifies both and detects various people speaking. Although both are tracked simultaneously, no servo output is generated at this stage; only identification and detection data are pushed to the interaction controller. The system exhibited stable face tracking, with accurate bounding-box alignment and mouth-movement classification in well-lit indoor environments.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec23\" class=\"Section2\"\u003e\u003ch2\u003e5.2 Audio Localization via FFT\u003c/h2\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e illustrates the FFT spectrum produced by the audio module. The plot shows clear frequency peaks corresponding to the active speaker's position relative to the microphone array. During the tests, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e, the FFT module identified the dominant speech direction when no face was visible to the vision module. It consistently classifies the input into left, center, or right zones based on the FFT module's output. The module functioned well in quiet and moderately noisy conditions but showed reduced accuracy in environments with background noise.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\u003ch2\u003e5.3 Servo Response under Module Control\u003c/h2\u003e\u003cp\u003eServo motor movements were logged and plotted separately under each module's interaction triggers.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAs the left and right eye servos mirror each other\u0026rsquo;s motion with identical degrees of freedom, only Channel 0 (left eye) was analyzed. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e, eye movement responds directly to the horizontal face position detected by the vision module. The plot confirms that the eye servos follow movement smoothly, not just jumping from one angle to another, indicating synchronized gaze direction.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eJaw Servos (Channels 2\u0026ndash;3) Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e shows the behavior of jaw servos during active interaction. Servo pulses were activated only during response playback, confirming proper integration with the speech module. Motion was brief and repetitive, simulating mouth motion effectively without overextension or jitter.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eNeck Servo (Channel 4) Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e illustrates the behavior of the neck servo when triggered by the FFT module. The servo smoothly rotates toward the estimated angle of the sound source. Transitions were visibly consistent with the audio input classification (left, center, right).\u003c/p\u003e\u003c/div\u003e"},{"header":"6. Limitations and Future Work","content":"\u003cp\u003eWhile this paper presented a technical evaluation, future work will include a user study using validated metrics [11] to assess the interaction's perceived naturalness and likeability quantitatively. In close-range scenarios, silent facial gestures were occasionally misclassified as speech activity. This led to incorrect speaker designation, causing the system to shift attention unnecessarily. FFT-based localization was highly susceptible to environmental noise, resulting in occasional misdirection in cluttered auditory environments. The interaction module relies on a pre-structured dataset of response templates. As a result, several responses were contextually inaccurate, especially when user input deviated from expected phrases or was distorted by background noise.\u003c/p\u003e"},{"header":"7. Conclusion","content":"\u003cp\u003eThis paper presents the development of an AI-based humanoid chat Robot, a stationary robot capable of engaging in human-like interactions. The system integrates real-time face and mouth tracking using the Media Pipe framework, FFT-based audio direction detection, and a CSV-driven voice interaction mechanism. The modular architecture enables individual modules to operate independently for vision, audio, and interaction, while communicating via shared control logic. The proposed multimodal switching strategy ensures that robots exhibit natural behavior. Outcomes demonstrate stable face tracking, accurate audio localization, and smooth servo action across multiple scenarios, including single-person and multiple-person interactions. This work offers a low-cost, scalable platform suitable for HRI research, educational applications, and interactive robotics experimentation. Future improvements for this project may include emotion recognition, dynamic speech generation, and enhanced person-tracking and machine learning to make human-robot interaction more natural.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData Availability Statement:\u003c/strong\u003e Data sets generated during the current study are available from the corresponding author on reasonable request. \u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eEthics Approval:\u003c/strong\u003e This is an engineering experimental study. The AIUB Research Ethics Committee has confirmed that no ethical approval is required.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eFunding declarations:\u003c/strong\u003e Self-Funded\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eEthics, Consent to Participate, and Consent to Publish declarations:\u003c/strong\u003e not applicable\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eCompeting Interest declaration:\u003c/strong\u003e no Competing Interests\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions:\u003c/strong\u003e \u003cstrong\u003eM Tanseer Ali\u003c/strong\u003e: Conceptualization, Methodology, Writing - Original; \u003cstrong\u003eMahidul Islam Nihad\u003c/strong\u003e: Team Lead, Formal analysis and Software; \u003cstrong\u003eShafayet Ullah\u003c/strong\u003e: Validation and Data Curation; \u003cstrong\u003eMd Radowan Sikder\u003c/strong\u003e: 3D design and mechanical integration; \u003cstrong\u003eNavid Rahman Nadvi\u003c/strong\u003e: Control mechanism and electronics integration; \u003cstrong\u003eIshmam Newaz\u003c/strong\u003e: Software implementation and ML integration; and \u003cstrong\u003eAbir Ahmed:\u003c/strong\u003e Funding acquisition and Coordination.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAdmoni, H., \u0026amp; Scassellati, B. (2017). Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction, 6(1), 25-63.\u003c/li\u003e\n\u003cli\u003ePaetzel, M., et al. (2020). Towards a robust, affordable and open-source humanoid robot platform. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction (HRI)\u003c/li\u003e\n\u003cli\u003eD. Hanson, et al., \u0026quot;Upending the Uncanny Valley,\u0026quot; in Proc. AAAI Conf. Artif. Intell., 2017.\u003c/li\u003e\n\u003cli\u003eS. Al Moubayed, et al., \u0026quot;Furhat: A back-projected human-like robot head for multiparty human-machine interaction,\u0026quot; in Cognitive Behavioural Systems. Springer, 2012.\u003c/li\u003e\n\u003cli\u003eC. Breazeal, \u0026quot;Designing Sociable Robots,\u0026quot; MIT Press, 2002.\u003c/li\u003e\n\u003cli\u003eGrondin, F., et al. (2020). Recurrent neural networks for broadband source localization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\u003c/li\u003e\n\u003cli\u003eHuang, S., Zhao, R., Bjorling, L., Lin, Z., Zhong, H., Yuan, Y., Zhu, H., \u0026hellip; \u0026amp; Abbeel, P. (2025). Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot. arXiv preprint arXiv:2504.17249. https://arxiv.org/abs/2504.17249.\u003c/li\u003e\n\u003cli\u003eHaresamudram, K., Gill, S., Ryoo, J., \u0026amp; Short, E. (2024). The impact of body and voice on perception of agents. Frontiers in Robotics and AI, 11, 1456613. https://doi.org/10.3389/frobt.2024.1456613\u003c/li\u003e\n\u003cli\u003eLugaresi, C., et al. (2019). MediaPipe: A Framework for Building Perception Pipelines. arXiv preprint arXiv:1906.08172.\u003c/li\u003e\n\u003cli\u003eGarcia, F., et al. (2020). A multimodal architecture for human-robot communication. IEEE Transactions on Cognitive and Developmental Systems, 12(3), 541-554.\u003c/li\u003e\n\u003cli\u003eBartneck, C., et al. (2020). Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. International Journal of Social Robotics\u003c/li\u003e\n\u003cli\u003eC. Breazeal, \u0026ldquo;Emotion and sociable humanoid robots,\u0026rdquo; Int. J. Hum.-Comput. Stud., vol. 59, no. 1\u0026ndash;2, pp. 119\u0026ndash;155, 2003.\u003c/li\u003e\n\u003cli\u003eT. Fong, I. Nourbakhsh, and K. Dautenhahn, \u0026ldquo;A survey of socially interactive robots,\u0026rdquo; Robot. Auton. Syst., vol. 42, no. 3\u0026ndash;4, pp. 143\u0026ndash;166, 2003.\u003c/li\u003e\n\u003cli\u003eK. Dautenhahn, \u0026ldquo;Socially intelligent robots: Dimensions of human\u0026ndash;robot interaction,\u0026rdquo; Philos. Trans. R. Soc. B Biol. Sci., vol. 362, no. 1480, pp. 679\u0026ndash;704, 2007.\u003c/li\u003e\n\u003cli\u003eJ. Li, W. Ju, and B. Reeves, \u0026ldquo;Touching a mechanical body: Tactile contact with a humanoid robot is physiologically arousing,\u0026rdquo; J. Hum.-Robot Interact., vol. 6, no. 3, pp. 118\u0026ndash;130, 2017.\u003c/li\u003e\n\u003cli\u003eN. Mavridis, \u0026ldquo;A review of verbal and non-verbal human\u0026ndash;robot interactive communication,\u0026rdquo; Robot. Auton. Syst., vol. 63, pp. 22\u0026ndash;35, 2015.\u003c/li\u003e\n\u003cli\u003eB. Scassellati, \u0026ldquo;Foundations for a theory of mind for a humanoid robot,\u0026rdquo; Artif. Intell., vol. 107, no. 1, pp. 85\u0026ndash;121, 2001.\u003c/li\u003e\n\u003cli\u003eT. Nomura, T. Kanda, and T. Suzuki, \u0026ldquo;Experimental investigation into influence of negative attitudes toward robots on human\u0026ndash;robot interaction,\u0026rdquo; AI Soc., vol. 20, no. 2, pp. 138\u0026ndash;150, 2006.\u003c/li\u003e\n\u003cli\u003eN. Sebanz, H. Bekkering, and G. Knoblich, \u0026ldquo;Joint action: Bodies and minds moving together,\u0026rdquo; Trends Cogn. Sci., vol. 10, no. 2, pp. 70\u0026ndash;76, 2006.\u003c/li\u003e\n\u003cli\u003eG. Castellano, S. Paiva, and A. Pereira, \u0026ldquo;Affective expression in robots for HRI,\u0026rdquo; in Proc. IEEE Int. Symp. Robot Hum. Interact. Commun. (RO-MAN), 2010, pp. 473\u0026ndash;480.\u003c/li\u003e\n\u003cli\u003eK. Kompatsiari, F. Ciardo, V. Tikhanoff, G. Metta, and A. Wykowska, \u0026ldquo;On the role of eye contact in gaze cueing,\u0026rdquo; Sci. Rep., vol. 9, no. 1, pp. 1\u0026ndash;10, 2019.\u003c/li\u003e\n\u003cli\u003eM. L. Walters, K. Dautenhahn, R. te Boekhorst, K. L. Koay, C. Kaouri, S. Woods, C. L. Nehaniv, D. Lee, and I. Werry, \u0026ldquo;The influence of subjects\u0026rsquo; personality traits on personal spatial zones in a human\u0026ndash;robot interaction experiment,\u0026rdquo; in Proc. IEEE Int. Workshop Robot Hum. Interact. Commun. (RO-MAN), 2008, pp. 347\u0026ndash;352.\u003c/li\u003e\n\u003cli\u003eA. D. Dragan, K. C. Lee, and S. S. Srinivasa, \u0026ldquo;Legibility and predictability of robot motion,\u0026rdquo; in Proc. 8th ACM/IEEE Int. Conf. Hum.-Robot Interact. (HRI), 2013, pp. 301\u0026ndash;308.\u003c/li\u003e\n\u003cli\u003eG. Hoffman and W. Ju, \u0026ldquo;Designing robots with movement in mind,\u0026rdquo; J. Hum.-Robot Interact., vol. 3, no. 1, pp. 89\u0026ndash;122, 2014.\u003c/li\u003e\n\u003cli\u003eT. Kanda and H. Ishiguro, Human\u0026ndash;Robot Interaction in Social Robotics. CRC Press, 2013.\u003c/li\u003e\n\u003cli\u003eA. Aly and A. Tapus, \u0026ldquo;A model for synthesizing a combined verbal and nonverbal behavior based on personality traits in human\u0026ndash;robot interaction,\u0026rdquo; in Proc. Int. Conf. Adv. Robot. (ICAR), 2013, pp. 1\u0026ndash;6.\u003c/li\u003e\n\u003cli\u003eS. Lemaignan, R. Ros, E. A. Sisbot, R. Alami, and M. Beetz, \u0026ldquo;Artificial cognition for social human\u0026ndash;robot interaction: An implementation,\u0026rdquo; Artif. Intell., vol. 247, pp. 45\u0026ndash;69, 2017.\u003c/li\u003e\n\u003cli\u003eB. Mutlu, J. Forlizzi, and J. Hodgins, \u0026ldquo;A storytelling robot: Modeling and evaluation of human-like gaze behavior,\u0026rdquo; in Proc. 6th IEEE-RAS Int. Conf. Humanoid Robots, 2006, pp. 518\u0026ndash;523.\u003c/li\u003e\n\u003cli\u003eH. G. Okuno, K. Nakadai, and H. Kitano, \u0026ldquo;Realizing audio-visual integration in humanoid robots,\u0026rdquo; in Proc. Nat. Conf. Artif. Intell. (AAAI), 2002, pp. 1248\u0026ndash;1253.\u003c/li\u003e\n\u003cli\u003eA. L. Thomaz and C. Breazeal, \u0026ldquo;Teachable robots: Understanding human teaching behavior to build more effective robot learners,\u0026rdquo; Artif. Intell., vol. 172, no. 6\u0026ndash;7, pp. 716\u0026ndash;737, 2008.\u003c/li\u003e\n\u003cli\u003eD. Vernon, G. Metta, and G. Sandini, \u0026ldquo;A survey of artificial cognitive systems: Implications for the autonomous development of mental capabilities in computational agents,\u0026rdquo; IEEE Trans. Evol. Comput., vol. 11, no. 2, pp. 151\u0026ndash;180, 2007.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Computing Methodologies, Embedded systems, Human–computer interaction (HCI), Natural Language Interfaces","lastPublishedDoi":"10.21203/rs.3.rs-8054965/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8054965/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eHuman-Robot Interaction (HRI) is a rapidly growing field that enables socially meaningful communication between humans and robotic systems. Most current robotic platforms rely heavily on visual or auditory cues, but few achieve seamless integration of both in a dynamic, context-aware manner. Motivated by the need for more natural, human-like interaction, this paper presents the development of an AI-based humanoid Chat Robot designed as a stationary robotic face capable of real-time multimodal interaction. The presented work integrates facial and mouth movement detection using the MediaPipe framework, auditory direction detection using Fast Fourier Transform (FFT) on multi-microphone input, and rule-based voice interaction using a dynamic CSV dataset. A core switching logic governs attention shifts between vision and audio based on environmental cues, ensuring robust and adaptive interaction. The robot's 3D design features a natural, humane facial structure, including servo-controlled eyes, jaw, and neck, which offer expressive motion to reinforce engagement. Evaluation across varied single and multi-user interaction scenarios demonstrates accurate speaker tracking, reliable audio localization, and smooth servo actuation. The system provides a low-cost, modular platform suitable for HRI research, educational applications, and experimental Social Robotics.\u003c/p\u003e","manuscriptTitle":"An Integrated Vision-Audio Architecture for Responsive Humanoid Robot Interaction with Dynamic Attention Switching","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-28 14:07:50","doi":"10.21203/rs.3.rs-8054965/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6f30a6fa-5f1a-4375-9c9e-f9e24315490a","owner":[],"postedDate":"November 28th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-12-24T12:24:51+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-28 14:07:50","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8054965","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8054965","identity":"rs-8054965","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00